Distributing embarrassingly parallel tasks GNU Parallel

 Copyright (c) 2021 UL HPC Team <hpc-team@uni.lu>

GNU Parallel is a tool for executing tasks in parallel, typically on a single machine. When coupled with the Slurm command srun, parallel becomes a powerful way of distributing a set of tasks amongst a number of workers. This is particularly useful when the number of tasks is significantly larger than the number of available workers (i.e. $SLURM_NTASKS), and each tasks is independent of the others.

This tutorial is part of the practical session "HPC Management of Sequential and Embarrassingly Parallel Jobs". See the associated slides.

Installation

The parallel command is available at the system level across the ULHPC clusters, yet under a relatively old version:

(access)$ which parallel
/usr/bin/parallel
(access)$ parallel --version
GNU parallel 20190922
[...]

You may want to build the up-to-date version. The process is quite straight-forward and we will illustrate this process using the GNU Stow utility which is quite useful for keeping track of system-wide and per-user installations of software built from source, as parallel in this case. GNU Stow manual - tutorial

Prepare the installation directories within your HOME, together with the stowdir:

### Access to ULHPC cluster if not yet done - here iris
(laptop)$> ssh aion-cluster
(access)$> cd      # go to your HOME
(access)$> mkdir -p bin include lib share/{doc,man} src
# create stowdir
(access)$> mkdir stow

Get the latest stable sources of GNU Parallel under src and compile them within an interactive job:

(access)$> cd ~/src
# Download the latest sources
(access)$> wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
# Not strictly necessary, but help to keep track of time of action
(access)$> mv parallel-latest.tar.bz2  parallel-latest-$(date +%F).tar.bz2
# Uncompress
(access)$> tar xf parallel-latest-$(date +%F).tar.bz2
(access)$> cd parallel-20211022/
### Have an interactive job for the compilation process
# ... either directly
(access)$> si
# ... or using the HPC School reservation 'hpcschool' if needed  - use 'sinfo -T' to check if active and its name
# (access)$> si --reservation=hpcschool

GNU Parallel is one of the many software that can be build very easily through the Autotools build system i.e ./configure; make; make install. However, this process wants to install by default the built software under /usr/local where you have NO rights to write files. So if you don't pay attention, the installation step will fail. To circumvent the problem:

we will install parallel in $HOME (--prefix option)
more specifically, we will install the built software withing the stow directory, in a specific sub-directory that allow to specify the precise version generated.

Proceed as follows:

(node)$> ./configure --prefix=$HOME/stow/parallel-20211022
(node)$> make
(node)$> make install

That's all folk. You can now use stow to enable this build:

(node)$> cd ~/stow
(node)$> stow parallel-20211022
# Check the result:
(node)$> ll ~/bin/parallel
lrwxrwxrwx. 1 svarrette clusterusers 38 Dec 12 14:50 /home/users/svarrette/bin/parallel -> ../stow/parallel-20211022/bin/parallel
# As ~/bin is part of your PATH at one of the first position, now the command
# 'parallel' is resolved as the newly built version
(node)$> which parallel
~/bin/parallel
(node)$>  parallel --version
GNU parallel 20211022   # In the unlikely case where you don't get the updated version, restart the bash session
[...]

At any moment of time, you can disable this build as follows:

(node)$> cd ~/stow
(node)$> stow -D parallel-20211022
# Now the command 'parallel' is resolved to the system one
# IGNORE the 'BUG in find_stowed_path?' message due to https://github.com/aspiers/stow/issues/65
(node)$> ll ~/bin/parallel
ls: cannot access /home/users/svarrette/bin/parallel: No such file or directory
(node)$> which parallel
/usr/bin/parallel
(node)$> parallel --version     # you may need to source ~/.bashrc
GNU parallel 20190922

You can quit your interactive job (CTRL-D)

Discovering the `parallel` command

The GNU Parallel syntax can be a little distributing, but basically it supports two modes:

Reading command arguments on the command line:

parallel    [-j N] [OPTIONS]    COMMAND {} ::: TASKLIST

Reading command arguments from an input file:

parallel    –a  TASKLIST.LST    [-j N] [OPTIONS]    COMMAND {}
parallel    [-j N] [OPTIONS]    COMMAND {} :::: TASKLIST.LST

If your COMMAND embed a pipe stage, you have to escape the pipe symbol as follows \|. Let's make some tests. The -j <N> option permits to define the jobs per machine - in particular you may want to use -j 1 to enable a sequential resolution of the parallel command

In all cases, the parallel command is available at the system across the ULHPC clusters. Run it once.

(access)$> parallel --version
GNU parallel 20160222
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --bibtex'.

If you want to avoid the disclaimer requesting to cite the paper describing GNU parallel, you have to run as indicated parallel --bibtex, type: 'will cite' and press enter.

Let's play with a TASKLIST from the command line:

(access)$> parallel echo {} ::: A B C
A
B
C
(access)$> parallel echo {} ::: {1..3}
1
2
3
# As above
(access)$> parallel echo {} ::: $(seq 1 3)
1
2
3
# Use index to refer to a given TASKLIST
(access)$> parallel echo {1} {2} ::: A B C ::: {1..3}
A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3
# Combine (link) TASKLIST with same size - Use '--link' on more recent parallel version
(access)$> parallel --xapply "echo {1} {2}" ::: A B C ::: {1..3}
A 1
B 2
C 3
# This can be useful to output command text with arguments
(access)$> parallel --xapply echo myapp_need_argument {1} {2} ::: A B C ::: {1..3}
myapp_need_argument A 1
myapp_need_argument B 2
myapp_need_argument C 3
# /!\ IMPORTANT: you can then **execute** these commands as above  by removing 'echo'
#     DON'T do that unless you know what you're doing
# You can filter out some elements:
(access)$> parallel --xapply echo myapp_need_argument {1} {2} \| grep -v 2 ::: A B C ::: {1..3}
myapp_need_argument A 1
myapp_need_argument C 3

Let's play now with a TASKLIST from an input file.

Let's assume you wish to process some images from the OpenImages V4 data set. A copy of this data set is available on the ULHPC facility, under /work/projects/bigdata_sets/OpenImages_V4/. Let's create a CSV file which contains a random selection of 10 training files within this dataset (prefixed by a line number). You may want to do it as follows (copy the full command):

#                                                       training set     select first 10K  random sort  take only top 10   prefix by line number      print to stdout AND in file
#                                                         ^^^^^^           ^^^^^^^^^^^^^   ^^^^^^^^     ^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(access)$> find /work/projects/bigdata_sets/OpenImages_V4/train/ -print | head -n 10000 | sort -R   |  head -n 10       | awk '{ print ++i","$0 }' | tee openimages_v4_filelist.csv
1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
4,/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg
5,/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg
6,/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg
7,/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg
8,/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg
9,/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg
10,/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg

Let's manipulate the file content with parallel (prefer the -a <filename> syntax):

# simple echo of the file
(access)$> parallel -a openimages_v4_filelist.csv echo {}
1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
[...]
# print specific column of the CSV file
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {1}
1
2
3
4
5
6
7
8
9
10
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg

# reformat and change order
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2} {1}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg 1
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg 2
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg 3
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg 4
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg 5
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg 6
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg 7
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg 8
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg 9
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg 10

The ULHPC team has designed a generic launcher for GNU parallel: see ../basics/scripts/launcher.parallel.sh.

Its usage is explicited in the HPC Management of Sequential and Embarrassingly Parallel Jobs tutorials.

Distributing embarrassingly parallel tasks GNU Parallel

Installation

Discovering the parallel command

Discovering the `parallel` command