# Distributing embarrassingly parallel tasks GNU Parallel

 Copyright (c) 2020 UL HPC Team <hpc-team@uni.lu>


GNU Parallel) is a tool for executing tasks in parallel, typically on a single machine. When coupled with the Slurm command srun, parallel becomes a powerful way of distributing a set of tasks amongst a number of workers. This is particularly useful when the number of tasks is significantly larger than the number of available workers (i.e. $SLURM_NTASKS), and each tasks is independent of the others. ## Installation The parallel command is available at the system level across the ULHPC clusters, yet under a relatively old version: (access)$> parallel --version
GNU parallel 20160222
[...]


You may want to build the up-to-date version. The process is quite straight-forward and we will illustrate this process using the GNU Stow utility which is quite useful for keeping track of system-wide and per-user installations of software built from source, as parallel in this case. GNU Stow manual - tutorial

Prepare the installation directories within your HOME, together with the stowdir:

### Access to ULHPC cluster if not yet done - here iris
(laptop)$> ssh iris-cluster (access)$> cd      # go to your HOME
(access)$> mkdir bin include lib share src # create stowdir (access)$> mkdir stow


Get the latest stable sources of GNU Parallel under src and compile them within an interactive job:

(access)$> cd ~/src # Download the latest sources (access)$> wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
# Not strictly necessary, but help to keep track of time of action
(access)$> mv parallel-latest.tar.bz2 parallel-latest-$(date +%F).tar.bz2
# Uncompress
(access)$> tar xf parallel-latest-$(date +%F).tar.bz2
(access)$> cd parallel-20201122/ ### Have an interactive job for the compilation process # ... either directly (access)$> si
# ... or using the HPC School reservation 'hpcschool' if needed  - use 'sinfo -T' to check if active and its name
# (access)$> si --reservation=hpcschool  GNU Parallel is one of the many software that can be build very easily through the Autotools build system i.e ./configure; make; make install. However, this process wants to install by default the built software under /usr/local where you have NO rights to write files. So if you don't pay attention, the installation step will fail. To circumvent the problem: • we will install parallel in $HOME (--prefix option)
• more specifically, we will install the built software withing the stow directory, in a specific sub-directory that allow to specify the precise version generated.

Proceed as follows:

(node)$> ./configure --prefix=$HOME/stow/parallel-20201122
(node)$> make (node)$> make install


That's all folk. You can now use stow to enable this build:

(node)$> cd ~/stow (node)$> stow parallel-20201122
# Check the result:
(node)$> ll ~/bin/parallel lrwxrwxrwx. 1 svarrette clusterusers 38 Dec 12 14:50 /home/users/svarrette/bin/parallel -> ../stow/parallel-20201122/bin/parallel # As ~/bin is part of your PATH at one of the first position, now the command # 'parallel' is resolved as the newly built version (node)$> which parallel
~/bin/parallel
(node)$> parallel --version GNU parallel 20201122 [...]  At any moment of time, you can disable this build as follows: (node)$> cd ~/stow
(node)$> stow -D parallel-20201122 # Now the command 'parallel' is resolved to the system one (node)$> ll ~/bin/parallel
ls: cannot access /home/users/svarrette/bin/parallel: No such file or directory
(node)$> which parallel /usr/bin/parallel (node)$> parallel --version     # you may need to source ~/.bashrc
GNU parallel 20160222


You can quit your interactive job (CTRL-D)

## Discovering the parallel command

The GNU Parallel syntax can be a little distributing, but basically it supports two modes:

• Reading command arguments on the command line:

parallel    [-j N] [OPTIONS]    COMMAND {} ::: TASKLIST

• Reading command arguments from an input file:

parallel    –a  TASKLIST.LST    [-j N] [OPTIONS]    COMMAND {}
parallel    [-j N] [OPTIONS]    COMMAND {} :::: TASKLIST.LST


If your COMMAND embed a pipe stage, you have to escape the pipe symbol as follows \|. Let's make some tests. The -j <N> option permits to define the jobs per machine - in particular you may want to use -j 1 to enable a sequential resolution of the parallel command

In all cases, the parallel command is available at the system across the ULHPC clusters. Run it once.

(access)$> parallel --version GNU parallel 20160222 Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty. Web site: http://www.gnu.org/software/parallel When using programs that use GNU Parallel to process data for publication please cite as described in 'parallel --bibtex'.  If you want to avoid the disclaimer requesting to cite the paper describing GNU parallel, you have to run as indicated parallel --bibtex, type: 'will cite' and press enter. Let's play with a TASKLIST from the command line: (access)$> parallel echo {} ::: A B C
A
B
C
(access)$> parallel echo {} ::: {1..3} 1 2 3 # As above (access)$> parallel echo {} ::: $(seq 1 3) 1 2 3 # Use index to refer to a given TASKLIST (access)$> parallel echo {1} {2} ::: A B C ::: {1..3}
A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3
(access)$> parallel --xapply "echo {1} {2}" ::: A B C ::: {1..3} A 1 B 2 C 3 # This can be useful to output command text with arguments (access)$> parallel --xapply echo myapp_need_argument {1} {2} ::: A B C ::: {1..3}
myapp_need_argument A 1
myapp_need_argument B 2
myapp_need_argument C 3
# /!\ IMPORTANT: you can then **execute** these commands as above  by removing 'echo'
#     DON'T do that unless you know what you're doing
# You can filter out some elements:
(access)$> parallel --xapply echo myapp_need_argument {1} {2} \| grep -v 2 ::: A B C ::: {1..3} myapp_need_argument A 1 myapp_need_argument C 3  Let's play now with a TASKLIST from an input file. Let's assume you wish to process some images from the OpenImages V4 data set. A copy of this data set is available on the ULHPC facility, under /work/projects/bigdata_sets/OpenImages_V4/. Let's create a CSV file which contains a random selection of 10 training files within this dataset (prefixed by a line number). You may want to do it as follows (copy the full command): # training set select first 10K random sort take only top 10 prefix by line number print to stdout AND in file # ^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (access)$> find /work/projects/bigdata_sets/OpenImages_V4/train/ -print | head -n 10000 | sort -R   |  head -n 10       | awk '{ print ++i","$0 }' | tee openimages_v4_filelist.csv 1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg 2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg 3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg 4,/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg 5,/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg 6,/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg 7,/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg 8,/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg 9,/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg 10,/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg  Let's manipulate the file content with parallel (prefer the -a <filename> syntax): # simple echo of the file (access)$> parallel -a openimages_v4_filelist.csv echo {}
1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
[...]
# print specific column of the CSV file
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {1} 1 2 3 4 5 6 7 8 9 10 (access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg

# reformat and change order
(access)\$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2} {1}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg 1
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg 2
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg 3
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg 4
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg 5
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg 6
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg 7
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg 8
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg 9
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg 10


The ULHPC team has designed a generic launcher for GNU parallel: see ../basics/scripts/launcher.parallel.sh.

Its usage is explicited in the HPC Management of Sequential and Embarrassingly Parallel Jobs tutorials.