By ULHPC Licence GitHub issues Github Documentation Status GitHub forks

Distributing embarrassingly parallel tasks GNU Parallel

 Copyright (c) 2020 UL HPC Team <hpc-team@uni.lu>

GNU Parallel) is a tool for executing tasks in parallel, typically on a single machine. When coupled with the Slurm command srun, parallel becomes a powerful way of distributing a set of tasks amongst a number of workers. This is particularly useful when the number of tasks is significantly larger than the number of available workers (i.e. $SLURM_NTASKS), and each tasks is independent of the others.

Installation

The parallel command is available at the system level across the ULHPC clusters, yet under a relatively old version:

(access)$> parallel --version
GNU parallel 20160222
[...]

You may want to build the up-to-date version. The process is quite straight-forward and we will illustrate this process using the GNU Stow utility which is quite useful for keeping track of system-wide and per-user installations of software built from source, as parallel in this case. GNU Stow manual - tutorial

Prepare the installation directories within your HOME, together with the stowdir:

### Access to ULHPC cluster if not yet done - here iris
(laptop)$> ssh iris-cluster
(access)$> cd      # go to your HOME
(access)$> mkdir bin include lib share src
# create stowdir
(access)$> mkdir stow

Get the latest stable sources of GNU Parallel under src and compile them within an interactive job:

(access)$> cd ~/src
# Download the latest sources
(access)$> wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
# Not strictly necessary, but help to keep track of time of action
(access)$> mv parallel-latest.tar.bz2  parallel-latest-$(date +%F).tar.bz2
# Uncompress
(access)$> tar xf parallel-latest-$(date +%F).tar.bz2
(access)$> cd parallel-20201122/
### Have an interactive job for the compilation process
# ... either directly
(access)$> si
# ... or using the HPC School reservation 'hpcschool' if needed  - use 'sinfo -T' to check if active and its name
# (access)$> si --reservation=hpcschool

GNU Parallel is one of the many software that can be build very easily through the Autotools build system i.e ./configure; make; make install. However, this process wants to install by default the built software under /usr/local where you have NO rights to write files. So if you don't pay attention, the installation step will fail. To circumvent the problem:

  • we will install parallel in $HOME (--prefix option)
  • more specifically, we will install the built software withing the stow directory, in a specific sub-directory that allow to specify the precise version generated.

Proceed as follows:

(node)$> ./configure --prefix=$HOME/stow/parallel-20201122
(node)$> make
(node)$> make install

That's all folk. You can now use stow to enable this build:

(node)$> cd ~/stow
(node)$> stow parallel-20201122
# Check the result:
(node)$> ll ~/bin/parallel
lrwxrwxrwx. 1 svarrette clusterusers 38 Dec 12 14:50 /home/users/svarrette/bin/parallel -> ../stow/parallel-20201122/bin/parallel
# As ~/bin is part of your PATH at one of the first position, now the command
# 'parallel' is resolved as the newly built version
(node)$> which parallel
~/bin/parallel
(node)$>  parallel --version
GNU parallel 20201122
[...]

At any moment of time, you can disable this build as follows:

(node)$> cd ~/stow
(node)$> stow -D parallel-20201122
# Now the command 'parallel' is resolved to the system one
(node)$> ll ~/bin/parallel
ls: cannot access /home/users/svarrette/bin/parallel: No such file or directory
(node)$> which parallel
/usr/bin/parallel
(node)$> parallel --version     # you may need to source ~/.bashrc
GNU parallel 20160222

You can quit your interactive job (CTRL-D)

Discovering the parallel command

The GNU Parallel syntax can be a little distributing, but basically it supports two modes:

  • Reading command arguments on the command line:

    parallel    [-j N] [OPTIONS]    COMMAND {} ::: TASKLIST
    
  • Reading command arguments from an input file:

    parallel    –a  TASKLIST.LST    [-j N] [OPTIONS]    COMMAND {}
    parallel    [-j N] [OPTIONS]    COMMAND {} :::: TASKLIST.LST
    

If your COMMAND embed a pipe stage, you have to escape the pipe symbol as follows \|. Let's make some tests. The -j <N> option permits to define the jobs per machine - in particular you may want to use -j 1 to enable a sequential resolution of the parallel command

In all cases, the parallel command is available at the system across the ULHPC clusters. Run it once.

(access)$> parallel --version
GNU parallel 20160222
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --bibtex'.

If you want to avoid the disclaimer requesting to cite the paper describing GNU parallel, you have to run as indicated parallel --bibtex, type: 'will cite' and press enter.

Let's play with a TASKLIST from the command line:

(access)$> parallel echo {} ::: A B C
A
B
C
(access)$> parallel echo {} ::: {1..3}
1
2
3
# As above
(access)$> parallel echo {} ::: $(seq 1 3)
1
2
3
# Use index to refer to a given TASKLIST
(access)$> parallel echo {1} {2} ::: A B C ::: {1..3}
A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3
# Combine (link) TASKLIST with same size - Use '--link' on more recent parallel version
(access)$> parallel --xapply "echo {1} {2}" ::: A B C ::: {1..3}
A 1
B 2
C 3
# This can be useful to output command text with arguments
(access)$> parallel --xapply echo myapp_need_argument {1} {2} ::: A B C ::: {1..3}
myapp_need_argument A 1
myapp_need_argument B 2
myapp_need_argument C 3
# /!\ IMPORTANT: you can then **execute** these commands as above  by removing 'echo'
#     DON'T do that unless you know what you're doing
# You can filter out some elements:
(access)$> parallel --xapply echo myapp_need_argument {1} {2} \| grep -v 2 ::: A B C ::: {1..3}
myapp_need_argument A 1
myapp_need_argument C 3

Let's play now with a TASKLIST from an input file.

Let's assume you wish to process some images from the OpenImages V4 data set. A copy of this data set is available on the ULHPC facility, under /work/projects/bigdata_sets/OpenImages_V4/. Let's create a CSV file which contains a random selection of 10 training files within this dataset (prefixed by a line number). You may want to do it as follows (copy the full command):

#                                                       training set     select first 10K  random sort  take only top 10   prefix by line number      print to stdout AND in file
#                                                         ^^^^^^           ^^^^^^^^^^^^^   ^^^^^^^^     ^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(access)$> find /work/projects/bigdata_sets/OpenImages_V4/train/ -print | head -n 10000 | sort -R   |  head -n 10       | awk '{ print ++i","$0 }' | tee openimages_v4_filelist.csv
1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
4,/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg
5,/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg
6,/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg
7,/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg
8,/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg
9,/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg
10,/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg

Let's manipulate the file content with parallel (prefer the -a <filename> syntax):

# simple echo of the file
(access)$> parallel -a openimages_v4_filelist.csv echo {}
1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
[...]
# print specific column of the CSV file
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {1}
1
2
3
4
5
6
7
8
9
10
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg

# reformat and change order
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2} {1}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg 1
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg 2
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg 3
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg 4
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg 5
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg 6
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg 7
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg 8
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg 9
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg 10

The ULHPC team has designed a generic launcher for GNU parallel: see ../basics/scripts/launcher.parallel.sh.

Its usage is explicited in the HPC Management of Sequential and Embarrassingly Parallel Jobs tutorials.