Distributing embarrassingly parallel tasks GNU Parallel
Copyright (c) 2020 UL HPC Team <hpc-team@uni.lu>
GNU Parallel) is a tool for executing tasks in parallel, typically on a single machine. When coupled with the Slurm command srun
, parallel becomes a powerful way of distributing a set of tasks amongst a number of workers. This is particularly useful when the number of tasks is significantly larger than the number of available workers (i.e. $SLURM_NTASKS
), and each tasks is independent of the others.
Installation
The parallel
command is available at the system level across the ULHPC clusters, yet under a relatively old version:
(access)$> parallel --version
GNU parallel 20160222
[...]
You may want to build the up-to-date version. The process is quite straight-forward and we will illustrate this process using the GNU Stow utility which is quite useful for keeping track of system-wide and per-user installations of software built from source, as parallel
in this case. GNU Stow manual - tutorial
Prepare the installation directories within your HOME, together with the stowdir:
### Access to ULHPC cluster if not yet done - here iris
(laptop)$> ssh iris-cluster
(access)$> cd # go to your HOME
(access)$> mkdir bin include lib share src
# create stowdir
(access)$> mkdir stow
Get the latest stable sources of GNU Parallel under src
and compile them within an interactive job:
(access)$> cd ~/src
# Download the latest sources
(access)$> wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
# Not strictly necessary, but help to keep track of time of action
(access)$> mv parallel-latest.tar.bz2 parallel-latest-$(date +%F).tar.bz2
# Uncompress
(access)$> tar xf parallel-latest-$(date +%F).tar.bz2
(access)$> cd parallel-20201122/
### Have an interactive job for the compilation process
# ... either directly
(access)$> si
# ... or using the HPC School reservation 'hpcschool' if needed - use 'sinfo -T' to check if active and its name
# (access)$> si --reservation=hpcschool
GNU Parallel is one of the many software that can be build very easily through the Autotools build system i.e ./configure; make; make install
.
However, this process wants to install by default the built software under /usr/local
where you have NO rights to write files. So if you don't pay attention, the installation step will fail.
To circumvent the problem:
- we will install
parallel
in$HOME
(--prefix
option) - more specifically, we will install the built software withing the stow directory, in a specific sub-directory that allow to specify the precise version generated.
Proceed as follows:
(node)$> ./configure --prefix=$HOME/stow/parallel-20201122
(node)$> make
(node)$> make install
That's all folk.
You can now use stow
to enable this build:
(node)$> cd ~/stow
(node)$> stow parallel-20201122
# Check the result:
(node)$> ll ~/bin/parallel
lrwxrwxrwx. 1 svarrette clusterusers 38 Dec 12 14:50 /home/users/svarrette/bin/parallel -> ../stow/parallel-20201122/bin/parallel
# As ~/bin is part of your PATH at one of the first position, now the command
# 'parallel' is resolved as the newly built version
(node)$> which parallel
~/bin/parallel
(node)$> parallel --version
GNU parallel 20201122
[...]
At any moment of time, you can disable this build as follows:
(node)$> cd ~/stow
(node)$> stow -D parallel-20201122
# Now the command 'parallel' is resolved to the system one
(node)$> ll ~/bin/parallel
ls: cannot access /home/users/svarrette/bin/parallel: No such file or directory
(node)$> which parallel
/usr/bin/parallel
(node)$> parallel --version # you may need to source ~/.bashrc
GNU parallel 20160222
You can quit your interactive job (CTRL-D)
Discovering the parallel
command
The GNU Parallel syntax can be a little distributing, but basically it supports two modes:
-
Reading command arguments on the command line:
parallel [-j N] [OPTIONS] COMMAND {} ::: TASKLIST
-
Reading command arguments from an input file:
parallel –a TASKLIST.LST [-j N] [OPTIONS] COMMAND {} parallel [-j N] [OPTIONS] COMMAND {} :::: TASKLIST.LST
If your COMMAND embed a pipe stage, you have to escape the pipe symbol as follows \|
.
Let's make some tests. The -j <N>
option permits to define the jobs per machine - in particular you may want to use -j 1
to enable a sequential resolution of the parallel command
In all cases, the parallel
command is available at the system across the ULHPC clusters. Run it once.
(access)$> parallel --version
GNU parallel 20160222
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --bibtex'.
If you want to avoid the disclaimer requesting to cite the paper describing GNU parallel, you have to run as indicated parallel --bibtex
, type: 'will cite' and press enter.
Let's play with a TASKLIST from the command line:
(access)$> parallel echo {} ::: A B C
A
B
C
(access)$> parallel echo {} ::: {1..3}
1
2
3
# As above
(access)$> parallel echo {} ::: $(seq 1 3)
1
2
3
# Use index to refer to a given TASKLIST
(access)$> parallel echo {1} {2} ::: A B C ::: {1..3}
A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3
# Combine (link) TASKLIST with same size - Use '--link' on more recent parallel version
(access)$> parallel --xapply "echo {1} {2}" ::: A B C ::: {1..3}
A 1
B 2
C 3
# This can be useful to output command text with arguments
(access)$> parallel --xapply echo myapp_need_argument {1} {2} ::: A B C ::: {1..3}
myapp_need_argument A 1
myapp_need_argument B 2
myapp_need_argument C 3
# /!\ IMPORTANT: you can then **execute** these commands as above by removing 'echo'
# DON'T do that unless you know what you're doing
# You can filter out some elements:
(access)$> parallel --xapply echo myapp_need_argument {1} {2} \| grep -v 2 ::: A B C ::: {1..3}
myapp_need_argument A 1
myapp_need_argument C 3
Let's play now with a TASKLIST from an input file.
Let's assume you wish to process some images from the OpenImages V4 data set.
A copy of this data set is available on the ULHPC facility, under /work/projects/bigdata_sets/OpenImages_V4/
.
Let's create a CSV file which contains a random selection of 10 training files within this dataset (prefixed by a line number).
You may want to do it as follows (copy the full command):
# training set select first 10K random sort take only top 10 prefix by line number print to stdout AND in file
# ^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(access)$> find /work/projects/bigdata_sets/OpenImages_V4/train/ -print | head -n 10000 | sort -R | head -n 10 | awk '{ print ++i","$0 }' | tee openimages_v4_filelist.csv
1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
4,/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg
5,/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg
6,/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg
7,/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg
8,/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg
9,/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg
10,/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg
Let's manipulate the file content with parallel (prefer the -a <filename>
syntax):
# simple echo of the file
(access)$> parallel -a openimages_v4_filelist.csv echo {}
1,/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
2,/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
3,/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
[...]
# print specific column of the CSV file
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {1}
1
2
3
4
5
6
7
8
9
10
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg
# reformat and change order
(access)$> parallel --colsep '\,' -a openimages_v4_filelist.csv echo {2} {1}
/work/projects/bigdata_sets/OpenImages_V4/train/6196380ea79283e0.jpg 1
/work/projects/bigdata_sets/OpenImages_V4/train/7f23f40740731c03.jpg 2
/work/projects/bigdata_sets/OpenImages_V4/train/dbfc1b37f45b3957.jpg 3
/work/projects/bigdata_sets/OpenImages_V4/train/f66087cdf8e172cd.jpg 4
/work/projects/bigdata_sets/OpenImages_V4/train/5efed414dd8b23d0.jpg 5
/work/projects/bigdata_sets/OpenImages_V4/train/1be054cb3021f6aa.jpg 6
/work/projects/bigdata_sets/OpenImages_V4/train/61446dee2ee9eb27.jpg 7
/work/projects/bigdata_sets/OpenImages_V4/train/dba2da75d899c3e7.jpg 8
/work/projects/bigdata_sets/OpenImages_V4/train/7ea06f092abc005e.jpg 9
/work/projects/bigdata_sets/OpenImages_V4/train/2db694eba4d4bb04.jpg 10
The ULHPC team has designed a generic launcher for GNU parallel: see ../basics/scripts/launcher.parallel.sh
.
Its usage is explicited in the HPC Management of Sequential and Embarrassingly Parallel Jobs tutorials.