Serial tasks in action: Object recognition with Tensorflow and Python Imageai
Copyright (c) 2013-2019 UL HPC Team <hpc-team@uni.lu>
The following github repositories will be used:
- ULHPC/launcher-scripts
- UPDATE (Dec 2020) This repository is deprecated and kept for archiving purposes only. Consider the up-to-date launchers listed at the root of the ULHPC/tutorials repository, under
launchers/
- UPDATE (Dec 2020) This repository is deprecated and kept for archiving purposes only. Consider the up-to-date launchers listed at the root of the ULHPC/tutorials repository, under
- ULHPC/tutorials
In this exercise, we will process some images from the OpenImages V4 data set with an object recognition tools.
Create a file which contains the list of parameters (random list of images):
(access)$> find /work/projects/bigdata_sets/OpenImages_V4/train/ -print | head -n 10000 | sort -R | head -n 50 | tail -n +2 > $SCRATCH/PS2/param_file
Step 0: Prepare the environment
(access)$> si -N 1
Load the default Python module
(node) module load lang/Python
(node) module list
Create a new python virtual env
(node) cd $SCRATCH/PS2
(node) virtualenv venv
Enable your newly created virtual env, and install the required modules inside
(node) source venv/bin/activate
(node) pip install tensorflow scipy opencv-python pillow matplotlib keras
(node) pip install https://github.com/OlafenwaMoses/ImageAI/releases/download/2.0.2/imageai-2.0.2-py3-none-any.whl
(node) exit
Step 1: Naive workflow
We will use the launcher NAIVE_AKA_BAD_launcher_serial.sh
(full path: $SCRATCH/PS2/launcher-scripts/bash/serial/NAIVE_AKA_BAD_launcher_serial.sh
).
Edit the following variables:
MODULE_TO_LOAD
must contain the list of modules to load before executing$TASK
,TASK
must contain the path of the executable,ARG_TASK_FILE
must contain the path of your parameter file.(node)$> nano $SCRATCH/PS2/launcher-scripts/bash/serial/NAIVE_AKA_BAD_launcher_serial.sh MODULE_TO_LOAD=(lang/Python) TASK="$SCRATCH/PS2/tutorials/sequential/examples/scripts/run_object_recognition.sh" ARG_TASK_FILE=$SCRATCH/PS2/param_file
Using Slurm on Iris
Launch the job, in interactive mode and execute the launcher:
(access)$> si -N 1
(node)$> source venv/bin/activate
(node)$> $SCRATCH/PS2/launcher-scripts/bash/serial/NAIVE_AKA_BAD_launcher_serial.sh
Or in passive mode (the output will be written in a file named BADSerial-<JOBID>.out
)
(access)$> sbatch $SCRATCH/PS2/launcher-scripts/bash/serial/NAIVE_AKA_BAD_launcher_serial.sh
You can use the command scontrol show job <JOBID>
to read all the details about your job:
(access)$> scontrol show job 207001
JobId=207001 JobName=BADSerial
UserId=hcartiaux(5079) GroupId=clusterusers(666) MCS_label=N/A
Priority=8791 Nice=0 Account=ulhpc QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:23 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2018-11-23T10:01:04 EligibleTime=2018-11-23T10:01:04
StartTime=2018-11-23T10:01:05 EndTime=2018-11-23T11:01:05 Deadline=N/A
And the command sacct
to see the start and end date
(access)$> sacct --format=start,end --j 207004
Start End
------------------- -------------------
2018-11-23T10:01:20 2018-11-23T10:02:31
2018-11-23T10:01:20 2018-11-23T10:02:31
In all cases, you can connect to a reserved node using the command srun
and check the status of the system using standard linux command (free
, top
, htop
, etc)
(access)$> sjoin <JOBID>
During the execution, you can see the job in the queue with the command squeue
:
(access)$> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
207001 batch BADSeria hcartiaux R 2:27 1 iris-110
Using the system monitoring tool ganglia, check the activity on your node.
Step 2: Optimal method using GNU parallel (GNU Parallel)
We will use the launcher launcher_serial.sh
(full path: $SCRATCH/PS2/launcher-scripts/bash/serial/launcher_serial.sh
).
Edit the following variables:
(access)$> nano $SCRATCH/PS2/launcher-scripts/bash/serial/launcher_serial.sh
MODULE_TO_LOAD=(lang/Python)
TASK="$SCRATCH/PS2/tutorials/sequential/examples/scripts/run_object_recognition.sh"
ARG_TASK_FILE=$SCRATCH/PS2/param_file
Submit the (passive) job with sbatch
(access)$> sbatch $SCRATCH/PS2/launcher-scripts/bash/serial/launcher_serial.sh
Question: compare and explain the execution time with both launchers:
-
Naive workflow: time = ?
-
Parallel workflow: time = ?
Conclusion
At the end, please clean up your home and scratch directories :)
Please do not store unnecessary files on the cluster's storage servers:
(access)$> rm -rf $SCRATCH/PS2