Bioinformatics software on the UL HPC platform
Copyright (c) 2014-2018 UL HPC Team hpc-sysadmins@uni.lu
Authors: Valentin Plugaru and Sarah Peter
The objective of this tutorial is to exemplify the execution of several Bioinformatics packages on top of the UL HPC platform.
The targeted applications are:
The tutorial will:
- show you how to load and run pre-configured versions of these applications on the clusters
- show you how to download and use updated versions of Bowtie2/TopHat
- discuss the parallelization capabilities of these applications
Prerequisites
When you look at the software page you will notice that some of the applications are part of the bioinfo software set. The modules in this set are not visible by default. To use them within a job you have to do:
(node)$> module use /opt/apps/resif/data/stable/bioinfo/modules/all
If you want them to always be available, you can add the following line to your .bash_private
:
command -v module >/dev/null 2>&1 && module use /opt/apps/resif/data/stable/bioinfo/modules/all
This tutorial relies on several input files for the bioinformatics packages, thus you will need to download them before following the instructions in the next sections:
(access)$> mkdir -p ~/bioinfo-tutorial/gromacs ~/bioinfo-tutorial/tophat ~/bioinfo-tutorial/mpiblast
(access)$> cd ~/bioinfo-tutorial
(access)$> wget --no-check-certificate https://raw.github.com/ULHPC/tutorials/devel/bio/basics/gromacs/pr.tpr -O gromacs/pr.tpr
(access)$> wget --no-check-certificate https://raw.github.com/ULHPC/tutorials/devel/bio/basics/tophat/test_data.tar.gz -O tophat/test_data.tar.gz
(access)$> wget --no-check-certificate https://raw.github.com/ULHPC/tutorials/devel/bio/basics/tophat/test2_path -O tophat/test2_path
(access)$> wget --no-check-certificate https://raw.github.com/ULHPC/tutorials/devel/bio/basics/mpiblast/test.fa -O mpiblast/test.fa
Or simply clone the full tutorials repository and make a link to the Bioinformatics tutorial:
(access)$> git clone https://github.com/ULHPC/tutorials.git
(access)$> ln -s tutorials/advanced/Bioinformatics/ ~/bioinfo-tutorial
ABySS
Characterization: CPU intensive, data intensive, native parallelization
Description
ABySS: Assembly By Short Sequences
ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes [*].
Example
This example will be ran in an interactive session, with batch-mode executions being proposed later on as exercises.
-
Gaia
# Connect to Gaia (Linux/OS X): (yourmachine)$> ssh access-gaia.uni.lu # Request 1 full node in an interactive job: (access-gaia)$> oarsub -I -l nodes=1,walltime=00:30:00
-
Iris
# Connect to Iris (Linux/OS X): (yourmachine)$> ssh access-iris.uni.lu # Request half a node in an interactive job: (access-iris)$> si -t 0-0:30:0 -N 1 -c 1 --ntasks-per-node=14
# Load bioinfo software set
(node)$> module use /opt/apps/resif/data/stable/bioinfo/modules/all
# Check the ABySS versions installed on the clusters:
(node)$> module avail 2>&1 | grep -i abyss
# Load the default ABySS version:
(node)$> module load bio/ABySS
# Check that it has been loaded, along with its dependencies:
(node)$> module list
# All the ABySS binaries are now in your path (check with TAB autocompletion)
(node)$> abyss-<TAB>
In the ABySS package only the ABYSS-P
application is parallelized using MPI and can be run on several cores (and across several nodes) using
the abyss-pe
launcher.
# Create a test directory and go to it
(node)$> mkdir ~/bioinfo-tutorial/abyss
(node)$> cd ~/bioinfo-tutorial/abyss
# Set the input files' directory in the environment
(node)$> export ABYSSINPDIR=/mnt/isilon/projects/ulhpc-tutorials/bioinformatics/abyss
# Give a name to the experiment
(node)$> export ABYSSNAME='abysstest'
-
Gaia
# Set the number of cores to use based on OAR's host file (node)$> export ABYSSNPROC=$(cat $OAR_NODEFILE | wc -l) # Launch the paired end assembler: (node)$> abyss-pe mpirun="mpirun -x PATH -x LD_LIBRARY_PATH -hostfile $OAR_NODEFILE" name=${ABYSSNAME} np=${ABYSSNPROC} k=31 n=10 lib=pairedend pairedend="${ABYSSINPDIR}/SRR001666_1.fastq.bz2 ${ABYSSINPDIR}/SRR001666_2.fastq.bz2" > ${ABYSSNAME}.out 2> ${ABYSSNAME}.err
-
Iris
# Set the number of cores to use based on SLURM environment variables (node)$> export ABYSSNPROC=$(expr $SLURM_NNODES \* $SLURM_NTASKS_PER_NODE \* $SLURM_CPUS_PER_TASK) # Create a hostfile (node)$> srun hostname | sort -n > hostfile # Launch the paired end assembler: (node)$> abyss-pe mpirun="mpirun -x PATH -x LD_LIBRARY_PATH -hostfile hostfile" name=${ABYSSNAME} np=${ABYSSNPROC} k=31 n=10 lib=pairedend pairedend="${ABYSSINPDIR}/SRR001666_1.fastq.bz2 ${ABYSSINPDIR}/SRR001666_2.fastq.bz2" > ${ABYSSNAME}.out 2> ${ABYSSNAME}.err
Question: Why do we use the -x VARIABLE parameters for mpirun?
Several options seen on the abyss-pe
command line are crucial:
- we explicitly set the mpirun command
- we export several environment variables to all the remote nodes, otherwise required paths (for the binaries, libraries) would not be known by the MPI processes running there
- we do not specify
-np $ABYSSNPROC
in the mpirun command, as it set withabyss-pe
's np parameter and internally passed on to mpirun
The execution should take around 12 minutes, meanwhile we can check its progress by monitoring the .out/.err output files:
(access)$> tail -f ~/bioinfo-tutorial/abyss/abysstest.*
# We exit the tail program with CTRL-C
On Gaia, we can also connect to the job (recall oarsub -C $JOBID) from a different terminal or Screen window and see the different ABySS phases with htop
.
Because the abyss-pe
workflow (pipeline) includes several processing steps with different applications of which only ABYSS-P is MPI-parallel,
the speedup obtained by using more than one node will be limited to ABYSS-P's execution. Several of the other applications that are part of the
processing stages are however parallelized using OpenMP and pthreads and will thus take advantage of the cores available on the node where
abyss-pe
was started.
The used input dataset is a well known Illumina run of E. coli.
Proposed exercises
Several exercises are proposed for ABySS:
- create a launcher for ABySS using the commands shown in the previous section
- launch jobs using 1 node: 4, 8 and 12 cores, then 2 and 4 nodes and measure the speedup obtained
- unpack the two input files and place them on a node's /dev/shm, then rerun the experiment with 4, 8 and 12 cores and measure the speedup
GROMACS
Characterization: CPU intensive, little I/O
Description
GROMACS: GROningen MAchine for Chemical Simulations
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers [*].
Example
This example will be ran in an interactive session, with batch-mode executions being proposed later on as exercises.
-
Gaia
# Connect to Gaia (Linux/OS X): (yourmachine)$> ssh access-gaia.uni.lu # Request 1 full node in an interactive job: (access-gaia)$> oarsub -I -l nodes=1,walltime=00:30:00
-
Iris
# Connect to Iris (Linux/OS X): (yourmachine)$> ssh access-iris.uni.lu # Request half a node in an interactive job: (access-iris)$> si -t 0-0:30:0 -N 1 -c 1 --ntasks-per-node=14
# Check the GROMACS versions installed on the clusters:
(node)$> module avail 2>&1 | grep -i gromacs
There used to be two versions of GROMACS available on Gaia, a hybrid
and a mt
version
- the hybrid version is OpenMP and MPI-enabled, all binaries have a '_mpi' suffix
- the mt version is only OpenMP-enabled, as such it can take advantage of only one node's cores (however it may be faster on single-node executions than the hybrid version)
Currently only the following version of GROMACS is available:
- bio/GROMACS/2016.3-intel-2017a-hybrid
We will perform our tests with the hybrid version:
# Load the MPI-enabled Gromacs, without CUDA support:
(node)$> module load bio/GROMACS
# Check that it has been loaded, along with its dependencies:
(node)$> module list
# Check the capabilities of the mdrun binary, note its suffix:
(node)$> gmx_mpi -version 2>/dev/null
# Go to the test directory
(node)$> cd ~/bioinfo-tutorial/gromacs
# Set the number of OpenMP threads to 1
(node)$> export OMP_NUM_THREADS=1
-
Gaia
# Perform a position restrained Molecular Dynamics run (node)$> mpirun -np 12 -hostfile $OAR_NODEFILE -envlist OMP_NUM_THREADS,PATH,LD_LIBRARY_PATH gmx_mpi mdrun -v -s pr -e pr -o pr -c after_pr -g prlog > test.out 2>&1
-
Iris
# Perform a position restrained Molecular Dynamics run (node)$> srun -n 12 gmx_mpi mdrun -v -s pr -e pr -o pr -c after_pr -g prlog > test.out 2>&1
We notice here that we are running gmx_mpi
in parallel with mpirun/srun on 12/14 cores, and we explicitly export the OMP_NUM_THREADS
variable to any remote node such that only one thread per MPI process will be created.
Question: What will happen if we do not set the number of OpenMP threads to 1?
GROMACS has many parallelization options and several parameters can be tuned to give you better performance depending on your workflow, see the references in the last section of this tutorial.
The used input corresponds to the Ribonuclease S-peptide example, which has been changed to perform 50k steps in the Molecular Dynamics run with position restraints on the peptide.
Proposed exercises
Several exercises are proposed for GROMACS:
- create a launcher for GROMACS using the commands shown in the previous section
- launch jobs using 1 node: 1, 2, 4, 8, 10 and 12 cores and measure the speedup obtained
- check what happens when executing mdrun with 16 and 24 cores
- launch a job using one full node that has GPU cards and run the GPU-enabled GROMACS to see if a speedup is obtained
Bowtie2/TopHat
Characterization: data intensive, RAM intensive
Description
Bowtie2: Fast and sensitive read alignment
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes [*].
TopHat : A spliced read mapper for RNA-Seq
TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie [*].
Example
This example will show you how to use the latest version of TopHat in conjunction with the latest Bowtie2, by using the versions prebuilt for Linux by the developers.
-
Gaia
# Connect to Gaia (Linux/OS X): (yourmachine)$> ssh access-gaia.uni.lu # Request 1 full node in an interactive job: (gaia-frontend)$> oarsub -I -l nodes=1,walltime=00:30:00
-
Iris
# Connect to Iris (Linux/OS X): (yourmachine)$> ssh access-iris.uni.lu # Request half a node in an interactive job: (access-iris)$> si -t 0-0:30:0 -N 1 -c 14 --ntasks-per-node=1
# Create a folder for the new software and go to it
(node)$> mkdir ~/bioinfo-tutorial/newsoft
(node)$> cd ~/bioinfo-tutorial/newsoft
# Download latest Bowtie2 and Tophat, plus the SAM tools dependency:
(node)$> wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.3.4.1/bowtie2-2.3.4.1-linux-x86_64.zip
(node)$> wget http://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
(node)$> wget https://github.com/samtools/samtools/releases/download/1.8/samtools-1.8.tar.bz2
# Unpack the three archives
(node)$> unzip bowtie2-2.3.4.1-linux-x86_64.zip
(node)$> tar xzvf tophat-2.1.1.Linux_x86_64.tar.gz
(node)$> tar xjvf samtools-1.8.tar.bz2
# SAMtools requires compilation:
(node)$> module load tools/bzip2/1.0.6-intel-2017a
(node)$> cd samtools-1.8 && ./configure && make && cd ..
# Create a file containing the paths to the binaries, to be sourced when needed
(node)$> echo "export PATH=$HOME/bioinfo-tutorial/newsoft/bowtie2-2.3.4.1-linux-x86_64:\$PATH" > newsoft
(node)$> echo "export PATH=$HOME/bioinfo-tutorial/newsoft/tophat-2.1.1.Linux_x86_64:\$PATH" >> newsoft
(node)$> echo "export PATH=$HOME/bioinfo-tutorial/newsoft/samtools-1.8:\$PATH" >> newsoft
(node)$> source newsoft
# You can now check that both main applications can be run:
(node)$> bowtie2 --version
(node)$> tophat2 --version
Now we will make a quick TopHat test, using the provided sample files:
# Go to the test directory, unpack the sample dataset and go to it
(node)$> cd ~/bioinfo-tutorial/tophat
(node)$> tar xzvf test_data.tar.gz
(node)$> cd test_data
# Launch TopHat, with Bowtie2 in serial mode
(node)$> tophat -r 20 test_ref reads_1.fq reads_2.fq
# Launch TopHat, with Bowtie2 in parallel mode
(node)$> tophat -p 12 -r 20 test_ref reads_1.fq reads_2.fq
We can see that for this fast execution, increasing the number of threads does not improve the calculation time due to the relatively high overhead of thread creation. Note that TopHat / Bowtie are not MPI applications and as such can take advantage of at most one compute node.
Next, we will make a longer test, where it will be interesting to monitor the TopHat pipeline (with htop
for example) to see the transitions between the serial
and parallel stages (left as an exercise).
# Load the file which will export $TOPHATTEST2 in the environment
(node)$> source ~/bioinfo-tutorial/tophat/test2_path
# Launch TopHat, with Bowtie2 in parallel mode
(node)$> tophat2 -p 12 -g 1 -r 200 --mate-std-dev 30 -o ./ $TOPHATTEST2/chr10.hs $TOPHATTEST2/SRR027888.SRR027890_chr10_1.fastq $TOPHATTEST2/SRR027888.SRR027890_chr10_2.fastq
The input data for the first test corresponds to the TopHat test set, while the second test is an example of aligning reads to the chromosome 10 of the human genome as given here.
Proposed exercises
The following exercises are proposed for TopHat/Bowtie2:
- create a launcher for TopHat using the commands shown in the previous section
- launch jobs with 1, 2, 4, 8 and 10 cores on one node, using the second test files, and measure the speedup obtained
mpiBLAST
Characterization: data intensive, little RAM overhead, native parallelization
Description
mpiBLAST: Open-Source Parallel BLAST
mpiBLAST is a freely available, open-source, parallel implementation of NCBI BLAST. By efficiently utilizing distributed computational resources through database fragmentation, query segmentation, intelligent scheduling, and parallel I/O, mpiBLAST improves NCBI BLAST performance by several orders of magnitude while scaling to hundreds of processors [*].
Example
This example will be ran in an interactive session, with batch-mode executions being proposed later on as exercises.
-
Gaia
# Connect to Gaia (Linux/OS X): (yourmachine)$> ssh access-gaia.uni.lu # Request 1 full node in an interactive job: (access-gaia)$> oarsub -I -l nodes=1,walltime=00:30:00 # Load the bioinfo software set (node)$> module use $RESIF_ROOTINSTALL/bioinfo/modules/all
-
Iris
# Connect to Iris (Linux/OS X): (yourmachine)$> ssh access-iris.uni.lu # Request half a node in an interactive job: (access-iris)$> s -t 0-0:30:0 -N 1 -c 1 --ntasks-per-node=14 # Load the bioinfo software set (node)$> module use /opt/apps/resif/data/stable/bioinfo/modules/all
# Check the mpiBLAST versions installed on the clusters:
(node)$> module avail 2>&1 | grep -i mpiblast
# Load the default mpiBLAST version:
(node)$> module load bio/mpiBLAST
# Check that it has been loaded, along with its dependencies:
(node)$> module list
# The mpiBLAST binaries should now be in your path
(node)$> mpiformatdb --version
(node)$> mpiblast --version
mpiBLAST requires access to NCBI substitution matrices and pre-formatted BLAST databases. For the purposes of this tutorial, a FASTA (NR) database has been formatted and split into 12 fragments, enabling the parallel alignment of a query against the database.
A .ncbirc
file containing the paths to the necessary data files can be downloaded from here
and placed in your $HOME
directory (make sure to backup an existing $HOME/.ncbirc
before overwriting it with the one in this tutorial).
Question: Knowing that the databases can take tens of gigabytes, what is an appropriate storage location for them on the clusters?
We will run a test using mpiBLAST. Note that mpiBLAST requires running with at least 3 processes, 2 dedicated for scheduling tasks and coordinating file output, with the additional processes performing the search.
-
Gaia
# Go to the test directory and execute mpiBLAST with one core for search (node)$> cd ~/bioinfo-tutorial/mpiblast (node)$> mpirun -np 3 mpiblast -p blastp -d nr -i test.fa -o test.out # Note the speedup when using 12 cores (node)$> mpirun -np 12 mpiblast -p blastp -d nr -i test.fa -o test.out
-
Iris
# Go to the test directory and execute mpiBLAST with one core for search (node)$> cd ~/bioinfo-tutorial/mpiblast (node)$> srun -n 3 mpiblast -p blastp -d nr -i test.fa -o test.out # Note the speedup when using 14 cores (node)$> srun -n 14 mpiblast -p blastp -d nr -i test.fa -o test.out
Proposed exercises
The following exercises are proposed for mpiBLAST:
- create a launcher for mpiBLAST, making sure to export the required environment to the remote nodes
- launch jobs with 8, 14 and 24 cores across two nodes and measure the speedup obtained