class: title-slide
# Reproducible pipelines in R ## with `targets` .center[<img src="img/logo_targets.png" width="100px"/>] ## HPC SChool 2021 PS11: R session ### A. Ginolhac ### DLSM University of Luxembourg
--- ## Introduction to R .Large[Not the scope of this session] .pull-left[ .Large[.bold[Beginner] user, check out [this lecture](https://rworkshop.uni.lu/lectures/lecture01_intro.html#10)] ![](img/intro_p9.png) ] .pull-right[ .Large[.bold[Advanced] user, interested in programming, check out [this lecture](https://rworkshop.uni.lu/lectures/lecture12_tidyeval.html)] ![](img/lecture_tidyeval.png) ] --- class: middle, center, inverse # targets ## a Make-like workflow manager for
--- # targets and companion package tarchetypes
.flex.items-center[ .w-10[ ] .w-70.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt3.mr1[ .large[.bbox[A workflow manager for R]] .large[ - Saving you time and stress - Understand how it is implemented in `targets` + Define your `targets` + Connect `targets` to create the **dependencies**
+ Check **dependencies** with `visnetwork` + Embrace **dynamic** branching
+ Run **only** what needs to be executed + Bundle **dependencies** in a Rmarkdown document with [`tar_render()`](https://wlandau.github.io/tarchetypes/reference/tar_render.html) + Increase reproducibility with the package manager [`renv`](https://rstudio.github.io/renv/articles/renv.html) - Example with RNA-seq data from .bold[Wendkouni Nadège MINOUNGOU] ]]] --- # Folder structure
.left-column[ ```r ├── .git/ *├── _targets.R ├── _targets/ ├── Repro.Rproj ├── R *│ ├── functions.R *│ └── utils.R ├── run.R* ├── renv/ ├── renv.lock └── report.Rmd ``` ] .right-column[ .large[ - With [`renv`](https://rstudio.github.io/renv/). Snapshot your package environment (and restore! 😌) - `_targets.R` is the only mandatory file - Use a `R` sub-folder for functions, gets closer to a
package - In a RStudio project - Version tracked with
- `Rmarkdown` file allows to gather results in a report - Optional: an executable `run.sh` allows to use Build Tools in RStudio .Large[Targets Markdown] .large[ Bundle `globals` and `pipeline` inside a **Rmarkdown** document. - Makes development easier - Documentation can be embedded - `targets` engine recognizes by `knitr` and takes care of writing all
scripts ] ] ] --- # `renv` features
.pull-left[ - `hydrate()` parses your code and finds library calls - `install()` from .bold[CRAN] with dependencies (also from
) - `snapshot()` registers changes, hashes and origin - `restore()` to a certain point in time ```r > renv::snapshot() The following package(s) will be updated in the lockfile: # CRAN =============================== - RcppParallel [5.0.2 -> 5.0.3] - cli [2.3.0 -> 2.3.1] - pkgload [1.1.0 -> 1.2.0] - tint [0.1.3 -> *] # GitHub ============================= - targets [ropensci/targets@main: 598d7a23 -> bdc1b29c] Do you want to proceed? [y/N]: ``` ] -- .pull-right[ `renv.lock` file after a `snapshot` ``` "R": { "Version": "4.0.3", "Repositories": [ { "Name": "CRAN", "URL": "https://cloud.r-project.org" } ] }, "Bioconductor": { "Version": "3.12" }, "Packages": { "AnnotationDbi": { "Package": "AnnotationDbi", "Version": "1.52.0", "Source": "Bioconductor", "Hash": "ca5106b296b3aa6af713ce197be547c1" }, "BH": { "Package": "BH", "Version": "1.75.0-0", "Source": "Repository", "Repository": "CRAN", "Hash": "e4c04affc2cac20c8fec18385cd14691" }, "targets": { "Package": "targets", "Version": "0.1.0.9000", "Source": "GitHub", "RemoteType": "github", "RemoteUsername": "ropensci", "RemoteRepo": "targets", "RemoteRef": "main", "RemoteSha": "598d7a23661d4c760209c7991bf10584eadcf7c8", "RemoteHost": "api.github.com", "Hash": "ee66061fd5c757ec600071965d457818" }, [...] ``` ] --- #Example with RNA-seq data across 3 cell lines ### PCA shows that differences between cells .bold.red[`>>`] biological effect (roman numbers)
-- ### Solution: Split counts and metadata for each cell .Large[Do we copy code 3 times?] --- # Define targets = explicit dependencies
.pull-left[ ### `_targets.R`, define 4 targets Last `target` depends on the **3** upstreams ```r library(targets) source("R/functions.R") source("R/plotting.R") list( * tar_target(cells, c("HepG2", "HuH7", "Hep3B")), * tar_qs(dds, read_rds(here::here("data", "all.rds")), packages = "DESeq2"), * tar_fst_tbl(annotation, gtf_to_tbl(here::here("data", "gencode.v36.annotation.gtf")), packages = c("tibble", "rtracklayer")), * tar_qs(sub_dds, subset_dds(dds, filter(annotation, type == "gene"), .cell = cells), pattern = map(cells), # dynamic branching packages = c("DESeq2", "tidyverse")) [...] ) ``` ] -- .pull-right[ .bold[Dynamic branching] makes dependencies easier to read. .large[ > .orange.bold[Of course, someone has to write for loops, it doesn't have to be you] .tr[ — _.bold[Jenny Bryan]_]] ![](img/targets_outdated.png) ] .footnote[Figure from `tar_visnetwork()`] --- class: hide_logo # Running targets
.pull-left[ ``` ● run target annotation ● run target cells ● run target dds ● run branch sub_dds_3078b1e0 condition time_h HepG2_I1 control 0 HepG2_I2 HIL6 2 using pre-existing size factors estimating dispersions gene-wise dispersion estimates: 2 workers mean-dispersion relationship final dispersion estimates, fitting model and testing: 2 workers ● run branch sub_dds_d05c5da7 condition time_h HuH7_I1 control 0 HuH7_I2 HIL6 2 using pre-existing size factors estimating dispersions gene-wise dispersion estimates: 2 workers mean-dispersion relationship final dispersion estimates, fitting model and testing: 2 workers ● run branch sub_dds_c60d7096 condition time_h Hep3B_I1 control 0 Hep3B_I2 HIL6 2 using pre-existing size factors estimating dispersions gene-wise dispersion estimates: 2 workers mean-dispersion relationship final dispersion estimates, fitting model and testing: 2 workers ● end pipeline ``` ] -- .pull-right[ .Large[Options to display time and object sizes] ![](img/targets_dag_done.png) ] --- # Re-running
.pull-left[ ``` ✓ skip target annotation ✓ skip target cells ✓ skip target dds ✓ skip branch sub_dds_3078b1e0 ✓ skip branch sub_dds_d05c5da7 ✓ skip branch sub_dds_c60d7096 ✓ skip pipeline ``` .Large[All good, nothing to be done ✔️. Actually `targets` tracks all objects and so functions A more complete dependency graph shows .bold[functions] ] ] -- .pull-right[ ![](img/targets_dag.png) .Large[Let's add the PCA per cell type now] ] --- class: hide_logo # PCA, add 4 targets
.pull-left[ ### Smaller targets avoid unnecessary re-running steps ```r [...] tar_target(rcounts, vst(sub_dds, blind = TRUE), pattern = map(sub_dds), packages = c("DESeq2")), tar_target(pca_df, pca_meta(rcounts), pattern = map(rcounts), packages = c("DESeq2", "tidyr", "dplyr")), tar_target(pca_cell, tibble(cell = cells, pca = list(plot_pca_meta(pca_df))), pattern = map(cells, pca_df), packages = c("ggplot2", "tibble")) [...] ``` .large[ .bold[Translate into]: - For every cell data, compute regularized counts (`vst`: variance stabilization) - For every regularized counts, compute PCA (`df`: data.frame, _i. e_ a table) - For every cell names / PCA tables, plot PCA in a table for easier labeling ] ] -- .pull-right[ ![](img/targets_dag_undone.png) ] --- # PCA results .flex[ .w-33.b--green.ba.bw2.br3.shadow-5.ph4.mt2.mr2[ ### Running <video width="310" height="340"> <source src="https://biostat2.uni.lu/practicals/data/targets_pca.mp4" type="video/mp4"> </video> ] .w-33.b--green.ba.bw2.br3.shadow-5.ph4.mt2.mr2[ ### Awesome feature: load results IN a Rmarkdown document .green.bold[Separate] `code` from content ![](img/targets_tar_read.png) ] .w-33.b--green.ba.bw2.br3.shadow-5.ph4.mt2[ ### How to display a plot ![](img/targets_pca_hep3b.png) ]] --- # The full picture ### Adding step by step ### desired analyses #### Whole analysis takes 24 minutes and 4.54 seconds .large[ > .orange.bold[Of course, someone has to</br>remember the dependencies, </br>it doesn't have to be you] .tl[ — _could be William Landau via .bold[Jenny Bryan]_]]
--- # Is it worth the effort? -- .left-column[ .huge.bold.green[Yes] .Large[ #### For you - Autonomy - Skills - _Free_ time - Confidence over results - Reproducibility - Fun 🥳 ] ] -- .right-column[ ### Better project design .Large[Thinking at what is a [good `targets`](https://books.ropensci.org/targets/targets.html#what-a-target-should-do) helps tremendously the coding] >1. Are large enough to subtract a decent amount of runtime when skipped. 2. Are small enough that some targets can be skipped even if others need to run. 3. Invoke no side effects (tar_target(format = "file") can save files.) 4.Return a single value that is + Easy to understand and introspect. + Meaningful to the project [...] .tr[.bold[William Landau]] ### Reproducibility .bold[Both] thanks to .bold[`targets`] and `renv` via `git` ```r > renv::history() commit author_date committer_date subject 1e8dd2278 2021-02-23 15:29:57 2021-02-23 15:29:57 reformat creating config files 24c1222db 2021-02-15 17:07:01 2021-02-15 17:07:01 highlight gene type in the DEG patterns 326c8a726 2021-02-04 16:16:38 2021-02-04 16:16:38 cluster LRT genes by they dynamic patterns 4c6791796 2021-01-26 13:08:15 2021-01-26 13:08:15 gene types in upset plots for lengths 5865ee70b 2021-01-21 16:36:48 2021-01-21 16:37:08 add upset plots [...] ``` ] --- # Scalability and parallelization
.pull-left[ .large[ - Scale-up with .bold[dynamic] branching - Parallelization on .bold[HPC] using: + `tar_make_clustermq(workers = 3L)` ([`clustermq`](https://mschubert.github.io/clustermq/articles/userguide.html#configuration) by **Michael Schubert**) + `tar_make_future(workers = 3L)` ([`future`](https://future.futureverse.org/) by **Henrik Bengtsson**) - [.bold[Cloud]](https://books.ropensci.org/targets/cloud.html) computing in .bold[AWS] is also available - [.bold[Static] branching](https://books.ropensci.org/targets/static.html#when-to-use-static-branching) to get explicit branch names. ] ] .pull-right[ </br></br> ![](https://wlandau.github.io/bayes-lund-2021/images/pipeline_graph.png) .footnote[Source: .bold[William Landau]: talk at [Bayes Lund](https://wlandau.github.io/bayes-lund-2021/#8)] ] --- # Reports as Rmarkdown documents
.left-column[ `targets`, written by [William Landau](https://wlandau.github.io/) (pictured), is flexible, robust and still allows for a customized report. All computing is done only when needed, and code is away from writing content. Pipelines can now also be a **Rmd**! Once `knitted` the report can be sent to the inquirer. ] ### Targets Markdown New in `targets` > **0.6**. Instructions at [William bookdown](https://books.ropensci.org/targets/markdown.html) Test it as the Rmd template (and excellent [video](https://www.youtube.com/watch?v=FODSavXGjYg) from R Lille meetup by **Landau**): .center[![](https://books.ropensci.org/targets/man/figures/target_markdown.png)] --- # Bonus: watch the pipeline running live 🍿
.left-column[ - `targets` events watched live 🎞 - Here, after changing a threshold in the LRT step - `branches` can be monitored too - 2 videos joined as I fixed an .red.bold[error] at 1'42" - Option to display functions (unset here) ] .right-column[ ### `tar_watch()` shiny app from `targets` <video> <source src="https://biostat2.uni.lu/practicals/data/tar_watch.mp4" type="video/mp4"> </video> ] --- # Before we stop .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .large[.gbox[Highlights] * `targets`: a Makefile-like approach for project design + dependencies manager + re-run only what's needed ]] .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.ml1[ .large[.bbox[Acknowledgments 🙏 👏] * **Eric Koncina** early adopter of `targets` * **Wendkouni N. Minoungou** for the RNA-seq data * [**William Landau**](https://github.com/wlandau) main developer of `targets` * [Xie Yihui](https://github.com/yihui) and [Garrick Aden-Buie](https://github.com/gadenbuie) for [`xarigan`](https://github.com/yihui/xaringan)/[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra) * [Jennifer Bryan](https://github.com/jennybc) ] ]] .flex[ .w-50.bg-washed-green.b--green.ba.bw2.br3.shadow-5.ph3.mt1.mr1[ .large[.ybox[Further reading
]] - [Main website](https://docs.ropensci.org/targets/) - [Targetopia](https://wlandau.github.io/targetopia/) **Landau** universe of targets-derived - [Video](https://www.youtube.com/watch?v=FODSavXGjYg) from R Lille meetup by **William Landau**. June 2021 45'' - [Video](https://www.youtube.com/watch?v=odcBA4ETLn8) from Bayes Lund by **William Landau**. October 2021 - [Documentation](https://books.ropensci.org/targets/) as bookdown by **Landau** ] .w-50.pv2.ph3.mt2.ml1[ .huge[.bbox[Thank you for your attention!]] ]]