class: center, middle, inverse, title-slide # Research pipelines ### George G Vega Yon ### USC IMAGE project ### 2019/10/24 --- # Today's talk ## Reproducible Research ## Tools to eaze your work We will spend most time looking at other people's work --- background-image: url("http://giphygifs.s3.amazonaws.com/media/88VqLDyQttwpa/giphy.gif") --- ## Reproducible research (RR) (or how to avoid "it works on my computer!") -- **A major new issue in sciences (overall)** * Accessible Reproducible Research ([Mesirov, __Science__ 2010](http://science.sciencemag.org/content/327/5964/415)) * Again, and Again, and Again, ... ([Jasny et al., __Science__ 2011](http://science.sciencemag.org/content/334/6060/1225)) * Challenges in Irreproducible Research ([__nature__ topic](http://www.nature.com/news/reproducibility-1.17552)) * Reproducibility of computational workflows is automated using continuous analysis ([Beaulieu-Jones & Greene, __nature biotechnology__ 2017](https://www.nature.com/articles/nbt.3780?foxtrotcallback=true)) -- **A Major productivity problem for researchers** * It is not only a good idea for science, but also for saving you time! --- # Reproducible research -- **Polished paper** - Ready to be submitted - Written using your fav doc editor -- **Intermediate report** - Not ready to be published - Not necesarily using your fav doc editor What do these two have in common? --- background-image: url("http://www.phdcomics.com/comics/archive/phd012618s.gif") class: center, top, inverse ... both should have pretty figures AND be reproducible --- # The minimum **Data** people must have a way to get your data - Include it with the paper. - Put it on a repo online like [zenodo](https://zenodo.org) (see for example the [Gene Ontology's profile](https://zenodo.org/communities/gene-ontology/?page=1&size=20)) . - Include instructions about how to get the data (e.g. in an experimental setting, how did you get the samples). -- **Analysis** source code/steps of your analysis - Include it with the paper - Put it on a repo online like [Github](https://github.com) or [GitLab](https://gitlab.com) -- **Pro tip:** *Avoid the "contact the corresponding author for..." lines.* --- # Tiers of reproducibility ### ~~minimum~~ Basic * **Data** Use public dataset (or at least shareable), or publish your dataset. * **Analysis** source code/steps of your analysis. ### Plus * **Tools** Use open source software (like R, python, etc.). * **Decency** Write your code neatly (like Emil does) * **Tidy** Organize your work in a structured way (folders+readme files, [here](https://github.com/gvegayon/fctc) is one example) ### Premium * **plug-n-play** Use a container (like [docker](https://www.docker.com/) or [singularity](https://sylabs.io/)) --- class: center, middle # Research Pipeline ![](fig/pipeline.svg) Source: Diagram by [문건웅](https://www.linkedin.com/in/%EA%B1%B4%EC%9B%85-%EB%AC%B8-5ab72599/?locale=en_US) showed [here](https://www.slideshare.net/ssuser7e30b2/reproducible-research1) --- class: center, bottom, inverse background-image: url("https://media.giphy.com/media/wLtPSVuz0co5G/giphy.gif") Now, a lot of it has to do with automatization... --- class: middle ## Automatizing your research: automatic reports - For quick reports use tools like [**rmarkdown**](https://cran.r-project.org/package=rmarkdown) which is based on [**pandoc**](https://pandoc.org) - Automate tabular outputs using [**kableExtra**](https://cran.r-project.org/package=kableExtra) or [**pander**](https://cran.r-project.org/package=pander) - Automate tables: In R, you can use [**texreg**](https://cran.r-project.org/package=texreg), [**xtable**](https://cran.r-project.org/package=xtable) --- ## Automatizing your research: automatic reports (example) .pull-left[ ```r data(anorexia, package = "MASS") anorex.1 <- glm( Postwt ~ Treat, data = anorexia ) anorex.2 <- glm( Postwt ~ Prewt + Treat, data = anorexia ) library(texreg) htmlreg( list(anorex.1, anorex.2), doctype = FALSE ) ``` ] <div style="font-size: small;" class="pull-right"> <table cellspacing="0" align="center" style="border: none;"> <caption align="bottom" style="margin-top:0.3em;">Statistical models</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 1</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Model 2</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">(Intercept)</td> <td style="padding-right: 12px; border: none;">85.70<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">49.77<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.35)</td> <td style="padding-right: 12px; border: none;">(13.39)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">TreatCont</td> <td style="padding-right: 12px; border: none;">-4.59<sup style="vertical-align: 0px;">*</sup></td> <td style="padding-right: 12px; border: none;">-4.10<sup style="vertical-align: 0px;">*</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(1.97)</td> <td style="padding-right: 12px; border: none;">(1.89)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">TreatFT</td> <td style="padding-right: 12px; border: none;">4.80<sup style="vertical-align: 0px;">*</sup></td> <td style="padding-right: 12px; border: none;">4.56<sup style="vertical-align: 0px;">*</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(2.23)</td> <td style="padding-right: 12px; border: none;">(2.13)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Prewt</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.43<sup style="vertical-align: 0px;">**</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">(0.16)</td> </tr> <tr> <td style="border-top: 1px solid black;">AIC</td> <td style="border-top: 1px solid black;">495.28</td> <td style="border-top: 1px solid black;">489.97</td> </tr> <tr> <td style="padding-right: 12px; border: none;">BIC</td> <td style="padding-right: 12px; border: none;">504.39</td> <td style="padding-right: 12px; border: none;">501.36</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Log Likelihood</td> <td style="padding-right: 12px; border: none;">-243.64</td> <td style="padding-right: 12px; border: none;">-239.99</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Deviance</td> <td style="padding-right: 12px; border: none;">3665.06</td> <td style="padding-right: 12px; border: none;">3311.26</td> </tr> <tr> <td style="border-bottom: 2px solid black;">Num. obs.</td> <td style="border-bottom: 2px solid black;">72</td> <td style="border-bottom: 2px solid black;">72</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="4"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001; <sup style="vertical-align: 0px;">**</sup>p < 0.01; <sup style="vertical-align: 0px;">*</sup>p < 0.05</span></td> </tr> </table> </div> -- Checkout the R CRAN Task View for [Reproducible Research](https://cran.r-project.org/web/views/ReproducibleResearch.html) --- class: center, bottom, inverse background-image: url("http://www.phdcomics.com/comics/archive/phd101212s.gif") background-size: 40% <!-- ![](http://www.phdcomics.com/comics/archive/phd101212s.gif){style="width:50%" } --> Version control to the rescue... --- class: center, middle # Git: The stupid version control system <img src="fig/git.svg" width="600px" /> (Live demo now) --- class: center, middle, inverse # Thank you! ## Some other resources rOpenSci's ["Reproduciblity in Science"](https://ropensci.github.io/reproducibility-guide/) Karl Broman's ["Tools for Reproducible Research Spring, 2016"](http://kbroman.org/Tools4RR/) The [**workflowr**](https://github.com/jdblischak/workflowr) package Colin Fay's ["An introduction to Docker for R Users"](https://colinfay.me/docker-r-reproducibility/) R packages that work with Docker/Singularity [**babelwhale**](https://cran.r-project.org/package=babelwhale) and [**dockerfiler**](https://cran.r-project.org/package=dockerfiler)