Introduction
A central tenet of the scientific method is that results should be independently verifiable — and, ideally, extendable — by other researchers. As computational methods play an increasing role in many disciplines, key scientific results are often produced by computer code. Verifying and extending such results requires that the code be “reproducible”; that is, it can be accessed and run, with outputs that can be corroborated against published results 1– 9. Unfortunately, this ideal is not usually achieved in practice; most scientific articles do not come with code that can reproduce their results 10– 13.
There are many barriers to sharing reproducible code and corresponding computational results 14. One barrier is simply that keeping code and results sufficiently organized and documented is difficult — it is burdensome even for experienced programmers who are well-trained in relevant computational tools such as version control (discussed later), and even harder for the many domain scientists who write code with little formal training in computing and informatics 15. Further, modern interactive computer environments (e.g., R, Python), while greatly enhancing code development 16, also make it easier to create results that are irreducible. For example, it is all too easy to run interactive code without recording or controlling the seed of a pseudo-random number generator, or generate results in a “contaminated” environment that contains objects whose values are critical but unrecorded. Both these issues can lead to results that are difficult or impossible to reproduce. Finally, even when analysts produce code that is reproducible in principle, sharing it in a way that makes it easy for others to retrieve and use (e.g., via GitHub or Bitbucket) involves technologies that many scientists are not familiar with 13, 17.
In light of this, there is a pressing need for easy-to-use tools to help analysts maintain reproducible code, document progress, and disseminate code and results to collaborators and to the scientific community. We have developed an open source R 18 package, workflowr, to address this need. The workflowr package aims to instill a particular “workflow” — a sequence of steps to be repeated and integrated into research practice — that helps make projects more reproducible and accessible. To achieve this, workflowr integrates four key features that facilitate reproducible code development: (1) version control 19, 20; (2) literate programming 21; (3) automatic checks and safeguards that improve code reproducibility; and (4) sharing code and results via a browsable website. These features exploit powerful existing tools, whose mastery would take considerable study. However, the workflowr interface is designed to be simple so that learning it does not become another barrier in itself and novice users can quickly enjoy its many benefits. By simply following the workflowr “workflow”, R users can create projects whose results and figures are easily accessible on a static website — thereby conveniently shareable with collaborators by sending them a URL — and accompanied by source code and reproducibility safeguards. The Web-based interface, updated with version control, also makes it easy to navigate through different parts of the project and browse the project history, including previous versions of figures and results, and the code used to produce them. By using workflowr, all this can be achieved with minimal experience in version control systems and Web technologies.
The workflowr package builds on several software technologies and R packages, without which this work would have been impossible. Workflowr builds on the invaluable R Markdown literate programming system implemented in knitr 22, 23 and rmarkdown 21, 24, which in turn build on pandoc, the “Markdown” markup language, and various Web technologies such as Cascading Style Sheets and Bootstrap 25. Several popular R packages extend knitr and rmarkdown for specific aims such as writing blogs ( blogdown 26), monographs ( bookdown 27), and software documentation ( pkgdown 28). Analogously, workflowr extends rmarkdown with additional features such as the reproducibility safeguards, and adds integration with the version control system Git 19, 20. Git was designed to support large-scale, distributed software development, but in workflowr it serves a different purpose: to record, and provide access to, the development history of a project. Workflowr also uses another feature of Git, “remotes”, to enable collaborative project development across multiple locations, and to help users create browsable projects via integration with popular online services such as GitHub Pages and GitLab Pages. These features are implemented using the R package git2r 29, which provides an interface to the libgit2 C library. Finally, beyond extending the R programming language, workflowr is also integrated with the popular RStudio interactive development environment 30.
In addition to the tools upon which workflowr directly builds, there are many other related tools that directly or indirectly advance open and reproducible data analysis. A comprehensive review of such tools is beyond the scope of this article, but we note that many of these tools are complementary to workflowr in that they tackle aspects of reproducibility that workflowr currently leaves to the user, such as management and deployment of computational environments and dependencies (e.g., conda, Homebrew, Singularity, Docker, Kubernetes, packrat 31, checkpoint 32, switchr 33, RSuite 34); development and management of computational pipelines (e.g., GNU Make, Snakemake 35, drake 36); management and archiving of data objects (e.g., archivist 37, Dryad 38, Zenodo); and distribution of open source software (e.g., CRAN, Bioconductor 39, Bioconda 40). Most of these tools or services could be used in combination with workflowr. There are additional, ambitious efforts to develop cloud-based services that come with many computational reproducibility features (e.g., Code Ocean, Binder, Gigantum, The Whole Tale). Many of these platforms manage individual projects as Git repositories, so workflowr could, in principle, be installed and used on these platforms, possibly to enhance their existing features. Other R packages with utilities to facilitate reproducibility that could complement workflowr include ProjectTemplate 41, rrtools 42, and usethis 43, as well as many of the R packages listed in the “Reproducible Research” CRAN Task View.
Of the available software tools facilitating reproducible research, perhaps the closest in scope to workflowr are the R package adapr 44 and the Python-based toolkit Sumatra 45. Like workflowr, both adapr and Sumatra use version control to maintain a project development history. Unlike workflowr, both place considerable emphasis on managing and documenting dependencies (software and data), whereas workflowr only records this information. In contrast, workflowr places more emphasis on literate programming — the publishing of text and code in a readable form — and more closely integrates other features such as tracking project development history via Git with literate programming.
The workflowr R package is available from CRAN and GitHub, and is distributed under the flexible open source MIT license (see Software availability). The R package and its dependencies are straightforward to install while being highly customizable for more dedicated users. Extensive documentation, tutorials, and user support can be found at the GitHub site. In the remainder of this article, we describe the workflowr interface, explain its design, and give examples illustrating how workflowr is used in practice.
Operation
In this section, we give an overview of workflowr’s main features from a user’s perspective. For step-by-step instructions on starting a workflowr project, see the “Getting started with workflowr” vignette.
For basic usage, only five functions are needed (summarized here, and described in more detail later):
Figure 1.
The workflowr package helps organize project files and results.
A) The function
The primary output of workflowr is a project website for browsing the results generated by the Rmd analysis files ( Figure 1B). The use of websites to organize information is, of course, now widespread. Nonetheless, we believe they are under-utilized for organizing the results of scientific projects. In particular, hypertext provides an ideal way to connect different analyses that have been performed, and to provide easy access to relevant external data (e.g., related work or helpful background information); see Figure 1B and Use cases below.
Organizing the project:
The function
In addition to creating a default file structure for a data analysis project,
In some cases, a user will have an existing project (with files that may or may not be tracked by
Git), and would like to incorporate
workflowr into the project
Finally,
Generating results reproducibly:
In a
workflowr project, analyses are performed using the R Markdown literate programming system
21. The user develops their R code inside Rmd files in the
1. It creates a clean R session for executing the code. This is critical for reproducibility—results should not depend on the current state of the user’s R environment, and all objects necessary to run the code should be defined in the code or loaded by packages.
2. It automatically sets the working directory in a consistent manner (the exact setting is controlled by a configuration file; see Implementation below). This prevents one of the most common failures to reproduce in R—not setting the working directory before running the R script, resulting in incorrectly resolved relative file paths.
3. It sets a seed for the pseudorandom number generator before executing the code. This ensures that analyses that use random numbers always return the same result.
4. It records information about the computing environment, including the operating system, the version of R used, and the packages that were used to produce the results.
Finally,
Figure 2.
The workflowr reproducibility report summarizes the reproducibility checks inside the results webpage.
(
A) A button is added to the top of each webpage. Clicking on the button (1) reveals the full reproducibility report with multiple tabs. If any of the reproducibility checks have failed, a red warning symbol (!) is shown. Clicking on the "Checks" tab (2) summarizes the reproducibility checks, with icons next to each check indicating a pass or failure. Clicking on an individual item (3) reveals a more detailed description of the reproducibility check, with an explanation of why it passed or failed. In (
A), the Rmd file contains changes that have not yet been committed, so one of the reproducibility checks has failed (uncommitted changes are acceptable during active development, but not acceptable when results are published). In this case, the recommendation is given to run
Keeping track of the project’s development:
As a project progresses, many versions of the results will be generated as results are scrutinized, analyses are revised, errors are corrected, and new data are considered. Keeping track of a project’s evolution is important for documenting progress and retracing the development of the analyses. This is sometimes done without version control tools by copying code and results whenever an important change is made. This typically results in a large collection of files with names such as
The version control system, Git, provides a more systematic and reliable way to keep track of a project’s development history. However, Git was designed to manage source code for large-scale software projects, and using it for scientific analyses brings some specific challenges. The relative complexity of Git provides a high barrier to entry, discouraging many researchers from adopting it for their projects. And Git is not ideally suited to data analysis projects where one wants to coordinate the tracking of source code, data, and the results generated by the code and data. Using Git commands to identify the version of the code that was used to generate a result can be non-trivial.
The
Figure 3.
The function
The function performs a three-step procedure to store the code and results in a project development history, and ensure that the results HTML file is always created from a unique and identifiable versioned Rmd analysis file. (1) The first step commits the changes to the Rmd analysis file. (2) The second step builds the results HTML file from the Rmd file. These two steps ensure that the results were generated from the committed version of the Rmd file. Furthermore, the unique version of the Git repository is inserted directly into the HTML file so that the source code used to generate the results is easily identified and accessed. If the code generates an error, the entire process is aborted and the previous commit made in the first step is undone. (3) The results HTML file, as well as any related figure files, are committed to the Git repository. Thus, the versioning of Rmd analysis files and corresponding HTML results files are coordinated whenever
Even experienced
Git users will benefit from using
1.
Every commit to an (Rmd) analysis file is associated with a commit to the results file generated by that analysis file.
2.
An analysis file is only published and committed if it runs successfully; on failure,
Publishing an analysis is not necessarily final — after calling
Checking in on the project’s development:
As a
workflowr project grows, it is important to be able to get an overview of the project’s status and identify files that may need attention. This functionality is provided by the
Figure 4.
The workflowr package is an R Markdown-aware version control system.
The function
Sharing code and results:
The version-controlled website created by workflowr is self-contained, so it can be hosted by most Web servers with little effort. Once the website is available online, the code and results can be shared with collaborators and colleagues by providing them with the website’s URL. Similarly, the workflowr repository can also serve as a companion resource for a manuscript by referencing the website URL in the paper.
Since a
workflowr project is also a
Git repository, the most convenient way to make the website available online is to use a
Git hosting service. The
workflowr package includes functions
The results files in a workflowr website include links to past versions of analysis and figures, making it easy for collaborators to benefit from the versioning of analyses without knowing anything about Git. For example, if a collaborator wants to download a previous version of a figure generated several months ago, this can be done by navigating the links on the workflowr website.
Installation
The workflowr package is available on CRAN. It works with R versions 2.3.5 or later, and can be installed on any major platform that is supported by R (Linux, macOS, Windows). It is regularly tested on all major operating systems via several continuous integration services (AppVeyor, CircleCI, Travis CI). It is also regularly tested by CRAN using machines running Debian GNU/Linux, Fedora, macOS, Solaris, and Windows.
Because workflowr uses the rmarkdown package to build the HTML pages, it requires the document conversion software pandoc to be installed. The easiest way for R users to install pandoc is to install RStudio.
Installing Git is not required because the R package dependency git2r includes libgit2, a minimal Git implementation (nonetheless, installing Git may be useful for occasional management of the Git repository outside regular workflowr usage).
Customization
Workflowr projects are highly customizable. For example, the look of the webpages can be customized, via options provided by the
rmarkdown package, by editing the
Implementation
Here we give an overview of the workflowr package implementation. All workflowr commands can be invoked from R (or RStudio) so long as the working directory in R is set to the directory containing a workflowr project, or any subdirectory of a workflowr project (this is similar to how Git commands are invoked). To determine the root directory of a workflowr project from a subdirectory, whenever a command is called from the R console, workflowr uses the rprojroot 46 R package to search for the RStudio project file stored at the root of the project (the RStudio project file is a required file, so if this file is deleted, the workflowr commands will not work).
Organizing the project:
The function
Generating results reproducibly:
The
The
In the
rmarkdown package, the rendering of individual webpages from Rmd files is controlled by a separate function,
Most of the
workflowr content is added as a preprocessing step prior to executing the R code in the Rmd file. To achieve this,
The process for embedding links to past versions of files — that is, files added to previous commits in a
Git repository — requires some additional explanation. Links to past versions are included only if the user has set up a remote repository hosted by either GitHub or GitLab. Clicking on a link to a past version of an Rmd file (or figure file) in a Web browser will load a webpage displaying the R Markdown source code (or figure file) as it is saved in the given commit. For past versions of the webpages, we use an independent service
raw.githack.com, which displays the HTML file in the browser like any other webpage (this is because GitHub and GitLab only show the raw HTML code). These links will point to valid webpages only after the remote repository (on GitHub or GitLab) is updated, e.g., using
The
To execute the code,
By default, the
rmarkdown package renders an Rmd file in the directory where the Rmd file is stored; that is, the R working directory is automatically changed to the directory containing the target Rmd file. By default,
Keeping track of the project’s development:
One of the steps in
Checking in on the project’s development:
The
Using git2r, it is mostly straightforward to determine the status of each file. The only complicated step is determining whether published Rmd files have been modified. If all changes to an Rmd file have been committed to the Git history, an Rmd file is considered “modified” if it has modifying commits that are more recent than commits modifying the corresponding HTML file.
Sharing the code and results:
To use
Use cases
Workflowr was officially released on CRAN in April 2018. As of September 2019, it has been downloaded from CRAN over 7,000 times, and it has been adopted by many researchers. The most common use cases are 1) documenting research development and including the project website in the accompanying academic paper, and 2) developing reproducible course materials to share with students. Here we highlight some successful examples.
Repositories for research projects
Human dermal fibroblast clonality project
A workflowr project accompanying a scientific paper on computational methods for decoding the clonal substructures of somatic tissues from DNA sequencing data 50. The webpages describe how to reproduce the data processing and analysis, along with the outputs and plots.
Characterizing and inferring quantitative cell cycle phase in single-cell RNA-seq data analysis
A workflowr project supporting a paper on measuring cell cycle phase and gene expression levels in human induced pluripotent stem cells 51. The repository contains the processed data and the code implementing the analyses. The full results can be browsed on the website.
Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions
A workflowr project containing the code and data used to produce the results from the GTEx data set that were presented in Urbut et al. 52.
Investigations on "truncated adaptive shrinkage"
A workflowr project created by a Ph.D. student created to keep track of his investigations into controlling false discoveries in the presence of correlation and heteroskedastic noise. This repository illustrates the use of workflowr as a scientific notebook — the webpages contain written notes, mathematical equations, source code, and the outputs generated from running the code.
Repositories for courses
Stanford STATS 110
A workflowr website for a statistics course taught at Stanford. The website includes working R examples, homework, the course syllabus, and other course materials.
Single-cell RNA-seq workshop
A workflowr website for a workshop on analysis of single-cell RNA-seq data offered by the Harvard Faculty of Arts and Sciences Informatics group as part of a two-week long bioinformatics course. The R examples demonstrate how to use several bioinformatics packages such as Seurat and msigdbr to prepare and analyze single-cell RNA-seq data sets.
Introduction to GIS in R
A workflowr website for a workshop given at the 2018 Evolutionary Biology Conference. The website includes working R demonstrations, setup instructions, and exercises.
Summary
Our main aim in developing workflowr is to lower barriers to open and reproducible code. Workflowr provides a core set of commands that can be easily integrated into research practice, and combined with other tools, to make projects more accessible and reproducible. The R package is straightforward to install, easy to learn, and highly customizable.
Since the first official release of workflowr (version 1.0.1, released in April 2018), the core functionality has remained intact, and we expect it to remain that way. The core features of workflowr have been carefully tested and revised, in large part thanks to feedback and issue reports from the user community. Our next aim is to implement several enhancements, including:
Create a centralized workflowr project website to make it easier for researchers to share and discover workflowr projects.
Provide additional functions to simplify website hosting on other popular platforms such as Netlify and Heroku.
As workflowr projects grow, it becomes increasingly important to document not only the evolution of the code and results over time, but also how the results interrelate with one another. Therefore, we aim to implement syntax that allows file dependencies to be recorded in the Rmd files, and incorporate checking of dependencies as part of the workflowr reproducibility safeguards.
As workflowr has been used in a variety of settings, we have also uncovered some limitations. Here we report on some of the more common issues that have arisen.
One limitation is that Git — hence workflowr — is not well suited to tracking very large files. Therefore, large data files must be left out of the project development history, which reduces reproducibility. One possible workaround is to use Git LFS (Large File Storage) or related tools that allow large data files to be tracked and stored remotely inside a Git repository. This, however, requires considerable expertise to install and configure Git LFS, so it is not a satisfactory solution for some workflowr users. Also note that sensitive or secure data can be added to a workflowr project so long as the storage and access practices meet the data security requirements ( workflowr has options to simplify creation and management of projects with security requirements).
Since
workflowr builds on
Git, users who already have experience with
Git can use
Git directly to manage their
workflowr projects. This provides additional flexibility, but is not without risk; for example,
Git commands such as
Finally, workflowr records information about the computing environment used to generate the results, but it does not provide any facilities for replicating the environment. This is an area with many recent software advances — there are many widely used tools for managing and deploying computational environments, from container technologies such as Docker to package managers such as Anaconda and packrat. We view these tools as being complementary to workflowr, and one future direction would be to develop easy-to-use functions that configure such tools for use in a workflowr project.
Data availability
All data underlying the results are available as part of the article and no additional source data are required.
Software availability
Software available from:
Source code available from:
Archived source code at time of publication:
License: MIT
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2019 Blischak JD et al. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Making scientific analyses reproducible, well documented, and easily shareable is crucial to maximizing their impact and ensuring that others can build on them. However, accomplishing these goals is not easy, requiring careful attention to organization, workflow, and familiarity with tools that are not a regular part of every scientist's toolbox. We have developed an R package, workflowr, to help all scientists, regardless of background, overcome these challenges. Workflowr aims to instill a particular "workflow" — a sequence of steps to be repeated and integrated into research practice — that helps make projects more reproducible and accessible.This workflow integrates four key elements: (1) version control (via Git); (2) literate programming (via R Markdown); (3) automatic checks and safeguards that improve code reproducibility; and (4) sharing code and results via a browsable website. These features exploit powerful existing tools, whose mastery would take considerable study. However, the workflowr interface is simple enough that novice users can quickly enjoy its many benefits. By simply following the workflowr "workflow", R users can create projects whose results, figures, and development history are easily accessible on a static website — thereby conveniently shareable with collaborators by sending them a URL — and accompanied by source code and reproducibility safeguards. The workflowr R package is open source and available on CRAN, with full documentation and source code available at https://github.com/jdblischak/workflowr.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer




