The knitr package by Yihui Xie is a wonderful tool for reproducible data science. I especially like using it with R Markdown documents, where with some simple markup in an easy-to-read document I can easily combine R code and narrative text to generate an attractive document with words, tables and pictures in HTML, PDF or Word format. Say, something like this:
In that document, the numerical weather records and the chart were generated by R, combined into a document using R Markdown, and then generated as a word file with knitr
. (You can find the R Markdown file to generate that report, and the R script to download the data, in my weather-report repository.)
Another useful tool for reproducible data science is the checkpoint package. It helps you manage the ever-changing ecosystem of R packages on CRAN, by making it easy to "lock in" specific versions of R packages. With a single call to the checkpoint function — say checkpoint("2017-04-25")
, for April 25, 2017 — you can automatically find all the packages used by your current R project (i.e. the current folder) and install them as they used to be on the specified date. A colleague or collaborator can use the same script to get the same versions too, and so be confident of reproducing your results without having to worry a newer package version may have affected the results. By the way, those package versions get installed in a special folder (.checkpoint
, in your home directory), so they won't change the results of any other R projects, either.
RStudio includes a very useful tool for working with R Markdown and knitr
: you can press the "Knit" toolbar button to process the document with a single click. For that to work, it does require certain R packages to be available for use behind the scenes. In normal circumstances RStudio will offer to install them, but the process doesn't work when a checkpoint folder is active. A simple workaround is to include a file in the same folder (I call mine knitr-packages.R
) with the following lines:
library("formatR") library("htmltools") library("caTools") library("bitops") library("base64enc") library("rprojroot") library("rmarkdown") library("evaluate") library("stringi")
Although you never run that file directly, the checkpoint process will discover it and ensure the necessary packages are installed for RStudio to perform its magic. (In my tests this works with recent versions of RStudio including the latest, 1.0.143). All you need to do is make sure you run checkpoint
from the R command line (just press Control-ENTER on the corresponding line in the .Rmd
file) before attempting to knit. Simple!