While spreadsheets are fine tools for collecting and sharing data, the temptation is often there to also use them for in-depth analysis better suited to reproducible systems like R. Historian Jesse Sadler recently published the useful guide Excel vs R: A Brief Introduction to R, which provides useful advice to data analysts currently using spreadsheets on how to transition to R:
Quantitative research often begins with the humble process of counting. Historical documents are never as plentiful as a historian would wish, but counting words, material objects, court cases, etc. can lead to a better understanding of the sources and the subject under study. When beginning the process of counting, the first instinct is to open a spreadsheet. The end result might be the production of tables and charts created in the very same spreadsheet document. In this post, I want to show why this spreadsheet-centric workflow is problematic and recommend the use of a programming language such as R as an alternative for both analyzing and visualizing data.
The post provides a good overview of the pros and cons of using spreadsheets for data analysis, and then provides a useful -- aimed at spreadsheet users -- to using R for the problematic parts. It includes:
- Basics of the R command line
- An overview of the Tidyverse, a suite of R packages for data manipulation
- Working with data in R: numbers, strings and dates
- Manipulating data frames by linking operations together with the pipe operator
- Visualizing data with the ggplot2 package
The guide is built around a worked analysis of an interesting historical data set: the correspondence network from 6,600 letters written to the 16th-century Dutch diplomat Daniel van der Meulen. You can find the complete guide, including a link to download the data for the examples, at the link below.
Jesse Sadler: Excel vs R: A Brief Introduction to R