Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:
- Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks;
- Load balance, where the computing resources aren't contributing equally to the problem;
- Impacts from use of RAM and virtual memory, such as cache misses and page faults;
- Network effects, such as latency and bandwidth, that impact performance and communication overhead;
- Interprocess conflicts and thread scheduling;
- Data access and other I/O considerations.
The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It's also worth checking out Norm Matloff's keynote from the useR!2017 conference, embedded below.
Norm Matloff: Understanding overhead issues in parallel computation