If you've ever wanted to play around with big data sets in a Spark cluster from R with the sparklyr package, but haven't gotten started because setting up a Spark cluster is too hard, well ... rest easy. You can get up and running in about 5 minutes using the guide SparklyR on Azure with AZTK, and you don't even have to install anything yourself. I'll summarize the steps below, but basically you'll run a command-line utility to launch a cluster in Azure with everything you need already installed, and then connect to RStudio Server using your browser to analyze data with sparklyr.
Step 1: Install the Azure Distributed Data Engineering Toolkit (aztk). For this, you'll need a Unix command line with Python 3 installed. I'm on Windows, so I used a bash shell from the Windows Subsystem for Linux and it worked great. (I just had to use pip3
instead of pip
to install, since the default there is Python 2.) The same process should work with other Linux distros or from a Mac terminal.
Step 2: Log into the Azure Portal with your Azure subscription. If you don't have an Azure subscription, you can sign up for free and get $200 in Azure credits.
Step 3: Back at the command line, set up authentication in the secrets.yaml
file. You'll be using the Azure portal to retrieve the necessary keys, and you'll need to create an Azure Batch account if you don't have one already. (Batch is the HPC cluster and job-management service in Azure.) You can find step-by-step details in the aztk documentation.
Step 4: Configure your cluster defaults in the cluster.yaml
file. Here you can define the default VM instance size used for the cluster nodes; for example vm_size: standard_a2
gives you basic 2-core nodes. (You can override this in the command line, but it's convenient to set it here.) You'll also need to specify a dockerfile here that will be used to set up the node images, and for use with sparklyr you'll need to specify one that includes R and the version of Spark you want. I used:
docker_repo: aztk/r-base:spark2.2.0-r3.4.1-base
This provides an image with Spark 2.2.0, R 3.4.1, and a suite of R packages pre-installed, including sparklyr and tidyverse. (You could provide your own dockerfile here, if you need other things installed on the nodes.)
Step 5: Privision a Spark cluster. This is the easy bit: just use the command line tool like this:
aztk spark cluster create --id mysparklyr4 --size 4
In this case, it will launch a cluster of 4 nodes, each with 2 cores (pre the vm_size
option configured above.) Each node will be pre-installed with R and (Warning: the default quotas for Azure Batch are laughably low: for me it was 24 cores total at first. You can get your limit raised fairly easily, but it can take a day to get approval.) Provisioning a cluster takes about 5 minutes; while your waiting you can check on the progress by clicking on the cluster name in the "Pools" section of your Azure Batch account within the Azure Portal.
Once it's ready, you'll also need to provide a password for the head node unless you set up ssh keys in the secrets.yaml
file.
Step 6: Connect to the head node of the Spark cluster. Normally you'd need to find the IP address first, but aztk makes it easy with its ssh command:
aztk spark cluster ssh --id mysparklyr4
(You'll need to provide a password here, if you set one up in Step 5.) This gives you a shell on the head node, but more importantly it maps the ports for Spark and RStudio server, so that you can connect to them using http://localhost
URLs in the next step. Don't exit from this shell until you're done with the next steps, or the port mappings will be cancelled.
Step 7: Connect to RStudio Server
Open a browser window on your desktop, and browse to http://localhost:8787
. This will open up RStudio Server in your browser. (The default login is rstudio/rstudio.) To be clear, RStudio Server is running on the head node of your cluster in Azure Batch, not on your local machine: the port mapping from the previous step is redirecting your local port 8787 to the remote cluster.
From here, you can use RStudio as you normally would. In particular, the sparklyr package is already installed, so you can connect to the Spark cluster directly and use RStudio Server's built-in features for working with Spark.
One of the nice things about using RStudio Server is that you can shut down your browser or even your machine, and RStudio Server will preserve its state so that you can pick up exactly where you left off next time you log in. (Just use aztk spark cluster ssh
to reapply the port mappings first, if necessary.)
Step 8: When you're finished, shut down your cluster using the aztk spark cluster delete
command. (While you can delete the nodes from the Pools view in the Azure portal, the command does some additional cleanup for you.) You'll be charged for each node in the cluster at the usual VM rates for as long as the cluster is provisioned. (One cost-saving option is to use low-priority VMs for the nodes, for savings of up to 90% compared to the usual rates.)
That's it! Once you get used to it, it's all quick and easy -- the longest part is waiting for the cluster to spin up in Step 5. This is just a summary, but the full details see the guide SparklyR on Azure with AZTK.