by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)
The Azure Data Science Virtual Machine (DSVM) is a curated VM which provides commonly-used tools and software for data science and machine learning, pre-installed. AzureDSVM is a new R package that enables seamless interaction with the DSVM from a local R session, by providing functions for the following tasks:
- Deployment, deallocation, deletion of one or multiple DSVMs;
- Remote execution of local R scripts: compute contexts available in Microsoft R Server can be enabled for enhanced computation efficiency for either a single DSVM or a cluster of DSVMs;
- Retrieval of cost consumption and total expense spent on using DSVM(s).
AzureDSVM is built upon the AzureSMR package and depends on the same set of R packages such as httr, jsonlite, etc. It requires the same initial set up on Azure Active Directory (for authentication).
To install AzureDSVM with devtools package:
library(devtools) devtools::install_github("Azure/AzureDSVM") library("AzureDSVM")
When deploying a Data Science Virtual Machine, the machine name, size, OS type, etc. must be specified. AzureDSVM supports DSVMs on Ubuntu, CentOS, Windows, and Windows with the Deep Learning Toolkit (on GPU-class instances). For example, the following code fires up a D4 v2 Ubuntu DSVM located in South East Asia:
deployDSVM(context, resource.group="example", location="southeastasia", size="Standard_D4_v2", os="Ubuntu", hostname="mydsvm", username="myname", pubkey="pubkey")
where context
is an azureActiveContext
object created by AzureSMR::createAzureContext()
function that encapsulates credentials (Tenant ID, Client ID, etc.) for Azure authentication.
In addition to launching a single DSVM, the AzureDSVM package makes it easy to launch a cluster with multiple virtual machines. Multi-deployment supports:
- creating a collection of independent DSVMs which can be distributed to a group of data scientists for collaborative projects, as well as
- clustering a set of connected DSVMs for high-performance computation.
To create a cluster of 5 Ubuntu DSVMs with default VM size, use:
cluster<-deployDSVMCluster(context, resource.group=RG, location="southeastasia", hostnames="mydsvm", usernames="myname", pubkeys="pubkey", count=5)
To execute a local script on remote cluster of DSVMs with a specified Microsoft R Server compute context, use the executeScript
function. (NOTE: only Linux-based DSVM instances are supported at the moment as underneath the remote execution is achieved via SSH. Microsoft R Server 9.x allows remote interaction for both Linux and Windows, and more details can be found here.) Here, we use the RxForeachDoPar
context (as indicated by the compute.context
option):
executeScript(context, resource.group="southeastasia", machines="dsvm_names_in_the_cluster", remote="fqdn_of_dsvm_used_as_master", user="myname", script="path_to_the_script_for_remote_execution", master="fqdn_of_dsvm_used_as_master", slaves="fqdns_of_dsvms_used_as_slaves", compute.context="clusterParallel")
Information of cost consumption and expense spent on DSVMs can be retrieved with:
consum<-expenseCalculator(context, instance="mydsvm", time.start="time_stamp_of_starting_point", time.end="time_stamp_of_ending_point", granularity="Daily", currency="USD", locale="en-US", offerId="offer_id_of_azure_subscription", region="southeastasia") print(consum)
Detailed introductions and tutorials can be found in the AzureDSVM Github repository, linked below.
Github (Azure): AzureDSVM