With the release of Cloudera Enterprise Data Hub 5.11, you can now run Spark, Hive, and MapReduce workloads in a Cloudera cluster on Azure Data Lake Store (ADLS). Running on ADLS has the following benefits:
- Grow or shrink a cluster independent of the size of the data.
- Data persists independently as you spin up or tear down a cluster. Other clusters and compute engines, such as Azure Data Lake Analytics or Azure SQL Data Warehouse, can execute workload on the same data.
- Enable role-based access controls integrated with Azure Active Directory and authorize users and groups with fine-grained POSIX-based ACLs.
- Cloud HDFS with performance optimized for analytics workload, supporting reading and writing hundreds of terabytes of data concurrently.
- No limits on account size or individual file size.
- Data is encrypted at rest by default using service-managed or customer-managed keys in Azure Key Vault, and is encrypted with SSL while in transit.
- High data durability at lower cost as data replication is managed by Data Lake Store and exposed from HDFS compatible interface rather than having to replicate data both in HDFS and at the cloud storage infrastructure level.
To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, use one or both of the following approaches to enable ADLS.
Add a Data Lake Store for cluster wide access
Step 1: ADLS uses Azure Active Directory for identity management and authentication. To access ADLS from a Cloudera cluster, first create a service principal in Azure AD. You will need the Application ID, Authentication Key, and Tenant ID of the service principal.
Step 2: To access ADLS, assign the permissions for the service principal created in the previous step. To do this, go to the Azure portal, navigate to the Data Lake Store, and select Data Explorer. Then navigate to the target path, select Access and add the service principal with appropriate access rights. Refer to this document for details on access control in ADLS.
Step 3: Go to Cloudera Manager -> HDFS -> Configuration. Add the following configurations to core-site.xml:
Use the service principal property values obtained from Step 1 to set these parameters:
<property> <name>dfs.adls.oauth2.client.id</name> <value>Application ID</value> </property> <property> <name>dfs.adls.oauth2.credential</name> <value>Authentication Key</value> </property> <property> <name>dfs.adls.oauth2.refresh.url</name> <value>https://login.microsoftonline.com/<Tenant ID>/oauth2/token</value> </property> <property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property>
Step 4: Verify you can access ADLS by running a Hadoop command, for example:
hdfs dfs -ls adl://<your adls account>.azuredatalakestore.net/<path to file>/
Specify a Data Lake Store in the Hadoop command line
Instead of, or in addition to, configuring a Data Lake Store for cluster wide access, you could also provide ADLS access information in the command line of a MapReduce or Spark job. With this method, if you use an Azure AD refresh token instead of a service principal, and encrypt the credentials in a .JCEKS file under a user’s home directory, you gain the following benefits:
- Each user can use their own credentials instead of having a cluster wide credential
- Nobody can see another user’s credential because it’s encrypted in .JCEKS in the user’s home directory
- No need to store credentials in clear text in a configuration file
- No need to wait for someone who has rights to create service principals in Azure AD
The following steps illustrate an example of how you can set this up by using the refresh token obtained by signing in to the Azure cross platform client tool.
Step 1: Sign in to Azure cli by running the command “azure login”, then get the refreshToken and _clientId from .azure/accessTokens.json under the user’s home directory.
Step 2: Run the following commands to set up credentials to access ADLS:
export HADOOP_CREDSTORE_PASSWORD=<your encryption password> hadoop credential create dfs.adls.oauth2.client.id -value <_clientId from Step 1> -provider jceks://hdfs/user/<username>/cred.jceks hadoop credential create dfs.adls.oauth2.refresh.token -value ‘<refreshToken from Step 1>’ -provider jceks://hdfs/user/<username>/cred.jceks
Step 3: Verify you can access ADLS by running a Hadoop command, for example:
hdfs dfs -Ddfs.adls.oauth2.access.token.provider.type=RefreshToken -Dhadoop.security.credential.provider.path=jceks://hdfs/user/<username>/cred.jceks -ls adl://<your adls account>.azuredatalakestore.net/<path to file> hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen -Dmapred.child.env="HADOOP_CREDSTORE_PASSWORD=$HADOOP_CREDSTORE_PASSWORD" -Dyarn.app.mapreduce.am.env="HADOOP_CREDSTORE_PASSWORD=$HADOOP_CREDSTORE_PASSWORD" -Ddfs.adls.oauth2.access.token.provider.type=RefreshToken -Dhadoop.security.credential.provider.path=jceks://hdfs/user/<username>/cred.jceks 1000 adl://<your adls account>.azuredatalakestore.net/<path to file>
Limitations of ADLS support in EDH 5.11
- Only Spark, Hive, and MapReduce workloads are supported on ADLS. Support for ADLS in Impala, HBase, and other services will come in future releases.
- ADLS is supported as a secondary storage. To access ADLS, use fully qualified URLs in the form of adl://<your adls account>.azuredatalakestore.net/<path to file> .