Using Azure WASB, ADLS, and ADLS2 with Splice Machine

This topic walks you through the steps required to configure Azure Blob (WASB), Azure Data Lake Storage (ADLS), and Azure Data Lake Storage Gen 2 (ADLS2) with Splice Machine.

This topic contains these sections:

Configuring WASB and ADLS Storage Access

This section shows you how to configure WASB and ADLS storage for Splice Machine Access, in these subsections:

Configuring and Uploading Your Data

  1. Log in to the Azure portal.

    If needed, first create an Azure account. Then log in to the portal at https://portal.azure.com.

  2. Create a resource group:

    Follow the instructions on this page to create a resource group: https://docs.microsoft.com/en-us/azure/azure-resource-manager/manage-resource-groups-portal.

    For example, create the myResourceGroup resource group.

  3. Create a Data Lake Storage Gen1 account:

    Follow the instructions on this page: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal.

    For example:

    name: myadls1
    resource group: myResourceGroup
    
  4. Upload your data file:

    In the Azure Data Explorer, create a new folder, and then upload your file (for example, myData) to that folder.

  5. Create credentials:

    Create Credentials, as described in this page: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory#create-an-active-directory-application.

    Register an AD application: myApp
    Assign the application to a role: Owner
    Copy IDs:
        Tenant ID: <tenantID>
        Client ID: <clientID>
    Create a new application secret
        myApp secret: <clientSecret>
    
  6. Assign the Azure AD application to the Azure Data Lake Storage Gen1 account file.

  7. Get the OAuth 2.0 token endpoint:

    For example: https://login.microsoftonline.com/<tenantID>/oauth2/token

  8. Access ADLS from your cluster:

    To access your ADLS storage from Cloudera, follow the instructions in this page: https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_adls_config.html. For example:

    hadoop fs -Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
       -Ddfs.adls.oauth2.client.id=<clientID> \
       -Ddfs.adls.oauth2.credential="<clientSecret>" \
       -Ddfs.adls.oauth2.refresh.url=https://login.microsoftonline.com/<tenantID>/oauth2/token \
       -ls adl://myadls1.azuredatalakestore.net/myData
    
  9. Create a WASB storage account:

    Follow the instructions on this page: https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account.

    Resource group: myResourceGroup
    Storage account name: mywasb
    Get access key: <accessKey>
    Create container: myData
    

Copying Data Between WASB and ADLS

To copy data, follow the instructions in this page: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-wasb-distcp. For example:

hadoop fs -Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
   -Ddfs.adls.oauth2.client.id=<clientID> \
   -Ddfs.adls.oauth2.credential="<clientSecret>" \
   -Ddfs.adls.oauth2.refresh.url=https://login.microsoftonline.com/<tenantID>/oauth2/token \
   -Dfs.azure.account.key.mywasb.blob.core.windows.net="<accessKey>" \
   -cp adl://myadls1.azuredatalakestore.net/myData/* wasbs://myData@mywasb.blob.core.windows.net/

hadoop fs -Dfs.azure.account.key.mywasb.blob.core.windows.net="<accessKey>" \
   -ls wasbs://myData@mywasb.blob.core.windows.net/

Importing Your Data from Azure WASB or ADLS Storage

If you’re using Cloudera, to import your data into Splice Machine, you need to add property values to this file:

HDFS->Configuration->Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml

Add the following values:

Name Value
dfs.adls.oauth2.access.token.provider.type ClientCredential
dfs.adls.oauth2.client.id <clientID>
dfs.adls.oauth2.credential <clientSecret>
dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<tenantID>/oauth2/token
fs.azure.account.key.mywasb.blob.core.windows.net <accessKey>

Then, you can import data with statements like the following:

splice> call SYSCS_UTIL.IMPORT_DATA('mySchema', 'myTable1', null, 'wasbs://myData@mywasb.blob.core.windows.net/myTbl.tbl', '|', null, null, null, null, 0, '/BAD', true, null);
splice> call SYSCS_UTIL.IMPORT_DATA('mySchema', 'myTable2', null, 'adl://myadls1.azuredatalakestore.net/myData/myTbl.tbl', '|', null, null, null, null, 0, '/BAD', true, null);

Configuring ADLS2 Storage Access

You’ll find an introduction to ADLS2 here: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction.

You can currently use ADLS2 with:

  • Hadoop 3.2+
  • Cloudera 6.1+
  • HortonWorks 3.1.x+

The remainder of this section you how to configure ADLS2 storage for Splice Machine Access, in these subsections:

Configuring ADLS2 for Splice Machine

  1. Log in to the Azure portal:

    If needed, first create an Azure account. Then log in to the portal at https://portal.azure.com.

  2. Create a resource group:

    Follow the instructions on this page to create a resource group: https://docs.microsoft.com/en-us/azure/azure-resource-manager/manage-resource-groups-portal.

    For example, create the myResourceGroup resource group.

  3. Create an ADLS2 storage account:

    Follow the instructions on this page: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account.

    For example:

    Name: myadls2
    Resource group: myResourceGroup
    Location: West US 2
    Account kind: StorageV2
    
  4. Manage the storage account:

    • Add a file system; for example myData.
    • Get access key: <accessKeyADLS2>

Copying Your Data from WASB to ADLS2

For Cloudera, follow the instructions in this page: https://www.cloudera.com/documentation/enterprise/latest/topics/admin_adls2_config.html. For example:

hadoop fs \
-Dfs.azure.account.key.olegadls2.dfs.core.windows.net="<accessKeyADLS2>" \
-Dfs.azure.account.key.olegwasb.blob.core.windows.net="<accessKeyWASB>" \
-cp wasbs://tpch1@olegwasb.blob.core.windows.net/* abfs://tpch1@olegadls2.dfs.core.windows.net/

Importing Your Data from Azure ADLS2

If you’re using Cloudera, to import your data into Splice Machine, you add property values to this file:

HDFS->Configuration->Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml

Add this value:

Name Value
fs.azure.account.key.myadls2.dfs.core.windows.net <accessKeyADLS2>

You can then import data with a statement like the following:

splice> call SYSCS_UTIL.IMPORT_DATA('SPLICE', 'CUSTOMER', null, 'abfs://myData@myadls2.dfs.core.windows.net/myTable.tbl', '|', null, null, null, null, 0, '/BAD', true, null);