Splice Machine Best Practices for Importing Data

This section contains best practice and troubleshooting information related to importing data into our On-Premise Database product, in these topics:

   Learn more

Using Bulk Import on a KMS-Enabled Cluster

If you are a Splice Machine On-Premise Database customer and want to use bulk import on a cluster with Cloudera Key Management Service (KMS) enabled, you must complete these extra configuration steps:

  1. Make sure that the bulkImportDirectory is in the same encryption zone as is HBase.
  2. Add these properties to hbase-site.xml to load secure Apache BulkLoad and to put its staging directory in the same encryption zone as HBase:
    <property>
       <name>hbase.bulkload.staging.dir</name>
       <value><YourStagingDirectory></value>
     </property>
     <property>
       <name>hbase.coprocessor.region.classes</name>
       <value>org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint</value>
     </property>

    Replace <YourStagingDirectory> with the path to your staging directory, and make sure that directory is in the same encryption zone as HBase; for example:

        <value>/hbase/load/staging</value>
    

For more information about KMS, see https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_sg_kms.html.

Bulk Import of Very Large Datasets with Spark

When using Splice Machine with Spark with Cloudera, bulk import of very large datasets can fail due to direct memory usage. Use the following settings to resolve this issue:

Update Shuffle-to-Mem Setting

Modify the following setting in the Cloudera Manager’s Java Configuration Options for HBase Master:

-Dsplice.spark.reducer.maxReqSizeShuffleToMem=134217728

Update the YARN User Classpath

Modify the following settings in the Cloudera Manager’s YARN (MR2 Included) Service Environment Advanced Configuration Snippet (Safety Valve):

YARN_USER_CLASSPATH=/opt/cloudera/parcels/SPARK2/lib/spark2/yarn/spark-2.2.0.cloudera1-yarn-shuffle.jar:/opt/cloudera/parcels/SPARK2/lib/spark2/jars/scala-library-2.11.8.jar
YARN_USER_CLASSPATH_FIRST=true