Best Practices: Troubleshooting On-Premise Bulk Import
This section contains the following tips for troubleshooting ingestion of data into our On-Premise Database product:
- Using Bulk Import on a KMS-Enabled Cluster
- Bulk Import: Out of Heap Memory
- Bulk Import: Out of Direct Memory (Cloudera)
- Bulk Import: Network Timeout
There are currently no troubleshooting issues to address if you’re using the Splice Machine Database-as-Service product.
Using Bulk Import on a KMS-Enabled Cluster
If you are a Splice Machine On-Premise Database customer and want to use bulk import on a cluster with Cloudera Key Management Service (KMS) enabled, you must complete these extra configuration steps:
- Make sure that the
bulkImportDirectoryis in the same encryption zone as is HBase.
Add these properties to
hbase-site.xmlto load secure Apache BulkLoad and to put its staging directory in the same encryption zone as HBase:
<property> <name>hbase.bulkload.staging.dir</name> <value><YourStagingDirectory></value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint</value> </property></pre>
<YourStagingDirectory>with the path to your staging directory, and make sure that directory is in the same encryption zone as HBase; for example:
For more information about KMS, see https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_sg_kms.html.
Bulk Import: Out of Heap Memory
If you run out of heap memory while bulk importing an extremely large amount of data with our On-Premise product, you can resolve the issue by setting the hbase client’s
hfile.block.cache.size property value to a very small number. We recommend this setting:
This setting should be applied only to the HBase client.
Bulk Import: Network Timeout
If you enounter a network timeout during bulk ingestion with our On-Premise product, you can resolve it by adjusting the value of the
shuffle.io.connectionTimout property as follows:
Bulk Import: Out of Direct Memory (Cloudera)
When using the On-Premise version of Splice Machine with Spark with Cloudera, bulk import of very large datasets can fail due to direct memory usage. Use the following settings to resolve this issue:
Update the Shuffle-to-Mem Setting
Modify the following setting in the Cloudera Manager’s Java Configuration Options for HBase Master:
Update the YARN User Classpath
Modify the following settings in the Cloudera Manager’s YARN (MR2 Included) Service Environment Advanced Configuration Snippet (Safety Valve):
Due to how Yarn manages memory, you need to modify your YARN configuration when bulk-importing large datasets. Make this change in your Yarn configuration,
ResourceManager Advanced Configuration Snippet (Safety Valve) for yarn-site.xml:
You may also need to temporarily make this additional configuration update as a workaround for memory allocation issues. Note that this update is not recommended for production usage, as it affects all YARN jobs and could cause your cluster to become unstable: