SYSCS_UTIL.BULK_IMPORT_HFILE system procedure imports data into
your Splice Machine database by splitting the table or index file into HFiles and then
importing those HFiles. The splitting can be managed automatically by this procedure, or you
can pre-split the data before calling
Unlike our standard
SYSCS_UTIL.IMPORT_DATA procedure, our bulk HFile
procedure does not perform constraint checking while loading your data.
Our Importing Data Tutorial includes topics to help with bulk HFile import:
- Importing Data Tutorial provides an overview of the different data loading procedures you can use, and includes a decision tree to help you determine which is appropriate for your situation.
- Importing Data: Using Bulk HFile Import describes how to use
SYSCS_UTIL.BULK_IMPORT_HFILEand the different methods for computing the split keys used to create the HFiles.
- Importing Data: Bulk HFile Examples includes examples of using bulk HFile import.
call SYSCS_UTIL.BULK_IMPORT_HFILE ( schemaName, tableName, insertColumnList | null, fileOrDirectoryName, columnDelimiter | null, characterDelimiter | null, timestampFormat | null, dateFormat | null, timeFormat | null, maxBadRecords, badRecordDirectory | null, oneLineRecords | null, charset | null, bulkImportDirectory, skipSampling );
If you have specified
skipSampling=true to indicate that you’ve computed pre-splits for your input data, the parameter values that you pass to that procedures must match the values that you pass to this procedure for the same-named parameters.
This table includes a brief description of each parameter; additional information is available in the Import Parameters topic of our Importing Data tutorial.
|schemaName||The name of the schema of the table into which to import.||SPLICE|
|tableName||The name of the table into which to import.||playerTeams|
|insertColumnList||The names, in single quotes, of the columns to import. If this is
Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. You can import compressed or uncompressed files.
On a cluster, the files to be imported
See the Configuring an S3 Bucket for Splice Machine Access topic for information about accessing data on S3.
|columnDelimiter||The character used to separate columns, Specify
|characterDelimiter||The character used to delimit strings in the imported data.||'"', ''''|
The format of timestamps stored in the file. You can set this to
All of the timestamps in the file you are importing must use the same format.
|dateFormat||The format of datestamps stored in the file. You can set this to
|timeFormat||The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss".||HH:mm:ss|
|maxBadRecords||The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed.||25|
The directory in which bad record information is logged. Splice Machine logs information to the
On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.
|oneLineRecords||A Boolean value that specifies whether (
|charset||The character encoding of the import file. The default value is UTF-8.||null|
|bulkImportDirectory (outputDirectory)||The name of the directory into which the generated HFiles are written prior to being imported into your database. These files will be deleted after the import has finished.||
This allows you more control over the splits, but adds a layer of complexity. You can learn about computing splits for your input data in the Importing Data: Using Bulk HFile Import topic of our Importing Data tutorial.
Before it generate HFiles,
SYSCS_UTIL.BULK_IMPORT_HFILE must use split keys to determine how to split
the data file into multiple HFiles. Splitting the file into evenly-size HFiles yields optimal data loading performance.
You have these choices for determining how the data is split:
You can call
skipSamplingparameter set to
false; this procedure then samples and analyzes the data in your file and splits the data into temporary HFiles based on that analysis. Example 1: Automatic Splitting shows how to use the automatic splits computation built into the
You can pre-split your data by first creating a CSV file that specifies the split keys to use to perform the pre-splits, and then calling the
SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEXto pre-split your table or index file. You then call
skipSamplingparameter set to
trueto import the data. Example 2: Computed Pre-Splits shows how to pre-split your data using
SPLIT_TABLE_OR_INDEXbefore performing the import.
If you want even more control over how your data is split into evenly-sized regions, you can specify the row boundaries for pre-splitting yourself in a CSV file. You then supply that file as a parameter to the
SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX_AT_POINTSprocedure, which performs the pre-splitting, after which you call
skipSamplingparameter set to
true. We recommend that only expert customers use this procedure.
SYSCS_UTIL.BULK_IMPORT_HFILE automatically deletes the temporary HFiles after the import process has completed.
A few important notes:
Splice Machine advises you to run a full compaction (with the
SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLEsystem procedure) after importing large amounts of data into your database.
On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the
badRecordDirectorydirectory. If you’re using our Database Service product, files can only be imported from S3.
In addition, the files must be readable by the
hbaseuser, and the
badRecordDirectorydirectory must be writable by the
hbaseuser, either by setting the user explicity, or by opening up the permissions; for example:
sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory
SYSCS_UTIL.BULK_IMPORT_HFILE displays a summary of the import
process results that looks like this:
rowsImported |failedRows |files |dataSize |failedLog ------------------------------------------------------------- 94 |0 |1 |4720 |NONE
You’ll find examples of using this procedure in the Bulk HFile Import Examples topic of our Importing Tutorial.