SYSCS_UTIL.BULK_IMPORT_HFILE system procedure imports data into
your Splice Machine database by splitting the table or index file into HFiles and then
importing those HFiles. The splitting can be managed automatically by this procedure, or you
can pre-split the data before calling
Unlike our standard
SYSCS_UTIL.IMPORT_DATA procedure, our bulk HFile
procedure does not perform constraint checking while loading your data.
Our Best Practices: Ingestion] chapter includes an overview and examples of using bulk HFile import.
call SYSCS_UTIL.BULK_IMPORT_HFILE ( schemaName, tableName, insertColumnList | null, fileOrDirectoryName, columnDelimiter | null, characterDelimiter | null, timestampFormat | null, dateFormat | null, timeFormat | null, maxBadRecords, badRecordDirectory | null, oneLineRecords | null, charset | null, bulkImportDirectory, skipSampling );
If you have specified
skipSampling=true to indicate that you’ve computed pre-splits for your input data, the parameter values that you pass to that procedures must match the values that you pass to this procedure for the same-named parameters.
This table includes a brief description of each parameter; additional information is available in the Ingestion Parameter Values topic of our Importing Data tutorial.
|schemaName||The name of the schema of the table into which to import.||SPLICE|
|tableName||The name of the table into which to import.||playerTeams|
The names, in single quotes, of the columns to import. If this is
The individual column names in the
Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. You can import compressed or uncompressed files.
On a cluster, the files to be imported
See the Configuring an S3 Bucket for Splice Machine Access topic for information about accessing data on S3.
|columnDelimiter||The character used to separate columns, Specify
|characterDelimiter||The character used to delimit strings in the imported data.||'"', ''''|
The format of timestamps stored in the file. You can set this to
All of the timestamps in the file you are importing must use the same format.
|dateFormat||The format of datestamps stored in the file. You can set this to
|timeFormat||The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss".||HH:mm:ss|
|maxBadRecords||The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed.||25|
The directory in which bad record information is logged. Splice Machine logs information to the
On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.
|oneLineRecords||A Boolean value that specifies whether (
|charset||The character encoding of the import file. The default value is UTF-8.||null|
|bulkImportDirectory (outputDirectory)||The name of the directory into which the generated HFiles are written prior to being imported into your database. These files will be deleted after the import has finished.||
This allows you more control over the splits, but adds a layer of complexity. You can learn about computing splits for your input data in the Using Bulk HFile Import topic of our Best Practices Guide.
Before it generate HFiles,
SYSCS_UTIL.BULK_IMPORT_HFILE must use split keys to determine how to split
the data file into multiple HFiles. Splitting the file into evenly-size HFiles yields optimal data loading performance.
You have these choices for determining how the data is split:
You can call
skipSamplingparameter set to
false; this procedure then samples and analyzes the data in your file and splits the data into temporary HFiles based on that analysis. You’ll find an example in the Best Practices: Bulk Importing Flag Files chapter of our Best Practices Guide.
You can pre-split your data by first creating a CSV file that specifies the split keys to use to perform the pre-splits, and then calling the
SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEXto pre-split your table or index file. You’ll also find an example of this in the Best Practices: Bulk Importing Flag Files chapter of our Best Practices Guide. *
SYSCS_UTIL.BULK_IMPORT_HFILEautomatically deletes the temporary HFiles after the import process has completed.
A few important notes:
Splice Machine advises you to run a full compaction (with the
SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLEsystem procedure) after importing large amounts of data into your database.
On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the
badRecordDirectorydirectory. If you’re using our Database Service product, files can only be imported from S3.
In addition, the files must be readable by the
hbaseuser, and the
badRecordDirectorydirectory must be writable by the
hbaseuser, either by setting the user explicity, or by opening up the permissions; for example:
sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory
SYSCS_UTIL.BULK_IMPORT_HFILE displays a summary of the import
process results that looks like this:
rowsImported |failedRows |files |dataSize |failedLog ------------------------------------------------------------- 94 |0 |1 |4720 |NONE
You’ll find examples of using this procedure in the Best Practices: Bulk Importing Flat Files topic of our Best Practices Guide.