SYSCS_UTIL.BULK_IMPORT_HFILE

The SYSCS_UTIL.BULK_IMPORT_HFILE system procedure imports data into your Splice Machine database by splitting the table or index file into HFiles and then importing those HFiles. The splitting can be managed automatically by this procedure, or you can pre-split the data before calling SYSCS_UTIL.BULK_IMPORT_HFILE.

Unlike our standard SYSCS_UTIL.IMPORT_DATA procedure, our bulk HFile procedure does not perform constraint checking while loading your data.

Our Importing Data Tutorial includes topics to help with bulk HFile import:

Syntax

call SYSCS_UTIL.BULK_IMPORT_HFILE (
    schemaName,
    tableName,
    insertColumnList | null,
    fileOrDirectoryName,
    columnDelimiter | null,
    characterDelimiter | null,
    timestampFormat | null,
    dateFormat | null,
    timeFormat | null,
    maxBadRecords,
    badRecordDirectory | null,
    oneLineRecords | null,
    charset | null,
    bulkImportDirectory,
    skipSampling
);

If you have specified skipSampling=true to indicate that you’ve computed pre-splits for your input data, the parameter values that you pass to that procedures must match the values that you pass to this procedure for the same-named parameters.

Parameters

This table includes a brief description of each parameter; additional information is available in the Import Parameters topic of our Importing Data tutorial.

Parameter Description Example Value
schemaName The name of the schema of the table into which to import. SPLICE
tableName The name of the table into which to import. playerTeams
insertColumnList The names, in single quotes, of the columns to import. If this is null, all columns are imported. 'ID, TEAM'
fileOrDirectoryName

Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. You can import compressed or uncompressed files.

On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS). If you're using our *Database Service* product, files can only be imported from S3.

/data/mydata/mytable.csv

's3a://splice-benchmark-data/flat/TPCH/100/region'

columnDelimiter The character used to separate columns, Specify null if using the comma (,) character as your delimiter. '|', '\t'
characterDelimiter The character used to delimit strings in the imported data. '"', ''''
timestampFormat

The format of timestamps stored in the file. You can set this to null if there are no time columns in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

'yyyy-MM-dd HH:mm:ss.SSZ'

dateFormat The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd". yyyy-MM-dd
timeFormat The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss". HH:mm:ss
maxBadRecords The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed. 25
badRecordDirectory

The directory in which bad record information is logged. Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

'importErrsDir'
oneLineRecords A Boolean value that specifies whether (true) each record in the import file is contained in one input line, or (false) if a record can span multiple lines. true
charset The character encoding of the import file. The default value is UTF-8. null
bulkImportDirectory (outputDirectory) The name of the directory into which the generated HFiles are written prior to being imported into your database. These files will be deleted after the import has finished. hdfs:///tmp/test_hfile_import/
skipSampling

The skipSampling parameter is a Boolean value that specifies how you want the split keys used for the bulk HFile import to be computed. Set to false to have SYSCS_UTIL.BULK_IMPORT_HFILE automatically determine splits for you.

If skipSampling is true, you need to use SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX (recommended) or SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX_AT_POINTS (for expert users) to pre-compute splits for your table before calling SYSCS_UTIL.BULK_IMPORT_HFILE.

If skipSampling is false, then SYSCS_UTIL.BULK_IMPORT_HFILE samples your input data and computes the table splits for you by performing the following steps. It:

  1. Scans (sample) the data.
  2. Collects a rowkey histogram.
  3. Uses that histogram to calculate the split key for the table.
  4. Uses the calculated split key to split the table into HFiles.

This allows you more control over the splits, but adds a layer of complexity. You can learn about computing splits for your input data in the Importing Data: Using Bulk HFile Import topic of our Importing Data tutorial.

false

Usage

Before it generate HFiles, SYSCS_UTIL.BULK_IMPORT_HFILE must use split keys to determine how to split the data file into multiple HFiles. Splitting the file into evenly-size HFiles yields optimal data loading performance.

You have these choices for determining how the data is split:

  • You can call SYSCS_UTIL.BULK_IMPORT_HFILE with the skipSampling parameter set to false; this procedure then samples and analyzes the data in your file and splits the data into temporary HFiles based on that analysis. Example 1: Automatic Splitting shows how to use the automatic splits computation built into the SYSCS_UTIL.BULK_IMPORT_HFILE procedure.

  • You can pre-split your data by first creating a CSV file that specifies the split keys to use to perform the pre-splits, and then calling the SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX to pre-split your table or index file. You then call SYSCS_UTIL.BULK_IMPORT_HFILE with the skipSampling parameter set to true to import the data. Example 2: Computed Pre-Splits shows how to pre-split your data using SPLIT_TABLE_OR_INDEX before performing the import.

  • If you want even more control over how your data is split into evenly-sized regions, you can specify the row boundaries for pre-splitting yourself in a CSV file. You then supply that file as a parameter to the SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX_AT_POINTS procedure, which performs the pre-splitting, after which you call SYSCS_UTIL.BULK_IMPORT_HFILE with the skipSampling parameter set to true. We recommend that only expert customers use this procedure.

SYSCS_UTIL.BULK_IMPORT_HFILE automatically deletes the temporary HFiles after the import process has completed.

Usage Notes

A few important notes:

  • Splice Machine advises you to run a full compaction (with the SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLE system procedure) after importing large amounts of data into your database.

  • On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the badRecordDirectory directory. If you’re using our Database Service product, files can only be imported from S3.

    In addition, the files must be readable by the hbase user, and the badRecordDirectory directory must be writable by the hbase user, either by setting the user explicity, or by opening up the permissions; for example:

    sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory

Results

SYSCS_UTIL.BULK_IMPORT_HFILE displays a summary of the import process results that looks like this:

rowsImported   |failedRows   |files   |dataSize   |failedLog
-------------------------------------------------------------
94             |0            |1       |4720       |NONE

Examples

You’ll find examples of using this procedure in the Bulk HFile Import Examples topic of our Importing Tutorial.

See Also