SYSCS_UTIL.IMPORT_DATA

The SYSCS_UTIL.IMPORT_DATA system procedure imports data to a new record in a table. You can choose to import all or a subset of the columns from the input data into your database using the insertColumnList parameter.

After a successful import completes, a simple report displays, showing how many files were imported, and how many record imports succeeded or failed.

Selecting an Import Procedure

Splice Machine provides four system procedures for importing data:

  • This procedure, SYSCS_UTIL.IMPORT_DATA, imports each input record into a new record in your database.
  • The SYSCS_UTIL.UPSERT_DATA_FROM_FILE procedure updates existing records and adds new records to your database. It only differs from SYSCS_UTIL.MERGE_DATA_FROM_FILE in that upserting overwrites the generated or default value of a column that is not specified in your insertColumnList parameter when updating a record.
  • The SYSCS_UTIL.MERGE_DATA_FROM_FILE procedure updates existing records and adds new records to your database. It only differs from SYSCS_UTIL.UPSERT_DATA_FROM_FILE in that merging does not overwrite the generated or default value of a column that is not specified in your insertColumnList parameter when updating a record.
  • The SYSCS_BULK_IMPORT_HFILE procedure takes advantage of HBase bulk loading to import table data into your database by temporarily converting the table file that you’re importing into HFiles, importing those directly into your database, and then removing the temporary HFiles. This procedure has improved performance for large tables; however, the bulk HFile import requires extra work on your part and lacks constraint checking.

Our Importing Data Tutorial includes a decision tree and brief discussion to help you determine which procedure best meets your needs.

Syntax

call SYSCS_UTIL.IMPORT_DATA (
        schemaName,
        tableName,
        insertColumnList | null,
        fileOrDirectoryName,
        columnDelimiter | null,
        characterDelimiter | null,
        timestampFormat | null,
        dateFormat | null,
        timeFormat | null,
        badRecordsAllowed,
        badRecordDirectory | null,
        oneLineRecords | null,
        charset | null
        );

Parameters

The following table summarizes the parameters used by SYSCS_UTIL.IMPORT_DATA and other Splice Machine data importation procedures. Each parameter name links to a more detailed description in our Importing Data Tutorial.

Category Parameter Description Example Value
Table Info schemaName The name of the schema of the table in which to import. SPLICE
tableName The name of the table in which to import playerTeams
Data Location insertColumnList The names, in single quotes, of the columns to import. If this is null, all columns are imported. 'ID, TEAM'
fileOrDirectoryName

Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. You can import compressed or uncompressed files.

The SYSCS_UTIL.MERGE_DATA_FROM_FILE procedure only works with single files; you cannot specify a directory name when calling SYSCS_UTIL.MERGE_DATA_FROM_FILE.

On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

/data/mydata/mytable.csv

's3a://splice-benchmark-data/flat/TPCH/100/region'

Data Formats oneLineRecords A Boolean value that specifies whether (true) each record in the import file is contained in one input line, or (false) if a record can span multiple lines. true
charset The character encoding of the import file. The default value is UTF-8. null
columnDelimiter The character used to separate columns, Specify null if using the comma (,) character as your delimiter. '|'
characterDelimiter The character is used to delimit strings in the imported data. '"'
timestampFormat

The format of timestamps stored in the file. You can set this to null if there are no time columns in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

'yyyy-MM-dd HH:mm:ss.SSZ'

dateFormat The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd". yyyy-MM-dd
timeFormat The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss". HH:mm:ss
Problem Logging badRecordsAllowed The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed. 25
badRecordDirectory

The directory in which bad record information is logged. Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

'importErrsDir'
Bulk HFile Import bulkImportDirectory (outputDirectory)

For SYSCS_UTIL.BULK_IMPORT_HFILE, this is the name of the directory into which the generated HFiles are written prior to being imported into your database.

For the SYSCS_UTIL.COMPUTE_SPLIT_KEY procedure, where it is named outputDirectory, this parameter specifies the directory into which the split keys are written.

hdfs:///tmp/test_hfile_import/
skipSampling

The skipSampling parameter is a Boolean value that specifies how you want the split keys used for the bulk HFile import to be computed. Set to false to have SYSCS_UTIL.BULK_IMPORT_HFILE automatically determine splits for you.

This parameter is only used with the SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

false

Results

SYSCS_UTIL.IMPORT_DATA displays a summary of the import process results that looks like this:

rowsImported   |failedRows   |files   |dataSize   |failedLog
-------------------------------------------------------------
94             |0            |1       |4720       |NONE

This procedure also logs rejected record activity into .bad files in the badRecordDirectory directory; one file for each imported file.

Importing and Updating Records

What distinguishes SYSCS_UTIL.IMPORT_DATA from the similar  SYSCS_UTIL.UPSERT_DATA_FROM_FILE and  SYSCS_UTIL.SYSCS_MERGED_DATA_FROM_FILE procedures is how each works with these specific conditions:

  • You are importing only a subset of data from the input data into your table, either because the table contains less columns than does the input file, or because you’ve specified a subset of the columns in your insertColumnList parameter.
  • Inserting and updating data in a column with generated values.
  • Inserting and updating data in a column with default values.
  • Handling of missing values.

The Importing Data Tutorial: Input Handling topic describes how each of these conditions is handled by the different system procedures.

Record Import Failure Reasons

Typical reasons for a row (record) import to fail include:

  • Improper data expected for a column.
  • Improper number of columns of data.
  • A primary key violation:  SYSCS_UTIL.IMPORT_DATA will only work correctly if the table into which you are inserting/updating has primary keys.

Usage Notes

A few important notes:

  • Splice Machine advises you to run a full compaction (with the SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLE system procedure) after importing large amounts of data into your database.

  • On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the badRecordDirectory directory. If you’re using our Database Service product, files can only be imported from S3.

    In addition, the files must be readable by the hbase user, and the badRecordDirectory directory must be writable by the hbase user, either by setting the user explicity, or by opening up the permissions; for example:

    sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory

Examples

This section presents a couple simple examples.

The Importing Data Usage Examples topic contains a more extensive set of examples.

Example 1: Importing our doc examples player data

This example shows the IMPORT_DATA call used to import the Players table into our documentation examples database:

splice> CALL SYSCS_UTIL.IMPORT_DATA('SPLICEBBALL', 'Players',
    'ID, Team, Name, Position, DisplayName, BirthDate',
    '/Data/DocExamplesDb/Players.csv',
    null, null, null, null, null, 0, null, true, null);
rowsImported        |failedRows          |files      |dataSize            |failedLog--------------------------------------------------------------------------------------
94                  |0                   |1          |4720                |NONE
1 row selected

Example 2: Specifying a timestamp format for an entire table

Use a single timestamp format for the entire table by explicitly specifying a single timeStampFormat.

Mike,2013-04-21 09:21:24.98-05
Mike,2013-04-21 09:15:32.78-04
Mike,2013-03-23 09:45:00.68-05

You can then import the data with the following call:

splice> CALL SYSCS_UTIL.IMPORT_DATA('app','tabx','c1,c2',
   '/path/to/ts3.csv',
   ',', '''',
   'yyyy-MM-dd HH:mm:ss.SSZ',
   null, null, 0, null, true, null);

Note that for any import use case shown above, the time shown in the imported table depends on the timezone setting in the server timestamp. In other words, given the same csv file, if imported on different servers with timestamps set to different time zones, the value in the table shown will be different. Additionally, daylight savings time may account for a 1-hour difference if timezone is specified.

See Importing Data Usage Examples for more examples.

See Also