Importing Data: Specifying the Import Parameter Values

This topic first shows you the syntax of each of the four import procedures, and then provides detailed information about the input parameters you need to specify when calling one of the Splice Machine data ingestion procedures.

Import Procedures Syntax

Three of our four data import procedures use identical parameters:

SYSCS_UTIL.IMPORT_DATA (
SYSCS_UTIL.UPSERT_DATA_FROM_FILE (
SYSCS_UTIL.MERGE_DATA_FROM_FILE (
    schemaName,
    tableName,
    insertColumnList | null,
    fileOrDirectoryName,
    columnDelimiter | null,
    characterDelimiter | null,
    timestampFormat | null,
    dateFormat | null,
    timeFormat | null,
    badRecordsAllowed,
    badRecordDirectory | null,
    oneLineRecords | null,
    charset | null
  );

The fourth procedure, SYSCS_UTIL.BULK_IMPORT_HFILE, adds a couple extra parameters at the end:

SYSCS_UTIL.BULK_IMPORT_HFILE
  ( schemaName,
    tableName,
    insertColumnList | null,
    fileName,
    columnDelimiter | null,
    characterDelimiter | null,
    timestampFormat | null,
    dateFormat | null,
    timeFormat | null,
    maxBadRecords,
    badRecordDirectory | null,
    oneLineRecords | null,
    charset | null,
    bulkImportDirectory,
    skipSampling
  );

Overview of Parameters Used in Import Procedures

All of the Splice Machine data import procedures share a number of parameters that describe the table into which you’re importing data, a number of input data format details, and how to handle problematic records.

The following table summarizes these parameters. Each parameter name links to its reference description, found below the table:

Category Parameter Description Example Value
Table Info schemaName The name of the schema of the table in which to import. SPLICE
tableName The name of the table in which to import playerTeams
Data Location insertColumnList The names, in single quotes, of the columns to import. If this is null, all columns are imported. 'ID, TEAM'
fileOrDirectoryName

Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. You can import compressed or uncompressed files.

The SYSCS_UTIL.MERGE_DATA_FROM_FILE procedure only works with single files; you cannot specify a directory name when calling SYSCS_UTIL.MERGE_DATA_FROM_FILE.

On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

/data/mydata/mytable.csv

's3a://splice-benchmark-data/flat/TPCH/100/region'

Data Formats oneLineRecords A Boolean value that specifies whether (true) each record in the import file is contained in one input line, or (false) if a record can span multiple lines. true
charset The character encoding of the import file. The default value is UTF-8. null
columnDelimiter The character used to separate columns, Specify null if using the comma (,) character as your delimiter. '|'
characterDelimiter The character is used to delimit strings in the imported data. '"'
timestampFormat

The format of timestamps stored in the file. You can set this to null if there are no time columns in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

'yyyy-MM-dd HH:mm:ss.SSZ'

dateFormat The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd". yyyy-MM-dd
timeFormat The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss". HH:mm:ss
Problem Logging badRecordsAllowed The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed. 25
badRecordDirectory

The directory in which bad record information is logged. Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

'importErrsDir'
Bulk HFile Import bulkImportDirectory (outputDirectory)

For SYSCS_UTIL.BULK_IMPORT_HFILE, this is the name of the directory into which the generated HFiles are written prior to being imported into your database.

For the SYSCS_UTIL.COMPUTE_SPLIT_KEY procedure, where it is named outputDirectory, this parameter specifies the directory into which the split keys are written.

hdfs:///tmp/test_hfile_import/
skipSampling

The skipSampling parameter is a Boolean value that specifies how you want the split keys used for the bulk HFile import to be computed. Set to false to have SYSCS_UTIL.BULK_IMPORT_HFILE automatically determine splits for you.

This parameter is only used with the SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

false

Import Parameters Reference

This section provides reference documentation for all of the data importation parameters.

schemaName

The schemaName is a string that specifies the name of the schema of the table into which you are importing data.

Example: SPLICE

tableName

The tableName is a string that specifies the name of the table into which you are importing data.

Example: playerTeams

insertColumnList

The insertColumnList parameter is a string that specifies the names, in single quotes, of the columns you wish to import. If this is null, all columns are imported.

  • If you don't specify an insertColumnList and your input file contains more columns than are in the table, then the the extra columns at the end of each line in the input file are ignored. For example, if your table contains columns (a, b, c) and your file contains columns (a, b, c, d, e), then the data in your file's d and e columns will be ignored.
  • If you do specify an insertColumnList, and the number of columns doesn't match your table, then any other columns in your table will be replaced by the default value for the table column (or NULL if there is no default for the column). For example, if your table contains columns (a, b, c) and you only want to import columns (a, c), then the data in table's b column will be replaced with the default value for that column.

Example: ID, TEAM

See Importing and Updating Records for additional information about handling of missing, generated, and default values during data importation.

fileOrDirectoryName

The fileOrDirectoryName (or fileName) parameter is a string that specifies the location of the data that you’re importing. This parameter is slightly different for different procedures:

  • For the SYSCS_UTIL.UPSERT_DATA_FROM_FILE or SYSCS_UTIL.UPSERT_DATA_FROM_FILE procedures, this is either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported.

  • For the SYSCS_UTIL.MERGE_DATA_FROM_FILE and SYSCS_UTIL.BULK_IMPORT_HFILE procedure, this can only be a single file (directories are not allowed).

On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the badRecordDirectory directory. If you’re using our Database Service product, files can only be imported from S3. The files must also be readable by the hbase user.

Example: data/mydata/mytable.csv

Importing from S3

If you are importing data that is stored in an S3 bucket on AWS, you need to specify the data location in an s3a URL that includes access key information.

Example: s3a://splice-benchmark-data/flat/TPCH/100/region

See Specifying Your Input Data Location for additional information about specifying your input data location.

Importing Compressed Files

Note that files can be compressed or uncompressed, including BZIP2 compressed files.

Importing multiple files at once improves parallelism, and thus speeds up the import process. Uncompressed files can be imported faster than compressed files. When using compressed files, the compression algorithm makes a difference; for example,

  • gzip-compressed files cannot be split during importation, which means that import work on such files cannot be performed in parallel.
  • In contrast, bzip2-compressed files can be split and thus can be imported using parallel tasks. Note that bzip2 is CPU intensive compared to LZ4 or LZ0, but is faster than gzip because files can be split.

oneLineRecords

The oneLineRecords parameter is a Boolean value that specifies whether each line in the import file contains one complete record:

  • If you specify true or null, then each record is expected to be found on a single line in the file.
  • If you specify false, records can span multiple lines in the file.

Multi-line record files are slower to load, because the file cannot be split and processed in parallel; if you import a directory of multiple line files, each file as a whole is processed in parallel, but no splitting takes place.

Example: true

charset

The charset parameter is a string that specifies the character encoding of the import file. The default value is UTF-8.

Currently, any value other than UTF-8 is ignored, and UTF-8 is used.

Example: null

columnDelimiter

The columnDelimiter parameter is a string that specifies the character used to separate columns, You can specify null if using the comma (,) character as your delimiter.

In addition to using plain text characters, you can specify the following special characters as delimiters:

Special character Display
\t Tab
\f Formfeed
\b Backspace
\\ Backslash
^a (or ^A)

Control-a

If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter.

Example: '|'

See Column Delimiters for additional information about column delimiters.

characterDelimiter

The characterDelimiter parameter is a string that specifies which character is used to delimit strings in the imported data. You can specify null or the empty string to use the default string delimiter, which is the double-quote (").

In addition to using plain text characters, you can specify the following special characters as delimiters:

Special character Display
\t Tab
\f Formfeed
\b Backspace
\\ Backslash
^a (or ^A)

Control-a

If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter.

Notes:

  • If your input contains control characters such as newline characters, make sure that those characters are embedded within delimited strings.

  • To use the single quote (') character as your string delimiter, you need to escape that character. This means that you specify four quotes ('''') as the value of this parameter. This is standard SQL syntax.

Example: ''

See Character Delimeters for additional information about character delimiters.

timestampFormat

The timestampFormat parameter specifies the format of timestamps in your input data. You can set this to null if either:

  • there are no time columns in the file
  • all time stamps in the input match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

Splice Machine uses the following Java date and time pattern letters to construct timestamps:

Pattern Letter Description Format(s)
y year yy or yyyy
M month MM
d day in month dd
h hour (0-12) hh
H hour (0-23) HH
m minute in hour mm
s seconds ss
S tenths of seconds

S, SS, SSS, SSSS, SSSSS or SSSSSS*

*Specify SSSSSS to allow a variable number (any number) of digits after the decimal point.

z time zone text e.g. Pacific Standard time
Z time zone, time offset e.g. -0800

The default timestamp format for Splice Machine imports is: yyyy-MM-dd HH:mm:ss, which uses a 24-hour clock, does not allow for decimal digits of seconds, and does not allow for time zone specification.

The standard Java library does not support microsecond precision, so you cannot specify millisecond (S) values in a custom timestamp format and import such values with the desired precision.

Timestamps and Importing Data at Different Locations

Note that timestamp values are relative to the geographic location at which they are imported, or more specifically, relative to the timezone setting and daylight saving time status where the data is imported.

This means that timestamp values from the same data file may appear differently after being imported in different timezones.

Examples

The following tables shows valid examples of timestamps and their corresponding format (parsing) patterns:

Timestamp value Format Pattern Notes
2013-03-23 09:45:00 yyyy-MM-dd HH:mm:ss This is the default pattern.
2013-03-23 19:45:00.98-05 yyyy-MM-dd HH:mm:ss.SSZ This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification.
2013-03-23 09:45:00-07 yyyy-MM-dd HH:mm:ssZ This patterns requires a time zone specification, but does not allow for decimal digits of seconds.
2013-03-23 19:45:00.98-0530 yyyy-MM-dd HH:mm:ss.SSZ This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification.

2013-03-23 19:45:00.123

2013-03-23 19:45:00.12

yyyy-MM-dd HH:mm:ss.SSS

This pattern allows up to 3 decimal digits of seconds, but does not allow a time zone specification.

Note that if your data specifies more than 3 decimal digits of seconds, an error occurs.

2013-03-23 19:45:00.1298 yyyy-MM-dd HH:mm:ss.SSSS This pattern allows up to 4 decimal digits of seconds, but does not allow a time zone specification.

See Time and Date Formats in Input Records for additional information about date, time, and timestamp values.

dateFormat

The dateFormat parameter specifies the format of datestamps stored in the file. You can set this to null if either:

  • there are no date columns in the file
  • the format of any dates in the input match this pattern: "yyyy-MM-dd".

Example: yyyy-MM-dd

See Time and Date Formats in Input Records for additional information about date, time, and timestamp values.

timeFormat

The timeFormat parameter specifies the format of time values in your input data. You can set this to null if either:

  • there are no time columns in the file
  • the format of any times in the input match this pattern: "HH:mm:ss".

Example: HH:mm:ss

See Time and Date Formats in Input Records for additional information about date, time, and timestamp values.

badRecordsAllowed

The badRecordsAllowed parameter is integer value that specifies the number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back.

These values have special meaning:

  • If you specify -1 as the value of this parameter, all record import failures are tolerated and logged.
  • If you specify 0 as the value of this parameter, the import will fail if even one record is bad.

Example: 25

badRecordDirectory

The badRecordDirectory parameter is a string that specifies the directory in which bad record information is logged. The default value is the directory in which the import files are found.

Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

The badRecordDirectory directory must be writable by the hbase user, either by setting the user explicity, or by opening up the permissions; for example:

sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory

Example: 'importErrsDir'

bulkImportDirectory

This parameter is only used with the SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

The bulkImportDirectory parameter is a string that specifies the name of the directory into which the generated HFiles are written prior to being imported into your database. The generated files are automatically removed after they’ve been imported.

Example: 'hdfs:///tmp/test_hfile_import/'

Please review the Bulk HFile Import Walkthrough topic to understand how importing bulk HFiles works.

skipSampling

This parameter is only used with the SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

The skipSampling parameter is a Boolean value that specifies how you want the split keys used for the bulk HFile import to be computed:

  • If skipSampling is true, you need to use our   SYSCS_UTIL.COMPUTE_SPLIT_KEY and SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX_AT_POINTS system procedures to manually split your table before calling SYSCS_UTIL.BULK_IMPORT_HFILE. This allows you more control over the splits, but adds a layer of complexity.

  • If skipSampling is false, then SYSCS_UTIL.BULK_IMPORT_HFILE samples your input data and computes the table splits for you, in the following steps. It:

    1. Scans (sample) the data
    2. Collects a rowkey histogram
    3. Uses that histogram to calculate the split key for the table
    4. Uses the calculated split key to split the table into HFiles

Example: false

Please review the Bulk HFile Import Walkthrough topic to understand how importing bulk HFiles works.

See Also