Importing Data: Specifying the Import Parameter Values

This topic first shows you the syntax of each of the four import procedures, and then provides detailed information about the input parameters you need to specify when calling one of the Splice Machine data ingestion procedures.

Import Procedures Syntax

Three of our four data import procedures use identical parameters:

SYSCS_UTIL.IMPORT_DATA (
SYSCS_UTIL.UPSERT_DATA_FROM_FILE (
SYSCS_UTIL.MERGE_DATA_FROM_FILE (
    schemaName,
    tableName,
    insertColumnList | null,
    fileOrDirectoryName,
    columnDelimiter | null,
    characterDelimiter | null,
    timestampFormat | null,
    dateFormat | null,
    timeFormat | null,
    badRecordsAllowed,
    badRecordDirectory | null,
    oneLineRecords | null,
    charset | null
  );

The fourth procedure, SYSCS_UTIL.BULK_IMPORT_HFILE, adds a couple extra parameters at the end:

SYSCS_UTIL.BULK_IMPORT_HFILE
  ( schemaName,
    tableName,
    insertColumnList | null,
    fileName,
    columnDelimiter | null,
    characterDelimiter | null,
    timestampFormat | null,
    dateFormat | null,
    timeFormat | null,
    maxBadRecords,
    badRecordDirectory | null,
    oneLineRecords | null,
    charset | null,
    bulkImportDirectory,
    skipSampling
  );

Overview of Parameters Used in Import Procedures

All of the Splice Machine data import procedures share a number of parameters that describe the table into which you’re importing data, a number of input data format details, and how to handle problematic records.

The following table summarizes these parameters. Each parameter name links to its reference description, found below the table:

Category Parameter Description Example Value
Table Info schemaName The name of the schema of the table into which to import. SPLICE
tableName The name of the table into which to import. playerTeams
Data Location insertColumnList The names, in single quotes, of the columns to import. If this is null, all columns are imported. 'ID, TEAM'
fileOrDirectoryName

Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. You can import compressed or uncompressed files.

The SYSCS_UTIL.MERGE_DATA_FROM_FILE procedure only works with single files; you cannot specify a directory name when calling SYSCS_UTIL.MERGE_DATA_FROM_FILE.

On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

/data/mydata/mytable.csv

's3a://splice-benchmark-data/flat/TPCH/100/region'

Data Formats oneLineRecords A Boolean value that specifies whether (true) each record in the import file is contained in one input line, or (false) if a record can span multiple lines. true
charset The character encoding of the import file. The default value is UTF-8. null
columnDelimiter The character used to separate columns, Specify null if using the comma (,) character as your delimiter. '|', '\t'
characterDelimiter The character used to delimit strings in the imported data. '"', ''''
timestampFormat

The format of timestamps stored in the file. You can set this to null if there are no time columns in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

'yyyy-MM-dd HH:mm:ss.SSZ'

dateFormat The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd". yyyy-MM-dd
timeFormat The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss". HH:mm:ss
Problem Logging badRecordsAllowed The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed. 25
badRecordDirectory

The directory in which bad record information is logged. Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

'importErrsDir'
Bulk HFile Import bulkImportDirectory For SYSCS_UTIL.BULK_IMPORT_HFILE, this is the name of the directory into which the generated HFiles are written prior to being imported into your database. hdfs:///tmp/test_hfile_import/
skipSampling

The skipSampling parameter is a Boolean value that specifies how you want the split keys used for the bulk HFile import to be computed. Set to false to have SYSCS_UTIL.BULK_IMPORT_HFILE automatically determine splits for you.

This parameter is only used with the SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

false

Import Parameters Reference

This section provides reference documentation for all of the data importation parameters.

schemaName

The schemaName is a string that specifies the name of the schema of the table into which you are importing data.

Example: SPLICE

tableName

The tableName is a string that specifies the name of the table into which you are importing data.

Example: playerTeams

insertColumnList

The insertColumnList parameter is a string that specifies the names, in single quotes, of the columns you wish to import. If this is null, all columns are imported.

  • If you don't specify an insertColumnList and your input file contains more columns than are in the table, then the the extra columns at the end of each line in the input file are ignored. For example, if your table contains columns (a, b, c) and your file contains columns (a, b, c, d, e), then the data in your file's d and e columns will be ignored.
  • If you do specify an insertColumnList, and the number of columns doesn't match your table, then any other columns in your table will be replaced by the default value for the table column (or NULL if there is no default for the column). For example, if your table contains columns (a, b, c) and you only want to import columns (a, c), then the data in table's b column will be replaced with the default value for that column.

Example: ID, TEAM

See Importing and Updating Records for additional information about handling of missing, generated, and default values during data importation.

fileOrDirectoryName

The fileOrDirectoryName (or fileName) parameter is a string that specifies the location of the data that you’re importing. This parameter is slightly different for different procedures:

  • For the SYSCS_UTIL.UPSERT_DATA_FROM_FILE or SYSCS_UTIL.UPSERT_DATA_FROM_FILE procedures, this is either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported.

  • For the SYSCS_UTIL.MERGE_DATA_FROM_FILE and SYSCS_UTIL.BULK_IMPORT_HFILE procedure, this can only be a single file (directories are not allowed).

On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the badRecordDirectory directory. If you’re using our Database Service product, files can only be imported from S3. The files must also be readable by the hbase user.

Example: data/mydata/mytable.csv

Importing from S3

If you are importing data that is stored in an S3 bucket on AWS, you need to specify the data location in an s3a URL that includes access key information.

Example: s3a://splice-benchmark-data/flat/TPCH/100/region

See Specifying Your Input Data Location for additional information about specifying your input data location.

Importing Compressed Files

Note that files can be compressed or uncompressed, including BZIP2 compressed files.

Importing multiple files at once improves parallelism, and thus speeds up the import process. Uncompressed files can be imported faster than compressed files. When using compressed files, the compression algorithm makes a difference; for example,

  • gzip-compressed files cannot be split during importation, which means that import work on such files cannot be performed in parallel.
  • In contrast, bzip2-compressed files can be split and thus can be imported using parallel tasks. Note that bzip2 is CPU intensive compared to LZ4 or LZ0, but is faster than gzip because files can be split.

oneLineRecords

The oneLineRecords parameter is a Boolean value that specifies whether each line in the import file contains one complete record:

  • If you specify true or null, then each record is expected to be found on a single line in the file.
  • If you specify false, records can span multiple lines in the file.

Multi-line record files are slower to load, because the file cannot be split and processed in parallel; if you import a directory of multiple line files, each file as a whole is processed in parallel, but no splitting takes place.

Example: true

charset

The charset parameter is a string that specifies the character encoding of the import file. The default value is UTF-8.

Currently, any value other than UTF-8 is ignored, and UTF-8 is used.

Example: null

columnDelimiter

The columnDelimiter parameter is a string that specifies the character used to separate columns, You can specify null if using the comma (,) character as your delimiter.

In addition to using plain text characters, you can specify the following special characters as delimiters:

Special character Display
'\t' Tab
'\f' Formfeed
'\b' Backspace
'\\' Backslash
'^a'
(or '^A')

Control-a

If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter.

'''' Single Quote (')

Notes:

  • To use the single quote (') character as your column delimiter, you need to escape that character. This means that you specify four quotes ('''') as the value of this parameter. This is standard SQL syntax.

Example: '|'

characterDelimiter

The characterDelimiter parameter is a string that specifies which character is used to delimit strings in the imported data. You can specify null or the empty string to use the default string delimiter, which is the double-quote (").

In addition to using plain text characters, you can specify the following special characters as delimiters:

Special character Display
'\t' Tab
'\f' Formfeed
'\b' Backspace
'\\' Backslash
'^a'
(or '^A')

Control-a

If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter.

'''' Single Quote (')

Notes:

  • If your input contains control characters such as newline characters, make sure that those characters are embedded within delimited strings.

  • To use the single quote (') character as your string delimiter, you need to escape that character. This means that you specify four quotes ('''') as the value of this parameter. This is standard SQL syntax.

Example: '"'

timestampFormat

The timestampFormat parameter specifies the format of timestamps in your input data. You can set this to null if either:

  • there are no time columns in the file
  • all time stamps in the input match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

Splice Machine uses the following Java date and time pattern letters to construct timestamps:

Pattern Letter Description Format(s)
y year yy or yyyy
M month MM
d day in month dd
h hour (0-12) hh
H hour (0-23) HH
m minute in hour mm
s seconds ss
S tenths of seconds

S, SS, SSS, SSSS, SSSSS or SSSSSS*

*Specify SSSSSS to allow a variable number (any number) of digits after the decimal point.

z time zone text e.g. Pacific Standard time
Z time zone, time offset e.g. -0800

The default timestamp format for Splice Machine imports is: yyyy-MM-dd HH:mm:ss, which uses a 24-hour clock, does not allow for decimal digits of seconds, and does not allow for time zone specification.

The standard Java library does not support microsecond precision, so you cannot specify millisecond (S) values in a custom timestamp format and import such values with the desired precision.

Converting Out-of-Range Timestamp Values Upon Import

The Splice Machine Timestamp data type has a range of 1677-09-21 00:12:43.147 GMT to 2262-04-11 23:47:16.853 GMT. If you import data that falls outside these ranges (some customers have used dummy timestamp values like 9999-01-01 in their data), your import will fail because the value is out of range for a timestamp. We have a very simple workaround for this:

This is not an issue with Date values.

You can work around this issue by telling Splice Machine to convert any out-of-range timestamp values that you’re importing to in-range timestamp values. To do so, follow these steps:

  1. Set the derby.database.convertOutOfRangeTimeStamps value to true as follows:
        CALL SYSCS_UTIL.SYSCS_SET_GLOBAL_DATABASE_PROPERTY('derby.database.convertOutOfRangeTimeStamps', 'true');
    
  2. Clear the Splice Machine compiled statement cache as follows:
        CALL SYSCS_UTIL.SYSCS_EMPTY_STATEMENT_CACHE();
    
  3. Import your data.

Once you’ve set the property and cleared the cache, when you import data into Splice Machine, any timestamp value in the imported data that is less than the minimum value (1677-09-21 00:12:43.147 GMT) is converted to that minimum value, and any timestamp value greater than the maximum value (2262-04-11 23:47:16.853 GMT) is converted to that maximum value.

Timestamps and Importing Data at Different Locations

Note that timestamp values are relative to the geographic location at which they are imported, or more specifically, relative to the timezone setting and daylight saving time status where the data is imported.

This means that timestamp values from the same data file may appear differently after being imported in different timezones.

Examples

The following tables shows valid examples of timestamps and their corresponding format (parsing) patterns:

Timestamp value Format Pattern Notes
2013-03-23 09:45:00 yyyy-MM-dd HH:mm:ss This is the default pattern.
2013-03-23 19:45:00.98-05 yyyy-MM-dd HH:mm:ss.SSZ This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification.
2013-03-23 09:45:00-07 yyyy-MM-dd HH:mm:ssZ This patterns requires a time zone specification, but does not allow for decimal digits of seconds.
2013-03-23 19:45:00.98-0530 yyyy-MM-dd HH:mm:ss.SSZ This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification.

2013-03-23 19:45:00.123

2013-03-23 19:45:00.12

yyyy-MM-dd HH:mm:ss.SSS

This pattern allows up to 3 decimal digits of seconds, but does not allow a time zone specification.

Note that if your data specifies more than 3 decimal digits of seconds, an error occurs.

2013-03-23 19:45:00.1298 yyyy-MM-dd HH:mm:ss.SSSS This pattern allows up to 4 decimal digits of seconds, but does not allow a time zone specification.

See Time and Date Formats in Input Records for additional information about date, time, and timestamp values.

dateFormat

The dateFormat parameter specifies the format of datestamps stored in the file. You can set this to null if either:

  • there are no date columns in the file
  • the format of any dates in the input match this pattern: "yyyy-MM-dd".

Example: yyyy-MM-dd

See Time and Date Formats in Input Records for additional information about date, time, and timestamp values.

timeFormat

The timeFormat parameter specifies the format of time values in your input data. You can set this to null if either:

  • there are no time columns in the file
  • the format of any times in the input match this pattern: "HH:mm:ss".

Example: HH:mm:ss

See Time and Date Formats in Input Records for additional information about date, time, and timestamp values.

badRecordsAllowed

The badRecordsAllowed parameter is integer value that specifies the number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back.

These values have special meaning:

  • If you specify -1 as the value of this parameter, all record import failures are tolerated and logged.
  • If you specify 0 as the value of this parameter, the import will fail if even one record is bad.

Example: 25

badRecordDirectory

The badRecordDirectory parameter is a string that specifies the directory in which bad record information is logged. The default value is the directory in which the import files are found.

Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

The badRecordDirectory directory must be writable by the hbase user, either by setting the user explicity, or by opening up the permissions; for example:

sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory

Example: 'importErrsDir'

bulkImportDirectory

This parameter is only used with the SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

The bulkImportDirectory parameter is a string that specifies the name of the directory into which the generated HFiles are written prior to being imported into your database. The generated files are automatically removed after they’ve been imported.

Example: 'hdfs:///tmp/test_hfile_import/'

If you’re using this procedure with our On-Premise database product, on a cluster with Cloudera Key Management Service (KMS) enabled, there are a few extra configuration steps required. Please see this troubleshooting note for details.

Please review the Importing Data: Using Bulk HFile Import topic to understand how importing bulk HFiles works.

skipSampling

This parameter is only used with the SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

The skipSampling parameter is a Boolean value that specifies how you want the split keys used for the bulk HFile import to be computed:

  • If skipSampling is true, you need to use our   SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX system procedure to compute splits for your table before calling SYSCS_UTIL.BULK_IMPORT_HFILE. This allows you more control over the splits, but adds a layer of complexity. You can learn about computing splits for your input data in the Importing Data: Using Bulk HFile Import topic of this tutorial.

  • If skipSampling is false, then SYSCS_UTIL.BULK_IMPORT_HFILE samples your input data and computes the table splits for you, in the following steps. It:

    1. Scans (sample) the data
    2. Collects a rowkey histogram
    3. Uses that histogram to calculate the split key for the table
    4. Uses the calculated split key to split the table into HFiles

Example: false

Please review the Bulk HFile Import Walkthrough topic to understand how importing bulk HFiles works.

See Also