Importing Data: Input Considerations

This topic provides detailed information about how the parameter values you specify when importing data are handled by Splice Machine’s built-in import procedures.

For a summary of our import procedures and determining which to use, please see Importing Data: Overview.

For reference descriptions of the parameters used by those import procedures, please see Importing Data: Parameter Usage.

This topic includes the following sections:

Section Description
Specifying Your Input Data Location Describes how to specify the location of your input data when importing.
Input Data File Format Information about input data files, including importing compressed files and multi-line records.
Delimiters in Your Input Data Discusses the use of column and characters delimiters in your input data.
Time and Date Formats in Input Records All about the date, time, and timestamp values in your input data.
Importing and Updating Records Discusses importing new records and updating existing database records, handling missing values in the input data, and handling of generated and default values.
Importing CLOBs and BLOBs Discussion of importing CLOBs and BLOBs into your Splice Machine database.
Scripting Your Imports Shows you how to script your import processes.

Specifying Your Input Data Location

Some customers get confused by the the fileOrDirectoryName parameter that’s used in our import procedures. How you use this depends on whether you are importing a single file or a directory of files, and whether you’re importing data into a standalone version or cluster version of Splice Machine. This section contains these three subsections:

Standalone Version Input File Path

If you are running a stand alone environment, the name or path will be to a file or directory on the file system. For example:

/users/myname/mydata/mytable.csv/users/myname/mydatadir

HBase Input File Path

If you are running this on a cluster, the path is to a file on HDFS (or the MapR File system). For example:

/data/mydata/mytable.csv/data/myname/mydatadir

AWS S3 Input File Path

Finally, if you’re importing data from an S3 bucket, you need to supply your AWS access and secret key codes, and you need to specify an s3a URL. This is also true for logging bad record information to an S3 bucket directory, as will be the case when using our Database-as-Service product.

For information about configuring Splice Machine access on AWS, please review our Configuring an S3 Bucket for Splice Machine Access topic, which walks you through using your AWS dashboard to generate and apply the necessary credentials.

Once you’ve established your access keys, you can include them inline; for example:

call SYSCS_UTIL.IMPORT_DATA ('TPCH', 'REGION', null, 's3a://(access key):(secret key)@splice-benchmark-data/flat/TPCH/100/region', '|', null, null, null, null, -1, 's3a://(access key):(secret key)@splice-benchmark-data/flat/TPCH/100/importLog', true, null);

Alternatively, you can specify the keys once in the core-site.xml file on your cluster, and then simply specify the s3a URL; for example:

call SYSCS_UTIL.IMPORT_DATA ('TPCH', 'REGION', null, 's3a://splice-benchmark-data/flat/TPCH/100/region', '|', null, null, null, null, 0, '/BAD', true, null);

To add your access and secret access keys to the core-site.xml file, define the fs.s3a.awsAccessKeyId and fs.s3a.awsSecretAccessKey properties in that file:

<property>
   <name>fs.s3a.awsAccessKeyId</name>
   <value>access key</value>
</property>
<property>
   <name>fs.s3a.awsSecretAccessKey</name>
   <value>secret key</value>
</property>

Input Data File Format

This section contains the following information about the format of the input data files that you’re importing:

Importing Compressed Files

We recommend importing files that are either uncompressed, or have been compressed with bz2 or lz4 compression.

If you import files compressed with gzip, Splice Machine cannot distribute the contents of your file across your cluster nodes to take advantage of parallel processing, which means that import performance will suffer significantly with gzip files.

Importing Multi-line Records

If your data contains line feed characters like CTRL-M, you need to set the oneLineRecords parameter to false. Splice Machine will accommodate to the line feeds; however, the import will take longer because Splice Machine will not be able to break the file up and distribute it across the cluster.

To improve import performance, avoid including line feed characters in your data and set the oneLineRecords parameter to true.

Importing Large Datasets in Groups of Files

If you have a lot of data (100s of millions or billions of records), you may be tempted to create one massive file that contains all of your records and import that file; Splice Machine recommends against this; instead, we urge you to manage your data in smaller files. Specifically, we suggest that you split your data into files that are:

  • approximately 40 GB
  • have approximately 50 million records, depending on how wide your table is

If you have a lot of files, group them into multiple directories, and import each directory individually. For example, here is a structure our Customer Success engineers like to use:

  • /data/mytable1/group1
  • /data/mytable1/group2
  • /data/mytable1/group3

If you are importing a lot of data, our SYSCS_UTIL.BULK_IMPORT_HFILE bulk import procedure greatly improves data loading performance by splitting the data into HFiles, doing the import, and then deleting the HFiles. You can have SYSCS_UTIL.BULK_IMPORT_HFILE use sampling to determine the keys to use for splitting your data by, or you can use the SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX procedure to compute the splits, and then call the bulk import procedure. For more information, see the Importing Data: Using Bulk HFile Import topic of this tutorial.

Delimiters in Your Input Data

This section discusses the delimiters that you use in your input data, in these subsections:

Use Special Characters for Delimiters

One common gotcha we see with customer imports is when the data you’re importing includes a special character that you’ve designated as a column or character delimiter. You’ll end up with records in your bad record directory and can spend hours trying to determine the issue, only to discover that it’s because the data includes a delimiter character. This can happen with columns that contain data such as product descriptions.

Column Delimiters

The standard column delimiter is a comma (,); however, we’ve all worked with string data that contains commas, and have figured out to use a different column delimiter. Some customers use the pipe (|) character, but frequently discover that it is also used in some descriptive data in the table they’re importing.

In addition to using plain text characters, you can specify the following special characters as delimiters:

Special character Display
\t Tab
\f Formfeed
\b Backspace
\\ Backslash
^a (or ^A)

Control-a

If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter.

We recommend using a control character like CTRL-A for your column delimiter. This is known as the SOH character, and is represented by 0x01 in hexadecimal. Unfortunately, there’s no way to enter this character from the keyboard in the Splice Machine command line interface; instead, you need to create a script file and type the control character using a text editor like vi or vim:

  • Open your script file in vi or vim.
  • Enter into INSERT mode.
  • Type CTRL-V then CTRL-A for the value of the column delimiter parameter in your procedure call. Note that this typically echoes as ^A when you type it in vi or vim.

Character Delimiters

By default, the character delimiter is a double quote. This can produce the same kind of problems that we see with using a comma for the column delimiter: columns values that include embedded quotes or use the double quote as the symbol for inches. You can use escape characters to include the embedded quotes, but it’s easier to use a special character for your delimiter.

We recommend using a control character like CTRL-A for your column delimiter. Unfortunately, there’s no way to enter this character from the keyboard in the Splice Machine command line interface; instead, you need to create a script file and type the control character using a text editor like vi or vim:

  • Open your script file in vi or vim.
  • Enter into INSERT mode.
  • Type CTRL-V then CTRL-G for the value of the character delimiter parameter in your procedure call. Note that this typically echoes as ^G when you type it in vi or vim.

Time and Date Formats in Input Records

Perhaps the most common difficulty that customers have with importing their data is with date, time, and timestamp values.

Splice Machine adheres to the Java SimpleDateFormat syntax for all date, time, and timestamp values, SimpleDateFormat is described here:

https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html

Splice Machine’s implementation of SimpleDateFormat is case-sensitive; this means, for example, that a lowercase h is used to represent an hour value between 0 and 12, whereas an uppercase H is used to represent an hour between 0 and 23.

All Values Must Use the Same Format

Splice Machine’s Import procedures only allow you to specify one format each for the date, time, and timestamp columns in the table data you are importing. This means that, for example, every date in the table data must be in the same format.

All of the Date values in the file (or group of files) you are importing must use the same date format.

All of the Time values in the file (or group of files) you are importing must use the same time format.

All of the Timestamp values in the file (or group of files) you are importing must use the same timestamp format.

Converting Out-of-Range Timestamp Values Upon Import

The Splice Machine Timestamp data type has a range of 1677-09-21 00:12:43.147 GMT to 2262-04-11 23:47:16.853 GMT. If you import data that falls outside these ranges (some customers have used dummy timestamp values like 9999-01-01 in their data), your import will fail because the value is out of range for a timestamp. We have a very simple workaround for this:

This is not an issue with Date values.

You can work around this issue by telling Splice Machine to convert any out-of-range timestamp values that you’re importing to in-range timestamp values. To do so, follow these steps:

  1. Set the derby.database.convertOutOfRangeTimeStamps value to true as follows:
        CALL SYSCS_UTIL.SYSCS_SET_GLOBAL_DATABASE_PROPERTY('derby.database.convertOutOfRangeTimeStamps', 'true');
    
  2. Clear the Splice Machine compiled statement cache as follows:
        CALL SYSCS_UTIL.SYSCS_EMPTY_STATEMENT_CACHE();
    
  3. Import your data.

Once you’ve set the property and cleared the cache, when you import data into Splice Machine, any timestamp value in the imported data that is less than the minimum value (1677-09-21 00:12:43.147 GMT) is converted to that minimum value, and any timestamp value greater than the maximum value (2262-04-11 23:47:16.853 GMT) is converted to that maximum value.

Additional Notes

A few additional notes:

  • Splice Machine suggests that, if your data contains any date or timestamp values that are not in the format yyyy-MM-dd HH:mm:ss, you create a simple table that has just one or two columns and test importing the format. This is a simple way to confirm that the imported data is what you expect.
  • Detailed information about each of these data types is found in our SQL Reference Manual:

Importing and Updating Records

This section describes certain aspects of how records are imported and updated when you import data into your database, including these subsections:

Inserting and Updating Column Values When Importing Data

This section summarizes what happens when you are importing, upserting, or merging records into a database table, based on:

  • Whether you are importing a new record or updating an existing record.
  • If the column is specified in your insertColumnList parameter.
  • If the table column is a generated value or has a default value.

The important difference in actions taken when importing data occurs when you are updating an existing record with the UPSERT or MERGE and your column list does not contain the name of a table column:

  • For newly inserted records, the default or auto-generated value is always inserted, as usual.
  • If you are updating an existing record in the table with UPSERT, the default auto-generated value in that record is overwritten with a new value.
  • If you are updating an existing record in the table with MERGE, the column value is not updated.

Importing a New Record Into a Database Table

The following table shows the actions taken when you are importing new records into a table in your database. These actions are the same for all three importation procedures (IMPORTing, UPSERTing, or MERGEing):

Column included in importColumnList? Table column conditions Action Taken
YES N/A Import value inserted into table column if valid; if not valid, a bad record error is logged.
NO Has Default Value Default value is inserted into table column.
Is Generated Value Generated value is inserted into table column.
None NULL is inserted into table column.

The table below shows what happens with default and generated column values when adding new records to a table using one of our import procedures; we use an example database table created with this statement:

CREATE TABLE myTable (
                colA INT,
                colB CHAR(12) DEFAULT 'myDefaultVal',
                colC INT);
insertColumnList Values in import record Values inserted into database Notes
"colA,colB,colC" 1,,2 [1,NULL,2]
"colA,colB,colC" 3,de,4 [3,de,4]
"colA,colB,colC" 1,2, Error: column B wrong type
"colA,colB,colC" 1,DEFAULT,2 [1,"DEFAULT",2] DEFAULT is imported as a literal value
Empty 1,,2 [1,myDefaultVal,2]
Empty 3,de,4 [3,de,4]
Empty 1,2, Error: column B wrong type
"colA,colC" 1,2 [1,myDefaultVal,2]
"colA,colC" 3,4 [3,myDefaultVal,4]

Note that the value `DEFAULT` in the imported file is not interpreted to mean that the default value should be applied to that column; instead:

  • If the target column in your database has a string data type, such as CHAR or VARCHAR, the literal value "DEFAULT" is inserted into your database..
  • If the target column is not a string data type, an error will occur.

Importing Into a Table that Contains Generated or Default Values

When you export a table with generated columns to a file, the actual column values are exported, so importing that same file into a different database will accurately replicate the original table values.

If you are importing previously exported records into a table with a generated column, and you want to import some records with actual values and apply generated or default values to other records, you need to split your import file into two files and import each:

  • Import the file containing records with non-default values with the column name included in the insertColumnList.
  • Import the file containing records with default values with the column name excluded from the insertColumnList.

Updating a Table Record with UPSERT

The following table shows the action taken when you are using the SYSCS_UTIL.UPSERT_DATA_FROM_FILE procedure to update an existing record in a database table:

Column included in importColumnList? Table column conditions Action Taken
YES N/A Import value updated in table column if valid; if not valid, a bad record error is logged.
NO Has Default Value Table column is overwritten with default value.
Is Generated Value Table column is overwritten with newly generated value.
None Table column is overwritten with NULL value.

Updating a Table Record with MERGE

The following table shows the action taken when you are using the SYSCS_UTIL.MERGE_DATA_FROM_FILE procedure to update an existing record in a database table:

Column included in importColumnList? Table column conditions Action Taken
YES N/A Import value updated in table column if valid; if not valid, a bad record error is logged.
NO N/A Table column is not updated.

Importing CLOBs and BLOBs

When importing CLOBs, be sure to review these tips to avoid common problems:

  • Be sure that the data you’re importing does not includes a special character that you’ve designated as a column or character delimiter. Otherwise, you’ll end up with records in your bad record directory and can spend hours trying to determine the issue, only to discover that it’s because the data includes a delimiter character.
  • If your data contains line feed characters like CTRL-M, you need to set the oneLineRecords parameter to false to allow Splice Machine to properly handle the data; however, the import will take longer because Splice Machine will not be able to break the file up and distribute it across the cluster. To improve import performance, avoid including line feed characters in your data and set the oneLineRecords parameter to true.

At this time, the Splice Machine import procedures do not work with columns of type BLOB. You can, however, create a virtual table interface (VTI) that reads the BLOBs and inserts them into your database.

Scripting Your Imports

You can make import tasks much easier and convenient by creating import scripts. An import script is simply a call to one of the import procedures; once you’ve verified that it works, you can use and clone the script and run unattended imports.

An import script is simply a file in which you store splice> commands that you can execute with the run command. For example, here’s an example of a text file named myimports.sql that we can use to import two csv files into our database:

call SYSCS_UTIL.IMPORT_DATA ('SPLICE','mytable1',null,'/data/mytable1/data.csv',null,null,null,null,null,0,'/BAD/mytable1',null,null);call SYSCS_UTIL.IMPORT_DATA ('SPLICE','mytable2',null,'/data/mytable2/data.csv',null,null,null,null,null,0,'/BAD/mytable2',null,null);

To run an import script, use the splice> run command; for example:

splice> run 'myimports.sql';

You can also start up the splice> command line interpreter with the name of a file to run; for example:

sqlshell.sh -f myimports.sql

In fact, you can script almost any sequence of Splice Machine commands in a file and run that script within the command line interpreter or when you start the interpreter.

See Also