Best Practices: Basic Flat File Ingestion

This topic show you how to use Splice Machine’s basic data ingestion methods, IMPORT_DATA and MERGE_DATA_FROM_FILE, to import data from flat files into your database. These highly performant procedure provide numerous data handling options, and perform constraint checking, which means that they detect and report on erroneous records (bad data) in the input file.

If you’re ingesting all new data, use IMPORT_DATA; if you are also ingesting updates to existing records in your database table, use MERGE_DATA_FROM_FILE. You can only merge data into a table that has a primary key.

The remainder of this topic contains these sections:

Our Bulk HFile Import procedure, BULK_HFILE_IMPORT, offers boosted ingestion speed when importing large (> 100GB) data sets, but does not perform constraint checking.

About Basic Flat File Ingestion

Here’s what a call to the IMPORT_DATA procedure looks like, with required parameters highlighted:

call SYSCS_UTIL.IMPORT_DATA('<schemaName>', '<tableName>', null,
        '<inFilePath>', null, null, null, null, null, -1,
        '<badRecordLogDirectory>', true, null);

All of the null parameter values specify that default values should be used. All of the parameters are described, along with their default values, in Table 1. Here’s a call with actual values plugged in:

call SYSCS_UTIL.IMPORT_DATA('SPLICE', 'playerTeams', null, 'myData.csv',
       null, null, null, null, null, 0, 'importErrsDir', true, null);

The MERGE_DATA_FROM_FILE procedure uses exactly the same parameters.

In the above calls, the parameter values have the following meaning:

'SPLICE' Import the data into a table in the `SPLICE` schema in our database.
'playerTeams' Import the data into the `playerTeams` table.
null Import all columns of data in the file.
'myData.csv' The input file path.
null Columns in the input file are separated by the `,` character.
null Character strings in the input file are delimited with `"` characters.
null Timestamp values are in yyyy-MM-dd HH:mm:ss format.
null Date values are in yyyy-MM-dd format.
null Time values are in HH:mm:ss format.
0 Zero tolerance for bad records: any input error will terminate the import.
'importErrsDir' Information about any bad records is logged in this directory.
true Each record is contained in one line in the input file.
null The input uses UTF-8 character encoding.

Example: Basic Import of a Flat File

Here’s a very basic example of importing a flat file into a table in your Splice Machine database. Follow these steps:

  1. Create a simple table named testImport:

    CREATE SCHEMA test;
    SET SCHEMA test;
    
    CREATE TABLE testImport (
             a1 INT,
             b1 INT,
             c1 INT GENERATED BY DEFAULT AS IDENTITY(start with 1, increment by 1),
             d1 INT DEFAULT 999,
             PRIMARY KEY (a1)
    )
    


  2. Access a simple file named ttest.csv from an S3 bucket on AWS. That file contains this data:

    0|0
    1|2
    2|4
    3|6
    4|8
    


  3. Use IMPORT_DATA to import that data into the testImport table:

    splice> CALL SYSCS_UTIL.IMPORT_DATA('TEST', 'testImport', null,
                        's3a:/mypublicbucket/ttest.csv',
                        '|', null, null, null, null, 0,
                        'hdfs:///tmp/test_import/', false, null);
    
    rowsImported        |failedRows          |files      |dataSize            |failedLog
    -------------------------------------------------------------------------------------
    5                   |0                   |1          |20                  |NONE
    

    Note that this IMPORT_DATA call logs bad import records to a file on HDFS, and uses almost all default parameter values. The exception: our data file uses the | to delimit columns. All of the parameters are summarized in Table 1 below.

  4. Use a SELECT statement to verify that all went well:

    splice> SELECT * FROM testImport;
    A1         |B1         |C1         |D1
    -----------------------------------------------
    0          |0          |1      |999
    1          |2          |2      |999
    2          |4          |3      |999
    3          |6          |4      |999
    4          |8          |5      |999
    
    6 rows selected
    

Example: Basic Merge of a Flat File

Here’s a very basic example of using MERGE_DATA_FROM_FILE to add new records and update a few existing records in a table. This example ingests into the same table that we just used in the IMPORT_DATA example above.

  1. Access a simple file named mergetest.csv from an S3 bucket on AWS. That file contains the following data. Note that the rows with key values 2 and 4 already exist in the table:

    2|22
    4|44
    5|55
    6|66
    


  2. Use MERGE_DATA to import that data into the testImport table:

    splice> CALL SYSCS_UTIL.MERGE_DATA_FROM_FILE('TEST', 'testImport', null,
                        's3a:/mypublicbucket/mergetest.csv',
                        '|', null, null, null, null, 0,
                        'hdfs:///tmp/test_import/', false, null);
    
    rowsUpdated   |rowsInserted  |failedRows     |files  |dataSize           |failedLog
    -------------------------------------------------------------------------------------
    2             |2             |0              |1      |20                 |NONE
    
    1 row selected
    


  3. Use a SELECT statement to verify that all went well:

    splice> SELECT * FROM testImport;
    A1         |B1         |C1         |D1
    -----------------------------------------------
    0          |0          |1          |999
    1          |2          |2          |999
    2          |22         |3          |999
    3          |6          |4          |999
    4          |44         |5          |999
    5          |55         |10001      |999
    6          |66         |10002      |999
    
    7 rows selected
    

    Note that this MERGE_DATA_FROM_FILE call uses exactly the same parameter values as does the previous call to IMPORT_DATA, with the exception of importing a different file. As you can see, two rows (A1=2 and A1=4) were updated with new B1 values, and two new rows were added by this merge call.

Parameters Used With the Basic Import Procedures

The following table summarizes the parameters you use when calling the IMPORT_DATA or MERGE_DATA_FROM_FILE procedures.

The Data Ingestion Parameter Values topic in this chapter provides reference information for all of these parameters.

Table 1: Basic Import Parameters
Parameter Name Description
schemaName The schema to import into.
tableName The table to import into.
insertColumnList A list of the columns to import; The default is to import all columns.
fileOrDirectoryName The file or directory of files to import.
columnDelimiter The character used to separate columns in the input; The default is the comma (,) character.
characterDelimiter The character used to delimit strings in the imported data; the default is the double-quote (") character.
timestampFormat The format of timestamps stored in the file; The default is "yyyy-MM-dd HH:mm:ss".
dateFormat The format of date values stored in the file; The default is "yyyy-MM-dd".
timeFormat The format of time values stored in the file; The default is "HH:mm:ss".
badRecordsAllowed The number of rejected (bad) records that are tolerated before the import fails. A value of `0` means that the import terminates as soon as a single bad record is detected; a value of `-1` means that all bad records are tolerated and logged.
badRecordDirectory The directory in which bad record information is logged.
oneLineRecords A Boolean value that specifies whether (true) each record in the import file is contained in one input line, or (false) if a record can span multiple lines.
charset The character encoding of the import file. The default value is UTF-8.

For Additional Information

Our SQL Reference Manual includes reference pages for each of these system procedures, which include full information about the parameters, additional examples, and discussion of handling special cases and input errors: