SYSCS_UTIL.MERGE_DATA_FROM_FILE

The SYSCS_UTIL.MERGE_DATA_FROM_FILE system procedure imports data to update an existing record or create a new record in your database. You can choose to import all or a subset of the columns from the input data into your database using the insertColumnList parameter.

After a successful import completes, a simple report displays, showing how many files were imported, and how many record imports succeeded or failed.

This procedure is one of several built-in system procedures provided by Splice Machine for importing data into your database. See our Best Practices: Ingestion for help with selecting the right process for your situation.

Syntax

call SYSCS_UTIL.MERGE_DATA_FROM_FILE (
               schemaName,
               tableName,
               insertColumnList | null,
               fileName,
               columnDelimiter | null,
               characterDelimiter | null,
               timestampFormat | null,
               dateFormat | null,
               timeFormat | null,
               badRecordsAllowed,
               badRecordDirectory | null,
               oneLineRecords | null,
               charset | null
);

Parameters

The following table summarizes the parameters used by SYSCS_UTIL.MERGE_DATA_FROM_FILE. Each parameter name links to a more detailed description in our Ingestion Parameter Values.

</tr> </tr>
Parameter Description Example Value
schemaName The name of the schema of the table into which to import. SPLICE
tableName The name of the table into which to import. playerTeams
insertColumnList

The names, in single quotes, of the columns to import. If this is null, all columns are imported.

The individual column names in the insertColumnList do not need to be double-quoted, even if they contain special characters. However, if you do double-quote any column name, you must double-quote all of the column names.

'ID, TEAM'
fileName

The file is imported; you can import compressed or uncompressed files.

On a cluster, the file to be imported MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, the file can only be imported from S3.

/data/mydata/mytable.csv
columnDelimiter The character used to separate columns, Specify null if using the comma (,) character as your delimiter. '|', '\t'
characterDelimiter The character used to delimit strings in the imported data. '"', ''''
timestampFormat

The format of timestamps stored in the file. You can set this to null if there are no time columns in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

'yyyy-MM-dd HH:mm:ss.SSZ'

dateFormat The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd". yyyy-MM-dd
timeFormat The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss". HH:mm:ss
badRecordsAllowed The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed. 25
badRecordDirectory

The directory in which bad record information is logged. Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.

'importErrsDir'
oneLineRecords A Boolean value that specifies whether (true) each record in the import file is contained in one input line, or (false) if a record can span multiple lines. true
charset The character encoding of the import file. The default value is UTF-8. null

The SYSCS_UTIL.MERGE_DATA_FROM_FILE procedure only imports single files; it does not process directories.

Results

SYSCS_UTIL.MERGE_DATA_FROM_FILE displays a summary of the import process results that looks like this:

rowsImported   |failedRows   |files   |dataSize   |failedLog
-------------------------------------------------------------
94             |0            |1       |4720       |NONE

This procedure also logs rejected record activity into .bad files in the badRecordDirectory directory; one file for each imported file.

Importing and Updating Records

The SYSCS_UTIL.SYSCS_MERGE_DATA_FROM_FILE imports new records into your database in the same way as does the SYSCS_UTIL.IMPORT_DATA procedure. SYSCS_UTIL.SYSCS_MERGE_DATA_FROM_FILE can also update existing records in your database; for this to work, the table you’re importing into must have a primary key. Because this procedure has to determine if a record already exists and how to update it, MERGE_DATA is slightly slower than using IMPORT_DATA; if you know that you’re ingesting all new records, you’ll get better performance with IMPORT_DATA.

Record Import Failure Reasons

When merging data from a file, the input file you generate must contain:

  • the columns to be changed
  • all NON_NULL columns

Typical reasons for a row (record) import to fail include:

  • Improper data expected for a column.
  • Improper number of columns of data.
  • A primary key violation:  SYSCS_UTIL.MERGE_DATA_FROM_FILE will only work correctly if the table into which you are inserting/updating has primary keys.

A few important notes:

  • Splice Machine advises you to run a full compaction (with the SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLE system procedure) after importing large amounts of data into your database.

  • On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the badRecordDirectory directory. If you’re using our Database Service product, files can only be imported from S3.

    In addition, the files must be readable by the hbase user, and the badRecordDirectory directory must be writable by the hbase user, either by setting the user explicity, or by opening up the permissions; for example:

    sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory

Examples

This section presents a couple simple examples.

The Importing Flat Files topic contains a more extensive set of examples.

Example 1: Updating our doc examples player data

This example shows the MERGE_DATA call used to update the Players in our documentation examples database:

splice> CALL SYSCS_UTIL.MERGE_DATA_FROM_FILE('SPLICEBBALL', 'Players',
    'ID, Team, Name, Position, DisplayName, BirthDate',
    '/Data/DocExamplesDb/Players.csv',
    null, null, null, null, null, 0, null, true, null);
rowsImported        |failedRows          |files      |dataSize            |failedLog--------------------------------------------------------------------------------------
94                  |0                   |1          |4720                |NONE
1 row selected

Example 2: Basic Merge of a Flat File

Here’s a very basic example of using MERGE_DATA_FROM_FILE to add new records and update a few existing records in a table. This example ingests into the same table that we just used in the IMPORT_DATA example above.

  1. Access a simple file named mergetest.csv from an S3 bucket on AWS. That file contains the following data. Note that the rows with key values 2 and 4 already exist in the table:

    2|22
    4|44
    5|55
    6|66
    


  2. Use MERGE_DATA to import that data into the testImport table:

    splice> CALL SYSCS_UTIL.MERGE_DATA_FROM_FILE('TEST', 'testImport', null,
                        's3a:/mypublicbucket/mergetest.csv',
                        '|', null, null, null, null, 0,
                        'hdfs:///tmp/test_import/', false, null);
    
    rowsUpdated   |rowsInserted  |failedRows     |files  |dataSize           |failedLog
    -------------------------------------------------------------------------------------
    2             |2             |0              |1      |20                 |NONE
    
    1 row selected
    


  3. Use a SELECT statement to verify that all went well:

    splice> SELECT * FROM testImport;
    A1         |B1         |C1         |D1
    -----------------------------------------------
    0          |0          |1          |999
    1          |2          |2          |999
    2          |22         |3          |999
    3          |6          |4          |999
    4          |44         |5          |999
    5          |55         |10001      |999
    6          |66         |10002      |999
    
    7 rows selected
    

    Note that this MERGE_DATA_FROM_FILE call uses exactly the same parameter values as does the previous call to IMPORT_DATA, with the exception of importing a different file. As you can see, two rows (A1=2 and A1=4) were updated with new B1 values, and two new rows were added by this merge call.

Example 3: Using single quotes to delimit strings

This example uses single quotes instead of double quotes as the character delimiter in the input:

1,This field is one line,Able
2,'This field has two lines
This is the second line of the field',Baker
3,This field is also just one line,Charlie

Note that you must escape single quotes in SQL, which means that you actually define the character delimiter parameter with four single quotes, as follow

SYSCS_UTIL.MERGE_DATA_FROM_FILE('SPLICE','MYTABLE',null,'data.csv','\t','''',null,null,null,0,'BAD', false, null);

See Importing Flat Files for more examples.

See Also