SYSCS_UTIL.MERGE_DATA_FROM_FILE system procedure imports data to update an existing record or create a new record in your database. You can choose to import all or a subset of the columns from the input data into your database using the
After a successful import completes, a simple report displays, showing how many files were imported, and how many record imports succeeded or failed.
Selecting an Import Procedure
Splice Machine provides four system procedures for importing data:
SYSCS_UTIL.IMPORT_DATAprocedure imports each input record into a new record in your database.
SYSCS_UTIL.UPSERT_DATA_FROM_FILEprocedure updates existing records and adds new records to your database. It only differs from
SYSCS_UTIL.MERGE_DATA_FROM_FILEin that upserting overwrites the generated or default value of a column that is not specified in your
insertColumnListparameter when updating a record.
- This procedure,
SYSCS_UTIL.MERGE_DATA_FROM_FILEprocedure updates existing records and adds new records to your database. It only differs from
SYSCS_UTIL.UPSERT_DATA_FROM_FILEin that merging does not overwrite the generated or default value of a column that is not specified in your
insertColumnListparameter when updating a record.
SYSCS_BULK_IMPORT_HFILEprocedure takes advantage of HBase bulk loading to import table data into your database by temporarily converting the table file that you’re importing into HFiles, importing those directly into your database, and then removing the temporary HFiles. This procedure has improved performance for large tables; however, the bulk HFile import requires extra work on your part and lacks constraint checking.
Our Importing Data Tutorial includes a decision tree and brief discussion to help you determine which procedure best meets your needs.
call SYSCS_UTIL.MERGE_DATA_FROM_FILE ( schemaName, tableName, insertColumnList | null, fileOrDirectoryName, columnDelimiter | null, characterDelimiter | null, timestampFormat | null, dateFormat | null, timeFormat | null, badRecordsAllowed, badRecordDirectory | null, oneLineRecords | null, charset | null );
The following table summarizes the parameters used by
SYSCS_UTIL.MERGE_DATA_FROM_FILE and other Splice Machine data importation procedures. Each parameter name links to a more detailed description in our Importing Data Tutorial.
|Table Info||schemaName||The name of the schema of the table in which to import.||SPLICE|
|tableName||The name of the table in which to import||playerTeams|
|Data Location||insertColumnList||The names, in single quotes, of the columns to import. If this is
Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. You can import compressed or uncompressed files.
On a cluster, the files to be imported
|Data Formats||oneLineRecords||A Boolean value that specifies whether (
|charset||The character encoding of the import file. The default value is UTF-8.||null|
|columnDelimiter||The character used to separate columns, Specify
|characterDelimiter||The character is used to delimit strings in the imported data.||'"'|
The format of timestamps stored in the file. You can set this to
All of the timestamps in the file you are importing must use the same format.
|dateFormat||The format of datestamps stored in the file. You can set this to
|timeFormat||The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss".||HH:mm:ss|
|Problem Logging||badRecordsAllowed||The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed.||25|
The directory in which bad record information is logged. Splice Machine logs information to the
On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, files can only be imported from S3.
|Bulk HFile Import||bulkImportDirectory (outputDirectory)||
This parameter is only used with the
SYSCS_UTIL.MERGE_DATA_FROM_FILEprocedure only imports single files; it does not process directories. This means that the
fileOrDirectoryNameparameter value must be a file name.
SYSCS_UTIL.MERGE_DATA_FROM_FILE displays a summary of the import
process results that looks like this:
rowsImported |failedRows |files |dataSize |failedLog ------------------------------------------------------------- 94 |0 |1 |4720 |NONE
This procedure also logs rejected record activity into
.bad files in
badRecordDirectory directory; one file for each imported file.
Importing and Updating Records
- You are importing only a subset of data from the input data into your table, either because the table contains less columns than does the input file, or because you’ve specified a subset of the columns in your
- Inserting and updating data in a column with generated values.
- Inserting and updating data in a column with default values.
- Handling of missing values.
The Importing Data Tutorial: Input Handling topic describes how each of these conditions is handled by the different system procedures.
Record Import Failure Reasons
When upserting data from a file, the input file you generate must contain:
- the columns to be changed
Typical reasons for a row (record) import to fail include:
- Improper data expected for a column.
- Improper number of columns of data.
- A primary key violation:
SYSCS_UTIL.MERGE_DATA_FROM_FILEwill only work correctly if the table into which you are inserting/updating has primary keys.
A few important notes:
Splice Machine advises you to run a full compaction (with the
SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLEsystem procedure) after importing large amounts of data into your database.
On a cluster, the files to be imported MUST be on S3, HDFS (or MapR-FS), as must the
badRecordDirectorydirectory. If you’re using our Database Service product, files can only be imported from S3.
In addition, the files must be readable by the
hbaseuser, and the
badRecordDirectorydirectory must be writable by the
hbaseuser, either by setting the user explicity, or by opening up the permissions; for example:
sudo -su hdfs hadoop fs -chmod 777 /badRecordDirectory
This section presents a couple simple examples.
The Importing Data Usage Examples topic contains a more extensive set of examples.
Example 1: Updating our doc examples player data
This example shows the
MERGE_DATA call used to update the Players in our documentation examples database:
splice> CALL SYSCS_UTIL.MERGE_DATA_FROM_FILE('SPLICEBBALL', 'Players', 'ID, Team, Name, Position, DisplayName, BirthDate', '/Data/DocExamplesDb/Players.csv', null, null, null, null, null, 0, null, true, null); rowsImported |failedRows |files |dataSize |failedLog-------------------------------------------------------------------------------------- 94 |0 |1 |4720 |NONE 1 row selected
Example 2: Using single quotes to delimit strings
This example uses single quotes instead of double quotes as the character delimiter in the input:
1,This field is one line,Able 2,'This field has two lines This is the second line of the field',Baker 3,This field is also just one line,Charlie
Note that you must escape single quotes in SQL, which means that you actually define the character delimiter parameter with four single quotes, as follow
SYSCS_UTIL.MERGE_DATA_FROM_FILE('SPLICE','MYTABLE',null,'data.csv','\t','''',null,null,null,0,'BAD', false, null);
See Importing Data Usage Examples for more examples.