SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX

The SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX system procedure pre-splits a table or index that you are import in HFile format. You must use this procedure in conjunction with the  SYSCS_UTIL.BULK_IMPORT_HFILE system procedure to import your data in HFile format.

Syntax

call SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX (
        schemaName,
        tableName,
        indexName,
        columnList | null,
        fileName,
        columnDelimiter | null,
        characterDelimiter | null,
        timestampFormat | null,
        dateFormat | null,
        timeFormat | null,
        maxBadRecords,
        badRecordDirectory | null,
        oneLineRecords | null,
        charset | null,
        );

Parameters

The parameter values that you pass into this procedure should match the values for the same-named parameters that you use when you subsequently call the  SYSCS_UTIL.BULK_IMPORT_HFILE procedure to perform the import.

This table includes a brief description of each parameter; additional information is available in the Import Parameters topic of our Importing Data tutorial.

Parameter Description Example Value
schemaName The name of the schema of the table into which to import. SPLICE
tableName The name of the table into which to import. playerTeams
indexName The name of the index into which to import. playerTeamsIdx
columnList A comma-separated list of the columns used for split keys. See the Example below.
fileName

The name of the file in which you have specified the split keys.

On a cluster, this file MUST be on S3, HDFS (or MapR-FS). If you're using our *Database Service* product, this file must be on S3.

/data/mydata/mytable.csv
columnDelimiter The character used to separate columns, Specify null if using the comma (,) character as your delimiter. '|', '\t'
characterDelimiter The character used to delimit strings in the imported data. '"', ''''
timestampFormat

The format of timestamps stored in the file. You can set this to null if there are no time columns in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:ss".

All of the timestamps in the file you are importing must use the same format.

'yyyy-MM-dd HH:mm:ss.SSZ'

dateFormat The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd". yyyy-MM-dd
timeFormat The format of time values stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss". HH:mm:ss
maxBadRecords The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back. Specify 0 to indicate that no bad records are tolerated, and specify -1 to indicate that all bad records should be logged and allowed. 25
badRecordDirectory

The directory in which bad record information is logged. Splice Machine logs information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

On a cluster, this directory MUST be on S3, HDFS (or MapR-FS). If you're using our Database Service product, it must be on S3.

'importErrsDir'
oneLineRecords A Boolean value that specifies whether (true) each record in the import file is contained in one input line, or (false) if a record can span multiple lines. true
charset The character encoding of the import file. The default value is UTF-8. null

Usage

You can use the SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX procedure to pre-split a data file that you’re importing with the SYSCS_UTIL.BULK_IMPORT_HFILE procedure. Alternatively, SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX can sample the data in your file and create the split keys itself.

When you pre-split your data, make sure that you set the skipSampling parameter to true when calling SYSCS_UTIL.BULK_IMPORT_HFILE; that tells the bulk import procedure that you have already split your data.

The Importing Data: Using Bulk HFile Import section of our Importing Data Tutorial describes the different methods for using our bulk HFile import functionality.

Example

This example details the steps used to import data inHFile format by:

  • specifying the split keys manually in a CSV file
  • using SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX to pre-split the file you’re importing
  • calling SYSCS_UTIL.BULK_IMPORT_HFILE to import the file

Follow these steps:

  1. Create a directory on HDFS for the import; for example:

    sudo -su hdfs hadoop fs -mkdir hdfs:///tmp/test_hfile_import
    

    Make sure that the directory you create has permissions set to allow Splice Machine to write your csv and Hfiles there.

  2. Create table and index:

    CREATE TABLE TPCH.LINEITEM (
        L_ORDERKEY BIGINT NOT NULL,
        L_PARTKEY INTEGER NOT NULL,
        L_SUPPKEY INTEGER NOT NULL,
        L_LINENUMBER INTEGER NOT NULL,
        L_QUANTITY DECIMAL(15,2),
        L_EXTENDEDPRICE DECIMAL(15,2),
        L_DISCOUNT DECIMAL(15,2),
        L_TAX DECIMAL(15,2),
        L_RETURNFLAG VARCHAR(1),
        L_LINESTATUS VARCHAR(1),
        L_SHIPDATE DATE,
        L_COMMITDATE DATE,
        L_RECEIPTDATE DATE,
        L_SHIPINSTRUCT VARCHAR(25),
        L_SHIPMODE VARCHAR(10),
        L_COMMENT VARCHAR(44),
        PRIMARY KEY(L_ORDERKEY,L_LINENUMBER)
        );
    
    CREATE INDEX L_SHIPDATE_IDX on TPCH.LINEITEM(
        L_SHIPDATE,
        L_PARTKEY,
        L_EXTENDEDPRICE,
        L_DISCOUNT
        );
    
  3. Determine the split row keys for your table and set up the pre-splits:

    1. Find primary key values that can horizontally split the table into roughly equal sized partitions.

      For this example, we provide 3 keys in a file named lineitemKey.csv, which will be specified as the value of the fileName parameter. Note that each of our three keys includes a second column that is null:

      1500000|
      3000000|
      4500000|
      

      For every N lines of split data you specify, you’ll end up with N+1 regions; for example, the above 3 splits will produce these 4 regions:

      0 -> 1500000
      1500000 -> 3000000
      3000000 -> 4500000
      4500000 -> (last possible key)
      
    2. Specify the column names in the csv file in the columnList parameter; in our example, the primary key columns are:

      'L_ORDERKEY,L_LINENUMBER'
      
    3. Invoke SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX to pre-split your table file:

      call SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX('TPCH',
              'LINEITEM',null, 'L_ORDERKEY,L_LINENUMBER',
              'hdfs:///tmp/test_hfile_import/lineitemKey.csv',
              '|', null, null, null,
              null, -1, '/BAD', true, null);
      
  4. Compute the split keys for your index:

    1. Find index values that can horizontally split the table into roughly equal sized partitions.

    2. For this example, we provide 2 index values in a file named shipDateIndex.csv, which will be specified as the value of the fileName parameter. Note that each of our keys includes null column values:

      1994-01-01|||
      1996-01-01|||
      
    3. Specify the column names in the csv file in the columnList parameter; in our example, the index columns are:

      'L_SHIPDATE,L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT'
      
    4. Invoke SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX to pre-split your index file:

      call SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX('TPCH',
              'LINEITEM', 'L_SHIPDATE_IDX',
              'L_SHIPDATE,L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT',
              'hdfs:///tmp/test_hfile_import/shipDateIndex.csv',
              '|', null, null,
              null, null, -1, '/BAD', true, null);
      
  5. Import the HFiles Into Your Database

    Once you have pre-split your table and indexes, call SYSCS_UTIL.BULK_IMPORT_HFILE to generate and import the HFiles into your Splice Machine database:

    call SYSCS_UTIL.BULK_IMPORT_HFILE('TPCH', 'LINEITEM', null,
                '/TPCH/1/lineitem', '|', null, null, null, null,
                -1, '/BAD', true, null,
                'hdfs:///tmp/test_hfile_import/', true);
    

    The generated HFiles are automatically deleted after being imported.

See Also