The Native Spark DataSource Methods

This topic describes the following methods of the SplicemachineContext class:

analyzeSchema

This method collects statistics for an entire schema; it is the same as using the ANALYZE SCHEMA splice> command line.

analyzeSchema(schemaName: String): Unit

schemaName

The name of the schema that you want analyzed.

analyzeTable

This method collects statistics for a specific table; it is the same as using the ANALYZE TABLE splice> command line.

analyzeTable( tableName: String,
              estimateStatistics: Boolean = false,
              samplePercent: Double = 0.10 ): Unit

tableName

The name of the table that you want analyzed.

estimateStatistics

A Boolean that specifies whether you want statistics generated by sampling the specified sampling percentage of the table. This can significantly reduce the overhead associated with generating statistics. Setting this parameter to false specifies that statistics are to be generated based on the full table.

samplePercent

A value between 0 and 100 that specifies the sampling percentage to use when generating statistics for this table. This value defaults to 10 percent, and is only used if estimateStatistics is set to true.

bulkImportHFile

This method efficiently imports data into your Splice Machine database by first generating HFiles and then importing those HFiles; it is the same as using the Splice Machine SYSCS_UTIL.BULK_IMPORT_HFILE system procedure.

You can either pass the data to this method in a DataFrame, or you can pass the data in an RDD, and pass in a structure (the Catalyst schema) that specifies the organization of the data.

bulkImportHFile( dataFrame: DataFrame,
                 schemaTableName: String,
                 options: scala.collection.mutable.Map[String, String] ): Unit

bulkImportHFile( rdd: JavaRDD[Row],
                 schema: StructType,
                 schemaTableName: String,
                 options: scala.collection.mutable.Map[String, String] ): Unit

dataFrame

The DataFrame containing the rows that you want imported into your database table.

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

rdd

The RDD containing the data the you want imported into your database table.

The Catalyst schema of the master table.

A structure that specifies the layout of the data in the RDD.

options

A collection of (key, value) pairs specifying the import options. For example, you can specify that sampling is not to be used with a statement like this:

    val bulkImportOptions = scala.collection.mutable.Map( "skipSampling" -> "true" )

createTable

This method creates a new table in your Splice Machine database; it is the same as using the Splice Machine CREATE TABLE SQL statement.

createTable( tableName: String,
             structType: StructType,
             keys: Seq[String],
             createTableOptions: String ): Unit

tableName

The name of the table.

structType

A structure that specifies the table’s schema.

keys

A sequence (comma-separated list) of keys for the table.

createTableOptions

A string that specifies the table options.

delete

This method deletes the contents of a Spark DataFrame or Spark RDD from a Splice Machine table; it is the same as using the Splice Machine DELETE FROM SQL statement.

You can either pass the data to this method in a DataFrame, or you can pass the data in an RDD, and pass in a structure that specifies the organization of the data.

delete( dataFrame: DataFrame,
        schemaTableName: String ): Unit

delete( rdd: JavaRDD[Row],
        schema: StructType,
        schemaTableName: String ): Unit

dataFrame

The DataFrame containing the rows that you want deleted from your database table.

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

rdd

The RDD containing the data the you want deleted from your database table.

The Catalyst schema of the master table.

A structure that specifies the layout of the data in the RDD.

df and internalDf

These methods executes an SQL string within Splice Machine and returns the results in a Spark DataFrame.

The only difference between df and internalDf methods is that the internalDf method runs internally and temporarily persists data on HDFS; this has a slight performance impact, but allows for checking permissions on Views. For more information, please see the Accessing Database Objects section in our Using the Native Spark DataSource topic.

df( sql: String ): Dataset[Row]

internalDf( sql: String ): Dataset[Row]

sql

The SQL string.

dropTable

This method removes the specified table; it is the same as using the Splice Machine DROP TABLE SQL statement.

You can pass the schema and table names into this method separately, or in combined (schema.table) format.

dropTable(schemaTableName: String): Unit

dropTable( schemaName: String,
           tableName: String ): Unit

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

schemaName

The schema name.

tableName

The name of the table.

export

This method exports a dataFrame in CSV format.

export( dataFrame: DataFrame,
        location: String,
        compression: Boolean,
        replicationCount: Int,
        fileEncoding: String,
        fieldSeparator: String,
        quoteCharacter: String): Unit

dataFrame

The dataFrame that you want to export.

location

The directory in which you want the export file(s) written.

compression

Whether or not to compress the exported files. You can specify one of the following values:

Value Description
true The exported files are compressed using deflate/gzip.
false Exported files are not compressed.

replicationCount

The file system block replication count to use for the exported CSV files.

You can specify any positive integer value. The default value is 1.

fileEncoding

The character set encoding to use for the exported CSV files.

You can specify any character set encoding that is supported by the Java Virtual Machine (JVM). The default encoding is UTF-8.

fieldSeparator

The character to use for separating fields in the exported CSV files.

The default separator character is the comma (,).

quoteCharacter

The character to use for quoting output in the exported CSV files.

The default quote character is the double quotation mark (").

exportBinary

This method exports a dataFrame in binary format, generating one or more binary files, which are stored in the directory that you specify in the location parameter. More than one output file can be generated to enhance the parallelism and performance of this operation.

exportBinary( dataFrame: DataFrame,
              location: String,
              compression: Boolean,
              format: String): Unit

dataFrame

The dataFrame that you want to export.

location

The directory in which you want the export file(s) written.

compression

Whether or not to compress the exported files. You can specify one of the following values:

Value Description
true The exported files are compressed using Snappy.
false Exported files are not compressed.

If compression=true, then each of the generated files is named with this format: part-r-<N>.snappy.parquet; if not, then each file is named with this format: part-r-<N>.parquet. In either case, the value of N is a sequence of numbers and letters.

format

The format in which to write the exported file(s). The only format supported at this time is parquet.

getConnection

This method returns the current connection.

getConnection(): Connection

getSchema

This method returns the Catalyst schema of the specified table.

getSchema( schemaTableName: String ): StructType

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

insert

This method inserts the contents of a Spark DataFrame or Spark RDD into a Splice Machine table; it is the same as using the Splice Machine INSERT INTO SQL statement.

You can either pass the data to this method in a DataFrame, or you can pass the data in an RDD, and pass in a structure that specifies the organization of the data.

insert( dataFrame: DataFrame,
        schemaTableName: String ): Unit

insert( rdd: JavaRDD[Row],
        schema: StructType,
        schemaTableName: String ): Unit

dataFrame

The DataFrame containing the rows that you want inserted into your database table.

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

rdd

The RDD containing the data the you want inserted into your database table.

schema

The Catalyst schema of the master table.

rdd and internalRdd

These methods creates a Spark RDD from a Splice Machine table.

The only difference between the rdd and internalRdd methods is that the internalRdd method runs internally and temporarily persists data on HDFS; this has a slight performance impact, but allows for checking permissions on Views. For more information, please see the Accessing Database Objects section in our Using the Native Spark DataSource topic.

rdd( schemaTableName: String,
     columnProjection: Seq[String] = Nil ): RDD[Row]

internalRdd( schemaTableName: String,
     columnProjection: Seq[String] = Nil ): RDD[Row]

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

columnProjection

The names of the columns in the underlying table that you want to project into the RDD; this is a comma-separated list of strings.

splitAndInsert

This method improves the performance of inserting data from the Native DataSource: instead of inserting into a single HBase region and having HBase split that region, we pre-split the table based on the data we’re inserting, and then insert the dataFrame. The table splits are computed by sampling the data in the dataFrame; Splice Machine uses the sampling percentage specified by the sampleFraction parameter value.

splitAndInsert( dataFrame: DataFrame,
                schemaTableName: String,
                sampleFraction: Double): Unit

dataFrame

The dataFrame to sample.

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

sampleFraction

A value between 0 and 1 that specifies the percentage of data in the dataFrame that should be sampled to determine the splits. For example, specify 0.005 if you want 0.5% of the data sampled.

tableExists

This method returns true if the specified table exists in your Splice Machine database.

You can pass the schema and table names into this method separately, or in combined (schema.table) format.

tableExists( schemaTableName: String ): Boolean

tableExists( schemaName: String,
             tableName: String ): Boolean

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

schemaName

The schema name.

tableName

The name of the table.

truncateTable

This method quickly removes all content from the specified table and returns it to its initial empty state.

It is the same as using the Splice Machine TRUNCATE TABLE SQL statement.

truncateTable( tableName: String ): Unit

tableName

The name of the table.

update

This method updates a Splice Machine table using the contents of a Spark DataFrame or Spark RDD from a Splice Machine table; it is the same as using the Splice Machine DELETE FROM SQL statement.

You can either pass the data to this method in a DataFrame, or you can pass the data in an RDD, and pass in a structure that specifies the organization of the data.

update( dataFrame: DataFrame,
        schemaTableName: String ): Unit

update( rdd: JavaRDD[Row],
        schema: StructType,
        schemaTableName: String ): Unit

dataFrame

The DataFrame containing the rows that you want updated in your database table.

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

rdd

The RDD containing the data the you want updated in your database table.

schema

The Catalyst schema of the master table.

upsert

This method upserts (inserts new records and updates existing records) the contents of a Spark DataFrame or Spark RDD into a Splice Machine table; it is the same as using the Splice Machine SYSCS_UTIL.UPSERT_DATA_FROM_FILE system procedure.

You can either pass the data to this method in a DataFrame, or you can pass the data in an RDD, and pass in a structure that specifies the organization of the data.

upsert(dataFrame: DataFrame,
       schemaTableName: String): Unit

upsert(rdd: JavaRDD[Row],
       schema: StructType,
       schemaTableName: String): Unit

dataFrame

The DataFrame containing the rows that you want inserted into your database table.

schemaTableName

The combined schema and table names, in the form: mySchema.myTable.

rdd

The RDD containing the data the you want inserted into your database table.

schema

The Catalyst schema of the master table.

See Also