Using the Splice Machine Native Spark DataSource
This topic provides general information about the Splice Machine Native Spark DataSource (aka the Splice Machine Spark Adapter), in these subsections:
- Native Spark DataSource Overview
- Connecting with the Native Spark DataSource
- Database Permissions and the Native Spark DataSource
- Accessing Database Objects with Internal Access
The other topics in this chapter provide additional information about the Native Spark DataSource:
- Native Spark DataSource API provides reference information for the Native Spark DataSource API methods.
- Native Spark DataSource Examples includes examples that show you how to launch a Spark app with our Spark Submit script, and how to use the Native Spark DataSource interactively, with the Spark Shell.
- Using Our Native Spark DataSource with Zeppelin presents an example of using our Native Spark DataSource in a Zeppelin notebook.
Native Spark DataSource Overview
The Splice Machine Native Spark DataSource, which is also referred to as the Spark Adapter, allows you to directly connect Spark DataFrames and Splice Machine database tables. You can efficiently insert, upsert, select, update, and delete data in your Splice Machine tables directly from Spark in a transactionally consistent manner. With the Spark Adapter, transfers of data between Spark and your database are completed without serialization/deserialization, which generates tremendous performance boosts over traditional over-the-wire transfers.
To use the adapter in your code, you simply instantiate a
SplicemachineContext object in your Spark code. You can run Spark applications that interface with your Splice Machine database interactively in the Spark shell or Zeppelin notebooks, or you can launch a Spark app by using our Spark Submit script.
You can craft applications that use Spark and our Native Spark DataSource in Scala, Python, and Java. Note that you can use the Native Spark DataSource in the Splice Machine ML Manager and Zeppelin Notebook interfaces.
Connecting with the Native Spark DataSource
When using the Native Spark DataSource, you can specify some optional properties for the JDBC connection you’re using to access your Splice Machine database. To do so,
Map those options using a
SpliceJDBCOptions object, and then create your
SplicemachineContext with that map. For example:
val options = Map( JDBCOptions.JDBC_URL -> "jdbc:splice://<jdbcUrlString>", SpliceJDBCOptions.JDBC_INTERNAL_QUERIES -> "true" ) spliceContext = new SplicemachineContext( options )
SpliceJDBCOptions properties that you can currently specify in the JDBC connect URL are:
|JDBC_INTERNAL_QUERIES||false||A string with value
The path to the temporary directory that you want to use when persisting temporary data from internally executed queries.
The user running a query must have write permission on this directory, or your connected application may freeze or fail.
Note that a typical JDBC URL for connecting to a Splice Machine database looks like this:
Database Permissions and the Native Spark DataSource
You must make sure that each user who is going to use the Splice Machine Native Spark DataSource has
execute permission on the
SYSCS_UTIL.SYSCS_HDFS_OPERATION system procedure.
SYSCS_UTIL.SYSCS_HDFS_OPERATION is a Splice Machine system procedure that is used internally to efficiently perform direct HDFS operations. This procedure is not documented because it is intended only for use by the Splice Machine code itself; however, the Native Spark DataSource uses it, so any user of the Adapter must have permission to execute the
Here’s an example of granting
execute permission for two users:
splice> grant execute on procedure SYSCS_UTIL.SYSCS_HDFS_OPERATION to someuser; 0 rows inserted/updated/deleted splice> grant execute on procedure SYSCS_UTIL.SYSCS_HDFS_OPERATION to anotheruser; 0 rows inserted/updated/deleted
Additional Property Setting for Kerberos
If you’re using the Native Spark DataSource on a Kerberized cluster, you must set the following property value in your
hbase-site.xml settings file:
Accessing Database Objects with Internal Access
By default, Native Spark DataSource queries execute in the Spark application, which is highly performant and allows access to almost all Splice Machine features. However, when your Native Spark DataSource application uses our Access Control List (ACL) feature, there is a restriction with regard to checking permissions.
The specific problem is that the Native Spark DataSource does not have the ability to check permissions at the view level or column level; instead, it checks permissions on the base table. This means that your Native Spark DataSource application doesn’t have access to the table underlying a view or column, it will not have access to that view or column; as a result, a query against the view or colunn fails and throws an exception.
The workaround for this problem is to tell the Native Spark DataSource to use internal access to the database; this enables view/column permission checking, at a slight cost in performance. With internal access, the adapter runs queries in Splice Machine and temporarily persists data in HDFS while running the query.
The ACL feature is enabled by setting the property
splice.authentication.token.enabled = true.