Optimizing Splice Machine Queries
This topic introduces you to Splice Machine query optimization techniques, including information about executing SQL expressions without a table context, and using optimization hints.
Introduction to Query Optimization
Here are a few mechanisms you can use to optimize your Splice Machine queries:
|Use Explain Plan||
You can use the Splice Machine Explain Plan facility to display the execution plan for a statement without actually executing the statement. You can use Explain Plan to help determine options such as which join strategy to use or which index to select.
See the About Explain Plan topic.
|Use Statistics||Your database administrator can refine which statistics are collected on your database, which in turn enhances the operation of the query optimizer. See the Using Statistics topic.|
You can speed up a query by having an index on the criteria used in the
Composite indexes work well for optimizing queries.
|Use Splice Machine query hints||The Splice Machine query optimizer allows you to provide hints to help in the optimization process.|
Using Select Without a Table
Sometimes you want to execute SQL scalar expressions without having a table context. For example, you might want to create a query that evaluates an expression and returns a table with a single row and one column. Or you might want to evaluate a list of comma-separated expressions and return a table with a single row and multiple columns one for each expression.
In Splice Machine, you can execute queries without a table by using the
sysibm.sysdummy1 dummy table. Here’s the syntax:
select expression FROM sysibm.sysdummy1
And here’s an example:
splice> select 1+ 1 from sysibm.sysdummy1;
Using Splice Machine Query Hints
You can use hints to help the Splice Machine query interface optimize your database queries.
The Splice Machine optimizer is constantly being improved, and new hint types sometimes get added. One recent addition is the ability to specify that a query should be run on (or not on) Spark, if possible.
Types of Hints
There are different kinds of hints you can supply, each of which is described in a section below; here’s a summary:
|Hint Type||Example||Used to Specify|
|Index||--splice-properties index=my_index||Which index to use or not use|
|Join Order||--splice-properties joinOrder=fixed||Which join order to use for two tables|
|Join Strategy||--splice-properties joinStrategy=sortmerge||How a join is processed (in conjunction with the Join Order hint)|
|Pinned Table||--splice-properties pin=true||That you want the pinned (cached in memory) version of a table used in a query|
|Spark||--splice-properties useSpark=true||That you want a query to run (or not run) on Spark|
|Delete||--splice-properties bulkDeleteDirectory='/path'||That you are deleting a large amount of data and want to bypass the normal write pipeline to speed up the deletion.|
Including Hints in Your Queries
Hints MUST ALWAYS be at the end of a line, meaning that you must always terminate hints with a newline character.
You cannot add the semicolon that terminates the command immediately after a hint; the semicolon must go on the next line, as shown in the examples in this topic.
Many of the examples in this section show usage of hints on the
splice> command line. Follow the same rules when using hints
Hints can be used in two locations: after a table identifier or after a
FROM clause. Some hint types can be use after a table identifier, and
some can be used after a
|Hint after a:||Hint types||Example|
This example shows proper placement of the hint and semicolon when the hint is at the end of statement:
SELECT * FROM my_table --splice-properties index=my_index;
If your command is broken into multiple lines, you still must add the hints at the end of the line, and you can add hints at the ends of multiple lines; for example:
SELECT * FROM my_table_1 --splice-properties index=my_index , my_table_2 --splice-properties index=my_index_2 WHERE my_table_1.id = my_table_2.parent_id;
In the above query, the first command line ends with the first index hint, because hints must always be the last thing on a command line. That’s why the comma separating the table specifications appears at the beginning of the next line.
Use index hints to tell the query interface how to use certain indexes for an operation.
To force the use of a particular index, you can specify the index name; for example:
splice> SELECT * FROM my_table --splice-properties index=my_index > ;
To tell the query interface to not use an index for an operation, specify the null index. For example:
splice> SELECT * FROM my_table --splice-properties index=null > ;
And to tell the query interface to use specific indexes on specific tables for an operation, you can add multiple hints. For example:
splice> SELECT * FROM my_table_1 --splice-properties index=my_index > , my_table_2 --splice-properties index=my_index_2 > WHERE my_table_1.id = my_table_2.parent_id;
Important Note About Placement of Index Hints
index hint in a query MUST be specified alongside the table
containing the index, or an error will occur.
For example, if we have a table named
points with an index named
ie_point and another table named
rewards with an index named
ie_rewards, then this hint works as expected:
SELECT * FROM member_info m, rewards r, points p --SPLICE-PROPERTIES index=ie_point WHERE...
But the following hint will generate an error because
not an index on the points table.
SELECT * FROM member_info m, rewards r, points p --SPLICE-PROPERTIES index=ie_rewards WHERE...
JoinOrder hints to tell the query interface in which order to join
two tables. You can specify these values for a
joinOrder=FIXEDto tell the query optimizer to order the table join according to how where they are named in the
joinOrder=UNFIXEDto specify that the query optimizer can rearrange the table order.
joinOrder=UNFIXEDis the default, which means that you don’t need to specify this hint to allow the optimizer to rearrange the table order.
Here are examples:
You can use a
JoinStrategy hint in conjunction with a
to tell the query interface how to process a join. For example, this
query specifies that the
SORTMERGE join strategy should be used:
SELECT * FROM --SPLICE-PROPERTIES joinOrder=fixed mytable1 e, mytable2 t --SPLICE-PROPERTIES joinStrategy=SORTMERGE WHERE e.id = t.parent_id;
And this uses a
joinOrder hint along with two
SELECT * FROM --SPLICE-PROPERTIES joinOrder=fixed keyword k JOIN campaign c --SPLICE-PROPERTIES joinStrategy=NESTEDLOOP ON k.campaignid = c.campaignid JOIN adgroup g --SPLICE-PROPERTIES joinStrategy=NESTEDLOOP ON k.adgroupid = g.adgroupid WHERE adid LIKE '%us_gse%'
You can specify these join strategies:
|JoinStrategy Value||Strategy Description|
Read the results of the Right Result Set (RHS) into memory, then for each row in the left result set (LHS), perform a local lookup to determine the right side of the join.
Read the Right and Left result sets simultaneously in order and join them together as they are read.
For each row on the left, fetch the values on the right that match the join.
Re-sort both the left and right sides according to the join keys, then perform a
Pinned Table Hint
You can use the
pin hint to specify to specify that you want a query
to run against a pinned version of a table.
splice> PIN TABLE myTable;splice> SELECT COUNT(*) FROM my_table --splice-properties pin=true> ;
You can read more about pinning tables in the
PIN TABLE statement topic.
You can use the
useSpark hint to specify to the optimizer that you
want a query to run on (or not on) Spark. The Splice Machine query
optimizer automatically determines whether to run a query through our
Spark engine or our HBase engine, based on the type of query; you can
override this by using a hint:
The Splice Machine optimizer uses its estimated cost for a query to decide whether to use spark. If your statistics are out of date, the optimizer may end up choosing the wrong engine for the query.
splice> SELECT COUNT(*) FROM my_table --splice-properties useSpark=true> ;
You can also specify that you want the query to run on HBase and not on Spark. For example:
splice> SELECT COUNT(*) FROM your_table --splice-properties useSpark=false> ;
You can use the
bulkDeleteDirectory hint to specify that you want to
use our bulk delete feature to optimize the deletion of a large amount
of data. Similar to our bulk import
feature, bulk delete generates HFiles,
which allows us to bypass the Splice Machine write pipeline and HBase
write path when performing the deletion. This can significantly speed up
the deletion process.
You need to specify the directory to which you want the temporary HFiles written; you must have write permissions on this directory to use this feature. If you’re specifying an S3 bucket on AWS, please review our Configuring an S3 Bucket for Splice Machine Access tutorial before proceeding.
splice> DELETE FROM my_table --splice-properties bulkDeleteDirectory='/bulkFilesPath' ;
We recommend performing a major compaction on your database after
deleting a large amount of data; you should also be aware of our new
system procedure, which you can call before a compaction to specify that
you want the data physically (not just logically) deleted during