Best Practices: Ingesting Data - Overview
In this Ingesting Data Best Practices topic, we’ll introduce you to the various methods you can use to ingest data into your Splice Machine database, guide you to the best method to use for your specific situation, and then direct you to example(s) that walk you through the steps to implement your ingestion solution.
To get started, use Table 1, below, to determine which use case applies to your situation, then click the How it Works link in the table. You’ll see a summary of how to ingest your data, and a link to a separate page that contains one or more detailed examples.
Pick Your Use Case
Where the data that you’re ingesting is coming from defines which approach you should take to importing that data into Splice Machine; how to use each approach is described, along with examples, in a section within this topic:
|Your Use Case||What to Read||Relative Complexity|
|You have flat files to import||☛||Ingesting Flat Files||✓✓
|You're writing a Spark app||☛||Ingesting Data in a Spark App||✓✓|
|You're writing a streaming app||☛||Ingesting Streaming Data||✓✓✓|
|You're accessing data in an external table||☛||Importing Data From an External Table||✓|
Importing Flat Files
The most common ingestion scenario is importing flat files into your Splice Machine database; typically, CSV files stored on a local computer, in the cloud, or in HDFS on your cluster. Splice Machine offers two primary methods for importing your flat files: bulk HFile imports and basic flat file imports. Each method has a few variants, which we’ll describe later.
Bulk HFile imports are highly performant because the import function pre-splits your data into Hadoop HFiles and imports them directly, which enhances the parallelism of the ingestion processing. When importing larger datasets, this can yield 10x ingestion performance compared to basic import methods. Splice Machine recommends that you use bulk HFile importing large files containing new data.
Basic flat file imports are also performant, and have two important features that may be of importance to you: 1) you can update existing records in addition to adding new records (if and only if the table you’re importing into has a Primary Key), and 2) constraint checking is applied to inserts and updates when using basic import methods.
No matter which method you decide upon, we strongly recommend debugging your ingest process with a small data sample before jumping into importing a large dataset; this will help you to quickly debug any input problems.
Determining Which Method to Use
There are three basic determinants for determining which method to use are:
- Are you importing all new data or do will you also be updating records in an existing table?
- Do you need constraint checking applied to the data you’re importing?
- Are you importing data files larger than 100GB?
The following table summarizes the decision framework for ingesting flat files:
|All New Data?||Apply Constraints?||Data Size||Strategy|
The table into which you're merging data must have a primary key.
Manual splitting adds complexity and requires knowledge of your data: you must determine the key values that will split your input data into approximately evenly-sized partitions (regions).
Note that all of the file import procedures require you to specify the same information about your data, such as: which delimiters are used, how dates and times are formatted, which character set is used, and how to handle invalidly formatted input records. You basically point the procedure at your source data and destination table/index, specify a few parameters, and start loading your data.
About Bulk HFile Import
When you use bulk HFile import, Splice Machine first notes the points in your data where it can be split. During ingestion, the
BULK_IMPORT_HFILE procedure splits the file into temporary Hadoop HFiles (store files) at those points. The store files are then loaded directly into your Region Servers. Ideally, the temporary HFiles are of approximately equal size, which spreads your data evenly across the Region Servers.
Splice Machine offers two bulk import variations:
Automatic Splitting is the easier method because the
BULK_HFILE_IMPORTprocedure takes care of determining the key values that will evenly split your input data into HFiles. The procedure does this by sampling your data to figure out how to split it into (approximately) equal-sized HFiles.
In Manual Splitting, you need to determine those key values yourself, which means that you need to know your data well enough to do that. Because
BULK_HFILE_IMPORTdoes not have to sample your data, manual splitting can increase performance, though it adds complexity.
Contrasting How Standard and Bulk Import Work
To help you better understand how standard and bulk import work differently, consider these significant differences in import processing:
When using standard import methods, the data you’re importing is added to your table; when the table fills a region, the table is split across two regions, and the new region starts filling with table data. With bulk import methods, all of the table regions are pre-allocated, and imported data is copied into those regions in parallel.
When using standard import methods, Splice Machine uses the standard HBase write path, which writes to the HBase Write-ahead-Log (WAL), writes to memstore, flushes to disk, and causes compaction and region splitting. Bulk import bypasses all of this, which results in significantly improved performance.
For Additional Information and Examples
See the Bulk Importing Flat Files topic in this Best Practices chapter for detailed information about and examples of using bulk HFile ingestion.
See the Basic Flat File Ingestion topic in this Best Practices chapter for more information about and examples of using basic flat file ingestion.
Ingesting Data in a Spark App
The Splice Machine Native Spark DataSource allows you to directly insert data into your database from a Spark DataFrame, which provides great performance by eliminating the need to serialize and deserialize the data. You can write Spark apps (for use with spark-submit) that take advantage of the Native Spark DataSource, or you can use the DataSource in your Zeppelin notebooks.
Ingesting data into your Splice Machine database from Spark is simple: once the data is in a Spark DataFrame, you use the Native Spark DataSource’s
merge operations to insert the data into a table in your database. These operations are extremely quick, because Splice Machine reads the data into the table directly, without serializing it, sending it over a wire, and deserializing. You can similarly move data from a Splice Machine table into a Spark DataFrame with a single, non-serialized operation.
The Ingesting Data in a Spark App topic in this Best Practices chapter contains an example of using the Native Spark DataSource in a standalone Spark application.
Ingesting Streaming Data
It’s also easy to stream data into your Splice Machine database using Spark. The Ingesting Streaming Data topic in this Best Practices chapter presents a sample program that uses Spark and Kafka to ingest real-time data into a Splice Machine table.
Ingesting Data With an External Table
Splice Machine supports external tables, which are flat files that you can reference, query, and update inside your database. One common use for external tables is to facilitate ingesting the data in a flat file into a table that you’ve created in your database. All you need to do is:
- Use the
CREATE EXTERNAL TABLEstatement to specify where your external file is and how it is formatted.
- Use the
INSERT INTOstatement to select data from the external file and insert it into a table in your database.
The Ingesting Data From an External Table topic in this Best Practices chapter contains an example of importing data from an ORC file.