Glossary

Term Definition
ACID Transactions ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction.
  • Atomicity means that if one part of the transaction fails, the entire transaction fails.
  • Consistency ensures that any transaction will bring the database from one valid state to another, which means that any data written to the database must be valid according all rules defined in the database.
  • Isolation ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed in serial order.
  • Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.
Auto-sharding The database is automatically and transparently partitioned (sharded) across low cost commodity nodes, allowing scale-out of read and write queries, without requiring changes to the application.
BI Tools Business Intelligence Tools
CDH Clouderas Cloudera Distribution Including Apache Hadoop, a popular Hadoop platform.
Column-Oriented Data Model A model for storing data in a database as sections of columns, rather than as rows of data. In a column-oriented database, all of the values in a column are serialized together.
Concurrency The ability for multiple users to access data at the same time.
CRM Customer Relationship Management
Cross-table, cross-row transactions A transaction (a group of SQL statements) can modify multiple rows (cross-row) in multiple tables (across tables).  
CRUD Create, Read, Update, Delete. The four basic functions of persistent storage.
DAG Directed Acyclic Graph. A directed graph with no directed cycles, meaning that no path through the graph loops back to its starting point. DAGs are used for various computational purposes, including query optimization in some databases.
Database Statistics   A form of dynamic metadata that assists the query optimizer in making better decisions by tracking distribution of values in indexes and/or columns.
Database Transaction A sequence of database operations performed as a single logical unit of work.
ERP Enterprise Resource Planning is business management software that a company can use to collect, store, manage and interpret data from many business activities.
Foreign Key A column or columns in one table that references a column (typically the primary key column) of another table. Foreign keys are used to enture referential integrity.
Full join support Databases use join operations to combine fields from multiple tables by using values common to each. Full join support means that the Database Management System support all five ANSI-standard types of join operations: Inner join, left outer join, right outer join, full outer join, and cross join.
Hadoop An Apache open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
HBase A column-oriented database management system that is part of the Apache Hadoop framework and runs on top of HDFS.
HCatalog Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools to more easily read and write data on the grid.
HDFS Hadoop Distributed File System. A distributed file system that stores data on commodity hardware and is part of the Apache Hadoop framework. It links together the file systems on many local nodes to make them into one big file system.
HDP Hortonworks Data Platform includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data. The platform is designed to deal with data from many sources and formats.
HIVE Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
HR Human Resources.
JDBC Java DataBase Connection. An API specification for connecting with databases using programs written in Java.
JSON An open standard format that uses human-readable text to transmit data objects consisting of attributevalue pairs. It is used primarily to transmit data between a server and web applications, as an alternative to XML.
JVM Java Virtual Machine. The code execution component of the Java platform.
Key-Value Data Model A fundamental and open-ended data model that allows for extension without modifying existing data. Data is represented in pairs: name (or key) and a value that is associated with that name. Also known as key-value pair, name-value pair, field-value pair, and attribute-value pair.
Map Reduce MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce programs typically include these steps:
  1. Each worker node applies the Map() function to filter and sort local data.
  2. Worker nodes redistribute (shuffle) data based on output keys produced by the Map() step, so that all data belonging to one key is located on the same node.
  3. Worker nodes process each group of output data in parallel to produce results.
  4. The MapReduce system collects the results and sorts them to produce the final outcome.
MapR MapR is a complete distribution for Apache Hadoop that packages more than a dozen projects from the Hadoop ecosystem to provide a broad set of big data capabilities.
Multi-partition transactions A database, like Splice Machine, that allows transactions for a table distributed as multiple partitions across multiple nodes in a cluster.
MVCC MultiVersion Concurrency Control is a method used to control concurrent access to a database. Concurrency control is needed to bypass the potential problem of someone viewing (reading) a data value while another is writing to the same value.
MySQL An open source Relational Database Management System (RDBMS) that uses Structured Query Language (SQL).
NewSQL NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.  
NoSQL A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
ODBC Open DataBase Connectivity. An open standard API for accessing database management systems, designed to be independent of any specific database or operating system.
OLAP OnLine Analytical Processing. An approach to quickly answering multi-dimensional analytical queries in the Business Intelligence world. OLAP tools allow users to analyze multidimensional data interactive from multiple perspectives.
OLTP OnLine Transaction Processing. A class of information processing systems that facilitate and manage transaction-oriented applications.
Query Optimizer A critical database management system (DBMS) component that analyzes SQL queries and determines efficient execution mechanisms, known as query plans. The optimizer typically generates several plans and then selects the most efficient plan to run the query.
Referential Integrity A property of data that requires every value of one column in a table to exist as a value of another column in a different (or the same) table. This term is generally used to describe the function of foreign keys.
Relational Data Model The model, developed by E.F. Codd, upon which relational database are based. Relational tables have these properties:
  • Data is presented as a collection of relations
  • Each relation is depicted as a table
  • Columns are attributes that belong to the entity modeled by the table
  • Each row represents a single entity (a record)
  • Every table has a set of attributes, a key, that unique identifies each entity
REST REpresentational State Transfer. A simple, stateless, client-server protocol use for networked applications, which uses the HTTP requests to communicate among machines. RESTful applications use HTTP requests to post (create or update) data, to read data, and to delete data. Collectively, these are known as CRUD operations.
Rollback An operation that returns the database to some previous state, typically used for recovering from database server crashes: by rolling back any transaction that was active at the time of the crash, the database is restored to a consistent state.
Scale Out A database architecture that doesnt rely on a single controller and scales by adding processing power coupled with additional storage.
Scale Up An architecture that uses a fixed controller resource for all processing. Scaling capacity happens by adding processing power to the controller or (eventually) upgrading to a new (and expensive) controller.
Sharding Horizontal partitioning in a database: that the data is split among multiple machines while ensuring that the data is always accessed from the correct place. See Auto-sharding.
Spark Apache Spark is an open-source cluster computing framework that uses in-memory primitives to reduce storage access and boost database performance. Spark allows user applications to load data into a cluster's memory and repeatedly query that data.
Trigger A database trigger is code that is run automatically in response to specific events on specific tables or views in your database. Triggers are typically configured to maintain data integrity, such as ensuring that an updated value is in range.
YARN Yet Another Resource Negotiator. YARN assigns CPU, memory, and storage to applications running on a Hadoop cluster, and  enables application frameworks other than MapReduce (like Spark) to run on Hadoop.
ZooKeeper Part of the Apache Hadoop framework, ZooKeeper provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, and so on. Applications can leverage these services to coordinate distributed processing across large clusters.