Splice Machine Window Functions

An SQL window function performs a calculation across a set of table rows that are related to the current row, either by proximity in the table, or by the value of a specific column or set of columns; these columns are known as the partition.

This topic provides a very quick summary of window functions, as implemented in Splice Machine. For more general information about SQL window functions, we recommending visiting some of the sources listed in the Additional Information section at the end of this topic.

Here’s a quick example of using a window function to operate on the following table:

OrderID CustomerID Amount
123 1 100
144 1 250
167 1 150
202 1 250
209 1 325
224 1 125
66 2 100
94 2 200
127 2 300
444 2 400

This query will find the first Order ID for each specified Customer ID in the above table:

SELECT OrderID, CustomerID,
    FIRST_VALUE(OrderID) OVER (
        PARTITION BY CustomerID
        ORDER BY OrderID
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )
AS FirstOrderID
FROM ORDERS
WHERE CustomerID IN (1,2);

This works by partitioning (grouping) the selected rows by CustomerID, ordering them for purposes of applying the function to the rows in the partition, and then using the FIRST_VALUE window function to evaluate the OrderID values in each partition and find the first value in each. The results for our sample table are:

OrderID CustomerID FirstOrderID
123 1 123
144 1 123
167 1 123
202 1 123
209 1 123
224 1 123
66 2 66
94 2 66
127 2 66
444 2 66

See the Window Frames section below for a further explanation of this query.

About Window Functions

Window functions:

  • Operate on a window, or set of rows. The rows considered by a window function are produced by the query’s FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses, if any. This means that any row that doesn’t meet the WHERE condition is not seen by a window function.
  • Are similar to aggregate functions, except that a window function does not group rows into a single output row. Instead, a window function returns a value for every row in the window. This is sometimes referred to as tuple-based aggregation.
  • The values are calculated from the set of rows in the window.
  • Always contain an OVER clause, which determines how the rows of the query are divided and sequenced for processing by the window function.
  • The OVER clause can contain a PARTITION clause that specifies the set of rows in the table that form the window, relative to the current row.
  • The OVER clause can contain an optional ORDER BY clause that specifies in which order rows are processed by the window function. This ORDER BY clause is independent of the ORDER BY clause that specifies the order in which rows are output.

    Note that the ROW NUMBER must contain an ORDER BY clause.

  • The OVER clause can also contain an optional frame clause that further restricts which of the rows in the partition are sent to the function for evaluation.

About Windows, Partitions, and Frames

Using window functions can seem complicated because they involve a number of overlapping terms, including window, sliding window, partition, set, and window frame. An additional complication is that window frames can be specified using either rows or ranges.

Let’s start with basic terminology definitions:

Terms Description
window function A function that operates on a set of rows and produces output for each row.
window partition

The grouping of rows within a table.

Note that window partitions retains the rows, unlike aggregates,

window ordering The sequence of rows within each partition; this is the order in which the rows are passed to the window function for evaluation.
window frame A frame of rows within a window partition, relative to the current row. The window frame is used to further restrict the set of rows operated on by a function, and is sometimes referred to as the row or range clause.
OVER clause

This is the clause used to define how the rows of the table are divided, or partitioned, for processing by the window function. It also orders the rows within the partition.

See the The OVER Clause section below for more information.

partitioning clause

An optional part of an OVER clause that divides the rows into partitions, similar to using the GROUP BY clause. The default partition is all rows in the table, though window functions are generally calculated over a partition.

See the The Partition Clause section below for more information.

ordering clause

Defines the ordering of rows within each partition.

See the The Order Clause section below for more information.

frame clause

Further refines the set of rows when you include an ORDER BY clause in your window function specification, by allowing you to include or exclude rows or values within the ordering.

See the The Frame Clause section below for examples and more information.

The OVER Clause

A window function alway contains an OVER clause, which determines how the rows of the query are divided, or partitioned, for processing by the window function.

expression OVER( 
     [partitionClause]
     [orderClause]
     [frameClause] );

expression

Any value expression that does not itself contain window function calls.

When you use an aggregate function such as AVG with an OVER clause, the aggregated value is computed per partition.

The Partition Clause

The partition clause, which is optional, specifies how the window function is broken down over groups, in the same way that GROUP BY specifies groupings for regular aggregate functions. Some example partitions are:

  • departments within an organization
  • regions within a geographic area
  • quarters within years for sales

If you omit the partition clause, the default partition, which contains all rows in the table, is used.However, since window functions are used to perform calculations over subsets (partitions) of rows in a table, you generally should specify a partition clause.

Syntax

PARTITION BY expression [, ...]

expression [,…]

A list of expressions that define the partitioning.

If you omit this clause, there is one partition that contains all rows in the entire table.

Here’s a simple example of using the partition clause to compute the average order amount per customer:

SELECT OrderID, CustomerID, Amount,
    Avg(Amount) OVER (
        PARTITION BY CustomerID
        ORDER BY OrderID
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )
AS AverageOrderAmt FROM ORDERS
WHERE CustomerID IN (1,2);
OrderID CustomerID Amount AverageOrderAmt
123 1 100 200
144 1 250 200
167 1 150 200
202 1 250 200
209 1 325 200
224 1 125 200
66 2 100 250
94 2 200 250
127 2 300 250
444 2 400 250

The Order Clause

You can also control the order in which rows are processed by window functions using ORDER BY within your OVER clause. This is optional, though it is important for any ranking or cumulative functions.

Syntax

ORDER BY expression
   [ ASC | DESC | USING operator ]   [ NULLS FIRST | NULLS LAST ]
   [, ...]

Some notes about the ORDER BY clause in an OVER clause:

  • Ascending order (ASC) is the default ordering.
  • If you specify NULLS LAST, then NULL values are returned last; this is the default when you use ASC order.
  • If you specify NULLS FIRST, then NULL values are returned first; this is the default when you use DESC order.
  • The ORDER BY clause in your OVER clause does not have to match the order in which the rows are output.
  • You can only specify a frame clause if you include an ORDER BY clause in your OVER clause.

The Frame Clause

The optional frame clause defines which of the rows in the partition (the frame) should be evaluated by the window function. You can limit which rows in the partition are passed to the function in two ways:

  • Specify a ROWS frame to limit the frame to a fixed number of rows from the partition that precede or follow the current row.
  • Specify RANGE to only include rows in the frame whose evaluated value falls within a certain range of the current row’s value. This is the default, and the current default range is 1, which means that only rows whose value matches that of the current row are passed to the function.

Some sources refer to the frame clause as the Rows or Ranges clause. If you omit this clause, the default is to include all rows

Window frames can only be used when you include an ORDER BY clause within the OVER clause.

Syntax

This clause specifies two offsets: one determines the start of the window frame, and the other determines the end of the window frame.

[RANGE | ROWS] frameStart |
[RANGE | ROWS] BETWEEN frameStart AND frameEnd

RANGE

The frame includes rows whose values are within a specified range of the current row’s value.

The range is determined by the ORDER BY column(s). Rows with identical values for their ORDER BY columns are referred to as peer rows.

ROWS

The frame includes a fixed number of rows based on their position in the table relative to the current row.

frameStart

Specifies the start of the frame.

For ROWS mode, you can specify:

UNBOUNDED PRECEDING
| value PRECEDING
| CURRENT ROW
| value FOLLOWING

value

A non-negative integer value.

For RANGE mode, you can only specify:

CURRENT ROW
| UNBOUNDED FOLLOWING

frameEnd

Specifies the end of the frame. The default value is CURRENT ROW.

For ROWS mode, you can specify:

value PRECEDING
| CURRENT ROW
| value FOLLOWING
| UNBOUNDED FOLLOWING

value

A non-negative integer value.

For RANGE mode, you can only specify:

CURRENT ROW
| UNBOUNDED FOLLOWING

Ranges and Rows

Probably the easiest way to understand how RANGE and ROWS work is by way of some simple OVER clause examples:

Example 1:

This clause can be used to apply a window function to all rows in the partition from the top of the partition to the current row:

OVER (PARTITION BY customerID ORDER BY orderDate)
Example 2:

Both of these clauses specify the same set of rows as Example 1:

OVER (PARTITION BY customerID ORDER BY orderDate UNBOUNDED PRECEDING preceding)

OVER (PARTITION BY customerID ORDER BY orderDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
Example 3:

This clause can be used to apply a window function to the current row and the 3 preceding row’s values in the partition:

OVER (PARTITION BY customerID ORDER BY orderDate ROWS 3 preceding)

FrameStart and FrameEnd

Some important notes about the frame clause:

  • UNBOUNDED PRECEDING means that the frame starts with the first row of the partition.
  • UNBOUNDED FOLLOWING means that the frame ends with the last row of the partition.
  • You must specify the frameStart first and the frameEnd last within the frame clause.
  • In ROWS mode, CURRENT ROW means that the frame starts or ends with the current row; in RANGE mode, CURRENT ROW means that the frame starts or ends with the current row’s first or last peer in the ORDER BY ordering.
  • The default frameClause is to include all values from the start of the partition through the current row: 

    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    

Common Frame Clauses

When learning about window functions, you may find references to these specific frame clause types:

Frame Clause Type Example
Recycled BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Cumulative BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Rolling BETWEEN 2 PRECEDING AND 2 FOLLOWING

Examples

This is a simple example that doesn’t use a frame clause:

  1. Rank each year within a player by the number of home runs hit by that player:

    RANK() OVER (PARTITION BY playerID ORDER BY H desc);
    

Here are some examples of window functions using frame clauses:

  1. Compute the running sum of G for each player:

    SUM(G) OVER (PARTITION BY playerID ORDER BY yearID
      RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
    
  2. Compute the career year:

    YearID - min(YEARID) OVER (PARTITION BY playerID
       RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) + 1;
    
  3. Compute a rolling average of games by player:

    AVG(G) OVER (PARTITION BY playerID ORDER BY yearID
       ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING);
    

The Ranking Functions

A subset of our window functions are known as ranking functions:

  • DENSE_RANK ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. The next number in the ranking sequence is then used to rank the row or rows that follow, which means that DENSE_RANK always returns consecutive numbers.
  • RANK ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. However, the next number in the ranking sequence is then skipped, which means that RANK can return non-consecutive numbers.
  • ROW NUMBER assigns a sequential number to each row in the result set.

All ranking functions must include an ORDER BY clause in the OVER() clause, since that is how they compute ranking values.

Window Function Restrictions

Because window functions are only allowed in ORDER BY clauses, and because window functions are computed after both WHERE and HAVING, you sometimes need to use subqueries with window functions to accomplish what seems like it could be done in a simpler query.

For example, because you cannot use an OVER clause in a WHERE clause, a query like the following is not possible:

SELECT *
FROM Batting
WHERE rank() OVER (PARTITION BY playerID ORDER BY G) = 1;

And because WHERE and HAVING are computed before the windowing functions, this won’t work either:

SELECT playerID, rank() OVER (PARTITION BY playerID ORDER BY G) as player_rank FROM Batting
WHERE player_rank = 1;

Instead, you need to use a subquery:

SELECT *
FROM (
   SELECT playerID, G, rank() OVER (PARTITION BY playerID ORDER BY G) as "pos"
   FROM Batting
) tmp
WHERE "pos" = 1;

And note that the above subquery will add a rank column to the original columns,

Window Functions Included in This Release

Splice Machine is currently expanding the set of SQL functions already able to take advantage of windowing functionality.

The  OVER clause topic completes the complete reference information for OVER.

Here is a list of the functions that currently support windowing:

Additional Information

There are numerous articles about window functions that you can find online. Here are a few you might find valuable: