Learning SQL Notes #14: Analytic Functions

Analytic Function Concepts

Data Windows

SELECT quarter(payment_date) quarter, 
 monthname(payment_date) month_nm, 
 sum(amount) monthly_sales, 
 max(sum(amount)) 
    over () max_overall_sales,/*controlled by where and group by and return the highest monthly total payment in 2005*/
 max(sum(amount))
    over (partition by quarter(payment_date)) max_qrtr_sales /*controlled by where and group by and return the highest monthly total payment in each quarter in 2005*/
FROM payment 
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date);

The analytic functions used to generate these additional columns group rows into two different sets: one set containing all rows in the same quarter and another set containing all of the rows. To accommodate this type of analysis, analytic functions include the ability to group rows into windows, which effectively partition the data for use by the analytic function without changing the overall result set. Windows are defined using the over clause combined with an optional partition by subclause. In the previous query, both analytic functions include an over clause, but the first one is empty, indicating that the window should include the entire result set, whereas the second one specifies that the window should include only rows within the same quarter. Data windows may contain anywhere from a single row to all of the rows in the result set, and different analytic functions can define different data windows.

Localized Sorting

SELECT quarter(payment_date) quarter, 
 monthname(payment_date) month_nm, 
 sum(amount) monthly_sales, 
 rank() over (order by sum(amount) desc) sales_rank /* order by only controls the rank()*/
FROM payment
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date)
ORDER BY 1, month(payment_date);/* order by only controls the presentation*/

or you may insert partition by quarter(payment_date) into the over() above to obtain rank within each quarter.

Ranking

Ranking Functions

There are multiple ranking functions available in the SQL standard, with each one taking a different approach to how ties are handled:

row_number

Returns a unique number for each row, with rankings arbitrarily assigned in case of a tie

rank

Returns the same ranking in case of a tie, with gaps in the rankings

dense_rank

Returns the same ranking in case of a tie, with no gaps in the rankings

SELECT customer_id, count(*) num_rentals, 
 row_number() over (order by count(*) desc) row_number_rnk, 
 rank() over (order by count(*) desc) rank_rnk, 
 dense_rank() over (order by count(*) desc) dense_rank_rnk 
FROM rental
GROUP BY customer_id
ORDER BY 2 desc;
customer_idnum_rentalsrow_number_rnkrank_rnkdense_rank_rnk
14442333
23642433
7541554

To get back to the original request, how would you identify the top 10 customers? There are three possible solutions:

  • Use the row_number function to identify customers ranked from 1 to 10, which results in exactly 10 customers in this example, but in other cases might exclude customers having the same number of rentals as the 10th ranked customer.

  • Use the rank function to identify customers ranked 10 or less, which also results in exactly 10 customers.

  • Use the dense_rank function to identify customers ranked 10 or less, which yields a list of 37 customers.

Generating Multiple Rankings

SELECT customer_id, 
 monthname(rental_date) rental_month, 
 count(*) num_rentals, 
 rank() over (partition by monthname(rental_date) 
 order by count(*) desc) rank_rnk 
FROM rental
GROUP BY customer_id, monthname(rental_date)
ORDER BY 2, 3 desc;

so that rank() starts from 1 for each month.

Looking at the results, you can see that the rankings are reset to 1 for each month. In order to generate the desired results for the marketing department (top five custom‐ ers from each month), you can simply wrap the previous query in a subquery and add a filter condition to exclude any rows with a ranking higher than five:

SELECT customer_id, rental_month, num_rentals, rank_rnk ranking
FROM 
(SELECT customer_id,
monthname(rental_date) rental_month, count(*) num_rentals,
rank() over (partition by monthname(rental_date) order by count(*) desc) rank_rnk
FROM rental
GROUP BY customer_id, monthname(rental_date) 
) cust_rankings 
WHERE rank_rnk <= 5
ORDER BY rental_month, num_rentals desc, rank_rnk;

Since analytic functions can be used only in the SELECT clause, you will often need to nest queries if you need to do any filtering or grouping based on the results from the analytic function.

Window FunctionReturn TypeDescription
CUME_DIST()DOUBLE PRECISIONThe CUME_DIST() window function calculates the relative rank of the current row within a window partition: (number of rows preceding or peer with current row) / (total rows in the window partition)
DENSE_RANK()BIGINTThe DENSE_RANK () window function determines the rank of a value in a group of values based on the ORDER BY expression and the OVER clause. Each value is ranked within its partition. Rows with equal values receive the same rank. There are no gaps in the sequence of ranked values if two or more rows have the same rank.
NTILE()INTEGERThe NTILE window function divides the rows for each window partition, as equally as possible, into a specified number of ranked groups. The NTILE window function requires the ORDER BY clause in the OVER clause.
PERCENT_RANK()DOUBLE PRECISIONThe PERCENT_RANK () window function calculates the percent rank of the current row using the following formula: (x - 1) / (number of rows in window partition - 1) where x is the rank of the current row.
RANK()BIGINTThe RANK window function determines the rank of a value in a group of values. The ORDER BY expression in the OVER clause determines the value. Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers. For example, if two rows are ranked 1, the next rank is 3. The DENSE_RANK window function differs in that no gaps exist if two or more rows tie.
ROW_NUMBER()BIGINTThe ROW_NUMBER window function determines the ordinal number of the current row within its partition. The ORDER BY expression in the OVER clause determines the number. Each value is ordered within its partition. Rows with equal values for the ORDER BY expressions receive different row numbers nondeterministically.

Reporting Functions

Calculate total by month/by total

SELECT monthname(payment_date) payment_month, 
 amount,
 sum(amount) over (partition by monthname(payment_date)) monthly_total,
 sum(amount) over () grand_total 
FROM payment 
WHERE amount >= 10
ORDER BY 1;
payment_monthamountmonthly_totalgrand_total
August10.99521.531262.86
August11.99521.531262.86

Calculate percentage:

SELECT monthname(payment_date) payment_month, 
 amount,
 round(sum(amount) / sum(sum(amount)) over () * 100, 2) pct_of_total
FROM payment 
GROUP BY monthname(payment_date);
payment_monthmonth_totalpct_of_total
May4824.437.16
June9631.8814.29
July28373.8942.09
August24072.1335.71
February514.180.76

Quasi-ranking functions:

SELECT monthname(payment_date) payment_month, 
 sum(amount) month_total, 
 CASE sum(amount) 
  WHEN max(sum(amount)) over () THEN 'Highest'
  WHEN min(sum(amount)) over () THEN 'Lowest'
  ELSE 'Middle'
 END descriptor
FROM payment
GROUP BY monthname(payment_date);
payment_monthmonth_totaldescriptor
May4824.43Middle
June9631.88Middle
July28373.89Highest
August24072.13Middle
February514.18Lowest

Window Frames

SELECT yearweek(payment_date) payment_week, 
 sum(amount) week_total, 
 sum(sum(amount)) 
  over (order by yearweek(payment_date)
   rows unbounded preceding) rolling_sum
FROM payment
GROUP BY yearweek(payment_date) 
ORDER BY 1;
SELECT yearweek(payment_date) payment_week, 
 sum(amount) week_total, 
 avg(sum(amount)) 
  over (order by yearweek(payment_date) 
  rows between 1 preceding and 1 following) rolling_3wk_avg 
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;
SELECT date(payment_date), sum(amount), 
 avg(sum(amount)) 
  over (order by date(payment_date) 
  range between interval 3 day preceding and interval 3 day following) range 
FROM payment
WHERE payment_date BETWEEN '2005-07-01' AND '2005-09-01'
GROUP BY date(payment_date)
ORDER BY 1;

Lag and Lead

Window FunctionArgument TypeReturn TypeDescription
LAG()Any supported Drill data typesSame as the expression typeThe LAG() window function returns the value for the row before the current row in a partition. If no row exists, null is returned.
LEAD()Any supported Drill data typesSame as the expression typeThe LEAD() window function returns the value for the row after the current row in a partition. If no row exists, null is returned.
FIRST_VALUEAny supported Drill data typesSame as the expression typeThe FIRST_VALUE window function returns the value of the specified expression with respect to the first row in the window frame.
LAST_VALUEAny supported Drill data typesSame as the expression typeThe LAST_VALUE window function returns the value of the specified expression with respect to the last row in the window frame.
SELECT yearweek(payment_date) payment_week, 
 sum(amount) week_total, 
  lag(sum(amount), 1)
   over (order by yearweek(payment_date)) prev_wk_tot, 
  lead(sum(amount), 1)
   over (order by yearweek(payment_date)) next_wk_tot,
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;
SELECT yearweek(payment_date) payment_week, 
 sum(amount) week_total, 
  round((sum(amount) - lag(sum(amount), 1) 
  over (order by yearweek(payment_date))) / lag(sum(amount), 1)
    over (order by yearweek(payment_date)) * 100, 1) pct_diff 
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

Column Value Concatenation

SELECT f.title, 
 group_concat(a.last_name order by a.last_name separator ', ') actors
FROM actor a 
INNER JOIN film_actor fa 
 ON a.actor_id = fa.actor_id 
INNER JOIN film f 
 ON fa.film_id = f.film_id 
GROUP BY f.title
HAVING count(*) = 3;

Related