Learning SQL Notes #7: Grouping and Aggregates (CH. 8)

Siqi Zheng

Last updated on Jun 7, 2021 2 min read SQL

Grouping Concepts
Aggregate Functions
Generating Groups
Group Filter Conditions

Grouping Concepts

SELECT customer_id, count(*) 
FROM rental
GROUP BY customer_id
HAVING count(*) >= 40
ORDER BY 2 DESC;

WARNING:

~~WHERE count(*) >= 40~~ since aggregate functions should come with HAVING.

R codes:

library(tidyverse)
rental %>%
  group_by(customer_id) %>%
  summarize(counts=n()) %>%
  filter(counts>=40) %>%
  arrange(desc(counts))

Aggregate Functions

Some aggregate functions in SQL/R:

SQL	R
count()	count()
sum()	sum()
average()	mean()
min()	min()
max()	max()
group_concat()	paste()
first()	[1]
last()	[-1]

SELECT COUNT(DISTINCT col1) 
FROM string_tbl;

R codes:

length(unique(string_tbl$col1))

NULLS are ignored unless you use count(*) where all rows will be counted.

Generating Groups

Single-Column/Multicolumn Grouping

Grouping can be done on 1 or more columns with aggregate functions.

SELECT actor_id, count(*) 
FROM film_actor
GROUP BY actor_id;

SELECT fa.actor_id, f.rating, count(*) 
FROM film_actor fa 
INNER JOIN film f 
ON fa.film_id = f.film_id 
GROUP BY fa.actor_id, f.rating
ORDER BY 1,2;

R codes are analogous to the codes in the last section.

Grouping via Expressions

SELECT extract(YEAR FROM rental_date) year, 
COUNT(*) how_many 
FROM rental
GROUP BY extract(YEAR FROM rental_date);

R codes:

library(tidyverse)
rental %>%
  mutate(year=year(rental_date)) %>%
  group_by(year) %>%
  summarize(counts=n()) %>%

Generating Rollups

Find total counts for each distinct actor.

/*MySQL*/
SELECT fa.actor_id, f.rating, count(*)  
FROM film_actor fa  
INNER JOIN film f  
ON fa.film_id = f.film_id  
GROUP BY fa.actor_id, f.rating WITH ROLLUP
ORDER BY 1,2;

/*Oracle*/
GROUP BY ROLLUP(fa.actor_id, f.rating)
GROUP BY a, ROLLUP(b, c)

actor_id	rating	count(*)
NULL	NULL	5462
1	NULL	19
1	G	4
1	PG	6
1	PG-13	1
1	R	3
1	NC-17	5
2	NULL	25
2	G	7

R codes:

library(reshape2)
library(zoo)

m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")

dout$month <- na.locf(replace(dout$month, dout$month  == "(all)", NA))

See here: https://stackoverflow.com/questions/36169073/how-to-do-group-by-rollup-in-r-like-sql