Learning SQL Notes #7: Grouping and Aggregates (CH. 8)

Grouping Concepts
SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) >= 40
ORDER BY 2 DESC;
WARNING:
WHERE count(*) >= 40 since aggregate functions should come with HAVING.
R codes:
library(tidyverse)
rental %>%
group_by(customer_id) %>%
summarize(counts=n()) %>%
filter(counts>=40) %>%
arrange(desc(counts))
Aggregate Functions
Some aggregate functions in SQL/R:
| SQL | R |
|---|---|
| count() | count() |
| sum() | sum() |
| average() | mean() |
| min() | min() |
| max() | max() |
| group_concat() | paste() |
| first() | [1] |
| last() | [-1] |
SELECT COUNT(DISTINCT col1)
FROM string_tbl;
R codes:
length(unique(string_tbl$col1))
NULLS are ignored unless you use count(*) where all rows will be counted.
Generating Groups
Single-Column/Multicolumn Grouping
Grouping can be done on 1 or more columns with aggregate functions.
SELECT actor_id, count(*)
FROM film_actor
GROUP BY actor_id;
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating
ORDER BY 1,2;
R codes are analogous to the codes in the last section.
Grouping via Expressions
SELECT extract(YEAR FROM rental_date) year,
COUNT(*) how_many
FROM rental
GROUP BY extract(YEAR FROM rental_date);
R codes:
library(tidyverse)
rental %>%
mutate(year=year(rental_date)) %>%
group_by(year) %>%
summarize(counts=n()) %>%
Generating Rollups
Find total counts for each distinct actor.
/*MySQL*/
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating WITH ROLLUP
ORDER BY 1,2;
/*Oracle*/
GROUP BY ROLLUP(fa.actor_id, f.rating)
GROUP BY a, ROLLUP(b, c)
| actor_id | rating | count(*) |
|---|---|---|
| NULL | NULL | 5462 |
| 1 | NULL | 19 |
| 1 | G | 4 |
| 1 | PG | 6 |
| 1 | PG-13 | 1 |
| 1 | R | 3 |
| 1 | NC-17 | 5 |
| 2 | NULL | 25 |
| 2 | G | 7 |
R codes:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
See here: https://stackoverflow.com/questions/36169073/how-to-do-group-by-rollup-in-r-like-sql
Group Filter Conditions
HAVINGwith aggregate functions;WHEREwith original columns;
