Learning SQL Notes #8: Subqueries

What Is a Subquery?
A subquery is a query contained within another SQL statement (which I refer to as the containing statement for the rest of this discussion). A subquery is always enclosed within parentheses, and it is usually executed prior to the containing statement. Like any query, a subquery returns a result set that may consist of:
- A single row with a single column
- Multiple rows with a single column
- Multiple rows having multiple columns
SELECT customer_id, first_name, last_name FROM customer WHERE customer_id = (SELECT MAX(customer_id) FROM customer);
Subquery Types
Noncorrelated Subqueries
Multiple-Row, Single-Column Subqueries
The in and not in operators
SELECT city_id, city FROM city WHERE country_id <> (SELECT country_id FROM country WHERE country = 'India');
Note: Subquery should not return more than one row when you use WHERE to filter a condition with inequality/equality in this case.
What you can do is use the following subqueries:
SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico');
or
SELECT country_id
FROM country
WHERE country = 'Canada' OR country = 'Mexico';
in the following ways:
SELECT city_id, city
FROM city
WHERE country_id IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));
or the opposite:
SELECT city_id, city
FROM city
WHERE country_id NOT IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));
The all operator
The all operator allows you to make comparisons between a single value and every value in a set:
SELECT first_name, last_name
FROM customer
WHERE customer_id <> ALL
(SELECT customer_id
FROM payment
WHERE amount = 0);
or the equivalent:
SELECT first_name, last_name
FROM customer
WHERE customer_id NOT IN
(SELECT customer_id
FROM payment
WHERE amount = 0);
Any attempt to equate a value to null yields unknown, so when using not in or <> all to compare a value to a set of values, you must be careful to ensure that the set of values does not contain a null value.
The subquery in this example returns the total number of film rentals for all custom‐ ers in North America, and the containing query returns all customers whose total number of film rentals exceeds any of the North American customers.
SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) > ALL
(SELECT count(*)
FROM rental r
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('United States','Mexico','Canada')
GROUP BY r.customer_id
);
The any operator (OR)
A condition using the any operator evaluates to true as soon as a single comparison is favorable.
SELECT customer_id, sum(amount)
FROM payment
GROUP BY customer_id
HAVING sum(amount) > ANY
(SELECT sum(amount)
FROM payment p
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('Bolivia','Paraguay','Chile')
GROUP BY co.country
);
Multicolumn Subqueries
SELECT actor_id, film_id
FROM film_actor
WHERE (actor_id, film_id) IN
(SELECT a.actor_id, f.film_id
FROM actor a
CROSS JOIN film f
WHERE a.last_name = 'MONROE'
AND f.rating = 'PG');
Correlated Subqueries
A correlated subquery, on the other hand, is dependent on its containing statement from which it references one or more columns.
SELECT c.first_name, c.last_name FROM customer c WHERE 20 = (SELECT count(*) FROM rental r WHERE r.customer_id = c.customer_id); /*customers who have rented exactly 20 films*/
The exists Operator
You use the exists operator when you want to identify that a relationship exists without regard for the quantity.
SELECT c.first_name, c.last_name
FROM customer c
WHERE (NOT) EXISTS
(SELECT r.rental_date, r.customer_id, 'ABCD' str, 2 * 3 / 7 nmbr /*can be replaced by anything*/
FROM rental r
WHERE r.customer_id = c.customer_id
AND date(r.rental_date) < '2005-05-25');
Since the condition in the containing query only needs to know how many rows have been returned, the actual data the subquery returned is irrelevant.
Data Manipulation Using Correlated Subqueries
UPDATE customer c
SET c.last_update =
(SELECT max(r.rental_date)
FROM rental r
WHERE r.customer_id = c.customer_id);
UPDATE customer c SET c.last_update =
(SELECT max(r.rental_date) FROM rental r WHERE r.customer_id = c.customer_id) WHERE EXISTS
(SELECT 1 FROM rental r
WHERE r.customer_id = c.customer_id);
/*executes only if the condition in the update statement’s where clause evaluates to true (meaning that at least one rental was found for the customer), thus protecting the data in the last_update column from being
overwritten with a null.*/
DELETE FROM customer WHERE 365 < ALL
(SELECT datediff(now(), r.rental_date) days_since_last_rental FROM rental r
WHERE r.customer_id = customer.customer_id);
/*removes rows from the customer table where there have been no film rentals in the past year*/
When to Use Subqueries
Subqueries as Data Sources
SELECT c.first_name, c.last_name, pymnt.num_rentals, pymnt.tot_payments
FROM customer c
INNER JOIN
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id ) pymnt /*execute first*/
ON c.customer_id = pymnt.customer_id;
Data fabrication
First we have a table for some standards (small/average/heavy) with lower and upper bounds.
SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit;
Then we have transformed the original tables into the desired one.
SELECT pymnt_grps.name, count(*) num_customers
FROM
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN (SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit
UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit ) pymnt_grps
ON pymnt.tot_payments
BETWEEN pymnt_grps.low_limit AND pymnt_grps.high_limit
GROUP BY pymnt_grps.name;
Task-oriented subqueries
SELECT c.first_name, c.last_name, ct.city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
INNER JOIN customer c
ON p.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
GROUP BY c.first_name, c.last_name, ct.city;
We only need names/cities/addresses for display purpose only, so we can use subqueries to group the data first before joining other tables. A more efficient code chunk for the same task:
SELECT c.first_name, c.last_name, ct.city, pymnt.tot_payments, pymnt.tot_rentals
FROM (SELECT customer_id, count(*) tot_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN customer c
ON pymnt.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id;
Common table expressions
WITH actors_s AS
(SELECT actor_id, first_name, last_name
FROM actor
WHERE last_name LIKE 'S%'
) /*can be used in the subsequent queries*/
...
Subqueries as Expression Generators
Correlated scalar subqueries. The customer table is accessed three times (once in each of the three subqueries) rather than just once.
SELECT (SELECT c.first_name
FROM customer c
WHERE c.customer_id = p.customer_id ) first_name, (SELECT c.last_name
FROM customer c
WHERE c.customer_id = p.customer_id ) last_name, (SELECT ct.city
FROM customer c
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
WHERE c.customer_id = p.customer_id
) city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
GROUP BY p.customer_id;
Similarly,
INSERT INTO film_actor (actor_id, film_id, last_update) VALUES (
(SELECT actor_id
FROM actor
WHERE first_name = 'JENNIFER' AND last_name = 'DAVIS'), (SELECT film_id FROM film
WHERE title = 'ACE GOLDFINGER'),
now()
);
Subquery Wrap-Up
- Return a single column and row, a single column with multiple rows, and multi‐ ple columns and rows
- Are independent of the containing statement (noncorrelated subqueries)
- Reference one or more columns from the containing statement (correlated subqueries)
- Are used in conditions that utilize comparison operators as well as the special-purpose operators in, not in, exists, and not exists
- Can be found in select, update, delete, and insert statements
- Generate result sets that can be joined to other tables (or subqueries) in a query
- Can be used to generate values to populate a table or to populate columns in a query’s result set
- Are used in the select, from, where, having, and order by clauses of queries
Happy learning!
