SQL | Siqi Zheng

Learning SQL Notes #16: SQL and Big Data

Fri, 11 Jun 2021 15:00:00 +0000

Introduction to Apache Drill
Querying Files Using Drill
Querying MySQL Using Drill
Querying MongoDB Using Drill
Drill with Multiple Data Sources
Future of SQL

The data landscape has changed quite a bit over the past decade, and SQL is changing to meet the needs of today’s rapidly evolving environments. Many organizations that had used relational databases exclusively just a few years ago are now also housing data in Hadoop clusters, data lakes, and NoSQL databases. At the same time, companies are struggling to find ways to gain insights from the ever-growing volumes of data, and the fact that this data is now spread across multiple data stores, perhaps both on-site and in the cloud, makes this a daunting task.

Because SQL is used by millions of people and has been integrated into thousands of applications, it makes sense to leverage SQL to harness this data and make it actionable. Over the past several years, a new breed of tools has emerged to enable SQL access to structured, semi-structured, and unstructured data: tools such as Presto, Apache Drill, and Toad Data Point. This chapter explores one of these tools, Apache Drill, to demonstrate how data in different formats and stored on different servers can be brought together for reporting and analysis.

Introduction to Apache Drill

Compelling features:

Facilitates queries across multiple data formats, including delimited data, JSON, Parquet, and log files
Connects to relational databases, Hadoop, NoSQL, HBase, and Kafka, as well as specialized data formats such as PCAP, BlockChain, and others
Allows creation of custom plug-ins to connect to most any other data store
Requires no up-front schema definitions
Supports the SQL:2003 standard
Works with popular business intelligence (BI) tools like Tableau and Apache Superset Using Drill, you can connect to any number of data sources and begin querying, without the need to first set up a metadata repository.

Querying Files Using Drill

Let’s start by using Drill to query data in a file. Drill understands how to read several different file formats, including packet capture (PCAP) files, which are in binary for‐ mat and contain information about packets traveling over a network. All I have to do when I want to query a PCAP file is to configure Drill’s dfs (distributed filesystem) plug-in to include the path to the directory containing my files, and I’m ready to write queries.

Drill includes partial support for information_schema, so you can find out high-level information about the data files in your workspace:

SELECT file_name, is_directory, is_file, permission
FROM information_schema.`files`
WHERE schema_name = 'dfs.data';
SELECT * FROM dfs.data.`attack-trace.pcap`
WHERE 1=2; # To see the column name

Counts the number of packets sent from each IP address to each destination port:

SELECT src_ip, dst_port,
count(*) AS packet_count
FROM dfs.data.`attack-trace.pcap`
GROUP BY src_ip, dst_port;

Aggregates packet information for each second:

SELECT trunc(extract(second from `timestamp`)) as packet_time,
count(*) AS num_packets,
sum(packet_length) AS tot_volume
FROM dfs.data.`attack-trace.pcap`
GROUP BY trunc(extract(second from `timestamp`));

Put backticks (`) around timestamp because it is a reserved word.

You can query files stored locally, on your network, in a distributed filesystem, or in the cloud. Drill has built-in support for many file types, but you can also build your own plug-in to allow Drill to query any type of file.

Querying MySQL Using Drill

Why Apache Drill? Because you can write queries using Drill that combine data from different sources, so you might write a query that joins data from MySQL, Hadoop, and comma-delimited files, for example.

The first step is to choose a database:

apache drill (information_schema)> use mysql.sakila;
show tables;

Simple joins, group by, order and having work for Drill as well. However, Drill works with many relational databases, not just MySQL, so some features of the language may differ (e.g., data conversion functions). For more information, read Drill’s documentation about their SQL implementation.

Querying MongoDB Using Drill

After using Drill to query the sample Sakila data in MySQL, the next logical step is to convert the Sakila data to another commonly used format, store it in a nonrelational database, and use Drill to query the data. I decided to convert the data to JSON and store it in MongoDB, which is one of the more popular NoSQL platforms for document storage. Drill includes a plug-in for MongoDB and also understands how to read JSON documents, so it was relatively easy to load the JSON files into Mongo and begin writing queries.

After the JSON files have been loaded, the Mongo database contains two collections (films and customers), and the data in these collections spans nine different tables from the MySQL Sakila database.

Group the data by rating and actor:

SELECT g_pg_films.Rating,
g_pg_films.actor_list.`First name` first_name,
g_pg_films.actor_list.`Last name` last_name,
count(*) num_films
FROM
(SELECT f.Rating, flatten(Actors) actor_list
FROM films f
WHERE f.Rating IN ('G','PG')
) g_pg_films
GROUP BY g_pg_films.Rating,
g_pg_films.actor_list.`First name`,
g_pg_films.actor_list.`Last name`
HAVING count(*) > 9;

The query should return all customers who have spent more than $80 to rent films rated either G or PG.

SELECT first_name, last_name,
sum(cast(cust_payments.payment_data.Amount
as decimal(4,2))) tot_payments
FROM
(SELECT cust_data.first_name,
cust_data.last_name,
f.Rating,
flatten(cust_data.rental_data.Payments)
payment_data
FROM films f
INNER JOIN
(SELECT c.`First Name` first_name,
c.`Last Name` last_name, flatten(c.Rentals) rental_data
FROM customers c
) cust_data
ON f._id = cust_data.rental_data.filmID
WHERE f.Rating IN ('G','PG')
) cust_payments
GROUP BY first_name, last_name
HAVING
sum(cast(cust_payments.payment_data.Amount as decimal(4,2))) > 80;

The innermost query, which I named cust_data, flattens the Rentals list so that the cust_payments query can join to the films collection and also flatten the Payments list. The outermost query groups the data by customer name and applies a having clause to filter out customers who spent $80 or less on films rated G or PG.

Drill with Multiple Data Sources

As long as Drill is configured to connect to both databases, you just need to describe where to find the data.

FROM mysql.sakila.film f
FROM mongo.sakila.customers c

Future of SQL

The future of relational databases is somewhat unclear. It is possible that the big data technologies of the past decade will continue to mature and gain market share. It’s also possible that a new set of technologies will emerge, overtaking Hadoop and NoSQL, and taking additional market share from relational databases. However, most companies still run their core business functions using relational databases, and it should take a long time for this to change.

The future of SQL seems a bit clearer, however. While the SQL language started out as a mechanism for interacting with data in relational databases, tools like Apache Drill act more like an abstraction layer, facilitating the analysis of data across various database platforms. In this author’s opinion, this trend will continue, and SQL will remain a critical tool for data analysis and reporting for many years.

Learning SQL Notes #15: Working with Large Databases

Fri, 11 Jun 2021 09:00:00 +0000

Partitioning
Clustering
Sharding
Big Data

While relational databases face various challenges as data volumes continue to grow, there are strategies such as partitioning, clustering, and sharding that allow companies to continue to utilize relational databases by spreading data across multi‐ ple storage tiers and servers. Other companies have decided to move to big data platforms such as Hadoop in order to handle huge data volumes.

Partitioning

The following tasks become more difficult and/or time consuming as a table grows past a few million rows:

Query execution requiring full table scans
Index creation/rebuild
Data archival/deletion
Generation of table/index statistics
Table relocation (e.g., move to a different tablespace)
Database backups

The best way to prevent administrative issues from occurring in the future is to break large tables into pieces, or partitions, when the table is first created (although tables can be partitioned later, it is easier to do so initially). Administrative tasks can be performed on individual partitions, often in parallel, and some tasks can skip one or more partitions entirely.

Partitioning Concepts

While every partition must have the same schema definition (columns, column types, etc.), there are several administrative features that can differ for each partition:

Partitions may be stored on different tablespaces, which can be on different physical storage tiers.
Partitions can be compressed using different compression schemes.
Local indexes (more on this shortly) can be dropped for some partitions.
Table statistics can be frozen on some partitions, while being periodically refreshed on others.
Individual partitions can be pinned into memory or stored in the database’s flash storage tier.

Table Partitioning

The partitioning scheme available in most relational databases is horizontal partitioning, which assigns entire rows to exactly one partition. Tables may also be partitioned vertically, which involves assigning sets of columns to different partitions, but this must be done manually. When partitioning a table horizontally, you must choose a partition key, which is the column whose values are used to assign a row to a particular partition. In most cases, a table’s partition key consists of a single column, and a partitioning function is applied to this column to determine in which partition each row should reside.

Index Partitioning

If your partitioned table has indexes, you will get to choose whether a particular index should stay intact, known as a global index, or be broken into pieces such that each partition has its own index, which is called a local index. Global indexes span all partitions of the table and are useful for queries that do not specify a value for the partition key.

Partitioning Methods

Range partitioning

The most common usage is to break up tables by date ranges.

CREATE TABLE sales
(sale_id INT NOT NULL,
cust_id INT NOT NULL,
store_id INT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(9,2)
)
PARTITION BY RANGE (yearweek(sale_date))
(PARTITION s1 VALUES LESS THAN (202002),
PARTITION s2 VALUES LESS THAN (202003),
PARTITION s3 VALUES LESS THAN (202004),
PARTITION s4 VALUES LESS THAN (202005),
PARTITION s5 VALUES LESS THAN (202006),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
);

Read and modify partitions:

SELECT partition_name, partition_method, partition_expression
FROM information_schema.partitions 
WHERE table_name = 'sales'
ORDER BY partition_ordinal_position;
ALTER TABLE sales REORGANIZE PARTITION s999 INTO
(PARTITION s6 VALUES LESS THAN (202007),
PARTITION s7 VALUES LESS THAN (202008),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
);

List partitioning

PARTITION BY LIST COLUMNS (geo_region_cd)
(PARTITION ASIA VALUES IN ('CHN','JPN','IND'))
ALTER TABLE sales REORGANIZE PARTITION ASIA INTO
(PARTITION ASIA VALUES IN ('CHN','JPN','IND', 'KOR'));

Hash partitioning

The server does this by applying a hashing function to the column value.

PARTITION BY HASH (cust_id)
PARTITIONS 4
(PARTITION H1,
PARTITION H2,
PARTITION H3,
PARTITION H4
);

Composite partitioning

If you need finer-grained control of how data is allocated to your partitions, you can employ composite partitioning, which allows you to use two different types of partitioning for the same table. With composite partitioning, the first partitioning method defines the partitions, and the second partitioning method defines the subpartitions.

CREATE TABLE sales
(sale_id INT NOT NULL,
cust_id INT NOT NULL,
store_id INT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(9,2)
)
PARTITION BY RANGE (yearweek(sale_date))
SUBPARTITION BY HASH (cust_id)
(PARTITION s1 VALUES LESS THAN (202002)
(SUBPARTITION s1_h1, SUBPARTITION s1_h2, SUBPARTITION s1_h3, SUBPARTITION s1_h4),
PARTITION s2 VALUES LESS THAN (202003)
(SUBPARTITION s2_h1, SUBPARTITION s2_h2, SUBPARTITION s2_h3, SUBPARTITION s2_h4),
PARTITION s3 VALUES LESS THAN (202004)
(SUBPARTITION s3_h1, SUBPARTITION s3_h2,
SUBPARTITION s3_h3,
SUBPARTITION s3_h4),
PARTITION s4 VALUES LESS THAN (202005)
(SUBPARTITION s4_h1, SUBPARTITION s4_h2, SUBPARTITION s4_h3, SUBPARTITION s4_h4),
PARTITION s5 VALUES LESS THAN (202006)
(SUBPARTITION s5_h1, SUBPARTITION s5_h2, SUBPARTITION s5_h3, SUBPARTITION s5_h4),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
(SUBPARTITION s999_h1, SUBPARTITION s999_h2, SUBPARTITION s999_h3,
SUBPARTITION s999_h4)
);
SELECT *
FROM sales PARTITION (s3);
SELECT *
FROM sales PARTITION (s3_h3);

Partitioning Benefits

One major advantage to partitioning is that you may only need to interact with as few as one partition, rather than the entire table.

If you execute a query that includes a join to a partitioned table and the query includes a condition on the partitioning column, the server can exclude any partitions that do not contain data pertinent to the query. This is known as partitionwise joins, and it is similar to partition pruning in that only those partitions that contain data needed by the query will be included.

From an administrative standpoint, one of the main benefits to partitioning is the ability to quickly delete data that is no longer needed.

Another administrative advantage to partitioned tables is the ability to perform updates on multiple partitions simultaneously, which can greatly reduce the time needed to touch every row in a table.

Clustering

Clustering allows multiple servers to act as a single database.

Shared-disk/shared-cache configurations: every server in the cluster has access to all disks, and data cached in one server can be accessed by any other server in the cluster. With this type of architecture, an application server could attach to any one of the database servers in the cluster, with connections automatically failing over to another server in the cluster in case of failure.

Of the commercial database vendors, Oracle is the leader in this space, with many of the world’s biggest companies using the Oracle Exadata platform to host extremely large databases accessed by thousands of concurrent users. However, even this plat‐ form fails to meet the needs of the biggest companies, which led Google, Facebook, Amazon, and other companies to blaze new trails.

Sharding

Sharding partitions the data across multiple databases (called shards), so it is similar to table partitioning but on a larger scale and with far more complexity. If you were to employ this strategy for the social media company, you might decide to implement 100 separate databases, each one hosting the data for approximately 10 million users.

You will need to choose a sharding key, which is the value used to determine to which database to connect.
While large tables will be divided into pieces, with individual rows assigned to a single shard, smaller reference tables may need to be replicated to all shards, and a strategy needs to be defined for how reference data can be modified and changes propagated to all shards.
If individual shards become too large (e.g., the social media company now has two billion users), you will need a plan for adding more shards and redistributing data across the shards.
When you need to make schema changes, you will need to have a strategy for deploying the changes across all of the shards so that all schemas stay in sync.
If application logic needs to access data stored in two or more shards, you need to have a strategy for how to query across multiple databases and also how to implement transactions across multiple databases.

Big Data

One way to define the boundaries of big data is with the “3 Vs”:

Volume

In this context, volume generally means billions or trillions of data points.

Velocity

This is a measure of how quickly data arrives.

Variety

This means that data is not always structured (as in rows and columns in a rela‐ tional database) but can also be unstructured (e.g., emails, videos, photos, audio files, etc.).

So, one way to characterize big data is any system designed to handle a huge amount of data of various formats arriving at a rapid pace.

Hadoop

Hadoop is best described as an ecosystem, or a set of technologies and tools that work together. Some of the major components of Hadoop include:

Hadoop Distributed File System (HDFS)

Like the name implies, HDFS enables file management across a large number of servers.

MapReduce

This technology processes large amounts of structured and unstructured data by breaking a task into many small pieces that can be run in parallel across many servers.

YARN

This is a resource manager and job scheduler for HDFS.

Together, these technologies allow for the storage and processing of files across hun‐ dreds or even thousands of servers acting as a single logical system. While Hadoop is widely used, querying the data using MapReduce generally requires a programmer, which has led to the development of several SQL interfaces, including Hive, Impala, and Drill.

NoSQL and Document Databases

What happens, however, if the structure of the data isn’t known beforehand or if the structure is known but changes frequently? The answer for many companies is to combine both the data and schema definition into documents using a format such as XML or JSON and then store the documents in a database. By doing so, various types of data can be stored in the same database without the need to make schema modifications, which makes storage easier but puts the burden on query and analytic tools to make sense of the data stored in the documents.

Document databases are a subset of what are called NoSQL databases, which typically store data using a simple key-value mechanism. For example, using a document data‐ base such as MongoDB, you could utilize the customer ID as the key to store a JSON document containing all of the customer’s data, and other users can read the schema stored within the document to make sense of the data stored within.

Cloud Computing

Prior to the advent of big data, most companies had to build their own data centers to house the database, web, and application servers used across the enterprise. With the advent of cloud computing, you can choose to essentially outsource your data center to platforms such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud. One of the biggest benefits to hosting your services in the cloud is instant scalability, which allows you to quickly dial up or down the amount of computing power needed to run your services. Startups love these platforms because they can start writing code without spending any money up front for servers, storage, networks, or software licenses.

As far as databases are concerned, a quick look at AWS’s database and analytics offerings yields the following options:

Relational databases (MySQL, Aurora, PostgreSQL, MariaDB, Oracle, and SQL Server)
In-memory database (ElastiCache)
Data warehousing database (Redshift)
NoSQL database (DynamoDB)
Document database (DocumentDB)
Graph database (Neptune)
Time-series database (TimeStream)
Hadoop (EMR)
Data lakes (Lake Formation)

While relational databases dominated the landscape up until the mid-2000s, it’s pretty easy to see that companies are now mixing and matching various platforms and that relational databases may become less popular over time.

Conclusion

Databases are getting larger, but at the same time storage, clustering, and partitioning technologies are becoming more robust. Working with huge amounts of data can be quite challenging, regardless of the technology stack. Whether you use relational databases, big data platforms, or a variety of database servers, SQL is evolving to facilitate data retrieval from various technologies.

Learning SQL Notes #14: Analytic Functions

Fri, 11 Jun 2021 01:00:00 +0000

Analytic Function Concepts
- Data Windows
- Localized Sorting
Ranking
- Ranking Functions
- Generating Multiple Rankings
Reporting Functions

Analytic Function Concepts

Data Windows

SELECT quarter(payment_date) quarter,
monthname(payment_date) month_nm,
sum(amount) monthly_sales,
max(sum(amount))
over () max_overall_sales,/*controlled by where and group by and return the highest monthly total payment in 2005*/
max(sum(amount))
over (partition by quarter(payment_date)) max_qrtr_sales /*controlled by where and group by and return the highest monthly total payment in each quarter in 2005*/
FROM payment
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date);

The analytic functions used to generate these additional columns group rows into two different sets: one set containing all rows in the same quarter and another set containing all of the rows. To accommodate this type of analysis, analytic functions include the ability to group rows into windows, which effectively partition the data for use by the analytic function without changing the overall result set. Windows are defined using the over clause combined with an optional partition by subclause. In the previous query, both analytic functions include an over clause, but the first one is empty, indicating that the window should include the entire result set, whereas the second one specifies that the window should include only rows within the same quarter. Data windows may contain anywhere from a single row to all of the rows in the result set, and different analytic functions can define different data windows.

Localized Sorting

SELECT quarter(payment_date) quarter,
monthname(payment_date) month_nm,
sum(amount) monthly_sales,
rank() over (order by sum(amount) desc) sales_rank /* order by only controls the rank()*/
FROM payment
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date)
ORDER BY 1, month(payment_date);/* order by only controls the presentation*/

or you may insert partition by quarter(payment_date) into the over() above to obtain rank within each quarter.

Ranking

Ranking Functions

There are multiple ranking functions available in the SQL standard, with each one taking a different approach to how ties are handled:

row_number

Returns a unique number for each row, with rankings arbitrarily assigned in case of a tie

rank

Returns the same ranking in case of a tie, with gaps in the rankings

dense_rank

Returns the same ranking in case of a tie, with no gaps in the rankings

SELECT customer_id, count(*) num_rentals,
row_number() over (order by count(*) desc) row_number_rnk,
rank() over (order by count(*) desc) rank_rnk,
dense_rank() over (order by count(*) desc) dense_rank_rnk
FROM rental
GROUP BY customer_id
ORDER BY 2 desc;

customer_id	num_rentals	row_number_rnk	rank_rnk	dense_rank_rnk
144	42	3	3	3
236	42	4	3	3
75	41	5	5	4

To get back to the original request, how would you identify the top 10 customers? There are three possible solutions:

Use the row_number function to identify customers ranked from 1 to 10, which results in exactly 10 customers in this example, but in other cases might exclude customers having the same number of rentals as the 10th ranked customer.
Use the rank function to identify customers ranked 10 or less, which also results in exactly 10 customers.
Use the dense_rank function to identify customers ranked 10 or less, which yields a list of 37 customers.

Generating Multiple Rankings

SELECT customer_id,
monthname(rental_date) rental_month,
count(*) num_rentals,
rank() over (partition by monthname(rental_date) 
order by count(*) desc) rank_rnk
FROM rental
GROUP BY customer_id, monthname(rental_date)
ORDER BY 2, 3 desc;

so that rank() starts from 1 for each month.

Looking at the results, you can see that the rankings are reset to 1 for each month. In order to generate the desired results for the marketing department (top five custom‐ ers from each month), you can simply wrap the previous query in a subquery and add a filter condition to exclude any rows with a ranking higher than five:

SELECT customer_id, rental_month, num_rentals, rank_rnk ranking
FROM
(SELECT customer_id,
monthname(rental_date) rental_month, count(*) num_rentals,
rank() over (partition by monthname(rental_date) order by count(*) desc) rank_rnk
FROM rental
GROUP BY customer_id, monthname(rental_date)
) cust_rankings
WHERE rank_rnk <= 5
ORDER BY rental_month, num_rentals desc, rank_rnk;

Since analytic functions can be used only in the SELECT clause, you will often need to nest queries if you need to do any filtering or grouping based on the results from the analytic function.

Window Function	Return Type	Description
CUME_DIST()	DOUBLE PRECISION	The CUME_DIST() window function calculates the relative rank of the current row within a window partition: (number of rows preceding or peer with current row) / (total rows in the window partition)
DENSE_RANK()	BIGINT	The DENSE_RANK () window function determines the rank of a value in a group of values based on the ORDER BY expression and the OVER clause. Each value is ranked within its partition. Rows with equal values receive the same rank. There are no gaps in the sequence of ranked values if two or more rows have the same rank.
NTILE()	INTEGER	The NTILE window function divides the rows for each window partition, as equally as possible, into a specified number of ranked groups. The NTILE window function requires the ORDER BY clause in the OVER clause.
PERCENT_RANK()	DOUBLE PRECISION	The PERCENT_RANK () window function calculates the percent rank of the current row using the following formula: (x - 1) / (number of rows in window partition - 1) where x is the rank of the current row.
RANK()	BIGINT	The RANK window function determines the rank of a value in a group of values. The ORDER BY expression in the OVER clause determines the value. Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers. For example, if two rows are ranked 1, the next rank is 3. The DENSE_RANK window function differs in that no gaps exist if two or more rows tie.
ROW_NUMBER()	BIGINT	The ROW_NUMBER window function determines the ordinal number of the current row within its partition. The ORDER BY expression in the OVER clause determines the number. Each value is ordered within its partition. Rows with equal values for the ORDER BY expressions receive different row numbers nondeterministically.

Reporting Functions

Calculate total by month/by total

SELECT monthname(payment_date) payment_month,
amount,
sum(amount) over (partition by monthname(payment_date)) monthly_total,
sum(amount) over () grand_total 
FROM payment
WHERE amount >= 10
ORDER BY 1;

payment_month	amount	monthly_total	grand_total
August	10.99	521.53	1262.86
August	11.99	521.53	1262.86

Calculate percentage:

SELECT monthname(payment_date) payment_month,
amount,
round(sum(amount) / sum(sum(amount)) over () * 100, 2) pct_of_total
FROM payment
GROUP BY monthname(payment_date);

payment_month	month_total	pct_of_total
May	4824.43	7.16
June	9631.88	14.29
July	28373.89	42.09
August	24072.13	35.71
February	514.18	0.76

Quasi-ranking functions:

SELECT monthname(payment_date) payment_month,
sum(amount) month_total,
CASE sum(amount)
WHEN max(sum(amount)) over () THEN 'Highest'
WHEN min(sum(amount)) over () THEN 'Lowest'
ELSE 'Middle'
END descriptor
FROM payment
GROUP BY monthname(payment_date);

payment_month	month_total	descriptor
May	4824.43	Middle
June	9631.88	Middle
July	28373.89	Highest
August	24072.13	Middle
February	514.18	Lowest

Window Frames

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
sum(sum(amount))
over (order by yearweek(payment_date)
rows unbounded preceding) rolling_sum
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
avg(sum(amount))
over (order by yearweek(payment_date)
rows between 1 preceding and 1 following) rolling_3wk_avg
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

SELECT date(payment_date), sum(amount),
avg(sum(amount))
over (order by date(payment_date)
range between interval 3 day preceding and interval 3 day following) range
FROM payment
WHERE payment_date BETWEEN '2005-07-01' AND '2005-09-01'
GROUP BY date(payment_date)
ORDER BY 1;

Lag and Lead

Window Function	Argument Type	Return Type	Description
LAG()	Any supported Drill data types	Same as the expression type	The LAG() window function returns the value for the row before the current row in a partition. If no row exists, null is returned.
LEAD()	Any supported Drill data types	Same as the expression type	The LEAD() window function returns the value for the row after the current row in a partition. If no row exists, null is returned.
FIRST_VALUE	Any supported Drill data types	Same as the expression type	The FIRST_VALUE window function returns the value of the specified expression with respect to the first row in the window frame.
LAST_VALUE	Any supported Drill data types	Same as the expression type	The LAST_VALUE window function returns the value of the specified expression with respect to the last row in the window frame.

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
lag(sum(amount), 1)
over (order by yearweek(payment_date)) prev_wk_tot,
lead(sum(amount), 1)
over (order by yearweek(payment_date)) next_wk_tot,
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
round((sum(amount) - lag(sum(amount), 1)
over (order by yearweek(payment_date))) / lag(sum(amount), 1)
over (order by yearweek(payment_date)) * 100, 1) pct_diff
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

Column Value Concatenation

SELECT f.title,
group_concat(a.last_name order by a.last_name separator ', ') actors
FROM actor a
INNER JOIN film_actor fa
ON a.actor_id = fa.actor_id
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY f.title
HAVING count(*) = 3;

Learning SQL Notes #13: Metadata

Thu, 10 Jun 2021 01:00:00 +0000

Data About Data
information_schema
Working with Metadata

A database server also needs to store information about all of the database objects (tables, views, indexes, etc.) that were created to store this data in a database. This chapter discusses how and where this information, known as metadata, is stored, how you can access it, and how you can use it to build flexible systems.

Data About Data

Metadata is essentially data about data. Every time you create a database object, the database server needs to record various pieces of information. For example, if you were to create a table with multiple columns, a primary key constraint, three indexes, and a foreign key constraint, the database server would need to store all the following information:

Table name
Table storage information (tablespace, initial size, etc.)
Storage engine
Column names
Column data types
Default column values
not null column constraints
Primary key columns
Primary key name
Name of primary key index
Index names
Index types (B-tree, bitmap)
Indexed columns
Index column sort order (ascending or descending)
Index storage information
Foreign key name
Foreign key columns
Associated table/columns for foreign keys

This data is collectively known as the data dictionary or system catalog. The database server needs to store this data persistently, and it needs to be able to quickly retrieve this data in order to verify and execute SQL statements. Additionally, the database server must safeguard this data so that it can be modified only via an appropriate mechanism, such as the alter table statement.

Every database server uses a different mechanism to publish metadata, such as:

A set of views, such as Oracle Database’s user_tables and all_constraints views
A set of system-stored procedures, such as SQL Server’s sp_tables procedure or Oracle Database’s dbms_metadata package
A special database, such as MySQL’s information_schema database

information_schema

All of the objects available within the information_schema database (or schema, in the case of SQL Server) are views. Unlike the describe utility, the views within information_schema can be queried and, thus, used programmatically.

Table Name	Description	Introduced	Deprecated
`ADMINISTRABLE_ROLE_AUTHORIZATIONS`	Grantable users or roles for current user or role	8.0.19
`APPLICABLE_ROLES`	Applicable roles for current user	8.0.19
`CHARACTER_SETS`	Available character sets
`CHECK_CONSTRAINTS`	Table and column CHECK constraints	8.0.16
`COLLATION_CHARACTER_SET_APPLICABILITY`	Character set applicable to each collation
`COLLATIONS`	Collations for each character set
`COLUMN_PRIVILEGES`	Privileges defined on columns
`COLUMN_STATISTICS`	Histogram statistics for column values
`COLUMNS`	Columns in each table
`COLUMNS_EXTENSIONS`	Column attributes for primary and secondary storage engines	8.0.21
`CONNECTION_CONTROL_FAILED_LOGIN_ATTEMPTS`	Current number of consecutive failed connection attempts per account
`ENABLED_ROLES`	Roles enabled within current session	8.0.19
`ENGINES`	Storage engine properties
`EVENTS`	Event Manager events
`FILES`	Files that store tablespace data
`INNODB_BUFFER_PAGE`	Pages in InnoDB buffer pool
`INNODB_BUFFER_PAGE_LRU`	LRU ordering of pages in InnoDB buffer pool
`INNODB_BUFFER_POOL_STATS`	InnoDB buffer pool statistics
`INNODB_CACHED_INDEXES`	Number of index pages cached per index in InnoDB buffer pool
`INNODB_CMP`	Status for operations related to compressed InnoDB tables
`INNODB_CMP_PER_INDEX`	Status for operations related to compressed InnoDB tables and indexes
`INNODB_CMP_PER_INDEX_RESET`	Status for operations related to compressed InnoDB tables and indexes
`INNODB_CMP_RESET`	Status for operations related to compressed InnoDB tables
`INNODB_CMPMEM`	Status for compressed pages within InnoDB buffer pool
`INNODB_CMPMEM_RESET`	Status for compressed pages within InnoDB buffer pool
`INNODB_COLUMNS`	Columns in each InnoDB table
`INNODB_DATAFILES`	Data file path information for InnoDB file-per-table and general tablespaces
`INNODB_FIELDS`	Key columns of InnoDB indexes
`INNODB_FOREIGN`	InnoDB foreign-key metadata
`INNODB_FOREIGN_COLS`	InnoDB foreign-key column status information
`INNODB_FT_BEING_DELETED`	Snapshot of INNODB_FT_DELETED table
`INNODB_FT_CONFIG`	Metadata for InnoDB table FULLTEXT index and associated processing
`INNODB_FT_DEFAULT_STOPWORD`	Default list of stopwords for InnoDB FULLTEXT indexes
`INNODB_FT_DELETED`	Rows deleted from InnoDB table FULLTEXT index
`INNODB_FT_INDEX_CACHE`	Token information for newly inserted rows in InnoDB FULLTEXT index
`INNODB_FT_INDEX_TABLE`	Inverted index information for processing text searches against InnoDB table FULLTEXT index
`INNODB_INDEXES`	InnoDB index metadata
`INNODB_METRICS`	InnoDB performance information
`INNODB_SESSION_TEMP_TABLESPACES`	Session temporary-tablespace metadata	8.0.13
`INNODB_TABLES`	InnoDB table metadata
`INNODB_TABLESPACES`	InnoDB file-per-table, general, and undo tablespace metadata
`INNODB_TABLESPACES_BRIEF`	Brief file-per-table, general, undo, and system tablespace metadata
`INNODB_TABLESTATS`	InnoDB table low-level status information
`INNODB_TEMP_TABLE_INFO`	Information about active user-created InnoDB temporary tables
`INNODB_TRX`	Active InnoDB transaction information
`INNODB_VIRTUAL`	InnoDB virtual generated column metadata
`KEY_COLUMN_USAGE`	Which key columns have constraints
`KEYWORDS`	MySQL keywords
`MYSQL_FIREWALL_USERS`	Firewall in-memory data for account profiles		8.0.26
`MYSQL_FIREWALL_WHITELIST`	Firewall in-memory data for account profile allowlists		8.0.26
`ndb_transid_mysql_connection_map`	NDB transaction information
`OPTIMIZER_TRACE`	Information produced by optimizer trace activity
`PARAMETERS`	Stored routine parameters and stored function return values
`PARTITIONS`	Table partition information
`PLUGINS`	Plugin information
`PROCESSLIST`	Information about currently executing threads
`PROFILING`	Statement profiling information
`REFERENTIAL_CONSTRAINTS`	Foreign key information
`RESOURCE_GROUPS`	Resource group information
`ROLE_COLUMN_GRANTS`	Column privileges for roles available to or granted by currently enabled roles	8.0.19
`ROLE_ROUTINE_GRANTS`	Routine privileges for roles available to or granted by currently enabled roles	8.0.19
`ROLE_TABLE_GRANTS`	Table privileges for roles available to or granted by currently enabled roles	8.0.19
`ROUTINES`	Stored routine information
`SCHEMA_PRIVILEGES`	Privileges defined on schemas
`SCHEMATA`	Schema information
`SCHEMATA_EXTENSIONS`	Schema options	8.0.22
`ST_GEOMETRY_COLUMNS`	Columns in each table that store spatial data
`ST_SPATIAL_REFERENCE_SYSTEMS`	Available spatial reference systems
`ST_UNITS_OF_MEASURE`	Acceptable units for ST_Distance()	8.0.14
`STATISTICS`	Table index statistics
`TABLE_CONSTRAINTS`	Which tables have constraints
`TABLE_CONSTRAINTS_EXTENSIONS`	Table constraint attributes for primary and secondary storage engines	8.0.21
`TABLE_PRIVILEGES`	Privileges defined on tables
`TABLES`	Table information
`TABLES_EXTENSIONS`	Table attributes for primary and secondary storage engines	8.0.21
`TABLESPACES`	Tablespace information
`TABLESPACES_EXTENSIONS`	Tablespace attributes for primary storage engines	8.0.21
`TP_THREAD_GROUP_STATE`	Thread pool thread group states
`TP_THREAD_GROUP_STATS`	Thread pool thread group statistics
`TP_THREAD_STATE`	Thread pool thread information
`TRIGGERS`	Trigger information
`USER_ATTRIBUTES`	User comments and attributes	8.0.21
`USER_PRIVILEGES`	Privileges defined globally per user
`VIEW_ROUTINE_USAGE`	Stored functions used in views	8.0.13
`VIEW_TABLE_USAGE`	Tables and views used in views	8.0.13
`VIEWS`	View information

Working with Metadata

Schema Generation Scripts

Generate a script that will create the various tables, indexes, views, and so on, that the team has deployed. Build a script that will create the sakila.category table. The following codes can be used to create a template-like SQL script.

SELECT 'CREATE TABLE category (' create_table_statement
UNION ALL
SELECT cols.txt
FROM
(SELECT concat(' ',column_name, ' ', column_type,
CASE
WHEN is_nullable = 'NO' THEN ' not null' ELSE ''
END, CASE
WHEN extra IS NOT NULL AND extra LIKE 'DEFAULT_GENERATED%' THEN concat(' DEFAULT ',column_default,substr(extra,18)) WHEN extra IS NOT NULL THEN concat(' ', extra)
ELSE '' END, ',') txt
FROM information_schema.columns
WHERE table_schema = 'sakila' AND table_name = 'category'
ORDER BY ordinal_position
) cols
UNION ALL
SELECT concat(' constraint primary key (')
FROM information_schema.table_constraints
WHERE table_schema = 'sakila' AND table_name = 'category'
AND constraint_type = 'PRIMARY KEY'
UNION ALL
SELECT cols.txt
FROM
(SELECT concat(CASE WHEN ordinal_position > 1 THEN ' ,'
ELSE ' ' END, column_name) txt
FROM information_schema.key_column_usage
WHERE table_schema = 'sakila' AND table_name = 'category'
AND constraint_name = 'PRIMARY'
ORDER BY ordinal_position
) cols
UNION ALL
SELECT ' )'
UNION ALL
SELECT ')';

Deployment Verification

After the deployment scripts have been run, it’s a good idea to run a verification script to ensure that the new schema objects are in place with the appropriate columns, indexes, primary keys, and so forth. Here’s a query that returns the number of columns, number of indexes, and number of primary key constraints (0 or 1) for each table in the Sakila schema:

SELECT tbl.table_name,
(SELECT count(*)
FROM information_schema.columns clm
WHERE clm.table_schema = tbl.table_schema
AND clm.table_name = tbl.table_name) num_columns,
(SELECT count(*)
FROM information_schema.statistics sta
WHERE sta.table_schema = tbl.table_schema
AND sta.table_name = tbl.table_name) num_indexes,
(SELECT count(*)
FROM information_schema.table_constraints tc
WHERE tc.table_schema = tbl.table_schema
AND tc.table_name = tbl.table_name
AND tc.constraint_type = 'PRIMARY KEY') num_primary_keys
FROM information_schema.tables tbl
WHERE tbl.table_schema = 'sakila' AND tbl.table_type = 'BASE TABLE'
ORDER BY 1;

TABLE_NAME	num_columns	num_indexes	num_primary_keys
actor	4	2	1

Dynamic SQL Generation

Most relational database servers, including SQL Server, Oracle Database, and MySQL, allow SQL statements to be submitted to the server as strings. Submit‐ ting strings to a database engine rather than utilizing its SQL interface is generally known as dynamic SQL execution.

Oracle’s PL/SQL language

execute immediate

SQL Server

sp_executesql

MySQL

prepare, execute, deallocate

SET @qry = 'SELECT customer_id, first_name, last_name FROM customer';
PREPARE dynsql1 FROM @qry;
EXECUTE dynsql1;
DEALLOCATE PREPARE dynsql1;
/*conditions can be specified at runtime*/
SET @qry = 'SELECT customer_id, first_name, last_name FROM customer WHERE customer_id = ?';
PREPARE dynsql2 FROM @qry;
SET @custid = 9;
EXECUTE dynsql2 USING @custid;
SET @custid = 145;
EXECUTE dynsql2 USING @custid;
DEALLOCATE PREPARE dynsql2;

Or you can do the following:

SELECT concat('SELECT ', concat_ws(',', cols.col1, cols.col2),
' FROM customer WHERE customer_id = ?')
INTO @qry 
FROM (SELECT
max(CASE WHEN ordinal_position = 1 THEN column_name
ELSE NULL END) col1,
max(CASE WHEN ordinal_position = 2 THEN column_name
ELSE NULL END) col2
FROM information_schema.columns
WHERE table_schema = 'sakila' AND table_name = 'customer'
GROUP BY table_name
) cols;

PREPARE dynsql3 FROM @qry;
SET @custid = 45; Query OK, 0 rows affected (0.00 sec)
EXECUTE dynsql3 USING @custid;
DEALLOCATE PREPARE dynsql3;

Note: Generally, it would be better to generate the query using a procedural language that includes looping constructs, such as Java, PL/SQL, Transact-SQL, or MySQL’s Stored Procedure Language.

Learning SQL Notes #12: Views

Wed, 09 Jun 2021 01:00:00 +0000

Well-designed applications generally expose a public interface while keeping imple‐ mentation details private, thereby enabling future design changes without impacting end users. When designing your database, you can achieve a similar result by keeping your tables private and allowing your users to access data only through a set of views.

What Are Views?

CREATE VIEW customer_vw
(customer_id,
first_name,
last_name,
email
)
AS
SELECT customer_id,
first_name,
last_name,
concat(substr(email,1,2), '*****', substr(email, -4)) email
FROM customer;
/*view the View*/
describe customer_vw;
/*group by, having, where, join etc. can also be used*/

Why Use Views?

Data Security

Oracle Database users have another option for securing both rows and columns of a table: Virtual Private Database (VPD). VPD allows you to attach policies to your tables, after which the server will modify a user’s query as necessary to enforce the policies.

Data Aggregation

CREATE VIEW sales_by_film_category AS
SELECT c.name AS category,
SUM(p.amount) AS total_sales
FROM payment AS p
INNER JOIN rental AS r
ON p.rental_id = r.rental_id
INNER JOIN inventory AS i
ON r.inventory_id = i.inventory_id
INNER JOIN film AS f
ON i.film_id = f.film_id
INNER JOIN film_category AS fc
ON f.film_id = fc.film_id
INNER JOIN category AS c
ON fc.category_id = c.category_id
GROUP BY c.name
ORDER BY total_sales DESC;

You have great flexibility! You can create a film_category_sales table, load it with aggregated data, and modify the sales_by_film_category view definition to retrieve data from this table if this improves the performance significantly.

Hiding Complexity

One of the most common reasons for deploying views is to shield end users from complexity.

CREATE VIEW film_stats AS
SELECT f.film_id, f.title, f.description, f.rating,
(SELECT c.name
FROM category c
INNER JOIN film_category fc
ON c.category_id = fc.category_id
WHERE fc.film_id = f.film_id) category_name,
(SELECT count(*)
FROM film_actor fa
WHERE fa.film_id = f.film_id ) num_actors,
(SELECT count(*)
FROM inventory i
WHERE i.film_id = f.film_id ) inventory_cnt,
(SELECT count(*)
FROM inventory i
INNER JOIN rental r
ON i.inventory_id = r.inventory_id
WHERE i.film_id = f.film_id ) num_rentals
FROM film f;

If someone uses this view but does not reference the category_name, num_actors, inventory_cnt, or num_rentals column, then none of the subqueries will be executed. This approach allows the view to be used for supplying descriptive information from the film table without unnecessarily joining five other tables.

Joining Partitioned Data

Some database designs break large tables into multiple pieces in order to improve performance. For example, if the payment table became large, the designers may decide to break it into two tables: payment_current, which holds the latest six months of data, and payment_historic, which holds all data up to six months ago. You can make it look like all payment data is stored in a single table.
```
CREATE VIEW payment_all
(payment_id,
customer_id,
staff_id,
rental_id, amount,
payment_date,
last_update
) AS
SELECT payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
FROM payment_historic
UNION ALL
SELECT payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
FROM payment_current;
```
Using a view in this case is a good idea because it allows the designers to change the structure of the underlying data without the need to force all database users to modify their queries.

Updatable Views

In the case of MySQL, a view is updatable if the following conditions are met:

No aggregate functions are used (max(), min(), avg(), etc.).
The view does not employ group by or having clauses.
No subqueries exist in the select or from clause, and any subqueries in the where clause do not refer to tables in the from clause.
The view does not utilize union, union all, or distinct.
The from clause includes at least one table or updatable view.
The from clause uses only inner joins if there is more than one table or view.

Updating Simple Views

UPDATE customer_vw
SET last_name = 'SMITH-ALLEN'
WHERE customer_id = 1;

Noinsert for views that contain derived columns, even if the derived columns are not included in the statement. Cannot modify columns derived from an expression.

Updating Complex Views

For complex views with more than one table, you are allowed to modify both of the underlying tables separately, but not within a single statement. In order to insert data through a complex view, you would need to know from where each column is sourced. Since many views are created to hide complexity from end users, this seems to defeat the purpose if the users need to have explicit knowledge of the view definition.

Learning SQL Notes #11: Indexes and Constraints

Tue, 08 Jun 2021 05:00:00 +0000

Indexes
Constraints
- Constraint Creation

Indexes

The server simply places the data in the next available location within the file (the server maintains a list of free space for each table).

To find all customers whose last name begins with Y, the server must visit each row in the customer table and inspect the contents of the last_name column; if the last name begins with Y, then the row is added to the result set. This type of access is known as a table scan.

An index is simply a mechanism for finding a specific item within a resource. A database server uses indexes to locate rows in a table. Indexes are special tables that, unlike normal data tables, are kept in a specific order. Instead of containing all of the data about an entity, however, an index contains only the column (or columns) used to locate rows in the data table, along with information describing where the rows are physically located. Therefore, the role of indexes is to facilitate the retrieval of a subset of a table’s rows and columns without the need to inspect every row in the table.

Index Creation

/*MySQL*/
ALTER TABLE customer
ADD INDEX idx_email (email);
/*OR*/
ALTER TABLE customer
DROP INDEX idx_email;
/*SQL Server*/
CREATE INDEX idx_email
ON customer (email);
SHOW INDEX FROM customer \G;

To create indexes, we can

CREATE TABLE customer (
customer_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
...
PRIMARY KEY (customer_id),
KEY idx_fk_store_id (store_id),
KEY idx_fk_address_id (address_id),
KEY idx_last_name (last_name),
...

Unique indexes

/*MySQL*/
ALTER TABLE customer
ADD UNIQUE INDEX idx_email (email);
/*SQL Server/Oracle Database*/
CREATE UNIQUE INDEX idx_email
ON customer (email);

You should not build unique indexes on your primary key column(s), since the server already checks uniqueness for primary key values.

Multicolumn indexes

/*MySQL*/
ALTER TABLE customer
ADD INDEX idx_full_name (last_name, first_name);
/*SQL Server/Oracle Database*/
CREATE UNIQUE INDEX idx_email
ON customer (email);

Types of Indexes

B-tree indexes

All the indexes shown thus far are balanced-tree indexes, which are more commonly known as B-tree indexes. MySQL, Oracle Database, and SQL Server all default to B-tree indexing.

B-tree indexes are organized as trees, with one or more levels of branch nodes leading to a single level of leaf nodes.
The server would look at the top branch node (called the root node) and follow the link to the branch node.
The server can add or remove branch nodes to redistribute the values more evenly and can even add or remove an entire level of branch nodes.

Bitmap indexes

If there are only two different values (stored as 1 for active and 0 for inactive) and far more active customers, it can be difficult to maintain a balanced B-tree index as the number of customers grows.

For columns that contain only a small number of values across a large number of rows (known as low-cardinality data), Oracle Database includes bitmap indexes, which generate a bitmap for each value stored in the column.

/*Oracle Database*/
CREATE BITMAP INDEX idx_active ON customer (active);

Bitmap indexes are commonly used in data warehousing environments, where large amounts of data are generally indexed on columns containing relatively few values (e.g., sales quarters, geographic regions, products, salespeople).

Text indexes

How Indexes Are Used

/*MySQL*/
EXPLAIN
SELECT customer_id, first_name, last_name
FROM customer
WHERE first_name LIKE 'S%' AND last_name LIKE 'P%';
/*SQL Server*/
set show plan_text
/*Oracle Database*/
explain plan

For this query, the server can employ any of the following strategies:

Scan all rows in the customer table.
Use the index on the last_name column to find all customers whose last name starts with P; then visit each row of the customer table to find only rows whose first name starts with S.
Use the index on the last_name and first_name columns to find all customers whose last name starts with P and whose first name starts with S.

Looking at the query results, the possible_keys column tells you that the server could decide to use either the idx_last_name or the idx_full_name index, and the key column tells you that the idx_full_name index was chosen. Furthermore, the type column tells you that a range scan will be utilized, meaning that the database server will be looking for a range of values in the index, rather than expecting to retrieve a single row.

The Downside of Indexes

Every index is a table (a special type of table but still a table). Therefore, every time a row is added to or removed from a table, all indexes on that table must be modified. When a row is updated, any indexes on the column or columns that were affected need to be modified as well. Therefore, the more indexes you have, the more work the server needs to do to keep all schema objects up-to-date, which tends to slow things down.

Indexes also require disk space as well as some amount of care from your administrators, so the best strategy is to add an index when a clear need arises. If you need an index for only special purposes, such as a monthly maintenance routine, you can always add the index, run the routine, and then drop the index until you need it again. In the case of data warehouses, where indexes are crucial during business hours as users run reports and ad hoc queries but are problematic when data is being loaded into the warehouse overnight, it is a common practice to drop the indexes before data is loaded and then re-create them before the warehouse opens for business.

In general, you should strive to have neither too many indexes nor too few. If you aren’t sure how many indexes you should have, you can use this strategy as a default:

Make sure all primary key columns are indexed (most servers automatically cre‐ ate unique indexes when you create primary key constraints). For multicolumn primary keys, consider building additional indexes on a subset of the primary key columns or on all the primary key columns but in a different order than the primary key constraint definition.
Build indexes on all columns that are referenced in foreign key constraints. Keep in mind that the server checks to make sure there are no child rows when a par‐ ent is deleted, so it must issue a query to search for a particular value in the col‐ umn. If there’s no index on the column, the entire table must be scanned.
Index any columns that will frequently be used to retrieve data. Most date columns are good candidates, along with short (2- to 50-character) string columns.

Constraints

A constraint is simply a restriction placed on one or more columns of a table. There are several different types of constraints, including:

Primary key constraints Identify the column or columns that guarantee uniqueness within a table

Foreign key constraints Restrict one or more columns to contain only values found in another table’s pri‐ mary key columns (may also restrict the allowable values in other tables if update cascade or delete cascade rules are established)

Unique constraints Restrict one or more columns to contain unique values within a table (primary key constraints are a special type of unique constraint)

Check constraints Restrict the allowable values for a column

If the server allows you to change a customer’s ID in the customer table without changing the same customer ID in the rental table, then you will end up with rental data that no longer points to valid customer records (known as orphaned rows). With primary and foreign key constraints in place, however, the server will either raise an error if an attempt is made to modify or delete data that is referenced by other tables or propagate the changes to other tables for you

Note: If you want to use foreign key constraints with the MySQL server, you must use the InnoDB storage engine for your tables.

Constraint Creation

CREATE TABLE customer (
...
PRIMARY KEY (customer_id), 
KEY idx_fk_store_id (store_id),
KEY idx_fk_address_id (address_id),
KEY idx_last_name (last_name),
CONSTRAINT fk_customer_address FOREIGN KEY (address_id) REFERENCES address (address_id) ON DELETE RESTRICT ON UPDATE CASCADE,
CONSTRAINT fk_customer_store FOREIGN KEY (store_id)REFERENCES store (store_id) ON DELETE RESTRICT ON UPDATE CASCADE
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*For existing tables, you can do"*/
ALTER TABLE customer
ADD CONSTRAINT fk_customer_address FOREIGN KEY (address_id)
REFERENCES address (address_id) ON DELETE RESTRICT ON UPDATE CASCADE;
ALTER TABLE customer
ADD CONSTRAINT fk_customer_store FOREIGN KEY (store_id)
REFERENCES store (store_id) ON DELETE RESTRICT ON UPDATE CASCADE;
/*if you want to drop them*/
ALTER TABLE customer
DROP CONSTRAINT fk_customer_address;
ALTER TABLE customer
DROP CONSTRAINT fk_customer_store F;

on delete restrict, which will cause the server to raise an error if a row is deleted in the parent table (address or store) that is referenced in the child table (customer)
on update cascade, which will cause the server to propagate a change to the primary key value of a parent table (address or store) to the child table (customer)

Parameter	Description
`ON DELETE NO ACTION`	Default action. If there are any existing references to the key being deleted, the transaction will fail at the end of the statement. The key can be updated, depending on the `ON UPDATE` action. Alias: `ON DELETE RESTRICT`
`ON UPDATE NO ACTION`	Default action. If there are any existing references to the key being updated, the transaction will fail at the end of the statement. The key can be deleted, depending on the `ON DELETE` action. Alias: `ON UPDATE RESTRICT`
`ON DELETE RESTRICT` / `ON UPDATE RESTRICT`	`RESTRICT` and `NO ACTION` are currently equivalent until options for deferring constraint checking are added. To set an existing foreign key action to `RESTRICT`, the foreign key constraint must be dropped and recreated.
`ON DELETE CASCADE` / `ON UPDATE CASCADE`	When a referenced foreign key is deleted or updated, all rows referencing that key are deleted or updated, respectively. If there are other alterations to the row, such as a `SET NULL` or `SET DEFAULT`, the delete will take precedence. Note that `CASCADE` does not list objects it drops or updates, so it should be used cautiously.
`ON DELETE SET NULL` / `ON UPDATE SET NULL`	When a referenced foreign key is deleted or updated, respectively, the columns of all rows referencing that key will be set to `NULL`. The column must allow `NULL` or this update will fail.
`ON DELETE SET DEFAULT` / `ON UPDATE SET DEFAULT`	When a referenced foreign key is deleted or updated, the columns of all rows referencing that key are set to the default value for that column. If the default value for the column is null, or if no default value is provided and the column does not have a `NOT NULL` constraint, this will have the same effect as `ON DELETE SET NULL` or `ON UPDATE SET NULL`. The default value must still conform with all other constraints, such as `UNIQUE`.

Learning SQL Notes #10: Transactions

Tue, 08 Jun 2021 01:00:00 +0000

Multiuser Databases
- Locking
- Lock Granularities
What Is a Transaction?

Transactions: Mechanism used to group a set of SQL statements together such that either all or none of the statements succeed.

Multiuser Databases

Locking

Locks are the mechanism the database server uses to control simultaneous use of data resources. When some portion of the database is locked, any other users wishing to modify (or possibly read) that data must wait until the lock has been released. Most database servers use one of two locking strategies:

Database writers must request and receive from the server a write lock to modify data, and database readers must request and receive from the server a read lock to query data. While multiple users can read data simultaneously, only one write lock is given out at a time for each table (or portion thereof), and read requests are blocked until the write lock is released. $\Rightarrow$ long wait times if there are many concurrent read and write requests. (Microsoft SQL Server/MySQL)
Database writers must request and receive from the server a write lock to modify data, but readers do not need any type of lock to query data. Instead, the server ensures that a reader sees a consistent view of the data (the data seems the same even though other users may be making modifications) from the time her query begins until her query has finished. This approach is known as versioning. $\Rightarrow$ problematic if there are long-running queries while data is being modified. (Oracle Database/MySQL)

Lock Granularities

Table locks $\Rightarrow$ less bookkeeping, longer waiting time Keep multiple users from modifying data in the same table simultaneously

Page locks Keep multiple users from modifying data on the same page (a page is a segment of memory generally in the range of 2 KB to 16 KB) of a table simultaneously

Row locks $\Rightarrow$ More bookkeeping, shorter waiting time Keep multiple users from modifying the same row in a table simultaneously

SQL Server will, under certain circumstances, escalate locks from row to page, and from page to table, whereas Oracle Database will never escalate locks.

What Is a Transaction?

Problems occur when one of the ideal situations fails:

Database servers do not enjoy 100% uptime
Users do not always allow programs to finish executing
Applications do not always complete without encountering fatal errors that halt execution

Transaction is a device for grouping together multiple SQL statements such that either all or none of the statements succeed (a property known as atomicity).

Ex:

If you attempt to transfer $500 from your savings account to your checking account, you would be a bit upset if the money were successfully withdrawn from your savings account but never made it to your checking account. Whatever the reason for the failure (the server was shut down for maintenance, the request for a page lock on the account table timed out, etc.), you want your $500 back. To protect against this kind of error, the program that handles your transfer request would first begin a transaction, then issue the SQL statements needed to move the money from your savings to your checking account, and, if everything succeeds, end the transaction by issuing the commit command. If something unexpected happens, however, the program would issue a rollback command, which instructs the server to undo all changes made since the transaction began.

Starting a Transaction

Database servers handle transaction creation in one of two ways:

An active transaction is always associated with a database session, so there is no need or method to explicitly begin a transaction. When the current transaction ends, the server automatically begins a new transaction for your session. You can undo some changes. (Oracle Database)
Unless you explicitly begin a transaction, individual SQL statements are automatically committed independently of one another. To begin a transaction, you must first issue a command. (Microsoft SQL Server/MySQL)

The SQL:2003 standard includes a start transaction command to be used when you want to explicitly begin a transaction. While MySQL conforms to the standard, SQL Server users must instead issue the command begin transaction. With both servers, until you explicitly begin a transaction, you are in what is known as autocommit mode, which means that individual statements are automatically committed by the server.

A word of advice: shut off autocommit mode each time you log in, and get in the habit of running all of your SQL statements within a transaction.

Both MySQL and SQL Server allow you to turn off autocommit mode for individual sessions, in which case the servers will act just like Oracle Database regarding transactions. With SQL Server, you issue the following command to disable autocommit mode:

SET IMPLICIT_TRANSACTIONS ON

MySQL allows you to disable autocommit mode via the following:

SET AUTOCOMMIT=0

Once you have left autocommit mode, all SQL commands take place within the scope of a transaction and must be explicitly committed or rolled back.

Ending a Transaction

End with commit if yes and rollback if no.

Some scenarios in practice:

The server shuts down, in which case your transaction will be rolled back automatically when the server is restarted. ✔
You issue an SQL schema statement, such as alter table, which will cause the current transaction to be committed and a new transaction to be started.
- be careful that the state‐ ments that comprise a unit of work are not inadvertently broken up into multiple transactions by the server！
You issue another start transaction command, which will cause the previous transaction to be committed. ✔
The server prematurely ends your transaction because the server detects a dead‐ lock and decides that your transaction is the culprit. In this case, the transaction will be rolled back, and you will receive an error message.
- Most of the time, the terminated transaction can be restarted and will succeed without encountering another deadlock situation.
  Message: Deadlock found when trying to get lock; try restarting transaction

Transaction Savepoints

You may not want to undo all of the work that has transpired. For these situations, you can establish one or more savepoints

SAVEPOINT my_savepoint;

within a transaction and use them to roll back to a particular location within your transaction

ROLLBACK TO SAVEPOINT my_savepoint;

rather than rolling all the way back to the start of the transaction.

Choosing a Storage Engine

When using Oracle Database or Microsoft SQL Server, a single set of code is respon‐ sible for low-level database operations, such as retrieving a particular row from a table based on primary key value. The MySQL server, however, has been designed so that multiple storage engines may be utilized to provide low-level database functionality, including resource locking and transaction management. As of version 8.0, MySQL includes the following storage engines:

MyISAM A nontransactional engine employing table locking

MEMORY A nontransactional engine used for in-memory tables

CSV A transactional engine that stores data in comma-separated files

InnoDB A transactional engine employing row-level locking

Merge A specialty engine used to make multiple identical MyISAM tables appear as a single table (a.k.a. table partitioning)

Archive A specialty engine used to store large amounts of unindexed data, mainly for archival purposes

MySQL is flexible enough to allow you to choose a storage engine on a table-by-table basis.

You may explicitly specify a storage engine when creating a table, or you can change an existing table to use a different engine.

show table status like 'customer' \G;
/*Second row: Engine: InnoDB*/
ALTER TABLE customer ENGINE = INNODB;

One example is shown below:

START TRANSACTION;
UPDATE product
SET date_retired = CURRENT_TIMESTAMP()
WHERE product_cd = 'XYZ';
SAVEPOINT before_close_accounts;
UPDATE account
SET status = 'CLOSED', close_date = CURRENT_TIMESTAMP(), last_activity_date = CURRENT_TIMESTAMP()
WHERE product_cd = 'XYZ';
ROLLBACK TO SAVEPOINT before_close_accounts;
COMMIT;
/*The net effect of this transaction is that the mythical XYZ product is retired but none of the accounts are closed.*/

When using savepoints, remember the following:

Despite the name, nothing is saved when you create a savepoint. You must even‐ tually issue a commit if you want your transaction to be made permanent.
If you issue a rollback without naming a savepoint, all savepoints within the transaction will be ignored, and the entire transaction will be undone.

If you are using SQL Server, you will need to use the proprietary command save transaction to create a savepoint and rollback transaction to roll back to a savepoint, with each command being followed by the savepoint name.

Learning SQL Notes #9: Conditional Logic

Mon, 07 Jun 2021 01:00:00 +0000

What Is Conditional Logic?
- The case Expression
  - Searched case Expressions
  - Simple case Expressions (A less flexible ver. of the previous expression)
- Examples of case Expressions

What Is Conditional Logic?

Conditional logic is simply the ability to take one of several paths during program execution.

Analogous to if-else in Python and R.

SELECT first_name, last_name,
CASE
WHEN active = 1 THEN 'ACTIVE'
ELSE 'INACT
END activity_type
FROM customer;

The case Expression

The case expression is part of the SQL standard (SQL92 release) and has been implemented by Oracle Database, SQL Server, MySQL, PostgreSQL, IBM UDB, and others.
case expressions are built into the SQL grammar and can be included in select, insert, update, and delete statements.

Searched case Expressions

CASE
WHEN category.name IN ('Children','Family','Sports','Animation')
THEN 'All Ages'
WHEN category.name = 'Horror'
THEN 'Adult'
WHEN category.name IN ('Music','Games')
THEN 'Teens'
ELSE 'Other'
END

SELECT c.first_name, c.last_name,
CASE
WHEN active = 0 THEN 0
ELSE
(SELECT count(*) FROM rental r
WHERE r.customer_id = c.customer_id)
END num_rentals /*Create new variables*/
FROM customer c;

Simple case Expressions (A less flexible ver. of the previous expression)

CASE V0
WHEN V1 THEN E1
WHEN V2 THEN E2 ...
WHEN VN THEN EN
[ELSE ED]
END

V0 represents a value, and the symbols V1, V2, …, VN rep‐ resent values that are to be compared to V0.

Examples of case Expressions

Result Set Transformations

SELECT monthname(rental_date) rental_month,
count(*) num_rentals
FROM rental
WHEN WHERE rental_date BETWEEN '2005-05-01' AND '2005-08-01'
GROUP BY monthname(rental_date);

rental_month	num_rentals
May	1156
June	2311
July	6709

SELECT
SUM(CASE WHEN monthname(rental_date) = 'May' THEN 1
ELSE 0 END) May_rentals,
SUM(CASE WHEN monthname(rental_date) = 'June' THEN 1
ELSE 0 END) June_rentals,
SUM(CASE WHEN monthname(rental_date) = 'July' THEN 1
ELSE 0 END) July_rentals
FROM rental
WHERE rental_date BETWEEN '2005-05-01' AND '2005-08-01';

May_rentals	June_rentals	July_rentals
1156	2311	6709

When the monthname() function returns the desired value for that column, the case expression returns the value 1; otherwise, it returns a 0. When summed over all rows, each column returns the number of accounts opened for that month. Obviously, such transformations are practical for only a small number of values

Checking for Existence

Sometimes you will want to determine whether a relationship exists between two entities without regard for the quantity.

SELECT a.first_name, a.last_name,
CASE
WHEN EXISTS (SELECT 1 FROM film_actor fa
INNER JOIN film f ON fa.film_id = f.film_id
WHERE fa.actor_id = a.actor_id
AND f.rating = 'G') THEN 'Y'
ELSE 'N'
END g_actor
FROM actor a
WHERE a.last_name LIKE 'S%' OR a.first_name LIKE 'S%';

(Avoid) Division-by-Zero Errors

...
sum(p.amount) /
CASE WHEN count(p.amount) = 0 THEN 1
ELSE count(p.amount)
END avg_payment
...

Conditional Updates

UPDATE customer
SET active =
CASE
WHEN 90 <= (SELECT datediff(now(), max(rental_date))
FROM rental r
WHERE r.customer_id = customer.customer_id)
THEN 0
ELSE 1
END
WHERE active = 1;
/*if the number returned by the subquery is 90 or higher, the customer is marked as inactive.*/

Handling Null Values

...
CASE
WHEN a.address IS NULL THEN 'Unknown'
ELSE a.address
END address,
...

Note: For calculations, null values often cause a null result. When performing calculations, case expressions are useful for translating a null value into a number (usually 0 or 1) that will allow the calculation to yield a non-null value.

Learning SQL Notes #8: Subqueries

Sun, 06 Jun 2021 01:00:00 +0000

What Is a Subquery?
Subquery Types
- Noncorrelated Subqueries
  - Multiple-Row, Single-Column Subqueries
  - Multicolumn Subqueries
- Correlated Subqueries
  - The exists Operator
  - Data Manipulation Using Correlated Subqueries
When to Use Subqueries
- Subqueries as Data Sources
- Subqueries as Expression Generators
Subquery Wrap-Up

What Is a Subquery?

A subquery is a query contained within another SQL statement (which I refer to as the containing statement for the rest of this discussion). A subquery is always enclosed within parentheses, and it is usually executed prior to the containing statement. Like any query, a subquery returns a result set that may consist of:

A single row with a single column
Multiple rows with a single column
Multiple rows having multiple columns

SELECT customer_id, first_name, last_name
FROM customer
WHERE customer_id = (SELECT MAX(customer_id) FROM customer);

Subquery Types

Noncorrelated Subqueries

Multiple-Row, Single-Column Subqueries

The in and not in operators

SELECT city_id, city
FROM city
WHERE country_id <> (SELECT country_id FROM country WHERE country = 'India');

Note: Subquery should not return more than one row when you use WHERE to filter a condition with inequality/equality in this case.

What you can do is use the following subqueries:

SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico');

SELECT country_id
FROM country
WHERE country = 'Canada' OR country = 'Mexico';

in the following ways:

SELECT city_id, city
FROM city
WHERE country_id IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));

or the opposite:

SELECT city_id, city
FROM city
WHERE country_id NOT IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));

The all operator

The all operator allows you to make comparisons between a single value and every value in a set:

SELECT first_name, last_name
FROM customer
WHERE customer_id <> ALL
(SELECT customer_id
FROM payment
WHERE amount = 0);

or the equivalent:

SELECT first_name, last_name
FROM customer
WHERE customer_id NOT IN
(SELECT customer_id
FROM payment
WHERE amount = 0);

Any attempt to equate a value to null yields unknown, so when using not in or <> all to compare a value to a set of values, you must be careful to ensure that the set of values does not contain a null value.

The subquery in this example returns the total number of film rentals for all custom‐ ers in North America, and the containing query returns all customers whose total number of film rentals exceeds any of the North American customers.

SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) > ALL
(SELECT count(*)
FROM rental r
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('United States','Mexico','Canada')
GROUP BY r.customer_id
);

The any operator (OR)

A condition using the any operator evaluates to true as soon as a single comparison is favorable.

SELECT customer_id, sum(amount)
FROM payment
GROUP BY customer_id
HAVING sum(amount) > ANY
(SELECT sum(amount)
FROM payment p
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('Bolivia','Paraguay','Chile')
GROUP BY co.country
);

Multicolumn Subqueries

SELECT actor_id, film_id
FROM film_actor
WHERE (actor_id, film_id) IN
(SELECT a.actor_id, f.film_id
FROM actor a
CROSS JOIN film f
WHERE a.last_name = 'MONROE'
AND f.rating = 'PG');

Correlated Subqueries

A correlated subquery, on the other hand, is dependent on its containing statement from which it references one or more columns.

SELECT c.first_name, c.last_name
FROM customer c
WHERE 20 =
(SELECT count(*)
FROM rental r
WHERE r.customer_id = c.customer_id);
/*customers who have rented exactly 20 films*/

The exists Operator

You use the exists operator when you want to identify that a relationship exists without regard for the quantity.

SELECT c.first_name, c.last_name
FROM customer c
WHERE (NOT) EXISTS
(SELECT r.rental_date, r.customer_id, 'ABCD' str, 2 * 3 / 7 nmbr /*can be replaced by anything*/
FROM rental r
WHERE r.customer_id = c.customer_id
AND date(r.rental_date) < '2005-05-25');

Since the condition in the containing query only needs to know how many rows have been returned, the actual data the subquery returned is irrelevant.

Data Manipulation Using Correlated Subqueries

UPDATE customer c
SET c.last_update =
(SELECT max(r.rental_date)
FROM rental r
WHERE r.customer_id = c.customer_id);
UPDATE customer c SET c.last_update =
(SELECT max(r.rental_date) FROM rental r WHERE r.customer_id = c.customer_id) WHERE EXISTS
(SELECT 1 FROM rental r
WHERE r.customer_id = c.customer_id);
/*executes only if the condition in the update statement’s where clause evaluates to true (meaning that at least one rental was found for the customer), thus protecting the data in the last_update column from being
overwritten with a null.*/
DELETE FROM customer WHERE 365 < ALL
(SELECT datediff(now(), r.rental_date) days_since_last_rental FROM rental r
WHERE r.customer_id = customer.customer_id);
/*removes rows from the customer table where there have been no film rentals in the past year*/

When to Use Subqueries

Subqueries as Data Sources

SELECT c.first_name, c.last_name, pymnt.num_rentals, pymnt.tot_payments
FROM customer c
INNER JOIN
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id ) pymnt /*execute first*/
ON c.customer_id = pymnt.customer_id;

Data fabrication

First we have a table for some standards (small/average/heavy) with lower and upper bounds.

SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit;

Then we have transformed the original tables into the desired one.

SELECT pymnt_grps.name, count(*) num_customers
FROM
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN (SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit
UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit ) pymnt_grps
ON pymnt.tot_payments
BETWEEN pymnt_grps.low_limit AND pymnt_grps.high_limit
GROUP BY pymnt_grps.name;

Task-oriented subqueries

SELECT c.first_name, c.last_name, ct.city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
INNER JOIN customer c
ON p.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
GROUP BY c.first_name, c.last_name, ct.city;

We only need names/cities/addresses for display purpose only, so we can use subqueries to group the data first before joining other tables. A more efficient code chunk for the same task：

SELECT c.first_name, c.last_name, ct.city, pymnt.tot_payments, pymnt.tot_rentals
FROM (SELECT customer_id, count(*) tot_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN customer c
ON pymnt.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id;

Common table expressions

WITH actors_s AS
(SELECT actor_id, first_name, last_name
FROM actor
WHERE last_name LIKE 'S%'
) /*can be used in the subsequent queries*/
...

Subqueries as Expression Generators

Correlated scalar subqueries. The customer table is accessed three times (once in each of the three subqueries) rather than just once.

SELECT (SELECT c.first_name
FROM customer c
WHERE c.customer_id = p.customer_id ) first_name, (SELECT c.last_name
FROM customer c
WHERE c.customer_id = p.customer_id ) last_name, (SELECT ct.city
FROM customer c
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
WHERE c.customer_id = p.customer_id
) city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
GROUP BY p.customer_id;

Similarly,

INSERT INTO film_actor (actor_id, film_id, last_update) VALUES (
(SELECT actor_id
FROM actor
WHERE first_name = 'JENNIFER' AND last_name = 'DAVIS'), (SELECT film_id FROM film
WHERE title = 'ACE GOLDFINGER'),
now()
);

Subquery Wrap-Up

Return a single column and row, a single column with multiple rows, and multi‐ ple columns and rows
Are independent of the containing statement (noncorrelated subqueries)
Reference one or more columns from the containing statement (correlated subqueries)
Are used in conditions that utilize comparison operators as well as the special-purpose operators in, not in, exists, and not exists
Can be found in select, update, delete, and insert statements
Generate result sets that can be joined to other tables (or subqueries) in a query
Can be used to generate values to populate a table or to populate columns in a query’s result set
Are used in the select, from, where, having, and order by clauses of queries

Happy learning!

Learning SQL Notes #7: Grouping and Aggregates (CH. 8)

Sat, 05 Jun 2021 01:00:00 +0000

Grouping Concepts
Aggregate Functions
Generating Groups
Group Filter Conditions

Grouping Concepts

SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) >= 40
ORDER BY 2 DESC;

WARNING:

~~WHERE count(*) >= 40~~ since aggregate functions should come with HAVING.

R codes:

library(tidyverse)
rental %>%
group_by(customer_id) %>%
summarize(counts=n()) %>%
filter(counts>=40) %>%
arrange(desc(counts))

Aggregate Functions

Some aggregate functions in SQL/R:

SQL	R
count()	count()
sum()	sum()
average()	mean()
min()	min()
max()	max()
group_concat()	paste()
first()	[1]
last()	[-1]

SELECT COUNT(DISTINCT col1)
FROM string_tbl;

R codes:

length(unique(string_tbl$col1))

NULLS are ignored unless you use count(*) where all rows will be counted.

Generating Groups

Single-Column/Multicolumn Grouping

Grouping can be done on 1 or more columns with aggregate functions.

SELECT actor_id, count(*)
FROM film_actor
GROUP BY actor_id;
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating
ORDER BY 1,2;

R codes are analogous to the codes in the last section.

Grouping via Expressions

SELECT extract(YEAR FROM rental_date) year,
COUNT(*) how_many
FROM rental
GROUP BY extract(YEAR FROM rental_date);

R codes:

library(tidyverse)
rental %>%
mutate(year=year(rental_date)) %>%
group_by(year) %>%
summarize(counts=n()) %>%

Generating Rollups

Find total counts for each distinct actor.

/*MySQL*/
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating WITH ROLLUP
ORDER BY 1,2;
/*Oracle*/
GROUP BY ROLLUP(fa.actor_id, f.rating)
GROUP BY a, ROLLUP(b, c)

actor_id	rating	count(*)
NULL	NULL	5462
1	NULL	19
1	G	4
1	PG	6
1	PG-13	1
1	R	3
1	NC-17	5
2	NULL	25
2	G	7

R codes:

library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))

See here: https://stackoverflow.com/questions/36169073/how-to-do-group-by-rollup-in-r-like-sql

Group Filter Conditions

HAVING with aggregate functions;
WHERE with original columns;

Learning SQL Notes #6: Data Generation, Manipulation, and Conversion

Fri, 04 Jun 2021 01:00:00 +0000

Working with String Data
- String Generation
  - Including single quotes
  - Including special characters
- String Manipulation
  - String functions that return numbers
Working with Numeric Data
- Performing Arithmetic Functions & Controlling Number Precision & Handling Signed Data
Working with Temporal Data
Appendix for Codes

Working with String Data

String Generation

Types:

CHAR Holds fixed-length, blank-padded strings.

varchar Holds variable-length strings.

text (MySQL and SQL Server) or clob (Oracle Database) Holds very large variable-length strings (generally referred to as documents in this context).

CREATE TABLE string_tbl
(char_fld CHAR(30),
vchar_fld VARCHAR(30),
text_fld TEXT
);
INSERT INTO string_tbl (char_fld, vchar_fld, text_fld)
VALUES ('This is char data',
'This is varchar data',
'This is text data');

If you want to have a longer string, you can

UPDATE string_tbl
SET vchar_fld = 'This is a piece of extremely long varchar data';

but then:

ERROR 1406 (22001): Data too long for column 'vchar_fld' at row 1

NOTE: Since MySQL 6.0, the default behavior is now “strict” mode, which means that exceptions are thrown when problems arise, whereas in older versions of the server the string would have been truncated and a warning issued.

SELECT @@session.sql_mode;
SET sql_mode='ansi'; /*Go back to the older ver.*/
SELECT @@session.sql_mode;

Now extra will be truncated.

Including single quotes

SELECT quote(text_fld)
FROM string_tbl;

Output:

QUOTE(text_fld)
‘This string didn't work, but it does now’

Including special characters

The SQL Server and MySQL servers include the built-in function char() so that you can build strings from any of the 255 characters in the ASCII character set (Oracle Database users can use the chr() function).

SELECT CHAR(128,129,130,131,132,133,134,135,136,137);

Output:

CHAR(128,129,130,131,132,133,134,135,136,137)
Çüéâäàåçêë

R codes:

coderange <- c(128,129,130,131,132,133,134,135,136,137)
rawToChar(as.raw(coderange),multiple=TRUE)

You can also concatenate two strings:

SELECT CONCAT('danke sch', CHAR(148), 'n');

Output:

CONCAT(‘danke sch’, CHAR(148), ‘n’)
danke schön

R codes:

paste('danke sch', rawToChar(as.raw(148)), 'n')
paste0()

See: https://www.r-bloggers.com/2011/03/ascii-code-table-in-r/

Oracle Database/PostgreSQL users can use the concatenation operator (||) instead of the concat() function, as in:

SELECT 'danke sch' || CHR(148) || 'n' FROM dual;

SQL Server does not include a concat() function, so you will need to use the concatenation operator (+), as in:

SELECT 'danke sch' + CHAR(148) + 'n'

String Manipulation

String functions that return numbers

To find the length of a string:

LENGTH()
SELECT LENGTH(char_fld) char_length,
LENGTH(vchar_fld) varchar_length,
LENGTH(text_fld) text_length
FROM string_tbl;

R codes:

length()

To find the index of a character in a string:

POSITION()
SELECT POSITION('characters' IN vchar_fld)
FROM string_tbl;

R codes:

match('y',x)
which('y' %in% x)

Note: When working with databases that the first character in a string is at position 1. A return value of 0 from instr() indicates that the substring could not be found, not that the substring was found at the first position in the string.

If you want to start your search at something other than the first character of your target string, you will need to use the locate() function, which is similar to the position() function except that it allows an optional third parameter, which is used to define the search’s start position. The locate() function is also proprietary, whereas the position() function is part of the SQL:2003 standard.

SELECT LOCATE('is', vchar_fld, 5)
FROM string_tbl;

R codes:

match('y',x[5:])
which('y' %in% x[5:])

Oracle Database instr(): Mimics the position() function when provided with two arguments and mimics the locate() function when provided with three arguments.

SQL Server charindx(): similar to Oracle’s instr() function.

strcmp() (MySQL ONLY) takes two strings as arguments and returns one of the following:

−1 if the first string comes before the second string in sort order
0 if the strings are identical
1 if the first string comes after the second string in sort order

SELECT vchar_fld
FROM string_tbl
ORDER BY vchar_fld;

vchar_fld
12345
abcd
QRSTUV
qrstuv
xyz

SELECT STRCMP('12345','12345') 12345_12345,
STRCMP('abcd','xyz') abcd_xyz,
STRCMP('abcd','QRSTUV') abcd_QRSTUV,
STRCMP('qrstuv','QRSTUV') qrstuv_QRSTUV, /*Case insensitive*/
STRCMP('12345','xyz') 12345_xyz,
STRCMP('xyz','qrstuv') xyz_qrstuv;

12345_12345	abcd_xyz	abcd_QRSTUV	qrstuv_QRSTUV	12345_xyz	xyz_qrstuv
0	−1	−1	0	−1	1

Add or replace characters in the middle of a string： insert() 4 parameters: the original string, the start position, the number of characters to replace (0 for inserting a string), and the replacement string.

SELECT INSERT('goodbye world', 9, 0, 'cruel ') string;
/*goodbye cruel world*/
SELECT INSERT('goodbye world', 1, 7, 'hello') string;
/*hello world*/
SELECT SUBSTRING('goodbye cruel world', 9, 5);
/*cruel*/

For other SQL,

/*Oracle*/
SELECT REPLACE('goodbye world', 'goodbye', 'hello') FROM dual;
/*hello world*/
SELECT substr('goodbye cruel world', 9, 5);
/*cruel*/
/*SQL Server*/
SELECT STUFF('hello world', 1, 5, 'goodbye cruel')
/*goodbye cruel world*/
SELECT SUBSTRING('goodbye cruel world', 9, 5);
/*cruel*/

Working with Numeric Data

SELECT (37 * 59) / (78 - (8 * 6));

Performing Arithmetic Functions & Controlling Number Precision & Handling Signed Data

Function name	Description
acos( x )	Calculates the arc cosine of x
asin( x )	Calculates the arc sine of x
atan( x )	Calculates the arc tangent of x
cos( x )	Calculates the cosine of x
sin( x )	Calculates the sine of x
tan( x )	Calculates the tangent of x
cot( x )	Calculates the cotangent of x
exp( x )	Calculates ex
ln( x )	Calculates the natural log of x
sqrt( x )	Calculates the square root of x

Some useful functions in R and SQL (See Appendix for full results):

SQL	R
MOD( x )	%%
POW( x )	^
CEIL( x )	ceiling()
FLOOR( x )	floor()
ROUND( x )	round()
TRUNCATE( x )	trunc()
SIGN( x )	sign()
ABS( x )	abs()

Working with Temporal Data

Dealing with Time Zones

/*MySQL*/
SELECT @@global.time_zone, @@session.time_zone;
SET time_zone = 'Europe/Zurich';
/*Oracle Database*/
ALTER SESSION TIMEZONE = 'Europe/Zurich'

From:

@@global.time_zone	@@session.time_zone
SYSTEM	SYSTEM

To:

@@global.time_zone	@@session.time_zone
SYSTEM	Europe/Zurich

R codes:

Sys.timezone()
Sys.setenv(TZ = "Europe/Zurich")

Generating Temporal Data

You can generate temporal data via any of the following means:

Copying data from an existing date, datetime, or time column
Executing a built-in function that returns a date, datetime, or time
Building a string representation of the temporal data to be evaluated by the server

String representations of temporal data

Component	Definition	Range
YYYY	Year, including century	1000 to 9999
MM	Month	01 (January) to 12 (December)
DD	Day	01 to 31
HH	Hour	Range 00 to 23
HHH	Hours	−838 to 838
MI	(elapsed) Minute	00 to 59
SS	Second	00 to 59

Type	Default format
date	YYYY-MM-DD
datetime	YYYY-MM-DD HH:MI:SS
timestamp	YYYY-MM-DD HH:MI:SS
time	HHH:MI:SS

String-to-date conversions

A simple query that returns a datetime value using the cast() function

SQL	R (lubridate)
CAST(‘2019-09-17 15:30:00’ AS DATETIME)	as_datetime()
STR_TO_DATE(‘September 17, 2019’, ‘%M %d, %Y’)	as.Date(…, format=…)
CAST(‘2019-09-17’ AS DATE)	as.Date()
CAST(‘108:17:57’ AS TIME)	as.POSIXlt()

/*MySQL*/
SELECT str_to_date();
/*Oracle Database*/
SELECT to_date();
/*SQL server*/
SELECT convert();
/*Current System Time*/
SELECT CURRENT_DATE(), CURRENT_TIME(), CURRENT_TIMESTAMP();

Common notations for both R and SQL:

Format component	Description
%M	Month name (January to December)
%m	Month numeric (01 to 12)
%d	Day numeric (01 to 31)
%j	Day of year (001 to 366)
%W	Weekday name (Sunday to Saturday)
%Y	Year, four-digit numeric
%y	Year, two-digit numeric
%H	Hour (00 to 23)
%h	Hour (01 to 12)
%i	Minutes (00 to 59)
%s	Seconds (00 to 59)
%f	Microseconds (000000 to 999999)
%p	A.M. or P.M.

Manipulating Temporal Data

Interval types for DATE_ADD() and EXTRACT()

Interval name	Description
second	Number of seconds
minute	Number of minutes
hour	Number of hours
day	Number of days
month	Number of months
year	Number of years
minute_second	Number of minutes and seconds, separated by “:”
hour_second	Number of hours, minutes, and seconds, separated by “:”
year_month	Number of years and months, separated by “-”

Temporal functions that return dates

The same result can be performed on three different servers:

/*MySQL*/
UPDATE employee
SET birth_date = DATE_ADD(birth_date, INTERVAL '9-11' YEAR_MONTH)
WHERE emp_id = 4789;
/*Oracle Database*/
UPDATE employee
SET birth_date = ADD_MONTHS(birth_date, 119)
WHERE emp_id = 4789;
/*SQL server*/
UPDATE employee
SET birth_date = DATEADD(MONTH, 119, birth_date)
WHERE emp_id = 4789

Temporal functions that return strings

Some other functions for temporal data:

/*MySQL*/
SELECT LAST_DAY('2019-09-17'); /*Extract last day of Sept*/
SELECT DAYNAME('2019-09-18'); /*Wednesday*/
SELECT EXTRACT(YEAR FROM '2019-09-18 22:19:05'); /*2019*/
/*SQL Server*/
SELECT DATEPART(YEAR, GETDATE())

Temporal functions that return numbers

SELECT DATEDIFF('2019-09-03', '2019-06-21');
/*74*/
SELECT DATEDIFF('2019-09-03 23:59:59', '2019-06-21 00:00:01');
/*74, time has no effects*/
SELECT DATEDIFF('2019-06-21', '2019-09-03');
/*-74*/
/*SQL Server*/
SELECT DATEDIFF(DAY, '2019-06-21', '2019-09-03')

Conversion Functions

SELECT CAST('1456328' AS SIGNED INTEGER);
/*1456328*/
SELECT CAST('999ABC111' AS UNSIGNED INTEGER);
/*999 with warnings about truncation*/

Appendix for Codes

SELECT MOD(10,4);
/*2*/
SELECT MOD(20.75,4); /*Real argument*/
/*0.75*/
SELECT POW(2,8);
/*256*/
SELECT CEIL(72.445), FLOOR(72.445);
/*73 72*/
SELECT CEIL(72.000000001), FLOOR(72.999999999);
/*73 72*/
SELECT ROUND(72.49999), ROUND(72.5), ROUND(72.50001);
/*72 73 73*/
SELECT ROUND(72.0909, 1), ROUND(72.0909, 2), ROUND(72.0909, 3);
/*72.1 72.09 72.091*/
SELECT TRUNCATE(72.0909, 1), TRUNCATE(72.0909, 2), TRUNCATE(72.0909, 3);
/*72.0 72.09 72.090*/
/*SQL Server*/
SELECT ROUND(72.0909, 1, 1)

R codes:

%%
^
ceiling()
floor()
round()
trunc()

SELECT account_id, SIGN(balance), ABS(balance)
FROM account;

R codes:

sign()
abs()

Hope I can finish this before July. Stay safe.

Learning SQL Notes #5: Querying Multiple Tables (CH. 5)

Thu, 03 Jun 2021 20:00:00 +0000

Cross Join (Cartesian Product)
Inner Joins
Joining Three or More Tables
Using Subqueries as Tables
Using the Same Table Twice
Self-Joins
Outer Joins
- Three-Way Outer Joins
Natural Joins

Join instructs the server to use a column as the transportation between tables, thus allows columns from both tables to be included in the query’s result set.

Cross Join (Cartesian Product)

If the query didn’t specify how the two tables should be joined, the database server generated the Cartesian product, which is every permutation of the two tables.

JOIN b
CROSS JOIN b

R codes:

merge(x = df1, y = df2, by = NULL)
library(data.table)
CJ(a, b)

Can be used to create a list of consecutive numbers.

Inner Joins

If a value exists for the address_id column in one table but not the other, then the join fails for the rows containing that value, and those rows are excluded from the result set. Inner join only returns rows that satisfy the join condition.

INNER JOIN b
ON a.id=b.id

R codes:

merge(df1, df2, by = "id")
library(plyr)
join(df1, df2,
type = "inner")

Joining Three or More Tables

Join order is not important!

Force order:

SELECT STRAIGHT_JOIN COL1

Using Subqueries as Tables

See subquery notes.

Using the Same Table Twice

Either one of the actors in the movie:

SELECT f.title
FROM film f
INNER JOIN film_actor fa
ON f.film_id = fa.film_id
INNER JOIN actor a
ON fa.actor_id = a.actor_id
WHERE ((a.first_name = 'CATE' AND a.last_name = 'MCQUEEN')
OR (a.first_name = 'CUBA' AND a.last_name = 'BIRCH');

If we want movies that have both, you cannot simply replace OR with AND since this will return an empty set. Hence instead, you need to join the table twice:

SELECT f.title
FROM film f
/*once: */
INNER JOIN film_actor fa1
ON f.film_id = fa1.film_id
INNER JOIN actor a1
ON fa1.actor_id = a1.actor_id
/*twice: */
INNER JOIN film_actor fa2
ON f.film_id = fa2.film_id
INNER JOIN actor a2
ON fa2.actor_id = a2.actor_id
/*filter condition is applied*/
WHERE (a1.first_name = 'CATE' AND a1.last_name = 'MCQUEEN')
AND (a2.first_name = 'CUBA' AND a2.last_name = 'BIRCH');

Self-Joins

Some tables include a self-referencing foreign key, which means that it includes a column that points to the primary key within the same table.

Imagine that the film table includes the column prequel_film_id, which points to the film’s parent (e.g., the film Fiddler Lost II would use this column to point to the parent film Fiddler Lost).

Using a self-join, you can write a query that lists every film that has a prequel, along with the prequel’s title:

SELECT f.title, f_prnt.title prequel
FROM film f
INNER JOIN film f_prnt
ON f_prnt.film_id = f.prequel_film_id
WHERE f.prequel_film_id IS NOT NULL;

A possible outcome:

title	prequel
FIDDLER LOST II	FIDDLER LOST

Outer Joins

SELECT f.film_id, f.title, count(i.inventory_id) num_copies
FROM film f
LEFT OUTER JOIN inventory i
ON f.film_id = i.film_id
GROUP BY f.film_id, f.title;

Left outer join includes all rows from the table on the left side of the join (film, in this case) and then include columns from the table on the right side of the join (inventory) if the join is successful.
The num_copies column definition was changed from count(*) to count(i.inventory_id), which will count the number of non-null values of the inventory.inventory_id column.
A left outer join B $\equiv$ B right outer join A.

Three-Way Outer Joins

SELECT f.film_id, f.title, i.inventory_id, r.rental_date
FROM film f LEFT OUTER JOIN inventory i
ON f.film_id = i.film_id
LEFT OUTER JOIN rental r
ON i.inventory_id = r.inventory_id
WHERE f.film_id BETWEEN 13 AND 15;

Natural Joins

Lets the database server determine what the join conditions need to be.

SELECT c.first_name, c.last_name, date(r.rental_date)
FROM customer c
NATURAL JOIN rental r;

Empty set (0.04 sec)

Because you specified a natural join, the server inspected the table definitions and added the join condition r.customer_id = c.customer_id to join the two tables. This would have worked fine, but in the Sakila schema all of the tables include the column last_update to show when each row was last modified, so the server is also adding the join condition r.last_update = c.last_update, which causes the query to return no data.

The only way around this issue is to use a subquery to restrict the columns for at least one of the tables:

SELECT cust.first_name, cust.last_name, date(r.rental_date)
FROM
(SELECT customer_id, first_name, last_name
FROM customer
) cust
NATURAL JOIN rental r;

Learning SQL Notes #4.5: Regular Expression

Wed, 02 Jun 2021 20:00:00 +0000

Adapted from https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference

Character Escapes

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes.

Escaped character	Description	Pattern	Matches
`\a`	Matches a bell character, \u0007.	`\a`	`"\u0007"` in `"Error!" + '\u0007'`
`\b`	In a character class, matches a backspace, \u0008.	`[\b]{3,}`	`"\b\b\b\b"` in `"\b\b\b\b"`
`\t`	Matches a tab, \u0009.	`(\w+)\t`	`"item1\t"`, `"item2\t"` in `"item1\titem2\t"`
`\r`	Matches a carriage return, \u000D. (`\r` is not equivalent to the newline character, `\n`.)	`\r\n(\w+)`	`"\r\nThese"` in `"\r\nThese are\ntwo lines."`
`\v`	Matches a vertical tab, \u000B.	`[\v]{2,}`	`"\v\v\v"` in `"\v\v\v"`
`\f`	Matches a form feed, \u000C.	`[\f]{2,}`	`"\f\f\f"` in `"\f\f\f"`
`\n`	Matches a new line, \u000A.	`\r\n(\w+)`	`"\r\nThese"` in `"\r\nThese are\ntwo lines."`
`\e`	Matches an escape, \u001B.	`\e`	`"\x001B"` in `"\x001B"`
`\` nnn	Uses octal representation to specify a character (nnn consists of two or three digits).	`\w\040\w`	`"a b"`, `"c d"` in `"a bc d"`
`\x` nn	Uses hexadecimal representation to specify a character (nn consists of exactly two digits).	`\w\x20\w`	`"a b"`, `"c d"` in `"a bc d"`
`\c` X `\c` x	Matches the ASCII control character that is specified by X or x, where X or x is the letter of the control character.	`\cC`	`"\x0003"` in `"\x0003"` (Ctrl-C)
`\u` nnnn	Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn).	`\w\u0020\w`	`"a b"`, `"c d"` in `"a bc d"`
`\`	When followed by a character that is not recognized as an escaped character in this and other tables in this topic, matches that character. For example, `\` is the same as `\x2A`, and `\.` is the same as `\x2E`. This allows the regular expression engine to disambiguate language elements (such as or ?) and character literals (represented by `\*` or `\?`).	`\d+[\+-x\*]\d+`	`"2+2"` and `"39"` in `"(2+2) 3*9"`

Character Classes

A character class matches any one of a set of characters. Character classes include the language elements listed in the following table. For more information, see Character Classes.

Character class	Description	Pattern	Matches
`[` character_group `]`	Matches any single character in character_group. By default, the match is case-sensitive.	`[ae]`	`"a"` in `"gray"` `"a"`, `"e"` in `"lane"`
`[^` character_group `]`	Negation: Matches any single character that is not in character_group. By default, characters in character_group are case-sensitive.	`[^aei]`	`"r"`, `"g"`, `"n"` in `"reign"`
`[` first `-` last `]`	Character range: Matches any single character in the range from first to last.	`[A-Z]`	`"A"`, `"B"` in `"AB123"`
`.`	Wildcard: Matches any single character except \n. To match a literal period character (. or `\u002E`), you must precede it with the escape character (`\.`).	`a.e`	`"ave"` in `"nave"` `"ate"` in `"water"`
`\p{` name `}`	Matches any single character in the Unicode general category or named block specified by name.	`\p{Lu}` `\p{IsCyrillic}`	`"C"`, `"L"` in `"City Lights"` `"Д"`, `"Ж"` in `"ДЖem"`
`\P{` name `}`	Matches any single character that is not in the Unicode general category or named block specified by name.	`\P{Lu}` `\P{IsCyrillic}`	`"i"`, `"t"`, `"y"` in `"City"` `"e"`, `"m"` in `"ДЖem"`
`\w`	Matches any word character.	`\w`	`"I"`, `"D"`, `"A"`, `"1"`, `"3"` in `"ID A1.3"`
`\W`	Matches any non-word character.	`\W`	`" "`, `"."` in `"ID A1.3"`
`\s`	Matches any white-space character.	`\w\s`	`"D "` in `"ID A1.3"`
`\S`	Matches any non-white-space character.	`\s\S`	`" _"` in `"int __ctr"`
`\d`	Matches any decimal digit.	`\d`	`"4"` in `"4 = IV"`
`\D`	Matches any character other than a decimal digit.	`\D`	`" "`, `"="`, `" "`, `"I"`, `"V"` in `"4 = IV"`

Anchors

Anchors, or atomic zero-width assertions, cause a match to succeed or fail depending on the current position in the string, but they do not cause the engine to advance through the string or consume characters. The metacharacters listed in the following table are anchors. For more information, see Anchors.

Assertion	Description	Pattern	Matches
`^`	By default, the match must start at the beginning of the string; in multiline mode, it must start at the beginning of the line.	`^\d{3}`	`"901"` in `"901-333-"`
`$`	By default, the match must occur at the end of the string or before `\n` at the end of the string; in multiline mode, it must occur before the end of the line or before `\n` at the end of the line.	`-\d{3}$`	`"-333"` in `"-901-333"`
`\A`	The match must occur at the start of the string.	`\A\d{3}`	`"901"` in `"901-333-"`
`\Z`	The match must occur at the end of the string or before `\n` at the end of the string.	`-\d{3}\Z`	`"-333"` in `"-901-333"`
`\z`	The match must occur at the end of the string.	`-\d{3}\z`	`"-333"` in `"-901-333"`
`\G`	The match must occur at the point where the previous match ended.	`\G$\d$`	`"(1)"`, `"(3)"`, `"(5)"` in `"(1)(3)(5)[7](9)"`
`\b`	The match must occur on a boundary between a `\w` (alphanumeric) and a `\W` (nonalphanumeric) character.	`\b\w+\s\w+\b`	`"them theme"`, `"them them"` in `"them theme them them"`
`\B`	The match must not occur on a `\b` boundary.	`\Bend\w*\b`	`"ends"`, `"ender"` in `"end sends endure lender"`

Grouping Constructs

Grouping constructs delineate subexpressions of a regular expression and typically capture substrings of an input string. Grouping constructs include the language elements listed in the following table. For more information, see Grouping Constructs.

Grouping construct	Description	Pattern	Matches
`(` subexpression `)`	Captures the matched subexpression and assigns it a one-based ordinal number.	`(\w)\1`	`"ee"` in `"deep"`
`(?<` name `>` subexpression `)` or `(?'` name `'` subexpression `)`	Captures the matched subexpression into a named group.	`(?<double>\w)\k<double>`	`"ee"` in `"deep"`
`(?<` name1 `-` name2 `>` subexpression `)` or `(?'` name1 `-` name2 `'` subexpression `)`	Defines a balancing group definition. For more information, see the "Balancing Group Definition" section in Grouping Constructs.	`(((?'Open'$)[^\($])+((?'Close-Open'\))[^])+)*(?(Open)(?!))$`	`"((1-3)(3-1))"` in `"3+2^((1-3)(3-1))"`
`(?:` subexpression `)`	Defines a noncapturing group.	`Write(?:Line)?`	`"WriteLine"` in `"Console.WriteLine()"` `"Write"` in `"Console.Write(value)"`
`(?imnsx-imnsx:` subexpression `)`	Applies or disables the specified options within subexpression. For more information, see Regular Expression Options.	`A\d{2}(?i:\w+)\b`	`"A12xl"`, `"A12XL"` in `"A12xl A12XL a12xl"`
`(?=` subexpression `)`	Zero-width positive lookahead assertion.	`\b\w+\b(?=.+and.+)`	`"cats"`, `"dogs"` in `"cats, dogs and some mice."`
`(?!` subexpression `)`	Zero-width negative lookahead assertion.	`\b\w+\b(?!.+and.+)`	`"and"`, `"some"`, `"mice"` in `"cats, dogs and some mice."`
`(?<=` subexpression `)`	Zero-width positive lookbehind assertion.	`\b\w+\b(?<=.+and.+)` ——————————— `\b\w+\b(?<=.+and.*)`	`"some"`, `"mice"` in `"cats, dogs and some mice."` ———————————— `"and"`, `"some"`, `"mice"` in `"cats, dogs and some mice."`
`(?<!` subexpression `)`	Zero-width negative lookbehind assertion.	`\b\w+\b(?<!.+and.+)` ——————————— `\b\w+\b(?<!.+and.*)`	`"cats"`, `"dogs"`, `"and"` in `"cats, dogs and some mice."` ———————————— `"cats"`, `"dogs"` in `"cats, dogs and some mice."`
`(?>` subexpression `)`	Atomic group.	`(?>a\|ab)c`	`"ac"` in`"ac"` nothing in`"abc"`

Lookarounds at a glance

When the regular expression engine hits a lookaround expression, it takes a substring reaching from the current position to the start (lookbehind) or end (lookahead) of the original string, and then runs Regex.IsMatch on that substring using the lookaround pattern. Success of this subexpression's result is then determined by whether it's a positive or negative assertion.

Lookaround	Name	Function
`(?=check)`	Positive Lookahead	Asserts that what immediately follows the current position in the string is "check"
`(?<=check)`	Positive Lookbehind	Asserts that what immediately precedes the current position in the string is "check"
`(?!check)`	Negative Lookahead	Asserts that what immediately follows the current position in the string is not "check"
`(?<!check)`	Negative Lookbehind	Asserts that what immediately precedes the current position in the string is not "check"

Once they have matched, atomic groups won't be re-evaluated again, even when the remainder of the pattern fails due to the match. This can significantly improve performance when quantifiers occur within the atomic group or the remainder of the pattern.

Quantifiers

A quantifier specifies how many instances of the previous element (which can be a character, a group, or a character class) must be present in the input string for a match to occur. Quantifiers include the language elements listed in the following table. For more information, see Quantifiers.

Quantifier	Description	Pattern	Matches
`*`	Matches the previous element zero or more times.	`\d*\.\d`	`".0"`, `"19.9"`, `"219.9"`
`+`	Matches the previous element one or more times.	`"be+"`	`"bee"` in `"been"`, `"be"` in `"bent"`
`?`	Matches the previous element zero or one time.	`"rai?n"`	`"ran"`, `"rain"`
`{` n `}`	Matches the previous element exactly n times.	`",\d{3}"`	`",043"` in `"1,043.6"`, `",876"`, `",543"`, and `",210"` in `"9,876,543,210"`
`{` n `,}`	Matches the previous element at least n times.	`"\d{2,}"`	`"166"`, `"29"`, `"1930"`
`{` n `,` m `}`	Matches the previous element at least n times, but no more than m times.	`"\d{3,5}"`	`"166"`, `"17668"` `"19302"` in `"193024"`
`*?`	Matches the previous element zero or more times, but as few times as possible.	`\d*?\.\d`	`".0"`, `"19.9"`, `"219.9"`
`+?`	Matches the previous element one or more times, but as few times as possible.	`"be+?"`	`"be"` in `"been"`, `"be"` in `"bent"`
`??`	Matches the previous element zero or one time, but as few times as possible.	`"rai??n"`	`"ran"`, `"rain"`
`{` n `}?`	Matches the preceding element exactly n times.	`",\d{3}?"`	`",043"` in `"1,043.6"`, `",876"`, `",543"`, and `",210"` in `"9,876,543,210"`
`{` n `,}?`	Matches the previous element at least n times, but as few times as possible.	`"\d{2,}?"`	`"166"`, `"29"`, `"1930"`
`{` n `,` m `}?`	Matches the previous element between n and m times, but as few times as possible.	`"\d{3,5}?"`	`"166"`, `"17668"` `"193"`, `"024"` in `"193024"`

Backreference Constructs

A backreference allows a previously matched subexpression to be identified subsequently in the same regular expression. The following table lists the backreference constructs supported by regular expressions in .NET. For more information, see Backreference Constructs.

Backreference construct	Description	Pattern	Matches
`\` number	Backreference. Matches the value of a numbered subexpression.	`(\w)\1`	`"ee"` in `"seek"`
`\k<` name `>`	Named backreference. Matches the value of a named expression.	`(?<char>\w)\k<char>`	`"ee"` in `"seek"`

Alternation Constructs

Alternation constructs modify a regular expression to enable either/or matching. These constructs include the language elements listed in the following table. For more information, see Alternation Constructs.

Alternation construct	Description	Pattern	Matches
`\|`	Matches any one element separated by the vertical bar (`\|`) character.	`th(e\|is\|at)`	`"the"`, `"this"` in `"this is the day."`
`(?(` expression `)` yes `\|` no `)`	Matches yes if the regular expression pattern designated by expression matches; otherwise, matches the optional no part. expression is interpreted as a zero-width assertion.	`(?(A)A\d{2}\b\|\b\d{3}\b)`	`"A10"`, `"910"` in `"A10 C103 910"`
`(?(` name `)` yes `\|` no `)`	Matches yes if name, a named or numbered capturing group, has a match; otherwise, matches the optional no.	`(?<quoted>")?(?(quoted).+?"\|\S+\s)`	`"Dogs.jpg "`, `"\"Yiska playing.jpg\""` in `"Dogs.jpg \"Yiska playing.jpg\""`

Substitutions

Substitutions are regular expression language elements that are supported in replacement patterns. For more information, see Substitutions. The metacharacters listed in the following table are atomic zero-width assertions.

Character	Description	Pattern	Replacement pattern	Input string	Result string
`$` number	Substitutes the substring matched by group number.	`\b(\w+)(\s)(\w+)\b`	`$3$2$1`	`"one two"`	`"two one"`
`${` name `}`	Substitutes the substring matched by the named group name.	`\b(?<word1>\w+)(\s)(?<word2>\w+)\b`	`${word2} ${word1}`	`"one two"`	`"two one"`
`$$`	Substitutes a literal "$".	`\b(\d+)\s?USD`	`$$$1`	`"103 USD"`	`"$103"`
`$&`	Substitutes a copy of the whole match.	`\$?\d*\.?\d+`	`$&`	`"$1.30"`	`"$1.30"`
$`	Substitutes all the text of the input string before the match.	`B+`	$`	`"AABBCC"`	`"AAAACC"`
`$'`	Substitutes all the text of the input string after the match.	`B+`	`$'`	`"AABBCC"`	`"AACCCC"`
`$+`	Substitutes the last group that was captured.	`B+(C+)`	`$+`	`"AABBCCDD"`	`"AACCDD"`
`$_`	Substitutes the entire input string.	`B+`	`$_`	`"AABBCC"`	`"AAAABBCCCC"`

Regular Expression Options

You can specify options that control how the regular expression engine interprets a regular expression pattern. Many of these options can be specified either inline (in the regular expression pattern) or as one or more RegexOptions constants. This quick reference lists only inline options. For more information about inline and RegexOptions options, see the article Regular Expression Options.

You can specify an inline option in two ways:

By using the miscellaneous construct (?imnsx-imnsx), where a minus sign (-) before an option or set of options turns those options off. For example, (?i-mn) turns case-insensitive matching (i) on, turns multiline mode (m) off, and turns unnamed group captures (n) off. The option applies to the regular expression pattern from the point at which the option is defined, and is effective either to the end of the pattern or to the point where another construct reverses the option.
By using the grouping construct(?imnsx-imnsx:subexpression), which defines options for the specified group only.

The .NET regular expression engine supports the following inline options:

Option	Description	Pattern	Matches
`i`	Use case-insensitive matching.	`\b(?i)a(?-i)a\w+\b`	`"aardvark"`, `"aaaAuto"` in `"aardvark AAAuto aaaAuto Adam breakfast"`
`m`	Use multiline mode. `^` and `$` match the beginning and end of a line, instead of the beginning and end of a string.	For an example, see the "Multiline Mode" section in Regular Expression Options.
`n`	Do not capture unnamed groups.	For an example, see the "Explicit Captures Only" section in Regular Expression Options.
`s`	Use single-line mode.	For an example, see the "Single-line Mode" section in Regular Expression Options.
`x`	Ignore unescaped white space in the regular expression pattern.	`\b(?x) \d+ \s \w+`	`"1 aardvark"`, `"2 cats"` in `"1 aardvark 2 cats IV centurions"`

Miscellaneous Constructs

Miscellaneous constructs either modify a regular expression pattern or provide information about it. The following table lists the miscellaneous constructs supported by .NET. For more information, see Miscellaneous Constructs.

Construct	Definition	Example
`(?imnsx-imnsx)`	Sets or disables options such as case insensitivity in the middle of a pattern.For more information, see Regular Expression Options.	`\bA(?i)b\w+\b` matches `"ABA"`, `"Able"` in `"ABA Able Act"`
`(?#` comment `)`	Inline comment. The comment ends at the first closing parenthesis.	`\bA(?#Matches words starting with A)\w+\b`
`#` [to end of line]	X-mode comment. The comment starts at an unescaped `#` and continues to the end of the line.	`(?x)\bA\w+\b#Matches words starting with A`

Learning SQL Notes #4: Query Primer (CH. 7)

Thu, 27 May 2021 20:00:00 +0000

Working with Sets

Working with Sets

Set Theory in Practice

Both data sets must have the same number of columns.
The data types of each column across the two data sets must be the same (or the server must be able to convert one to the other).

Set Operators

The UNION Operator

The union and union all operators allow you to combine multiple data sets. The difference between the two is that union sorts the combined set and removes duplicates, whereas union all does not.

https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
UNION ALL
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JENNIFER	DAVIS
JENNIFER	DAVIS
JUDY	DEAN
JODIE	DEGENERES
JULIANNE	DENCH

R codes:

library(dplyr)
union_all(df1,df2)

where as UNION removes duplicate Jennifer Davis.

https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
UNION
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JENNIFER	DAVIS
JUDY	DEAN
JODIE	DEGENERES
JULIANNE	DENCH

R codes:

library(dplyr)
union(df1,df2)

The INTERSECT Operator (Not for MySQL!)

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
INTERSECT
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JENNIFER	DAVIS

R codes:

library(dplyr)
intersect(df1,df2)

The EXCEPT Operator (Not for MySQL!)

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
EXCEPT
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JUDY	DEAN
JODIE	DEGENERES
JULIANNE	DENCH

R codes:

library(dplyr)
setdiff(df1,df2)

*Set A *

actor_id
10
11
12
10
10

Set B | actor_id | | :——: | | 10 | | 10 |

The operation A except B yields the following:

actor_id
11
12

The operation A except all B yields the following:

actor_id
10
11
12

The difference between the two operations is that except removes all occurrences of duplicate data from set A, whereas except all removes only one occurrence of duplicate data from set A for every occurrence in set B.

Set Operation Rules

The following sections outline some rules that you must follow when working with compound queries.

Sorting Compound Query Results

Sort

SELECT a.first_name fname, a.last_name lname /*aliases can be helpful*/
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%' UNION ALL
SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%' ORDER BY lname, fname;

Order

In general, compound queries containing three or more queries are evaluated in order from top to bottom. Except for:

The ANSI SQL specification calls for the intersect operator to have precedence over the other set operators.
You may dictate the order in which queries are combined by enclosing multiple queries in parentheses.

NOT FOR MySQL:

You can also wrap adjoining queries in parentheses to override the default top-to-bottom processing of compound queries.

SELECT a.first_name, a.last_name FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%' UNION (SELECT a.first_name, a.last_name FROM actor a
WHERE a.first_name LIKE 'M%' AND a.last_name LIKE 'T%' UNION ALL
SELECT c.first_name, c.last_name FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
)

Learning SQL Notes #3: Query Primer (CH. 3)

Wed, 26 May 2021 20:00:00 +0000

Query Mechanics
Query Clauses
Filtering
- WHERE

Complete sometime this summer:

Finish Join Notes;
Finish GROUP BY Notes;

Query Mechanics

Do you have permission to execute the statement?
Do you have permission to access the desired data?
Is your statement syntax correct?

Query Clauses

Clause name	Purpose
select	Determines which columns to include in the query’s result set
from	Identifies the tables from which to retrieve data and how the tables should be joined
where	Filters out unwanted data
group by	Used to group rows together by common column values
having	Filters out unwanted groups
order by	the rows of the final result set by one or more columns

SELECT

Literals, such as numbers or strings
Expressions, such as transaction.amount * −1
Built-in function calls, such as ROUND(transaction.amount, 2)
User-defined function calls

SELECT version(), user(), database();

Results:

version()	user()	database()
8.0.15	root@localhost	sakila

SELECT row1 AS r1;/*Column Aliases*/
SELECT DISTINCT row1 /*Removing Duplicates-should know beforehand whether duplicates are possible*/

R codes：

unique()

FROM

Permanent tables (i.e., created using the create table statement)

Derived tables (i.e., rows returned by a subquery and held in memory)

SELECT *
FROM
(SELECT first_name, last_name, email
FROM customer
WHERE first_name = 'JESSIE'
) AS cust;

Temporary tables (i.e., volatile data held in memory): any data inserted into a temporary table will disappear at some point
```
CREATE TEMPORARY TABLE actors_j
(actor_id smallint(5),
first_name varchar(45),
last_name varchar(45)
);
```
Virtual tables (i.e., created using the create view statement): When you issue a query against a view, your query is merged with the view definition to create a final query to be executed.
```
CREATE VIEW cust_vw AS
SELECT customer_id, first_name, last_name, active
FROM customer;
```

Table Links

See JOIN in the next note.

Table Aliases

FROM customer AS c;

GROUP BY and HAVING (CH. 8)

[] Haven’t done

ORDER BY

ORDER BY col1, col2, etc;

R codes：

df[order(col1),]
require(tidyverse)
df %>%
arrange(col1)

ORDER BY col1;
ORDER BY col1 desc;

R codes：

df[order(-col1),]
require(tidyverse)
df %>%
arrange(desc(col1))

SELECT col1, col2, col3;
FROM table1
ORDER BY 3; /*equivalent to ORDER BY col3*/

Filtering

WHERE

(...) AND (...)
(...) OR (...)

See operators and expressions for details.

OR operator

Intermediate result	Final result
WHERE true OR true	true
WHERE true OR false	true
WHERE false OR true	true
WHERE false OR false	false

AND operator

Intermediate result	Final result
WHERE (true OR true) AND true	true
WHERE (true OR false) AND true	true
WHERE (false OR true) AND true	true
WHERE (false OR false) AND true	false
WHERE (true OR true) AND false	false
WHERE (true OR false) AND false	false
WHERE (false OR true) AND false	false
WHERE (false OR false) AND false	false

NOT operator

Intermediate result	Final result
WHERE NOT (true OR true) AND true	false
WHERE NOT (true OR false) AND true	false
WHERE NOT (false OR true) AND true	false
WHERE NOT (false OR false) AND true	true
WHERE NOT (true OR true) AND false	false
WHERE NOT (true OR false) AND false	false
WHERE NOT (false OR true) AND false	false
WHERE NOT (false OR false) AND false	false

Expressions

An expression can be any of the following:

A number
A column in a table or view
A string literal, such as ‘Maple Street’
A built-in function, such as concat(‘Learning’, ' ‘, ‘SQL’)
A subquery
A list of expressions, such as (‘Boston’, ‘New York’, ‘Chicago’)

Operators:

Comparison operators, such as =, !=, <, <=, >, >=, <>, like, in, between, is null, exists
Arithmetic operators, such as +, −, *, /, DIV (integer division) and (% or MOD) for modulus

Note:

= can be used for date/string/number;
‘between and’ can be used for date/string/number;
‘between and’ is inclusive;
col1 (not) in (‘A’,‘B’)/subqueries;
built-in function: left(name, 1) in (‘A’,‘B’);
wildcards/regular expressions:
- Strings beginning/ending with a certain character
- Strings beginning/ending with a substring
- Strings containing a certain character anywhere within the string
- Strings containing a substring anywhere within the string
- Strings with a specific format, regardless of individual characters

Wildcard character	Matches
_	Exactly one character
%	Any number of characters (including 0)

NULL

Null is used for various cases where a value cannot be supplied, such as:

Not applicable Such as the employee ID column for a transaction that took place at an ATM machine
Value not yet known Such as when the federal ID is not known at the time a customer row is created
Value undefined Such as when an account is created for a product that has not yet been added to the database

Note:

An expression can be null, but it can never equal null. IS NULL/IS NOT NULL.
Two nulls are never equal to each other.

Learning SQL Notes #2: Data Types

Wed, 26 May 2021 01:00:00 +0000

Character Data
Numeric Data
Temporal Data
BOUNS: Find Current Time

Character Data

char(20) /* fixed-length */
varchar(20) /* variable-length */

No easy way to constrain the length of character in R, but one can try stringr::str_trunc().

Note:

If the data being loaded into a text column exceeds the maximum size for that type, the data will be truncated;
Trailing spaces will not be removed when data is loaded into the column;
When using text columns for sorting or grouping, only the first 1,024 bytes are used, although this limit may be increased if necessary.

CREATE DATABASE european_sales CHARACTER SET latin1;

Numeric Data

Boolean: 0 False, 1 True.

System-generated primary keys: 1 to $\infin$, integers;

mediumint −8,388,608 to 8,388,607
mediumint unsigned 0 to 16,777,215
int −2,147,483,648 to 2,147,483,647
int unsigned 0 to 4,294,967,295
bigint −2^63 to 2^63 - 1
bigint unsigned 0 to 2^64 - 1

Item number: positive integers in a range;

tinyint −128 to 127
tinyint unsigned 0 to 255
smallint −32,768 to 32,767
smallint unsigned 0 to 65,535

unsigned takes only positive values；

High-precision scientific or manufacturing data;
```
float( p , s ) −3.402823466E+38 to −1.175494351E-38 and 1.175494351E-38 to 3.402823466E+38
double( p , s ) −1.7976931348623157E+308 to −2.2250738585072014E-308
and 2.2250738585072014E-308 to 1.7976931348623157E+308
```
p, s are optional parameters, precision (the total number of allowable digits both to the left and to the right of the decimal point) and a scale (the number of allowable digits to the right of the decimal point), left digits = p - s.

Temporal Data

The future date that a particular event is expected to happen, such as shipping a customer’s order
```
date YYYY-MM-DD 1000-01-01 to 9999-12-31
```

The date that a customer’s order was shipped

datetime YYYY-MM-DD HH:MI:SS 1000-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999

The date and time that a user modified a particular row in a table

timestamp YYYY-MM-DD HH:MI:SS 1970-01-01 00:00:00.000000 to 2038-01-18 22:14:07.999999

An employee’s birth date

date YYYY-MM-DD 1000-01-01 to 9999-12-31

The year corresponding to a row in a yearly_sales fact table in a data warehouse
```
year YYYY 1901-2155
```
The elapsed time needed to complete a wiring harness on an automobile assembly line
```
time HHH:MI:SS −838:59:59.000000 to 838:59:59.000000
```

BOUNS: Find Current Time

To find the current data/time:

SELECT now();
/*2019-04-04 20:44:26 Timezone not included*/

R codes：

sys.time()
# "2021-05-25 10:58:06 EDT", Timezone included

If Oracle, add FROM dual;;(Think about dummy variable!)

Learning SQL Notes #1

Tue, 25 May 2021 18:00:00 +0000

Introduction to Databases
Table Creation (CH. 2)

Introduction to Databases

SQL was initially created to be the language for generating, manipulating, and retrieving data from relational databases.
A database is a set of related information.
Database systems are computerized data storage and retrieval mechanisms.
Nonrelational Database Systems:
- In a hierarchical database system, for example, data is represented as one or more tree structures. The hierarchical database system provides tools for locating a particular customer’s tree and then traversing the tree to find the desired accounts and/or transactions. Each node in the tree may have either zero or one parent and zero, one, or many children.
- Network database system exposes sets of records and sets of links that define relationships between different records.
Data can be represented as sets of tables. Rather than using pointers to navigate between related entities, redundant data is used to link records in different tables: relational model.

More about Relational Databases

Now columns/rows are constrained due to physical limit or maintainability;
Primary key includes information that uniquely identifies a row in that table;
1. If more than one column, then compound key;
2. If select, say, first name, then it is a natural key;
3. If select an id, then it is a surrogate key;
4. NEVER be allowed to change!
5. Possible error:
```
ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'
```
More than one identifiers in a table including the primary key: foreign keys, connect the entities in different tables;
Make sure that there is only one place in the database that holds, say, the customer’s name; otherwise, the data might be changed in one place but not another, causing the data in the database to be unreliable. The process of refining a database design to ensure that each independent piece of information is in only one place (except for foreign keys) is known as normalization. (Think about the concept of Tidy Data in R!)
Two-column primary key is also possible depending on the context (CH.2);

Foreign key constraint limits the id to those exist in another table (CH.2); Possible error:

ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ('sakila'.'favorite_food', CONSTRAINT 'fk_fav_food_person_id' FOREIGN KEY
('person_id') REFERENCES 'person' ('person_id'))

Ways to generate primary keys:

Look at the largest value currently in the table and add one.
Let the database server provide the value for you.

ALTER TABLE table_name MODIFY col_0 SMALLINT UNSIGNED AUTO_INCREMENT;
set foreign_key_checks=0; /*IMPORTANT*/
ALTER TABLE person
MODIFY person_id SMALLINT UNSIGNED AUTO_INCREMENT;
set foreign_key_checks=1; /*IMPORTANT*/

Find Databases

To see the see the mysql> prompt:

mysql -u root -p;

Then type show databases; to display all databases;

Find a Table

To select a table, type use table_name;;

Can do the following:

mysql -u root -p table_name;

InR, one can find it under the global environment.

Create a Table

CREATE TABLE table_name /*Create a table with name: ……*/
(col_0 smallint;
col_1 VARCHAR(30);
col_2 timestamp;
CONSTRAINT pk_col_0 PRIMARY KEY (col_0) /*set col_0 as primary key*/
); /*The most basic method to create a database*/

R codes:

df <- data.frame()
# x1 = c(7, 3, 2, 9, 0),
# x2 = c(4, 4, 1, 1, 8),
# x0 = c(5, 3, 9, 2, 4)
# Primary key can only be added manually

Add a Row

INSERT INTO table_name (col_0, col_1, col_2) /*The table*/
VALUES (27, 'Rdm Name', 'Acme Paper Corporation'); /*The values*/
/*The most basic method to insert a full row into a database*/

Query OK, 1 row affected$\Rightarrow$one row was added to the database

R codes:

new_row <- c(27, 'Rdm Name', 'Acme Paper Corporation')
rbind(df, new_row)

You are not required to provide data for every column in the table unless the column cannot be NULL;

MySQL will convert the string to a date for you as long as the format is followed;

ERROR 1292 (22007): Incorrect date value: 'DEC-21-1980' for column 'birth_date' at row 1

Change a Cell

UPDATE table_name
/*Fix column*/ /*Insert the values*/
SET name = 'Certificate of Deposit'
WHERE col_2 = 'CD'; /*Fix row, otherwise all will be replaced*/

R codes:

df[df$col_2=='CD', "name"] <- 'Certificate of Deposit'
# Fix column, fix row

Delete a Row

DELETE ...
/*Fix column*/
FROM table_name
WHERE col_2 = 'CD'; /*Fix row, otherwise all will be deleted*/

R codes:

df[df$col_2=='CD', ] <- NULL

Table Overview

DESC favorite_food;

R codes:

str(df)
summary(df)
glimpse(df)

Describe the table.

Show Tables

show tables

Drop a Table

drop table xxx

Export to XML

Type the following in CMD:

mysql -u lrngsql -p --xml bank

SELECT * FROM table_name
FOR XML AUTO, ELEMENTS /*IMPORTANT*/

No easy way to do so in R.

Table Creation (CH. 2)

1 Design

What info is needed? Make a list.

Compound objects need to be separated into multiple columns, including names or address;
If a column is a list containing zero, one, or more independent items, we need another table;
Need primary key column(s) to guarantee uniqueness.

3 Building SQL Schema Statements

Another type of constraint called a check constraint constrains the allowable values for a particular column. A check constraint to be attached to a column definition.

eye_color CHAR(2) CHECK (eye_color IN ('BR','BL','GR'))

Possible error:

ERROR 1265 (01000): Data truncated for column 'eye_color' at row 1

MySQL does provide another character data type called enum that merges the check constraint into the data type definition.

eye_color ENUM('BR','BL','GR')

R codes:

Enum <- function(...) {
## EDIT: use solution provided in comments to capture the arguments
values <- sapply(match.call(expand.dots = TRUE)[-1L], deparse)
stopifnot(identical(unique(values), values))
res <- setNames(seq_along(values), values)
res <- as.environment(as.list(res))
lockEnvironment(res, bindings = TRUE)
res
}
FRUITS <- Enum(APPLE, BANANA, MELON)

See https://stackoverflow.com/questions/33838392/enum-like-arguments-in-r for further details.

After processing the create table statement, the MySQL server returns the message “Query OK, 0 rows affected,” which tells me that the statement had no syntax errors.