Blog | Siqi Zheng

Outline for the research article

Fri, 12 Nov 2021 01:00:00 +0000

Please click the icon book above for the full text!!

Supplement Solutions for New Questions in Chapter 1.4 to 1.6 in Understanding Analysis Second Edition

Tue, 10 Aug 2021 01:00:00 +0000

Note: There may be LaTex display issues due to blogdown rendering limitations. A complete well-formatted solution can be found by clicking the download icon above.

One may notice that most questions in the second edition are the same as those in the first edition. However, there are still some new or modified questions in the latest edition that remain unanswered.

Therefore, in the following posts, I am going to present a collection of solutions to these new questions found on the internet and worked out by myself. To be more concise and clear, I also rewrote some of my solutions according to the internet sources (links are attached at the end of each question). The solution to the first edition can be found here: https://github.com/mikinty/Understanding-Analysis-Abbott-Solutions

Exercise 1.5.2. Review the proof of Theorem 1.5.6, part (ii) showing that $\Bbb R$ is uncountable, and then find the flaw in the following erroneous proof that $\Bbb Q$ is uncountable: Assume, for contradiction, that $\Bbb Q$ is countable. Thus we can write $\Bbb Q = {r1, r2, r3, \dots}$ and, as before, construct a nested sequence of closed intervals with $r_n \not \in I_n$. Our construction implies $\cap^\infty_{n=1} I_n = \empty$ while NIP implies $\cap^\infty_{n=1} I_n \neq \empty$. This contradiction implies Q must therefore be uncountable.

(1) It may contain only one irrational number. (2) NIP is for real intervals not rational.

Source: https://math.stackexchange.com/questions/1914901/false-proofs-claiming-that-mathbbq-is-uncountable

Exercise 1.5.4. (a) Show $(a, b) \sim R$ for any interval $(a, b)$. We know from the Example 1.4.9. that the function $f(x) = x/(x^2 − 1)$ takes the interval $(−1, 1)$ onto $\Bbb R$ in a 1–1 fashion. Then we map $(a,b)$ onto $(-1,1)$ by another bijective linear function $g(x)=2x/(b-a)-(b+a)/(b-a)$.

(b) Show that an unbounded interval like $(a,\infty) = {x : x > a}$ has the same cardinality as $\Bbb R$ as well. We know from the Example 1.4.9. that the function $f(x) = x/(x^2 − 1)$ takes the interval $(−1, 1)$ onto $\Bbb R$ in a 1–1 fashion. Then we map $(a,\infty)$ onto $(-1,1)$ by another bijective linear function $g(x)=2x/(1-x)$.

(c) Using open intervals makes it more convenient to produce the required 1–1, onto functions, but it is not really necessary. Show that $[0, 1) \sim (0, 1)$ by exhibiting a 1–1 onto function between the two sets. $f:[0,1) \rightarrow (0,1)$ by $f(0)=1/2$, $f(1/n)=1/(n+1)$ for integer $n \geq 2$, and $f(x)=x$ otherwise.

Source: https://math.stackexchange.com/questions/1425492/explicit-bijection-between-0-1-and-0-1

Exercise 1.5.5. (a) Why is $A \sim A$ for every set $A$? Trivial. By definition $f(x)=x$ will do the job.

(b) Given sets $A$ and $B$, explain why $A \sim B$ is equivalent to asserting $B \sim A$. Bijection, so consider inverse mapping.

(c) For three sets $A,B,$ and $C$, show that $A \sim B$ and $B \sim C$ implies $A \sim C$. These three properties are what is meant by saying that $\sim$ is an equivalence relation. Assume $f$ maps $A$ to $B$ and $g$ maps $B$ to $C$, $g(f(x))$ will work.

Exercise 1.5.6. (a) Give an example of a countable collection of disjoint open intervals. $A_n = (n, n+1)$, $n\in \Bbb N$

(b) Give an example of an uncountable collection of disjoint open intervals, or argue that no such collection exists. DNE. Every collection of disjoint open intervals in $\Bbb R$ is countable because you can choose a rational number (by density theorem) in each of them and rationals are countable.

Exercise 1.5.7. Consider the open interval $(0,1)$, and let $S$ be the set of points in the open unit square; that is, $S = {(x, y) : 0 < x,y < 1}$.

(a) Find a 1–1 function that maps $(0, 1)$ into, but not necessarily onto, $S$. (This is easy.) $f(x) = (x,x),x \in (0,1)$

(b) Use the fact that every real number has a decimal expansion to produce a 1–1 function that maps $S$ into $(0, 1)$. Discuss whether the formulated function is onto. (Keep in mind that any terminating decimal expansion such as $.235$ represents the same real number as $.234999 \dots$)

For any point with two coordinates $(0.d_1d_2\dots,0.e_1e_2\dots)$, we map it to the real number $(0.d_1e_1d_2e_2\dots)$. We restrict the choice of point in its simplest form so that $(0.2,0.5)$ will be chosen for $0.25$ instead of $(0.2999\dots,0.4999\dots)$, which is equal to $(0.3,0.5)$, corresponding to $0.35$.

This function (mapping), however, is not onto. Consider $1/11=0.090909\dots$, which by definition can be produced by a point $(0,0.999\dots)$, but this point can no be selected since it is equal to $(0,1)$ and $(0,1)$ yields $0.01$. Therefore not point in the unit square can be used to map to $1/11$.

Exercise 1.5.8. Let $B$ be a set of positive real numbers with the property that adding together any finite subset of elements from $B$ always gives a sum of $2$ or less. Show $B$ must be finite or countable.

For each $n\in \Bbb N$, let$$B_n=\left{b\in B,\middle|,b\geqslant\frac2n\right}\subset B.$$

Of course, $B_n$ can have no more than $n-1$ distinct elements; otherwise, the sum of $n$ distinct elements of $B_n$ would be grater than $2$.

But$$B=\bigcup_{n\in\Bbb N}B_n.$$Since $\Bbb N$ is countable and each $B_n$ is finite, $B$ is countable.

Source: https://math.stackexchange.com/questions/2446630/showing-a-set-is-finite-or-countable

Exercise 1.5.10. (a) Let $C \subseteq [0,1]$ be uncountable, show there exists $a \in (0,1)$ such that $C \cap [a,1] $ is uncountable.

Suppose that $C\cap [\tfrac{1}{n}, 1]$ is countable for all $n$. Then $$C\cap [0,1] = C\cap\big({0}\cup \bigcup_{n=1}^\infty [\tfrac{1}{n},1]\big) = (C\cap {0}) \cup \bigcup_{n=1}^\infty (C\cap [\tfrac{1}{n}, 1])$$ would be countable too.

Source: https://math.stackexchange.com/questions/1452550/let-c-subseteq-0-1-be-uncountable-show-there-exists-a-in-0-1-such-tha

(b) Now let A be the set of all $a \in (0, 1)$ such that $C \subseteq [a,1]$ is uncountable, and set $\alpha = supA$. Is $C \subseteq [0,1]$ an uncountable set?

WTS: Suppose $C\subseteq [0,1]$ is uncountable. Let $A = {a\in (0,1)\mid C\cap[a,1]$ is uncountable $}$, and $\alpha = \sup A$. Then $C\cap [\alpha,1]$ is countable.

First, $A$ is nonempty: for $n\in\Bbb N$ let $C_n = C\cap [\frac 1 n, 1]$. Some $C_n$ must be uncountable, otherwise $C= \bigcup_n C_n$ is a countable union of countable sets and therefore countable. So for some $n$, $1/n \in A$.

Clearly $0 \lt \alpha \le 1$.

If $\alpha =1 $ then of course the claim is true. If $\alpha \lt 1$. Let $(b_n)$ be a decreasing sequence in $(\alpha, 1)$ with $\alpha = \inf_n b_n$. By definition of $A$ and $\alpha$, for every $n$, $C\cap[b_n,1]$ is countable, for otherwise $b_n\in A$ and $b_n \le \alpha$. Thus $$\begin{align} C\cap [\alpha,1] &= C\cap \bigcup_n [b_n, 1] \
&= \bigcup_n (C\cap [b_n, 1]) \end{align}$$ is a countable union of countable sets, so it’s countable.

Source: https://math.stackexchange.com/questions/1639608/intersection-of-uncountable-sets

Exercise 1.5.11 (Schröder–Bernstein Theorem). Assume there exists a 1–1 function function $f: X \rightarrow Y$ and another 1–1 function $g: Y \rightarrow X$. Then there exists a 1–1, onto function $h: X \rightarrow Y$ and hence $X \sim Y$.

The strategy is to partition $X$ and $Y$ into components $X = A \cup A'$ and $Y = B \cup B'$ with $A \cap A’ = \emptyset$ and $B \cap B’ = \emptyset$, in such a way that $f$ maps $A$ onto $B$ and $g$ maps $B'$ onto $A'$.

(a) Explain how achieving this would lead to a proof that $X \sim Y$. $f: A \rightarrow B$ is a 1–1, onto function; $g: B’ \rightarrow A'$ is a 1–1, onto function; Then $h(x)=f(x)$ if $x \in A$ and $h(x)=g^{-1}(x)$ if $x \in A'$ is a $X \rightarrow Y$ 1–1, onto function and hence $X \sim Y$.

(b) Set $A_1 = X \setminus g(Y)$ (what happens if $A_1 = \emptyset$?) and inductively define a sequence of sets by letting $A_{n+1} = g(f(A_n))$. Show that ${A_n : n \in \Bbb{N}}$ is a pairwise disjoint collection of subsets of $X$, while ${f(A_n) : n \in \Bbb{N} }$ is a similar collection in $Y$.

For $k \ge 2$, since $A_k = g(f(A_{k-1})) \subseteq g(Y)$, $A_k$ and $A_1$ are disjoint.

For $2 \le m \lt n$, if there exists $a \in A_m \cap A_n$, then for some $a_{m-1} \in A_{m-1}$ and $a_{n-1} \in A_{n-1}$, $f(g(a_{m-1})) = a = f(g(a_{n-1}))$. Since both $f$ and $g$ are injective, here $a_{m-1} = a_{n-1}$. Hence $A_m \cap A_n \ne \emptyset$ implies $A_{m-1} \cap A_{n-1} \ne \emptyset$. By induction, we can conclude that $A_1 \cap A_{n-m+1} \ne \emptyset$, which is contradict with part 1. Therefore $A_m$ and $A_n$ are disjoint ($2 \le m \lt n$).

Source: https://math.stackexchange.com/questions/1726578/understanding-a-proof-of-schr%C3%B6der-bernstein-theorem

(c) Let $A = \cup_{n=1}^\infty A_n$ and $B = \cup_{n=1}^\infty f(A_n)$. Show that $f$ maps $A$ onto $B$. Trivial because for every $b \in B$, $b = f(a_n)$ for some $a_n \in A_n \subseteq A $.

(d) Let $A’ = X\setminus A$ and $B’ = Y \setminus B$. Show $g$ maps $B'$ onto $A'$. Suppose there is an element $a’ \in A’\not \in g(B’)$. Since $a'$ cannot be in $A_1$ there has to be an element $b \in f(A_n)\subset B$ s.t. $g(b)=a'$. Since $b \in f(A_n)$ we can write it as $f(a)=b$ and therefore $a'=g(f(a))\in A_{n+1}$. But this is a contradiction to where $a'$ lives.

Source: https://math.stackexchange.com/questions/1726578/understanding-a-proof-of-schr%C3%B6der-bernstein-theorem

Exercise 1.6.9. Using the various tools and techniques developed in the last two sections (including the exercises from Section 1.5), give a compelling argument showing that $\cal P(\Bbb N) \sim \Bbb R$.

First note that that $\Bbb R$ can inject into $ \cal P(\Bbb Q)$ by mapping $r$ to ${q\in\Bbb Q\mid q \lt r}$. Since $\Bbb Q$ is countable there is a bijection between $\cal P(\Bbb Q)$ and $\cal P(\Bbb N)$. So $\Bbb R$ injects into $\cal P(\Bbb N)$.

Then note that we can map $x\in 2^\Bbb N$ to the continued fraction defined by the sequence $x$. Or to a point in $[0,1]$ defined by $\sum\frac{x(n)}{3^{n+1}}$, which we can show is injective in a somewhat easier proof.

Supplement Solutions for New Questions in Chapter 1.2 to 1.4 in Understanding Analysis Second Edition

Thu, 05 Aug 2021 01:00:00 +0000

Note: There may be LaTex display issues due to blogdown rendering limitations. A complete well-formatted solution can be found by clicking the download icon above.

Exercise 1.2.2. Show that there is no rational number r satisfying $2^r=3$.

Suppose $r=\frac{a}{b}$ with positive integers $a,b$.

Then, we get $$2^{\frac{a}{b}}=3$$

which can be expressed as

$$2^a=3^b$$

This is clearly a contradiction because the left side is even and the right side is odd.

Source: https://math.stackexchange.com/questions/1427219/prove-there-is-no-rational-r-satisfying-2r-3

Exercise 1.2.4. Expressing $\Bbb N$ as an infinite union of disjoint infinite subsets.

Let $A_{i}$ consist of all the numbers of the form $2^im$ where $2\nmid m$. That is, $A_i$ consists of all the numbers that have exactly a factor of $2^i$ in them. So $$\begin{align} A_1 = {1,3,5,7,9,11, \dots}\
A_2 = {2, 6 =2^1\cdot 3, 10 = 2^1\cdot 5, 14 = 2^1\cdot 7, \dots}\
A_3 = {4 = 2^2, 12=2^2\cdot 3, 20=2^2\cdot 5, \dots}\
A_4 = {8 = 2^3, 24=2^3\cdot 3, 40=2^3\cdot 5, \dots}\
\dots \end{align} $$

Source: https://math.stackexchange.com/questions/847465/expressing-bbb-n-as-an-infinite-union-of-disjoint-infinite-subsets

As pointed out in the link above, any prime numbers can work here.

$A_1 = \Bbb N \ {x: x = 3b, b \in \Bbb N}$

$A_2 = {3a,a\in A_1}$

$A_3 = {3^2a,a\in A_1}$

$A_4 = {3^3a,a\in A_1}$

$\vdots$

Exercise 1.2.8.

Give an example of each or state that the request is impossible:

(a) $f : \Bbb N \rightarrow \Bbb N$ that is 1–1 but not onto. $f(x) = x^2+2$ because $1 \in \Bbb N$, but $f(x)>1$ $\forall x \in \Bbb N$

(b) $f : \Bbb N \rightarrow \Bbb N$ that is onto but not 1–1. $f(x) = (x-2)^2$ because $f(1)=f(3)$ while $1 \neq 3$

Exercise 1.2.10. Decide which of the following are true statements. Provide a short justification for those that are valid and a counterexample for those that are not:

(a) Two real numbers satisfy a < b if and only if a < b + $\epsilon$ for every $\epsilon$ > 0. The converse is FALSE if we take a=b=5.

(b) Two real numbers satisfy a < b if a < b + $\epsilon$ for every $\epsilon$ > 0. The statement is FALSE if we take a=b=5.

(c) Two real numbers satisfy a ≤ b if and only if a < b + $\epsilon$ for every $\epsilon$ > 0. Forward (trivial): $a \le b \lt b + \epsilon$. Reverse: Suppose $a \lt b + \epsilon$, $\forall \epsilon \gt 0$. Let $\delta = a - b$, then $b + \delta = b + a -b = a$ so that $a \not \lt b + \delta$. $\delta \not \gt 0$, so $\delta = a - b \le 0$. Hence $a \le b$.

Source: https://math.stackexchange.com/questions/1633992/if-true-prove-that-2-real-numbers-satisfy-ab-iff-ab-epsilon-forall-e/1633997

Exercise 1.2.12. Let $y_1 = 6$, and for each $n\in \Bbb N$ define $y_{n+1} = (2y_n − 6)/3$.

(a) Use induction to prove that the sequence satisfies $y_n > −6$ for all $n \in \Bbb N$.

Base Case: $y_1 = 6 > -6$
Inductive case. Assume $y_k>-6$.
$y_{k+1}=\frac{2y_k}{3}-2>\frac{2\times(-6)}{3}-2=-4-2=-6$
By induction our original claim is proved.

(b) Use another induction argument to show the sequence $(y_1, y_2, y_3, \dots)$ is decreasing.

Base Case: $y_2 = 2 < 6 = y_1$
Inductive case. Assume $y_{k+1}<y_k$.
$y_{k+2}=\frac{2y_{k+1}}{3}-2 =\frac{2y_{k+1}}{3}+\frac{-6}{3} <\frac{2y_{k+1}}{3}+\frac{y_{k+1}}{3} =y_{k+1}$
By induction our original claim is proved.

Exercise 1.3.2. Give an example of each of the following, or state that the request is impossible.

(a) A set B with inf B $\geq$ sup B. $B={1}$

(b) A finite set that contains its infimum but not its supremum. Except for $\emptyset$, by Axiom of Completeness, we cannot find such set.

(c) A bounded subset of Q that contains its supremum but not its infimum. $C={1/x|x\in\Bbb N}$ contains its supremum 1 but not its infimum 0.

Exercise 1.3.4. Let $A_1,A_2,A_3,\dots$ be a collection of nonempty sets, each of which is bounded above.

(a)Find a formula for $sup(A_1 \cup A_2)$. Extend this to $sup(\cup^n_{k=1}A_k)$. $sup {sup A_1, sup A_2}$ $sup {sup A_1, sup A_2 \dots sup A_n}$

(b) Consider $sup(\cup^{\infty}_{k=1}A_k)$. Does the formula in (a) extend to the infinite case?

No. Consider $A_i = {i}$, we have $sup(\cup^n_{k=1}A_k)=i$, but $sup(\cup^{\infty}_{k=1}A_k)$ does not exist.

Exercise 1.3.6. Given sets A and B, define $A+B = {a+b : a \in A$ and $b \in B}$. Follow these steps to prove that if A and B are nonempty and bounded above then sup(A + B) = supA + supB.

(a) Let s = sup A and t = sup B. Show s + t is an upper bound for A + B. Take $a \in A$ and $b \in B$, by definition, $a\leq s$ and $b \leq t$ and $a+b \in A+B$. So $a+b \leq s+t$.

(b) Now let $u$ be an arbitrary upper bound for A + B, and temporarily fix $a \in A$. Show $t \leq u − a$. By definition of $A + B$ and $\sup(A + B)$, for all $a \in A$ and $b \in B$, $${a + b} {\leq \sup (A + B)} {\leq u}.$$

If we fix $a \in A$, then ${\sup (A + B)} - a$ is an upper bound for $${A + B} - A = B.$$

Subtract $a$ from both sides:

$$b = {a + b} - a \leq \sup (A + B) \leq u - a.$$

And so by definition of $\sup B$, for every $a \in A$, $$\sup B =t \leq \sup (A+ B) − a\leq u - a.$$

(c) Finally, show sup(A + B) = s + t. Rearrange the previous inequality in (b): ${a} \leq \sup(A +B) − \sup B$ for all $a \in A$.

Hence, $\sup(A +B) − \sup B$ is an upper bound for any a.

By the definition of supremum, the previous inequality means: ${\sup A} \leq \sup(A + B) − \sup B \iff \sup A + \sup B \leq \sup(A + B).$ i.e.

$$s+t \leq sup(A+B)$$

Also, by inequality $a+b \leq s+t$ in (a) and the definition of supremum: $$sup(A+B)\leq s+t$$

We conclude that $$sup(A+B)= s+t.$$

(d) Construct another proof of this same fact using Lemma 1.3.8.

Let $\epsilon \gt 0.$ Then there exists $a \in A$ and $b \in B$ such that $a \gt \sup A − \frac{\epsilon}{2}$ and $b \gt \sup B − \frac{\epsilon}{2}.$ Then $a + b \in A + B$. We have $${\sup(A + B)} \geq a + b {\gt \sup A + \sup B - \epsilon} \implies { \sup(A + B) \gt \sup A + \sup B - \epsilon }.$$ Since $\epsilon$ is arbitrary, $\sup(A + B) \geq \sup A + \sup B=s+t$

Take $a \in A$ and $b \in B$, by definition, $a\leq s$ and $b \leq t$ and $a+b \in A+B$. So $a+b \leq s+t$. Also, by inequality $a+b \leq s+t$ in (a) and the definition of supremum: $$sup(A+B)\leq s+t = sup(A+B)$$

We conclude that $$sup(A+B)= s+t.$$

Source: https://math.stackexchange.com/questions/4551/how-can-i-prove-supab-sup-a-sup-b-if-ab-ab-mid-a-in-a-b-in-b

Exercise 1.3.8. Compute, without proofs, the suprema and infima (if they exist) of the following sets:

(a) ${m/n : m, n \in N$ with $m < n}$. sup: $1$ inf: $0$

(b) ${(−1)^m/n : m, n \in N}$. sup: $1$ inf: $-1$

(d) ${m/(m+ n) : m, n \in N}$. sup: 1 inf: 0

Exercise 1.3.9.

(a) If supA < supB, show that there exists an element $b \in B$ that is an upper bound for A. Take $\epsilon=supB-supA$ and take $b \in B$ where $b>supB-\epsilon /2$ as desired.

(b) Give an example to show that this is not always the case if we only assume supA ≤ supB. Take $A={0}$ and $B={-1/n,n \in \Bbb N}$

Exercise 1.3.10 (Cut Property).

(a) Use the Axiom of Completeness to prove the Cut Property. Suppose we have the axiom of completeness and assume you have $A$ and $B$ as in the statement of the cut property. Then, as $B$ is nonempty, $A$ has an upper bound. Let $c$ be the least upper bound for $A$.

For $a\in A$, $a\le c$, because $c$ is an upper bound for $A$; For $b\in B$, $c\le b$, because $b$ is an upper bound for $A$ and $c$ is the least upper bound.

Source: https://math.stackexchange.com/questions/1616583/use-the-axiom-of-completeness-to-prove-the-cut-property

(b)Show that the implication goes the other way. Suppose we know the Cut Property. Consider a nonempty set $E$ with an upper bound. Then let

$B={x\in\mathbb{R}: x\geq e \forall e\in E}$ i.e. $B$ is the set of all upper bounds of $E$

and let $A$ be the complement of $B$. $A={x\in\mathbb{R}: x\lt e$ for some $e\in E}$

Since $E$ is non-empty and bounded above, $B$ is nonempty as well as $A$. The union of $A$ and $B$ is $\mathbb{R}$ by construction. Suppose $a\in A$ and $b\in B$. If $b\le a$, we have $e\leq a \forall e\in E$, so $A\in B$: a contradiction.

Since $b>a$ for all $a \in A$ and $b \in B$, we know there exists $d$ such that $a \leq d$ and $d \leq b$ by Cut Property. We want to show that $d$ is the supremum for E.

To show that $d$ is an upper bound of $E$, suppose some $s$ in $E$ exceeds $d$. Since $(s + d)/2$ exceeds $d$, it belongs to $B$, so by the definition of $B$ it must be an upper bound of $E$, which is impossible since $s > (s + d)/2$. To show that $d$ is a least upper bound of $E$, suppose that some $a < d$ is an upper bound of $E$. But $a$ (being less than $d$) is in $A$, so it can’t be an upper bound of $E$.

Source: https://arxiv.org/abs/1204.4483

(c) give a concrete example showing that the Cut Property is not a valid statement when $\Bbb R$ is replaced by $\Bbb Q$. Hint: Find the break point in $\Bbb Q:$ $\sqrt{5},\sqrt{3},\dots$ Consider $A = (-\infty, 0) \cup {x \ge 0 : x^2 \le 3}$ and $B = {x \ge 0 : x^2 \gt 3}$ If such a number $c$ existed, we would have $c^2 = 3$. But there is no rational number for which this is true.

Exercise 1.3.11. Without worrying about formal proofs for the moment, decide if the following statements about suprema and infima are true or false. For any that are false, supply an example where the claim in question does not appear to hold.

(a) TRUE. Since $A \subset B$, $\sup B$ is an upper bound for $A$. Since $\sup A$ is the least upper bound for $A$ by definition, it must be less than or equal $\sup B$.

(b) TRUE. Take $c=(sup A + inf B)/2$ will work for nonempty sets $A$ and $B$.

Exercise 1.4.2. Let $A \subseteq \Bbb R$ be nonempty and bounded above, and let $s \in \Bbb R$ have the property that for all $n \in \Bbb N$, s + 1/n is an upper bound for A and s − 1/n is not an upper bound for A. Show s = supA.

Suppose s is not an upper bound for A. Then $\exists a \in A$ such that $s \lt a$. Take $\delta = a - s$ and $n_0 \in \Bbb N$ to be large enough so that $1/\delta < n_0$ i.e. $1/n_0 < \delta$. By definition, $s+1/n_0$ is an upper bound for $A$, but $s+1/n_0<s+\delta=a\in A$: a contradiction.

Let $\epsilon>0$. Take $n_1 \in \Bbb N$ to be large enough so that $1/\epsilon < n_1$ i.e. $1/n_1 < \epsilon$. By definition, $\exists a \in A$ such that $ s-\epsilon \lt s-1/n_1 \lt a$. Hence s = sup A.

Exercise 1.4.4. Let $a \lt b$ be real numbers and consider the set $T=\mathbb{Q}\cap[a,b]$. Show $\sup T=b$

If $x\in T$, then $x\in [a,b]$, and if $x\in [a,b]$, then $x\leq b$ i.e. $b$ is an upper bound for T.

To show that $b$ is a least upper bound of $T$, suppose that some $c \lt b$ is an upper bound of $T$. Since the rationals are dense in $\Bbb R$ there exists a rational $t$ such that $a\lt c \lt t \lt b$. This means $t \in [a,b]$ and $t \lt c$ by definition of upper bound, which is a contradiction.

Exercise 1.4.6. Which of the following sets are dense in $\Bbb R$? Take $p \in \Bbb Z$ and $q \in \Bbb N$ in every case.

(a) The set of all rational numbers $p/q$ with $q \leq 10$. Not dense in $\Bbb R$. For any distinct $\frac pq$ and $\frac{p’}{q’}$ with $q,q’\le 10$ the difference $$ \frac pq-\frac{p’}{q’}=\frac{pq’-p’q}{qq’}$$ is a fraction with non-zero numerator and denominator$\le 10^2$, hence is $\ge \frac{1}{10^2}$ in absolute value. For example, no element in this set can be found between $1/500$ and $2/500$.

Source: https://math.stackexchange.com/questions/1638526/how-do-you-show-a-set-is-dense-for-example-is-the-set-of-all-rational-numbers

(b) The set of all rational numbers $p/q$ with $q$ a power of $2$. Dense in $\Bbb R$. Consider two arbitrary real numbers $a,b$ with $a\lt b $, By the Archimedean Property there exists $n \in \mathbb N$ such that $$0\lt \frac{1}{n} \lt b-a ;;\text{ which implies} ;; 0\lt \frac{1}{2^{n}}\lt \frac{1}{n}\lt b-a$$ Thus we have $1\lt b2^n-a2^n$. As the distance between $a2^n$ and $b2^n$ is greater than $1$, there exists $m \in \mathbb N$ such that $a2^{n}\lt m\lt b2^{n}$ which implies that $a \lt \frac{m}{2^{n}} \lt b$. Since $a$ and $b$ were arbitrary, the claim is proved.

Source: https://math.stackexchange.com/questions/3968925/proof-of-dyadic-rational-numbers-are-dense-in-mathbb-r

Not dense in $\Bbb R$. Rational numbers between (-1/10,1/20) are missing. For example, no element in this set can be found between $-1/20$ and $-1/30$.

Source: https://www.reddit.com/r/HomeworkHelp/comments/7ruu7u/real_analysis_density_of_subsets_of_q_in_r/

Exercise 1.4.8. Give an example of each or state that the request is impossible. When a request is impossible, provide a compelling argument for why this is the case.

(a) Two sets A and B with $A \cap B = \emptyset$, supA = supB, $supA \not \in A $ and $supB \not \in B$. $A={x|x\in I,x\in (0,1)}$ $B={x|x\in Q,x\in (0,1)}$

(b) A sequence of nested open intervals $J_1 \supseteq J_2 \supseteq J_3 \supseteq \dots $ with $\cap^\infty_{n=1}J_n$ nonempty but containing only a finite number of elements. $J_n = (5-1/n,5+1/n), n \in \Bbb N, \cap^\infty_{n=1}J_n=5$

(c) A sequence of nested unbounded closed intervals $L_1 \supseteq L_2 \supseteq L_3 \supseteq \dots $ with $\cap^\infty_{n=1}L_n=\emptyset$ (An unbounded closed interval has the form $[a,\infty) = {x \in \Bbb R : x \geq a}$.) $L_n = [n,\infty), n \in \Bbb N, \cap^\infty_{n=1}J_n=\emptyset$

(d) A sequence of closed bounded (not necessarily nested) intervals $I_1, I_2, I_3, \dots$ with the property that $\cap^N_{n=1} I_n \neq \emptyset$ for all $N \in \Bbb N$, but $\cap^\infty_{n=1} I_n = \emptyset$. The answer is negative, because then $\cap^N_{n=1} I_n$ for all $N \in \Bbb N$ is a decreasing sequence of non-empty closed and bounded intervals and therefore its intersection is non-empty.

Source: https://math.stackexchange.com/questions/2619781/intersection-of-a-sequence-of-closed-intervals

Appendix for unused sources

By definition, $d$ is an upper bound for A. So it is an upper bound for $E$, because if there exists $e \in E$ with $d<e$, then $d<\frac{d+e}{2}$. $\frac{d+e}{2}$ cannot be in $B$ (indeed, it’s not an upper bound for $E$, because it’s less than $e$) so it must be in $A$, but this contradicts that $d$ is an upper bound for $A$.

Source: https://math.stackexchange.com/questions/2228772/assume-mathbbr-possesses-the-cut-property-and-let-e-be-a-nonempty-that-is-b

If possible, suppose $A$ has the greatest member, say $a'$. Then, $a’ \in A \Rightarrow a’ \not\in B$. We know $\exists s \in E$ such that $a’ \lt s$, since $a'<(a'+s)/2 \in B$, $(a'+s)/2 $ is an upper bound of $S$. This contradiction leads to the fact that $A$ has no greatest member. And so, $B$ has the least member. Hence, the set of upper bounds of a non-empty set 𝑆 bounded above has a least member, which is the completeness axiom in $\Bbb R$. Hence the theorem is proved.

Modelling the dynamics of a Chalmydia infection

Wed, 04 Aug 2021 01:00:00 +0000

The assignment requirements can be found here above by clicking the ‘assignment’ icon.

Task 1
Task 2
Task 3

Task 1

We have $$ \frac{dE}{dt} = 0.004 - 2 E(t) - \kappa_1C(t) E(t) $$ $$ \frac{dC}{dt} = P \kappa_2 I(t) - \mu C(t) - \kappa_1 C(t) E(t) $$ $$ \frac{dI}{dt} = \kappa_1 C(t) E(t) - \gamma I(t) - \kappa_2 I(t) $$

The following codes load the data. Note that dose 1 to dose 4 correspond to dose 10 to 10⁴.

data = readRDS("rank_et_al_2003_data.RDS")
dose1 <- data[1:10,]
dose2 <- data[11:20,]
dose3 <- data[21:30,]
dose4 <- data[31:40,]
dose5 <- data[41:50,]

The following codes examine whether the model is written correctly.

model <- function (t, y, parms) {
dy1 <- (40 * 10 ^ (-4)) - 2 * y[1] - params[1] * y[2] * y[1]
dy2 <- params[2] * params[3] * y[3] - params[4] * y[2] - params[1] * y[2] * y[1]
dy3 <- params[1] * y[2] * y[1] - params[5] * y[3] - params[3] * y[3]
list(c(dy1, dy2, dy3))
}
yini <- c(E = 0.96, C = 0.001, I = 0)
params <- c(
kappa1 = 1000,
P = 1000,
kappa2 = 1.3, #0.4-1.3
C = 1.2,
gamma = 1.2
)
out <- ode(y=yini, t=seq(0,30,0.1), model, parameters)
df_out <- as.data.frame(out)
# An estimate of the differential equations
summary(df_out$C)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001 1.728 1.728 15.164 1.832 248.361
# An overview of the data
head(data)
## t C dose number
## 1 3 0.000 10^1 2
## 2 6 46.829 10^1 2
## 3 9 9.106 10^1 2
## 4 12 18.862 10^1 2
## 5 15 25.366 10^1 2
## 6 18 21.463 10^1 2

The following codes are for r2.stan (model from the differential equations.)

functions {
vector rhs(real t, vector y,
real P, real kappa1, real kappa2, real gamma, real mu) {
vector[3] dydt;
dydt[1] = (40 * 1e-4) - 2 * y[1] - kappa1 * y[2] * y[1];
dydt[2] = P * kappa2 * y[3] - mu * y[2] - kappa1 * y[2] * y[1];
dydt[3] = kappa1 * y[2] * y[1] - gamma * y[3] - kappa2 * y[3];
return dydt;
}
}
data {
int<lower=0> N;
vector [N] y;
real t[N]; // This must be an array!
// Control
int<lower=0, upper = 1> only_prior;
}
parameters {
real<lower = 0, upper = 2000> P;
real<lower = 0, upper = 2000> kappa1;
real<lower = 0.01, upper = 1.5> kappa2;
real<lower = 0.01, upper = 1.5> gamma;
real<lower = 0.01, upper = 1.5> mu;
real<lower = 0> c;
}
transformed parameters {
// [a, b] makes a row_vector, [a, b]' makes a column vector
// 0 is an int! 0.0 is a real!
vector[N] C; // Outputted
{ // Local computation - isn't saved or outputted!
vector[3] solution[N] = ode_bdf(rhs, [0.96, c, 0.0]', 0, t, P, kappa1, kappa2, gamma, mu);
for(i in 1:N){
C[i] = solution[i,2];
}
}
}
model{
c ~ uniform(0,1)
P ~ uniform(0, 2000);
kappa1 ~ uniform(0, 2000);
kappa2 ~ uniform(0.01, 1.5);
gamma ~ uniform(0.01, 1.5);
mu ~ uniform(0.01, 1.5);
if(only_prior == 0) {
y ~ normal(C, 50);
}
}
generated quantities {
vector<lower=0>[N] y_pred;
vector<lower=0>[N] log_lik;
for (i in 1:N) {
y_pred[i] = abs(normal_rng(C[i], 50)); // abs because minnimum is 0 and 30 is chosen because the standard deviation of the dataset is around 50,
// that is, if we are only interested in values larger than zero, then
log_lik[i] = abs(normal_lpdf(y[i] | C[i], 50));
}
}

Since when 0.4 < κ₂ < 1.4, the max of C(t) is between 100 < C(t) < 250 as 0.01 < C(0) < 1. Hence we may assume that the mean of κ₂ is 0.9 and the standard deviation of it is 0.3 so that it is highly probable to take values in the range of $[0.3,1.5]$. A uniform distribution is used because we do not know the exact distribution and it is hard to estimate the standard deviation of each parameter.

Therefore, we assume a weakly informative prior for other parameters in stan as well. In particular, uniform distribution that covers the value given as an example. This is largely due to two reasons. First, most parameters that require priors do not have sufficient information for us to set up a good distribution. Furthermore, since there are five different doses in the dataset, a distribution that can take values in a large range is preferred. As you will see later on, this choice provides a fair estimate of the parameters.

We assume the parameters in the 10³ scale to be from 0 to 2000. The lower bound 0 is pretty self-explanatory, the upper bound is taken by 1000 + (1000 - lower bound). Similarly, we take 1.5 as upper bound for other parameters because of the range of κ₂.

To estimate the rate of change of C at t = 0, we use the data for average C on day 3 and then divided by 3 (3 days) for each dose. This assumes that the rate of C is increasing so that C(0) is not underestimated, which is reasonable given that C increases until the dose takes effect. However, this could have some limitations. Hence we will further address this issue in task 3. C(0) will therefore be an average of all estimates of C(0) from 5 doses.

mod <- cmdstan_model("r2.stan")

Task 2

A note on selection of priors

I have attempted different range of values for the uniform distribution, however, the posterior predictive check shows that the noise is higher than expected and the model does not fit very well. I have tried different models including normal distribution and uniform distribution with other parameters, but the actual result is not better than this. Hence uniform distribution is used. However, I would still suggest future researchers to collect more data (1 data per day) or apply other models specifically for this question.

Dose 1

Prior Predictive Check for Model on Dose 1 Data

mcmc_hist(fit$draws("C"))

mcmc_hist(fit$draws("y_pred"))

Posterior Predictive Check for Model on Dose 1 Data

yrep = fit$draws() %>% reshape2::melt() %>% filter(str_detect(variable, "y_pred")) %>%
extract(col = variable, into = "ind",
regex = "y_pred\\[([0-9]*)\\]",
convert = TRUE) %>%
pivot_wider(id_cols = c("chain","iteration"),
names_from = "ind") %>%
select(-c("chain", "iteration")) %>% as.matrix
ppc_stat(dose1$C, yrep, stat = "min")

Leave one out cross validation for dose 1

## By default, it looks for something called "log_lik", but you can override this
## with the variables = argument. Eg if you called your log-likelihood "ll",
## you could run loo1 <- fit$loo(save_psis=TRUE, variable = "ll")
loo1 <- fit$loo(save_psis=TRUE)
print(loo1)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 49.1 0.4
## p_loo 0.3 0.3
## looic -98.3 0.7
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 6 60.0% 2223
## (0.5, 0.7] (ok) 0 0.0% <NA>
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 4 40.0% 2420
## See help('pareto-k-diagnostic') for details.
plot(loo1)

All except five of our points are good from the leave one out cross validation for model. This is not very great, but we may take a look at the estimate of C(0) to determine if we really identify C(0) fairly.

Dose 2

Prior Predictive Check for Model on Dose 2 Data

mcmc_hist(fit2$draws("C"))

mcmc_hist(fit2$draws("y_pred"))

Posterior Predictive Check for Model on Dose 2 Data

yrep2 = fit2$draws() %>% reshape2::melt() %>% filter(str_detect(variable, "y_pred")) %>%
extract(col = variable, into = "ind",
regex = "y_pred\\[([0-9]*)\\]",
convert = TRUE) %>%
pivot_wider(id_cols = c("chain","iteration"),
names_from = "ind") %>%
select(-c("chain", "iteration")) %>% as.matrix
ppc_stat(dose2$C, yrep2, stat = "min")

Leave one out cross validation for dose 2

loo2 <- fit2$loo(save_psis=TRUE)
print(loo2)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 49.7 0.7
## p_loo 0.5 0.4
## looic -99.4 1.4
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 6 60.0% 2138
## (0.5, 0.7] (ok) 0 0.0% <NA>
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 4 40.0% 1931
## See help('pareto-k-diagnostic') for details.
plot(loo2)

Dose 3

Prior Predictive Check for Model on Dose 3 Data

mcmc_hist(fit3$draws("C"))

mcmc_hist(fit3$draws("y_pred"))

Posterior Predictive Check for Model on Dose 3 Data

yrep3 = fit3$draws() %>% reshape2::melt() %>% filter(str_detect(variable, "y_pred")) %>%
extract(col = variable, into = "ind",
regex = "y_pred\\[([0-9]*)\\]",
convert = TRUE) %>%
pivot_wider(id_cols = c("chain","iteration"),
names_from = "ind") %>%
select(-c("chain", "iteration")) %>% as.matrix
ppc_stat(dose3$C, yrep3, stat = "min")

Leave one out cross validation for dose 3

loo3 <- fit3$loo(save_psis=TRUE)
print(loo3)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 51.0 1.3
## p_loo 3.7 2.1
## looic -101.9 2.5
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
plot(loo3)

Dose 4

Prior Predictive Check for Model on Dose 4 Data

mcmc_hist(fit4$draws("C"))

mcmc_hist(fit4$draws("y_pred"))

Posterior Predictive Check for Model on Dose 4 Data

yrep4 = fit4$draws() %>% reshape2::melt() %>% filter(str_detect(variable, "y_pred")) %>%
extract(col = variable, into = "ind",
regex = "y_pred\\[([0-9]*)\\]",
convert = TRUE) %>%
pivot_wider(id_cols = c("chain","iteration"),
names_from = "ind") %>%
select(-c("chain", "iteration")) %>% as.matrix
ppc_stat(dose4$C, yrep4, stat = "min")

Leave one out cross validation for dose 4

loo4 <- fit4$loo(save_psis=TRUE)
print(loo4)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 52.5 2.9
## p_loo 3.3 2.2
## looic -105.0 5.9
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
plot(loo4)

Remarks

From the prior predictive check, we can see the estimated distribution is more heavy-tailed than the actual distribution. From the posterior predictive check, T(y) is the skewness. The model captures the observed statistic to some extent for 4 doses, but leave one out cross validation shows that the model fits the the data for dose 3 and 4 better as all points from dose 3 and 4 are good.

Task 3

vec_c0 <- unlist(c0_estimate)
hist(vec_c0, breaks=20)

The advantage about estimating C(0) here is that this approach uses all available data and keep the statistical power of all data, since we are conditioning on all data; however, it takes a long time to produce the results. For instance, a laptop with one core will need more than an hour for this task.

Mathematics Theorems and Proofs in Applied Multivariate Statistical Analysis (CH.1)

Wed, 21 Jul 2021 01:00:00 +0000

Details in Chapter 1 (Johnson & Wichern, 2002)

P78 (2-48)

Cauchy-Schwarz Inequality. Let $\mathbf{b}$ and $\mathbf{d}$ be any two $p\times 1$ vectors. Then $$ \left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}\leq(\mathbf{b}^{\prime} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{d})} $$ with equality if and only if $\mathbf{b}=c\mathbf{d}$ (or $c\mathbf{d}=\mathbf{b}$) for some constant c.

Proof. The inequality is obvious if either $\mathbf{b}=\mathbf{0}$ or $\mathbf{d}=\mathbf{0}$. Excluding this possibility, consider the vector $\mathbf{b}-x \mathbf{d}$, where $x$ is an arbitrary scalar. Since the length of $\mathbf{b}-x \mathbf{d}$ is positive for $\mathbf{b}-x \mathbf{d} \neq \mathbf{0}$, in this case $$ \begin{aligned} 0<(\mathbf{b}-x \mathbf{d})^{\prime}(\mathbf{b}-x \mathbf{d}) &=\mathbf{b}^{\prime} \mathbf{b}-x \mathbf{d}^{\prime} \mathbf{b}-\mathbf{b}^{\prime}(x \mathbf{d})+x^{2} \mathbf{d}^{\prime} \mathbf{d} \
&=\mathbf{b}^{\prime} \mathbf{b}-2 \boldsymbol{x}\left(\mathbf{b}^{\prime} \mathbf{d}\right)+x^{2}\left(\mathbf{d}^{\prime} \mathbf{d}\right) \end{aligned} $$ The last expression is quadratic in $x .$ If we complete the square by adding and subtracting the scalar $\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2} / \mathbf{d}^{\prime} \mathbf{d}$, we get $$ \begin{gathered} 0<\mathbf{b}^{\prime} \mathbf{b}-\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}}+\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}}-2 x\left(\mathbf{b}^{\prime} \mathbf{d}\right)+x^{2}\left(\mathbf{d}^{\prime} \mathbf{d}\right) \
=\mathbf{b}^{\prime} \mathbf{b}-\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}}+\left(\mathbf{d}^{\prime} \mathbf{d}\right)\left(x-\frac{\mathbf{b}^{\prime} \mathbf{d}}{\mathbf{d}^{\prime} \mathbf{d}}\right)^{2} \end{gathered} $$ The term in brackets is zero if we choose $x=\mathbf{b}^{\prime} \mathbf{d} / \mathbf{d}^{\prime} \mathbf{d}$, so we conclude that $$ 0<\mathbf{b}^{\prime} \mathbf{b}-\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}} $$ or $\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}<\left(\mathbf{b}^{\prime} \mathbf{b}\right)\left(\mathbf{d}^{\prime} \mathbf{d}\right)$ if $\mathbf{b} \neq x \mathbf{d}$ for some $x$ Note that if $\mathbf{b}=c \mathbf{d}, 0=(\mathbf{b}-c \mathbf{d})^{\prime}(\mathbf{b}-c \mathbf{d})$, and the same argument produces $\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}=\left(\mathbf{b}^{\prime} \mathbf{b}\right)\left(\mathbf{d}^{\prime} \mathbf{d}\right)$

Extended Cauchy-Schwarz Inequality. Let $\mathbf{b}$ and $\mathbf{d}$ be any two $p\times 1$ vectors, and let $\mathbf{B}$ be a positive definite matrix. Then $$ \left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}\leq(\mathbf{b}^{\prime} \mathbf{B} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d})} $$

with equality if and only if $\mathbf{b}=c\mathbf{B}^{-1}\mathbf{d}$ or $\mathbf{d}=c\mathbf{B}\mathbf{b}$ for some constant c.

Proof. The inequality is obvious when $\mathbf{b}=\mathbf{0}$ or $\mathbf{d}=\mathbf{0}$. For cases other than these, consider the square-root matrix $\mathbf{B}^{1 / 2}$ defined in terms of its eigenvalues $\lambda_{i}$ and the normalized eigenvectors $\mathbf{e}_{i}$ as $\mathbf{B}^{1 / 2}=\sum_{i=1}^{p} \sqrt{\lambda_{i}} \mathbf{e}_{i} \mathbf{e}_{i}^{\prime} .$ If we set $$ \mathbf{B}^{-1 / 2}=\sum_{i=1}^{p} \frac{1}{\sqrt{\lambda_{i}}} \mathbf{e}_{i} \mathbf{e}_{i}^{\prime} $$ it follows that $$ \mathbf{b}^{\prime} \mathbf{d}=\mathbf{b}^{\prime} \mathbf{I} \mathbf{d}=\mathbf{b}^{\prime} \mathbf{B}^{1 / 2} \mathbf{B}^{-1 / 2} \mathbf{d}=\left(\mathbf{B}^{1 / 2} \mathbf{b}\right)^{\prime}\left(\mathbf{B}^{-1 / 2} \mathbf{d}\right) $$ and the proof is completed by applying the Cauchy-Schwarz inequality to the vectors $\left(\mathbf{B}^{1 / 2} \mathbf{b}\right)$ and $\left(\mathbf{B}^{-1 / 2} \mathbf{d}\right)$

Let $\mathbf{u}=\mathbf{B}^{1 / 2} \mathbf{b}$ and $\mathbf{v}=\mathbf{B}^{-1 / 2} \mathbf{d}$, we have $$ \mathbf{b}^{\prime} \mathbf{d}=\left(\mathbf{B}^{1 / 2} \mathbf{b}\right)^{\prime}\left(\mathbf{B}^{-1 / 2} \mathbf{d}\right)\leq(\mathbf{b}^{\prime} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{d})}=(\mathbf{B}^{1 / 2} \mathbf{b})^{\prime}(\mathbf{B}^{1 / 2} \mathbf{b})(\mathbf{B}^{-1 / 2} \mathbf{d})(\mathbf{B}^{-1 / 2} \mathbf{d})^{\prime}=(\mathbf{b}^{\prime} \mathbf{B} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d})} $$

The extended Cauchy-Schwarz inequality gives rise to the following maximization result.

Maximization Lemma. Let $\underset{(p \times p)}{\mathbf{B}}$ be positive definite and $\underset{(p \times 1)}{\mathbf{d}}$ be a given vector. Then, for an ărbitrary nonzero vector $\underset{(p \times 1)}{\mathbf{x}}$, $$ \max _{\mathbf{x} \neq \boldsymbol{\theta}} \frac{\left(\mathbf{x}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}=\mathbf{d}^{\prime} \mathbf{B}^{-\mathbf{1}} \mathbf{d} $$ with the maximum attained when $\underset{(p \times 1)}{\mathbf{x}}=\underset{(p \times p)( p \times 1)}{\mathbf{d}}$ for any constant $c \neq 0$. Proof. By the extended Cauchy-Schwarz inequality, $\left(\mathbf{x}^{\prime} \mathbf{d}\right)^{2} \leq\left(\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}\right)\left(\mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d}\right)$. Because $\mathbf{x} \neq \mathbf{0}$ and $\mathbf{B}$ is positive definite, $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}>0$. Dividing both sides of the inequality by the positive scalar $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$ yields the upper bound $$ \frac{\left(\mathbf{x}^{\prime} \mathbf{d}\right)^{2}}{\boldsymbol{x}^{\prime} \mathbf{B} \mathbf{x}} \leq \mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d} $$

Taking the maximum over $\mathbf{x}$ gives Equation $(2-50)$ because the bound is attained for $\mathbf{x}=c \mathbf{B}^{-1} \mathbf{d} .$

A final maximization result will provide us with an interpretation of eigenvalues.

Maximization of Quadratic Forms for Points on the Unit Sphere. Let $\mathbf{B}$ be a positive definite matrix with eigenvalues $\lambda_{1} \geq \lambda_{2} \geq \cdots \geq \lambda_{p} \geq 0$ and associated normalized eigenvectors $\mathbf{e}_{\mathbf{1}}, \mathbf{e}_{2}, \ldots, \mathbf{e}_{p}$. Then

$$ \max {\mathbf{x} \neq \mathbf{0}} \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\lambda{1}\quad \text { (attained when } \mathbf{x}=\mathbf{e}_{1} \text {)} $$

$$ \min {\mathbf{x} \neq 0} \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\lambda{p} \quad \text { (attained when } \mathbf{x}=\mathbf{e}_{p} \text {)} $$

Moreover,

$$ \max {\mathbf{x} \perp \mathbf{e}{\mathbf{1}},\ldots,\mathbf{e}{\mathbf{k}}} \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\lambda{k+1} \quad \text { (attained when } \mathbf{x}=\mathbf{e}_{k+1} \text {, } k=1,2,\ldots,p-1 \text {)} $$

where the symbol $\perp$ is read “is perpendicular to.

Proof. Let $\underset{( p \times p)}{\mathbf{P}}$ be the orthogonal matrix whose columns are the eigenvectors $\mathbf{e}{1}, \mathbf{e}{2}, \ldots, \mathbf{e}{p}$ and $\mathbf{A}$ be the diagonal matrix with eigenvalues $\lambda{1}, \lambda_{2}, \ldots, \lambda_{p}$ along the main diagonal. Let $\mathbf{B}^{1 / 2}=\mathbf{P} \Lambda^{1 / 2} \mathbf{P}^{\prime}$ and $\underset{(p \times 1)}{\mathbf{y}}=\underset{(p \times p)(p \times 1)}{\mathbf{x}}$. Consequently, $\mathbf{x} \neq \boldsymbol{0}$ implies $\mathbf{y} \neq \mathbf{0}$. Thus, $$ \begin{aligned} \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}} &=\frac{\mathbf{x}^{\prime} \mathbf{B}^{1 / 2} \mathbf{B}^{1 / 2} \mathbf{x}}{\mathbf{x}^{\prime} \underbrace{\mathbf{P P}^{\prime}}_{\mathbf{I} \atop(p \times p)} \mathbf{x}}=\frac{\mathbf{x}^{\prime} \mathbf{P} \mathbf{\Lambda}^{1 / 2} \mathbf{P}^{\prime} \mathbf{P} \mathbf{\Lambda}^{1 / 2} \mathbf{P}^{\prime} \mathbf{x}}{\mathbf{y}^{\prime} \mathbf{y}}=\frac{\mathbf{y}^{\prime} \mathbf{\Lambda} \mathbf{y}}{\mathbf{y}^{\prime} \mathbf{y}} \
&=\frac{\sum_{i=1}^{p} \lambda_{i} y_{i}^{2}}{\sum_{i=1}^{p} y_{i}^{2}} \leq \lambda_{1} \frac{\sum_{i=1}^{p} y_{i}^{2}}{\sum_{i=1}^{p} y_{i}^{2}}=\lambda_{\mathrm{I}} \end{aligned} $$

Setting $\mathbf{x}=\mathbf{e}{1}$ gives $$ \mathbf{y}=\mathbf{P}^{\prime} \mathbf{e}{1}=\left[\begin{array}{c} 1 \
0 \
\vdots \
0 \end{array}\right] $$ since $$ \mathbf{e}{k}^{\prime} \mathbf{e}{1}= \begin{cases}1, & k=1 \ 0, & k \neq 1\end{cases} $$ For this choice of $\mathbf{x}$, we have $\mathbf{y}^{\prime} \mathbf{\Lambda} \mathbf{y} / \mathbf{y}^{\prime} \mathbf{y}=\lambda_{1} / 1=\lambda_{1}$, or $$ \frac{\mathbf{e}_{1}^{\prime} \mathbf{B e}_{1}}{\mathbf{e}_{1}^{\prime} \mathbf{e}_{1}}=\mathbf{e}_{1}^{\prime} \mathbf{B e}_{1}=\lambda_{1} $$ A similar argument produces the second part of $(2-51)$. Now, $\mathbf{x}=\mathbf{P y}=y_{1} \mathbf{e}_{1}+y_{2} \mathbf{e}_{2}+\cdots+y_{p} \mathbf{e}_{p}$, so $\mathbf{x} \perp \mathbf{e}_{1}, \ldots, \mathbf{e}_{k}$ implies $$ 0=\mathbf{e}_{i}^{\prime} \mathbf{x}=y_{1} \mathbf{e}_{i}^{\prime} \mathbf{e}_{1}+y_{2} \mathbf{e}_{i}^{\prime} \mathbf{e}_{2}+\cdots+y_{p} \mathbf{e}_{i}^{\prime} \mathbf{e}_{p}=y_{i}, \quad i \leq k $$ Therefore, for $x$ perpendicular to the first $k$ eigenvectors $\mathbf{e}_{i}$, the left-hand side of the inequality in $(2-53)$ becomes $$ \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\frac{\sum_{i=k+1}^{p} \lambda_{i} y_{i}^{2}}{\sum_{i=k+1}^{p} y_{i}^{2}} $$ Taking $y_{k+1}=1, y_{k+2}=\cdots=y_{p}=0$ gives the asserted maximum. For a fixed $\mathbf{x}_{0} \neq \mathbf{0}, \mathbf{x}_{0}^{\prime} \mathbf{B} \mathbf{x}_{0} / \mathbf{x}_{0}^{\prime} \mathbf{x}_{0}$ has the same value as $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$, where $\mathbf{x}^{\prime}=\mathbf{x}_{0}^{\prime} / \sqrt{\mathbf{x}_{0}^{\prime} \mathbf{x}_{0}}$ is of unit length. Consequently, Equation (2-51) says that the largest eigenvalue, $\lambda_{1}$, is the maximum value of the quadratic form $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$ for all points $\mathbf{x}$ whose distance from the origin is unity. Similarly, $\lambda_{p}$ is the smallest value of the quadratic form for all points $x$ one unit from the origin. The largest and smallest eigenvalues thus represent extreme values of $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$ for points on the unit sphere. The “intermediate” eigenvalues of the $p \times p$ positive definite matrix $B$ also have an interpretation as extreme values when $\mathbf{x}$ is further restricted to be perpendicular to the earlier choices.

An Example of the Application of Cauchy-Schwarz Inequality (Cramér, 1946)

In statistical problems, large amounts of data are collected to study a phenomenon. With a desire to derive a mathematical model to describe it, we may find, numerically, a function $\widetilde{\phi}$ to approximate a parameter $\phi$. $\widetilde{\phi}$ is called an unbiased estimator of $\phi$ if $E(\widetilde{\phi})=\phi . \quad$ That is $$ \int_{-\infty}^{\infty} \widetilde{\phi} f_{\theta}(x) d x=\phi(\theta) $$ Here, $\theta$ and $x$ are independent parameters. Differentiating this with respect to $\theta$ and interchanging integration and differentiation (provided of course that this is permissible) gives: $$ \int_{-\infty}^{\infty} \widetilde{\phi}(x) \frac{\partial f_{\theta}}{\partial \theta}(x) d x=\phi^{\prime}(\theta) $$ The rate of change of information is the function $$ S(x):=\frac{\partial}{\partial \theta} \log f_{\theta}(x) $$ called the score statistic. Plainly, $S(x)=\frac{1}{f_{\theta}(x)} \frac{\partial f_{\theta}}{\partial \theta}(x)$, so that we can write $$ \int_{-\infty}^{\infty} \widetilde{\phi}(x) S(x) f_{\theta}(x) d x=\phi^{\prime}(\theta) . $$ Also, the expectation of $S(x)$ is $$ E(S(x))=\int_{-\infty}^{\infty} S(x) f_{\theta}(x) d x=\int_{-\infty}^{\infty} \frac{\partial f_{\theta}}{\partial \theta}(x) d x=\frac{\partial}{\partial \theta} \int_{-\infty}^{\infty} f_{\theta}(x) d x=0 $$ since $$ \int_{-\infty}^{\infty} f_{\theta}(x) d x=1 $$ because the total probability is $1 .$ Thus, (4.1) can be re-written as $$ \int_{-\infty}^{\infty}(\widetilde{\phi}(x)-\phi(\theta)) S(x) f_{\theta}(x) d x=\phi^{\prime}(\theta) . $$ Applying the Cauchy-Schwarz inequality, we obtain $$ \phi^{\prime}(\theta)^{2} \leq\left(\int_{-\infty}^{\infty}(\widetilde{\phi}(x)-\phi(\theta))^{2} f_{\theta}(x) d x\right)\left(\int_{-\infty}^{\infty} S(x)^{2} f_{\theta}(x) d x\right) $$

Writing $$ I(\theta):=\int_{-\infty}^{\infty}\left(\frac{\partial \log f_{\theta}}{\partial \theta}\right)^{2} f_{\theta}(x) d x $$ (called Fisher information in statistical parlance), we can write our inequality as:

Theorem 5 (The Cramér-Rao inequality). For an unbiased estimator $\widetilde{\phi}$ of $\phi$, we have $$ \int_{-\infty}^{\infty}(\widetilde{\phi}(x)-\phi(\theta))^{2} f_{\theta}(x) d x \geq \frac{\phi^{\prime}(\theta)^{2}}{I(\theta)} . $$ Often, this is applied with $\phi(\theta)=\theta$ so that $\phi^{\prime}(\theta)=1$. The inequality then gives us a limitation on the accuracy of the unbiased estimator to the function $\theta$. Somtimes it is referred to as the information inequality. It was discovered independently by C. R. Rao [10] and H. Cramér [2] in 1945 and has played a pivotal role in statistical inference. An enlightening survey of the Cramér-Rao inequality was written by K.R. Parthasarathy [9] where the reader can find discussion of Riemannian metrics to study population models.

Regarding Theorem 5 , there is a lot of interest in estimators that actually achieve the Cramer-Rao lower bound. Such estimators are said to be asymptotically efficient. Under certain regularity conditions the maximum likelihood estimators are asymptotically efficient. In such cases the Fisher information about $\theta$ in the data is equal to the inverse of the variance of the estimator.

Reference

Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press, Princeton.
Johnson, R. A., Wichern, D. W. (2002). Applied multivariate statistical analysis. Upper Saddle River, NJ: Prentice Hall. ISBN: 0130925535

A Comparison between Two Ways of Coding for Bayesian Statistical Modeling in Toronto Rental Price (with brms)

Tue, 20 Jul 2021 01:00:00 +0000

Data Preparation

According to a research in the distribution of housing price in Tokyo (OHNISHI, MIZUNO, SHIMIZU & WATANABE, 2011), the housing price follows a lognormal distribution. Therefore we would like to examine if an exponential model works for the mean rental price $\mu_{ij}$ of type j at year i in Toronto with two predictors (year and unit size). Data were adapted from Canada Mortgage and Housing Corporation (2021). For copyrights reasons, the dataset will not be attached on GitHub. The link to the dataset can be found under Bibliography Section. An overview of the data is provided below.

# Data Wrangling
library(tidyverse)
# Bayes Models
library(brms)
library(tidybayes)
library(bayesplot)
library(loo)
# Prior libraries
library(extraDistr)
library(cmdstanr)
library(posterior)
# html widgets
library(kableExtra)

head(df_bay)

## # A tibble: 6 x 4
## neighbourhood year_temp unit_size rent
## <chr> <dbl> <chr> <dbl>
## 1 Banbury-Don Mills/York Mills 0 0 bedroom 881
## 2 Banbury-Don Mills/York Mills 0 1 bedroom 1097
## 3 Banbury-Don Mills/York Mills 0 2 bedrooms 1253
## 4 Banbury-Don Mills/York Mills 0 3 bedrooms 1565
## 5 Bathurst Manor 0 0 bedroom 797
## 6 Bathurst Manor 0 1 bedroom 1101

Note: Year 0 represents year 2016 and year 4 is 2020.

df_bay_2016 = df_bay %>%
filter(year_temp==1) %>%
select(rent)
hist(df_bay_2016$rent, main='Distribution of Mean Rental Prices in Neighborhoods in Toronto in 2017', xlab='Mean Price')

hist(log(df_bay_2016$rent), main='Distribution of Log Mean Rental Prices in Neighborhoods\nin Toronto in 2017', xlab='Log Mean Price')

Two Matematically Equivalent Approaches

We consider the following equivalent approaches and test if both models agree with each other.

We are going to consider the following two approaches (with brms in R):

Approach 1: $\mu_{ij}\sim lognormal((b_{0j}+\beta_0)+(b_{1j}+\beta_1)x_i,\sigma^2)$ (i.e.family=lognormal())

Approach 2: $log(\mu_{ij})\sim N((b_{0j}+\beta_0)+(b_{1j}+\beta_1)x_i,\sigma^2)$

where

$b_{0j}$: Random intercept due to j type of unit size;

$\beta_0$: Baseline intercept, which may have no practical meaning;

$x_i$: Variable year i, ranging from 2011 to 2020;

$b_{1j}$: Random slope due to j type of unit size;

$\beta_1$: Coefficient of variable year;

$\sigma^2$: Actual variation in rental price.

Priors

Weakly informative priors are chosen based on our belief in the baseline price of a studio at the intercept, which is around 700 Canadian dollars per month, in Toronto. Moreover, we expect the unit size has small effects on the intercept (normal distribution is chosen for this reason). At the same time, we also want to ensure that we do not miss the possibility of large parameters with Cauchy distribution as priors:

$\beta_0\sim N(700,100)$

$b_{0j}\sim N(0,1)$

$b_{1j}\sim N(0,{\tau_1}^2 )$

$\beta_1\sim N(0,{\tau_2}^2 )$

${\tau_1}^2,{\tau_2}^2,\sigma\sim Cauchy(0,1)$

$Cov(b_{0j},b_{1j})\sim Cholesky LKJ Correlation Distribution(1.5)$

Note: Prior Predictive Check is not the main focus of this project, so it is omitted to save space for the model comparison below. One should, however, conduct prior predictive check to be more rigorous.

Model 1 Estimates

priors <- c(prior(normal(700, 100), class = Intercept),
prior(normal(0, 1), class = b),
prior(cauchy(0, 1), class = sd),
prior(cauchy(0, 1), class = sigma),
prior(lkj_corr_cholesky(1.5), class = cor))
# priors <- c(prior(normal(0,1), class = Intercept),
# prior(cauchy(0,0.5), class = sd))
if (!file.exists("models/bayes_mod1.rds")){
mod_1 <- brm(rent ~ (1 + year_temp | unit_size ) + year_temp ,
data = df_bay,
prior = priors,
family=lognormal(),
warmup = 1000, # burn-in
iter = 5000, # number of iterations
chains = 2, # number of MCMC chains
control = list(adapt_delta = 0.95))
saveRDS(mod_1, file= "models/bayes_mod1.rds")
} else {
mod_1 <- readRDS("models/bayes_mod1.rds")
}
fixef(mod_1) %>%
kable(booktabs = T, caption = "Fixed Effects for Model 1") %>%
kable_styling(latex_options = c("HOLD_position", "scale_down"))

Fixed Effects for Model 1
	Estimate	Est.Error	Q2.5	Q97.5
Intercept	7.052757	0.2196256	6.5951566	7.5166879
year_temp	0.054498	0.0087226	0.0406854	0.0670868

data.frame(ranef(mod_1)) %>%
kable(booktabs = T, caption = "Random Effects for Model 1") %>%
kable_styling(latex_options = c("HOLD_position", "scale_down"))

Random Effects for Model 1
	unit_size.Estimate.Intercept	unit_size.Est.Error.Intercept	unit_size.Q2.5.Intercept	unit_size.Q97.5.Intercept	unit_size.Estimate.year_temp	unit_size.Est.Error.year_temp	unit_size.Q2.5.year_temp	unit_size.Q97.5.year_temp
0 bedroom	-0.2945081	0.2198822	-0.7599386	0.1641069	0.0028919	0.0091980	-0.0093780	0.0213398
1 bedroom	-0.0838120	0.2196423	-0.5421174	0.3796284	0.0012612	0.0086730	-0.0118502	0.0167562
2 bedrooms	0.0867104	0.2198136	-0.3782345	0.5462856	-0.0003455	0.0088410	-0.0151367	0.0140831
3 bedrooms	0.2633750	0.2195523	-0.1990381	0.7254240	-0.0016846	0.0087438	-0.0178807	0.0113221

Model 2 Estimates

# priors <- c(prior(normal(0,1), class = Intercept),
# prior(cauchy(0,0.5), class = sd))
if (!file.exists("models/bayes_mod2.rds")){
mod_2 <- brm(log(rent) ~ (1 + year_temp | unit_size ) + year_temp ,
data = df_bay,
prior = priors,
warmup = 1000, # burn-in
iter = 5000, # number of iterations
chains = 2, # number of MCMC chains
control = list(adapt_delta = 0.95))
saveRDS(mod_2, file= "models/bayes_mod2.rds")
} else {
mod_2 <- readRDS("models/bayes_mod2.rds")
}
fixef(mod_2) %>%
kable(booktabs = T, caption = "Fixed Effects for Model 2") %>%
kable_styling(latex_options = c("HOLD_position", "scale_down"))

Fixed Effects for Model 2
	Estimate	Est.Error	Q2.5	Q97.5
Intercept	7.0518325	0.2324473	6.5815863	7.5796947
year_temp	0.0549225	0.0075329	0.0401083	0.0683509

data.frame(ranef(mod_2)) %>%
kable(booktabs = T, caption = "Random Effects for Model 2") %>%
kable_styling(latex_options = c("HOLD_position", "scale_down"))

Random Effects for Model 2
	unit_size.Estimate.Intercept	unit_size.Est.Error.Intercept	unit_size.Q2.5.Intercept	unit_size.Q97.5.Intercept	unit_size.Estimate.year_temp	unit_size.Est.Error.year_temp	unit_size.Q2.5.year_temp	unit_size.Q97.5.year_temp
0 bedroom	-0.2939128	0.2329494	-0.8206667	0.1726338	0.0026946	0.0079855	-0.0107282	0.0205602
1 bedroom	-0.0830413	0.2324192	-0.6096193	0.3857650	0.0009799	0.0076378	-0.0127195	0.0172185
2 bedrooms	0.0882536	0.2321475	-0.4375356	0.5553330	-0.0009177	0.0076263	-0.0167288	0.0138163
3 bedrooms	0.2643750	0.2323713	-0.2665589	0.7340761	-0.0022045	0.0079285	-0.0200957	0.0117391

Key Findings for both models:

Both models yield similar estimates for both fixed effects and random effects.
The rental price increases by around 5.6% each year on average, higher than the inflation rate 3.6 % in Canada;
The baseline price for 3-bedroom apartment is 73% higher than a Studio, so a hierarchy model is necessary;
The slope does not vary much for each room type, so a random intercept model may be sufficient for analysis.

Posterior Predictive Check (Density)

pp_check(mod_1) + labs(title="Distribution of observed and replicated rental price for model 1")

## Using 10 posterior samples for ppc type 'dens_overlay' by default.

pp_check(mod_2) + labs(title="Distribution of observed and replicated rental price for model 2")

## Using 10 posterior samples for ppc type 'dens_overlay' by default.

Both models are reasonable from the comparison above.

Posterior Predictive Check (Test Statistic)

pp_check(mod_1, type = "stat", stat = 'mean', nsamples = 5000) + labs(title="Comparison between the distribution of the mean rental price in simulated datasets\nand the mean of the actual data for Model 1")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pp_check(mod_2, type = "stat", stat = 'mean', nsamples = 5000) + labs(title="Comparison between the distribution of the log mean rental price in simulated datasets\nand the log mean of the actual data for Model 2")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both models are reasonable from the comparison above.

Leave-one-out Cross-validation(LOO-CV)

loo1b <- loo(mod_1, save_psis = TRUE)
loo2b <- loo(mod_2, save_psis = TRUE)
plot(loo1b, main = "PSIS diagnostic plot for model 1")

plot(loo2b, main = "PSIS diagnostic plot for model 2")

Pareto k estimates, which give an indication of how ‘influential’ each point is. The higher the value of k, the more influential the point is. Points with K over 0.5 are not good, fortunately there are not influential points for both models.

Conclusion

Though the models have subtle difference in actual results, RStan produces the estimates for both models fairly. One may want to explore further about rental prices with more predictors and gather more data to validate the model.

Bibliography

Canada Mortgage and Housing Corporation.(2021).Toronto — Historical Average Rents by Bedroom Type.https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/Table?TableId=2.2.11&GeographyId=2270&GeographyTypeId=3&DisplayAs=Table&GeograghyName=Toronto

Learning SQL Notes #16: SQL and Big Data

Fri, 11 Jun 2021 15:00:00 +0000

Introduction to Apache Drill
Querying Files Using Drill
Querying MySQL Using Drill
Querying MongoDB Using Drill
Drill with Multiple Data Sources
Future of SQL

The data landscape has changed quite a bit over the past decade, and SQL is changing to meet the needs of today’s rapidly evolving environments. Many organizations that had used relational databases exclusively just a few years ago are now also housing data in Hadoop clusters, data lakes, and NoSQL databases. At the same time, companies are struggling to find ways to gain insights from the ever-growing volumes of data, and the fact that this data is now spread across multiple data stores, perhaps both on-site and in the cloud, makes this a daunting task.

Because SQL is used by millions of people and has been integrated into thousands of applications, it makes sense to leverage SQL to harness this data and make it actionable. Over the past several years, a new breed of tools has emerged to enable SQL access to structured, semi-structured, and unstructured data: tools such as Presto, Apache Drill, and Toad Data Point. This chapter explores one of these tools, Apache Drill, to demonstrate how data in different formats and stored on different servers can be brought together for reporting and analysis.

Introduction to Apache Drill

Compelling features:

Facilitates queries across multiple data formats, including delimited data, JSON, Parquet, and log files
Connects to relational databases, Hadoop, NoSQL, HBase, and Kafka, as well as specialized data formats such as PCAP, BlockChain, and others
Allows creation of custom plug-ins to connect to most any other data store
Requires no up-front schema definitions
Supports the SQL:2003 standard
Works with popular business intelligence (BI) tools like Tableau and Apache Superset Using Drill, you can connect to any number of data sources and begin querying, without the need to first set up a metadata repository.

Querying Files Using Drill

Let’s start by using Drill to query data in a file. Drill understands how to read several different file formats, including packet capture (PCAP) files, which are in binary for‐ mat and contain information about packets traveling over a network. All I have to do when I want to query a PCAP file is to configure Drill’s dfs (distributed filesystem) plug-in to include the path to the directory containing my files, and I’m ready to write queries.

Drill includes partial support for information_schema, so you can find out high-level information about the data files in your workspace:

SELECT file_name, is_directory, is_file, permission
FROM information_schema.`files`
WHERE schema_name = 'dfs.data';
SELECT * FROM dfs.data.`attack-trace.pcap`
WHERE 1=2; # To see the column name

Counts the number of packets sent from each IP address to each destination port:

SELECT src_ip, dst_port,
count(*) AS packet_count
FROM dfs.data.`attack-trace.pcap`
GROUP BY src_ip, dst_port;

Aggregates packet information for each second:

SELECT trunc(extract(second from `timestamp`)) as packet_time,
count(*) AS num_packets,
sum(packet_length) AS tot_volume
FROM dfs.data.`attack-trace.pcap`
GROUP BY trunc(extract(second from `timestamp`));

Put backticks (`) around timestamp because it is a reserved word.

You can query files stored locally, on your network, in a distributed filesystem, or in the cloud. Drill has built-in support for many file types, but you can also build your own plug-in to allow Drill to query any type of file.

Querying MySQL Using Drill

Why Apache Drill? Because you can write queries using Drill that combine data from different sources, so you might write a query that joins data from MySQL, Hadoop, and comma-delimited files, for example.

The first step is to choose a database:

apache drill (information_schema)> use mysql.sakila;
show tables;

Simple joins, group by, order and having work for Drill as well. However, Drill works with many relational databases, not just MySQL, so some features of the language may differ (e.g., data conversion functions). For more information, read Drill’s documentation about their SQL implementation.

Querying MongoDB Using Drill

After using Drill to query the sample Sakila data in MySQL, the next logical step is to convert the Sakila data to another commonly used format, store it in a nonrelational database, and use Drill to query the data. I decided to convert the data to JSON and store it in MongoDB, which is one of the more popular NoSQL platforms for document storage. Drill includes a plug-in for MongoDB and also understands how to read JSON documents, so it was relatively easy to load the JSON files into Mongo and begin writing queries.

After the JSON files have been loaded, the Mongo database contains two collections (films and customers), and the data in these collections spans nine different tables from the MySQL Sakila database.

Group the data by rating and actor:

SELECT g_pg_films.Rating,
g_pg_films.actor_list.`First name` first_name,
g_pg_films.actor_list.`Last name` last_name,
count(*) num_films
FROM
(SELECT f.Rating, flatten(Actors) actor_list
FROM films f
WHERE f.Rating IN ('G','PG')
) g_pg_films
GROUP BY g_pg_films.Rating,
g_pg_films.actor_list.`First name`,
g_pg_films.actor_list.`Last name`
HAVING count(*) > 9;

The query should return all customers who have spent more than $80 to rent films rated either G or PG.

SELECT first_name, last_name,
sum(cast(cust_payments.payment_data.Amount
as decimal(4,2))) tot_payments
FROM
(SELECT cust_data.first_name,
cust_data.last_name,
f.Rating,
flatten(cust_data.rental_data.Payments)
payment_data
FROM films f
INNER JOIN
(SELECT c.`First Name` first_name,
c.`Last Name` last_name, flatten(c.Rentals) rental_data
FROM customers c
) cust_data
ON f._id = cust_data.rental_data.filmID
WHERE f.Rating IN ('G','PG')
) cust_payments
GROUP BY first_name, last_name
HAVING
sum(cast(cust_payments.payment_data.Amount as decimal(4,2))) > 80;

The innermost query, which I named cust_data, flattens the Rentals list so that the cust_payments query can join to the films collection and also flatten the Payments list. The outermost query groups the data by customer name and applies a having clause to filter out customers who spent $80 or less on films rated G or PG.

Drill with Multiple Data Sources

As long as Drill is configured to connect to both databases, you just need to describe where to find the data.

FROM mysql.sakila.film f
FROM mongo.sakila.customers c

Future of SQL

The future of relational databases is somewhat unclear. It is possible that the big data technologies of the past decade will continue to mature and gain market share. It’s also possible that a new set of technologies will emerge, overtaking Hadoop and NoSQL, and taking additional market share from relational databases. However, most companies still run their core business functions using relational databases, and it should take a long time for this to change.

The future of SQL seems a bit clearer, however. While the SQL language started out as a mechanism for interacting with data in relational databases, tools like Apache Drill act more like an abstraction layer, facilitating the analysis of data across various database platforms. In this author’s opinion, this trend will continue, and SQL will remain a critical tool for data analysis and reporting for many years.

Learning SQL Notes #15: Working with Large Databases

Fri, 11 Jun 2021 09:00:00 +0000

Partitioning
Clustering
Sharding
Big Data

While relational databases face various challenges as data volumes continue to grow, there are strategies such as partitioning, clustering, and sharding that allow companies to continue to utilize relational databases by spreading data across multi‐ ple storage tiers and servers. Other companies have decided to move to big data platforms such as Hadoop in order to handle huge data volumes.

Partitioning

The following tasks become more difficult and/or time consuming as a table grows past a few million rows:

Query execution requiring full table scans
Index creation/rebuild
Data archival/deletion
Generation of table/index statistics
Table relocation (e.g., move to a different tablespace)
Database backups

The best way to prevent administrative issues from occurring in the future is to break large tables into pieces, or partitions, when the table is first created (although tables can be partitioned later, it is easier to do so initially). Administrative tasks can be performed on individual partitions, often in parallel, and some tasks can skip one or more partitions entirely.

Partitioning Concepts

While every partition must have the same schema definition (columns, column types, etc.), there are several administrative features that can differ for each partition:

Partitions may be stored on different tablespaces, which can be on different physical storage tiers.
Partitions can be compressed using different compression schemes.
Local indexes (more on this shortly) can be dropped for some partitions.
Table statistics can be frozen on some partitions, while being periodically refreshed on others.
Individual partitions can be pinned into memory or stored in the database’s flash storage tier.

Table Partitioning

The partitioning scheme available in most relational databases is horizontal partitioning, which assigns entire rows to exactly one partition. Tables may also be partitioned vertically, which involves assigning sets of columns to different partitions, but this must be done manually. When partitioning a table horizontally, you must choose a partition key, which is the column whose values are used to assign a row to a particular partition. In most cases, a table’s partition key consists of a single column, and a partitioning function is applied to this column to determine in which partition each row should reside.

Index Partitioning

If your partitioned table has indexes, you will get to choose whether a particular index should stay intact, known as a global index, or be broken into pieces such that each partition has its own index, which is called a local index. Global indexes span all partitions of the table and are useful for queries that do not specify a value for the partition key.

Partitioning Methods

Range partitioning

The most common usage is to break up tables by date ranges.

CREATE TABLE sales
(sale_id INT NOT NULL,
cust_id INT NOT NULL,
store_id INT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(9,2)
)
PARTITION BY RANGE (yearweek(sale_date))
(PARTITION s1 VALUES LESS THAN (202002),
PARTITION s2 VALUES LESS THAN (202003),
PARTITION s3 VALUES LESS THAN (202004),
PARTITION s4 VALUES LESS THAN (202005),
PARTITION s5 VALUES LESS THAN (202006),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
);

Read and modify partitions:

SELECT partition_name, partition_method, partition_expression
FROM information_schema.partitions 
WHERE table_name = 'sales'
ORDER BY partition_ordinal_position;
ALTER TABLE sales REORGANIZE PARTITION s999 INTO
(PARTITION s6 VALUES LESS THAN (202007),
PARTITION s7 VALUES LESS THAN (202008),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
);

List partitioning

PARTITION BY LIST COLUMNS (geo_region_cd)
(PARTITION ASIA VALUES IN ('CHN','JPN','IND'))
ALTER TABLE sales REORGANIZE PARTITION ASIA INTO
(PARTITION ASIA VALUES IN ('CHN','JPN','IND', 'KOR'));

Hash partitioning

The server does this by applying a hashing function to the column value.

PARTITION BY HASH (cust_id)
PARTITIONS 4
(PARTITION H1,
PARTITION H2,
PARTITION H3,
PARTITION H4
);

Composite partitioning

If you need finer-grained control of how data is allocated to your partitions, you can employ composite partitioning, which allows you to use two different types of partitioning for the same table. With composite partitioning, the first partitioning method defines the partitions, and the second partitioning method defines the subpartitions.

CREATE TABLE sales
(sale_id INT NOT NULL,
cust_id INT NOT NULL,
store_id INT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(9,2)
)
PARTITION BY RANGE (yearweek(sale_date))
SUBPARTITION BY HASH (cust_id)
(PARTITION s1 VALUES LESS THAN (202002)
(SUBPARTITION s1_h1, SUBPARTITION s1_h2, SUBPARTITION s1_h3, SUBPARTITION s1_h4),
PARTITION s2 VALUES LESS THAN (202003)
(SUBPARTITION s2_h1, SUBPARTITION s2_h2, SUBPARTITION s2_h3, SUBPARTITION s2_h4),
PARTITION s3 VALUES LESS THAN (202004)
(SUBPARTITION s3_h1, SUBPARTITION s3_h2,
SUBPARTITION s3_h3,
SUBPARTITION s3_h4),
PARTITION s4 VALUES LESS THAN (202005)
(SUBPARTITION s4_h1, SUBPARTITION s4_h2, SUBPARTITION s4_h3, SUBPARTITION s4_h4),
PARTITION s5 VALUES LESS THAN (202006)
(SUBPARTITION s5_h1, SUBPARTITION s5_h2, SUBPARTITION s5_h3, SUBPARTITION s5_h4),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
(SUBPARTITION s999_h1, SUBPARTITION s999_h2, SUBPARTITION s999_h3,
SUBPARTITION s999_h4)
);
SELECT *
FROM sales PARTITION (s3);
SELECT *
FROM sales PARTITION (s3_h3);

Partitioning Benefits

One major advantage to partitioning is that you may only need to interact with as few as one partition, rather than the entire table.

If you execute a query that includes a join to a partitioned table and the query includes a condition on the partitioning column, the server can exclude any partitions that do not contain data pertinent to the query. This is known as partitionwise joins, and it is similar to partition pruning in that only those partitions that contain data needed by the query will be included.

From an administrative standpoint, one of the main benefits to partitioning is the ability to quickly delete data that is no longer needed.

Another administrative advantage to partitioned tables is the ability to perform updates on multiple partitions simultaneously, which can greatly reduce the time needed to touch every row in a table.

Clustering

Clustering allows multiple servers to act as a single database.

Shared-disk/shared-cache configurations: every server in the cluster has access to all disks, and data cached in one server can be accessed by any other server in the cluster. With this type of architecture, an application server could attach to any one of the database servers in the cluster, with connections automatically failing over to another server in the cluster in case of failure.

Of the commercial database vendors, Oracle is the leader in this space, with many of the world’s biggest companies using the Oracle Exadata platform to host extremely large databases accessed by thousands of concurrent users. However, even this plat‐ form fails to meet the needs of the biggest companies, which led Google, Facebook, Amazon, and other companies to blaze new trails.

Sharding

Sharding partitions the data across multiple databases (called shards), so it is similar to table partitioning but on a larger scale and with far more complexity. If you were to employ this strategy for the social media company, you might decide to implement 100 separate databases, each one hosting the data for approximately 10 million users.

You will need to choose a sharding key, which is the value used to determine to which database to connect.
While large tables will be divided into pieces, with individual rows assigned to a single shard, smaller reference tables may need to be replicated to all shards, and a strategy needs to be defined for how reference data can be modified and changes propagated to all shards.
If individual shards become too large (e.g., the social media company now has two billion users), you will need a plan for adding more shards and redistributing data across the shards.
When you need to make schema changes, you will need to have a strategy for deploying the changes across all of the shards so that all schemas stay in sync.
If application logic needs to access data stored in two or more shards, you need to have a strategy for how to query across multiple databases and also how to implement transactions across multiple databases.

Big Data

One way to define the boundaries of big data is with the “3 Vs”:

Volume

In this context, volume generally means billions or trillions of data points.

Velocity

This is a measure of how quickly data arrives.

Variety

This means that data is not always structured (as in rows and columns in a rela‐ tional database) but can also be unstructured (e.g., emails, videos, photos, audio files, etc.).

So, one way to characterize big data is any system designed to handle a huge amount of data of various formats arriving at a rapid pace.

Hadoop

Hadoop is best described as an ecosystem, or a set of technologies and tools that work together. Some of the major components of Hadoop include:

Hadoop Distributed File System (HDFS)

Like the name implies, HDFS enables file management across a large number of servers.

MapReduce

This technology processes large amounts of structured and unstructured data by breaking a task into many small pieces that can be run in parallel across many servers.

YARN

This is a resource manager and job scheduler for HDFS.

Together, these technologies allow for the storage and processing of files across hun‐ dreds or even thousands of servers acting as a single logical system. While Hadoop is widely used, querying the data using MapReduce generally requires a programmer, which has led to the development of several SQL interfaces, including Hive, Impala, and Drill.

NoSQL and Document Databases

What happens, however, if the structure of the data isn’t known beforehand or if the structure is known but changes frequently? The answer for many companies is to combine both the data and schema definition into documents using a format such as XML or JSON and then store the documents in a database. By doing so, various types of data can be stored in the same database without the need to make schema modifications, which makes storage easier but puts the burden on query and analytic tools to make sense of the data stored in the documents.

Document databases are a subset of what are called NoSQL databases, which typically store data using a simple key-value mechanism. For example, using a document data‐ base such as MongoDB, you could utilize the customer ID as the key to store a JSON document containing all of the customer’s data, and other users can read the schema stored within the document to make sense of the data stored within.

Cloud Computing

Prior to the advent of big data, most companies had to build their own data centers to house the database, web, and application servers used across the enterprise. With the advent of cloud computing, you can choose to essentially outsource your data center to platforms such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud. One of the biggest benefits to hosting your services in the cloud is instant scalability, which allows you to quickly dial up or down the amount of computing power needed to run your services. Startups love these platforms because they can start writing code without spending any money up front for servers, storage, networks, or software licenses.

As far as databases are concerned, a quick look at AWS’s database and analytics offerings yields the following options:

Relational databases (MySQL, Aurora, PostgreSQL, MariaDB, Oracle, and SQL Server)
In-memory database (ElastiCache)
Data warehousing database (Redshift)
NoSQL database (DynamoDB)
Document database (DocumentDB)
Graph database (Neptune)
Time-series database (TimeStream)
Hadoop (EMR)
Data lakes (Lake Formation)

While relational databases dominated the landscape up until the mid-2000s, it’s pretty easy to see that companies are now mixing and matching various platforms and that relational databases may become less popular over time.

Conclusion

Databases are getting larger, but at the same time storage, clustering, and partitioning technologies are becoming more robust. Working with huge amounts of data can be quite challenging, regardless of the technology stack. Whether you use relational databases, big data platforms, or a variety of database servers, SQL is evolving to facilitate data retrieval from various technologies.

Learning SQL Notes #14: Analytic Functions

Fri, 11 Jun 2021 01:00:00 +0000

Analytic Function Concepts
- Data Windows
- Localized Sorting
Ranking
- Ranking Functions
- Generating Multiple Rankings
Reporting Functions

Analytic Function Concepts

Data Windows

SELECT quarter(payment_date) quarter,
monthname(payment_date) month_nm,
sum(amount) monthly_sales,
max(sum(amount))
over () max_overall_sales,/*controlled by where and group by and return the highest monthly total payment in 2005*/
max(sum(amount))
over (partition by quarter(payment_date)) max_qrtr_sales /*controlled by where and group by and return the highest monthly total payment in each quarter in 2005*/
FROM payment
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date);

The analytic functions used to generate these additional columns group rows into two different sets: one set containing all rows in the same quarter and another set containing all of the rows. To accommodate this type of analysis, analytic functions include the ability to group rows into windows, which effectively partition the data for use by the analytic function without changing the overall result set. Windows are defined using the over clause combined with an optional partition by subclause. In the previous query, both analytic functions include an over clause, but the first one is empty, indicating that the window should include the entire result set, whereas the second one specifies that the window should include only rows within the same quarter. Data windows may contain anywhere from a single row to all of the rows in the result set, and different analytic functions can define different data windows.

Localized Sorting

SELECT quarter(payment_date) quarter,
monthname(payment_date) month_nm,
sum(amount) monthly_sales,
rank() over (order by sum(amount) desc) sales_rank /* order by only controls the rank()*/
FROM payment
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date)
ORDER BY 1, month(payment_date);/* order by only controls the presentation*/

or you may insert partition by quarter(payment_date) into the over() above to obtain rank within each quarter.

Ranking

Ranking Functions

There are multiple ranking functions available in the SQL standard, with each one taking a different approach to how ties are handled:

row_number

Returns a unique number for each row, with rankings arbitrarily assigned in case of a tie

rank

Returns the same ranking in case of a tie, with gaps in the rankings

dense_rank

Returns the same ranking in case of a tie, with no gaps in the rankings

SELECT customer_id, count(*) num_rentals,
row_number() over (order by count(*) desc) row_number_rnk,
rank() over (order by count(*) desc) rank_rnk,
dense_rank() over (order by count(*) desc) dense_rank_rnk
FROM rental
GROUP BY customer_id
ORDER BY 2 desc;

customer_id	num_rentals	row_number_rnk	rank_rnk	dense_rank_rnk
144	42	3	3	3
236	42	4	3	3
75	41	5	5	4

To get back to the original request, how would you identify the top 10 customers? There are three possible solutions:

Use the row_number function to identify customers ranked from 1 to 10, which results in exactly 10 customers in this example, but in other cases might exclude customers having the same number of rentals as the 10th ranked customer.
Use the rank function to identify customers ranked 10 or less, which also results in exactly 10 customers.
Use the dense_rank function to identify customers ranked 10 or less, which yields a list of 37 customers.

Generating Multiple Rankings

SELECT customer_id,
monthname(rental_date) rental_month,
count(*) num_rentals,
rank() over (partition by monthname(rental_date) 
order by count(*) desc) rank_rnk
FROM rental
GROUP BY customer_id, monthname(rental_date)
ORDER BY 2, 3 desc;

so that rank() starts from 1 for each month.

Looking at the results, you can see that the rankings are reset to 1 for each month. In order to generate the desired results for the marketing department (top five custom‐ ers from each month), you can simply wrap the previous query in a subquery and add a filter condition to exclude any rows with a ranking higher than five:

SELECT customer_id, rental_month, num_rentals, rank_rnk ranking
FROM
(SELECT customer_id,
monthname(rental_date) rental_month, count(*) num_rentals,
rank() over (partition by monthname(rental_date) order by count(*) desc) rank_rnk
FROM rental
GROUP BY customer_id, monthname(rental_date)
) cust_rankings
WHERE rank_rnk <= 5
ORDER BY rental_month, num_rentals desc, rank_rnk;

Since analytic functions can be used only in the SELECT clause, you will often need to nest queries if you need to do any filtering or grouping based on the results from the analytic function.

Window Function	Return Type	Description
CUME_DIST()	DOUBLE PRECISION	The CUME_DIST() window function calculates the relative rank of the current row within a window partition: (number of rows preceding or peer with current row) / (total rows in the window partition)
DENSE_RANK()	BIGINT	The DENSE_RANK () window function determines the rank of a value in a group of values based on the ORDER BY expression and the OVER clause. Each value is ranked within its partition. Rows with equal values receive the same rank. There are no gaps in the sequence of ranked values if two or more rows have the same rank.
NTILE()	INTEGER	The NTILE window function divides the rows for each window partition, as equally as possible, into a specified number of ranked groups. The NTILE window function requires the ORDER BY clause in the OVER clause.
PERCENT_RANK()	DOUBLE PRECISION	The PERCENT_RANK () window function calculates the percent rank of the current row using the following formula: (x - 1) / (number of rows in window partition - 1) where x is the rank of the current row.
RANK()	BIGINT	The RANK window function determines the rank of a value in a group of values. The ORDER BY expression in the OVER clause determines the value. Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers. For example, if two rows are ranked 1, the next rank is 3. The DENSE_RANK window function differs in that no gaps exist if two or more rows tie.
ROW_NUMBER()	BIGINT	The ROW_NUMBER window function determines the ordinal number of the current row within its partition. The ORDER BY expression in the OVER clause determines the number. Each value is ordered within its partition. Rows with equal values for the ORDER BY expressions receive different row numbers nondeterministically.

Reporting Functions

Calculate total by month/by total

SELECT monthname(payment_date) payment_month,
amount,
sum(amount) over (partition by monthname(payment_date)) monthly_total,
sum(amount) over () grand_total 
FROM payment
WHERE amount >= 10
ORDER BY 1;

payment_month	amount	monthly_total	grand_total
August	10.99	521.53	1262.86
August	11.99	521.53	1262.86

Calculate percentage:

SELECT monthname(payment_date) payment_month,
amount,
round(sum(amount) / sum(sum(amount)) over () * 100, 2) pct_of_total
FROM payment
GROUP BY monthname(payment_date);

payment_month	month_total	pct_of_total
May	4824.43	7.16
June	9631.88	14.29
July	28373.89	42.09
August	24072.13	35.71
February	514.18	0.76

Quasi-ranking functions:

SELECT monthname(payment_date) payment_month,
sum(amount) month_total,
CASE sum(amount)
WHEN max(sum(amount)) over () THEN 'Highest'
WHEN min(sum(amount)) over () THEN 'Lowest'
ELSE 'Middle'
END descriptor
FROM payment
GROUP BY monthname(payment_date);

payment_month	month_total	descriptor
May	4824.43	Middle
June	9631.88	Middle
July	28373.89	Highest
August	24072.13	Middle
February	514.18	Lowest

Window Frames

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
sum(sum(amount))
over (order by yearweek(payment_date)
rows unbounded preceding) rolling_sum
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
avg(sum(amount))
over (order by yearweek(payment_date)
rows between 1 preceding and 1 following) rolling_3wk_avg
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

SELECT date(payment_date), sum(amount),
avg(sum(amount))
over (order by date(payment_date)
range between interval 3 day preceding and interval 3 day following) range
FROM payment
WHERE payment_date BETWEEN '2005-07-01' AND '2005-09-01'
GROUP BY date(payment_date)
ORDER BY 1;

Lag and Lead

Window Function	Argument Type	Return Type	Description
LAG()	Any supported Drill data types	Same as the expression type	The LAG() window function returns the value for the row before the current row in a partition. If no row exists, null is returned.
LEAD()	Any supported Drill data types	Same as the expression type	The LEAD() window function returns the value for the row after the current row in a partition. If no row exists, null is returned.
FIRST_VALUE	Any supported Drill data types	Same as the expression type	The FIRST_VALUE window function returns the value of the specified expression with respect to the first row in the window frame.
LAST_VALUE	Any supported Drill data types	Same as the expression type	The LAST_VALUE window function returns the value of the specified expression with respect to the last row in the window frame.

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
lag(sum(amount), 1)
over (order by yearweek(payment_date)) prev_wk_tot,
lead(sum(amount), 1)
over (order by yearweek(payment_date)) next_wk_tot,
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
round((sum(amount) - lag(sum(amount), 1)
over (order by yearweek(payment_date))) / lag(sum(amount), 1)
over (order by yearweek(payment_date)) * 100, 1) pct_diff
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;

Column Value Concatenation

SELECT f.title,
group_concat(a.last_name order by a.last_name separator ', ') actors
FROM actor a
INNER JOIN film_actor fa
ON a.actor_id = fa.actor_id
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY f.title
HAVING count(*) = 3;

Learning SQL Notes #13: Metadata

Thu, 10 Jun 2021 01:00:00 +0000

Data About Data
information_schema
Working with Metadata

A database server also needs to store information about all of the database objects (tables, views, indexes, etc.) that were created to store this data in a database. This chapter discusses how and where this information, known as metadata, is stored, how you can access it, and how you can use it to build flexible systems.

Data About Data

Metadata is essentially data about data. Every time you create a database object, the database server needs to record various pieces of information. For example, if you were to create a table with multiple columns, a primary key constraint, three indexes, and a foreign key constraint, the database server would need to store all the following information:

Table name
Table storage information (tablespace, initial size, etc.)
Storage engine
Column names
Column data types
Default column values
not null column constraints
Primary key columns
Primary key name
Name of primary key index
Index names
Index types (B-tree, bitmap)
Indexed columns
Index column sort order (ascending or descending)
Index storage information
Foreign key name
Foreign key columns
Associated table/columns for foreign keys

This data is collectively known as the data dictionary or system catalog. The database server needs to store this data persistently, and it needs to be able to quickly retrieve this data in order to verify and execute SQL statements. Additionally, the database server must safeguard this data so that it can be modified only via an appropriate mechanism, such as the alter table statement.

Every database server uses a different mechanism to publish metadata, such as:

A set of views, such as Oracle Database’s user_tables and all_constraints views
A set of system-stored procedures, such as SQL Server’s sp_tables procedure or Oracle Database’s dbms_metadata package
A special database, such as MySQL’s information_schema database

information_schema

All of the objects available within the information_schema database (or schema, in the case of SQL Server) are views. Unlike the describe utility, the views within information_schema can be queried and, thus, used programmatically.

Table Name	Description	Introduced	Deprecated
`ADMINISTRABLE_ROLE_AUTHORIZATIONS`	Grantable users or roles for current user or role	8.0.19
`APPLICABLE_ROLES`	Applicable roles for current user	8.0.19
`CHARACTER_SETS`	Available character sets
`CHECK_CONSTRAINTS`	Table and column CHECK constraints	8.0.16
`COLLATION_CHARACTER_SET_APPLICABILITY`	Character set applicable to each collation
`COLLATIONS`	Collations for each character set
`COLUMN_PRIVILEGES`	Privileges defined on columns
`COLUMN_STATISTICS`	Histogram statistics for column values
`COLUMNS`	Columns in each table
`COLUMNS_EXTENSIONS`	Column attributes for primary and secondary storage engines	8.0.21
`CONNECTION_CONTROL_FAILED_LOGIN_ATTEMPTS`	Current number of consecutive failed connection attempts per account
`ENABLED_ROLES`	Roles enabled within current session	8.0.19
`ENGINES`	Storage engine properties
`EVENTS`	Event Manager events
`FILES`	Files that store tablespace data
`INNODB_BUFFER_PAGE`	Pages in InnoDB buffer pool
`INNODB_BUFFER_PAGE_LRU`	LRU ordering of pages in InnoDB buffer pool
`INNODB_BUFFER_POOL_STATS`	InnoDB buffer pool statistics
`INNODB_CACHED_INDEXES`	Number of index pages cached per index in InnoDB buffer pool
`INNODB_CMP`	Status for operations related to compressed InnoDB tables
`INNODB_CMP_PER_INDEX`	Status for operations related to compressed InnoDB tables and indexes
`INNODB_CMP_PER_INDEX_RESET`	Status for operations related to compressed InnoDB tables and indexes
`INNODB_CMP_RESET`	Status for operations related to compressed InnoDB tables
`INNODB_CMPMEM`	Status for compressed pages within InnoDB buffer pool
`INNODB_CMPMEM_RESET`	Status for compressed pages within InnoDB buffer pool
`INNODB_COLUMNS`	Columns in each InnoDB table
`INNODB_DATAFILES`	Data file path information for InnoDB file-per-table and general tablespaces
`INNODB_FIELDS`	Key columns of InnoDB indexes
`INNODB_FOREIGN`	InnoDB foreign-key metadata
`INNODB_FOREIGN_COLS`	InnoDB foreign-key column status information
`INNODB_FT_BEING_DELETED`	Snapshot of INNODB_FT_DELETED table
`INNODB_FT_CONFIG`	Metadata for InnoDB table FULLTEXT index and associated processing
`INNODB_FT_DEFAULT_STOPWORD`	Default list of stopwords for InnoDB FULLTEXT indexes
`INNODB_FT_DELETED`	Rows deleted from InnoDB table FULLTEXT index
`INNODB_FT_INDEX_CACHE`	Token information for newly inserted rows in InnoDB FULLTEXT index
`INNODB_FT_INDEX_TABLE`	Inverted index information for processing text searches against InnoDB table FULLTEXT index
`INNODB_INDEXES`	InnoDB index metadata
`INNODB_METRICS`	InnoDB performance information
`INNODB_SESSION_TEMP_TABLESPACES`	Session temporary-tablespace metadata	8.0.13
`INNODB_TABLES`	InnoDB table metadata
`INNODB_TABLESPACES`	InnoDB file-per-table, general, and undo tablespace metadata
`INNODB_TABLESPACES_BRIEF`	Brief file-per-table, general, undo, and system tablespace metadata
`INNODB_TABLESTATS`	InnoDB table low-level status information
`INNODB_TEMP_TABLE_INFO`	Information about active user-created InnoDB temporary tables
`INNODB_TRX`	Active InnoDB transaction information
`INNODB_VIRTUAL`	InnoDB virtual generated column metadata
`KEY_COLUMN_USAGE`	Which key columns have constraints
`KEYWORDS`	MySQL keywords
`MYSQL_FIREWALL_USERS`	Firewall in-memory data for account profiles		8.0.26
`MYSQL_FIREWALL_WHITELIST`	Firewall in-memory data for account profile allowlists		8.0.26
`ndb_transid_mysql_connection_map`	NDB transaction information
`OPTIMIZER_TRACE`	Information produced by optimizer trace activity
`PARAMETERS`	Stored routine parameters and stored function return values
`PARTITIONS`	Table partition information
`PLUGINS`	Plugin information
`PROCESSLIST`	Information about currently executing threads
`PROFILING`	Statement profiling information
`REFERENTIAL_CONSTRAINTS`	Foreign key information
`RESOURCE_GROUPS`	Resource group information
`ROLE_COLUMN_GRANTS`	Column privileges for roles available to or granted by currently enabled roles	8.0.19
`ROLE_ROUTINE_GRANTS`	Routine privileges for roles available to or granted by currently enabled roles	8.0.19
`ROLE_TABLE_GRANTS`	Table privileges for roles available to or granted by currently enabled roles	8.0.19
`ROUTINES`	Stored routine information
`SCHEMA_PRIVILEGES`	Privileges defined on schemas
`SCHEMATA`	Schema information
`SCHEMATA_EXTENSIONS`	Schema options	8.0.22
`ST_GEOMETRY_COLUMNS`	Columns in each table that store spatial data
`ST_SPATIAL_REFERENCE_SYSTEMS`	Available spatial reference systems
`ST_UNITS_OF_MEASURE`	Acceptable units for ST_Distance()	8.0.14
`STATISTICS`	Table index statistics
`TABLE_CONSTRAINTS`	Which tables have constraints
`TABLE_CONSTRAINTS_EXTENSIONS`	Table constraint attributes for primary and secondary storage engines	8.0.21
`TABLE_PRIVILEGES`	Privileges defined on tables
`TABLES`	Table information
`TABLES_EXTENSIONS`	Table attributes for primary and secondary storage engines	8.0.21
`TABLESPACES`	Tablespace information
`TABLESPACES_EXTENSIONS`	Tablespace attributes for primary storage engines	8.0.21
`TP_THREAD_GROUP_STATE`	Thread pool thread group states
`TP_THREAD_GROUP_STATS`	Thread pool thread group statistics
`TP_THREAD_STATE`	Thread pool thread information
`TRIGGERS`	Trigger information
`USER_ATTRIBUTES`	User comments and attributes	8.0.21
`USER_PRIVILEGES`	Privileges defined globally per user
`VIEW_ROUTINE_USAGE`	Stored functions used in views	8.0.13
`VIEW_TABLE_USAGE`	Tables and views used in views	8.0.13
`VIEWS`	View information

Working with Metadata

Schema Generation Scripts

Generate a script that will create the various tables, indexes, views, and so on, that the team has deployed. Build a script that will create the sakila.category table. The following codes can be used to create a template-like SQL script.

SELECT 'CREATE TABLE category (' create_table_statement
UNION ALL
SELECT cols.txt
FROM
(SELECT concat(' ',column_name, ' ', column_type,
CASE
WHEN is_nullable = 'NO' THEN ' not null' ELSE ''
END, CASE
WHEN extra IS NOT NULL AND extra LIKE 'DEFAULT_GENERATED%' THEN concat(' DEFAULT ',column_default,substr(extra,18)) WHEN extra IS NOT NULL THEN concat(' ', extra)
ELSE '' END, ',') txt
FROM information_schema.columns
WHERE table_schema = 'sakila' AND table_name = 'category'
ORDER BY ordinal_position
) cols
UNION ALL
SELECT concat(' constraint primary key (')
FROM information_schema.table_constraints
WHERE table_schema = 'sakila' AND table_name = 'category'
AND constraint_type = 'PRIMARY KEY'
UNION ALL
SELECT cols.txt
FROM
(SELECT concat(CASE WHEN ordinal_position > 1 THEN ' ,'
ELSE ' ' END, column_name) txt
FROM information_schema.key_column_usage
WHERE table_schema = 'sakila' AND table_name = 'category'
AND constraint_name = 'PRIMARY'
ORDER BY ordinal_position
) cols
UNION ALL
SELECT ' )'
UNION ALL
SELECT ')';

Deployment Verification

After the deployment scripts have been run, it’s a good idea to run a verification script to ensure that the new schema objects are in place with the appropriate columns, indexes, primary keys, and so forth. Here’s a query that returns the number of columns, number of indexes, and number of primary key constraints (0 or 1) for each table in the Sakila schema:

SELECT tbl.table_name,
(SELECT count(*)
FROM information_schema.columns clm
WHERE clm.table_schema = tbl.table_schema
AND clm.table_name = tbl.table_name) num_columns,
(SELECT count(*)
FROM information_schema.statistics sta
WHERE sta.table_schema = tbl.table_schema
AND sta.table_name = tbl.table_name) num_indexes,
(SELECT count(*)
FROM information_schema.table_constraints tc
WHERE tc.table_schema = tbl.table_schema
AND tc.table_name = tbl.table_name
AND tc.constraint_type = 'PRIMARY KEY') num_primary_keys
FROM information_schema.tables tbl
WHERE tbl.table_schema = 'sakila' AND tbl.table_type = 'BASE TABLE'
ORDER BY 1;

TABLE_NAME	num_columns	num_indexes	num_primary_keys
actor	4	2	1

Dynamic SQL Generation

Most relational database servers, including SQL Server, Oracle Database, and MySQL, allow SQL statements to be submitted to the server as strings. Submit‐ ting strings to a database engine rather than utilizing its SQL interface is generally known as dynamic SQL execution.

Oracle’s PL/SQL language

execute immediate

SQL Server

sp_executesql

MySQL

prepare, execute, deallocate

SET @qry = 'SELECT customer_id, first_name, last_name FROM customer';
PREPARE dynsql1 FROM @qry;
EXECUTE dynsql1;
DEALLOCATE PREPARE dynsql1;
/*conditions can be specified at runtime*/
SET @qry = 'SELECT customer_id, first_name, last_name FROM customer WHERE customer_id = ?';
PREPARE dynsql2 FROM @qry;
SET @custid = 9;
EXECUTE dynsql2 USING @custid;
SET @custid = 145;
EXECUTE dynsql2 USING @custid;
DEALLOCATE PREPARE dynsql2;

Or you can do the following:

SELECT concat('SELECT ', concat_ws(',', cols.col1, cols.col2),
' FROM customer WHERE customer_id = ?')
INTO @qry 
FROM (SELECT
max(CASE WHEN ordinal_position = 1 THEN column_name
ELSE NULL END) col1,
max(CASE WHEN ordinal_position = 2 THEN column_name
ELSE NULL END) col2
FROM information_schema.columns
WHERE table_schema = 'sakila' AND table_name = 'customer'
GROUP BY table_name
) cols;

PREPARE dynsql3 FROM @qry;
SET @custid = 45; Query OK, 0 rows affected (0.00 sec)
EXECUTE dynsql3 USING @custid;
DEALLOCATE PREPARE dynsql3;

Note: Generally, it would be better to generate the query using a procedural language that includes looping constructs, such as Java, PL/SQL, Transact-SQL, or MySQL’s Stored Procedure Language.

Learning SQL Notes #12: Views

Wed, 09 Jun 2021 01:00:00 +0000

Well-designed applications generally expose a public interface while keeping imple‐ mentation details private, thereby enabling future design changes without impacting end users. When designing your database, you can achieve a similar result by keeping your tables private and allowing your users to access data only through a set of views.

What Are Views?

CREATE VIEW customer_vw
(customer_id,
first_name,
last_name,
email
)
AS
SELECT customer_id,
first_name,
last_name,
concat(substr(email,1,2), '*****', substr(email, -4)) email
FROM customer;
/*view the View*/
describe customer_vw;
/*group by, having, where, join etc. can also be used*/

Why Use Views?

Data Security

Oracle Database users have another option for securing both rows and columns of a table: Virtual Private Database (VPD). VPD allows you to attach policies to your tables, after which the server will modify a user’s query as necessary to enforce the policies.

Data Aggregation

CREATE VIEW sales_by_film_category AS
SELECT c.name AS category,
SUM(p.amount) AS total_sales
FROM payment AS p
INNER JOIN rental AS r
ON p.rental_id = r.rental_id
INNER JOIN inventory AS i
ON r.inventory_id = i.inventory_id
INNER JOIN film AS f
ON i.film_id = f.film_id
INNER JOIN film_category AS fc
ON f.film_id = fc.film_id
INNER JOIN category AS c
ON fc.category_id = c.category_id
GROUP BY c.name
ORDER BY total_sales DESC;

You have great flexibility! You can create a film_category_sales table, load it with aggregated data, and modify the sales_by_film_category view definition to retrieve data from this table if this improves the performance significantly.

Hiding Complexity

One of the most common reasons for deploying views is to shield end users from complexity.

CREATE VIEW film_stats AS
SELECT f.film_id, f.title, f.description, f.rating,
(SELECT c.name
FROM category c
INNER JOIN film_category fc
ON c.category_id = fc.category_id
WHERE fc.film_id = f.film_id) category_name,
(SELECT count(*)
FROM film_actor fa
WHERE fa.film_id = f.film_id ) num_actors,
(SELECT count(*)
FROM inventory i
WHERE i.film_id = f.film_id ) inventory_cnt,
(SELECT count(*)
FROM inventory i
INNER JOIN rental r
ON i.inventory_id = r.inventory_id
WHERE i.film_id = f.film_id ) num_rentals
FROM film f;

If someone uses this view but does not reference the category_name, num_actors, inventory_cnt, or num_rentals column, then none of the subqueries will be executed. This approach allows the view to be used for supplying descriptive information from the film table without unnecessarily joining five other tables.

Joining Partitioned Data

Some database designs break large tables into multiple pieces in order to improve performance. For example, if the payment table became large, the designers may decide to break it into two tables: payment_current, which holds the latest six months of data, and payment_historic, which holds all data up to six months ago. You can make it look like all payment data is stored in a single table.
```
CREATE VIEW payment_all
(payment_id,
customer_id,
staff_id,
rental_id, amount,
payment_date,
last_update
) AS
SELECT payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
FROM payment_historic
UNION ALL
SELECT payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
FROM payment_current;
```
Using a view in this case is a good idea because it allows the designers to change the structure of the underlying data without the need to force all database users to modify their queries.

Updatable Views

In the case of MySQL, a view is updatable if the following conditions are met:

No aggregate functions are used (max(), min(), avg(), etc.).
The view does not employ group by or having clauses.
No subqueries exist in the select or from clause, and any subqueries in the where clause do not refer to tables in the from clause.
The view does not utilize union, union all, or distinct.
The from clause includes at least one table or updatable view.
The from clause uses only inner joins if there is more than one table or view.

Updating Simple Views

UPDATE customer_vw
SET last_name = 'SMITH-ALLEN'
WHERE customer_id = 1;

Noinsert for views that contain derived columns, even if the derived columns are not included in the statement. Cannot modify columns derived from an expression.

Updating Complex Views

For complex views with more than one table, you are allowed to modify both of the underlying tables separately, but not within a single statement. In order to insert data through a complex view, you would need to know from where each column is sourced. Since many views are created to hide complexity from end users, this seems to defeat the purpose if the users need to have explicit knowledge of the view definition.

Learning SQL Notes #11: Indexes and Constraints

Tue, 08 Jun 2021 05:00:00 +0000

Indexes
Constraints
- Constraint Creation

Indexes

The server simply places the data in the next available location within the file (the server maintains a list of free space for each table).

To find all customers whose last name begins with Y, the server must visit each row in the customer table and inspect the contents of the last_name column; if the last name begins with Y, then the row is added to the result set. This type of access is known as a table scan.

An index is simply a mechanism for finding a specific item within a resource. A database server uses indexes to locate rows in a table. Indexes are special tables that, unlike normal data tables, are kept in a specific order. Instead of containing all of the data about an entity, however, an index contains only the column (or columns) used to locate rows in the data table, along with information describing where the rows are physically located. Therefore, the role of indexes is to facilitate the retrieval of a subset of a table’s rows and columns without the need to inspect every row in the table.

Index Creation

/*MySQL*/
ALTER TABLE customer
ADD INDEX idx_email (email);
/*OR*/
ALTER TABLE customer
DROP INDEX idx_email;
/*SQL Server*/
CREATE INDEX idx_email
ON customer (email);
SHOW INDEX FROM customer \G;

To create indexes, we can

CREATE TABLE customer (
customer_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
...
PRIMARY KEY (customer_id),
KEY idx_fk_store_id (store_id),
KEY idx_fk_address_id (address_id),
KEY idx_last_name (last_name),
...

Unique indexes

/*MySQL*/
ALTER TABLE customer
ADD UNIQUE INDEX idx_email (email);
/*SQL Server/Oracle Database*/
CREATE UNIQUE INDEX idx_email
ON customer (email);

You should not build unique indexes on your primary key column(s), since the server already checks uniqueness for primary key values.

Multicolumn indexes

/*MySQL*/
ALTER TABLE customer
ADD INDEX idx_full_name (last_name, first_name);
/*SQL Server/Oracle Database*/
CREATE UNIQUE INDEX idx_email
ON customer (email);

Types of Indexes

B-tree indexes

All the indexes shown thus far are balanced-tree indexes, which are more commonly known as B-tree indexes. MySQL, Oracle Database, and SQL Server all default to B-tree indexing.

B-tree indexes are organized as trees, with one or more levels of branch nodes leading to a single level of leaf nodes.
The server would look at the top branch node (called the root node) and follow the link to the branch node.
The server can add or remove branch nodes to redistribute the values more evenly and can even add or remove an entire level of branch nodes.

Bitmap indexes

If there are only two different values (stored as 1 for active and 0 for inactive) and far more active customers, it can be difficult to maintain a balanced B-tree index as the number of customers grows.

For columns that contain only a small number of values across a large number of rows (known as low-cardinality data), Oracle Database includes bitmap indexes, which generate a bitmap for each value stored in the column.

/*Oracle Database*/
CREATE BITMAP INDEX idx_active ON customer (active);

Bitmap indexes are commonly used in data warehousing environments, where large amounts of data are generally indexed on columns containing relatively few values (e.g., sales quarters, geographic regions, products, salespeople).

Text indexes

How Indexes Are Used

/*MySQL*/
EXPLAIN
SELECT customer_id, first_name, last_name
FROM customer
WHERE first_name LIKE 'S%' AND last_name LIKE 'P%';
/*SQL Server*/
set show plan_text
/*Oracle Database*/
explain plan

For this query, the server can employ any of the following strategies:

Scan all rows in the customer table.
Use the index on the last_name column to find all customers whose last name starts with P; then visit each row of the customer table to find only rows whose first name starts with S.
Use the index on the last_name and first_name columns to find all customers whose last name starts with P and whose first name starts with S.

Looking at the query results, the possible_keys column tells you that the server could decide to use either the idx_last_name or the idx_full_name index, and the key column tells you that the idx_full_name index was chosen. Furthermore, the type column tells you that a range scan will be utilized, meaning that the database server will be looking for a range of values in the index, rather than expecting to retrieve a single row.

The Downside of Indexes

Every index is a table (a special type of table but still a table). Therefore, every time a row is added to or removed from a table, all indexes on that table must be modified. When a row is updated, any indexes on the column or columns that were affected need to be modified as well. Therefore, the more indexes you have, the more work the server needs to do to keep all schema objects up-to-date, which tends to slow things down.

Indexes also require disk space as well as some amount of care from your administrators, so the best strategy is to add an index when a clear need arises. If you need an index for only special purposes, such as a monthly maintenance routine, you can always add the index, run the routine, and then drop the index until you need it again. In the case of data warehouses, where indexes are crucial during business hours as users run reports and ad hoc queries but are problematic when data is being loaded into the warehouse overnight, it is a common practice to drop the indexes before data is loaded and then re-create them before the warehouse opens for business.

In general, you should strive to have neither too many indexes nor too few. If you aren’t sure how many indexes you should have, you can use this strategy as a default:

Make sure all primary key columns are indexed (most servers automatically cre‐ ate unique indexes when you create primary key constraints). For multicolumn primary keys, consider building additional indexes on a subset of the primary key columns or on all the primary key columns but in a different order than the primary key constraint definition.
Build indexes on all columns that are referenced in foreign key constraints. Keep in mind that the server checks to make sure there are no child rows when a par‐ ent is deleted, so it must issue a query to search for a particular value in the col‐ umn. If there’s no index on the column, the entire table must be scanned.
Index any columns that will frequently be used to retrieve data. Most date columns are good candidates, along with short (2- to 50-character) string columns.

Constraints

A constraint is simply a restriction placed on one or more columns of a table. There are several different types of constraints, including:

Primary key constraints Identify the column or columns that guarantee uniqueness within a table

Foreign key constraints Restrict one or more columns to contain only values found in another table’s pri‐ mary key columns (may also restrict the allowable values in other tables if update cascade or delete cascade rules are established)

Unique constraints Restrict one or more columns to contain unique values within a table (primary key constraints are a special type of unique constraint)

Check constraints Restrict the allowable values for a column

If the server allows you to change a customer’s ID in the customer table without changing the same customer ID in the rental table, then you will end up with rental data that no longer points to valid customer records (known as orphaned rows). With primary and foreign key constraints in place, however, the server will either raise an error if an attempt is made to modify or delete data that is referenced by other tables or propagate the changes to other tables for you

Note: If you want to use foreign key constraints with the MySQL server, you must use the InnoDB storage engine for your tables.

Constraint Creation

CREATE TABLE customer (
...
PRIMARY KEY (customer_id), 
KEY idx_fk_store_id (store_id),
KEY idx_fk_address_id (address_id),
KEY idx_last_name (last_name),
CONSTRAINT fk_customer_address FOREIGN KEY (address_id) REFERENCES address (address_id) ON DELETE RESTRICT ON UPDATE CASCADE,
CONSTRAINT fk_customer_store FOREIGN KEY (store_id)REFERENCES store (store_id) ON DELETE RESTRICT ON UPDATE CASCADE
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*For existing tables, you can do"*/
ALTER TABLE customer
ADD CONSTRAINT fk_customer_address FOREIGN KEY (address_id)
REFERENCES address (address_id) ON DELETE RESTRICT ON UPDATE CASCADE;
ALTER TABLE customer
ADD CONSTRAINT fk_customer_store FOREIGN KEY (store_id)
REFERENCES store (store_id) ON DELETE RESTRICT ON UPDATE CASCADE;
/*if you want to drop them*/
ALTER TABLE customer
DROP CONSTRAINT fk_customer_address;
ALTER TABLE customer
DROP CONSTRAINT fk_customer_store F;

on delete restrict, which will cause the server to raise an error if a row is deleted in the parent table (address or store) that is referenced in the child table (customer)
on update cascade, which will cause the server to propagate a change to the primary key value of a parent table (address or store) to the child table (customer)

Parameter	Description
`ON DELETE NO ACTION`	Default action. If there are any existing references to the key being deleted, the transaction will fail at the end of the statement. The key can be updated, depending on the `ON UPDATE` action. Alias: `ON DELETE RESTRICT`
`ON UPDATE NO ACTION`	Default action. If there are any existing references to the key being updated, the transaction will fail at the end of the statement. The key can be deleted, depending on the `ON DELETE` action. Alias: `ON UPDATE RESTRICT`
`ON DELETE RESTRICT` / `ON UPDATE RESTRICT`	`RESTRICT` and `NO ACTION` are currently equivalent until options for deferring constraint checking are added. To set an existing foreign key action to `RESTRICT`, the foreign key constraint must be dropped and recreated.
`ON DELETE CASCADE` / `ON UPDATE CASCADE`	When a referenced foreign key is deleted or updated, all rows referencing that key are deleted or updated, respectively. If there are other alterations to the row, such as a `SET NULL` or `SET DEFAULT`, the delete will take precedence. Note that `CASCADE` does not list objects it drops or updates, so it should be used cautiously.
`ON DELETE SET NULL` / `ON UPDATE SET NULL`	When a referenced foreign key is deleted or updated, respectively, the columns of all rows referencing that key will be set to `NULL`. The column must allow `NULL` or this update will fail.
`ON DELETE SET DEFAULT` / `ON UPDATE SET DEFAULT`	When a referenced foreign key is deleted or updated, the columns of all rows referencing that key are set to the default value for that column. If the default value for the column is null, or if no default value is provided and the column does not have a `NOT NULL` constraint, this will have the same effect as `ON DELETE SET NULL` or `ON UPDATE SET NULL`. The default value must still conform with all other constraints, such as `UNIQUE`.

Learning SQL Notes #10: Transactions

Tue, 08 Jun 2021 01:00:00 +0000

Multiuser Databases
- Locking
- Lock Granularities
What Is a Transaction?

Transactions: Mechanism used to group a set of SQL statements together such that either all or none of the statements succeed.

Multiuser Databases

Locking

Locks are the mechanism the database server uses to control simultaneous use of data resources. When some portion of the database is locked, any other users wishing to modify (or possibly read) that data must wait until the lock has been released. Most database servers use one of two locking strategies:

Database writers must request and receive from the server a write lock to modify data, and database readers must request and receive from the server a read lock to query data. While multiple users can read data simultaneously, only one write lock is given out at a time for each table (or portion thereof), and read requests are blocked until the write lock is released. $\Rightarrow$ long wait times if there are many concurrent read and write requests. (Microsoft SQL Server/MySQL)
Database writers must request and receive from the server a write lock to modify data, but readers do not need any type of lock to query data. Instead, the server ensures that a reader sees a consistent view of the data (the data seems the same even though other users may be making modifications) from the time her query begins until her query has finished. This approach is known as versioning. $\Rightarrow$ problematic if there are long-running queries while data is being modified. (Oracle Database/MySQL)

Lock Granularities

Table locks $\Rightarrow$ less bookkeeping, longer waiting time Keep multiple users from modifying data in the same table simultaneously

Page locks Keep multiple users from modifying data on the same page (a page is a segment of memory generally in the range of 2 KB to 16 KB) of a table simultaneously

Row locks $\Rightarrow$ More bookkeeping, shorter waiting time Keep multiple users from modifying the same row in a table simultaneously

SQL Server will, under certain circumstances, escalate locks from row to page, and from page to table, whereas Oracle Database will never escalate locks.

What Is a Transaction?

Problems occur when one of the ideal situations fails:

Database servers do not enjoy 100% uptime
Users do not always allow programs to finish executing
Applications do not always complete without encountering fatal errors that halt execution

Transaction is a device for grouping together multiple SQL statements such that either all or none of the statements succeed (a property known as atomicity).

Ex:

If you attempt to transfer $500 from your savings account to your checking account, you would be a bit upset if the money were successfully withdrawn from your savings account but never made it to your checking account. Whatever the reason for the failure (the server was shut down for maintenance, the request for a page lock on the account table timed out, etc.), you want your $500 back. To protect against this kind of error, the program that handles your transfer request would first begin a transaction, then issue the SQL statements needed to move the money from your savings to your checking account, and, if everything succeeds, end the transaction by issuing the commit command. If something unexpected happens, however, the program would issue a rollback command, which instructs the server to undo all changes made since the transaction began.

Starting a Transaction

Database servers handle transaction creation in one of two ways:

An active transaction is always associated with a database session, so there is no need or method to explicitly begin a transaction. When the current transaction ends, the server automatically begins a new transaction for your session. You can undo some changes. (Oracle Database)
Unless you explicitly begin a transaction, individual SQL statements are automatically committed independently of one another. To begin a transaction, you must first issue a command. (Microsoft SQL Server/MySQL)

The SQL:2003 standard includes a start transaction command to be used when you want to explicitly begin a transaction. While MySQL conforms to the standard, SQL Server users must instead issue the command begin transaction. With both servers, until you explicitly begin a transaction, you are in what is known as autocommit mode, which means that individual statements are automatically committed by the server.

A word of advice: shut off autocommit mode each time you log in, and get in the habit of running all of your SQL statements within a transaction.

Both MySQL and SQL Server allow you to turn off autocommit mode for individual sessions, in which case the servers will act just like Oracle Database regarding transactions. With SQL Server, you issue the following command to disable autocommit mode:

SET IMPLICIT_TRANSACTIONS ON

MySQL allows you to disable autocommit mode via the following:

SET AUTOCOMMIT=0

Once you have left autocommit mode, all SQL commands take place within the scope of a transaction and must be explicitly committed or rolled back.

Ending a Transaction

End with commit if yes and rollback if no.

Some scenarios in practice:

The server shuts down, in which case your transaction will be rolled back automatically when the server is restarted. ✔
You issue an SQL schema statement, such as alter table, which will cause the current transaction to be committed and a new transaction to be started.
- be careful that the state‐ ments that comprise a unit of work are not inadvertently broken up into multiple transactions by the server！
You issue another start transaction command, which will cause the previous transaction to be committed. ✔
The server prematurely ends your transaction because the server detects a dead‐ lock and decides that your transaction is the culprit. In this case, the transaction will be rolled back, and you will receive an error message.
- Most of the time, the terminated transaction can be restarted and will succeed without encountering another deadlock situation.
  Message: Deadlock found when trying to get lock; try restarting transaction

Transaction Savepoints

You may not want to undo all of the work that has transpired. For these situations, you can establish one or more savepoints

SAVEPOINT my_savepoint;

within a transaction and use them to roll back to a particular location within your transaction

ROLLBACK TO SAVEPOINT my_savepoint;

rather than rolling all the way back to the start of the transaction.

Choosing a Storage Engine

When using Oracle Database or Microsoft SQL Server, a single set of code is respon‐ sible for low-level database operations, such as retrieving a particular row from a table based on primary key value. The MySQL server, however, has been designed so that multiple storage engines may be utilized to provide low-level database functionality, including resource locking and transaction management. As of version 8.0, MySQL includes the following storage engines:

MyISAM A nontransactional engine employing table locking

MEMORY A nontransactional engine used for in-memory tables

CSV A transactional engine that stores data in comma-separated files

InnoDB A transactional engine employing row-level locking

Merge A specialty engine used to make multiple identical MyISAM tables appear as a single table (a.k.a. table partitioning)

Archive A specialty engine used to store large amounts of unindexed data, mainly for archival purposes

MySQL is flexible enough to allow you to choose a storage engine on a table-by-table basis.

You may explicitly specify a storage engine when creating a table, or you can change an existing table to use a different engine.

show table status like 'customer' \G;
/*Second row: Engine: InnoDB*/
ALTER TABLE customer ENGINE = INNODB;

One example is shown below:

START TRANSACTION;
UPDATE product
SET date_retired = CURRENT_TIMESTAMP()
WHERE product_cd = 'XYZ';
SAVEPOINT before_close_accounts;
UPDATE account
SET status = 'CLOSED', close_date = CURRENT_TIMESTAMP(), last_activity_date = CURRENT_TIMESTAMP()
WHERE product_cd = 'XYZ';
ROLLBACK TO SAVEPOINT before_close_accounts;
COMMIT;
/*The net effect of this transaction is that the mythical XYZ product is retired but none of the accounts are closed.*/

When using savepoints, remember the following:

Despite the name, nothing is saved when you create a savepoint. You must even‐ tually issue a commit if you want your transaction to be made permanent.
If you issue a rollback without naming a savepoint, all savepoints within the transaction will be ignored, and the entire transaction will be undone.

If you are using SQL Server, you will need to use the proprietary command save transaction to create a savepoint and rollback transaction to roll back to a savepoint, with each command being followed by the savepoint name.

Learning SQL Notes #9: Conditional Logic

Mon, 07 Jun 2021 01:00:00 +0000

What Is Conditional Logic?
- The case Expression
  - Searched case Expressions
  - Simple case Expressions (A less flexible ver. of the previous expression)
- Examples of case Expressions

What Is Conditional Logic?

Conditional logic is simply the ability to take one of several paths during program execution.

Analogous to if-else in Python and R.

SELECT first_name, last_name,
CASE
WHEN active = 1 THEN 'ACTIVE'
ELSE 'INACT
END activity_type
FROM customer;

The case Expression

The case expression is part of the SQL standard (SQL92 release) and has been implemented by Oracle Database, SQL Server, MySQL, PostgreSQL, IBM UDB, and others.
case expressions are built into the SQL grammar and can be included in select, insert, update, and delete statements.

Searched case Expressions

CASE
WHEN category.name IN ('Children','Family','Sports','Animation')
THEN 'All Ages'
WHEN category.name = 'Horror'
THEN 'Adult'
WHEN category.name IN ('Music','Games')
THEN 'Teens'
ELSE 'Other'
END

SELECT c.first_name, c.last_name,
CASE
WHEN active = 0 THEN 0
ELSE
(SELECT count(*) FROM rental r
WHERE r.customer_id = c.customer_id)
END num_rentals /*Create new variables*/
FROM customer c;

Simple case Expressions (A less flexible ver. of the previous expression)

CASE V0
WHEN V1 THEN E1
WHEN V2 THEN E2 ...
WHEN VN THEN EN
[ELSE ED]
END

V0 represents a value, and the symbols V1, V2, …, VN rep‐ resent values that are to be compared to V0.

Examples of case Expressions

Result Set Transformations

SELECT monthname(rental_date) rental_month,
count(*) num_rentals
FROM rental
WHEN WHERE rental_date BETWEEN '2005-05-01' AND '2005-08-01'
GROUP BY monthname(rental_date);

rental_month	num_rentals
May	1156
June	2311
July	6709

SELECT
SUM(CASE WHEN monthname(rental_date) = 'May' THEN 1
ELSE 0 END) May_rentals,
SUM(CASE WHEN monthname(rental_date) = 'June' THEN 1
ELSE 0 END) June_rentals,
SUM(CASE WHEN monthname(rental_date) = 'July' THEN 1
ELSE 0 END) July_rentals
FROM rental
WHERE rental_date BETWEEN '2005-05-01' AND '2005-08-01';

May_rentals	June_rentals	July_rentals
1156	2311	6709

When the monthname() function returns the desired value for that column, the case expression returns the value 1; otherwise, it returns a 0. When summed over all rows, each column returns the number of accounts opened for that month. Obviously, such transformations are practical for only a small number of values

Checking for Existence

Sometimes you will want to determine whether a relationship exists between two entities without regard for the quantity.

SELECT a.first_name, a.last_name,
CASE
WHEN EXISTS (SELECT 1 FROM film_actor fa
INNER JOIN film f ON fa.film_id = f.film_id
WHERE fa.actor_id = a.actor_id
AND f.rating = 'G') THEN 'Y'
ELSE 'N'
END g_actor
FROM actor a
WHERE a.last_name LIKE 'S%' OR a.first_name LIKE 'S%';

(Avoid) Division-by-Zero Errors

...
sum(p.amount) /
CASE WHEN count(p.amount) = 0 THEN 1
ELSE count(p.amount)
END avg_payment
...

Conditional Updates

UPDATE customer
SET active =
CASE
WHEN 90 <= (SELECT datediff(now(), max(rental_date))
FROM rental r
WHERE r.customer_id = customer.customer_id)
THEN 0
ELSE 1
END
WHERE active = 1;
/*if the number returned by the subquery is 90 or higher, the customer is marked as inactive.*/

Handling Null Values

...
CASE
WHEN a.address IS NULL THEN 'Unknown'
ELSE a.address
END address,
...

Note: For calculations, null values often cause a null result. When performing calculations, case expressions are useful for translating a null value into a number (usually 0 or 1) that will allow the calculation to yield a non-null value.

Learning SQL Notes #8: Subqueries

Sun, 06 Jun 2021 01:00:00 +0000

What Is a Subquery?
Subquery Types
- Noncorrelated Subqueries
  - Multiple-Row, Single-Column Subqueries
  - Multicolumn Subqueries
- Correlated Subqueries
  - The exists Operator
  - Data Manipulation Using Correlated Subqueries
When to Use Subqueries
- Subqueries as Data Sources
- Subqueries as Expression Generators
Subquery Wrap-Up

What Is a Subquery?

A subquery is a query contained within another SQL statement (which I refer to as the containing statement for the rest of this discussion). A subquery is always enclosed within parentheses, and it is usually executed prior to the containing statement. Like any query, a subquery returns a result set that may consist of:

A single row with a single column
Multiple rows with a single column
Multiple rows having multiple columns

SELECT customer_id, first_name, last_name
FROM customer
WHERE customer_id = (SELECT MAX(customer_id) FROM customer);

Subquery Types

Noncorrelated Subqueries

Multiple-Row, Single-Column Subqueries

The in and not in operators

SELECT city_id, city
FROM city
WHERE country_id <> (SELECT country_id FROM country WHERE country = 'India');

Note: Subquery should not return more than one row when you use WHERE to filter a condition with inequality/equality in this case.

What you can do is use the following subqueries:

SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico');

SELECT country_id
FROM country
WHERE country = 'Canada' OR country = 'Mexico';

in the following ways:

SELECT city_id, city
FROM city
WHERE country_id IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));

or the opposite:

SELECT city_id, city
FROM city
WHERE country_id NOT IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));

The all operator

The all operator allows you to make comparisons between a single value and every value in a set:

SELECT first_name, last_name
FROM customer
WHERE customer_id <> ALL
(SELECT customer_id
FROM payment
WHERE amount = 0);

or the equivalent:

SELECT first_name, last_name
FROM customer
WHERE customer_id NOT IN
(SELECT customer_id
FROM payment
WHERE amount = 0);

Any attempt to equate a value to null yields unknown, so when using not in or <> all to compare a value to a set of values, you must be careful to ensure that the set of values does not contain a null value.

The subquery in this example returns the total number of film rentals for all custom‐ ers in North America, and the containing query returns all customers whose total number of film rentals exceeds any of the North American customers.

SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) > ALL
(SELECT count(*)
FROM rental r
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('United States','Mexico','Canada')
GROUP BY r.customer_id
);

The any operator (OR)

A condition using the any operator evaluates to true as soon as a single comparison is favorable.

SELECT customer_id, sum(amount)
FROM payment
GROUP BY customer_id
HAVING sum(amount) > ANY
(SELECT sum(amount)
FROM payment p
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('Bolivia','Paraguay','Chile')
GROUP BY co.country
);

Multicolumn Subqueries

SELECT actor_id, film_id
FROM film_actor
WHERE (actor_id, film_id) IN
(SELECT a.actor_id, f.film_id
FROM actor a
CROSS JOIN film f
WHERE a.last_name = 'MONROE'
AND f.rating = 'PG');

Correlated Subqueries

A correlated subquery, on the other hand, is dependent on its containing statement from which it references one or more columns.

SELECT c.first_name, c.last_name
FROM customer c
WHERE 20 =
(SELECT count(*)
FROM rental r
WHERE r.customer_id = c.customer_id);
/*customers who have rented exactly 20 films*/

The exists Operator

You use the exists operator when you want to identify that a relationship exists without regard for the quantity.

SELECT c.first_name, c.last_name
FROM customer c
WHERE (NOT) EXISTS
(SELECT r.rental_date, r.customer_id, 'ABCD' str, 2 * 3 / 7 nmbr /*can be replaced by anything*/
FROM rental r
WHERE r.customer_id = c.customer_id
AND date(r.rental_date) < '2005-05-25');

Since the condition in the containing query only needs to know how many rows have been returned, the actual data the subquery returned is irrelevant.

Data Manipulation Using Correlated Subqueries

UPDATE customer c
SET c.last_update =
(SELECT max(r.rental_date)
FROM rental r
WHERE r.customer_id = c.customer_id);
UPDATE customer c SET c.last_update =
(SELECT max(r.rental_date) FROM rental r WHERE r.customer_id = c.customer_id) WHERE EXISTS
(SELECT 1 FROM rental r
WHERE r.customer_id = c.customer_id);
/*executes only if the condition in the update statement’s where clause evaluates to true (meaning that at least one rental was found for the customer), thus protecting the data in the last_update column from being
overwritten with a null.*/
DELETE FROM customer WHERE 365 < ALL
(SELECT datediff(now(), r.rental_date) days_since_last_rental FROM rental r
WHERE r.customer_id = customer.customer_id);
/*removes rows from the customer table where there have been no film rentals in the past year*/

When to Use Subqueries

Subqueries as Data Sources

SELECT c.first_name, c.last_name, pymnt.num_rentals, pymnt.tot_payments
FROM customer c
INNER JOIN
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id ) pymnt /*execute first*/
ON c.customer_id = pymnt.customer_id;

Data fabrication

First we have a table for some standards (small/average/heavy) with lower and upper bounds.

SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit;

Then we have transformed the original tables into the desired one.

SELECT pymnt_grps.name, count(*) num_customers
FROM
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN (SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit
UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit ) pymnt_grps
ON pymnt.tot_payments
BETWEEN pymnt_grps.low_limit AND pymnt_grps.high_limit
GROUP BY pymnt_grps.name;

Task-oriented subqueries

SELECT c.first_name, c.last_name, ct.city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
INNER JOIN customer c
ON p.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
GROUP BY c.first_name, c.last_name, ct.city;

We only need names/cities/addresses for display purpose only, so we can use subqueries to group the data first before joining other tables. A more efficient code chunk for the same task：

SELECT c.first_name, c.last_name, ct.city, pymnt.tot_payments, pymnt.tot_rentals
FROM (SELECT customer_id, count(*) tot_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN customer c
ON pymnt.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id;

Common table expressions

WITH actors_s AS
(SELECT actor_id, first_name, last_name
FROM actor
WHERE last_name LIKE 'S%'
) /*can be used in the subsequent queries*/
...

Subqueries as Expression Generators

Correlated scalar subqueries. The customer table is accessed three times (once in each of the three subqueries) rather than just once.

SELECT (SELECT c.first_name
FROM customer c
WHERE c.customer_id = p.customer_id ) first_name, (SELECT c.last_name
FROM customer c
WHERE c.customer_id = p.customer_id ) last_name, (SELECT ct.city
FROM customer c
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
WHERE c.customer_id = p.customer_id
) city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
GROUP BY p.customer_id;

Similarly,

INSERT INTO film_actor (actor_id, film_id, last_update) VALUES (
(SELECT actor_id
FROM actor
WHERE first_name = 'JENNIFER' AND last_name = 'DAVIS'), (SELECT film_id FROM film
WHERE title = 'ACE GOLDFINGER'),
now()
);

Subquery Wrap-Up

Return a single column and row, a single column with multiple rows, and multi‐ ple columns and rows
Are independent of the containing statement (noncorrelated subqueries)
Reference one or more columns from the containing statement (correlated subqueries)
Are used in conditions that utilize comparison operators as well as the special-purpose operators in, not in, exists, and not exists
Can be found in select, update, delete, and insert statements
Generate result sets that can be joined to other tables (or subqueries) in a query
Can be used to generate values to populate a table or to populate columns in a query’s result set
Are used in the select, from, where, having, and order by clauses of queries

Happy learning!

Learning SQL Notes #7: Grouping and Aggregates (CH. 8)

Sat, 05 Jun 2021 01:00:00 +0000

Grouping Concepts
Aggregate Functions
Generating Groups
Group Filter Conditions

Grouping Concepts

SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) >= 40
ORDER BY 2 DESC;

WARNING:

~~WHERE count(*) >= 40~~ since aggregate functions should come with HAVING.

R codes:

library(tidyverse)
rental %>%
group_by(customer_id) %>%
summarize(counts=n()) %>%
filter(counts>=40) %>%
arrange(desc(counts))

Aggregate Functions

Some aggregate functions in SQL/R:

SQL	R
count()	count()
sum()	sum()
average()	mean()
min()	min()
max()	max()
group_concat()	paste()
first()	[1]
last()	[-1]

SELECT COUNT(DISTINCT col1)
FROM string_tbl;

R codes:

length(unique(string_tbl$col1))

NULLS are ignored unless you use count(*) where all rows will be counted.

Generating Groups

Single-Column/Multicolumn Grouping

Grouping can be done on 1 or more columns with aggregate functions.

SELECT actor_id, count(*)
FROM film_actor
GROUP BY actor_id;
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating
ORDER BY 1,2;

R codes are analogous to the codes in the last section.

Grouping via Expressions

SELECT extract(YEAR FROM rental_date) year,
COUNT(*) how_many
FROM rental
GROUP BY extract(YEAR FROM rental_date);

R codes:

library(tidyverse)
rental %>%
mutate(year=year(rental_date)) %>%
group_by(year) %>%
summarize(counts=n()) %>%

Generating Rollups

Find total counts for each distinct actor.

/*MySQL*/
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating WITH ROLLUP
ORDER BY 1,2;
/*Oracle*/
GROUP BY ROLLUP(fa.actor_id, f.rating)
GROUP BY a, ROLLUP(b, c)

actor_id	rating	count(*)
NULL	NULL	5462
1	NULL	19
1	G	4
1	PG	6
1	PG-13	1
1	R	3
1	NC-17	5
2	NULL	25
2	G	7

R codes:

library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))

See here: https://stackoverflow.com/questions/36169073/how-to-do-group-by-rollup-in-r-like-sql

Group Filter Conditions

HAVING with aggregate functions;
WHERE with original columns;

Learning SQL Notes #6: Data Generation, Manipulation, and Conversion

Fri, 04 Jun 2021 01:00:00 +0000

Working with String Data
- String Generation
  - Including single quotes
  - Including special characters
- String Manipulation
  - String functions that return numbers
Working with Numeric Data
- Performing Arithmetic Functions & Controlling Number Precision & Handling Signed Data
Working with Temporal Data
Appendix for Codes

Working with String Data

String Generation

Types:

CHAR Holds fixed-length, blank-padded strings.

varchar Holds variable-length strings.

text (MySQL and SQL Server) or clob (Oracle Database) Holds very large variable-length strings (generally referred to as documents in this context).

CREATE TABLE string_tbl
(char_fld CHAR(30),
vchar_fld VARCHAR(30),
text_fld TEXT
);
INSERT INTO string_tbl (char_fld, vchar_fld, text_fld)
VALUES ('This is char data',
'This is varchar data',
'This is text data');

If you want to have a longer string, you can

UPDATE string_tbl
SET vchar_fld = 'This is a piece of extremely long varchar data';

but then:

ERROR 1406 (22001): Data too long for column 'vchar_fld' at row 1

NOTE: Since MySQL 6.0, the default behavior is now “strict” mode, which means that exceptions are thrown when problems arise, whereas in older versions of the server the string would have been truncated and a warning issued.

SELECT @@session.sql_mode;
SET sql_mode='ansi'; /*Go back to the older ver.*/
SELECT @@session.sql_mode;

Now extra will be truncated.

Including single quotes

SELECT quote(text_fld)
FROM string_tbl;

Output:

QUOTE(text_fld)
‘This string didn't work, but it does now’

Including special characters

The SQL Server and MySQL servers include the built-in function char() so that you can build strings from any of the 255 characters in the ASCII character set (Oracle Database users can use the chr() function).

SELECT CHAR(128,129,130,131,132,133,134,135,136,137);

Output:

CHAR(128,129,130,131,132,133,134,135,136,137)
Çüéâäàåçêë

R codes:

coderange <- c(128,129,130,131,132,133,134,135,136,137)
rawToChar(as.raw(coderange),multiple=TRUE)

You can also concatenate two strings:

SELECT CONCAT('danke sch', CHAR(148), 'n');

Output:

CONCAT(‘danke sch’, CHAR(148), ‘n’)
danke schön

R codes:

paste('danke sch', rawToChar(as.raw(148)), 'n')
paste0()

See: https://www.r-bloggers.com/2011/03/ascii-code-table-in-r/

Oracle Database/PostgreSQL users can use the concatenation operator (||) instead of the concat() function, as in:

SELECT 'danke sch' || CHR(148) || 'n' FROM dual;

SQL Server does not include a concat() function, so you will need to use the concatenation operator (+), as in:

SELECT 'danke sch' + CHAR(148) + 'n'

String Manipulation

String functions that return numbers

To find the length of a string:

LENGTH()
SELECT LENGTH(char_fld) char_length,
LENGTH(vchar_fld) varchar_length,
LENGTH(text_fld) text_length
FROM string_tbl;

R codes:

length()

To find the index of a character in a string:

POSITION()
SELECT POSITION('characters' IN vchar_fld)
FROM string_tbl;

R codes:

match('y',x)
which('y' %in% x)

Note: When working with databases that the first character in a string is at position 1. A return value of 0 from instr() indicates that the substring could not be found, not that the substring was found at the first position in the string.

If you want to start your search at something other than the first character of your target string, you will need to use the locate() function, which is similar to the position() function except that it allows an optional third parameter, which is used to define the search’s start position. The locate() function is also proprietary, whereas the position() function is part of the SQL:2003 standard.

SELECT LOCATE('is', vchar_fld, 5)
FROM string_tbl;

R codes:

match('y',x[5:])
which('y' %in% x[5:])

Oracle Database instr(): Mimics the position() function when provided with two arguments and mimics the locate() function when provided with three arguments.

SQL Server charindx(): similar to Oracle’s instr() function.

strcmp() (MySQL ONLY) takes two strings as arguments and returns one of the following:

−1 if the first string comes before the second string in sort order
0 if the strings are identical
1 if the first string comes after the second string in sort order

SELECT vchar_fld
FROM string_tbl
ORDER BY vchar_fld;

vchar_fld
12345
abcd
QRSTUV
qrstuv
xyz

SELECT STRCMP('12345','12345') 12345_12345,
STRCMP('abcd','xyz') abcd_xyz,
STRCMP('abcd','QRSTUV') abcd_QRSTUV,
STRCMP('qrstuv','QRSTUV') qrstuv_QRSTUV, /*Case insensitive*/
STRCMP('12345','xyz') 12345_xyz,
STRCMP('xyz','qrstuv') xyz_qrstuv;

12345_12345	abcd_xyz	abcd_QRSTUV	qrstuv_QRSTUV	12345_xyz	xyz_qrstuv
0	−1	−1	0	−1	1

Add or replace characters in the middle of a string： insert() 4 parameters: the original string, the start position, the number of characters to replace (0 for inserting a string), and the replacement string.

SELECT INSERT('goodbye world', 9, 0, 'cruel ') string;
/*goodbye cruel world*/
SELECT INSERT('goodbye world', 1, 7, 'hello') string;
/*hello world*/
SELECT SUBSTRING('goodbye cruel world', 9, 5);
/*cruel*/

For other SQL,

/*Oracle*/
SELECT REPLACE('goodbye world', 'goodbye', 'hello') FROM dual;
/*hello world*/
SELECT substr('goodbye cruel world', 9, 5);
/*cruel*/
/*SQL Server*/
SELECT STUFF('hello world', 1, 5, 'goodbye cruel')
/*goodbye cruel world*/
SELECT SUBSTRING('goodbye cruel world', 9, 5);
/*cruel*/

Working with Numeric Data

SELECT (37 * 59) / (78 - (8 * 6));

Performing Arithmetic Functions & Controlling Number Precision & Handling Signed Data

Function name	Description
acos( x )	Calculates the arc cosine of x
asin( x )	Calculates the arc sine of x
atan( x )	Calculates the arc tangent of x
cos( x )	Calculates the cosine of x
sin( x )	Calculates the sine of x
tan( x )	Calculates the tangent of x
cot( x )	Calculates the cotangent of x
exp( x )	Calculates ex
ln( x )	Calculates the natural log of x
sqrt( x )	Calculates the square root of x

Some useful functions in R and SQL (See Appendix for full results):

SQL	R
MOD( x )	%%
POW( x )	^
CEIL( x )	ceiling()
FLOOR( x )	floor()
ROUND( x )	round()
TRUNCATE( x )	trunc()
SIGN( x )	sign()
ABS( x )	abs()

Working with Temporal Data

Dealing with Time Zones

/*MySQL*/
SELECT @@global.time_zone, @@session.time_zone;
SET time_zone = 'Europe/Zurich';
/*Oracle Database*/
ALTER SESSION TIMEZONE = 'Europe/Zurich'

From:

@@global.time_zone	@@session.time_zone
SYSTEM	SYSTEM

To:

@@global.time_zone	@@session.time_zone
SYSTEM	Europe/Zurich

R codes:

Sys.timezone()
Sys.setenv(TZ = "Europe/Zurich")

Generating Temporal Data

You can generate temporal data via any of the following means:

Copying data from an existing date, datetime, or time column
Executing a built-in function that returns a date, datetime, or time
Building a string representation of the temporal data to be evaluated by the server

String representations of temporal data

Component	Definition	Range
YYYY	Year, including century	1000 to 9999
MM	Month	01 (January) to 12 (December)
DD	Day	01 to 31
HH	Hour	Range 00 to 23
HHH	Hours	−838 to 838
MI	(elapsed) Minute	00 to 59
SS	Second	00 to 59

Type	Default format
date	YYYY-MM-DD
datetime	YYYY-MM-DD HH:MI:SS
timestamp	YYYY-MM-DD HH:MI:SS
time	HHH:MI:SS

String-to-date conversions

A simple query that returns a datetime value using the cast() function

SQL	R (lubridate)
CAST(‘2019-09-17 15:30:00’ AS DATETIME)	as_datetime()
STR_TO_DATE(‘September 17, 2019’, ‘%M %d, %Y’)	as.Date(…, format=…)
CAST(‘2019-09-17’ AS DATE)	as.Date()
CAST(‘108:17:57’ AS TIME)	as.POSIXlt()

/*MySQL*/
SELECT str_to_date();
/*Oracle Database*/
SELECT to_date();
/*SQL server*/
SELECT convert();
/*Current System Time*/
SELECT CURRENT_DATE(), CURRENT_TIME(), CURRENT_TIMESTAMP();

Common notations for both R and SQL:

Format component	Description
%M	Month name (January to December)
%m	Month numeric (01 to 12)
%d	Day numeric (01 to 31)
%j	Day of year (001 to 366)
%W	Weekday name (Sunday to Saturday)
%Y	Year, four-digit numeric
%y	Year, two-digit numeric
%H	Hour (00 to 23)
%h	Hour (01 to 12)
%i	Minutes (00 to 59)
%s	Seconds (00 to 59)
%f	Microseconds (000000 to 999999)
%p	A.M. or P.M.

Manipulating Temporal Data

Interval types for DATE_ADD() and EXTRACT()

Interval name	Description
second	Number of seconds
minute	Number of minutes
hour	Number of hours
day	Number of days
month	Number of months
year	Number of years
minute_second	Number of minutes and seconds, separated by “:”
hour_second	Number of hours, minutes, and seconds, separated by “:”
year_month	Number of years and months, separated by “-”

Temporal functions that return dates

The same result can be performed on three different servers:

/*MySQL*/
UPDATE employee
SET birth_date = DATE_ADD(birth_date, INTERVAL '9-11' YEAR_MONTH)
WHERE emp_id = 4789;
/*Oracle Database*/
UPDATE employee
SET birth_date = ADD_MONTHS(birth_date, 119)
WHERE emp_id = 4789;
/*SQL server*/
UPDATE employee
SET birth_date = DATEADD(MONTH, 119, birth_date)
WHERE emp_id = 4789

Temporal functions that return strings

Some other functions for temporal data:

/*MySQL*/
SELECT LAST_DAY('2019-09-17'); /*Extract last day of Sept*/
SELECT DAYNAME('2019-09-18'); /*Wednesday*/
SELECT EXTRACT(YEAR FROM '2019-09-18 22:19:05'); /*2019*/
/*SQL Server*/
SELECT DATEPART(YEAR, GETDATE())

Temporal functions that return numbers

SELECT DATEDIFF('2019-09-03', '2019-06-21');
/*74*/
SELECT DATEDIFF('2019-09-03 23:59:59', '2019-06-21 00:00:01');
/*74, time has no effects*/
SELECT DATEDIFF('2019-06-21', '2019-09-03');
/*-74*/
/*SQL Server*/
SELECT DATEDIFF(DAY, '2019-06-21', '2019-09-03')

Conversion Functions

SELECT CAST('1456328' AS SIGNED INTEGER);
/*1456328*/
SELECT CAST('999ABC111' AS UNSIGNED INTEGER);
/*999 with warnings about truncation*/

Appendix for Codes

SELECT MOD(10,4);
/*2*/
SELECT MOD(20.75,4); /*Real argument*/
/*0.75*/
SELECT POW(2,8);
/*256*/
SELECT CEIL(72.445), FLOOR(72.445);
/*73 72*/
SELECT CEIL(72.000000001), FLOOR(72.999999999);
/*73 72*/
SELECT ROUND(72.49999), ROUND(72.5), ROUND(72.50001);
/*72 73 73*/
SELECT ROUND(72.0909, 1), ROUND(72.0909, 2), ROUND(72.0909, 3);
/*72.1 72.09 72.091*/
SELECT TRUNCATE(72.0909, 1), TRUNCATE(72.0909, 2), TRUNCATE(72.0909, 3);
/*72.0 72.09 72.090*/
/*SQL Server*/
SELECT ROUND(72.0909, 1, 1)

R codes:

%%
^
ceiling()
floor()
round()
trunc()

SELECT account_id, SIGN(balance), ABS(balance)
FROM account;

R codes:

sign()
abs()

Hope I can finish this before July. Stay safe.

Learning SQL Notes #5: Querying Multiple Tables (CH. 5)

Thu, 03 Jun 2021 20:00:00 +0000

Cross Join (Cartesian Product)
Inner Joins
Joining Three or More Tables
Using Subqueries as Tables
Using the Same Table Twice
Self-Joins
Outer Joins
- Three-Way Outer Joins
Natural Joins

Join instructs the server to use a column as the transportation between tables, thus allows columns from both tables to be included in the query’s result set.

Cross Join (Cartesian Product)

If the query didn’t specify how the two tables should be joined, the database server generated the Cartesian product, which is every permutation of the two tables.

JOIN b
CROSS JOIN b

R codes:

merge(x = df1, y = df2, by = NULL)
library(data.table)
CJ(a, b)

Can be used to create a list of consecutive numbers.

Inner Joins

If a value exists for the address_id column in one table but not the other, then the join fails for the rows containing that value, and those rows are excluded from the result set. Inner join only returns rows that satisfy the join condition.

INNER JOIN b
ON a.id=b.id

R codes:

merge(df1, df2, by = "id")
library(plyr)
join(df1, df2,
type = "inner")

Joining Three or More Tables

Join order is not important!

Force order:

SELECT STRAIGHT_JOIN COL1

Using Subqueries as Tables

See subquery notes.

Using the Same Table Twice

Either one of the actors in the movie:

SELECT f.title
FROM film f
INNER JOIN film_actor fa
ON f.film_id = fa.film_id
INNER JOIN actor a
ON fa.actor_id = a.actor_id
WHERE ((a.first_name = 'CATE' AND a.last_name = 'MCQUEEN')
OR (a.first_name = 'CUBA' AND a.last_name = 'BIRCH');

If we want movies that have both, you cannot simply replace OR with AND since this will return an empty set. Hence instead, you need to join the table twice:

SELECT f.title
FROM film f
/*once: */
INNER JOIN film_actor fa1
ON f.film_id = fa1.film_id
INNER JOIN actor a1
ON fa1.actor_id = a1.actor_id
/*twice: */
INNER JOIN film_actor fa2
ON f.film_id = fa2.film_id
INNER JOIN actor a2
ON fa2.actor_id = a2.actor_id
/*filter condition is applied*/
WHERE (a1.first_name = 'CATE' AND a1.last_name = 'MCQUEEN')
AND (a2.first_name = 'CUBA' AND a2.last_name = 'BIRCH');

Self-Joins

Some tables include a self-referencing foreign key, which means that it includes a column that points to the primary key within the same table.

Imagine that the film table includes the column prequel_film_id, which points to the film’s parent (e.g., the film Fiddler Lost II would use this column to point to the parent film Fiddler Lost).

Using a self-join, you can write a query that lists every film that has a prequel, along with the prequel’s title:

SELECT f.title, f_prnt.title prequel
FROM film f
INNER JOIN film f_prnt
ON f_prnt.film_id = f.prequel_film_id
WHERE f.prequel_film_id IS NOT NULL;

A possible outcome:

title	prequel
FIDDLER LOST II	FIDDLER LOST

Outer Joins

SELECT f.film_id, f.title, count(i.inventory_id) num_copies
FROM film f
LEFT OUTER JOIN inventory i
ON f.film_id = i.film_id
GROUP BY f.film_id, f.title;

Left outer join includes all rows from the table on the left side of the join (film, in this case) and then include columns from the table on the right side of the join (inventory) if the join is successful.
The num_copies column definition was changed from count(*) to count(i.inventory_id), which will count the number of non-null values of the inventory.inventory_id column.
A left outer join B $\equiv$ B right outer join A.

Three-Way Outer Joins

SELECT f.film_id, f.title, i.inventory_id, r.rental_date
FROM film f LEFT OUTER JOIN inventory i
ON f.film_id = i.film_id
LEFT OUTER JOIN rental r
ON i.inventory_id = r.inventory_id
WHERE f.film_id BETWEEN 13 AND 15;

Natural Joins

Lets the database server determine what the join conditions need to be.

SELECT c.first_name, c.last_name, date(r.rental_date)
FROM customer c
NATURAL JOIN rental r;

Empty set (0.04 sec)

Because you specified a natural join, the server inspected the table definitions and added the join condition r.customer_id = c.customer_id to join the two tables. This would have worked fine, but in the Sakila schema all of the tables include the column last_update to show when each row was last modified, so the server is also adding the join condition r.last_update = c.last_update, which causes the query to return no data.

The only way around this issue is to use a subquery to restrict the columns for at least one of the tables:

SELECT cust.first_name, cust.last_name, date(r.rental_date)
FROM
(SELECT customer_id, first_name, last_name
FROM customer
) cust
NATURAL JOIN rental r;

Learning SQL Notes #4.5: Regular Expression

Wed, 02 Jun 2021 20:00:00 +0000

Adapted from https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference

Character Escapes

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes.

Escaped character	Description	Pattern	Matches
`\a`	Matches a bell character, \u0007.	`\a`	`"\u0007"` in `"Error!" + '\u0007'`
`\b`	In a character class, matches a backspace, \u0008.	`[\b]{3,}`	`"\b\b\b\b"` in `"\b\b\b\b"`
`\t`	Matches a tab, \u0009.	`(\w+)\t`	`"item1\t"`, `"item2\t"` in `"item1\titem2\t"`
`\r`	Matches a carriage return, \u000D. (`\r` is not equivalent to the newline character, `\n`.)	`\r\n(\w+)`	`"\r\nThese"` in `"\r\nThese are\ntwo lines."`
`\v`	Matches a vertical tab, \u000B.	`[\v]{2,}`	`"\v\v\v"` in `"\v\v\v"`
`\f`	Matches a form feed, \u000C.	`[\f]{2,}`	`"\f\f\f"` in `"\f\f\f"`
`\n`	Matches a new line, \u000A.	`\r\n(\w+)`	`"\r\nThese"` in `"\r\nThese are\ntwo lines."`
`\e`	Matches an escape, \u001B.	`\e`	`"\x001B"` in `"\x001B"`
`\` nnn	Uses octal representation to specify a character (nnn consists of two or three digits).	`\w\040\w`	`"a b"`, `"c d"` in `"a bc d"`
`\x` nn	Uses hexadecimal representation to specify a character (nn consists of exactly two digits).	`\w\x20\w`	`"a b"`, `"c d"` in `"a bc d"`
`\c` X `\c` x	Matches the ASCII control character that is specified by X or x, where X or x is the letter of the control character.	`\cC`	`"\x0003"` in `"\x0003"` (Ctrl-C)
`\u` nnnn	Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn).	`\w\u0020\w`	`"a b"`, `"c d"` in `"a bc d"`
`\`	When followed by a character that is not recognized as an escaped character in this and other tables in this topic, matches that character. For example, `\` is the same as `\x2A`, and `\.` is the same as `\x2E`. This allows the regular expression engine to disambiguate language elements (such as or ?) and character literals (represented by `\*` or `\?`).	`\d+[\+-x\*]\d+`	`"2+2"` and `"39"` in `"(2+2) 3*9"`

Character Classes

A character class matches any one of a set of characters. Character classes include the language elements listed in the following table. For more information, see Character Classes.

Character class	Description	Pattern	Matches
`[` character_group `]`	Matches any single character in character_group. By default, the match is case-sensitive.	`[ae]`	`"a"` in `"gray"` `"a"`, `"e"` in `"lane"`
`[^` character_group `]`	Negation: Matches any single character that is not in character_group. By default, characters in character_group are case-sensitive.	`[^aei]`	`"r"`, `"g"`, `"n"` in `"reign"`
`[` first `-` last `]`	Character range: Matches any single character in the range from first to last.	`[A-Z]`	`"A"`, `"B"` in `"AB123"`
`.`	Wildcard: Matches any single character except \n. To match a literal period character (. or `\u002E`), you must precede it with the escape character (`\.`).	`a.e`	`"ave"` in `"nave"` `"ate"` in `"water"`
`\p{` name `}`	Matches any single character in the Unicode general category or named block specified by name.	`\p{Lu}` `\p{IsCyrillic}`	`"C"`, `"L"` in `"City Lights"` `"Д"`, `"Ж"` in `"ДЖem"`
`\P{` name `}`	Matches any single character that is not in the Unicode general category or named block specified by name.	`\P{Lu}` `\P{IsCyrillic}`	`"i"`, `"t"`, `"y"` in `"City"` `"e"`, `"m"` in `"ДЖem"`
`\w`	Matches any word character.	`\w`	`"I"`, `"D"`, `"A"`, `"1"`, `"3"` in `"ID A1.3"`
`\W`	Matches any non-word character.	`\W`	`" "`, `"."` in `"ID A1.3"`
`\s`	Matches any white-space character.	`\w\s`	`"D "` in `"ID A1.3"`
`\S`	Matches any non-white-space character.	`\s\S`	`" _"` in `"int __ctr"`
`\d`	Matches any decimal digit.	`\d`	`"4"` in `"4 = IV"`
`\D`	Matches any character other than a decimal digit.	`\D`	`" "`, `"="`, `" "`, `"I"`, `"V"` in `"4 = IV"`

Anchors

Anchors, or atomic zero-width assertions, cause a match to succeed or fail depending on the current position in the string, but they do not cause the engine to advance through the string or consume characters. The metacharacters listed in the following table are anchors. For more information, see Anchors.

Assertion	Description	Pattern	Matches
`^`	By default, the match must start at the beginning of the string; in multiline mode, it must start at the beginning of the line.	`^\d{3}`	`"901"` in `"901-333-"`
`$`	By default, the match must occur at the end of the string or before `\n` at the end of the string; in multiline mode, it must occur before the end of the line or before `\n` at the end of the line.	`-\d{3}$`	`"-333"` in `"-901-333"`
`\A`	The match must occur at the start of the string.	`\A\d{3}`	`"901"` in `"901-333-"`
`\Z`	The match must occur at the end of the string or before `\n` at the end of the string.	`-\d{3}\Z`	`"-333"` in `"-901-333"`
`\z`	The match must occur at the end of the string.	`-\d{3}\z`	`"-333"` in `"-901-333"`
`\G`	The match must occur at the point where the previous match ended.	`\G$\d$`	`"(1)"`, `"(3)"`, `"(5)"` in `"(1)(3)(5)[7](9)"`
`\b`	The match must occur on a boundary between a `\w` (alphanumeric) and a `\W` (nonalphanumeric) character.	`\b\w+\s\w+\b`	`"them theme"`, `"them them"` in `"them theme them them"`
`\B`	The match must not occur on a `\b` boundary.	`\Bend\w*\b`	`"ends"`, `"ender"` in `"end sends endure lender"`

Grouping Constructs

Grouping constructs delineate subexpressions of a regular expression and typically capture substrings of an input string. Grouping constructs include the language elements listed in the following table. For more information, see Grouping Constructs.

Grouping construct	Description	Pattern	Matches
`(` subexpression `)`	Captures the matched subexpression and assigns it a one-based ordinal number.	`(\w)\1`	`"ee"` in `"deep"`
`(?<` name `>` subexpression `)` or `(?'` name `'` subexpression `)`	Captures the matched subexpression into a named group.	`(?<double>\w)\k<double>`	`"ee"` in `"deep"`
`(?<` name1 `-` name2 `>` subexpression `)` or `(?'` name1 `-` name2 `'` subexpression `)`	Defines a balancing group definition. For more information, see the "Balancing Group Definition" section in Grouping Constructs.	`(((?'Open'$)[^\($])+((?'Close-Open'\))[^])+)*(?(Open)(?!))$`	`"((1-3)(3-1))"` in `"3+2^((1-3)(3-1))"`
`(?:` subexpression `)`	Defines a noncapturing group.	`Write(?:Line)?`	`"WriteLine"` in `"Console.WriteLine()"` `"Write"` in `"Console.Write(value)"`
`(?imnsx-imnsx:` subexpression `)`	Applies or disables the specified options within subexpression. For more information, see Regular Expression Options.	`A\d{2}(?i:\w+)\b`	`"A12xl"`, `"A12XL"` in `"A12xl A12XL a12xl"`
`(?=` subexpression `)`	Zero-width positive lookahead assertion.	`\b\w+\b(?=.+and.+)`	`"cats"`, `"dogs"` in `"cats, dogs and some mice."`
`(?!` subexpression `)`	Zero-width negative lookahead assertion.	`\b\w+\b(?!.+and.+)`	`"and"`, `"some"`, `"mice"` in `"cats, dogs and some mice."`
`(?<=` subexpression `)`	Zero-width positive lookbehind assertion.	`\b\w+\b(?<=.+and.+)` ——————————— `\b\w+\b(?<=.+and.*)`	`"some"`, `"mice"` in `"cats, dogs and some mice."` ———————————— `"and"`, `"some"`, `"mice"` in `"cats, dogs and some mice."`
`(?<!` subexpression `)`	Zero-width negative lookbehind assertion.	`\b\w+\b(?<!.+and.+)` ——————————— `\b\w+\b(?<!.+and.*)`	`"cats"`, `"dogs"`, `"and"` in `"cats, dogs and some mice."` ———————————— `"cats"`, `"dogs"` in `"cats, dogs and some mice."`
`(?>` subexpression `)`	Atomic group.	`(?>a\|ab)c`	`"ac"` in`"ac"` nothing in`"abc"`

Lookarounds at a glance

When the regular expression engine hits a lookaround expression, it takes a substring reaching from the current position to the start (lookbehind) or end (lookahead) of the original string, and then runs Regex.IsMatch on that substring using the lookaround pattern. Success of this subexpression's result is then determined by whether it's a positive or negative assertion.

Lookaround	Name	Function
`(?=check)`	Positive Lookahead	Asserts that what immediately follows the current position in the string is "check"
`(?<=check)`	Positive Lookbehind	Asserts that what immediately precedes the current position in the string is "check"
`(?!check)`	Negative Lookahead	Asserts that what immediately follows the current position in the string is not "check"
`(?<!check)`	Negative Lookbehind	Asserts that what immediately precedes the current position in the string is not "check"

Once they have matched, atomic groups won't be re-evaluated again, even when the remainder of the pattern fails due to the match. This can significantly improve performance when quantifiers occur within the atomic group or the remainder of the pattern.

Quantifiers

A quantifier specifies how many instances of the previous element (which can be a character, a group, or a character class) must be present in the input string for a match to occur. Quantifiers include the language elements listed in the following table. For more information, see Quantifiers.

Quantifier	Description	Pattern	Matches
`*`	Matches the previous element zero or more times.	`\d*\.\d`	`".0"`, `"19.9"`, `"219.9"`
`+`	Matches the previous element one or more times.	`"be+"`	`"bee"` in `"been"`, `"be"` in `"bent"`
`?`	Matches the previous element zero or one time.	`"rai?n"`	`"ran"`, `"rain"`
`{` n `}`	Matches the previous element exactly n times.	`",\d{3}"`	`",043"` in `"1,043.6"`, `",876"`, `",543"`, and `",210"` in `"9,876,543,210"`
`{` n `,}`	Matches the previous element at least n times.	`"\d{2,}"`	`"166"`, `"29"`, `"1930"`
`{` n `,` m `}`	Matches the previous element at least n times, but no more than m times.	`"\d{3,5}"`	`"166"`, `"17668"` `"19302"` in `"193024"`
`*?`	Matches the previous element zero or more times, but as few times as possible.	`\d*?\.\d`	`".0"`, `"19.9"`, `"219.9"`
`+?`	Matches the previous element one or more times, but as few times as possible.	`"be+?"`	`"be"` in `"been"`, `"be"` in `"bent"`
`??`	Matches the previous element zero or one time, but as few times as possible.	`"rai??n"`	`"ran"`, `"rain"`
`{` n `}?`	Matches the preceding element exactly n times.	`",\d{3}?"`	`",043"` in `"1,043.6"`, `",876"`, `",543"`, and `",210"` in `"9,876,543,210"`
`{` n `,}?`	Matches the previous element at least n times, but as few times as possible.	`"\d{2,}?"`	`"166"`, `"29"`, `"1930"`
`{` n `,` m `}?`	Matches the previous element between n and m times, but as few times as possible.	`"\d{3,5}?"`	`"166"`, `"17668"` `"193"`, `"024"` in `"193024"`

Backreference Constructs

A backreference allows a previously matched subexpression to be identified subsequently in the same regular expression. The following table lists the backreference constructs supported by regular expressions in .NET. For more information, see Backreference Constructs.

Backreference construct	Description	Pattern	Matches
`\` number	Backreference. Matches the value of a numbered subexpression.	`(\w)\1`	`"ee"` in `"seek"`
`\k<` name `>`	Named backreference. Matches the value of a named expression.	`(?<char>\w)\k<char>`	`"ee"` in `"seek"`

Alternation Constructs

Alternation constructs modify a regular expression to enable either/or matching. These constructs include the language elements listed in the following table. For more information, see Alternation Constructs.

Alternation construct	Description	Pattern	Matches
`\|`	Matches any one element separated by the vertical bar (`\|`) character.	`th(e\|is\|at)`	`"the"`, `"this"` in `"this is the day."`
`(?(` expression `)` yes `\|` no `)`	Matches yes if the regular expression pattern designated by expression matches; otherwise, matches the optional no part. expression is interpreted as a zero-width assertion.	`(?(A)A\d{2}\b\|\b\d{3}\b)`	`"A10"`, `"910"` in `"A10 C103 910"`
`(?(` name `)` yes `\|` no `)`	Matches yes if name, a named or numbered capturing group, has a match; otherwise, matches the optional no.	`(?<quoted>")?(?(quoted).+?"\|\S+\s)`	`"Dogs.jpg "`, `"\"Yiska playing.jpg\""` in `"Dogs.jpg \"Yiska playing.jpg\""`

Substitutions

Substitutions are regular expression language elements that are supported in replacement patterns. For more information, see Substitutions. The metacharacters listed in the following table are atomic zero-width assertions.

Character	Description	Pattern	Replacement pattern	Input string	Result string
`$` number	Substitutes the substring matched by group number.	`\b(\w+)(\s)(\w+)\b`	`$3$2$1`	`"one two"`	`"two one"`
`${` name `}`	Substitutes the substring matched by the named group name.	`\b(?<word1>\w+)(\s)(?<word2>\w+)\b`	`${word2} ${word1}`	`"one two"`	`"two one"`
`$$`	Substitutes a literal "$".	`\b(\d+)\s?USD`	`$$$1`	`"103 USD"`	`"$103"`
`$&`	Substitutes a copy of the whole match.	`\$?\d*\.?\d+`	`$&`	`"$1.30"`	`"$1.30"`
$`	Substitutes all the text of the input string before the match.	`B+`	$`	`"AABBCC"`	`"AAAACC"`
`$'`	Substitutes all the text of the input string after the match.	`B+`	`$'`	`"AABBCC"`	`"AACCCC"`
`$+`	Substitutes the last group that was captured.	`B+(C+)`	`$+`	`"AABBCCDD"`	`"AACCDD"`
`$_`	Substitutes the entire input string.	`B+`	`$_`	`"AABBCC"`	`"AAAABBCCCC"`

Regular Expression Options

You can specify options that control how the regular expression engine interprets a regular expression pattern. Many of these options can be specified either inline (in the regular expression pattern) or as one or more RegexOptions constants. This quick reference lists only inline options. For more information about inline and RegexOptions options, see the article Regular Expression Options.

You can specify an inline option in two ways:

By using the miscellaneous construct (?imnsx-imnsx), where a minus sign (-) before an option or set of options turns those options off. For example, (?i-mn) turns case-insensitive matching (i) on, turns multiline mode (m) off, and turns unnamed group captures (n) off. The option applies to the regular expression pattern from the point at which the option is defined, and is effective either to the end of the pattern or to the point where another construct reverses the option.
By using the grouping construct(?imnsx-imnsx:subexpression), which defines options for the specified group only.

The .NET regular expression engine supports the following inline options:

Option	Description	Pattern	Matches
`i`	Use case-insensitive matching.	`\b(?i)a(?-i)a\w+\b`	`"aardvark"`, `"aaaAuto"` in `"aardvark AAAuto aaaAuto Adam breakfast"`
`m`	Use multiline mode. `^` and `$` match the beginning and end of a line, instead of the beginning and end of a string.	For an example, see the "Multiline Mode" section in Regular Expression Options.
`n`	Do not capture unnamed groups.	For an example, see the "Explicit Captures Only" section in Regular Expression Options.
`s`	Use single-line mode.	For an example, see the "Single-line Mode" section in Regular Expression Options.
`x`	Ignore unescaped white space in the regular expression pattern.	`\b(?x) \d+ \s \w+`	`"1 aardvark"`, `"2 cats"` in `"1 aardvark 2 cats IV centurions"`

Miscellaneous Constructs

Miscellaneous constructs either modify a regular expression pattern or provide information about it. The following table lists the miscellaneous constructs supported by .NET. For more information, see Miscellaneous Constructs.

Construct	Definition	Example
`(?imnsx-imnsx)`	Sets or disables options such as case insensitivity in the middle of a pattern.For more information, see Regular Expression Options.	`\bA(?i)b\w+\b` matches `"ABA"`, `"Able"` in `"ABA Able Act"`
`(?#` comment `)`	Inline comment. The comment ends at the first closing parenthesis.	`\bA(?#Matches words starting with A)\w+\b`
`#` [to end of line]	X-mode comment. The comment starts at an unescaped `#` and continues to the end of the line.	`(?x)\bA\w+\b#Matches words starting with A`

Learning SQL Notes #4: Query Primer (CH. 7)

Thu, 27 May 2021 20:00:00 +0000

Working with Sets

Working with Sets

Set Theory in Practice

Both data sets must have the same number of columns.
The data types of each column across the two data sets must be the same (or the server must be able to convert one to the other).

Set Operators

The UNION Operator

The union and union all operators allow you to combine multiple data sets. The difference between the two is that union sorts the combined set and removes duplicates, whereas union all does not.

https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
UNION ALL
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JENNIFER	DAVIS
JENNIFER	DAVIS
JUDY	DEAN
JODIE	DEGENERES
JULIANNE	DENCH

R codes:

library(dplyr)
union_all(df1,df2)

where as UNION removes duplicate Jennifer Davis.

https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
UNION
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JENNIFER	DAVIS
JUDY	DEAN
JODIE	DEGENERES
JULIANNE	DENCH

R codes:

library(dplyr)
union(df1,df2)

The INTERSECT Operator (Not for MySQL!)

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
INTERSECT
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JENNIFER	DAVIS

R codes:

library(dplyr)
intersect(df1,df2)

The EXCEPT Operator (Not for MySQL!)

SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
EXCEPT
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';

first_name	last_name
JUDY	DEAN
JODIE	DEGENERES
JULIANNE	DENCH

R codes:

library(dplyr)
setdiff(df1,df2)

*Set A *

actor_id
10
11
12
10
10

Set B | actor_id | | :——: | | 10 | | 10 |

The operation A except B yields the following:

actor_id
11
12

The operation A except all B yields the following:

actor_id
10
11
12

The difference between the two operations is that except removes all occurrences of duplicate data from set A, whereas except all removes only one occurrence of duplicate data from set A for every occurrence in set B.

Set Operation Rules

The following sections outline some rules that you must follow when working with compound queries.

Sorting Compound Query Results

Sort

SELECT a.first_name fname, a.last_name lname /*aliases can be helpful*/
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%' UNION ALL
SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%' ORDER BY lname, fname;

Order

In general, compound queries containing three or more queries are evaluated in order from top to bottom. Except for:

The ANSI SQL specification calls for the intersect operator to have precedence over the other set operators.
You may dictate the order in which queries are combined by enclosing multiple queries in parentheses.

NOT FOR MySQL:

You can also wrap adjoining queries in parentheses to override the default top-to-bottom processing of compound queries.

SELECT a.first_name, a.last_name FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%' UNION (SELECT a.first_name, a.last_name FROM actor a
WHERE a.first_name LIKE 'M%' AND a.last_name LIKE 'T%' UNION ALL
SELECT c.first_name, c.last_name FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
)

Learning SQL Notes #3: Query Primer (CH. 3)

Wed, 26 May 2021 20:00:00 +0000

Query Mechanics
Query Clauses
Filtering
- WHERE

Complete sometime this summer:

Finish Join Notes;
Finish GROUP BY Notes;

Query Mechanics

Do you have permission to execute the statement?
Do you have permission to access the desired data?
Is your statement syntax correct?

Query Clauses

Clause name	Purpose
select	Determines which columns to include in the query’s result set
from	Identifies the tables from which to retrieve data and how the tables should be joined
where	Filters out unwanted data
group by	Used to group rows together by common column values
having	Filters out unwanted groups
order by	the rows of the final result set by one or more columns

SELECT

Literals, such as numbers or strings
Expressions, such as transaction.amount * −1
Built-in function calls, such as ROUND(transaction.amount, 2)
User-defined function calls

SELECT version(), user(), database();

Results:

version()	user()	database()
8.0.15	root@localhost	sakila

SELECT row1 AS r1;/*Column Aliases*/
SELECT DISTINCT row1 /*Removing Duplicates-should know beforehand whether duplicates are possible*/

R codes：

unique()

FROM

Permanent tables (i.e., created using the create table statement)

Derived tables (i.e., rows returned by a subquery and held in memory)

SELECT *
FROM
(SELECT first_name, last_name, email
FROM customer
WHERE first_name = 'JESSIE'
) AS cust;

Temporary tables (i.e., volatile data held in memory): any data inserted into a temporary table will disappear at some point
```
CREATE TEMPORARY TABLE actors_j
(actor_id smallint(5),
first_name varchar(45),
last_name varchar(45)
);
```
Virtual tables (i.e., created using the create view statement): When you issue a query against a view, your query is merged with the view definition to create a final query to be executed.
```
CREATE VIEW cust_vw AS
SELECT customer_id, first_name, last_name, active
FROM customer;
```

Table Links

See JOIN in the next note.

Table Aliases

FROM customer AS c;

GROUP BY and HAVING (CH. 8)

[] Haven’t done

ORDER BY

ORDER BY col1, col2, etc;

R codes：

df[order(col1),]
require(tidyverse)
df %>%
arrange(col1)

ORDER BY col1;
ORDER BY col1 desc;

R codes：

df[order(-col1),]
require(tidyverse)
df %>%
arrange(desc(col1))

SELECT col1, col2, col3;
FROM table1
ORDER BY 3; /*equivalent to ORDER BY col3*/

Filtering

WHERE

(...) AND (...)
(...) OR (...)

See operators and expressions for details.

OR operator

Intermediate result	Final result
WHERE true OR true	true
WHERE true OR false	true
WHERE false OR true	true
WHERE false OR false	false

AND operator

Intermediate result	Final result
WHERE (true OR true) AND true	true
WHERE (true OR false) AND true	true
WHERE (false OR true) AND true	true
WHERE (false OR false) AND true	false
WHERE (true OR true) AND false	false
WHERE (true OR false) AND false	false
WHERE (false OR true) AND false	false
WHERE (false OR false) AND false	false

NOT operator

Intermediate result	Final result
WHERE NOT (true OR true) AND true	false
WHERE NOT (true OR false) AND true	false
WHERE NOT (false OR true) AND true	false
WHERE NOT (false OR false) AND true	true
WHERE NOT (true OR true) AND false	false
WHERE NOT (true OR false) AND false	false
WHERE NOT (false OR true) AND false	false
WHERE NOT (false OR false) AND false	false

Expressions

An expression can be any of the following:

A number
A column in a table or view
A string literal, such as ‘Maple Street’
A built-in function, such as concat(‘Learning’, ' ‘, ‘SQL’)
A subquery
A list of expressions, such as (‘Boston’, ‘New York’, ‘Chicago’)

Operators:

Comparison operators, such as =, !=, <, <=, >, >=, <>, like, in, between, is null, exists
Arithmetic operators, such as +, −, *, /, DIV (integer division) and (% or MOD) for modulus

Note:

= can be used for date/string/number;
‘between and’ can be used for date/string/number;
‘between and’ is inclusive;
col1 (not) in (‘A’,‘B’)/subqueries;
built-in function: left(name, 1) in (‘A’,‘B’);
wildcards/regular expressions:
- Strings beginning/ending with a certain character
- Strings beginning/ending with a substring
- Strings containing a certain character anywhere within the string
- Strings containing a substring anywhere within the string
- Strings with a specific format, regardless of individual characters

Wildcard character	Matches
_	Exactly one character
%	Any number of characters (including 0)

NULL

Null is used for various cases where a value cannot be supplied, such as:

Not applicable Such as the employee ID column for a transaction that took place at an ATM machine
Value not yet known Such as when the federal ID is not known at the time a customer row is created
Value undefined Such as when an account is created for a product that has not yet been added to the database

Note:

An expression can be null, but it can never equal null. IS NULL/IS NOT NULL.
Two nulls are never equal to each other.

Learning SQL Notes #2: Data Types

Wed, 26 May 2021 01:00:00 +0000

Character Data
Numeric Data
Temporal Data
BOUNS: Find Current Time

Character Data

char(20) /* fixed-length */
varchar(20) /* variable-length */

No easy way to constrain the length of character in R, but one can try stringr::str_trunc().

Note:

If the data being loaded into a text column exceeds the maximum size for that type, the data will be truncated;
Trailing spaces will not be removed when data is loaded into the column;
When using text columns for sorting or grouping, only the first 1,024 bytes are used, although this limit may be increased if necessary.

CREATE DATABASE european_sales CHARACTER SET latin1;

Numeric Data

Boolean: 0 False, 1 True.

System-generated primary keys: 1 to $\infin$, integers;

mediumint −8,388,608 to 8,388,607
mediumint unsigned 0 to 16,777,215
int −2,147,483,648 to 2,147,483,647
int unsigned 0 to 4,294,967,295
bigint −2^63 to 2^63 - 1
bigint unsigned 0 to 2^64 - 1

Item number: positive integers in a range;

tinyint −128 to 127
tinyint unsigned 0 to 255
smallint −32,768 to 32,767
smallint unsigned 0 to 65,535

unsigned takes only positive values；

High-precision scientific or manufacturing data;
```
float( p , s ) −3.402823466E+38 to −1.175494351E-38 and 1.175494351E-38 to 3.402823466E+38
double( p , s ) −1.7976931348623157E+308 to −2.2250738585072014E-308
and 2.2250738585072014E-308 to 1.7976931348623157E+308
```
p, s are optional parameters, precision (the total number of allowable digits both to the left and to the right of the decimal point) and a scale (the number of allowable digits to the right of the decimal point), left digits = p - s.

Temporal Data

The future date that a particular event is expected to happen, such as shipping a customer’s order
```
date YYYY-MM-DD 1000-01-01 to 9999-12-31
```

The date that a customer’s order was shipped

datetime YYYY-MM-DD HH:MI:SS 1000-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999

The date and time that a user modified a particular row in a table

timestamp YYYY-MM-DD HH:MI:SS 1970-01-01 00:00:00.000000 to 2038-01-18 22:14:07.999999

An employee’s birth date

date YYYY-MM-DD 1000-01-01 to 9999-12-31

The year corresponding to a row in a yearly_sales fact table in a data warehouse
```
year YYYY 1901-2155
```
The elapsed time needed to complete a wiring harness on an automobile assembly line
```
time HHH:MI:SS −838:59:59.000000 to 838:59:59.000000
```

BOUNS: Find Current Time

To find the current data/time:

SELECT now();
/*2019-04-04 20:44:26 Timezone not included*/

R codes：

sys.time()
# "2021-05-25 10:58:06 EDT", Timezone included

If Oracle, add FROM dual;;(Think about dummy variable!)

Learning SQL Notes #1

Tue, 25 May 2021 18:00:00 +0000

Introduction to Databases
Table Creation (CH. 2)

Introduction to Databases

SQL was initially created to be the language for generating, manipulating, and retrieving data from relational databases.
A database is a set of related information.
Database systems are computerized data storage and retrieval mechanisms.
Nonrelational Database Systems:
- In a hierarchical database system, for example, data is represented as one or more tree structures. The hierarchical database system provides tools for locating a particular customer’s tree and then traversing the tree to find the desired accounts and/or transactions. Each node in the tree may have either zero or one parent and zero, one, or many children.
- Network database system exposes sets of records and sets of links that define relationships between different records.
Data can be represented as sets of tables. Rather than using pointers to navigate between related entities, redundant data is used to link records in different tables: relational model.

More about Relational Databases

Now columns/rows are constrained due to physical limit or maintainability;
Primary key includes information that uniquely identifies a row in that table;
1. If more than one column, then compound key;
2. If select, say, first name, then it is a natural key;
3. If select an id, then it is a surrogate key;
4. NEVER be allowed to change!
5. Possible error:
```
ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'
```
More than one identifiers in a table including the primary key: foreign keys, connect the entities in different tables;
Make sure that there is only one place in the database that holds, say, the customer’s name; otherwise, the data might be changed in one place but not another, causing the data in the database to be unreliable. The process of refining a database design to ensure that each independent piece of information is in only one place (except for foreign keys) is known as normalization. (Think about the concept of Tidy Data in R!)
Two-column primary key is also possible depending on the context (CH.2);

Foreign key constraint limits the id to those exist in another table (CH.2); Possible error:

ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ('sakila'.'favorite_food', CONSTRAINT 'fk_fav_food_person_id' FOREIGN KEY
('person_id') REFERENCES 'person' ('person_id'))

Ways to generate primary keys:

Look at the largest value currently in the table and add one.
Let the database server provide the value for you.

ALTER TABLE table_name MODIFY col_0 SMALLINT UNSIGNED AUTO_INCREMENT;
set foreign_key_checks=0; /*IMPORTANT*/
ALTER TABLE person
MODIFY person_id SMALLINT UNSIGNED AUTO_INCREMENT;
set foreign_key_checks=1; /*IMPORTANT*/

Find Databases

To see the see the mysql> prompt:

mysql -u root -p;

Then type show databases; to display all databases;

Find a Table

To select a table, type use table_name;;

Can do the following:

mysql -u root -p table_name;

InR, one can find it under the global environment.

Create a Table

CREATE TABLE table_name /*Create a table with name: ……*/
(col_0 smallint;
col_1 VARCHAR(30);
col_2 timestamp;
CONSTRAINT pk_col_0 PRIMARY KEY (col_0) /*set col_0 as primary key*/
); /*The most basic method to create a database*/

R codes:

df <- data.frame()
# x1 = c(7, 3, 2, 9, 0),
# x2 = c(4, 4, 1, 1, 8),
# x0 = c(5, 3, 9, 2, 4)
# Primary key can only be added manually

Add a Row

INSERT INTO table_name (col_0, col_1, col_2) /*The table*/
VALUES (27, 'Rdm Name', 'Acme Paper Corporation'); /*The values*/
/*The most basic method to insert a full row into a database*/

Query OK, 1 row affected$\Rightarrow$one row was added to the database

R codes:

new_row <- c(27, 'Rdm Name', 'Acme Paper Corporation')
rbind(df, new_row)

You are not required to provide data for every column in the table unless the column cannot be NULL;

MySQL will convert the string to a date for you as long as the format is followed;

ERROR 1292 (22007): Incorrect date value: 'DEC-21-1980' for column 'birth_date' at row 1

Change a Cell

UPDATE table_name
/*Fix column*/ /*Insert the values*/
SET name = 'Certificate of Deposit'
WHERE col_2 = 'CD'; /*Fix row, otherwise all will be replaced*/

R codes:

df[df$col_2=='CD', "name"] <- 'Certificate of Deposit'
# Fix column, fix row

Delete a Row

DELETE ...
/*Fix column*/
FROM table_name
WHERE col_2 = 'CD'; /*Fix row, otherwise all will be deleted*/

R codes:

df[df$col_2=='CD', ] <- NULL

Table Overview

DESC favorite_food;

R codes:

str(df)
summary(df)
glimpse(df)

Describe the table.

Show Tables

show tables

Drop a Table

drop table xxx

Export to XML

Type the following in CMD:

mysql -u lrngsql -p --xml bank

SELECT * FROM table_name
FOR XML AUTO, ELEMENTS /*IMPORTANT*/

No easy way to do so in R.

Table Creation (CH. 2)

1 Design

What info is needed? Make a list.

Compound objects need to be separated into multiple columns, including names or address;
If a column is a list containing zero, one, or more independent items, we need another table;
Need primary key column(s) to guarantee uniqueness.

3 Building SQL Schema Statements

Another type of constraint called a check constraint constrains the allowable values for a particular column. A check constraint to be attached to a column definition.

eye_color CHAR(2) CHECK (eye_color IN ('BR','BL','GR'))

Possible error:

ERROR 1265 (01000): Data truncated for column 'eye_color' at row 1

MySQL does provide another character data type called enum that merges the check constraint into the data type definition.

eye_color ENUM('BR','BL','GR')

R codes:

Enum <- function(...) {
## EDIT: use solution provided in comments to capture the arguments
values <- sapply(match.call(expand.dots = TRUE)[-1L], deparse)
stopifnot(identical(unique(values), values))
res <- setNames(seq_along(values), values)
res <- as.environment(as.list(res))
lockEnvironment(res, bindings = TRUE)
res
}
FRUITS <- Enum(APPLE, BANANA, MELON)

See https://stackoverflow.com/questions/33838392/enum-like-arguments-in-r for further details.

After processing the create table statement, the MySQL server returns the message “Query OK, 0 rows affected,” which tells me that the statement had no syntax errors.

Learning Stats at UofT #8: Problems in Statistics Application

Sat, 27 Mar 2021 09:00:00 +0000

This is the eighth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.

The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on discussing statistics.

Now it’s approaching the end of semester for me. To reflect back on the past 2020, a lot of things were going on and everyone had a tough time. It was also the time to develop skills to collaborate virtually and be compassionate about other people in workplaces around the world.

In a data driven world, we are connected by data and the study of data, statistics, is very essential to our day to day life. However, the abuse of statistics also created problems for us.

Misspecified Models

At the early stage of the pandemic, people proposed many models to predict the number of cases around the world. Some even argued that the cases would grow exponentially. This was based on their past experience with virus spread. However, this was a very unreasonable guess because there was no up-to-date evidence to support this. this created some rumors and pessimistic expectations on our world. In fact, this could be avoided if the posters were more cautious about what they were going to say and the implications. But they were not. Statistics became a tool to spread rumors, and readers should be more critical towards such models. We statisticians have the responsibility to stand out and correct this mistake.

Data Exploration

New data were released every day from the government. The data analysis should be the job of data analyst. However, many people who had no such backgrounds also spread their ideas on the internet. It was not that harmful if they were accidentally correct. However, some people enjoyed playing around with data and sharing some false conclusion based on them. Hence I see the necessity for a general education about statistics for the public.

Learning Stats at UofT #7: Detail-oriented and Communication Skills

Sat, 20 Mar 2021 09:00:00 +0000

This is the seventh post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.

I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts & Science, and I hope this paragraph serves well as an introduction.

By meeting with a Vic alumnus, I summarize a set of core skills that are important for our future career. This is the last part of the core skills.

Detail-oriented

In workplace, noticing the details means you are careful every pieces of your writing and avoid making errors due to carelessness. In real life, the skill refer to the followings. Be curious about the surroundings and the environment. Catch sight of the beautiful and show your appreciation. Remark on the unusual and take a note of it. Notice the changing seasons and take a photo of them. Savour the moment, whether you are walking to work, eating lunch or talking to friends. Be aware of the world around you and what you are feeling. You are not a robot, so you should not only work or study. Lastly, reflecting on your experiences will help you appreciate what matters to you (credit to Chad Jankowski).

Communication Skills

Communication is every where. You need to communicate verbally or in writing with family, friends, colleagues and neighbours. You apply communication skills at home, work, school or in your local community. Therefore, you need to think of your past communications as the cornerstones of your life and invest time in enhancing these skills. Building these connections skillfully will support and enrich you every day (credit to Chad Jankowski).

In workplace, communication should be clear and precise. Sometimes you need to be a bit diplomatic when you talk to people, sometimes you need to be bold to speak up your needs. People should develop the ability to communicate differently with various people in many contexts.

Learning Stats at UofT #6: Critical Analysis and Problem Solving

Sat, 13 Mar 2021 09:00:00 +0000

This is the sixth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.

By meeting with a Vic alumnus, I summarize a set of core skills that are important for our future career. This is the first part of the core skills.

Critical Analysis

Critical analysis involves the ability to analyze the situation, to retrieve information from different sources, and the ability to communicate the idea quantitatively and qualitatively.

Statistics courses at U of T provide great training of quantitative analysis. In recent years, instructors also think about different assignments that requires student to apply their skills in analyzing real-life cases. Nonetheless, this is not enough from my perspective. First, such tasks have to align with the specific course objective. In particular, the data provided for the assignment are so good that you don’t need to consider any absurd situations. Second, the professors may not necessarily know what nowadays employers are looking for. Hence it is important to explore the real world by yourself.

Problem Solving

The following materials were adapted from the learning strategies at UofT, Rahul Bhat.

Background

What background information do I need to solve the problem? This should be combined with critical analysis. Specifically, you may want to know what information is missing all ignored.

Rules

What theories, solutions, rules, proofs, or approaches might I use to solve the problem? In quantitative analysis, you will need to use mathematical knowledge, for example, theorems, to solve questions.

Steps

Can I break the problem into steps - those I understand and those I can gather more information for? This way, you can explore the steps that can be done fairly easily, and save more time for the difficult tasks.

Connection

Is there something I have seen in the past that resembles this problem? You practice active retrieve of knowledge in this aspect, and look for solutions that are applicable in some sense to this question.

Learning Stats at UofT #5: A Dialogic Way to Introduce MLE

Sat, 06 Mar 2021 09:00:00 +0000

This is the fifth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.

A Dialogic Way to Introduce MLE

Imagine you walk into Starbucks at Robarts Library, and you meet one of your TAs from STA257. Now you may want to say hi to this TA, but you also want this TA to clarify the concept of MLE. If I were the TA, I would explain the concept of MLE in the following way.

Sure, I can explain the concept of likelihood while we wait in line. In statistics, we often need to estimate the parameter of a model. But how? Well, Maximum Likelihood estimation (MLE) can help. First of all, we need to know likelihood means the probability of a value being the true value of a parameter θ in a model given a set of data. MLE provides a way to find a value θ ̂ with the maximum probability to be θ.

Let’s use an example to illustrate this. Suppose you are interested in a model that describes the waiting time for a customer in this restaurant. Now you can first collect the data of individual waiting time randomly. Then you may assume that the true population of waiting time follows some classical distributions so that we only need to estimate the parameter θ of a known distribution. Then we may be able to use MLE here. Does that make sense so far?

Alright, now, here comes the tricky part. In order to use MLE, we need the joint distribution of all data X, which gives the probability that each of X falls within a specific range for a variable. But wait, we do not actually know it since we only know the marginal distribution, i.e. the density function of individual X with an unknown θ. Hence we can consider making an assumption that all data are independent, meaning knowing one waiting time does not tell us about the next waiting time. It may not be true in reality, but it is sufficient for our purpose. Under this assumption, we can multiply all marginal density to get our joint density. To find the maximum likelihood estimate, we can set the first derivative with respect to θ equal 0 and find a value. We can calculate the second derivative as well to ensure that θ ̂ is a maxima. This θ ̂ is our Maximum Likelihood estimate for the parameter. Did that explanation help?

To sum up, Maximum Likelihood estimation gives an estimate for a parameter in a model given a set of data. In particular, with a known joint probability density function of all data and the parameter, we can use derivatives to find an estimate of the parameter with the maximum probability to be the true value for the model. Let’s pause for a moment! It is my turn to get the drink!

Learning Stats at UofT #4: Model Selection

Sat, 27 Feb 2021 09:00:00 +0000

This is the fourth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.

Model Selection

In the third year course STA302, you will learn about simple linear regression and multiple linear regression models. In fact, you will learn more about assumptions behind models and possible remedies for improvements. Nonetheless, what you will not learn is whether it is appropriate to apply this model in a particular field.

In reality, you will find that many models fit the data pretty well, but those models are incorrect. So here comes the question: to what extent you can apply a complex model on your data? The disciplinary knowledge is crucial in this context. It not only provides a justification for model, but also a way to interpret the model.

Does Disciplinary Knowledge Play a Role in Model Selection?

In NFS284, you learn about some thresholds to determine whether the consumption of nutrients is adequate. These thresholds, however, depend on the normality assumption. In fact, if you look at the hypothesis testing in the academic papers, you will notice that almost always P=0.05 is selected as significant level and a normal assumption is approximated. But this requires some reasons. You cannot select a P value that just serves your convenience.

I believe that the researchers have more knowledge in nutrition science than me, and they may have very good reasons for their application of statistics. However, it is not very common to see the justifications in the well-written papers. In fact, you cannot tell whether disciplinary knowledge plays a role in Model Selection.

One possible reason is that this requires researchers to devote some paragraphs on it. The page limit of an academic journal, nevertheless, may not allow them to do so. To take a step back, even if there is some restrictions on the length of the article, this justification should not be given up just because of it.

Learning Stats at UofT #3: Some Controversies

Sat, 20 Feb 2021 09:00:00 +0000

This is the third post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.

Data Science Program

The first controversy is about the data science program. It is controversial because of its high enrollment requirements and different views towards data science. The enrollment requirements are the highest among all stats programs offered at UofT. Moreover, it requires you to learn both knowledge in computer science and knowledge in statistics, but it doesn’t require you to learn a lot of theories about statistics. Rather, it asks more for probability theories and the application of statistics. On the other hand, it involves a lot of things about data structure, but less computer science knowledge. Therefore some people think that this course it’s kind of awkward between pure CS program and stats program. There is another view. Many also think so it’s better for the workplace because this program offer an internship opportunity.

The Changes in Course Content

Statistics departments at U of T went through many changes. In particular, many courses changed their instructors every year. Furthermore, the course content evolved with the changing focus in nowadays workplace. The pro was that you could always learn the most up-to-date knowledge in statistics and instructors also had more flexibility in designing a course. Note that the scope of a course remained the same throughout the time, but the way to convey knowledge might change. For students, however, it was hard to prepare for the upcoming courses. Sometimes the course organization would have many small issues when the new content was added.

Learning Stats at UofT #2: A Guide to Second-year Courses

Sat, 13 Feb 2021 09:00:00 +0000

This is the second post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.

STA237/238

There are three combinations of courses offered by DoSS. The first combination is STA237 and STA238. This combination primarily focuses on R. It also goes through the fundamentals of statistics. However, it may not be the best introductory courses for statistical theories because every year the focus is adjusted. The organization of the courses was not very satisfactory last year because students from RC without strong stats background found it too programming-based and students from CS found it less interesting because of lack of in-depth theories.

STA247/248

There is another combination called STA247 and STA248. This combination is designed solely for computer science students and it is great to learn if you want more knowledge in probability, especially because it involves many creative questions about probability and some knowledge that computer science students may need for programming,

STA257/261

The last combination, which is the combination I took in my second year, is STA257 and STA261. This combination is so-called the hardest one for second-year stats students. Typically, the instructor will introduce a bunch of distributions, some new concepts about CDF and PDF, and some calculations using double integration. Now this was the tricky part for me at that time. Many students didn’t learn convolution and double integration when they took this course. As a result, many of us needed to spend more time getting familiar with these things.

Another interesting aspect of this course is that it doesn’t involved too much about Bayesian. I would say this is not really a limitation, but it somehow affects how students think about statistics in the future. This course introduces many concepts that was thought to be important in the future, particularly the part about ordered statistics and quantile thing. They will play an important role in the third-year courses

Learning Stats at UofT: A Guide to Focuses in Applied Statistics

Sat, 06 Feb 2021 09:00:00 +0000

Focus can be changed, but you have to plan ahead

The selection of focus really depends on the courses you take in your first year. Most students in MP take ECO101/102 and CSC148/165 in their first year. This courses combination of CS and ECO has certain benefits. Specifically, this combination gives students much flexibility in their second and third year since it allows them to choose Data Science Specialist in Statistics program, CS programs and Economics programs

However, a common solution is not necessarily good. As a student who would like to take new challenges and learn more about education, I chose to take Education courses in Victoria College. This to some extent limited my choices of programs. In particular, if I would like to enroll in other programs, I might need to start from the beginning. Nonetheless, I met great friends there and discovered that being a teacher in a primary/middle school was not what I really wanted.

Then I decided to select Astrophysics as my focus in my second year, hoping to explore the broader universe that I have never learnt before. It was fun to learn, but it was too theoretical and I started to get interested in Finance and Economics soon. Hence I reached a crossroads again. If I continued on Astrophysics, I believed I could still do well in academia, but I could not imagine what I was going to do after that. On the other hand, if I chose Economics, I needed to take first-year courses in Economics in my second year and caught up with others in my third year. This was exactly the disadvantage of my first-year course selection.

The key point is that there is going to be a trade-off when you want to select a focus - it is more common to stick to your first-year courses when you think about choosing a focus, but then you do not have the opportunity to choose some other interesting courses in the university.

First Blog in 2021 about Teaching Statistics

Sat, 30 Jan 2021 09:00:00 +0000

A few days ago, a student asked me about the logic behind the simulated sampling distribution. She was curious about how we could use reshuffling labels + random sampling to obtain a sampling distribution. The logic is related to the Skeptic’s Argument and the consequence of it. Here is the theorem and its elegant proof.

I am particularly interested in her question because I knew the intuition behind the codes back to my first year at UofT, but I also saw some unbalanced CRDs and the same reshuffling method was used to calculate the sampling distribution. Hence I was confused by the use of the codes.

The symmetric property of an unbalanced CRD turns out to be uncertain in theory, but in many cases we can still see a simulated sampling distribution somewhat symmetric around 0.

Another question about test statistic appears in one of my students’ writing assignments. In the writing, this student defines the test statistic (mean) as a random variable after calculating an exact number from the sample. Indeed, a test statistic can be a random variable, but it is not when a number is already obtained from the sample. A great answer from an online forum is attached below.

Test statistic is about the sample, but parameter is about the population. If one is thinking about whether a parameter is a random variable or a fixed value, then one may be very interested in the controversy between Bayes and Frequentist. They are fundamentally different approaches to knowledge about data and uncertainty, but they can yield the same result in many situations mathematically.

This also reminds me of my past experience with statistics. When I learnt probability theories in middle school, we did not really differentiate between these two approaches. We sometimes claimed that one event was more likely to happen because of higher probability, and sometimes interpreted the proportion of heads when flipping the coin for many times as the long-term frequency.

To wrap up, I use these two examples to show that there can be complicated theories behind some seemingly simple facts. Though this is really beyond the scope of STA130, I still think students can benefit from thinking about these questions. One will get to know more about statistics when one takes a second-year statistics course.

References

Revision Guide

Wed, 02 Dec 2020 09:00:00 +0000

The revision guide can be downloaded by clicking the button above. Good luck on your exam!

Week 6 Tutorial (Bootstrapping)

Thu, 22 Oct 2020 09:00:00 +0000

This tutorial was designed to illustrate a sample beamer presentation created from .Rmd file for teaching bootstrap sampling.

Teaching bootstrapping to students who were new to statistics can be difficult, especially when they were taught about hypothesis testing (z-test) before bootstrapping. During my actual practice, I found it useful to discuss the similarities between these two methods then the differences.

An introduction about why we need such method can always be inspiring and motivating.