<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blog | Siqi Zheng</title><link>https://siqi-zheng.rbind.io/post/</link><atom:link href="https://siqi-zheng.rbind.io/post/index.xml" rel="self" type="application/rss+xml"/><description>Blog</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><image><url>https://siqi-zheng.rbind.io/images/icon_hu1f65844ca26c0df97a9719a407d829c0_98767_512x512_fill_lanczos_center_2.png</url><title>Blog</title><link>https://siqi-zheng.rbind.io/post/</link></image><item><title>Outline for the research article</title><link>https://siqi-zheng.rbind.io/post/2021-11-12-research-reproducibility/</link><pubDate>Fri, 12 Nov 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-11-12-research-reproducibility/</guid><description>&lt;p>Please click the icon book above for the full text!!&lt;/p></description></item><item><title>Supplement Solutions for New Questions in Chapter 1.4 to 1.6 in Understanding Analysis Second Edition</title><link>https://siqi-zheng.rbind.io/post/2021-08-10-analysis-sol-1-2/</link><pubDate>Tue, 10 Aug 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-08-10-analysis-sol-1-2/</guid><description>&lt;p>&lt;strong>Note: There may be LaTex display issues due to blogdown rendering limitations. A complete well-formatted solution can be found by clicking the download icon above.&lt;/strong>&lt;/p>
&lt;p>One may notice that most questions in the second edition are the same as those in the first edition. However, there are still some new or modified questions in the latest edition that remain unanswered.&lt;/p>
&lt;p>Therefore, in the following posts, I am going to present a collection of solutions to these new questions found on the internet and worked out by myself. To be more concise and clear, I also rewrote some of my solutions according to the internet sources (links are attached at the end of each question). The solution to the first edition can be found here: &lt;a href="https://github.com/mikinty/Understanding-Analysis-Abbott-Solutions">https://github.com/mikinty/Understanding-Analysis-Abbott-Solutions&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.2.&lt;/strong> Review the proof of Theorem 1.5.6, part (ii) showing that $\Bbb R$ is uncountable, and then find the flaw in the following erroneous proof that $\Bbb Q$ is uncountable:
Assume, for contradiction, that $\Bbb Q$ is countable. Thus we can write $\Bbb Q = {r1, r2, r3, \dots}$ and, as before, construct a nested sequence of closed intervals with $r_n \not \in I_n$. Our construction implies $\cap^\infty_{n=1} I_n = \empty$ while NIP implies $\cap^\infty_{n=1} I_n \neq \empty$. This contradiction implies Q must therefore be uncountable.&lt;/p>
&lt;p>(1) It may contain only one irrational number.
(2) NIP is for real intervals not rational.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1914901/false-proofs-claiming-that-mathbbq-is-uncountable">https://math.stackexchange.com/questions/1914901/false-proofs-claiming-that-mathbbq-is-uncountable&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.4.&lt;/strong> (a) Show $(a, b) \sim R$ for any interval $(a, b)$.
We know from the &lt;strong>Example 1.4.9.&lt;/strong> that the function $f(x) = x/(x^2 − 1)$ takes the interval $(−1, 1)$ onto $\Bbb R$ in a 1–1 fashion. Then we map $(a,b)$ onto $(-1,1)$ by another bijective linear function $g(x)=2x/(b-a)-(b+a)/(b-a)$.&lt;/p>
&lt;p>(b) Show that an unbounded interval like $(a,\infty) = {x : x &amp;gt; a}$ has the same cardinality as $\Bbb R$ as well.
We know from the &lt;strong>Example 1.4.9.&lt;/strong> that the function $f(x) = x/(x^2 − 1)$ takes the interval $(−1, 1)$ onto $\Bbb R$ in a 1–1 fashion. Then we map $(a,\infty)$ onto $(-1,1)$ by another bijective linear function $g(x)=2x/(1-x)$.&lt;/p>
&lt;p>(c) Using open intervals makes it more convenient to produce the required 1–1, onto functions, but it is not really necessary. Show that $[0, 1) \sim (0, 1)$ by exhibiting a 1–1 onto function between the two sets.
$f:[0,1) \rightarrow (0,1)$ by $f(0)=1/2$, $f(1/n)=1/(n+1)$ for integer $n \geq 2$, and $f(x)=x$ otherwise.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1425492/explicit-bijection-between-0-1-and-0-1">https://math.stackexchange.com/questions/1425492/explicit-bijection-between-0-1-and-0-1&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.5.&lt;/strong> (a) Why is $A \sim A$ for every set $A$?
Trivial. By definition $f(x)=x$ will do the job.&lt;/p>
&lt;p>(b) Given sets $A$ and $B$, explain why $A \sim B$ is equivalent to asserting $B \sim A$.
Bijection, so consider inverse mapping.&lt;/p>
&lt;p>(c) For three sets $A,B,$ and $C$, show that $A \sim B$ and $B \sim C$ implies $A \sim C$. These three properties are what is meant by saying that $\sim$ is an equivalence relation.
Assume $f$ maps $A$ to $B$ and $g$ maps $B$ to $C$, $g(f(x))$ will work.&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.6.&lt;/strong> (a) Give an example of a countable collection of disjoint open intervals.
$A_n = (n, n+1)$, $n\in \Bbb N$&lt;/p>
&lt;p>(b) Give an example of an uncountable collection of disjoint open intervals, or argue that no such collection exists.
DNE. Every collection of disjoint open intervals in $\Bbb R$ is countable because you can choose a rational number (by density theorem) in each of them and rationals are countable.&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.7.&lt;/strong> Consider the open interval $(0,1)$, and let $S$ be the set of points in the open unit square; that is, $S = {(x, y) : 0 &amp;lt; x,y &amp;lt; 1}$.&lt;/p>
&lt;p>(a) Find a 1–1 function that maps $(0, 1)$ into, but not necessarily onto, $S$. (This is easy.)
$f(x) = (x,x),x \in (0,1)$&lt;/p>
&lt;p>(b) Use the fact that every real number has a decimal expansion to produce a 1–1 function that maps $S$ into $(0, 1)$. Discuss whether the formulated function is onto. (Keep in mind that any terminating decimal expansion such as $.235$ represents the same real number as $.234999 \dots$)&lt;/p>
&lt;p>For any point with two coordinates $(0.d_1d_2\dots,0.e_1e_2\dots)$, we map it to the real number $(0.d_1e_1d_2e_2\dots)$. We restrict the choice of point in its simplest form so that $(0.2,0.5)$ will be chosen for $0.25$ instead of $(0.2999\dots,0.4999\dots)$, which is equal to $(0.3,0.5)$, corresponding to $0.35$.&lt;/p>
&lt;p>This function (mapping), however, is not onto. Consider $1/11=0.090909\dots$, which by definition can be produced by a point $(0,0.999\dots)$, but this point can no be selected since it is equal to $(0,1)$ and $(0,1)$ yields $0.01$. Therefore not point in the unit square can be used to map to $1/11$.&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.8.&lt;/strong> Let $B$ be a set of positive real numbers with the property that adding together any finite subset of elements from $B$ always gives a sum of $2$ or less. Show $B$ must be finite or countable.&lt;/p>
&lt;p>For each $n\in \Bbb N$, let$$B_n=\left{b\in B,\middle|,b\geqslant\frac2n\right}\subset B.$$&lt;/p>
&lt;p>Of course, $B_n$ can have no more than $n-1$ distinct elements; otherwise, the sum of $n$ distinct elements of $B_n$ would be grater than $2$.&lt;/p>
&lt;p>But$$B=\bigcup_{n\in\Bbb N}B_n.$$Since $\Bbb N$ is countable and each $B_n$ is finite, $B$ is countable.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/2446630/showing-a-set-is-finite-or-countable">https://math.stackexchange.com/questions/2446630/showing-a-set-is-finite-or-countable&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.10.&lt;/strong> (a) Let $C \subseteq [0,1]$ be uncountable, show there exists $a \in (0,1)$ such that $C \cap [a,1] $ is uncountable.&lt;/p>
&lt;p>Suppose that $C\cap [\tfrac{1}{n}, 1]$ is countable for all $n$. Then $$C\cap [0,1] = C\cap\big({0}\cup \bigcup_{n=1}^\infty [\tfrac{1}{n},1]\big) = (C\cap {0}) \cup \bigcup_{n=1}^\infty (C\cap [\tfrac{1}{n}, 1])$$ would be countable too.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1452550/let-c-subseteq-0-1-be-uncountable-show-there-exists-a-in-0-1-such-tha">https://math.stackexchange.com/questions/1452550/let-c-subseteq-0-1-be-uncountable-show-there-exists-a-in-0-1-such-tha&lt;/a>&lt;/p>
&lt;p>(b) Now let A be the set of all $a \in (0, 1)$ such that $C \subseteq [a,1]$ is uncountable, and set $\alpha = supA$. Is $C \subseteq [0,1]$ an uncountable set?&lt;/p>
&lt;p>WTS: Suppose $C\subseteq [0,1]$ is uncountable. Let $A = {a\in (0,1)\mid C\cap[a,1]$ is uncountable $}$, and $\alpha = \sup A$. Then $C\cap [\alpha,1]$ is countable.&lt;/p>
&lt;p>First, $A$ is nonempty: for $n\in\Bbb N$ let $C_n = C\cap [\frac 1 n, 1]$. Some $C_n$ must be uncountable, otherwise $C= \bigcup_n C_n$ is a countable union of countable sets and therefore countable. So for some $n$, $1/n \in A$.&lt;/p>
&lt;p>Clearly $0 \lt \alpha \le 1$.&lt;/p>
&lt;p>If $\alpha =1 $ then of course the claim is true.
If $\alpha \lt 1$. Let $(b_n)$ be a decreasing sequence in $(\alpha, 1)$ with $\alpha = \inf_n b_n$. By definition of $A$ and $\alpha$, for every $n$, $C\cap[b_n,1]$ is countable, for otherwise $b_n\in A$ and $b_n \le \alpha$. Thus
$$\begin{align}
C\cap [\alpha,1] &amp;amp;= C\cap \bigcup_n [b_n, 1] \&lt;br>
&amp;amp;= \bigcup_n (C\cap [b_n, 1])
\end{align}$$
is a countable union of countable sets, so it&amp;rsquo;s countable.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1639608/intersection-of-uncountable-sets">https://math.stackexchange.com/questions/1639608/intersection-of-uncountable-sets&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.5.11 (Schröder–Bernstein Theorem).&lt;/strong> Assume there exists a 1–1 function function $f: X \rightarrow Y$ and another 1–1 function $g: Y \rightarrow X$. Then there exists a 1–1, onto function $h: X \rightarrow Y$ and hence $X \sim Y$.&lt;/p>
&lt;p>The strategy is to partition $X$ and $Y$ into components $X = A \cup A'$ and $Y = B \cup B'$ with $A \cap A&amp;rsquo; = \emptyset$ and $B \cap B&amp;rsquo; = \emptyset$, in such a way that $f$ maps $A$ onto $B$ and $g$ maps $B'$ onto $A'$.&lt;/p>
&lt;p>(a) Explain how achieving this would lead to a proof that $X \sim Y$.
$f: A \rightarrow B$ is a 1–1, onto function;
$g: B&amp;rsquo; \rightarrow A'$ is a 1–1, onto function;
Then $h(x)=f(x)$ if $x \in A$ and $h(x)=g^{-1}(x)$ if $x \in A'$ is a $X \rightarrow Y$ 1–1, onto function and hence $X \sim Y$.&lt;/p>
&lt;p>(b) Set $A_1 = X \setminus g(Y)$ (what happens if $A_1 = \emptyset$?) and inductively define a sequence of sets by letting $A_{n+1} = g(f(A_n))$. Show that ${A_n : n \in \Bbb{N}}$ is a pairwise disjoint collection of subsets of $X$, while ${f(A_n) : n \in \Bbb{N} }$ is a similar collection in $Y$.&lt;/p>
&lt;p>For $k \ge 2$, since $A_k = g(f(A_{k-1})) \subseteq g(Y)$, $A_k$ and $A_1$ are disjoint.&lt;/p>
&lt;p>For $2 \le m \lt n$, if there exists $a \in A_m \cap A_n$, then for some $a_{m-1} \in A_{m-1}$ and $a_{n-1} \in A_{n-1}$, $f(g(a_{m-1})) = a = f(g(a_{n-1}))$. Since both $f$ and $g$ are injective, here $a_{m-1} = a_{n-1}$. Hence $A_m \cap A_n \ne \emptyset$ implies $A_{m-1} \cap A_{n-1} \ne \emptyset$. By induction, we can conclude that $A_1 \cap A_{n-m+1} \ne \emptyset$, which is contradict with part 1. Therefore $A_m$ and $A_n$ are disjoint ($2 \le m \lt n$).&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1726578/understanding-a-proof-of-schr%C3%B6der-bernstein-theorem">https://math.stackexchange.com/questions/1726578/understanding-a-proof-of-schr%C3%B6der-bernstein-theorem&lt;/a>&lt;/p>
&lt;p>(c) Let $A = \cup_{n=1}^\infty A_n$ and $B = \cup_{n=1}^\infty f(A_n)$. Show that $f$ maps $A$ onto $B$.
Trivial because for every $b \in B$, $b = f(a_n)$ for some $a_n \in A_n \subseteq A $.&lt;/p>
&lt;p>(d) Let $A&amp;rsquo; = X\setminus A$ and $B&amp;rsquo; = Y \setminus B$. Show $g$ maps $B'$ onto $A'$.
Suppose there is an element $a&amp;rsquo; \in A&amp;rsquo;\not \in g(B&amp;rsquo;)$. Since $a'$ cannot be in $A_1$ there has to be an element $b \in f(A_n)\subset B$ s.t. $g(b)=a'$. Since $b \in f(A_n)$ we can write it as $f(a)=b$ and therefore $a'=g(f(a))\in A_{n+1}$. But this is a contradiction to where $a'$ lives.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1726578/understanding-a-proof-of-schr%C3%B6der-bernstein-theorem">https://math.stackexchange.com/questions/1726578/understanding-a-proof-of-schr%C3%B6der-bernstein-theorem&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.6.9.&lt;/strong> Using the various tools and techniques developed in the last two sections (including the exercises from Section 1.5), give a compelling argument showing that $\cal P(\Bbb N) \sim \Bbb R$.&lt;/p>
&lt;p>First note that that $\Bbb R$ can inject into $ \cal P(\Bbb Q)$ by mapping $r$ to ${q\in\Bbb Q\mid q \lt r}$. Since $\Bbb Q$ is countable there is a bijection between $\cal P(\Bbb Q)$ and $\cal P(\Bbb N)$. So $\Bbb R$ injects into $\cal P(\Bbb N)$.&lt;/p>
&lt;p>Then note that we can map $x\in 2^\Bbb N$ to the continued fraction defined by the sequence $x$. Or to a point in $[0,1]$ defined by $\sum\frac{x(n)}{3^{n+1}}$, which we can show is injective in a somewhat easier proof.&lt;/p></description></item><item><title>Supplement Solutions for New Questions in Chapter 1.2 to 1.4 in Understanding Analysis Second Edition</title><link>https://siqi-zheng.rbind.io/post/2021-08-05-analysis-sol-1-1/</link><pubDate>Thu, 05 Aug 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-08-05-analysis-sol-1-1/</guid><description>&lt;p>&lt;strong>Note: There may be LaTex display issues due to blogdown rendering limitations. A complete well-formatted solution can be found by clicking the download icon above.&lt;/strong>&lt;/p>
&lt;p>One may notice that most questions in the second edition are the same as those in the first edition. However, there are still some new or modified questions in the latest edition that remain unanswered.&lt;/p>
&lt;p>Therefore, in the following posts, I am going to present a collection of solutions to these new questions found on the internet and worked out by myself. To be more concise and clear, I also rewrote some of my solutions according to the internet sources (links are attached at the end of each question). The solution to the first edition can be found here: &lt;a href="https://github.com/mikinty/Understanding-Analysis-Abbott-Solutions">https://github.com/mikinty/Understanding-Analysis-Abbott-Solutions&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.2.2.&lt;/strong> Show that there is no rational number r satisfying $2^r=3$.&lt;/p>
&lt;p>Suppose $r=\frac{a}{b}$ with positive integers $a,b$.&lt;/p>
&lt;p>Then, we get $$2^{\frac{a}{b}}=3$$&lt;/p>
&lt;p>which can be expressed as&lt;/p>
&lt;p>$$2^a=3^b$$&lt;/p>
&lt;p>This is clearly a contradiction because the left side is even and the right side is odd.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1427219/prove-there-is-no-rational-r-satisfying-2r-3">https://math.stackexchange.com/questions/1427219/prove-there-is-no-rational-r-satisfying-2r-3&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.2.4.&lt;/strong> Expressing $\Bbb N$ as an infinite union of disjoint infinite subsets.&lt;/p>
&lt;p>Let $A_{i}$ consist of all the numbers of the form $2^im$ where $2\nmid m$. That is, $A_i$ consists of all the numbers that have exactly a factor of $2^i$ in them. So
$$\begin{align}
A_1 = {1,3,5,7,9,11, \dots}\&lt;br>
A_2 = {2, 6 =2^1\cdot 3, 10 = 2^1\cdot 5, 14 = 2^1\cdot 7, \dots}\&lt;br>
A_3 = {4 = 2^2, 12=2^2\cdot 3, 20=2^2\cdot 5, \dots}\&lt;br>
A_4 = {8 = 2^3, 24=2^3\cdot 3, 40=2^3\cdot 5, \dots}\&lt;br>
\dots
\end{align}
$$&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/847465/expressing-bbb-n-as-an-infinite-union-of-disjoint-infinite-subsets">https://math.stackexchange.com/questions/847465/expressing-bbb-n-as-an-infinite-union-of-disjoint-infinite-subsets&lt;/a>&lt;/p>
&lt;p>As pointed out in the link above, any prime numbers can work here.&lt;/p>
&lt;p>$A_1 = \Bbb N \ {x: x = 3b, b \in \Bbb N}$&lt;/p>
&lt;p>$A_2 = {3a,a\in A_1}$&lt;/p>
&lt;p>$A_3 = {3^2a,a\in A_1}$&lt;/p>
&lt;p>$A_4 = {3^3a,a\in A_1}$&lt;/p>
&lt;p>$\vdots$&lt;/p>
&lt;p>&lt;strong>Exercise 1.2.8.&lt;/strong>&lt;/p>
&lt;p>Give an example of each or state that the request is impossible:&lt;/p>
&lt;p>(a) $f : \Bbb N \rightarrow \Bbb N$ that is 1–1 but not onto.
$f(x) = x^2+2$ because $1 \in \Bbb N$, but $f(x)&amp;gt;1$ $\forall x \in \Bbb N$&lt;/p>
&lt;p>(b) $f : \Bbb N \rightarrow \Bbb N$ that is onto but not 1–1.
$f(x) = (x-2)^2$ because $f(1)=f(3)$ while $1 \neq 3$&lt;/p>
&lt;p>(c) $f : \Bbb N \rightarrow \Bbb Z$ that is 1–1 and onto.
$f(x) = x^2$&lt;/p>
&lt;p>&lt;strong>Exercise 1.2.10.&lt;/strong> Decide which of the following are true statements. Provide a short justification for those that are valid and a counterexample for those that are not:&lt;/p>
&lt;p>(a) Two real numbers satisfy a &amp;lt; b if and only if a &amp;lt; b + $\epsilon$ for every $\epsilon$ &amp;gt; 0.
The converse is FALSE if we take a=b=5.&lt;/p>
&lt;p>(b) Two real numbers satisfy a &amp;lt; b if a &amp;lt; b + $\epsilon$ for every $\epsilon$ &amp;gt; 0.
The statement is FALSE if we take a=b=5.&lt;/p>
&lt;p>(c) Two real numbers satisfy a ≤ b if and only if a &amp;lt; b + $\epsilon$ for every $\epsilon$ &amp;gt; 0.
Forward (trivial):
$a \le b \lt b + \epsilon$.
Reverse:
Suppose $a \lt b + \epsilon$, $\forall \epsilon \gt 0$.
Let $\delta = a - b$, then $b + \delta = b + a -b = a$ so that $a \not \lt b + \delta$. $\delta \not \gt 0$, so $\delta = a - b \le 0$. Hence $a \le b$.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1633992/if-true-prove-that-2-real-numbers-satisfy-ab-iff-ab-epsilon-forall-e/1633997">https://math.stackexchange.com/questions/1633992/if-true-prove-that-2-real-numbers-satisfy-ab-iff-ab-epsilon-forall-e/1633997&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.2.12.&lt;/strong> Let $y_1 = 6$, and for each $n\in \Bbb N$ define $y_{n+1} = (2y_n − 6)/3$.&lt;/p>
&lt;p>(a) Use induction to prove that the sequence satisfies $y_n &amp;gt; −6$ for all $n \in \Bbb N$.&lt;/p>
&lt;ul>
&lt;li>Base Case: $y_1 = 6 &amp;gt; -6$&lt;/li>
&lt;li>Inductive case. Assume $y_k&amp;gt;-6$.&lt;/li>
&lt;li>$y_{k+1}=\frac{2y_k}{3}-2&amp;gt;\frac{2\times(-6)}{3}-2=-4-2=-6$&lt;/li>
&lt;li>By induction our original claim is proved.&lt;/li>
&lt;/ul>
&lt;p>(b) Use another induction argument to show the sequence $(y_1, y_2, y_3, \dots)$ is decreasing.&lt;/p>
&lt;ul>
&lt;li>Base Case: $y_2 = 2 &amp;lt; 6 = y_1$&lt;/li>
&lt;li>Inductive case. Assume $y_{k+1}&amp;lt;y_k$.&lt;/li>
&lt;li>$y_{k+2}=\frac{2y_{k+1}}{3}-2
=\frac{2y_{k+1}}{3}+\frac{-6}{3}
&amp;lt;\frac{2y_{k+1}}{3}+\frac{y_{k+1}}{3}
=y_{k+1}$&lt;/li>
&lt;li>By induction our original claim is proved.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Exercise 1.3.2.&lt;/strong> Give an example of each of the following, or state that the request is impossible.&lt;/p>
&lt;p>(a) A set B with inf B $\geq$ sup B.
$B={1}$&lt;/p>
&lt;p>(b) A finite set that contains its infimum but not its supremum.
Except for $\emptyset$, by Axiom of Completeness, we cannot find such set.&lt;/p>
&lt;p>(c) A bounded subset of Q that contains its supremum but not its infimum.
$C={1/x|x\in\Bbb N}$ contains its supremum 1 but not its infimum 0.&lt;/p>
&lt;p>&lt;strong>Exercise 1.3.4.&lt;/strong> Let $A_1,A_2,A_3,\dots$ be a collection of nonempty sets, each of which is bounded above.&lt;/p>
&lt;p>(a)Find a formula for $sup(A_1 \cup A_2)$. Extend this to $sup(\cup^n_{k=1}A_k)$.
$sup {sup A_1, sup A_2}$
$sup {sup A_1, sup A_2 \dots sup A_n}$&lt;/p>
&lt;p>(b) Consider $sup(\cup^{\infty}_{k=1}A_k)$. Does the formula in (a) extend to the infinite case?&lt;/p>
&lt;p>No. Consider $A_i = {i}$, we have $sup(\cup^n_{k=1}A_k)=i$, but $sup(\cup^{\infty}_{k=1}A_k)$ does not exist.&lt;/p>
&lt;p>&lt;strong>Exercise 1.3.6.&lt;/strong> Given sets A and B, define $A+B = {a+b : a \in A$ and $b \in B}$.
Follow these steps to prove that if A and B are nonempty and bounded above then sup(A + B) = supA + supB.&lt;/p>
&lt;p>(a) Let s = sup A and t = sup B. Show s + t is an upper bound for A + B.
Take $a \in A$ and $b \in B$, by definition, $a\leq s$ and $b \leq t$ and $a+b \in A+B$. So $a+b \leq s+t$.&lt;/p>
&lt;p>(b) Now let $u$ be an arbitrary upper bound for A + B, and temporarily fix $a \in A$. Show $t \leq u − a$.
By definition of $A + B$ and $\sup(A + B)$, for all $a \in A$ and $b \in B$,
$${a + b} {\leq \sup (A + B)} {\leq u}.$$&lt;/p>
&lt;p>If we fix $a \in A$, then ${\sup (A + B)} - a$ is an upper bound for $${A + B} - A = B.$$&lt;/p>
&lt;p>Subtract $a$ from both sides:&lt;/p>
&lt;p>$$b = {a + b} - a \leq \sup (A + B) \leq u - a.$$&lt;/p>
&lt;p>And so by definition of $\sup B$, for every $a \in A$, $$\sup B =t \leq \sup (A+ B) − a\leq u - a.$$&lt;/p>
&lt;p>(c) Finally, show sup(A + B) = s + t.
Rearrange the previous inequality in (b): ${a} \leq \sup(A +B) − \sup B$ for all $a \in A$.&lt;/p>
&lt;p>Hence, $\sup(A +B) − \sup B$ is an upper bound for any a.&lt;/p>
&lt;p>By the definition of supremum, the previous inequality means: ${\sup A} \leq \sup(A + B) − \sup B \iff
\sup A + \sup B \leq \sup(A + B).$ i.e.&lt;/p>
&lt;p>$$s+t \leq sup(A+B)$$&lt;/p>
&lt;p>Also, by inequality $a+b \leq s+t$ in (a) and the definition of supremum:
$$sup(A+B)\leq s+t$$&lt;/p>
&lt;p>We conclude that
$$sup(A+B)= s+t.$$&lt;/p>
&lt;p>(d) Construct another proof of this same fact using Lemma 1.3.8.&lt;/p>
&lt;p>Let $\epsilon \gt 0.$ Then there exists $a \in A$ and $b \in B$ such that $a \gt \sup A − \frac{\epsilon}{2}$ and $b \gt \sup B − \frac{\epsilon}{2}.$
Then $a + b \in A + B$. We have
$${\sup(A + B)} \geq a + b {\gt \sup A + \sup B - \epsilon} \implies { \sup(A + B) \gt \sup A + \sup B - \epsilon }.$$ Since $\epsilon$ is arbitrary, $\sup(A + B) \geq \sup A + \sup B=s+t$&lt;/p>
&lt;p>Take $a \in A$ and $b \in B$, by definition, $a\leq s$ and $b \leq t$ and $a+b \in A+B$. So $a+b \leq s+t$. Also, by inequality $a+b \leq s+t$ in (a) and the definition of supremum:
$$sup(A+B)\leq s+t = sup(A+B)$$&lt;/p>
&lt;p>We conclude that
$$sup(A+B)= s+t.$$&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/4551/how-can-i-prove-supab-sup-a-sup-b-if-ab-ab-mid-a-in-a-b-in-b">https://math.stackexchange.com/questions/4551/how-can-i-prove-supab-sup-a-sup-b-if-ab-ab-mid-a-in-a-b-in-b&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.3.8.&lt;/strong> Compute, without proofs, the suprema and infima (if they exist) of the following sets:&lt;/p>
&lt;p>(a) ${m/n : m, n \in N$ with $m &amp;lt; n}$.
sup: $1$ inf: $0$&lt;/p>
&lt;p>(b) ${(−1)^m/n : m, n \in N}$.
sup: $1$ inf: $-1$&lt;/p>
&lt;p>(c) ${n/(3n+ 1) : n \in N}$.
sup: $\frac{1}{3}$ inf: $\frac{1}{4}$&lt;/p>
&lt;p>(d) ${m/(m+ n) : m, n \in N}$.
sup: 1 inf: 0&lt;/p>
&lt;p>&lt;strong>Exercise 1.3.9.&lt;/strong>&lt;/p>
&lt;p>(a) If supA &amp;lt; supB, show that there exists an element $b \in B$ that is an upper bound for A.
Take $\epsilon=supB-supA$ and take $b \in B$ where $b&amp;gt;supB-\epsilon /2$ as desired.&lt;/p>
&lt;p>(b) Give an example to show that this is not always the case if we only assume supA ≤ supB.
Take $A={0}$ and $B={-1/n,n \in \Bbb N}$&lt;/p>
&lt;p>&lt;strong>Exercise 1.3.10 (Cut Property).&lt;/strong>&lt;/p>
&lt;p>(a) Use the Axiom of Completeness to prove the Cut Property.
Suppose we have the axiom of completeness and assume you have $A$ and $B$ as in the statement of the cut property. Then, as $B$ is nonempty, $A$ has an upper bound. Let $c$ be the least upper bound for $A$.&lt;/p>
&lt;p>For $a\in A$, $a\le c$, because $c$ is an upper bound for $A$;
For $b\in B$, $c\le b$, because $b$ is an upper bound for $A$ and $c$ is the least upper bound.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1616583/use-the-axiom-of-completeness-to-prove-the-cut-property">https://math.stackexchange.com/questions/1616583/use-the-axiom-of-completeness-to-prove-the-cut-property&lt;/a>&lt;/p>
&lt;p>(b)Show that the implication goes the other way.
Suppose we know the Cut Property. Consider a nonempty set $E$ with an upper bound. Then let&lt;/p>
&lt;p>$B={x\in\mathbb{R}: x\geq e \forall e\in E}$
i.e. $B$ is the set of all upper bounds of $E$&lt;/p>
&lt;p>and let $A$ be the complement of $B$.
$A={x\in\mathbb{R}: x\lt e$ for some $e\in E}$&lt;/p>
&lt;p>Since $E$ is non-empty and bounded above, $B$ is nonempty as well as $A$. The union of $A$ and $B$ is $\mathbb{R}$ by construction. Suppose $a\in A$ and $b\in B$. If $b\le a$, we have $e\leq a \forall e\in E$, so $A\in B$: a contradiction.&lt;/p>
&lt;p>Since $b&amp;gt;a$ for all $a \in A$ and $b \in B$, we know there exists $d$ such that $a \leq d$ and $d \leq b$ by Cut Property. We want to show that $d$ is the supremum for E.&lt;/p>
&lt;p>To show that $d$ is an upper bound of $E$, suppose some $s$ in $E$ exceeds $d$. Since $(s + d)/2$ exceeds $d$, it
belongs to $B$, so by the definition of $B$ it must be an upper bound of $E$, which is impossible since $s &amp;gt; (s + d)/2$. To show that $d$ is a least upper bound of $E$, suppose that some $a &amp;lt; d$ is an upper bound of $E$. But $a$ (being less than $d$) is in $A$, so it can’t be an upper bound of $E$.&lt;/p>
&lt;p>Source: &lt;a href="https://arxiv.org/abs/1204.4483">https://arxiv.org/abs/1204.4483&lt;/a>&lt;/p>
&lt;p>(c) give a concrete example showing that the Cut Property is not a valid statement when $\Bbb R$ is replaced by $\Bbb Q$.
Hint: Find the break point in $\Bbb Q:$ $\sqrt{5},\sqrt{3},\dots$
Consider $A = (-\infty, 0) \cup {x \ge 0 : x^2 \le 3}$ and $B = {x \ge 0 : x^2 \gt 3}$
If such a number $c$ existed, we would have $c^2 = 3$. But there is no rational number for which this is true.&lt;/p>
&lt;p>&lt;strong>Exercise 1.3.11.&lt;/strong> Without worrying about formal proofs for the moment, decide if the following statements about suprema and infima are true or false. For any that are false, supply an example where the claim in question does not appear to hold.&lt;/p>
&lt;p>(a) TRUE. Since $A \subset B$, $\sup B$ is an upper bound for $A$. Since $\sup A$ is the least upper bound for $A$ by definition, it must be less than or equal $\sup B$.&lt;/p>
&lt;p>(b) TRUE. Take $c=(sup A + inf B)/2$ will work for nonempty sets $A$ and $B$.&lt;/p>
&lt;p>(c) FALSE. Consider $A = (-\infty, 3)$ and $B = (3, \infty)$, where $a&amp;lt;3&amp;lt;b$ and $sup A = inf B$.&lt;/p>
&lt;p>&lt;strong>Exercise 1.4.2.&lt;/strong> Let $A \subseteq \Bbb R$ be nonempty and bounded above, and let $s \in \Bbb R$ have the property that for all $n \in \Bbb N$, s + 1/n is an upper bound for A and s − 1/n is not an upper bound for A. Show s = supA.&lt;/p>
&lt;p>Suppose s is not an upper bound for A. Then $\exists a \in A$ such that $s \lt a$. Take $\delta = a - s$ and $n_0 \in \Bbb N$ to be large enough so that $1/\delta &amp;lt; n_0$ i.e. $1/n_0 &amp;lt; \delta$. By definition, $s+1/n_0$ is an upper bound for $A$, but $s+1/n_0&amp;lt;s+\delta=a\in A$: a contradiction.&lt;/p>
&lt;p>Let $\epsilon&amp;gt;0$. Take $n_1 \in \Bbb N$ to be large enough so that $1/\epsilon &amp;lt; n_1$ i.e. $1/n_1 &amp;lt; \epsilon$. By definition, $\exists a \in A$ such that $ s-\epsilon \lt s-1/n_1 \lt a$. Hence s = sup A.&lt;/p>
&lt;p>&lt;strong>Exercise 1.4.4.&lt;/strong> Let $a \lt b$ be real numbers and consider the set $T=\mathbb{Q}\cap[a,b]$. Show $\sup T=b$&lt;/p>
&lt;p>If $x\in T$, then $x\in [a,b]$, and if $x\in [a,b]$, then $x\leq b$ i.e. $b$ is an upper bound for T.&lt;/p>
&lt;p>To show that $b$ is a least upper bound of $T$, suppose that some $c \lt b$ is an upper bound of $T$. Since the rationals are dense in $\Bbb R$ there exists a rational $t$ such that $a\lt c \lt t \lt b$. This means $t \in [a,b]$ and $t \lt c$ by definition of upper bound, which is a contradiction.&lt;/p>
&lt;p>&lt;strong>Exercise 1.4.6.&lt;/strong> Which of the following sets are dense in $\Bbb R$? Take $p \in \Bbb Z$ and $q \in \Bbb N$ in every case.&lt;/p>
&lt;p>(a) The set of all rational numbers $p/q$ with $q \leq 10$.
Not dense in $\Bbb R$.
For any distinct $\frac pq$ and $\frac{p&amp;rsquo;}{q&amp;rsquo;}$ with $q,q&amp;rsquo;\le 10$ the difference $$ \frac pq-\frac{p&amp;rsquo;}{q&amp;rsquo;}=\frac{pq&amp;rsquo;-p&amp;rsquo;q}{qq&amp;rsquo;}$$ is a fraction with non-zero numerator and denominator$\le 10^2$, hence is $\ge \frac{1}{10^2}$ in absolute value. For example, no element in this set can be found between $1/500$ and $2/500$.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/1638526/how-do-you-show-a-set-is-dense-for-example-is-the-set-of-all-rational-numbers">https://math.stackexchange.com/questions/1638526/how-do-you-show-a-set-is-dense-for-example-is-the-set-of-all-rational-numbers&lt;/a>&lt;/p>
&lt;p>(b) The set of all rational numbers $p/q$ with $q$ a power of $2$.
Dense in $\Bbb R$.
Consider two arbitrary real numbers $a,b$ with $a\lt b $,
By the Archimedean Property there exists $n \in \mathbb N$ such that $$0\lt \frac{1}{n} \lt b-a ;;\text{ which implies} ;; 0\lt \frac{1}{2^{n}}\lt \frac{1}{n}\lt b-a$$
Thus we have $1\lt b2^n-a2^n$.
As the distance between $a2^n$ and $b2^n$ is greater than $1$, there exists $m \in \mathbb N$ such that $a2^{n}\lt m\lt b2^{n}$
which implies that $a \lt \frac{m}{2^{n}} \lt b$. Since $a$ and $b$ were arbitrary, the claim is proved.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/3968925/proof-of-dyadic-rational-numbers-are-dense-in-mathbb-r">https://math.stackexchange.com/questions/3968925/proof-of-dyadic-rational-numbers-are-dense-in-mathbb-r&lt;/a>&lt;/p>
&lt;p>(c) The set of all rational numbers $p/q$ with $10|p| \geq q$.&lt;/p>
&lt;p>Not dense in $\Bbb R$.
Rational numbers between (-1/10,1/20) are missing.
For example, no element in this set can be found between $-1/20$ and $-1/30$.&lt;/p>
&lt;p>Source: &lt;a href="https://www.reddit.com/r/HomeworkHelp/comments/7ruu7u/real_analysis_density_of_subsets_of_q_in_r/">https://www.reddit.com/r/HomeworkHelp/comments/7ruu7u/real_analysis_density_of_subsets_of_q_in_r/&lt;/a>&lt;/p>
&lt;p>&lt;strong>Exercise 1.4.8.&lt;/strong> Give an example of each or state that the request is impossible. When a request is impossible, provide a compelling argument for why this is the case.&lt;/p>
&lt;p>(a) Two sets A and B with $A \cap B = \emptyset$, supA = supB, $supA \not \in A $ and $supB \not \in B$.
$A={x|x\in I,x\in (0,1)}$
$B={x|x\in Q,x\in (0,1)}$&lt;/p>
&lt;p>(b) A sequence of nested open intervals $J_1 \supseteq J_2 \supseteq J_3 \supseteq \dots $ with $\cap^\infty_{n=1}J_n$ nonempty but containing only a finite number of elements.
$J_n = (5-1/n,5+1/n), n \in \Bbb N, \cap^\infty_{n=1}J_n=5$&lt;/p>
&lt;p>(c) A sequence of nested unbounded closed intervals $L_1 \supseteq L_2 \supseteq L_3 \supseteq \dots $ with $\cap^\infty_{n=1}L_n=\emptyset$ (An unbounded closed interval has the form $[a,\infty) = {x \in \Bbb R : x \geq a}$.)
$L_n = [n,\infty), n \in \Bbb N, \cap^\infty_{n=1}J_n=\emptyset$&lt;/p>
&lt;p>(d) A sequence of closed bounded (not necessarily nested) intervals $I_1, I_2, I_3, \dots$ with the property that $\cap^N_{n=1} I_n \neq \emptyset$ for all $N \in \Bbb N$, but $\cap^\infty_{n=1} I_n = \emptyset$.
The answer is negative, because then $\cap^N_{n=1} I_n$ for all $N \in \Bbb N$ is a decreasing sequence of non-empty closed and bounded intervals and therefore its intersection is non-empty.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/2619781/intersection-of-a-sequence-of-closed-intervals">https://math.stackexchange.com/questions/2619781/intersection-of-a-sequence-of-closed-intervals&lt;/a>&lt;/p>
&lt;h2 id="appendix-for-unused-sources">Appendix for unused sources&lt;/h2>
&lt;p>By definition, $d$ is an upper bound for A. So it is an upper bound for $E$, because if there exists $e \in E$ with $d&amp;lt;e$, then $d&amp;lt;\frac{d+e}{2}$. $\frac{d+e}{2}$ cannot be in $B$ (indeed, it&amp;rsquo;s not an upper bound for $E$, because it&amp;rsquo;s less than $e$) so it must be in $A$, but this contradicts that $d$ is an upper bound for $A$.&lt;/p>
&lt;p>Source: &lt;a href="https://math.stackexchange.com/questions/2228772/assume-mathbbr-possesses-the-cut-property-and-let-e-be-a-nonempty-that-is-b">https://math.stackexchange.com/questions/2228772/assume-mathbbr-possesses-the-cut-property-and-let-e-be-a-nonempty-that-is-b&lt;/a>&lt;/p>
&lt;p>If possible, suppose $A$ has the greatest member, say $a'$. Then, $a&amp;rsquo; \in A \Rightarrow a&amp;rsquo; \not\in B$. We know $\exists s \in E$ such that $a&amp;rsquo; \lt s$, since $a'&amp;lt;(a'+s)/2 \in B$, $(a'+s)/2 $ is an upper bound of $S$. This contradiction leads to the fact that $A$ has no greatest member. And so, $B$ has the least member. Hence, the set of upper bounds of a non-empty set 𝑆 bounded above has a least member, which is the completeness axiom in $\Bbb R$. Hence the theorem is proved.&lt;/p>
&lt;p>Read more: &lt;a href="https://www.emathzone.com/tutorials/real-analysis/dedekind-property.html#ixzz72M0m5FcR">https://www.emathzone.com/tutorials/real-analysis/dedekind-property.html#ixzz72M0m5FcR&lt;/a>&lt;/p></description></item><item><title>Modelling the dynamics of a Chalmydia infection</title><link>https://siqi-zheng.rbind.io/post/2021-08-04-bayes-bio-model-1/</link><pubDate>Wed, 04 Aug 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-08-04-bayes-bio-model-1/</guid><description>&lt;p>The assignment requirements can be found here above by clicking the &amp;lsquo;assignment&amp;rsquo; icon.&lt;/p>
&lt;ul>
&lt;li>
&lt;a href="#task-1">Task 1&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#task-2">Task 2&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#a-note-on-selection-of-priors">A note on selection of priors&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#dose-1">Dose 1&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#prior-predictive-check-for-model-on-dose-1-data">Prior Predictive Check for Model on Dose 1 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#posterior-predictive-check-for-model-on-dose-1-data">Posterior Predictive Check for Model on Dose 1 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#leave-one-out-cross-validation-for-dose-1">Leave one out cross validation for dose 1&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#dose-2">Dose 2&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#prior-predictive-check-for-model-on-dose-2-data">Prior Predictive Check for Model on Dose 2 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#posterior-predictive-check-for-model-on-dose-2-data">Posterior Predictive Check for Model on Dose 2 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#leave-one-out-cross-validation-for-dose-2">Leave one out cross validation for dose 2&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#dose-3">Dose 3&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#prior-predictive-check-for-model-on-dose-3-data">Prior Predictive Check for Model on Dose 3 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#posterior-predictive-check-for-model-on-dose-3-data">Posterior Predictive Check for Model on Dose 3 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#leave-one-out-cross-validation-for-dose-3">Leave one out cross validation for dose 3&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#dose-4">Dose 4&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#prior-predictive-check-for-model-on-dose-4-data">Prior Predictive Check for Model on Dose 4 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#posterior-predictive-check-for-model-on-dose-4-data">Posterior Predictive Check for Model on Dose 4 Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#leave-one-out-cross-validation-for-dose-4">Leave one out cross validation for dose 4&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#remarks">Remarks&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#task-3">Task 3&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="task-1">Task 1&lt;/h2>
&lt;p>We have
$$ \frac{dE}{dt} = 0.004 - 2 E(t) - \kappa_1C(t) E(t) $$
$$ \frac{dC}{dt} = P \kappa_2 I(t) - \mu C(t) - \kappa_1 C(t) E(t) $$
$$ \frac{dI}{dt} = \kappa_1 C(t) E(t) - \gamma I(t) - \kappa_2 I(t) $$&lt;/p>
&lt;p>The following codes load the data. Note that dose 1 to dose 4 correspond to dose 10 to 10&lt;sup>4&lt;/sup>.&lt;/p>
&lt;pre>&lt;code>data = readRDS(&amp;quot;rank_et_al_2003_data.RDS&amp;quot;)
dose1 &amp;lt;- data[1:10,]
dose2 &amp;lt;- data[11:20,]
dose3 &amp;lt;- data[21:30,]
dose4 &amp;lt;- data[31:40,]
dose5 &amp;lt;- data[41:50,]
&lt;/code>&lt;/pre>
&lt;p>The following codes examine whether the model is written correctly.&lt;/p>
&lt;pre>&lt;code>model &amp;lt;- function (t, y, parms) {
dy1 &amp;lt;- (40 * 10 ^ (-4)) - 2 * y[1] - params[1] * y[2] * y[1]
dy2 &amp;lt;- params[2] * params[3] * y[3] - params[4] * y[2] - params[1] * y[2] * y[1]
dy3 &amp;lt;- params[1] * y[2] * y[1] - params[5] * y[3] - params[3] * y[3]
list(c(dy1, dy2, dy3))
}
yini &amp;lt;- c(E = 0.96, C = 0.001, I = 0)
params &amp;lt;- c(
kappa1 = 1000,
P = 1000,
kappa2 = 1.3, #0.4-1.3
C = 1.2,
gamma = 1.2
)
out &amp;lt;- ode(y=yini, t=seq(0,30,0.1), model, parameters)
df_out &amp;lt;- as.data.frame(out)
# An estimate of the differential equations
summary(df_out$C)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001 1.728 1.728 15.164 1.832 248.361
# An overview of the data
head(data)
## t C dose number
## 1 3 0.000 10^1 2
## 2 6 46.829 10^1 2
## 3 9 9.106 10^1 2
## 4 12 18.862 10^1 2
## 5 15 25.366 10^1 2
## 6 18 21.463 10^1 2
&lt;/code>&lt;/pre>
&lt;p>The following codes are for r2.stan (model from the differential equations.)&lt;/p>
&lt;pre>&lt;code>functions {
vector rhs(real t, vector y,
real P, real kappa1, real kappa2, real gamma, real mu) {
vector[3] dydt;
dydt[1] = (40 * 1e-4) - 2 * y[1] - kappa1 * y[2] * y[1];
dydt[2] = P * kappa2 * y[3] - mu * y[2] - kappa1 * y[2] * y[1];
dydt[3] = kappa1 * y[2] * y[1] - gamma * y[3] - kappa2 * y[3];
return dydt;
}
}
data {
int&amp;lt;lower=0&amp;gt; N;
vector [N] y;
real t[N]; // This must be an array!
// Control
int&amp;lt;lower=0, upper = 1&amp;gt; only_prior;
}
parameters {
real&amp;lt;lower = 0, upper = 2000&amp;gt; P;
real&amp;lt;lower = 0, upper = 2000&amp;gt; kappa1;
real&amp;lt;lower = 0.01, upper = 1.5&amp;gt; kappa2;
real&amp;lt;lower = 0.01, upper = 1.5&amp;gt; gamma;
real&amp;lt;lower = 0.01, upper = 1.5&amp;gt; mu;
real&amp;lt;lower = 0&amp;gt; c;
}
transformed parameters {
// [a, b] makes a row_vector, [a, b]' makes a column vector
// 0 is an int! 0.0 is a real!
vector[N] C; // Outputted
{ // Local computation - isn't saved or outputted!
vector[3] solution[N] = ode_bdf(rhs, [0.96, c, 0.0]', 0, t, P, kappa1, kappa2, gamma, mu);
for(i in 1:N){
C[i] = solution[i,2];
}
}
}
model{
c ~ uniform(0,1)
P ~ uniform(0, 2000);
kappa1 ~ uniform(0, 2000);
kappa2 ~ uniform(0.01, 1.5);
gamma ~ uniform(0.01, 1.5);
mu ~ uniform(0.01, 1.5);
if(only_prior == 0) {
y ~ normal(C, 50);
}
}
generated quantities {
vector&amp;lt;lower=0&amp;gt;[N] y_pred;
vector&amp;lt;lower=0&amp;gt;[N] log_lik;
for (i in 1:N) {
y_pred[i] = abs(normal_rng(C[i], 50)); // abs because minnimum is 0 and 30 is chosen because the standard deviation of the dataset is around 50,
// that is, if we are only interested in values larger than zero, then
log_lik[i] = abs(normal_lpdf(y[i] | C[i], 50));
}
}
&lt;/code>&lt;/pre>
&lt;p>Since when 0.4 &amp;lt; &lt;em>κ&lt;/em>&lt;sub>2&lt;/sub> &amp;lt; 1.4, the max of C(t) is between 100 &amp;lt; &lt;em>C&lt;/em>(&lt;em>t&lt;/em>) &amp;lt; 250 as 0.01 &amp;lt; &lt;em>C&lt;/em>(0) &amp;lt; 1. Hence we may assume that the mean of &lt;em>κ&lt;/em>&lt;sub>2&lt;/sub> is 0.9 and the standard deviation of it is 0.3 so that it is highly probable to take values in the range of $[0.3,1.5]$. A uniform distribution is used because we do not know the exact distribution and it is hard to estimate the standard deviation of each parameter.&lt;/p>
&lt;p>Therefore, we assume a weakly informative prior for other parameters in stan as well. In particular, uniform distribution that covers the value given as an example. This is largely due to two reasons. First, most parameters that require priors do not have sufficient information for us to set up a good distribution. Furthermore, since there are five different doses in the dataset, a distribution that can take values in a large range is preferred. As you will see later on, this choice provides a fair estimate of the parameters.&lt;/p>
&lt;p>We assume the parameters in the 10&lt;sup>3&lt;/sup> scale to be from 0 to 2000. The lower bound 0 is pretty self-explanatory, the upper bound is taken by 1000 + (1000 - lower bound). Similarly, we take 1.5 as upper bound for other parameters because of the range of &lt;em>κ&lt;/em>&lt;sub>2&lt;/sub>.&lt;/p>
&lt;p>To estimate the rate of change of C at t = 0, we use the data for average C on day 3 and then divided by 3 (3 days) for each dose. This assumes that the rate of C is increasing so that C(0) is not underestimated, which is reasonable given that C increases until the dose takes effect. However, this could have some limitations. Hence we will further address this issue in task 3. C(0) will therefore be an average of all estimates of C(0) from 5 doses.&lt;/p>
&lt;pre>&lt;code>mod &amp;lt;- cmdstan_model(&amp;quot;r2.stan&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h2 id="task-2">Task 2&lt;/h2>
&lt;h3 id="a-note-on-selection-of-priors">A note on selection of priors&lt;/h3>
&lt;p>I have attempted different range of values for the uniform distribution, however, the posterior predictive check shows that the noise is higher than expected and the model does not fit very well. I have tried different models including normal distribution and uniform distribution with other parameters, but the actual result is not better than this. Hence uniform distribution is used. However, I would still suggest future researchers to collect more data (1 data per day) or apply other models specifically for this question.&lt;/p>
&lt;h3 id="dose-1">Dose 1&lt;/h3>
&lt;h4 id="prior-predictive-check-for-model-on-dose-1-data">Prior Predictive Check for Model on Dose 1 Data&lt;/h4>
&lt;pre>&lt;code>mcmc_hist(fit$draws(&amp;quot;C&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose1-prior-predictive-check-1.png" alt="">&lt;/p>
&lt;pre>&lt;code>mcmc_hist(fit$draws(&amp;quot;y_pred&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose1-prior-predictive-check-2.png" alt="">&lt;/p>
&lt;h4 id="posterior-predictive-check-for-model-on-dose-1-data">Posterior Predictive Check for Model on Dose 1 Data&lt;/h4>
&lt;pre>&lt;code>yrep = fit$draws() %&amp;gt;% reshape2::melt() %&amp;gt;% filter(str_detect(variable, &amp;quot;y_pred&amp;quot;)) %&amp;gt;%
extract(col = variable, into = &amp;quot;ind&amp;quot;,
regex = &amp;quot;y_pred\\[([0-9]*)\\]&amp;quot;,
convert = TRUE) %&amp;gt;%
pivot_wider(id_cols = c(&amp;quot;chain&amp;quot;,&amp;quot;iteration&amp;quot;),
names_from = &amp;quot;ind&amp;quot;) %&amp;gt;%
select(-c(&amp;quot;chain&amp;quot;, &amp;quot;iteration&amp;quot;)) %&amp;gt;% as.matrix
ppc_stat(dose1$C, yrep, stat = &amp;quot;min&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose1-posterior-predictive-check-1.png" alt="">&lt;/p>
&lt;h4 id="leave-one-out-cross-validation-for-dose-1">Leave one out cross validation for dose 1&lt;/h4>
&lt;pre>&lt;code>## By default, it looks for something called &amp;quot;log_lik&amp;quot;, but you can override this
## with the variables = argument. Eg if you called your log-likelihood &amp;quot;ll&amp;quot;,
## you could run loo1 &amp;lt;- fit$loo(save_psis=TRUE, variable = &amp;quot;ll&amp;quot;)
loo1 &amp;lt;- fit$loo(save_psis=TRUE)
print(loo1)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 49.1 0.4
## p_loo 0.3 0.3
## looic -98.3 0.7
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 6 60.0% 2223
## (0.5, 0.7] (ok) 0 0.0% &amp;lt;NA&amp;gt;
## (0.7, 1] (bad) 0 0.0% &amp;lt;NA&amp;gt;
## (1, Inf) (very bad) 4 40.0% 2420
## See help('pareto-k-diagnostic') for details.
plot(loo1)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="loo1-1.png" alt="">&lt;/p>
&lt;p>All except five of our points are good from the leave one out cross validation for model. This is not very great, but we may take a look at the estimate of C(0) to determine if we really identify C(0) fairly.&lt;/p>
&lt;h3 id="dose-2">Dose 2&lt;/h3>
&lt;h4 id="prior-predictive-check-for-model-on-dose-2-data">Prior Predictive Check for Model on Dose 2 Data&lt;/h4>
&lt;pre>&lt;code>mcmc_hist(fit2$draws(&amp;quot;C&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose2-prior-predictive-check-1.png" alt="">&lt;/p>
&lt;pre>&lt;code>mcmc_hist(fit2$draws(&amp;quot;y_pred&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose2-prior-predictive-check-2.png" alt="">&lt;/p>
&lt;h4 id="posterior-predictive-check-for-model-on-dose-2-data">Posterior Predictive Check for Model on Dose 2 Data&lt;/h4>
&lt;pre>&lt;code>yrep2 = fit2$draws() %&amp;gt;% reshape2::melt() %&amp;gt;% filter(str_detect(variable, &amp;quot;y_pred&amp;quot;)) %&amp;gt;%
extract(col = variable, into = &amp;quot;ind&amp;quot;,
regex = &amp;quot;y_pred\\[([0-9]*)\\]&amp;quot;,
convert = TRUE) %&amp;gt;%
pivot_wider(id_cols = c(&amp;quot;chain&amp;quot;,&amp;quot;iteration&amp;quot;),
names_from = &amp;quot;ind&amp;quot;) %&amp;gt;%
select(-c(&amp;quot;chain&amp;quot;, &amp;quot;iteration&amp;quot;)) %&amp;gt;% as.matrix
ppc_stat(dose2$C, yrep2, stat = &amp;quot;min&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose2-posterior-predictive-check-1.png" alt="">&lt;/p>
&lt;h4 id="leave-one-out-cross-validation-for-dose-2">Leave one out cross validation for dose 2&lt;/h4>
&lt;pre>&lt;code>loo2 &amp;lt;- fit2$loo(save_psis=TRUE)
print(loo2)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 49.7 0.7
## p_loo 0.5 0.4
## looic -99.4 1.4
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 6 60.0% 2138
## (0.5, 0.7] (ok) 0 0.0% &amp;lt;NA&amp;gt;
## (0.7, 1] (bad) 0 0.0% &amp;lt;NA&amp;gt;
## (1, Inf) (very bad) 4 40.0% 1931
## See help('pareto-k-diagnostic') for details.
plot(loo2)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="loo2-1.png" alt="">&lt;/p>
&lt;h3 id="dose-3">Dose 3&lt;/h3>
&lt;h4 id="prior-predictive-check-for-model-on-dose-3-data">Prior Predictive Check for Model on Dose 3 Data&lt;/h4>
&lt;pre>&lt;code>mcmc_hist(fit3$draws(&amp;quot;C&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose3-prior-predictive-check-1.png" alt="">&lt;/p>
&lt;pre>&lt;code>mcmc_hist(fit3$draws(&amp;quot;y_pred&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose3-prior-predictive-check-2.png" alt="">&lt;/p>
&lt;h4 id="posterior-predictive-check-for-model-on-dose-3-data">Posterior Predictive Check for Model on Dose 3 Data&lt;/h4>
&lt;pre>&lt;code>yrep3 = fit3$draws() %&amp;gt;% reshape2::melt() %&amp;gt;% filter(str_detect(variable, &amp;quot;y_pred&amp;quot;)) %&amp;gt;%
extract(col = variable, into = &amp;quot;ind&amp;quot;,
regex = &amp;quot;y_pred\\[([0-9]*)\\]&amp;quot;,
convert = TRUE) %&amp;gt;%
pivot_wider(id_cols = c(&amp;quot;chain&amp;quot;,&amp;quot;iteration&amp;quot;),
names_from = &amp;quot;ind&amp;quot;) %&amp;gt;%
select(-c(&amp;quot;chain&amp;quot;, &amp;quot;iteration&amp;quot;)) %&amp;gt;% as.matrix
ppc_stat(dose3$C, yrep3, stat = &amp;quot;min&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose3-posterior-predictive-check-1.png" alt="">&lt;/p>
&lt;h4 id="leave-one-out-cross-validation-for-dose-3">Leave one out cross validation for dose 3&lt;/h4>
&lt;pre>&lt;code>loo3 &amp;lt;- fit3$loo(save_psis=TRUE)
print(loo3)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 51.0 1.3
## p_loo 3.7 2.1
## looic -101.9 2.5
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k &amp;lt; 0.5).
## See help('pareto-k-diagnostic') for details.
plot(loo3)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="loo3-1.png" alt="">&lt;/p>
&lt;h3 id="dose-4">Dose 4&lt;/h3>
&lt;h4 id="prior-predictive-check-for-model-on-dose-4-data">Prior Predictive Check for Model on Dose 4 Data&lt;/h4>
&lt;pre>&lt;code>mcmc_hist(fit4$draws(&amp;quot;C&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose4-prior-predictive-check-1.png" alt="">&lt;/p>
&lt;pre>&lt;code>mcmc_hist(fit4$draws(&amp;quot;y_pred&amp;quot;))
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose4-prior-predictive-check-2.png" alt="">&lt;/p>
&lt;h4 id="posterior-predictive-check-for-model-on-dose-4-data">Posterior Predictive Check for Model on Dose 4 Data&lt;/h4>
&lt;pre>&lt;code>yrep4 = fit4$draws() %&amp;gt;% reshape2::melt() %&amp;gt;% filter(str_detect(variable, &amp;quot;y_pred&amp;quot;)) %&amp;gt;%
extract(col = variable, into = &amp;quot;ind&amp;quot;,
regex = &amp;quot;y_pred\\[([0-9]*)\\]&amp;quot;,
convert = TRUE) %&amp;gt;%
pivot_wider(id_cols = c(&amp;quot;chain&amp;quot;,&amp;quot;iteration&amp;quot;),
names_from = &amp;quot;ind&amp;quot;) %&amp;gt;%
select(-c(&amp;quot;chain&amp;quot;, &amp;quot;iteration&amp;quot;)) %&amp;gt;% as.matrix
ppc_stat(dose4$C, yrep4, stat = &amp;quot;min&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dose4-posterior-predictive-check-1.png" alt="">&lt;/p>
&lt;h4 id="leave-one-out-cross-validation-for-dose-4">Leave one out cross validation for dose 4&lt;/h4>
&lt;pre>&lt;code>loo4 &amp;lt;- fit4$loo(save_psis=TRUE)
print(loo4)
##
## Computed from 4000 by 10 log-likelihood matrix
##
## Estimate SE
## elpd_loo 52.5 2.9
## p_loo 3.3 2.2
## looic -105.0 5.9
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k &amp;lt; 0.5).
## See help('pareto-k-diagnostic') for details.
plot(loo4)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="loo4-1.png" alt="">&lt;/p>
&lt;h3 id="remarks">Remarks&lt;/h3>
&lt;p>From the prior predictive check, we can see the estimated distribution is more heavy-tailed than the actual distribution. From the posterior predictive check, T(y) is the skewness. The model captures the observed statistic to some extent for 4 doses, but leave one out cross validation shows that the model fits the the data for dose 3 and 4 better as all points from dose 3 and 4 are good.&lt;/p>
&lt;h2 id="task-3">Task 3&lt;/h2>
&lt;pre>&lt;code>vec_c0 &amp;lt;- unlist(c0_estimate)
hist(vec_c0, breaks=20)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="unnamed-chunk-2-1.png" alt="">
The advantage about estimating C(0) here is that this approach uses all available data and keep the statistical power of all data, since we are conditioning on all data; however, it takes a long time to produce the results. For instance, a laptop with one core will need more than an hour for this task.&lt;/p></description></item><item><title>Mathematics Theorems and Proofs in Applied Multivariate Statistical Analysis (CH.1)</title><link>https://siqi-zheng.rbind.io/post/2021-07-23-amsa-chapter-1/</link><pubDate>Wed, 21 Jul 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-07-23-amsa-chapter-1/</guid><description>&lt;h2 id="details-in-chapter-1-johnson--wichern-2002">Details in Chapter 1 (Johnson &amp;amp; Wichern, 2002)&lt;/h2>
&lt;p>P78 (2-48)&lt;/p>
&lt;p>&lt;strong>Cauchy-Schwarz Inequality&lt;/strong>. Let $\mathbf{b}$ and $\mathbf{d}$ be any two $p\times 1$ vectors. Then
$$
\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}\leq(\mathbf{b}^{\prime} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{d})}
$$
with equality if and only if $\mathbf{b}=c\mathbf{d}$ (or $c\mathbf{d}=\mathbf{b}$) for some constant c.&lt;/p>
&lt;p>Proof. The inequality is obvious if either $\mathbf{b}=\mathbf{0}$ or $\mathbf{d}=\mathbf{0}$. Excluding this possibility, consider the vector $\mathbf{b}-x \mathbf{d}$, where $x$ is an arbitrary scalar. Since the length of $\mathbf{b}-x \mathbf{d}$ is positive for $\mathbf{b}-x \mathbf{d} \neq \mathbf{0}$, in this case
$$
\begin{aligned}
0&amp;lt;(\mathbf{b}-x \mathbf{d})^{\prime}(\mathbf{b}-x \mathbf{d}) &amp;amp;=\mathbf{b}^{\prime} \mathbf{b}-x \mathbf{d}^{\prime} \mathbf{b}-\mathbf{b}^{\prime}(x \mathbf{d})+x^{2} \mathbf{d}^{\prime} \mathbf{d} \&lt;br>
&amp;amp;=\mathbf{b}^{\prime} \mathbf{b}-2 \boldsymbol{x}\left(\mathbf{b}^{\prime} \mathbf{d}\right)+x^{2}\left(\mathbf{d}^{\prime} \mathbf{d}\right)
\end{aligned}
$$
The last expression is quadratic in $x .$ If we complete the square by adding and subtracting the scalar $\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2} / \mathbf{d}^{\prime} \mathbf{d}$, we get
$$
\begin{gathered}
0&amp;lt;\mathbf{b}^{\prime} \mathbf{b}-\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}}+\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}}-2 x\left(\mathbf{b}^{\prime} \mathbf{d}\right)+x^{2}\left(\mathbf{d}^{\prime} \mathbf{d}\right) \&lt;br>
=\mathbf{b}^{\prime} \mathbf{b}-\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}}+\left(\mathbf{d}^{\prime} \mathbf{d}\right)\left(x-\frac{\mathbf{b}^{\prime} \mathbf{d}}{\mathbf{d}^{\prime} \mathbf{d}}\right)^{2}
\end{gathered}
$$
The term in brackets is zero if we choose $x=\mathbf{b}^{\prime} \mathbf{d} / \mathbf{d}^{\prime} \mathbf{d}$, so we conclude that
$$
0&amp;lt;\mathbf{b}^{\prime} \mathbf{b}-\frac{\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{d}^{\prime} \mathbf{d}}
$$
or $\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}&amp;lt;\left(\mathbf{b}^{\prime} \mathbf{b}\right)\left(\mathbf{d}^{\prime} \mathbf{d}\right)$ if $\mathbf{b} \neq x \mathbf{d}$ for some $x$
Note that if $\mathbf{b}=c \mathbf{d}, 0=(\mathbf{b}-c \mathbf{d})^{\prime}(\mathbf{b}-c \mathbf{d})$, and the same argument produces
$\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}=\left(\mathbf{b}^{\prime} \mathbf{b}\right)\left(\mathbf{d}^{\prime} \mathbf{d}\right)$&lt;/p>
&lt;p>&lt;strong>Extended Cauchy-Schwarz Inequality&lt;/strong>. Let $\mathbf{b}$ and $\mathbf{d}$ be any two $p\times 1$ vectors, and let $\mathbf{B}$ be a positive definite matrix. Then
$$
\left(\mathbf{b}^{\prime} \mathbf{d}\right)^{2}\leq(\mathbf{b}^{\prime} \mathbf{B} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d})}
$$&lt;/p>
&lt;p>with equality if and only if $\mathbf{b}=c\mathbf{B}^{-1}\mathbf{d}$ or $\mathbf{d}=c\mathbf{B}\mathbf{b}$ for some constant c.&lt;/p>
&lt;p>Proof. The inequality is obvious when $\mathbf{b}=\mathbf{0}$ or $\mathbf{d}=\mathbf{0}$. For cases other than these, consider the square-root matrix $\mathbf{B}^{1 / 2}$ defined in terms of its eigenvalues $\lambda_{i}$ and the normalized eigenvectors $\mathbf{e}_{i}$ as $\mathbf{B}^{1 / 2}=\sum_{i=1}^{p} \sqrt{\lambda_{i}} \mathbf{e}_{i} \mathbf{e}_{i}^{\prime} .$ If we set
$$
\mathbf{B}^{-1 / 2}=\sum_{i=1}^{p} \frac{1}{\sqrt{\lambda_{i}}} \mathbf{e}_{i} \mathbf{e}_{i}^{\prime}
$$
it follows that
$$
\mathbf{b}^{\prime} \mathbf{d}=\mathbf{b}^{\prime} \mathbf{I} \mathbf{d}=\mathbf{b}^{\prime} \mathbf{B}^{1 / 2} \mathbf{B}^{-1 / 2} \mathbf{d}=\left(\mathbf{B}^{1 / 2} \mathbf{b}\right)^{\prime}\left(\mathbf{B}^{-1 / 2} \mathbf{d}\right)
$$
and the proof is completed by applying the Cauchy-Schwarz inequality to the vectors $\left(\mathbf{B}^{1 / 2} \mathbf{b}\right)$ and $\left(\mathbf{B}^{-1 / 2} \mathbf{d}\right)$&lt;/p>
&lt;p>Let $\mathbf{u}=\mathbf{B}^{1 / 2} \mathbf{b}$ and $\mathbf{v}=\mathbf{B}^{-1 / 2} \mathbf{d}$, we have
$$
\mathbf{b}^{\prime} \mathbf{d}=\left(\mathbf{B}^{1 / 2} \mathbf{b}\right)^{\prime}\left(\mathbf{B}^{-1 / 2} \mathbf{d}\right)\leq(\mathbf{b}^{\prime} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{d})}=(\mathbf{B}^{1 / 2} \mathbf{b})^{\prime}(\mathbf{B}^{1 / 2} \mathbf{b})(\mathbf{B}^{-1 / 2} \mathbf{d})(\mathbf{B}^{-1 / 2} \mathbf{d})^{\prime}=(\mathbf{b}^{\prime} \mathbf{B} \mathbf{b}){(\mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d})}
$$&lt;/p>
&lt;p>The extended Cauchy-Schwarz inequality gives rise to the following maximization result.&lt;/p>
&lt;p>&lt;strong>Maximization Lemma.&lt;/strong> Let $\underset{(p \times p)}{\mathbf{B}}$ be positive definite and $\underset{(p \times 1)}{\mathbf{d}}$ be a given vector. Then, for an ărbitrary nonzero vector $\underset{(p \times 1)}{\mathbf{x}}$,
$$
\max _{\mathbf{x} \neq \boldsymbol{\theta}} \frac{\left(\mathbf{x}^{\prime} \mathbf{d}\right)^{2}}{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}=\mathbf{d}^{\prime} \mathbf{B}^{-\mathbf{1}} \mathbf{d}
$$
with the maximum attained when $\underset{(p \times 1)}{\mathbf{x}}=\underset{(p \times p)( p \times 1)}{\mathbf{d}}$ for any constant $c \neq 0$. Proof. By the extended Cauchy-Schwarz inequality, $\left(\mathbf{x}^{\prime} \mathbf{d}\right)^{2} \leq\left(\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}\right)\left(\mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d}\right)$.
Because $\mathbf{x} \neq \mathbf{0}$ and $\mathbf{B}$ is positive definite, $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}&amp;gt;0$. Dividing both sides of the
inequality by the positive scalar $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$ yields the upper bound
$$
\frac{\left(\mathbf{x}^{\prime} \mathbf{d}\right)^{2}}{\boldsymbol{x}^{\prime} \mathbf{B} \mathbf{x}} \leq \mathbf{d}^{\prime} \mathbf{B}^{-1} \mathbf{d}
$$&lt;/p>
&lt;p>Taking the maximum over $\mathbf{x}$ gives Equation $(2-50)$ because the bound is attained for $\mathbf{x}=c \mathbf{B}^{-1} \mathbf{d} .$&lt;/p>
&lt;p>A final maximization result will provide us with an interpretation of eigenvalues.&lt;/p>
&lt;p>&lt;strong>Maximization of Quadratic Forms for Points on the Unit Sphere.&lt;/strong> Let $\mathbf{B}$ be a positive definite matrix with eigenvalues $\lambda_{1} \geq \lambda_{2} \geq \cdots \geq \lambda_{p} \geq 0$ and associated normalized eigenvectors $\mathbf{e}_{\mathbf{1}}, \mathbf{e}_{2}, \ldots, \mathbf{e}_{p}$. Then&lt;/p>
&lt;p>$$
\max &lt;em>{\mathbf{x} \neq \mathbf{0}} \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\lambda&lt;/em>{1}\quad \text { (attained when } \mathbf{x}=\mathbf{e}_{1} \text {)}
$$&lt;/p>
&lt;p>$$
\min &lt;em>{\mathbf{x} \neq 0} \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\lambda&lt;/em>{p} \quad \text { (attained when } \mathbf{x}=\mathbf{e}_{p} \text {)}
$$&lt;/p>
&lt;p>Moreover,&lt;/p>
&lt;p>$$
\max &lt;em>{\mathbf{x} \perp \mathbf{e}&lt;/em>{\mathbf{1}},\ldots,\mathbf{e}&lt;em>{\mathbf{k}}} \frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\lambda&lt;/em>{k+1} \quad \text { (attained when } \mathbf{x}=\mathbf{e}_{k+1} \text {, } k=1,2,\ldots,p-1 \text {)}
$$&lt;/p>
&lt;p>where the symbol $\perp$ is read &amp;ldquo;is perpendicular to.&lt;/p>
&lt;p>Proof. Let $\underset{( p \times p)}{\mathbf{P}}$ be the orthogonal matrix whose columns are the eigenvectors
$\mathbf{e}&lt;em>{1}, \mathbf{e}&lt;/em>{2}, \ldots, \mathbf{e}&lt;em>{p}$ and $\mathbf{A}$ be the diagonal matrix with eigenvalues $\lambda&lt;/em>{1}, \lambda_{2}, \ldots, \lambda_{p}$ along the
main diagonal. Let $\mathbf{B}^{1 / 2}=\mathbf{P} \Lambda^{1 / 2} \mathbf{P}^{\prime}$ and $\underset{(p \times 1)}{\mathbf{y}}=\underset{(p \times p)(p \times 1)}{\mathbf{x}}$.
Consequently, $\mathbf{x} \neq \boldsymbol{0}$ implies $\mathbf{y} \neq \mathbf{0}$. Thus,
$$
\begin{aligned}
\frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}} &amp;amp;=\frac{\mathbf{x}^{\prime} \mathbf{B}^{1 / 2} \mathbf{B}^{1 / 2} \mathbf{x}}{\mathbf{x}^{\prime} \underbrace{\mathbf{P P}^{\prime}}_{\mathbf{I} \atop(p \times p)} \mathbf{x}}=\frac{\mathbf{x}^{\prime} \mathbf{P} \mathbf{\Lambda}^{1 / 2} \mathbf{P}^{\prime} \mathbf{P} \mathbf{\Lambda}^{1 / 2} \mathbf{P}^{\prime} \mathbf{x}}{\mathbf{y}^{\prime} \mathbf{y}}=\frac{\mathbf{y}^{\prime} \mathbf{\Lambda} \mathbf{y}}{\mathbf{y}^{\prime} \mathbf{y}} \&lt;br>
&amp;amp;=\frac{\sum_{i=1}^{p} \lambda_{i} y_{i}^{2}}{\sum_{i=1}^{p} y_{i}^{2}} \leq \lambda_{1} \frac{\sum_{i=1}^{p} y_{i}^{2}}{\sum_{i=1}^{p} y_{i}^{2}}=\lambda_{\mathrm{I}}
\end{aligned}
$$&lt;/p>
&lt;p>Setting $\mathbf{x}=\mathbf{e}&lt;em>{1}$ gives
$$
\mathbf{y}=\mathbf{P}^{\prime} \mathbf{e}&lt;/em>{1}=\left[\begin{array}{c}
1 \&lt;br>
0 \&lt;br>
\vdots \&lt;br>
0
\end{array}\right]
$$
since
$$
\mathbf{e}&lt;em>{k}^{\prime} \mathbf{e}&lt;/em>{1}= \begin{cases}1, &amp;amp; k=1 \ 0, &amp;amp; k \neq 1\end{cases}
$$
For this choice of $\mathbf{x}$, we have $\mathbf{y}^{\prime} \mathbf{\Lambda} \mathbf{y} / \mathbf{y}^{\prime} \mathbf{y}=\lambda_{1} / 1=\lambda_{1}$, or
$$
\frac{\mathbf{e}_{1}^{\prime} \mathbf{B e}_{1}}{\mathbf{e}_{1}^{\prime} \mathbf{e}_{1}}=\mathbf{e}_{1}^{\prime} \mathbf{B e}_{1}=\lambda_{1}
$$
A similar argument produces the second part of $(2-51)$. Now, $\mathbf{x}=\mathbf{P y}=y_{1} \mathbf{e}_{1}+y_{2} \mathbf{e}_{2}+\cdots+y_{p} \mathbf{e}_{p}$, so $\mathbf{x} \perp \mathbf{e}_{1}, \ldots, \mathbf{e}_{k}$ implies
$$
0=\mathbf{e}_{i}^{\prime} \mathbf{x}=y_{1} \mathbf{e}_{i}^{\prime} \mathbf{e}_{1}+y_{2} \mathbf{e}_{i}^{\prime} \mathbf{e}_{2}+\cdots+y_{p} \mathbf{e}_{i}^{\prime} \mathbf{e}_{p}=y_{i}, \quad i \leq k
$$
Therefore, for $x$ perpendicular to the first $k$ eigenvectors $\mathbf{e}_{i}$, the left-hand side of the inequality in $(2-53)$ becomes
$$
\frac{\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}}{\mathbf{x}^{\prime} \mathbf{x}}=\frac{\sum_{i=k+1}^{p} \lambda_{i} y_{i}^{2}}{\sum_{i=k+1}^{p} y_{i}^{2}}
$$
Taking $y_{k+1}=1, y_{k+2}=\cdots=y_{p}=0$ gives the asserted maximum.
For a fixed $\mathbf{x}_{0} \neq \mathbf{0}, \mathbf{x}_{0}^{\prime} \mathbf{B} \mathbf{x}_{0} / \mathbf{x}_{0}^{\prime} \mathbf{x}_{0}$ has the same value as $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$, where
$\mathbf{x}^{\prime}=\mathbf{x}_{0}^{\prime} / \sqrt{\mathbf{x}_{0}^{\prime} \mathbf{x}_{0}}$ is of unit length. Consequently, Equation (2-51) says that the largest eigenvalue, $\lambda_{1}$, is the maximum value of the quadratic form $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$ for all points $\mathbf{x}$ whose distance from the origin is unity. Similarly, $\lambda_{p}$ is the smallest value of the quadratic form for all points $x$ one unit from the origin. The largest and smallest eigenvalues thus represent extreme values of $\mathbf{x}^{\prime} \mathbf{B} \mathbf{x}$ for points on the unit sphere. The &amp;ldquo;intermediate&amp;rdquo; eigenvalues of the $p \times p$ positive definite matrix $B$ also have an interpretation as extreme values when $\mathbf{x}$ is further restricted to be perpendicular to the earlier choices.&lt;/p>
&lt;h2 id="an-example-of-the-application-of-cauchy-schwarz-inequality-cramér-1946">An Example of the Application of Cauchy-Schwarz Inequality (Cramér, 1946)&lt;/h2>
&lt;p>In statistical problems, large amounts of data are collected to study a phenomenon. With a desire to derive a mathematical model to describe it, we may find, numerically, a function $\widetilde{\phi}$ to approximate a parameter $\phi$. $\widetilde{\phi}$ is called an unbiased estimator of $\phi$ if $E(\widetilde{\phi})=\phi . \quad$ That is
$$
\int_{-\infty}^{\infty} \widetilde{\phi} f_{\theta}(x) d x=\phi(\theta)
$$
Here, $\theta$ and $x$ are independent parameters. Differentiating this with respect to $\theta$ and interchanging integration and differentiation (provided of course that this is permissible) gives:
$$
\int_{-\infty}^{\infty} \widetilde{\phi}(x) \frac{\partial f_{\theta}}{\partial \theta}(x) d x=\phi^{\prime}(\theta)
$$
The rate of change of information is the function
$$
S(x):=\frac{\partial}{\partial \theta} \log f_{\theta}(x)
$$
called the score statistic. Plainly, $S(x)=\frac{1}{f_{\theta}(x)} \frac{\partial f_{\theta}}{\partial \theta}(x)$, so that we can write
$$
\int_{-\infty}^{\infty} \widetilde{\phi}(x) S(x) f_{\theta}(x) d x=\phi^{\prime}(\theta) .
$$
Also, the expectation of $S(x)$ is
$$
E(S(x))=\int_{-\infty}^{\infty} S(x) f_{\theta}(x) d x=\int_{-\infty}^{\infty} \frac{\partial f_{\theta}}{\partial \theta}(x) d x=\frac{\partial}{\partial \theta} \int_{-\infty}^{\infty} f_{\theta}(x) d x=0
$$
since
$$
\int_{-\infty}^{\infty} f_{\theta}(x) d x=1
$$
because the total probability is $1 .$ Thus, (4.1) can be re-written as
$$
\int_{-\infty}^{\infty}(\widetilde{\phi}(x)-\phi(\theta)) S(x) f_{\theta}(x) d x=\phi^{\prime}(\theta) .
$$
Applying the Cauchy-Schwarz inequality, we obtain
$$
\phi^{\prime}(\theta)^{2} \leq\left(\int_{-\infty}^{\infty}(\widetilde{\phi}(x)-\phi(\theta))^{2} f_{\theta}(x) d x\right)\left(\int_{-\infty}^{\infty} S(x)^{2} f_{\theta}(x) d x\right)
$$&lt;/p>
&lt;p>Writing
$$
I(\theta):=\int_{-\infty}^{\infty}\left(\frac{\partial \log f_{\theta}}{\partial \theta}\right)^{2} f_{\theta}(x) d x
$$
(called Fisher information in statistical parlance), we can write our inequality as:&lt;/p>
&lt;p>&lt;strong>Theorem 5&lt;/strong> (The Cramér-Rao inequality). For an unbiased estimator $\widetilde{\phi}$ of $\phi$, we have
$$
\int_{-\infty}^{\infty}(\widetilde{\phi}(x)-\phi(\theta))^{2} f_{\theta}(x) d x \geq \frac{\phi^{\prime}(\theta)^{2}}{I(\theta)} .
$$
Often, this is applied with $\phi(\theta)=\theta$ so that $\phi^{\prime}(\theta)=1$. The inequality then gives us a limitation on the accuracy of the unbiased estimator to the function $\theta$. Somtimes it is referred to as the information inequality. It was discovered independently by C. R. Rao [10] and H. Cramér [2] in 1945 and has played a pivotal role in statistical inference. An enlightening survey of the Cramér-Rao inequality was written by K.R. Parthasarathy [9] where the reader can find discussion of Riemannian metrics to study population models.&lt;/p>
&lt;p>Regarding Theorem 5 , there is a lot of interest in estimators that actually achieve the Cramer-Rao lower bound. Such estimators are said to be asymptotically efficient. Under certain regularity conditions the maximum likelihood estimators are asymptotically efficient. In such cases the Fisher information about $\theta$ in the data is equal to the inverse of the variance of the estimator.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ol>
&lt;li>Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press,
Princeton.&lt;/li>
&lt;li>Johnson, R. A., Wichern, D. W. (2002). Applied multivariate statistical analysis. Upper Saddle River, NJ: Prentice Hall. ISBN: 0130925535&lt;/li>
&lt;/ol></description></item><item><title>A Comparison between Two Ways of Coding for Bayesian Statistical Modeling in Toronto Rental Price (with brms)</title><link>https://siqi-zheng.rbind.io/post/2021-07-20-bayes-rental-price/</link><pubDate>Tue, 20 Jul 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-07-20-bayes-rental-price/</guid><description>&lt;h2 id="data-preparation">Data Preparation&lt;/h2>
&lt;p>According to a research in the distribution of housing price in Tokyo (OHNISHI, MIZUNO, SHIMIZU &amp;amp; WATANABE, 2011), the housing price follows a lognormal distribution. Therefore we would like to examine if an exponential model works for the mean rental price $\mu_{ij}$ of type j at year i in Toronto with two predictors (year and unit size). Data were adapted from Canada Mortgage and Housing Corporation (2021). For copyrights reasons, the dataset will not be attached on GitHub. The link to the dataset can be found under Bibliography Section. An overview of the data is provided below.&lt;/p>
&lt;pre>&lt;code class="language-r"># Data Wrangling
library(tidyverse)
# Bayes Models
library(brms)
library(tidybayes)
library(bayesplot)
library(loo)
# Prior libraries
library(extraDistr)
library(cmdstanr)
library(posterior)
# html widgets
library(kableExtra)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-r">head(df_bay)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## # A tibble: 6 x 4
## neighbourhood year_temp unit_size rent
## &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt;
## 1 Banbury-Don Mills/York Mills 0 0 bedroom 881
## 2 Banbury-Don Mills/York Mills 0 1 bedroom 1097
## 3 Banbury-Don Mills/York Mills 0 2 bedrooms 1253
## 4 Banbury-Don Mills/York Mills 0 3 bedrooms 1565
## 5 Bathurst Manor 0 0 bedroom 797
## 6 Bathurst Manor 0 1 bedroom 1101
&lt;/code>&lt;/pre>
&lt;p>Note: Year 0 represents year 2016 and year 4 is 2020.&lt;/p>
&lt;pre>&lt;code class="language-r">df_bay_2016 = df_bay %&amp;gt;%
filter(year_temp==1) %&amp;gt;%
select(rent)
hist(df_bay_2016$rent, main='Distribution of Mean Rental Prices in Neighborhoods in Toronto in 2017', xlab='Mean Price')
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dist-1.png" alt="">&lt;!-- -->&lt;/p>
&lt;pre>&lt;code class="language-r">hist(log(df_bay_2016$rent), main='Distribution of Log Mean Rental Prices in Neighborhoods\nin Toronto in 2017', xlab='Log Mean Price')
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="dist-2.png" alt="">&lt;!-- -->&lt;/p>
&lt;h2 id="two-matematically-equivalent-approaches">Two Matematically Equivalent Approaches&lt;/h2>
&lt;p>We consider the following equivalent approaches and test if both models agree with each other.&lt;/p>
&lt;p>We are going to consider the following two approaches (with brms in R):&lt;/p>
&lt;p>Approach 1: $\mu_{ij}\sim lognormal((b_{0j}+\beta_0)+(b_{1j}+\beta_1)x_i,\sigma^2)$ (i.e.family=lognormal())&lt;/p>
&lt;p>Approach 2: $log(\mu_{ij})\sim N((b_{0j}+\beta_0)+(b_{1j}+\beta_1)x_i,\sigma^2)$&lt;/p>
&lt;p>where&lt;/p>
&lt;p>$b_{0j}$: Random intercept due to j type of unit size;&lt;/p>
&lt;p>$\beta_0$: Baseline intercept, which may have no practical meaning;&lt;/p>
&lt;p>$x_i$: Variable year i, ranging from 2011 to 2020;&lt;/p>
&lt;p>$b_{1j}$: Random slope due to j type of unit size;&lt;/p>
&lt;p>$\beta_1$: Coefficient of variable year;&lt;/p>
&lt;p>$\sigma^2$: Actual variation in rental price.&lt;/p>
&lt;h2 id="priors">Priors&lt;/h2>
&lt;p>Weakly informative priors are chosen based on our belief in the baseline price of a studio at the intercept, which is around 700 Canadian dollars per month, in Toronto. Moreover, we expect the unit size has small effects on the intercept (normal distribution is chosen for this reason). At the same time, we also want to ensure that we do not miss the possibility of large parameters with Cauchy distribution as priors:&lt;/p>
&lt;p>$\beta_0\sim N(700,100)$&lt;/p>
&lt;p>$b_{0j}\sim N(0,1)$&lt;/p>
&lt;p>$b_{1j}\sim N(0,{\tau_1}^2 )$&lt;/p>
&lt;p>$\beta_1\sim N(0,{\tau_2}^2 )$&lt;/p>
&lt;p>${\tau_1}^2,{\tau_2}^2,\sigma\sim Cauchy(0,1)$&lt;/p>
&lt;p>$Cov(b_{0j},b_{1j})\sim Cholesky LKJ Correlation Distribution(1.5)$&lt;/p>
&lt;p>Note: Prior Predictive Check is not the main focus of this project, so it is omitted to save space for the model comparison below. One should, however, conduct prior predictive check to be more rigorous.&lt;/p>
&lt;h2 id="model-1-estimates">Model 1 Estimates&lt;/h2>
&lt;pre>&lt;code class="language-r">priors &amp;lt;- c(prior(normal(700, 100), class = Intercept),
prior(normal(0, 1), class = b),
prior(cauchy(0, 1), class = sd),
prior(cauchy(0, 1), class = sigma),
prior(lkj_corr_cholesky(1.5), class = cor))
# priors &amp;lt;- c(prior(normal(0,1), class = Intercept),
# prior(cauchy(0,0.5), class = sd))
if (!file.exists(&amp;quot;models/bayes_mod1.rds&amp;quot;)){
mod_1 &amp;lt;- brm(rent ~ (1 + year_temp | unit_size ) + year_temp ,
data = df_bay,
prior = priors,
family=lognormal(),
warmup = 1000, # burn-in
iter = 5000, # number of iterations
chains = 2, # number of MCMC chains
control = list(adapt_delta = 0.95))
saveRDS(mod_1, file= &amp;quot;models/bayes_mod1.rds&amp;quot;)
} else {
mod_1 &amp;lt;- readRDS(&amp;quot;models/bayes_mod1.rds&amp;quot;)
}
fixef(mod_1) %&amp;gt;%
kable(booktabs = T, caption = &amp;quot;Fixed Effects for Model 1&amp;quot;) %&amp;gt;%
kable_styling(latex_options = c(&amp;quot;HOLD_position&amp;quot;, &amp;quot;scale_down&amp;quot;))
&lt;/code>&lt;/pre>
&lt;table class="table" style="margin-left: auto; margin-right: auto;">
&lt;caption>Fixed Effects for Model 1&lt;/caption>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left;"> &lt;/th>
&lt;th style="text-align:right;"> Estimate &lt;/th>
&lt;th style="text-align:right;"> Est.Error &lt;/th>
&lt;th style="text-align:right;"> Q2.5 &lt;/th>
&lt;th style="text-align:right;"> Q97.5 &lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left;"> Intercept &lt;/td>
&lt;td style="text-align:right;"> 7.052757 &lt;/td>
&lt;td style="text-align:right;"> 0.2196256 &lt;/td>
&lt;td style="text-align:right;"> 6.5951566 &lt;/td>
&lt;td style="text-align:right;"> 7.5166879 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> year_temp &lt;/td>
&lt;td style="text-align:right;"> 0.054498 &lt;/td>
&lt;td style="text-align:right;"> 0.0087226 &lt;/td>
&lt;td style="text-align:right;"> 0.0406854 &lt;/td>
&lt;td style="text-align:right;"> 0.0670868 &lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-r">data.frame(ranef(mod_1)) %&amp;gt;%
kable(booktabs = T, caption = &amp;quot;Random Effects for Model 1&amp;quot;) %&amp;gt;%
kable_styling(latex_options = c(&amp;quot;HOLD_position&amp;quot;, &amp;quot;scale_down&amp;quot;))
&lt;/code>&lt;/pre>
&lt;table class="table" style="margin-left: auto; margin-right: auto;">
&lt;caption>Random Effects for Model 1&lt;/caption>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left;"> &lt;/th>
&lt;th style="text-align:right;"> unit_size.Estimate.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Est.Error.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q2.5.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q97.5.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Estimate.year_temp &lt;/th>
&lt;th style="text-align:right;"> unit_size.Est.Error.year_temp &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q2.5.year_temp &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q97.5.year_temp &lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left;"> 0 bedroom &lt;/td>
&lt;td style="text-align:right;"> -0.2945081 &lt;/td>
&lt;td style="text-align:right;"> 0.2198822 &lt;/td>
&lt;td style="text-align:right;"> -0.7599386 &lt;/td>
&lt;td style="text-align:right;"> 0.1641069 &lt;/td>
&lt;td style="text-align:right;"> 0.0028919 &lt;/td>
&lt;td style="text-align:right;"> 0.0091980 &lt;/td>
&lt;td style="text-align:right;"> -0.0093780 &lt;/td>
&lt;td style="text-align:right;"> 0.0213398 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> 1 bedroom &lt;/td>
&lt;td style="text-align:right;"> -0.0838120 &lt;/td>
&lt;td style="text-align:right;"> 0.2196423 &lt;/td>
&lt;td style="text-align:right;"> -0.5421174 &lt;/td>
&lt;td style="text-align:right;"> 0.3796284 &lt;/td>
&lt;td style="text-align:right;"> 0.0012612 &lt;/td>
&lt;td style="text-align:right;"> 0.0086730 &lt;/td>
&lt;td style="text-align:right;"> -0.0118502 &lt;/td>
&lt;td style="text-align:right;"> 0.0167562 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> 2 bedrooms &lt;/td>
&lt;td style="text-align:right;"> 0.0867104 &lt;/td>
&lt;td style="text-align:right;"> 0.2198136 &lt;/td>
&lt;td style="text-align:right;"> -0.3782345 &lt;/td>
&lt;td style="text-align:right;"> 0.5462856 &lt;/td>
&lt;td style="text-align:right;"> -0.0003455 &lt;/td>
&lt;td style="text-align:right;"> 0.0088410 &lt;/td>
&lt;td style="text-align:right;"> -0.0151367 &lt;/td>
&lt;td style="text-align:right;"> 0.0140831 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> 3 bedrooms &lt;/td>
&lt;td style="text-align:right;"> 0.2633750 &lt;/td>
&lt;td style="text-align:right;"> 0.2195523 &lt;/td>
&lt;td style="text-align:right;"> -0.1990381 &lt;/td>
&lt;td style="text-align:right;"> 0.7254240 &lt;/td>
&lt;td style="text-align:right;"> -0.0016846 &lt;/td>
&lt;td style="text-align:right;"> 0.0087438 &lt;/td>
&lt;td style="text-align:right;"> -0.0178807 &lt;/td>
&lt;td style="text-align:right;"> 0.0113221 &lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="model-2-estimates">Model 2 Estimates&lt;/h2>
&lt;pre>&lt;code class="language-r"># priors &amp;lt;- c(prior(normal(0,1), class = Intercept),
# prior(cauchy(0,0.5), class = sd))
if (!file.exists(&amp;quot;models/bayes_mod2.rds&amp;quot;)){
mod_2 &amp;lt;- brm(log(rent) ~ (1 + year_temp | unit_size ) + year_temp ,
data = df_bay,
prior = priors,
warmup = 1000, # burn-in
iter = 5000, # number of iterations
chains = 2, # number of MCMC chains
control = list(adapt_delta = 0.95))
saveRDS(mod_2, file= &amp;quot;models/bayes_mod2.rds&amp;quot;)
} else {
mod_2 &amp;lt;- readRDS(&amp;quot;models/bayes_mod2.rds&amp;quot;)
}
fixef(mod_2) %&amp;gt;%
kable(booktabs = T, caption = &amp;quot;Fixed Effects for Model 2&amp;quot;) %&amp;gt;%
kable_styling(latex_options = c(&amp;quot;HOLD_position&amp;quot;, &amp;quot;scale_down&amp;quot;))
&lt;/code>&lt;/pre>
&lt;table class="table" style="margin-left: auto; margin-right: auto;">
&lt;caption>Fixed Effects for Model 2&lt;/caption>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left;"> &lt;/th>
&lt;th style="text-align:right;"> Estimate &lt;/th>
&lt;th style="text-align:right;"> Est.Error &lt;/th>
&lt;th style="text-align:right;"> Q2.5 &lt;/th>
&lt;th style="text-align:right;"> Q97.5 &lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left;"> Intercept &lt;/td>
&lt;td style="text-align:right;"> 7.0518325 &lt;/td>
&lt;td style="text-align:right;"> 0.2324473 &lt;/td>
&lt;td style="text-align:right;"> 6.5815863 &lt;/td>
&lt;td style="text-align:right;"> 7.5796947 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> year_temp &lt;/td>
&lt;td style="text-align:right;"> 0.0549225 &lt;/td>
&lt;td style="text-align:right;"> 0.0075329 &lt;/td>
&lt;td style="text-align:right;"> 0.0401083 &lt;/td>
&lt;td style="text-align:right;"> 0.0683509 &lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-r">data.frame(ranef(mod_2)) %&amp;gt;%
kable(booktabs = T, caption = &amp;quot;Random Effects for Model 2&amp;quot;) %&amp;gt;%
kable_styling(latex_options = c(&amp;quot;HOLD_position&amp;quot;, &amp;quot;scale_down&amp;quot;))
&lt;/code>&lt;/pre>
&lt;table class="table" style="margin-left: auto; margin-right: auto;">
&lt;caption>Random Effects for Model 2&lt;/caption>
&lt;thead>
&lt;tr>
&lt;th style="text-align:left;"> &lt;/th>
&lt;th style="text-align:right;"> unit_size.Estimate.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Est.Error.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q2.5.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q97.5.Intercept &lt;/th>
&lt;th style="text-align:right;"> unit_size.Estimate.year_temp &lt;/th>
&lt;th style="text-align:right;"> unit_size.Est.Error.year_temp &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q2.5.year_temp &lt;/th>
&lt;th style="text-align:right;"> unit_size.Q97.5.year_temp &lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:left;"> 0 bedroom &lt;/td>
&lt;td style="text-align:right;"> -0.2939128 &lt;/td>
&lt;td style="text-align:right;"> 0.2329494 &lt;/td>
&lt;td style="text-align:right;"> -0.8206667 &lt;/td>
&lt;td style="text-align:right;"> 0.1726338 &lt;/td>
&lt;td style="text-align:right;"> 0.0026946 &lt;/td>
&lt;td style="text-align:right;"> 0.0079855 &lt;/td>
&lt;td style="text-align:right;"> -0.0107282 &lt;/td>
&lt;td style="text-align:right;"> 0.0205602 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> 1 bedroom &lt;/td>
&lt;td style="text-align:right;"> -0.0830413 &lt;/td>
&lt;td style="text-align:right;"> 0.2324192 &lt;/td>
&lt;td style="text-align:right;"> -0.6096193 &lt;/td>
&lt;td style="text-align:right;"> 0.3857650 &lt;/td>
&lt;td style="text-align:right;"> 0.0009799 &lt;/td>
&lt;td style="text-align:right;"> 0.0076378 &lt;/td>
&lt;td style="text-align:right;"> -0.0127195 &lt;/td>
&lt;td style="text-align:right;"> 0.0172185 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> 2 bedrooms &lt;/td>
&lt;td style="text-align:right;"> 0.0882536 &lt;/td>
&lt;td style="text-align:right;"> 0.2321475 &lt;/td>
&lt;td style="text-align:right;"> -0.4375356 &lt;/td>
&lt;td style="text-align:right;"> 0.5553330 &lt;/td>
&lt;td style="text-align:right;"> -0.0009177 &lt;/td>
&lt;td style="text-align:right;"> 0.0076263 &lt;/td>
&lt;td style="text-align:right;"> -0.0167288 &lt;/td>
&lt;td style="text-align:right;"> 0.0138163 &lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:left;"> 3 bedrooms &lt;/td>
&lt;td style="text-align:right;"> 0.2643750 &lt;/td>
&lt;td style="text-align:right;"> 0.2323713 &lt;/td>
&lt;td style="text-align:right;"> -0.2665589 &lt;/td>
&lt;td style="text-align:right;"> 0.7340761 &lt;/td>
&lt;td style="text-align:right;"> -0.0022045 &lt;/td>
&lt;td style="text-align:right;"> 0.0079285 &lt;/td>
&lt;td style="text-align:right;"> -0.0200957 &lt;/td>
&lt;td style="text-align:right;"> 0.0117391 &lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Key Findings for both models:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Both models yield similar estimates for both fixed effects and random effects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The rental price increases by around 5.6% each year on average, higher than the inflation rate 3.6 % in Canada;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The baseline price for 3-bedroom apartment is 73% higher than a Studio, so a hierarchy model is necessary;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The slope does not vary much for each room type, so a random intercept model may be sufficient for analysis.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="posterior-predictive-check-density">Posterior Predictive Check (Density)&lt;/h2>
&lt;pre>&lt;code class="language-r">pp_check(mod_1) + labs(title=&amp;quot;Distribution of observed and replicated rental price for model 1&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## Using 10 posterior samples for ppc type 'dens_overlay' by default.
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="pp_check_density-1.png" alt="">&lt;!-- -->&lt;/p>
&lt;pre>&lt;code class="language-r">pp_check(mod_2) + labs(title=&amp;quot;Distribution of observed and replicated rental price for model 2&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## Using 10 posterior samples for ppc type 'dens_overlay' by default.
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="pp_check_density-2.png" alt="">&lt;!-- -->&lt;/p>
&lt;p>Both models are reasonable from the comparison above.&lt;/p>
&lt;h2 id="posterior-predictive-check-test-statistic">Posterior Predictive Check (Test Statistic)&lt;/h2>
&lt;pre>&lt;code class="language-r">pp_check(mod_1, type = &amp;quot;stat&amp;quot;, stat = 'mean', nsamples = 5000) + labs(title=&amp;quot;Comparison between the distribution of the mean rental price in simulated datasets\nand the mean of the actual data for Model 1&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="pp_check_test_stats-1.png" alt="">&lt;!-- -->&lt;/p>
&lt;pre>&lt;code class="language-r">pp_check(mod_2, type = &amp;quot;stat&amp;quot;, stat = 'mean', nsamples = 5000) + labs(title=&amp;quot;Comparison between the distribution of the log mean rental price in simulated datasets\nand the log mean of the actual data for Model 2&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="pp_check_test_stats-2.png" alt="">&lt;!-- -->&lt;/p>
&lt;p>Both models are reasonable from the comparison above.&lt;/p>
&lt;h2 id="leave-one-out-cross-validationloo-cv">Leave-one-out Cross-validation(LOO-CV)&lt;/h2>
&lt;pre>&lt;code class="language-r">loo1b &amp;lt;- loo(mod_1, save_psis = TRUE)
loo2b &amp;lt;- loo(mod_2, save_psis = TRUE)
plot(loo1b, main = &amp;quot;PSIS diagnostic plot for model 1&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="loo-1.png" alt="">&lt;!-- -->&lt;/p>
&lt;pre>&lt;code class="language-r">plot(loo2b, main = &amp;quot;PSIS diagnostic plot for model 2&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="loo-2.png" alt="">&lt;!-- -->&lt;/p>
&lt;p>Pareto k estimates, which give an indication of how ‘influential’ each point is. The higher the value of k, the more influential the point is. Points with K over 0.5 are not good, fortunately there are not influential points for both models.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Though the models have subtle difference in actual results, RStan produces the estimates for both models fairly. One may want to explore further about rental prices with more predictors and gather more data to validate the model.&lt;/p>
&lt;h2 id="bibliography">Bibliography&lt;/h2>
&lt;p>Canada Mortgage and Housing Corporation.(2021).Toronto — Historical Average Rents by Bedroom Type.https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/Table?TableId=2.2.11&amp;amp;GeographyId=2270&amp;amp;GeographyTypeId=3&amp;amp;DisplayAs=Table&amp;amp;GeograghyName=Toronto&lt;/p></description></item><item><title>Learning SQL Notes #16: SQL and Big Data</title><link>https://siqi-zheng.rbind.io/post/2021-06-11-sql-notes-16/</link><pubDate>Fri, 11 Jun 2021 15:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-11-sql-notes-16/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#introduction-to-apache-drill">Introduction to Apache Drill&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#querying-files-using-drill">Querying Files Using Drill&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#querying-mysql-using-drill">Querying MySQL Using Drill&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#querying-mongodb-using-drill">Querying MongoDB Using Drill&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#drill-with-multiple-data-sources">Drill with Multiple Data Sources&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#future-of-sql">Future of SQL&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The data landscape has changed quite a bit over the past decade, and SQL is changing to meet the needs of today’s rapidly evolving environments. Many organizations that had used relational databases exclusively just a few years ago are now also housing data in Hadoop clusters, data lakes, and NoSQL databases. At the same time, companies are struggling to find ways to gain insights from the ever-growing volumes of data, and the fact that this data is now spread across multiple data stores, perhaps both on-site and in the cloud, makes this a daunting task.&lt;/p>
&lt;p>Because SQL is used by millions of people and has been integrated into thousands of applications, it makes sense to leverage SQL to harness this data and make it actionable. Over the past several years, a new breed of tools has emerged to enable SQL access to structured, semi-structured, and unstructured data: tools such as Presto, Apache Drill, and Toad Data Point. This chapter explores one of these tools, Apache Drill, to demonstrate how data in different formats and stored on different servers can be brought together for reporting and analysis.&lt;/p>
&lt;h1 id="introduction-to-apache-drill">Introduction to Apache Drill&lt;/h1>
&lt;p>Compelling features:&lt;/p>
&lt;ul>
&lt;li>Facilitates queries across multiple data formats, including delimited data, JSON, Parquet, and log files&lt;/li>
&lt;li>Connects to relational databases, Hadoop, NoSQL, HBase, and Kafka, as well as specialized data formats such as PCAP, BlockChain, and others&lt;/li>
&lt;li>Allows creation of custom plug-ins to connect to most any other data store&lt;/li>
&lt;li>Requires no up-front schema definitions&lt;/li>
&lt;li>Supports the SQL:2003 standard&lt;/li>
&lt;li>Works with popular business intelligence (BI) tools like Tableau and Apache Superset
Using Drill, you can connect to any number of data sources and begin querying, without the need to first set up a metadata repository.&lt;/li>
&lt;/ul>
&lt;h1 id="querying-files-using-drill">Querying Files Using Drill&lt;/h1>
&lt;p>Let’s start by using Drill to query data in a file. Drill understands how to read several different file formats, including packet capture (PCAP) files, which are in binary for‐ mat and contain information about packets traveling over a network. All I have to do when I want to query a PCAP file is to configure Drill’s dfs (distributed filesystem) plug-in to include the path to the directory containing my files, and I’m ready to write queries.&lt;/p>
&lt;p>Drill includes partial support for information_schema, so you can find out high-level information about the data files in your workspace:&lt;/p>
&lt;pre>
SELECT file_name, is_directory, is_file, permission
FROM &lt;b>information_schema.`files`&lt;/b>
WHERE schema_name = 'dfs.data';
SELECT * FROM dfs.data.`attack-trace.pcap`
&lt;b>WHERE 1=2;&lt;/b> # To see the column name
&lt;/pre>
&lt;p>Counts the number of packets sent from each IP address to each destination port:&lt;/p>
&lt;pre>
SELECT src_ip, dst_port,
count(*) AS packet_count
FROM dfs.data.`attack-trace.pcap`
GROUP BY src_ip, dst_port;
&lt;/pre>
&lt;p>Aggregates packet information for each second:&lt;/p>
&lt;pre>
SELECT trunc(extract(second from `timestamp`)) as packet_time,
count(*) AS num_packets,
sum(packet_length) AS tot_volume
FROM dfs.data.`attack-trace.pcap`
GROUP BY trunc(extract(second from `timestamp`));
&lt;/pre>
&lt;p>Put backticks (`) around timestamp because it is a reserved word.&lt;/p>
&lt;p>You can query files stored locally, on your network, in a distributed filesystem, or in the cloud. Drill has built-in support for many file types, but you can also build your own plug-in to allow Drill to query any type of file.&lt;/p>
&lt;h1 id="querying-mysql-using-drill">Querying MySQL Using Drill&lt;/h1>
&lt;p>Why Apache Drill? Because you can write queries using Drill that combine data from different sources, so you might write a query that joins data from MySQL, Hadoop, and comma-delimited files, for example.&lt;/p>
&lt;p>The first step is to choose a database:&lt;/p>
&lt;pre>
apache drill (information_schema)> &lt;b>use mysql.sakila&lt;/b>;
&lt;b>show tables;&lt;/b>
&lt;/pre>
&lt;p>Simple joins, group by, order and having work for Drill as well. However, Drill works with many relational databases, not just MySQL, so some features of the language may differ (e.g., data conversion functions). For more information, read
&lt;a href="http://drill.apache.org/docs/sql-reference/" target="_blank" rel="noopener">Drill’s documentation about their SQL implementation&lt;/a>.&lt;/p>
&lt;h1 id="querying-mongodb-using-drill">Querying MongoDB Using Drill&lt;/h1>
&lt;p>After using Drill to query the sample Sakila data in MySQL, the next logical step is to convert the Sakila data to another commonly used format, store it in a nonrelational database, and use Drill to query the data. I decided to convert the data to JSON and store it in MongoDB, which is one of the more popular NoSQL platforms for document storage. Drill includes a plug-in for MongoDB and also understands how to read JSON documents, so it was relatively easy to load the JSON files into Mongo and begin writing queries.&lt;/p>
&lt;p>After the JSON files have been loaded, the Mongo database contains two collections (films and customers), and the data in these collections spans nine different tables from the MySQL Sakila database.&lt;/p>
&lt;p>Group the data by rating and actor:&lt;/p>
&lt;pre>
SELECT g_pg_films.Rating,
g_pg_films.actor_list.`First name` first_name,
g_pg_films.actor_list.`Last name` last_name,
count(*) num_films
FROM
(SELECT f.Rating, flatten(Actors) actor_list
FROM films f
WHERE f.Rating IN ('G','PG')
) g_pg_films
GROUP BY g_pg_films.Rating,
g_pg_films.actor_list.`First name`,
g_pg_films.actor_list.`Last name`
HAVING count(*) > 9;
&lt;/pre>
&lt;p>The query should return all customers who have spent more than $80 to rent films rated either G or PG.&lt;/p>
&lt;pre>
SELECT first_name, last_name,
sum(cast(cust_payments.payment_data.Amount
as decimal(4,2))) tot_payments
FROM
(SELECT cust_data.first_name,
cust_data.last_name,
f.Rating,
flatten(cust_data.rental_data.Payments)
payment_data
FROM films f
INNER JOIN
(SELECT c.`First Name` first_name,
c.`Last Name` last_name, flatten(c.Rentals) rental_data
FROM customers c
) cust_data
ON f._id = cust_data.rental_data.filmID
WHERE f.Rating IN ('G','PG')
) cust_payments
GROUP BY first_name, last_name
HAVING
sum(cast(cust_payments.payment_data.Amount as decimal(4,2))) > 80;
&lt;/pre>
&lt;p>The innermost query, which I named cust_data, flattens the Rentals list so that the cust_payments query can join to the films collection and also flatten the Payments list. The outermost query groups the data by customer name and applies a having clause to filter out customers who spent $80 or less on films rated G or PG.&lt;/p>
&lt;h1 id="drill-with-multiple-data-sources">Drill with Multiple Data Sources&lt;/h1>
&lt;p>As long as Drill is configured to connect to both databases, you just need to describe where to find the data.&lt;/p>
&lt;pre>
&lt;b>FROM mysql.sakila.film f&lt;/b>
&lt;b>FROM mongo.sakila.customers c&lt;/b>
&lt;/pre>
&lt;h1 id="future-of-sql">Future of SQL&lt;/h1>
&lt;p>The future of relational databases is somewhat unclear. It is possible that the big data technologies of the past decade will continue to mature and gain market share. It’s also possible that a new set of technologies will emerge, overtaking Hadoop and NoSQL, and taking additional market share from relational databases. However, most companies still run their core business functions using relational databases, and it should take a long time for this to change.&lt;/p>
&lt;p>The future of SQL seems a bit clearer, however. While the SQL language started out as a mechanism for interacting with data in relational databases, tools like Apache Drill act more like an abstraction layer, facilitating the analysis of data across various database platforms. In this author’s opinion, this trend will continue, and SQL will remain a critical tool for data analysis and reporting for many years.&lt;/p></description></item><item><title>Learning SQL Notes #15: Working with Large Databases</title><link>https://siqi-zheng.rbind.io/post/2021-06-11-sql-notes-15/</link><pubDate>Fri, 11 Jun 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-11-sql-notes-15/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#partitioning">Partitioning&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#partitioning-concepts">Partitioning Concepts&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#table-partitioning">Table Partitioning&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#index-partitioning">Index Partitioning&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#partitioning-methods">Partitioning Methods&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#range-partitioning">Range partitioning&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#list-partitioning">List partitioning&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#hash-partitioning">Hash partitioning&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#composite-partitioning">Composite partitioning&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#partitioning-benefits">Partitioning Benefits&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#clustering">Clustering&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#sharding">Sharding&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#big-data">Big Data&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#hadoop">Hadoop&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#nosql-and-document-databases">NoSQL and Document Databases&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#cloud-computing">Cloud Computing&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#conclusion">Conclusion&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>While relational databases face various challenges as data volumes continue to grow, there are strategies such as partitioning, clustering, and sharding that allow companies to continue to utilize relational databases by spreading data across multi‐ ple storage tiers and servers. Other companies have decided to move to big data platforms such as Hadoop in order to handle huge data volumes.&lt;/p>
&lt;h1 id="partitioning">Partitioning&lt;/h1>
&lt;p>The following tasks become more difficult and/or time consuming as a table grows past a few million rows:&lt;/p>
&lt;ul>
&lt;li>Query execution requiring full table scans&lt;/li>
&lt;li>Index creation/rebuild&lt;/li>
&lt;li>Data archival/deletion&lt;/li>
&lt;li>Generation of table/index statistics&lt;/li>
&lt;li>Table relocation (e.g., move to a different tablespace)&lt;/li>
&lt;li>Database backups&lt;/li>
&lt;/ul>
&lt;p>The best way to prevent administrative issues from occurring in the future is to break large tables into pieces, or &lt;em>partitions&lt;/em>, when the table is first created (although tables can be partitioned later, it is easier to do so initially). Administrative tasks can be performed on individual partitions, often in parallel, and some tasks can skip one or more partitions entirely.&lt;/p>
&lt;h2 id="partitioning-concepts">Partitioning Concepts&lt;/h2>
&lt;p>While every partition must have the same schema definition (columns, column types, etc.), there are several administrative features that can differ for each partition:&lt;/p>
&lt;ul>
&lt;li>Partitions may be stored on different tablespaces, which can be on different physical storage tiers.&lt;/li>
&lt;li>Partitions can be compressed using different compression schemes.&lt;/li>
&lt;li>Local indexes (more on this shortly) can be dropped for some partitions.&lt;/li>
&lt;li>Table statistics can be frozen on some partitions, while being periodically refreshed on others.&lt;/li>
&lt;li>Individual partitions can be pinned into memory or stored in the database’s flash storage tier.&lt;/li>
&lt;/ul>
&lt;h2 id="table-partitioning">Table Partitioning&lt;/h2>
&lt;p>The partitioning scheme available in most relational databases is &lt;em>horizontal partitioning&lt;/em>, which assigns entire rows to exactly one partition. Tables may also be partitioned &lt;em>vertically&lt;/em>, which involves assigning sets of columns to different partitions, but this must be done manually. When partitioning a table horizontally, you must choose a &lt;em>partition key&lt;/em>, which is the column whose values are used to assign a row to a particular partition. In most cases, a table’s partition key consists of a single column, and a &lt;em>partitioning function&lt;/em> is applied to this column to determine in which partition each row should reside.&lt;/p>
&lt;h2 id="index-partitioning">Index Partitioning&lt;/h2>
&lt;p>If your partitioned table has indexes, you will get to choose whether a particular index should stay intact, known as a &lt;em>global index&lt;/em>, or be broken into pieces such that each partition has its own index, which is called a &lt;em>local index&lt;/em>. Global indexes span all partitions of the table and are useful for queries that do not specify a value for the partition key.&lt;/p>
&lt;h2 id="partitioning-methods">Partitioning Methods&lt;/h2>
&lt;h3 id="range-partitioning">Range partitioning&lt;/h3>
&lt;p>The most common usage is to break up tables by date ranges.&lt;/p>
&lt;pre>&lt;code class="language-sql">CREATE TABLE sales
(sale_id INT NOT NULL,
cust_id INT NOT NULL,
store_id INT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(9,2)
)
PARTITION BY RANGE (yearweek(sale_date))
(PARTITION s1 VALUES LESS THAN (202002),
PARTITION s2 VALUES LESS THAN (202003),
PARTITION s3 VALUES LESS THAN (202004),
PARTITION s4 VALUES LESS THAN (202005),
PARTITION s5 VALUES LESS THAN (202006),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
);
&lt;/code>&lt;/pre>
&lt;p>Read and modify partitions:&lt;/p>
&lt;pre>
SELECT partition_name, partition_method, partition_expression
&lt;b>FROM information_schema.partitions &lt;/b>
WHERE table_name = 'sales'
ORDER BY partition_ordinal_position;
ALTER TABLE sales &lt;b>REORGANIZE PARTITION&lt;/b> s999 INTO
(PARTITION s6 VALUES LESS THAN (202007),
PARTITION s7 VALUES LESS THAN (202008),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
);
&lt;/pre>
&lt;h3 id="list-partitioning">List partitioning&lt;/h3>
&lt;pre>&lt;code class="language-sql">PARTITION BY LIST COLUMNS (geo_region_cd)
(PARTITION ASIA VALUES IN ('CHN','JPN','IND'))
ALTER TABLE sales REORGANIZE PARTITION ASIA INTO
(PARTITION ASIA VALUES IN ('CHN','JPN','IND', 'KOR'));
&lt;/code>&lt;/pre>
&lt;h3 id="hash-partitioning">Hash partitioning&lt;/h3>
&lt;p>The server does this by applying a &lt;em>hashing function&lt;/em> to the column value.&lt;/p>
&lt;pre>&lt;code class="language-sql">PARTITION BY HASH (cust_id)
PARTITIONS 4
(PARTITION H1,
PARTITION H2,
PARTITION H3,
PARTITION H4
);
&lt;/code>&lt;/pre>
&lt;h3 id="composite-partitioning">Composite partitioning&lt;/h3>
&lt;p>If you need finer-grained control of how data is allocated to your partitions, you can employ &lt;em>composite partitioning&lt;/em>, which allows you to use two different types of partitioning for the same table. With composite partitioning, the first partitioning method defines the partitions, and the second partitioning method defines the &lt;em>subpartitions&lt;/em>.&lt;/p>
&lt;pre>&lt;code class="language-sql">CREATE TABLE sales
(sale_id INT NOT NULL,
cust_id INT NOT NULL,
store_id INT NOT NULL,
sale_date DATE NOT NULL,
amount DECIMAL(9,2)
)
PARTITION BY RANGE (yearweek(sale_date))
SUBPARTITION BY HASH (cust_id)
(PARTITION s1 VALUES LESS THAN (202002)
(SUBPARTITION s1_h1, SUBPARTITION s1_h2, SUBPARTITION s1_h3, SUBPARTITION s1_h4),
PARTITION s2 VALUES LESS THAN (202003)
(SUBPARTITION s2_h1, SUBPARTITION s2_h2, SUBPARTITION s2_h3, SUBPARTITION s2_h4),
PARTITION s3 VALUES LESS THAN (202004)
(SUBPARTITION s3_h1, SUBPARTITION s3_h2,
SUBPARTITION s3_h3,
SUBPARTITION s3_h4),
PARTITION s4 VALUES LESS THAN (202005)
(SUBPARTITION s4_h1, SUBPARTITION s4_h2, SUBPARTITION s4_h3, SUBPARTITION s4_h4),
PARTITION s5 VALUES LESS THAN (202006)
(SUBPARTITION s5_h1, SUBPARTITION s5_h2, SUBPARTITION s5_h3, SUBPARTITION s5_h4),
PARTITION s999 VALUES LESS THAN (MAXVALUE)
(SUBPARTITION s999_h1, SUBPARTITION s999_h2, SUBPARTITION s999_h3,
SUBPARTITION s999_h4)
);
SELECT *
FROM sales PARTITION (s3);
SELECT *
FROM sales PARTITION (s3_h3);
&lt;/code>&lt;/pre>
&lt;h2 id="partitioning-benefits">Partitioning Benefits&lt;/h2>
&lt;p>One major advantage to partitioning is that you may only need to interact with as few as one partition, rather than the entire table.&lt;/p>
&lt;p>If you execute a query that includes a join to a partitioned table and the query includes a condition on the partitioning column, the server can exclude any partitions that do not contain data pertinent to the query. This is known as &lt;em>partitionwise joins&lt;/em>, and it is similar to partition pruning in that only those partitions that contain data needed by the query will be included.&lt;/p>
&lt;p>From an administrative standpoint, one of the main benefits to partitioning is the ability to quickly delete data that is no longer needed.&lt;/p>
&lt;p>Another administrative advantage to partitioned tables is the ability to perform updates on multiple partitions simultaneously, which can greatly reduce the time needed to touch every row in a table.&lt;/p>
&lt;h1 id="clustering">Clustering&lt;/h1>
&lt;p>&lt;em>Clustering&lt;/em> allows multiple servers to act as a single database.&lt;/p>
&lt;p>Shared-disk/shared-cache configurations: every server in the cluster has access to all disks, and data cached in one server can be accessed by any other server in the cluster. With this type of architecture, an application server could attach to any one of the database servers in the cluster, with connections automatically failing over to another server in the cluster in case of failure.&lt;/p>
&lt;p>Of the commercial database vendors, Oracle is the leader in this space, with many of the world’s biggest companies using the Oracle Exadata platform to host extremely large databases accessed by thousands of concurrent users. However, even this plat‐ form fails to meet the needs of the biggest companies, which led Google, Facebook, Amazon, and other companies to blaze new trails.&lt;/p>
&lt;h1 id="sharding">Sharding&lt;/h1>
&lt;p>&lt;em>Sharding&lt;/em> partitions the data across multiple databases (called &lt;em>shards&lt;/em>), so it is similar to table partitioning but on a larger scale and with far more complexity. If you were to employ this strategy for the social media company, you might decide to implement 100 separate databases, each one hosting the data for approximately 10 million users.&lt;/p>
&lt;ul>
&lt;li>You will need to choose a &lt;em>sharding key&lt;/em>, which is the value used to determine to which database to connect.&lt;/li>
&lt;li>While large tables will be divided into pieces, with individual rows assigned to a single shard, smaller reference tables may need to be replicated to all shards, and a strategy needs to be defined for how reference data can be modified and changes propagated to all shards.&lt;/li>
&lt;li>If individual shards become too large (e.g., the social media company now has two billion users), you will need a plan for adding more shards and redistributing data across the shards.&lt;/li>
&lt;li>When you need to make schema changes, you will need to have a strategy for deploying the changes across all of the shards so that all schemas stay in sync.&lt;/li>
&lt;li>If application logic needs to access data stored in two or more shards, you need to have a strategy for how to query across multiple databases and also how to implement transactions across multiple databases.&lt;/li>
&lt;/ul>
&lt;h1 id="big-data">Big Data&lt;/h1>
&lt;p>One way to define the boundaries of big data is with the “3 Vs”:&lt;/p>
&lt;p>&lt;em>Volume&lt;/em>&lt;/p>
&lt;p>In this context, volume generally means billions or trillions of data points.&lt;/p>
&lt;p>&lt;em>Velocity&lt;/em>&lt;/p>
&lt;p>This is a measure of how quickly data arrives.&lt;/p>
&lt;p>&lt;em>Variety&lt;/em>&lt;/p>
&lt;p>This means that data is not always structured (as in rows and columns in a rela‐ tional database) but can also be unstructured (e.g., emails, videos, photos, audio files, etc.).&lt;/p>
&lt;p>So, one way to characterize big data is any system designed to handle a huge amount of data of various formats arriving at a rapid pace.&lt;/p>
&lt;h2 id="hadoop">Hadoop&lt;/h2>
&lt;p>Hadoop is best described as an &lt;em>ecosystem&lt;/em>, or a set of technologies and tools that work together. Some of the major components of Hadoop include:&lt;/p>
&lt;p>&lt;em>Hadoop Distributed File System (HDFS)&lt;/em>&lt;/p>
&lt;p>Like the name implies, HDFS enables file management across a large number of servers.&lt;/p>
&lt;p>&lt;em>MapReduce&lt;/em>&lt;/p>
&lt;p>This technology processes large amounts of structured and unstructured data by breaking a task into many small pieces that can be run in parallel across many servers.&lt;/p>
&lt;p>&lt;em>YARN&lt;/em>&lt;/p>
&lt;p>This is a resource manager and job scheduler for HDFS.&lt;/p>
&lt;p>Together, these technologies allow for the storage and processing of files across hun‐ dreds or even thousands of servers acting as a single logical system. While Hadoop is widely used, querying the data using MapReduce generally requires a programmer, which has led to the development of several SQL interfaces, including Hive, Impala, and Drill.&lt;/p>
&lt;h2 id="nosql-and-document-databases">NoSQL and Document Databases&lt;/h2>
&lt;p>What happens, however, if the structure of the data isn’t known beforehand or if the structure is known but changes frequently? The answer for many companies is to combine both the data and schema definition into documents using a format such as XML or JSON and then store the documents in a database. By doing so, various types of data can be stored in the same database without the need to make schema modifications, which makes storage easier but puts the burden on query and analytic tools to make sense of the data stored in the documents.&lt;/p>
&lt;p>Document databases are a subset of what are called NoSQL databases, which typically store data using a simple key-value mechanism. For example, using a document data‐ base such as MongoDB, you could utilize the customer ID as the key to store a JSON document containing all of the customer’s data, and other users can read the schema stored within the document to make sense of the data stored within.&lt;/p>
&lt;h2 id="cloud-computing">Cloud Computing&lt;/h2>
&lt;p>Prior to the advent of big data, most companies had to build their own data centers to house the database, web, and application servers used across the enterprise. With the advent of cloud computing, you can choose to essentially outsource your data center to platforms such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud. One of the biggest benefits to hosting your services in the cloud is &lt;strong>instant scalability&lt;/strong>, which allows you to quickly dial up or down the amount of computing power needed to run your services. Startups love these platforms because they can start writing code without spending any money up front for servers, storage, networks, or software licenses.&lt;/p>
&lt;p>As far as databases are concerned, a quick look at AWS’s database and analytics offerings yields the following options:&lt;/p>
&lt;ul>
&lt;li>Relational databases (MySQL, Aurora, PostgreSQL, MariaDB, Oracle, and SQL Server)&lt;/li>
&lt;li>In-memory database (ElastiCache)&lt;/li>
&lt;li>Data warehousing database (Redshift)&lt;/li>
&lt;li>NoSQL database (DynamoDB)&lt;/li>
&lt;li>Document database (DocumentDB)&lt;/li>
&lt;li>Graph database (Neptune)&lt;/li>
&lt;li>Time-series database (TimeStream)&lt;/li>
&lt;li>Hadoop (EMR)&lt;/li>
&lt;li>Data lakes (Lake Formation)&lt;/li>
&lt;/ul>
&lt;p>While relational databases dominated the landscape up until the mid-2000s, it’s pretty easy to see that companies are now mixing and matching various platforms and that relational databases may become less popular over time.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Databases are getting larger, but at the same time storage, clustering, and partitioning technologies are becoming more robust. Working with huge amounts of data can be quite challenging, regardless of the technology stack. Whether you use relational databases, big data platforms, or a variety of database servers, SQL is evolving to facilitate data retrieval from various technologies.&lt;/p></description></item><item><title>Learning SQL Notes #14: Analytic Functions</title><link>https://siqi-zheng.rbind.io/post/2021-06-11-sql-notes-14/</link><pubDate>Fri, 11 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-11-sql-notes-14/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#analytic-function-concepts">Analytic Function Concepts&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#data-windows">Data Windows&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#localized-sorting">Localized Sorting&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#ranking">Ranking&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#ranking-functions">Ranking Functions&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#generating-multiple-rankings">Generating Multiple Rankings&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#reporting-functions">Reporting Functions&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#window-frames">Window Frames&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#lag-and-lead">Lag and Lead&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#column-value-concatenation">Column Value Concatenation&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h1 id="analytic-function-concepts">Analytic Function Concepts&lt;/h1>
&lt;h2 id="data-windows">Data Windows&lt;/h2>
&lt;pre>
SELECT quarter(payment_date) quarter,
monthname(payment_date) month_nm,
sum(amount) monthly_sales,
&lt;b>max(sum(amount))
over () max_overall_sales,&lt;/b>/*controlled by where and group by and return the highest monthly total payment in 2005*/
&lt;b>max(sum(amount))
over (partition by quarter(payment_date)) max_qrtr_sales&lt;/b> /*controlled by where and group by and return the highest monthly total payment in each quarter in 2005*/
FROM payment
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date);
&lt;/pre>
&lt;p>The analytic functions used to generate these additional columns group rows into two different sets: one set containing all rows in the same quarter and another set containing all of the rows. To accommodate this type of analysis, analytic functions include the ability to group rows into &lt;em>windows&lt;/em>, which effectively partition the data for use by the analytic function without changing the overall result set. Windows are defined using the &lt;code>over&lt;/code> clause combined with an optional &lt;code>partition&lt;/code> by subclause. In the previous query, both analytic functions include an over clause, but the first one is empty, indicating that the window should include the entire result set, whereas the second one specifies that the window should include only rows within the same quarter. Data windows may contain anywhere from a single row to all of the rows in the result set, and different analytic functions can define different data windows.&lt;/p>
&lt;h2 id="localized-sorting">Localized Sorting&lt;/h2>
&lt;pre>
SELECT quarter(payment_date) quarter,
monthname(payment_date) month_nm,
sum(amount) monthly_sales,
&lt;b>rank() over (order by sum(amount) desc)&lt;/b> sales_rank /* order by only controls the rank()*/
FROM payment
WHERE year(payment_date) = 2005
GROUP BY quarter(payment_date), monthname(payment_date)
ORDER BY 1, month(payment_date);/* order by only controls the presentation*/
&lt;/pre>
&lt;p>or you may insert &lt;code>partition by quarter(payment_date)&lt;/code> into the &lt;code>over()&lt;/code> above to obtain rank within each quarter.&lt;/p>
&lt;h1 id="ranking">Ranking&lt;/h1>
&lt;h2 id="ranking-functions">Ranking Functions&lt;/h2>
&lt;p>There are multiple ranking functions available in the SQL standard, with each one taking a different approach to how ties are handled:&lt;/p>
&lt;p>&lt;code>row_number&lt;/code>&lt;/p>
&lt;p>Returns a unique number for each row, with rankings arbitrarily assigned in case of a tie&lt;/p>
&lt;p>&lt;code>rank&lt;/code>&lt;/p>
&lt;p>Returns the same ranking in case of a tie, with gaps in the rankings&lt;/p>
&lt;p>&lt;code>dense_rank&lt;/code>&lt;/p>
&lt;p>Returns the same ranking in case of a tie, with no gaps in the rankings&lt;/p>
&lt;pre>
SELECT customer_id, count(*) num_rentals,
row_number() over (order by count(*) desc) row_number_rnk,
rank() over (order by count(*) desc) rank_rnk,
dense_rank() over (order by count(*) desc) dense_rank_rnk
FROM rental
GROUP BY customer_id
ORDER BY 2 desc;
&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">customer_id&lt;/th>
&lt;th>num_rentals&lt;/th>
&lt;th>row_number_rnk&lt;/th>
&lt;th>rank_rnk&lt;/th>
&lt;th align="right">dense_rank_rnk&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">144&lt;/td>
&lt;td>42&lt;/td>
&lt;td>3&lt;/td>
&lt;td>3&lt;/td>
&lt;td align="right">3&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">236&lt;/td>
&lt;td>42&lt;/td>
&lt;td>4&lt;/td>
&lt;td>3&lt;/td>
&lt;td align="right">3&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;b>75&lt;/b>&lt;/td>
&lt;td>&lt;b>41&lt;/b>&lt;/td>
&lt;td>&lt;b>5&lt;/b>&lt;/td>
&lt;td>&lt;b>5&lt;/b>&lt;/td>
&lt;td align="right">&lt;b>4&lt;/b>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>To get back to the original request, how would you identify the top 10 customers? There are three possible solutions:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Use the row_number function to identify customers ranked from 1 to 10, which results in exactly 10 customers in this example, but in other cases might exclude customers having the same number of rentals as the 10th ranked customer.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the rank function to identify customers ranked 10 or less, which also results in exactly 10 customers.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the dense_rank function to identify customers ranked 10 or less, which yields a list of 37 customers.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="generating-multiple-rankings">Generating Multiple Rankings&lt;/h2>
&lt;pre>
SELECT customer_id,
monthname(rental_date) rental_month,
count(*) num_rentals,
rank() over (&lt;b>partition by monthname(rental_date) &lt;/b>
order by count(*) desc) rank_rnk
FROM rental
GROUP BY customer_id, monthname(rental_date)
ORDER BY 2, 3 desc;
&lt;/pre>
&lt;p>so that rank() starts from 1 for each month.&lt;/p>
&lt;p>Looking at the results, you can see that the rankings are reset to 1 for each month. In order to generate the desired results for the marketing department (top five custom‐ ers from each month), you can simply wrap the previous query in a subquery and add a filter condition to exclude any rows with a ranking higher than five:&lt;/p>
&lt;pre>
SELECT customer_id, rental_month, num_rentals, rank_rnk ranking
FROM
(SELECT customer_id,
monthname(rental_date) rental_month, count(*) num_rentals,
rank() over (partition by monthname(rental_date) order by count(*) desc) rank_rnk
FROM rental
GROUP BY customer_id, monthname(rental_date)
) cust_rankings
&lt;b>WHERE rank_rnk &lt;= 5&lt;/b>
ORDER BY rental_month, num_rentals desc, rank_rnk;
&lt;/pre>
&lt;p>Since analytic functions can be used only in the SELECT clause, you will often need to &lt;strong>nest queries&lt;/strong> if you need to do any filtering or grouping based on the results from the analytic function.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Window Function&lt;/th>
&lt;th>Return Type&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>CUME_DIST()&lt;/td>
&lt;td>DOUBLE PRECISION&lt;/td>
&lt;td>The CUME_DIST() window function calculates the relative rank of the current row within a window partition: (number of rows preceding or peer with current row) / (total rows in the window partition)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DENSE_RANK()&lt;/td>
&lt;td>BIGINT&lt;/td>
&lt;td>The DENSE_RANK () window function determines the rank of a value in a group of values based on the ORDER BY expression and the OVER clause. Each value is ranked within its partition. Rows with equal values receive the same rank. There are no gaps in the sequence of ranked values if two or more rows have the same rank.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>NTILE()&lt;/td>
&lt;td>INTEGER&lt;/td>
&lt;td>The NTILE window function divides the rows for each window partition, as equally as possible, into a specified number of ranked groups. The NTILE window function requires the ORDER BY clause in the OVER clause.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>PERCENT_RANK()&lt;/td>
&lt;td>DOUBLE PRECISION&lt;/td>
&lt;td>The PERCENT_RANK () window function calculates the percent rank of the current row using the following formula: (x - 1) / (number of rows in window partition - 1) where x is the rank of the current row.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>RANK()&lt;/td>
&lt;td>BIGINT&lt;/td>
&lt;td>The RANK window function determines the rank of a value in a group of values. The ORDER BY expression in the OVER clause determines the value. Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers. For example, if two rows are ranked 1, the next rank is 3. The DENSE_RANK window function differs in that no gaps exist if two or more rows tie.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>ROW_NUMBER()&lt;/td>
&lt;td>BIGINT&lt;/td>
&lt;td>The ROW_NUMBER window function determines the ordinal number of the current row within its partition. The ORDER BY expression in the OVER clause determines the number. Each value is ordered within its partition. Rows with equal values for the ORDER BY expressions receive different row numbers nondeterministically.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h1 id="reporting-functions">Reporting Functions&lt;/h1>
&lt;p>Calculate total by month/by total&lt;/p>
&lt;pre>
SELECT monthname(payment_date) payment_month,
amount,
&lt;b>sum(amount) over (partition by monthname(payment_date)) monthly_total,
sum(amount) over () grand_total &lt;/b>
FROM payment
WHERE amount >= 10
ORDER BY 1;
&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">payment_month&lt;/th>
&lt;th>amount&lt;/th>
&lt;th>monthly_total&lt;/th>
&lt;th align="right">grand_total&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">August&lt;/td>
&lt;td>10.99&lt;/td>
&lt;td>521.53&lt;/td>
&lt;td align="right">1262.86&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">August&lt;/td>
&lt;td>11.99&lt;/td>
&lt;td>521.53&lt;/td>
&lt;td align="right">1262.86&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Calculate percentage:&lt;/p>
&lt;pre>
SELECT monthname(payment_date) payment_month,
amount,
&lt;b>round(sum(amount) / sum(sum(amount)) over () * 100, 2) pct_of_total&lt;/b>
FROM payment
GROUP BY monthname(payment_date);
&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">payment_month&lt;/th>
&lt;th>month_total&lt;/th>
&lt;th align="right">pct_of_total&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">May&lt;/td>
&lt;td>4824.43&lt;/td>
&lt;td align="right">7.16&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">June&lt;/td>
&lt;td>9631.88&lt;/td>
&lt;td align="right">14.29&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">July&lt;/td>
&lt;td>28373.89&lt;/td>
&lt;td align="right">42.09&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">August&lt;/td>
&lt;td>24072.13&lt;/td>
&lt;td align="right">35.71&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">February&lt;/td>
&lt;td>514.18&lt;/td>
&lt;td align="right">0.76&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Quasi-ranking functions:&lt;/p>
&lt;pre>
SELECT monthname(payment_date) payment_month,
sum(amount) month_total,
&lt;b>CASE sum(amount)
WHEN max(sum(amount)) over () THEN 'Highest'
WHEN min(sum(amount)) over () THEN 'Lowest'
ELSE 'Middle'
END descriptor&lt;/b>
FROM payment
GROUP BY monthname(payment_date);
&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">payment_month&lt;/th>
&lt;th>month_total&lt;/th>
&lt;th align="right">descriptor&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">May&lt;/td>
&lt;td>4824.43&lt;/td>
&lt;td align="right">Middle&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">June&lt;/td>
&lt;td>9631.88&lt;/td>
&lt;td align="right">Middle&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">July&lt;/td>
&lt;td>28373.89&lt;/td>
&lt;td align="right">&lt;b>Highest&lt;/b>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">August&lt;/td>
&lt;td>24072.13&lt;/td>
&lt;td align="right">Middle&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">February&lt;/td>
&lt;td>514.18&lt;/td>
&lt;td align="right">&lt;b>Lowest&lt;/b>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="window-frames">Window Frames&lt;/h2>
&lt;pre>
SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
sum(sum(amount))
&lt;b>over (order by yearweek(payment_date)
rows unbounded preceding)&lt;/b> rolling_sum
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;
&lt;/pre>
&lt;pre>
SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
avg(sum(amount))
over (order by yearweek(payment_date)
&lt;b>rows between 1 preceding and 1 following&lt;/b>) rolling_3wk_avg
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;
&lt;/pre>
&lt;pre>
SELECT date(payment_date), sum(amount),
avg(sum(amount))
over (order by date(payment_date)
&lt;b>range between interval 3 day preceding and interval 3 day following&lt;/b>) range
FROM payment
WHERE payment_date BETWEEN '2005-07-01' AND '2005-09-01'
GROUP BY date(payment_date)
ORDER BY 1;
&lt;/pre>
&lt;h2 id="lag-and-lead">Lag and Lead&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Window Function&lt;/th>
&lt;th>Argument Type&lt;/th>
&lt;th>Return Type&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>LAG()&lt;/td>
&lt;td>Any supported Drill data types&lt;/td>
&lt;td>Same as the expression type&lt;/td>
&lt;td>The LAG() window function returns the value for the row before the current row in a partition. If no row exists, null is returned.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>LEAD()&lt;/td>
&lt;td>Any supported Drill data types&lt;/td>
&lt;td>Same as the expression type&lt;/td>
&lt;td>The LEAD() window function returns the value for the row after the current row in a partition. If no row exists, null is returned.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>FIRST_VALUE&lt;/td>
&lt;td>Any supported Drill data types&lt;/td>
&lt;td>Same as the expression type&lt;/td>
&lt;td>The FIRST_VALUE window function returns the value of the specified expression with respect to the first row in the window frame.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>LAST_VALUE&lt;/td>
&lt;td>Any supported Drill data types&lt;/td>
&lt;td>Same as the expression type&lt;/td>
&lt;td>The LAST_VALUE window function returns the value of the specified expression with respect to the last row in the window frame.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>
SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
&lt;b>lag(sum(amount), 1)
over (order by yearweek(payment_date)) prev_wk_tot,&lt;/b>
&lt;b>lead(sum(amount), 1)
over (order by yearweek(payment_date)) next_wk_tot,&lt;/b>
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;
&lt;/pre>
&lt;pre>
SELECT yearweek(payment_date) payment_week,
sum(amount) week_total,
&lt;b>round((sum(amount) - lag(sum(amount), 1)
over (order by yearweek(payment_date))) / lag(sum(amount), 1)
over (order by yearweek(payment_date)) * 100, 1) pct_diff&lt;/b>
FROM payment
GROUP BY yearweek(payment_date)
ORDER BY 1;
&lt;/pre>
&lt;h2 id="column-value-concatenation">Column Value Concatenation&lt;/h2>
&lt;pre>
SELECT f.title,
&lt;B>group_concat(a.last_name order by a.last_name separator ', ') actors&lt;/b>
FROM actor a
INNER JOIN film_actor fa
ON a.actor_id = fa.actor_id
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY f.title
HAVING count(*) = 3;
&lt;/pre>&lt;blockquote>
&lt;/blockquote></description></item><item><title>Learning SQL Notes #13: Metadata</title><link>https://siqi-zheng.rbind.io/post/2021-06-10-sql-notes-13/</link><pubDate>Thu, 10 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-10-sql-notes-13/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#data-about-data">Data About Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#information_schema">information_schema&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#working-with-metadata">Working with Metadata&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#schema-generation-scripts">Schema Generation Scripts&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#deployment-verification">Deployment Verification&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#dynamic-sql-generation">Dynamic SQL Generation&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>A database server also needs to store information about all of the database objects (tables, views, indexes, etc.) that were created to store this data in a database. This chapter discusses how and where this information, known as &lt;em>metadata&lt;/em>, is stored, how you can access it, and how you can use it to build flexible systems.&lt;/p>
&lt;h1 id="data-about-data">Data About Data&lt;/h1>
&lt;p>Metadata is essentially data about data. Every time you create a database object, the database server needs to record various pieces of information. For example, if you were to create a table with multiple columns, a primary key constraint, three indexes, and a foreign key constraint, the database server would need to store all the following information:&lt;/p>
&lt;ul>
&lt;li>Table name&lt;/li>
&lt;li>Table storage information (tablespace, initial size, etc.)&lt;/li>
&lt;li>Storage engine&lt;/li>
&lt;li>Column names&lt;/li>
&lt;li>Column data types&lt;/li>
&lt;li>Default column values&lt;/li>
&lt;li>not null column constraints&lt;/li>
&lt;li>Primary key columns&lt;/li>
&lt;li>Primary key name&lt;/li>
&lt;li>Name of primary key index&lt;/li>
&lt;li>Index names&lt;/li>
&lt;li>Index types (B-tree, bitmap)&lt;/li>
&lt;li>Indexed columns&lt;/li>
&lt;li>Index column sort order (ascending or descending)&lt;/li>
&lt;li>Index storage information&lt;/li>
&lt;li>Foreign key name&lt;/li>
&lt;li>Foreign key columns&lt;/li>
&lt;li>Associated table/columns for foreign keys&lt;/li>
&lt;/ul>
&lt;p>This data is collectively known as the &lt;em>data dictionary&lt;/em> or &lt;em>system catalog&lt;/em>. The database server needs to store this data persistently, and it needs to be able to quickly retrieve this data in order to verify and execute SQL statements. Additionally, the database server must safeguard this data so that it can be modified only via an appropriate mechanism, such as the &lt;code>alter&lt;/code> table statement.&lt;/p>
&lt;p>Every database server uses a different mechanism to publish metadata, such as:&lt;/p>
&lt;ul>
&lt;li>A set of views, such as Oracle Database’s user_tables and all_constraints views&lt;/li>
&lt;li>A set of system-stored procedures, such as SQL Server’s sp_tables procedure or Oracle Database’s dbms_metadata package&lt;/li>
&lt;li>A special database, such as MySQL’s information_schema database&lt;/li>
&lt;/ul>
&lt;h1 id="information_schema">information_schema&lt;/h1>
&lt;p>All of the objects available within the information_schema database (or &lt;em>schema&lt;/em>, in the case of SQL Server) are views. Unlike the describe utility, the views within information_schema can be queried and, thus, used programmatically.&lt;/p>
&lt;table frame="box" rules="all" summary="A reference that lists all INFORMATION_SCHEMA tables.">&lt;col style="width: 22%">&lt;col style="width: 55%">&lt;col style="width: 11%">&lt;col style="width: 11%">&lt;thead>&lt;tr>&lt;th>Table Name&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Introduced&lt;/th>
&lt;th>Deprecated&lt;/th>
&lt;/tr>&lt;/thead>&lt;tbody>&lt;tr>&lt;th scope="row">&lt;code class="literal">ADMINISTRABLE_ROLE_AUTHORIZATIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Grantable users or roles for current user or role&lt;/td>
&lt;td>8.0.19&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">APPLICABLE_ROLES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Applicable roles for current user&lt;/td>
&lt;td>8.0.19&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">CHARACTER_SETS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Available character sets&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">CHECK_CONSTRAINTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Table and column CHECK constraints&lt;/td>
&lt;td>8.0.16&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">COLLATION_CHARACTER_SET_APPLICABILITY&lt;/code>&lt;/a>&lt;/th>
&lt;td>Character set applicable to each collation&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">COLLATIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Collations for each character set&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">COLUMN_PRIVILEGES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Privileges defined on columns&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">COLUMN_STATISTICS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Histogram statistics for column values&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">COLUMNS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Columns in each table&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">COLUMNS_EXTENSIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Column attributes for primary and secondary storage engines&lt;/td>
&lt;td>8.0.21&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">CONNECTION_CONTROL_FAILED_LOGIN_ATTEMPTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Current number of consecutive failed connection attempts per account&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ENABLED_ROLES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Roles enabled within current session&lt;/td>
&lt;td>8.0.19&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ENGINES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Storage engine properties&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">EVENTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Event Manager events&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">FILES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Files that store tablespace data&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_BUFFER_PAGE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Pages in InnoDB buffer pool&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_BUFFER_PAGE_LRU&lt;/code>&lt;/a>&lt;/th>
&lt;td>LRU ordering of pages in InnoDB buffer pool&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_BUFFER_POOL_STATS&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB buffer pool statistics&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_CACHED_INDEXES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Number of index pages cached per index in InnoDB buffer pool&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_CMP&lt;/code>&lt;/a>&lt;/th>
&lt;td>Status for operations related to compressed InnoDB tables&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_CMP_PER_INDEX&lt;/code>&lt;/a>&lt;/th>
&lt;td>Status for operations related to compressed InnoDB tables and indexes&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_CMP_PER_INDEX_RESET&lt;/code>&lt;/a>&lt;/th>
&lt;td>Status for operations related to compressed InnoDB tables and indexes&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_CMP_RESET&lt;/code>&lt;/a>&lt;/th>
&lt;td>Status for operations related to compressed InnoDB tables&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_CMPMEM&lt;/code>&lt;/a>&lt;/th>
&lt;td>Status for compressed pages within InnoDB buffer pool&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_CMPMEM_RESET&lt;/code>&lt;/a>&lt;/th>
&lt;td>Status for compressed pages within InnoDB buffer pool&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_COLUMNS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Columns in each InnoDB table&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_DATAFILES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Data file path information for InnoDB file-per-table and general tablespaces&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FIELDS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Key columns of InnoDB indexes&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FOREIGN&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB foreign-key metadata&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FOREIGN_COLS&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB foreign-key column status information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FT_BEING_DELETED&lt;/code>&lt;/a>&lt;/th>
&lt;td>Snapshot of INNODB_FT_DELETED table&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FT_CONFIG&lt;/code>&lt;/a>&lt;/th>
&lt;td>Metadata for InnoDB table FULLTEXT index and associated processing&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FT_DEFAULT_STOPWORD&lt;/code>&lt;/a>&lt;/th>
&lt;td>Default list of stopwords for InnoDB FULLTEXT indexes&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FT_DELETED&lt;/code>&lt;/a>&lt;/th>
&lt;td>Rows deleted from InnoDB table FULLTEXT index&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FT_INDEX_CACHE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Token information for newly inserted rows in InnoDB FULLTEXT index&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_FT_INDEX_TABLE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Inverted index information for processing text searches against InnoDB table FULLTEXT index&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_INDEXES&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB index metadata&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_METRICS&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB performance information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_SESSION_TEMP_TABLESPACES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Session temporary-tablespace metadata&lt;/td>
&lt;td>8.0.13&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_TABLES&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB table metadata&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_TABLESPACES&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB file-per-table, general, and undo tablespace metadata&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_TABLESPACES_BRIEF&lt;/code>&lt;/a>&lt;/th>
&lt;td>Brief file-per-table, general, undo, and system tablespace metadata&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_TABLESTATS&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB table low-level status information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_TEMP_TABLE_INFO&lt;/code>&lt;/a>&lt;/th>
&lt;td>Information about active user-created InnoDB temporary tables&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_TRX&lt;/code>&lt;/a>&lt;/th>
&lt;td>Active InnoDB transaction information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">INNODB_VIRTUAL&lt;/code>&lt;/a>&lt;/th>
&lt;td>InnoDB virtual generated column metadata&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">KEY_COLUMN_USAGE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Which key columns have constraints&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">KEYWORDS&lt;/code>&lt;/a>&lt;/th>
&lt;td>MySQL keywords&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">MYSQL_FIREWALL_USERS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Firewall in-memory data for account profiles&lt;/td>
&lt;td>&lt;/td>
&lt;td>8.0.26&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">MYSQL_FIREWALL_WHITELIST&lt;/code>&lt;/a>&lt;/th>
&lt;td>Firewall in-memory data for account profile allowlists&lt;/td>
&lt;td>&lt;/td>
&lt;td>8.0.26&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ndb_transid_mysql_connection_map&lt;/code>&lt;/a>&lt;/th>
&lt;td>NDB transaction information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">OPTIMIZER_TRACE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Information produced by optimizer trace activity&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">PARAMETERS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Stored routine parameters and stored function return values&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">PARTITIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Table partition information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">PLUGINS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Plugin information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">PROCESSLIST&lt;/code>&lt;/a>&lt;/th>
&lt;td>Information about currently executing threads&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">PROFILING&lt;/code>&lt;/a>&lt;/th>
&lt;td>Statement profiling information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">REFERENTIAL_CONSTRAINTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Foreign key information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">RESOURCE_GROUPS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Resource group information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ROLE_COLUMN_GRANTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Column privileges for roles available to or granted by currently enabled roles&lt;/td>
&lt;td>8.0.19&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ROLE_ROUTINE_GRANTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Routine privileges for roles available to or granted by currently enabled roles&lt;/td>
&lt;td>8.0.19&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ROLE_TABLE_GRANTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Table privileges for roles available to or granted by currently enabled roles&lt;/td>
&lt;td>8.0.19&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ROUTINES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Stored routine information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">SCHEMA_PRIVILEGES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Privileges defined on schemas&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">SCHEMATA&lt;/code>&lt;/a>&lt;/th>
&lt;td>Schema information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">SCHEMATA_EXTENSIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Schema options&lt;/td>
&lt;td>8.0.22&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ST_GEOMETRY_COLUMNS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Columns in each table that store spatial data&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ST_SPATIAL_REFERENCE_SYSTEMS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Available spatial reference systems&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">ST_UNITS_OF_MEASURE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Acceptable units for ST_Distance()&lt;/td>
&lt;td>8.0.14&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">STATISTICS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Table index statistics&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TABLE_CONSTRAINTS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Which tables have constraints&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TABLE_CONSTRAINTS_EXTENSIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Table constraint attributes for primary and secondary storage engines&lt;/td>
&lt;td>8.0.21&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TABLE_PRIVILEGES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Privileges defined on tables&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TABLES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Table information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TABLES_EXTENSIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Table attributes for primary and secondary storage engines&lt;/td>
&lt;td>8.0.21&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TABLESPACES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Tablespace information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TABLESPACES_EXTENSIONS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Tablespace attributes for primary storage engines&lt;/td>
&lt;td>8.0.21&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TP_THREAD_GROUP_STATE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Thread pool thread group states&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TP_THREAD_GROUP_STATS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Thread pool thread group statistics&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TP_THREAD_STATE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Thread pool thread information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">TRIGGERS&lt;/code>&lt;/a>&lt;/th>
&lt;td>Trigger information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">USER_ATTRIBUTES&lt;/code>&lt;/a>&lt;/th>
&lt;td>User comments and attributes&lt;/td>
&lt;td>8.0.21&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">USER_PRIVILEGES&lt;/code>&lt;/a>&lt;/th>
&lt;td>Privileges defined globally per user&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">VIEW_ROUTINE_USAGE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Stored functions used in views&lt;/td>
&lt;td>8.0.13&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">VIEW_TABLE_USAGE&lt;/code>&lt;/a>&lt;/th>
&lt;td>Tables and views used in views&lt;/td>
&lt;td>8.0.13&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;tr>&lt;th scope="row">&lt;code class="literal">VIEWS&lt;/code>&lt;/a>&lt;/th>
&lt;td>View information&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>&lt;/tbody>&lt;/table>
&lt;h1 id="working-with-metadata">Working with Metadata&lt;/h1>
&lt;h2 id="schema-generation-scripts">Schema Generation Scripts&lt;/h2>
&lt;p>Generate a script that will create the various tables, indexes, views, and so on, that the team has deployed. Build a script that will create the sakila.category table. The following codes can be used to create a template-like SQL script.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT 'CREATE TABLE category (' create_table_statement
UNION ALL
SELECT cols.txt
FROM
(SELECT concat(' ',column_name, ' ', column_type,
CASE
WHEN is_nullable = 'NO' THEN ' not null' ELSE ''
END, CASE
WHEN extra IS NOT NULL AND extra LIKE 'DEFAULT_GENERATED%' THEN concat(' DEFAULT ',column_default,substr(extra,18)) WHEN extra IS NOT NULL THEN concat(' ', extra)
ELSE '' END, ',') txt
FROM information_schema.columns
WHERE table_schema = 'sakila' AND table_name = 'category'
ORDER BY ordinal_position
) cols
UNION ALL
SELECT concat(' constraint primary key (')
FROM information_schema.table_constraints
WHERE table_schema = 'sakila' AND table_name = 'category'
AND constraint_type = 'PRIMARY KEY'
UNION ALL
SELECT cols.txt
FROM
(SELECT concat(CASE WHEN ordinal_position &amp;gt; 1 THEN ' ,'
ELSE ' ' END, column_name) txt
FROM information_schema.key_column_usage
WHERE table_schema = 'sakila' AND table_name = 'category'
AND constraint_name = 'PRIMARY'
ORDER BY ordinal_position
) cols
UNION ALL
SELECT ' )'
UNION ALL
SELECT ')';
&lt;/code>&lt;/pre>
&lt;h2 id="deployment-verification">Deployment Verification&lt;/h2>
&lt;p>After the deployment scripts have been run, it’s a good idea to run a verification script to ensure that the new schema objects are in place with the appropriate columns, indexes, primary keys, and so forth. Here’s a query that returns the number of columns, number of indexes, and number of primary key constraints (0 or 1) for each table in the Sakila schema:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT tbl.table_name,
(SELECT count(*)
FROM information_schema.columns clm
WHERE clm.table_schema = tbl.table_schema
AND clm.table_name = tbl.table_name) num_columns,
(SELECT count(*)
FROM information_schema.statistics sta
WHERE sta.table_schema = tbl.table_schema
AND sta.table_name = tbl.table_name) num_indexes,
(SELECT count(*)
FROM information_schema.table_constraints tc
WHERE tc.table_schema = tbl.table_schema
AND tc.table_name = tbl.table_name
AND tc.constraint_type = 'PRIMARY KEY') num_primary_keys
FROM information_schema.tables tbl
WHERE tbl.table_schema = 'sakila' AND tbl.table_type = 'BASE TABLE'
ORDER BY 1;
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">TABLE_NAME&lt;/th>
&lt;th>num_columns&lt;/th>
&lt;th>num_indexes&lt;/th>
&lt;th align="right">num_primary_keys&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">actor&lt;/td>
&lt;td>4&lt;/td>
&lt;td>2&lt;/td>
&lt;td align="right">1&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="dynamic-sql-generation">Dynamic SQL Generation&lt;/h2>
&lt;p>Most relational database servers, including SQL Server, Oracle Database, and MySQL, allow SQL statements to be submitted to the server as strings. Submit‐ ting strings to a database engine rather than utilizing its SQL interface is generally known as &lt;em>dynamic SQL execution&lt;/em>.&lt;/p>
&lt;p>&lt;em>Oracle’s PL/SQL language&lt;/em>&lt;/p>
&lt;p>&lt;code>execute immediate&lt;/code>&lt;/p>
&lt;p>&lt;em>SQL Server&lt;/em>&lt;/p>
&lt;p>&lt;code>sp_executesql&lt;/code>&lt;/p>
&lt;p>&lt;em>MySQL&lt;/em>&lt;/p>
&lt;p>&lt;code>prepare, execute, deallocate&lt;/code>&lt;/p>
&lt;pre>&lt;code class="language-sql">SET @qry = 'SELECT customer_id, first_name, last_name FROM customer';
PREPARE dynsql1 FROM @qry;
EXECUTE dynsql1;
DEALLOCATE PREPARE dynsql1;
/*conditions can be specified at runtime*/
SET @qry = 'SELECT customer_id, first_name, last_name FROM customer WHERE customer_id = ?';
PREPARE dynsql2 FROM @qry;
SET @custid = 9;
EXECUTE dynsql2 USING @custid;
SET @custid = 145;
EXECUTE dynsql2 USING @custid;
DEALLOCATE PREPARE dynsql2;
&lt;/code>&lt;/pre>
&lt;p>Or you can do the following:&lt;/p>
&lt;pre>
SELECT concat('SELECT ', concat_ws(',', cols.col1, cols.col2),
' FROM customer WHERE customer_id = ?')
&lt;b>INTO @qry &lt;/b>
FROM (SELECT
max(CASE WHEN ordinal_position = 1 THEN column_name
ELSE NULL END) col1,
max(CASE WHEN ordinal_position = 2 THEN column_name
ELSE NULL END) col2
FROM information_schema.columns
WHERE table_schema = 'sakila' AND table_name = 'customer'
GROUP BY table_name
) cols;
&lt;/pre>
&lt;pre>&lt;code class="language-sql">PREPARE dynsql3 FROM @qry;
SET @custid = 45; Query OK, 0 rows affected (0.00 sec)
EXECUTE dynsql3 USING @custid;
DEALLOCATE PREPARE dynsql3;
&lt;/code>&lt;/pre>
&lt;p>Note: Generally, it would be better to generate the query using a procedural language that includes looping constructs, such as Java, PL/SQL, Transact-SQL, or MySQL’s Stored Procedure Language.&lt;/p></description></item><item><title>Learning SQL Notes #12: Views</title><link>https://siqi-zheng.rbind.io/post/2021-06-09-sql-notes-12/</link><pubDate>Wed, 09 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-09-sql-notes-12/</guid><description>&lt;p>Well-designed applications generally expose a public interface while keeping imple‐ mentation details private, thereby enabling future design changes without impacting end users. When designing your database, you can achieve a similar result by keeping your tables private and allowing your users to access data only through a set of &lt;em>views&lt;/em>.&lt;/p>
&lt;h1 id="what-are-views">What Are Views?&lt;/h1>
&lt;pre>&lt;code class="language-sql">CREATE VIEW customer_vw
(customer_id,
first_name,
last_name,
email
)
AS
SELECT customer_id,
first_name,
last_name,
concat(substr(email,1,2), '*****', substr(email, -4)) email
FROM customer;
/*view the View*/
describe customer_vw;
/*group by, having, where, join etc. can also be used*/
&lt;/code>&lt;/pre>
&lt;h1 id="why-use-views">Why Use Views?&lt;/h1>
&lt;ul>
&lt;li>
&lt;p>Data Security&lt;/p>
&lt;p>Oracle Database users have another option for securing both rows and columns of a table: Virtual Private Database (VPD). VPD allows you to attach policies to your tables, after which the server will modify a user’s query as necessary to enforce the policies.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Data Aggregation&lt;/p>
&lt;pre>&lt;code class="language-sql">CREATE VIEW sales_by_film_category AS
SELECT c.name AS category,
SUM(p.amount) AS total_sales
FROM payment AS p
INNER JOIN rental AS r
ON p.rental_id = r.rental_id
INNER JOIN inventory AS i
ON r.inventory_id = i.inventory_id
INNER JOIN film AS f
ON i.film_id = f.film_id
INNER JOIN film_category AS fc
ON f.film_id = fc.film_id
INNER JOIN category AS c
ON fc.category_id = c.category_id
GROUP BY c.name
ORDER BY total_sales DESC;
&lt;/code>&lt;/pre>
&lt;p>You have great flexibility! You can create a film_category_sales table, load it with aggregated data, and modify the sales_by_film_category view definition to retrieve data from this table if this improves the performance significantly.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Hiding Complexity&lt;/p>
&lt;p>One of the most common reasons for deploying views is to shield end users from complexity.&lt;/p>
&lt;pre>&lt;code class="language-sql">CREATE VIEW film_stats AS
SELECT f.film_id, f.title, f.description, f.rating,
(SELECT c.name
FROM category c
INNER JOIN film_category fc
ON c.category_id = fc.category_id
WHERE fc.film_id = f.film_id) category_name,
(SELECT count(*)
FROM film_actor fa
WHERE fa.film_id = f.film_id ) num_actors,
(SELECT count(*)
FROM inventory i
WHERE i.film_id = f.film_id ) inventory_cnt,
(SELECT count(*)
FROM inventory i
INNER JOIN rental r
ON i.inventory_id = r.inventory_id
WHERE i.film_id = f.film_id ) num_rentals
FROM film f;
&lt;/code>&lt;/pre>
&lt;p>If someone uses this view but does not reference the category_name, num_actors, inventory_cnt, or num_rentals column, then none of the subqueries will be executed. This approach allows the view to be used for supplying descriptive information from the film table without unnecessarily joining five other tables.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Joining Partitioned Data&lt;/p>
&lt;p>Some database designs break large tables into multiple pieces in order to improve performance. For example, if the payment table became large, the designers may decide to break it into two tables: payment_current, which holds the latest six months of data, and payment_historic, which holds all data up to six months ago. You can make it look like all payment data is stored in a single table.&lt;/p>
&lt;pre>&lt;code class="language-sql">CREATE VIEW payment_all
(payment_id,
customer_id,
staff_id,
rental_id, amount,
payment_date,
last_update
) AS
SELECT payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
FROM payment_historic
UNION ALL
SELECT payment_id, customer_id, staff_id, rental_id, amount, payment_date, last_update
FROM payment_current;
&lt;/code>&lt;/pre>
&lt;p>Using a view in this case is a good idea because it allows the designers to change the structure of the underlying data without the need to force all database users to modify their queries.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h1 id="updatable-views">Updatable Views&lt;/h1>
&lt;p>In the case of MySQL, a view is updatable if the following conditions are met:&lt;/p>
&lt;ul>
&lt;li>No aggregate functions are used (max(), min(), avg(), etc.).&lt;/li>
&lt;li>The view does not employ group by or having clauses.&lt;/li>
&lt;li>No subqueries exist in the select or from clause, and any subqueries in the where clause do not refer to tables in the from clause.&lt;/li>
&lt;li>The view does not utilize union, union all, or distinct.&lt;/li>
&lt;li>The from clause includes at least one table or updatable view.&lt;/li>
&lt;li>The from clause uses only inner joins if there is more than one table or view.&lt;/li>
&lt;/ul>
&lt;h2 id="updating-simple-views">Updating Simple Views&lt;/h2>
&lt;pre>&lt;code class="language-sql">UPDATE customer_vw
SET last_name = 'SMITH-ALLEN'
WHERE customer_id = 1;
&lt;/code>&lt;/pre>
&lt;p>No&lt;code>insert&lt;/code> for views that contain derived columns, even if the derived columns are not included in the statement. Cannot modify columns derived from an expression.&lt;/p>
&lt;h2 id="updating-complex-views">Updating Complex Views&lt;/h2>
&lt;p>For complex views with more than one table, you are allowed to modify both of the underlying tables separately, but not within a single statement. In order to insert data through a complex view, you would need to know from where each column is sourced. Since many views are created to hide complexity from end users, this seems to defeat the purpose if the users need to have explicit knowledge of the view definition.&lt;/p></description></item><item><title>Learning SQL Notes #11: Indexes and Constraints</title><link>https://siqi-zheng.rbind.io/post/2021-06-08-sql-notes-11/</link><pubDate>Tue, 08 Jun 2021 05:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-08-sql-notes-11/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#indexes">Indexes&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#index-creation">Index Creation&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#unique-indexes">Unique indexes&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#multicolumn-indexes">Multicolumn indexes&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#types-of-indexes">Types of Indexes&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#b-tree-indexes">B-tree indexes&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#bitmap-indexes">Bitmap indexes&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#text-indexes">Text indexes&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#how-indexes-are-used">How Indexes Are Used&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#the-downside-of-indexes">The Downside of Indexes&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#constraints">Constraints&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#constraint-creation">Constraint Creation&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h1 id="indexes">Indexes&lt;/h1>
&lt;p>The server simply places the data in the next available location within the file (the server
maintains a list of free space for each table).&lt;/p>
&lt;p>To find all customers whose last name begins with Y, the server must visit each row in the customer table and inspect the contents of the last_name column; if the last name begins with Y, then the row is added to the result set. This type of access is known as a &lt;em>table&lt;/em> &lt;em>scan&lt;/em>.&lt;/p>
&lt;p>An index is simply a mechanism for finding a specific item within a resource. A database server uses indexes to locate rows in a table. Indexes are special tables that, unlike normal data tables, &lt;em>are&lt;/em> kept in a specific order. Instead of containing &lt;em>all&lt;/em> of the data about an entity, however, an index contains only the column (or columns) used to locate rows in the data table, along with information describing where the rows are physically located. Therefore, the role of indexes is to facilitate the retrieval of a subset of a table’s rows and columns &lt;em>without&lt;/em> the need to inspect every row in the table.&lt;/p>
&lt;h2 id="index-creation">Index Creation&lt;/h2>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
ALTER TABLE customer
ADD INDEX idx_email (email);
/*OR*/
ALTER TABLE customer
DROP INDEX idx_email;
/*SQL Server*/
CREATE INDEX idx_email
ON customer (email);
SHOW INDEX FROM customer \G;
&lt;/code>&lt;/pre>
&lt;p>To create indexes, we can&lt;/p>
&lt;pre>
CREATE TABLE customer (
customer_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
...
&lt;b>PRIMARY KEY (customer_id),
KEY idx_fk_store_id (store_id),
KEY idx_fk_address_id (address_id),
KEY idx_last_name (last_name),&lt;/b>
...
&lt;/pre>
&lt;h3 id="unique-indexes">Unique indexes&lt;/h3>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
ALTER TABLE customer
ADD UNIQUE INDEX idx_email (email);
/*SQL Server/Oracle Database*/
CREATE UNIQUE INDEX idx_email
ON customer (email);
&lt;/code>&lt;/pre>
&lt;p>You should not build unique indexes on your primary key column(s), since the server already checks uniqueness for primary key values.&lt;/p>
&lt;h3 id="multicolumn-indexes">Multicolumn indexes&lt;/h3>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
ALTER TABLE customer
ADD INDEX idx_full_name (last_name, first_name);
/*SQL Server/Oracle Database*/
CREATE UNIQUE INDEX idx_email
ON customer (email);
&lt;/code>&lt;/pre>
&lt;h2 id="types-of-indexes">Types of Indexes&lt;/h2>
&lt;h3 id="b-tree-indexes">B-tree indexes&lt;/h3>
&lt;p>All the indexes shown thus far are &lt;em>balanced-tree indexes&lt;/em>, which are more commonly known as &lt;em>B-tree indexes&lt;/em>. MySQL, Oracle Database, and SQL Server all default to B-tree indexing.&lt;/p>
&lt;ul>
&lt;li>B-tree indexes are organized as trees, with one or more levels of branch nodes leading to a single level of leaf nodes.&lt;/li>
&lt;li>The server would look at the top branch node (called the root node) and follow the link to the branch node.&lt;/li>
&lt;li>The server can add or remove branch nodes to redistribute the values more evenly and can even add or remove an entire level of branch nodes.&lt;/li>
&lt;/ul>
&lt;h3 id="bitmap-indexes">Bitmap indexes&lt;/h3>
&lt;p>If there are only two different values (stored as 1 for active and 0 for inactive) and far more active customers, it can be difficult to maintain a balanced B-tree index as the number of customers grows.&lt;/p>
&lt;p>For columns that contain only a small number of values across a large number of rows (known as &lt;em>low-cardinality data&lt;/em>), Oracle Database includes bitmap indexes, which generate a bitmap for each value stored in the column.&lt;/p>
&lt;pre>&lt;code class="language-sql">/*Oracle Database*/
CREATE BITMAP INDEX idx_active ON customer (active);
&lt;/code>&lt;/pre>
&lt;p>Bitmap indexes are commonly used in data warehousing environments, where large amounts of data are generally indexed on columns containing relatively few values (e.g., sales quarters, geographic regions, products, salespeople).&lt;/p>
&lt;h3 id="text-indexes">Text indexes&lt;/h3>
&lt;h2 id="how-indexes-are-used">How Indexes Are Used&lt;/h2>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
EXPLAIN
SELECT customer_id, first_name, last_name
FROM customer
WHERE first_name LIKE 'S%' AND last_name LIKE 'P%';
/*SQL Server*/
set show plan_text
/*Oracle Database*/
explain plan
&lt;/code>&lt;/pre>
&lt;p>For this query, the server can employ any of the following strategies:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Scan all rows in the customer table.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the index on the last_name column to find all customers whose last name starts with P; then visit each row of the customer table to find only rows whose first name starts with S.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the index on the last_name and first_name columns to find all customers whose last name starts with P and whose first name starts with S.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Looking at the query results, the &lt;code>possible_keys&lt;/code> column tells you that the server could decide to use either the &lt;code>idx_last_name&lt;/code> or the &lt;code>idx_full_name&lt;/code> index, and the key column tells you that the &lt;code>idx_full_name&lt;/code> index was chosen. Furthermore, the &lt;code>type&lt;/code> column tells you that a range scan will be utilized, meaning that the database server will be looking for a range of values in the index, rather than expecting to retrieve a single row.&lt;/p>
&lt;h2 id="the-downside-of-indexes">The Downside of Indexes&lt;/h2>
&lt;p>&lt;strong>Every index is a table&lt;/strong> (a special type of table but still a table). Therefore, every time a row is added to or removed from a table, all indexes on that table must be modified. When a row is updated, any indexes on the column or columns that were affected need to be modified as well. Therefore, the more indexes you have, the more work the server needs to do to keep all schema objects up-to-date, which tends to slow things down.&lt;/p>
&lt;p>Indexes also require &lt;strong>disk space&lt;/strong> as well as some amount of care from your administrators, so the best strategy is to add an index when a clear need arises. If you need an index for only special purposes, such as a monthly maintenance routine, you can always add the index, run the routine, and then drop the index until you need it again. In the case of data warehouses, where indexes are crucial during business hours as users run reports and ad hoc queries but are problematic when data is being loaded into the warehouse overnight, it is a common practice to drop the indexes before data is loaded and then re-create them before the warehouse opens for business.&lt;/p>
&lt;p>In general, you should strive to have neither too many indexes nor too few. If you aren’t sure how many indexes you should have, you can use this strategy as a default:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Make sure all primary key columns are indexed (most servers automatically cre‐ ate unique indexes when you create primary key constraints). For multicolumn primary keys, consider building additional indexes on a subset of the primary key columns or on all the primary key columns but in a different order than the primary key constraint definition.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Build indexes on all columns that are referenced in foreign key constraints. Keep in mind that the server checks to make sure there are no child rows when a par‐ ent is deleted, so it must issue a query to search for a particular value in the col‐ umn. If there’s no index on the column, the entire table must be scanned.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Index any columns that will frequently be used to retrieve data. Most date columns are good candidates, along with short (2- to 50-character) string columns.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h1 id="constraints">Constraints&lt;/h1>
&lt;p>A constraint is simply a restriction placed on one or more columns of a table. There are several different types of constraints, including:&lt;/p>
&lt;p>&lt;em>Primary key constraints&lt;/em>
Identify the column or columns that guarantee uniqueness within a table&lt;/p>
&lt;p>&lt;em>Foreign key constraints&lt;/em>
Restrict one or more columns to contain only values found in another table’s pri‐ mary key columns (may also restrict the allowable values in other tables if update cascade or delete cascade rules are established)&lt;/p>
&lt;p>&lt;em>Unique constraints&lt;/em>
Restrict one or more columns to contain unique values within a table (primary key constraints are a special type of unique constraint)&lt;/p>
&lt;p>&lt;em>Check constraints&lt;/em>
Restrict the allowable values for a column&lt;/p>
&lt;p>If the server allows you to change a customer’s ID in the customer table without changing the same customer ID in the rental table, then you will end up with rental data that no longer points to valid customer records (known as &lt;em>orphaned rows&lt;/em>). With primary and foreign key constraints in place, however, the server will either raise an error if an attempt is made to modify or delete data that is referenced by other tables or propagate the changes to other tables for you&lt;/p>
&lt;p>Note: If you want to use foreign key constraints with the MySQL server, you must use the &lt;em>InnoDB&lt;/em> storage engine for your tables.&lt;/p>
&lt;h3 id="constraint-creation">Constraint Creation&lt;/h3>
&lt;pre>
CREATE TABLE customer (
...
&lt;b>PRIMARY KEY (customer_id), &lt;/b>
KEY idx_fk_store_id (store_id),
KEY idx_fk_address_id (address_id),
KEY idx_last_name (last_name),
&lt;b>CONSTRAINT fk_customer_address FOREIGN KEY (address_id) REFERENCES address (address_id) ON DELETE RESTRICT ON UPDATE CASCADE,
CONSTRAINT fk_customer_store FOREIGN KEY (store_id)REFERENCES store (store_id) ON DELETE RESTRICT ON UPDATE CASCADE&lt;/b>
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*For existing tables, you can do"*/
ALTER TABLE customer
&lt;b>ADD CONSTRAINT&lt;/b> fk_customer_address FOREIGN KEY (address_id)
REFERENCES address (address_id) ON DELETE RESTRICT ON UPDATE CASCADE;
ALTER TABLE customer
&lt;b>ADD CONSTRAINT&lt;/b> fk_customer_store FOREIGN KEY (store_id)
REFERENCES store (store_id) ON DELETE RESTRICT ON UPDATE CASCADE;
/*if you want to drop them*/
ALTER TABLE customer
&lt;b>DROP CONSTRAINT&lt;/b> fk_customer_address;
ALTER TABLE customer
&lt;b>DROP CONSTRAINT&lt;/b> fk_customer_store F;
&lt;/pre>
&lt;ul>
&lt;li>on delete restrict, which will cause the server to raise an error if a row is deleted in the parent table (address or store) that is referenced in the child table (customer)&lt;/li>
&lt;li>on update cascade, which will cause the server to propagate a change to the primary key value of a parent table (address or store) to the child table (customer)&lt;/li>
&lt;/ul>
&lt;table>&lt;thead>
&lt;tr>
&lt;th>Parameter&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>&lt;tbody>
&lt;tr>
&lt;td>&lt;code>ON DELETE NO ACTION&lt;/code>&lt;/td>
&lt;td>&lt;em>Default action.&lt;/em> If there are any existing references to the key being deleted, the transaction will fail at the end of the statement. The key can be updated, depending on the &lt;code>ON UPDATE&lt;/code> action. &lt;br>&lt;br>Alias: &lt;code>ON DELETE RESTRICT&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>ON UPDATE NO ACTION&lt;/code>&lt;/td>
&lt;td>&lt;em>Default action.&lt;/em> If there are any existing references to the key being updated, the transaction will fail at the end of the statement. The key can be deleted, depending on the &lt;code>ON DELETE&lt;/code> action. &lt;br>&lt;br>Alias: &lt;code>ON UPDATE RESTRICT&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>ON DELETE RESTRICT&lt;/code> / &lt;code>ON UPDATE RESTRICT&lt;/code>&lt;/td>
&lt;td>&lt;code>RESTRICT&lt;/code> and &lt;code>NO ACTION&lt;/code> are currently equivalent until options for deferring constraint checking are added. To set an existing foreign key action to &lt;code>RESTRICT&lt;/code>, the foreign key constraint must be dropped and recreated.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>ON DELETE CASCADE&lt;/code> / &lt;code>ON UPDATE CASCADE&lt;/code>&lt;/td>
&lt;td>When a referenced foreign key is deleted or updated, all rows referencing that key are deleted or updated, respectively. If there are other alterations to the row, such as a &lt;code>SET NULL&lt;/code> or &lt;code>SET DEFAULT&lt;/code>, the delete will take precedence. &lt;br>&lt;br>Note that &lt;code>CASCADE&lt;/code> does not list objects it drops or updates, so it should be used cautiously.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>ON DELETE SET NULL&lt;/code> / &lt;code>ON UPDATE SET NULL&lt;/code>&lt;/td>
&lt;td>When a referenced foreign key is deleted or updated, respectively, the columns of all rows referencing that key will be set to &lt;code>NULL&lt;/code>. The column must allow &lt;code>NULL&lt;/code> or this update will fail.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>ON DELETE SET DEFAULT&lt;/code> / &lt;code>ON UPDATE SET DEFAULT&lt;/code>&lt;/td>
&lt;td>When a referenced foreign key is deleted or updated, the columns of all rows referencing that key are set to the default value for that column. &lt;br/>&lt;br/> If the default value for the column is null, or if no default value is provided and the column does not have a &lt;a href='https://siqi-zheng.rbind.io/docs/v21.1/not-null'>&lt;code>NOT NULL&lt;/code>&lt;/a> constraint, this will have the same effect as &lt;code>ON DELETE SET NULL&lt;/code> or &lt;code>ON UPDATE SET NULL&lt;/code>. The default value must still conform with all other constraints, such as &lt;code>UNIQUE&lt;/code>.&lt;/td>
&lt;/tr>
&lt;/tbody>&lt;/table></description></item><item><title>Learning SQL Notes #10: Transactions</title><link>https://siqi-zheng.rbind.io/post/2021-06-08-sql-notes-10/</link><pubDate>Tue, 08 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-08-sql-notes-10/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#multiuser-databases">Multiuser Databases&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#locking">Locking&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#lock-granularities">Lock Granularities&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#what-is-a-transaction">What Is a Transaction?&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#starting-a-transaction">Starting a Transaction&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#ending-a-transaction">Ending a Transaction&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#transaction-savepoints">Transaction Savepoints&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#choosing-a-storage-engine">Choosing a Storage Engine&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Transactions: Mechanism used to group a set of SQL statements together such that either all or none of the statements succeed.&lt;/p>
&lt;h1 id="multiuser-databases">Multiuser Databases&lt;/h1>
&lt;h2 id="locking">Locking&lt;/h2>
&lt;p>Locks are the mechanism the database server uses to &lt;strong>control simultaneous use&lt;/strong> of data resources. When some portion of the database is locked, any other users wishing to modify (or possibly read) that data must wait until the lock has been released. Most database servers use one of two locking strategies:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Database writers must request and receive from the server a write lock to modify data, and database readers must request and receive from the server a read lock to query data. While multiple users can read data simultaneously, only one write lock is given out at a time for each table (or portion thereof), and read requests are blocked until the write lock is released. $\Rightarrow$ long wait times if there are many concurrent read and write requests. (Microsoft SQL Server/MySQL)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Database writers must request and receive from the server a write lock to modify data, but readers do not need any type of lock to query data. Instead, the server ensures that a reader sees a consistent view of the data (the data seems the same even though other users may be making modifications) from the time her query begins until her query has finished. This approach is known as &lt;em>versioning&lt;/em>. $\Rightarrow$ problematic if there are long-running queries while data is being modified. (Oracle Database/MySQL)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="lock-granularities">Lock Granularities&lt;/h2>
&lt;p>&lt;em>Table locks&lt;/em> $\Rightarrow$ less bookkeeping, longer waiting time
Keep multiple users from modifying data in the same table simultaneously&lt;/p>
&lt;p>&lt;em>Page locks&lt;/em>
Keep multiple users from modifying data on the same page (a page is a segment of memory generally in the range of 2 KB to 16 KB) of a table simultaneously&lt;/p>
&lt;p>&lt;em>Row locks&lt;/em> $\Rightarrow$ More bookkeeping, shorter waiting time
Keep multiple users from modifying the same row in a table simultaneously&lt;/p>
&lt;p>SQL Server will, under certain circumstances, &lt;em>escalate&lt;/em> locks from row to page, and from page to table, whereas Oracle Database will never escalate locks.&lt;/p>
&lt;h1 id="what-is-a-transaction">What Is a Transaction?&lt;/h1>
&lt;p>Problems occur when one of the ideal situations fails:&lt;/p>
&lt;ul>
&lt;li>Database servers do not enjoy 100% uptime&lt;/li>
&lt;li>Users do not always allow programs to finish executing&lt;/li>
&lt;li>Applications do not always complete without encountering fatal errors that halt execution&lt;/li>
&lt;/ul>
&lt;p>&lt;em>Transaction&lt;/em> is a device for grouping together multiple SQL statements such that either all or none of the statements succeed (a property known as atomicity).&lt;/p>
&lt;p>Ex:&lt;/p>
&lt;p>If you attempt to transfer $500 from your savings account to your checking account, you would be a bit upset if the money were successfully withdrawn from your savings account but never made it to your checking account. Whatever the reason for the failure (the server was shut down for maintenance, the request for a page lock on the account table timed out, etc.), you want your $500 back. To protect against this kind of error, the program that handles your transfer request would first &lt;strong>begin a transaction&lt;/strong>, then issue the SQL statements needed to move the money from your savings to your checking account, and, &lt;strong>if everything succeeds&lt;/strong>, end the transaction by issuing the &lt;strong>commit&lt;/strong> command. If something &lt;strong>unexpected&lt;/strong> &lt;strong>happens&lt;/strong>, however, the program would issue a &lt;strong>rollback&lt;/strong> command, which instructs the server to undo all changes made since the transaction began.&lt;/p>
&lt;h2 id="starting-a-transaction">Starting a Transaction&lt;/h2>
&lt;p>Database servers handle transaction creation in one of two ways:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>An active transaction is always associated with a database session, so there is no need or method to explicitly begin a transaction. When the current transaction ends, the server automatically begins a new transaction for your session. &lt;em>You can undo some changes.&lt;/em> (Oracle Database)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Unless you explicitly begin a transaction, individual SQL statements are automatically committed independently of one another. To begin a transaction, you must first issue a command. (Microsoft SQL Server/MySQL)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>The SQL:2003 standard includes a &lt;code>start transaction&lt;/code> command to be used when you want to explicitly begin a transaction. While MySQL conforms to the standard, SQL Server users must instead issue the command &lt;code>begin transaction&lt;/code>. With both servers, until you explicitly begin a transaction, you are in what is known as &lt;em>autocommit mode&lt;/em>, which means that individual statements are automatically committed by the server.&lt;/p>
&lt;p>A word of advice: shut off autocommit mode each time you log in, and get in the habit of running all of your SQL statements within a transaction.&lt;/p>
&lt;p>Both MySQL and SQL Server allow you to turn off autocommit mode for individual sessions, in which case the servers will act just like Oracle Database regarding transactions. With SQL Server, you issue the following command to disable autocommit mode:&lt;/p>
&lt;p>&lt;code>SET IMPLICIT_TRANSACTIONS ON&lt;/code>&lt;/p>
&lt;p>MySQL allows you to disable autocommit mode via the following:&lt;/p>
&lt;p>&lt;code>SET AUTOCOMMIT=0&lt;/code>&lt;/p>
&lt;p>Once you have left autocommit mode, all SQL commands take place within the scope of a transaction and must be explicitly committed or rolled back.&lt;/p>
&lt;h2 id="ending-a-transaction">Ending a Transaction&lt;/h2>
&lt;p>End with &lt;code>commit&lt;/code> if yes and &lt;code>rollback&lt;/code> if no.&lt;/p>
&lt;p>Some scenarios in practice:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The server shuts down, in which case your transaction will be rolled back automatically when the server is restarted. ✔&lt;/p>
&lt;/li>
&lt;li>
&lt;p>You issue an SQL schema statement, such as alter table, which will cause the current transaction to be committed and a new transaction to be started.&lt;/p>
&lt;ul>
&lt;li>be careful that the state‐ ments that comprise a unit of work are not inadvertently broken up into multiple transactions by the server！&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>You issue another start transaction command, which will cause the previous transaction to be committed. ✔&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The server prematurely ends your transaction because the server detects a dead‐ lock and decides that your transaction is the culprit. In this case, the transaction
will be rolled back, and you will receive an error message.&lt;/p>
&lt;ul>
&lt;li>Most of the time, the terminated transaction can be restarted and will succeed without encountering another deadlock situation.&lt;br>
&lt;code>Message: Deadlock found when trying to get lock; try restarting transaction&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="transaction-savepoints">Transaction Savepoints&lt;/h2>
&lt;p>You may not want to undo &lt;em>all&lt;/em> of the work that has transpired. For these situations, you can establish one or more &lt;em>savepoints&lt;/em>&lt;/p>
&lt;pre>&lt;code class="language-sql">SAVEPOINT my_savepoint;
&lt;/code>&lt;/pre>
&lt;p>within a transaction and use them to roll back to a particular location within your transaction&lt;/p>
&lt;pre>&lt;code class="language-sql">ROLLBACK TO SAVEPOINT my_savepoint;
&lt;/code>&lt;/pre>
&lt;p>rather than rolling all the way back to the start of the transaction.&lt;/p>
&lt;h3 id="choosing-a-storage-engine">Choosing a Storage Engine&lt;/h3>
&lt;p>When using Oracle Database or Microsoft SQL Server, a single set of code is respon‐ sible for low-level database operations, such as retrieving a particular row from a table based on primary key value. The MySQL server, however, has been designed so that multiple storage engines may be utilized to provide low-level database functionality, including resource locking and transaction management. As of version 8.0, MySQL includes the following storage engines:&lt;/p>
&lt;p>&lt;em>MyISAM&lt;/em>
A nontransactional engine employing table locking&lt;/p>
&lt;p>&lt;em>MEMORY&lt;/em>
A nontransactional engine used for in-memory tables&lt;/p>
&lt;p>&lt;em>CSV&lt;/em>
A transactional engine that stores data in comma-separated files&lt;/p>
&lt;p>&lt;em>InnoDB&lt;/em>
A transactional engine employing row-level locking&lt;/p>
&lt;p>&lt;em>Merge&lt;/em>
A specialty engine used to make multiple identical &lt;em>MyISAM&lt;/em> tables appear as a single table (a.k.a. table partitioning)&lt;/p>
&lt;p>&lt;em>Archive&lt;/em>
A specialty engine used to store large amounts of unindexed data, mainly for archival purposes&lt;/p>
&lt;p>MySQL is flexible enough to allow you to choose a storage engine on a table-by-table basis.&lt;/p>
&lt;p>You may explicitly specify a storage engine when creating a table, or you can change an existing table to use a different engine.&lt;/p>
&lt;pre>&lt;code class="language-sql">show table status like 'customer' \G;
/*Second row: Engine: InnoDB*/
ALTER TABLE customer ENGINE = INNODB;
&lt;/code>&lt;/pre>
&lt;p>One example is shown below:&lt;/p>
&lt;pre>&lt;code class="language-sql">START TRANSACTION;
UPDATE product
SET date_retired = CURRENT_TIMESTAMP()
WHERE product_cd = 'XYZ';
SAVEPOINT before_close_accounts;
UPDATE account
SET status = 'CLOSED', close_date = CURRENT_TIMESTAMP(), last_activity_date = CURRENT_TIMESTAMP()
WHERE product_cd = 'XYZ';
ROLLBACK TO SAVEPOINT before_close_accounts;
COMMIT;
/*The net effect of this transaction is that the mythical XYZ product is retired but none of the accounts are closed.*/
&lt;/code>&lt;/pre>
&lt;p>When using savepoints, remember the following:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Despite the name, nothing is saved when you create a savepoint. You must even‐ tually issue a commit if you want your transaction to be made permanent.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If you issue a rollback without naming a savepoint, all savepoints within the transaction will be ignored, and the entire transaction will be undone.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>If you are using &lt;em>SQL Server&lt;/em>, you will need to use the proprietary command &lt;code>save transaction&lt;/code> to create a savepoint and &lt;code>rollback transaction&lt;/code> to roll back to a savepoint, with each command being followed by the savepoint name.&lt;/p></description></item><item><title>Learning SQL Notes #9: Conditional Logic</title><link>https://siqi-zheng.rbind.io/post/2021-06-07-sql-notes-9/</link><pubDate>Mon, 07 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-07-sql-notes-9/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#what-is-conditional-logic">What Is Conditional Logic?&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#the-case-expression">The case Expression&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#searched-case-expressions">Searched case Expressions&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#simple-case-expressions-a-less-flexible-ver-of-the-previous-expression">Simple case Expressions (A less flexible ver. of the previous expression)&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#examples-of-case-expressions">Examples of case Expressions&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#result-set-transformations">Result Set Transformations&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#checking-for-existence">Checking for Existence&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#avoid-division-by-zero-errors">(Avoid) Division-by-Zero Errors&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#conditional-updates">Conditional Updates&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#handling-null-values">Handling Null Values&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h1 id="what-is-conditional-logic">What Is Conditional Logic?&lt;/h1>
&lt;p>Conditional logic is simply the ability to take one of several paths during program execution.&lt;/p>
&lt;p>Analogous to if-else in Python and R.&lt;/p>
&lt;PRE>
SELECT first_name, last_name,
&lt;B>CASE&lt;/B>
WHEN active = 1 THEN 'ACTIVE'
ELSE 'INACT
&lt;B>END&lt;/B> activity_type
FROM customer;
&lt;/PRE>
&lt;h2 id="the-case-expression">The case Expression&lt;/h2>
&lt;ul>
&lt;li>The case expression is part of the SQL standard (SQL92 release) and has been implemented by Oracle Database, SQL Server, MySQL, PostgreSQL, IBM UDB, and others.&lt;/li>
&lt;li>case expressions are built into the SQL grammar and can be included in select, insert, update, and delete statements.&lt;/li>
&lt;/ul>
&lt;h3 id="searched-case-expressions">Searched case Expressions&lt;/h3>
&lt;pre>&lt;code class="language-sql">CASE
WHEN category.name IN ('Children','Family','Sports','Animation')
THEN 'All Ages'
WHEN category.name = 'Horror'
THEN 'Adult'
WHEN category.name IN ('Music','Games')
THEN 'Teens'
ELSE 'Other'
END
&lt;/code>&lt;/pre>
&lt;PRE>
SELECT c.first_name, c.last_name,
CASE
WHEN active = 0 THEN 0
&lt;B>ELSE
(SELECT count(*) FROM rental r
WHERE r.customer_id = c.customer_id)&lt;/B>
END num_rentals /*Create new variables*/
FROM customer c;
&lt;/PRE>
&lt;h3 id="simple-case-expressions-a-less-flexible-ver-of-the-previous-expression">Simple case Expressions (A less flexible ver. of the previous expression)&lt;/h3>
&lt;PRE>
CASE &lt;B>V0&lt;/B>
WHEN V1 THEN E1
WHEN V2 THEN E2 ...
WHEN VN THEN EN
[ELSE ED]
END
&lt;/PRE>
&lt;p>V0 represents a value, and the symbols V1, V2, &amp;hellip;, VN rep‐ resent values that are to be compared to V0.&lt;/p>
&lt;h2 id="examples-of-case-expressions">Examples of case Expressions&lt;/h2>
&lt;h3 id="result-set-transformations">Result Set Transformations&lt;/h3>
&lt;pre>&lt;code class="language-sql">SELECT monthname(rental_date) rental_month,
count(*) num_rentals
FROM rental
WHEN WHERE rental_date BETWEEN '2005-05-01' AND '2005-08-01'
GROUP BY monthname(rental_date);
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">rental_month&lt;/th>
&lt;th align="right">num_rentals&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">May&lt;/td>
&lt;td align="right">1156&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">June&lt;/td>
&lt;td align="right">2311&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">July&lt;/td>
&lt;td align="right">6709&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-sql">SELECT
SUM(CASE WHEN monthname(rental_date) = 'May' THEN 1
ELSE 0 END) May_rentals,
SUM(CASE WHEN monthname(rental_date) = 'June' THEN 1
ELSE 0 END) June_rentals,
SUM(CASE WHEN monthname(rental_date) = 'July' THEN 1
ELSE 0 END) July_rentals
FROM rental
WHERE rental_date BETWEEN '2005-05-01' AND '2005-08-01';
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">May_rentals&lt;/th>
&lt;th>June_rentals&lt;/th>
&lt;th align="right">July_rentals&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">1156&lt;/td>
&lt;td>2311&lt;/td>
&lt;td align="right">6709&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>When the monthname() function returns the desired value for that column, the case expression returns the value 1; otherwise, it returns a 0. When summed over all rows, each column returns the number of accounts opened for that month. Obviously, such transformations are practical for only a small number of values&lt;/p>
&lt;h3 id="checking-for-existence">Checking for Existence&lt;/h3>
&lt;p>Sometimes you will want to determine whether a relationship exists between two entities &lt;strong>without regard for the quantity&lt;/strong>.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT a.first_name, a.last_name,
CASE
WHEN EXISTS (SELECT 1 FROM film_actor fa
INNER JOIN film f ON fa.film_id = f.film_id
WHERE fa.actor_id = a.actor_id
AND f.rating = 'G') THEN 'Y'
ELSE 'N'
END g_actor
FROM actor a
WHERE a.last_name LIKE 'S%' OR a.first_name LIKE 'S%';
&lt;/code>&lt;/pre>
&lt;h3 id="avoid-division-by-zero-errors">(Avoid) Division-by-Zero Errors&lt;/h3>
&lt;pre>&lt;code class="language-sql">...
sum(p.amount) /
CASE WHEN count(p.amount) = 0 THEN 1
ELSE count(p.amount)
END avg_payment
...
&lt;/code>&lt;/pre>
&lt;h3 id="conditional-updates">Conditional Updates&lt;/h3>
&lt;pre>&lt;code class="language-sql">UPDATE customer
SET active =
CASE
WHEN 90 &amp;lt;= (SELECT datediff(now(), max(rental_date))
FROM rental r
WHERE r.customer_id = customer.customer_id)
THEN 0
ELSE 1
END
WHERE active = 1;
/*if the number returned by the subquery is 90 or higher, the customer is marked as inactive.*/
&lt;/code>&lt;/pre>
&lt;h3 id="handling-null-values">Handling Null Values&lt;/h3>
&lt;pre>&lt;code class="language-sql">...
CASE
WHEN a.address IS NULL THEN 'Unknown'
ELSE a.address
END address,
...
&lt;/code>&lt;/pre>
&lt;p>Note: For calculations, null values often cause a null result. When performing calculations, case expressions are useful for translating a null value into a number (usually 0 or 1) that will allow the calculation to yield a non-null value.&lt;/p></description></item><item><title>Learning SQL Notes #8: Subqueries</title><link>https://siqi-zheng.rbind.io/post/2021-06-06-sql-notes-8/</link><pubDate>Sun, 06 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-06-sql-notes-8/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#what-is-a-subquery">What Is a Subquery?&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#subquery-types">Subquery Types&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#noncorrelated-subqueries">Noncorrelated Subqueries&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#multiple-row-single-column-subqueries">Multiple-Row, Single-Column Subqueries&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#the-in-and-not-in-operators">The in and not in operators&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#the-all-operator">The all operator&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#the-any-operator-or">The any operator (OR)&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#multicolumn-subqueries">Multicolumn Subqueries&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#correlated-subqueries">Correlated Subqueries&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#the-exists-operator">The exists Operator&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#data-manipulation-using-correlated-subqueries">Data Manipulation Using Correlated Subqueries&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#when-to-use-subqueries">When to Use Subqueries&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#subqueries-as-data-sources">Subqueries as Data Sources&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#data-fabrication">Data fabrication&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#task-oriented-subqueries">Task-oriented subqueries&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#common-table-expressions">Common table expressions&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#subqueries-as-expression-generators">Subqueries as Expression Generators&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#subquery-wrap-up">Subquery Wrap-Up&lt;/a>&lt;/li>
&lt;/ul>
&lt;h1 id="what-is-a-subquery">What Is a Subquery?&lt;/h1>
&lt;p>A &lt;em>subquery&lt;/em> is a query contained within another SQL statement (which I refer to as the containing statement for the rest of this discussion). A subquery is always enclosed within parentheses, and it is usually executed prior to the containing statement. Like any query, a subquery returns a result set that may consist of:&lt;/p>
&lt;ul>
&lt;li>A single row with a single column&lt;/li>
&lt;li>Multiple rows with a single column&lt;/li>
&lt;li>Multiple rows having multiple columns&lt;/li>
&lt;/ul>
&lt;pre>
SELECT customer_id, first_name, last_name
FROM customer
WHERE customer_id = &lt;b>(SELECT MAX(customer_id) FROM customer);&lt;/b>
&lt;/pre>
&lt;h1 id="subquery-types">Subquery Types&lt;/h1>
&lt;h2 id="noncorrelated-subqueries">Noncorrelated Subqueries&lt;/h2>
&lt;h3 id="multiple-row-single-column-subqueries">Multiple-Row, Single-Column Subqueries&lt;/h3>
&lt;h4 id="the-in-and-not-in-operators">The in and not in operators&lt;/h4>
&lt;pre>
SELECT city_id, city
FROM city
WHERE country_id &lt;> &lt;b>(SELECT country_id FROM country WHERE country = 'India');&lt;/b>
&lt;/pre>
&lt;p>Note: Subquery should not return more than one row when you use &lt;code>WHERE&lt;/code> to filter a condition with inequality/equality in this case.&lt;/p>
&lt;p>What you can do is use the following subqueries:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico');
&lt;/code>&lt;/pre>
&lt;p>or&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT country_id
FROM country
WHERE country = 'Canada' OR country = 'Mexico';
&lt;/code>&lt;/pre>
&lt;p>in the following ways:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT city_id, city
FROM city
WHERE country_id IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));
&lt;/code>&lt;/pre>
&lt;p>or the opposite:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT city_id, city
FROM city
WHERE country_id NOT IN
(SELECT country_id
FROM country
WHERE country IN ('Canada','Mexico'));
&lt;/code>&lt;/pre>
&lt;h4 id="the-all-operator">The all operator&lt;/h4>
&lt;p>The all operator allows you to make comparisons between a single value and every value in a set:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT first_name, last_name
FROM customer
WHERE customer_id &amp;lt;&amp;gt; ALL
(SELECT customer_id
FROM payment
WHERE amount = 0);
&lt;/code>&lt;/pre>
&lt;p>or the equivalent:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT first_name, last_name
FROM customer
WHERE customer_id NOT IN
(SELECT customer_id
FROM payment
WHERE amount = 0);
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Any attempt to equate a value to null yields unknown, so when using &lt;code>not in&lt;/code> or &lt;code>&amp;lt;&amp;gt; all&lt;/code> to compare a value to a set of values, you must be careful to ensure that the set of values does not contain a null value.&lt;/strong>&lt;/p>
&lt;p>The subquery in this example returns the total number of film rentals for all custom‐ ers in North America, and the containing query returns all customers whose total
number of film rentals exceeds any of the North American customers.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) &amp;gt; ALL
(SELECT count(*)
FROM rental r
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('United States','Mexico','Canada')
GROUP BY r.customer_id
);
&lt;/code>&lt;/pre>
&lt;h4 id="the-any-operator-or">The any operator (OR)&lt;/h4>
&lt;p>A condition using the any operator evaluates to true as soon as a single comparison is favorable.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT customer_id, sum(amount)
FROM payment
GROUP BY customer_id
HAVING sum(amount) &amp;gt; ANY
(SELECT sum(amount)
FROM payment p
INNER JOIN customer c
ON r.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
INNER JOIN country co
ON ct.country_id = co.country_id
WHERE co.country IN ('Bolivia','Paraguay','Chile')
GROUP BY co.country
);
&lt;/code>&lt;/pre>
&lt;h3 id="multicolumn-subqueries">Multicolumn Subqueries&lt;/h3>
&lt;pre>&lt;code class="language-sql">SELECT actor_id, film_id
FROM film_actor
WHERE (actor_id, film_id) IN
(SELECT a.actor_id, f.film_id
FROM actor a
CROSS JOIN film f
WHERE a.last_name = 'MONROE'
AND f.rating = 'PG');
&lt;/code>&lt;/pre>
&lt;h2 id="correlated-subqueries">Correlated Subqueries&lt;/h2>
&lt;p>A &lt;em>correlated&lt;/em> &lt;em>subquery&lt;/em>, on the other hand, is &lt;em>dependent&lt;/em> on its containing statement from which it references one or more columns.&lt;/p>
&lt;pre>
SELECT c.first_name, c.last_name
FROM customer c
WHERE 20 =
(SELECT count(*)
FROM rental r
WHERE r.customer_id = &lt;b>c.customer_id&lt;/b>);
/*customers who have rented exactly 20 films*/
&lt;/pre>
&lt;h3 id="the-exists-operator">The exists Operator&lt;/h3>
&lt;p>You use the exists operator when you want to identify that a relationship exists without regard for the quantity.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name
FROM customer c
WHERE (NOT) EXISTS
(SELECT r.rental_date, r.customer_id, 'ABCD' str, 2 * 3 / 7 nmbr /*can be replaced by anything*/
FROM rental r
WHERE r.customer_id = c.customer_id
AND date(r.rental_date) &amp;lt; '2005-05-25');
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Since the condition in the containing query only needs to know how many rows have been returned, the actual data the subquery returned is irrelevant.&lt;/strong>&lt;/p>
&lt;h3 id="data-manipulation-using-correlated-subqueries">Data Manipulation Using Correlated Subqueries&lt;/h3>
&lt;pre>&lt;code class="language-sql">UPDATE customer c
SET c.last_update =
(SELECT max(r.rental_date)
FROM rental r
WHERE r.customer_id = c.customer_id);
UPDATE customer c SET c.last_update =
(SELECT max(r.rental_date) FROM rental r WHERE r.customer_id = c.customer_id) WHERE EXISTS
(SELECT 1 FROM rental r
WHERE r.customer_id = c.customer_id);
/*executes only if the condition in the update statement’s where clause evaluates to true (meaning that at least one rental was found for the customer), thus protecting the data in the last_update column from being
overwritten with a null.*/
DELETE FROM customer WHERE 365 &amp;lt; ALL
(SELECT datediff(now(), r.rental_date) days_since_last_rental FROM rental r
WHERE r.customer_id = customer.customer_id);
/*removes rows from the customer table where there have been no film rentals in the past year*/
&lt;/code>&lt;/pre>
&lt;h1 id="when-to-use-subqueries">When to Use Subqueries&lt;/h1>
&lt;h2 id="subqueries-as-data-sources">Subqueries as Data Sources&lt;/h2>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name, pymnt.num_rentals, pymnt.tot_payments
FROM customer c
INNER JOIN
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id ) pymnt /*execute first*/
ON c.customer_id = pymnt.customer_id;
&lt;/code>&lt;/pre>
&lt;h3 id="data-fabrication">Data fabrication&lt;/h3>
&lt;p>First we have a table for some standards (small/average/heavy) with lower and upper bounds.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit;
&lt;/code>&lt;/pre>
&lt;p>Then we have transformed the original tables into the desired one.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT pymnt_grps.name, count(*) num_customers
FROM
(SELECT customer_id, count(*) num_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN (SELECT 'Small Fry' name, 0 low_limit, 74.99 high_limit
UNION ALL
SELECT 'Average Joes' name, 75 low_limit, 149.99 high_limit
UNION ALL
SELECT 'Heavy Hitters' name, 150 low_limit, 9999999.99 high_limit ) pymnt_grps
ON pymnt.tot_payments
BETWEEN pymnt_grps.low_limit AND pymnt_grps.high_limit
GROUP BY pymnt_grps.name;
&lt;/code>&lt;/pre>
&lt;h3 id="task-oriented-subqueries">Task-oriented subqueries&lt;/h3>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name, ct.city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
INNER JOIN customer c
ON p.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
GROUP BY c.first_name, c.last_name, ct.city;
&lt;/code>&lt;/pre>
&lt;p>We only need names/cities/addresses for display purpose only, so we can use subqueries to group the data first before joining other tables. A more efficient code chunk for the same task：&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name, ct.city, pymnt.tot_payments, pymnt.tot_rentals
FROM (SELECT customer_id, count(*) tot_rentals, sum(amount) tot_payments
FROM payment
GROUP BY customer_id) pymnt
INNER JOIN customer c
ON pymnt.customer_id = c.customer_id
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id;
&lt;/code>&lt;/pre>
&lt;h3 id="common-table-expressions">Common table expressions&lt;/h3>
&lt;pre>&lt;code class="language-sql">WITH actors_s AS
(SELECT actor_id, first_name, last_name
FROM actor
WHERE last_name LIKE 'S%'
) /*can be used in the subsequent queries*/
...
&lt;/code>&lt;/pre>
&lt;h2 id="subqueries-as-expression-generators">Subqueries as Expression Generators&lt;/h2>
&lt;p>Correlated scalar subqueries. The customer table is accessed three times (once in each of the three subqueries) rather than just once.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT (SELECT c.first_name
FROM customer c
WHERE c.customer_id = p.customer_id ) first_name, (SELECT c.last_name
FROM customer c
WHERE c.customer_id = p.customer_id ) last_name, (SELECT ct.city
FROM customer c
INNER JOIN address a
ON c.address_id = a.address_id
INNER JOIN city ct
ON a.city_id = ct.city_id
WHERE c.customer_id = p.customer_id
) city,
sum(p.amount) tot_payments, count(*) tot_rentals
FROM payment p
GROUP BY p.customer_id;
&lt;/code>&lt;/pre>
&lt;p>Similarly,&lt;/p>
&lt;pre>&lt;code class="language-sql">INSERT INTO film_actor (actor_id, film_id, last_update) VALUES (
(SELECT actor_id
FROM actor
WHERE first_name = 'JENNIFER' AND last_name = 'DAVIS'), (SELECT film_id FROM film
WHERE title = 'ACE GOLDFINGER'),
now()
);
&lt;/code>&lt;/pre>
&lt;h1 id="subquery-wrap-up">Subquery Wrap-Up&lt;/h1>
&lt;ul>
&lt;li>Return a single column and row, a single column with multiple rows, and multi‐ ple columns and rows&lt;/li>
&lt;li>Are independent of the containing statement (noncorrelated subqueries)&lt;/li>
&lt;li>Reference one or more columns from the containing statement (correlated subqueries)&lt;/li>
&lt;li>Are used in conditions that utilize comparison operators as well as the special-purpose operators in, not in, exists, and not exists&lt;/li>
&lt;li>Can be found in select, update, delete, and insert statements&lt;/li>
&lt;li>Generate result sets that can be joined to other tables (or subqueries) in a query&lt;/li>
&lt;li>Can be used to generate values to populate a table or to populate columns in a query’s result set&lt;/li>
&lt;li>Are used in the select, from, where, having, and order by clauses of queries&lt;/li>
&lt;/ul>
&lt;p>Happy learning!&lt;/p>
&lt;p>&lt;img src="2.jpg" alt="">&lt;/p></description></item><item><title>Learning SQL Notes #7: Grouping and Aggregates (CH. 8)</title><link>https://siqi-zheng.rbind.io/post/2021-06-05-sql-notes-7/</link><pubDate>Sat, 05 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-05-sql-notes-7/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#grouping-concepts">Grouping Concepts&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#aggregate-functions">Aggregate Functions&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#generating-groups">Generating Groups&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#single-columnmulticolumn-grouping">Single-Column/Multicolumn Grouping&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#grouping-via-expressions">Grouping via Expressions&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#generating-rollups">Generating Rollups&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#group-filter-conditions">Group Filter Conditions&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="grouping-concepts">Grouping Concepts&lt;/h2>
&lt;pre>&lt;code class="language-sql">SELECT customer_id, count(*)
FROM rental
GROUP BY customer_id
HAVING count(*) &amp;gt;= 40
ORDER BY 2 DESC;
&lt;/code>&lt;/pre>
&lt;p>WARNING:&lt;/p>
&lt;p>&lt;del>WHERE count(*) &amp;gt;= 40&lt;/del> since aggregate functions should come with &lt;code>HAVING&lt;/code>.&lt;/p>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">library(tidyverse)
rental %&amp;gt;%
group_by(customer_id) %&amp;gt;%
summarize(counts=n()) %&amp;gt;%
filter(counts&amp;gt;=40) %&amp;gt;%
arrange(desc(counts))
&lt;/code>&lt;/pre>
&lt;h2 id="aggregate-functions">Aggregate Functions&lt;/h2>
&lt;p>Some aggregate functions in SQL/R:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">SQL&lt;/th>
&lt;th align="right">R&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">count()&lt;/td>
&lt;td align="right">count()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">sum()&lt;/td>
&lt;td align="right">sum()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">average()&lt;/td>
&lt;td align="right">mean()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">min()&lt;/td>
&lt;td align="right">min()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">max()&lt;/td>
&lt;td align="right">max()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">group_concat()&lt;/td>
&lt;td align="right">paste()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">first()&lt;/td>
&lt;td align="right">[1]&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">last()&lt;/td>
&lt;td align="right">[-1]&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-sql">SELECT COUNT(DISTINCT col1)
FROM string_tbl;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">length(unique(string_tbl$col1))
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>NULLS are ignored unless you use &lt;code>count(*)&lt;/code> where all rows will be counted.&lt;/strong>&lt;/p>
&lt;h2 id="generating-groups">Generating Groups&lt;/h2>
&lt;h3 id="single-columnmulticolumn-grouping">Single-Column/Multicolumn Grouping&lt;/h3>
&lt;p>Grouping can be done on 1 or more columns with aggregate functions.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT actor_id, count(*)
FROM film_actor
GROUP BY actor_id;
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating
ORDER BY 1,2;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes are analogous to the codes in the last section.&lt;/p>
&lt;h3 id="grouping-via-expressions">Grouping via Expressions&lt;/h3>
&lt;pre>&lt;code class="language-sql">SELECT extract(YEAR FROM rental_date) year,
COUNT(*) how_many
FROM rental
GROUP BY extract(YEAR FROM rental_date);
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">library(tidyverse)
rental %&amp;gt;%
mutate(year=year(rental_date)) %&amp;gt;%
group_by(year) %&amp;gt;%
summarize(counts=n()) %&amp;gt;%
&lt;/code>&lt;/pre>
&lt;h3 id="generating-rollups">Generating Rollups&lt;/h3>
&lt;p>Find total counts for each distinct actor.&lt;/p>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
SELECT fa.actor_id, f.rating, count(*)
FROM film_actor fa
INNER JOIN film f
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating WITH ROLLUP
ORDER BY 1,2;
/*Oracle*/
GROUP BY ROLLUP(fa.actor_id, f.rating)
GROUP BY a, ROLLUP(b, c)
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">actor_id&lt;/th>
&lt;th>rating&lt;/th>
&lt;th align="right">count(*)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">NULL&lt;/td>
&lt;td>NULL&lt;/td>
&lt;td align="right">5462&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">1&lt;/td>
&lt;td>NULL&lt;/td>
&lt;td align="right">19&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">1&lt;/td>
&lt;td>G&lt;/td>
&lt;td align="right">4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">1&lt;/td>
&lt;td>PG&lt;/td>
&lt;td align="right">6&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">1&lt;/td>
&lt;td>PG-13&lt;/td>
&lt;td align="right">1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">1&lt;/td>
&lt;td>R&lt;/td>
&lt;td align="right">3&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">1&lt;/td>
&lt;td>NC-17&lt;/td>
&lt;td align="right">5&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">2&lt;/td>
&lt;td>NULL&lt;/td>
&lt;td align="right">25&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">2&lt;/td>
&lt;td>G&lt;/td>
&lt;td align="right">7&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">library(reshape2)
library(zoo)
m &amp;lt;- melt(df, measure.vars = &amp;quot;sales&amp;quot;)
dout &amp;lt;- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = &amp;quot;month&amp;quot;)
dout$month &amp;lt;- na.locf(replace(dout$month, dout$month == &amp;quot;(all)&amp;quot;, NA))
&lt;/code>&lt;/pre>
&lt;p>See here: &lt;a href="https://stackoverflow.com/questions/36169073/how-to-do-group-by-rollup-in-r-like-sql">https://stackoverflow.com/questions/36169073/how-to-do-group-by-rollup-in-r-like-sql&lt;/a>&lt;/p>
&lt;h2 id="group-filter-conditions">Group Filter Conditions&lt;/h2>
&lt;ul>
&lt;li>&lt;code>HAVING&lt;/code> with aggregate functions;&lt;/li>
&lt;li>&lt;code>WHERE&lt;/code> with original columns;&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="2.gif" alt="">&lt;/p></description></item><item><title>Learning SQL Notes #6: Data Generation, Manipulation, and Conversion</title><link>https://siqi-zheng.rbind.io/post/2021-06-04-sql-notes-6/</link><pubDate>Fri, 04 Jun 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-04-sql-notes-6/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#working-with-string-data">Working with String Data&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#string-generation">String Generation&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#including-single-quotes">Including single quotes&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#including-special-characters">Including special characters&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#string-manipulation">String Manipulation&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#string-functions-that-return-numbers">String functions that return numbers&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#working-with-numeric-data">Working with Numeric Data&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#performing-arithmetic-functions--controlling-number-precision--handling-signed-data">Performing Arithmetic Functions &amp;amp; Controlling Number Precision &amp;amp; Handling Signed Data&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#working-with-temporal-data">Working with Temporal Data&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#dealing-with-time-zones">Dealing with Time Zones&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#generating-temporal-data">Generating Temporal Data&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#string-representations-of-temporal-data">String representations of temporal data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#string-to-date-conversions">String-to-date conversions&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#manipulating-temporal-data">Manipulating Temporal Data&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#temporal-functions-that-return-dates">Temporal functions that return dates&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#temporal-functions-that-return-strings">Temporal functions that return strings&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#temporal-functions-that-return-numbers">Temporal functions that return numbers&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#conversion-functions">Conversion Functions&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#appendix-for-codes">Appendix for Codes&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="working-with-string-data">Working with String Data&lt;/h2>
&lt;h3 id="string-generation">String Generation&lt;/h3>
&lt;p>Types:&lt;/p>
&lt;p>CHAR
Holds fixed-length, blank-padded strings.&lt;/p>
&lt;p>varchar
Holds variable-length strings.&lt;/p>
&lt;p>text (MySQL and SQL Server) or clob (Oracle Database)
Holds very large variable-length strings (generally referred to as documents in this context).&lt;/p>
&lt;pre>&lt;code class="language-sql">CREATE TABLE string_tbl
(char_fld CHAR(30),
vchar_fld VARCHAR(30),
text_fld TEXT
);
INSERT INTO string_tbl (char_fld, vchar_fld, text_fld)
VALUES ('This is char data',
'This is varchar data',
'This is text data');
&lt;/code>&lt;/pre>
&lt;p>If you want to have a longer string, you can&lt;/p>
&lt;pre>&lt;code class="language-sql">UPDATE string_tbl
SET vchar_fld = 'This is a piece of extremely long varchar data';
&lt;/code>&lt;/pre>
&lt;p>but then:&lt;/p>
&lt;pre>&lt;code>ERROR 1406 (22001): Data too long for column 'vchar_fld' at row 1
&lt;/code>&lt;/pre>
&lt;p>NOTE: Since MySQL 6.0, the default behavior is now “strict” mode, which means that exceptions are thrown when problems arise, whereas in older versions of the server &lt;strong>the string would have been truncated and a warning issued&lt;/strong>.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT @@session.sql_mode;
SET sql_mode='ansi'; /*Go back to the older ver.*/
SELECT @@session.sql_mode;
&lt;/code>&lt;/pre>
&lt;p>Now extra will be truncated.&lt;/p>
&lt;h4 id="including-single-quotes">Including single quotes&lt;/h4>
&lt;pre>&lt;code class="language-sql">SELECT quote(text_fld)
FROM string_tbl;
&lt;/code>&lt;/pre>
&lt;p>Output:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">QUOTE(text_fld)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&amp;lsquo;This string didn't work, but it does now&amp;rsquo;&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="including-special-characters">Including special characters&lt;/h4>
&lt;p>The SQL Server and MySQL servers include the built-in function &lt;code>char()&lt;/code> so that you can build strings from any of the 255 characters in the ASCII character set (Oracle Database users can use the &lt;code>chr()&lt;/code> function).&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT CHAR(128,129,130,131,132,133,134,135,136,137);
&lt;/code>&lt;/pre>
&lt;p>Output:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">CHAR(128,129,130,131,132,133,134,135,136,137)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">Çüéâäàåçêë&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">coderange &amp;lt;- c(128,129,130,131,132,133,134,135,136,137)
rawToChar(as.raw(coderange),multiple=TRUE)
&lt;/code>&lt;/pre>
&lt;p>You can also concatenate two strings:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT CONCAT('danke sch', CHAR(148), 'n');
&lt;/code>&lt;/pre>
&lt;p>Output:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">CONCAT(&amp;lsquo;danke sch&amp;rsquo;, CHAR(148), &amp;lsquo;n&amp;rsquo;)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">danke schön&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">paste('danke sch', rawToChar(as.raw(148)), 'n')
paste0()
&lt;/code>&lt;/pre>
&lt;p>See: &lt;a href="https://www.r-bloggers.com/2011/03/ascii-code-table-in-r/">https://www.r-bloggers.com/2011/03/ascii-code-table-in-r/&lt;/a>&lt;/p>
&lt;ul>
&lt;li>Oracle Database/PostgreSQL users can use the concatenation operator (&lt;code>||&lt;/code>) instead of the &lt;code>concat()&lt;/code> function, as in:&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-sql">SELECT 'danke sch' || CHR(148) || 'n' FROM dual;
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>SQL Server does not include a &lt;code>concat()&lt;/code> function, so you will need to use the concatenation operator (+), as in:&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-sql">SELECT 'danke sch' + CHAR(148) + 'n'
&lt;/code>&lt;/pre>
&lt;h3 id="string-manipulation">String Manipulation&lt;/h3>
&lt;h4 id="string-functions-that-return-numbers">String functions that return numbers&lt;/h4>
&lt;p>To find the length of a string:&lt;/p>
&lt;pre>&lt;code class="language-sql">LENGTH()
SELECT LENGTH(char_fld) char_length,
LENGTH(vchar_fld) varchar_length,
LENGTH(text_fld) text_length
FROM string_tbl;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">length()
&lt;/code>&lt;/pre>
&lt;p>To find the index of a character in a string:&lt;/p>
&lt;pre>&lt;code class="language-sql">POSITION()
SELECT POSITION('characters' IN vchar_fld)
FROM string_tbl;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">match('y',x)
which('y' %in% x)
&lt;/code>&lt;/pre>
&lt;p>Note: When working with databases that the &lt;strong>first&lt;/strong> character in a string is at position &lt;strong>1&lt;/strong>. A return value of &lt;strong>0&lt;/strong> from &lt;code>instr()&lt;/code> indicates that the substring &lt;strong>could not be found&lt;/strong>, not that the substring was found at the first position in the string.&lt;/p>
&lt;p>If you want to start your search at something &lt;strong>other than the first character&lt;/strong> of your target string, you will need to use the &lt;code>locate()&lt;/code> function, which is similar to the &lt;code>position()&lt;/code> function except that it allows an optional &lt;strong>third parameter&lt;/strong>, which is used to define the search’s start position. The &lt;code>locate()&lt;/code> function is also proprietary, whereas the &lt;code>position()&lt;/code> function is part of the SQL:2003 standard.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT LOCATE('is', vchar_fld, 5)
FROM string_tbl;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">match('y',x[5:])
which('y' %in% x[5:])
&lt;/code>&lt;/pre>
&lt;p>Oracle Database
&lt;code>instr()&lt;/code>: Mimics the &lt;code>position()&lt;/code> function when provided with two arguments and mimics the &lt;code>locate()&lt;/code> function when provided with three arguments.&lt;/p>
&lt;p>SQL Server
&lt;code>charindx()&lt;/code>: similar to Oracle’s &lt;code>instr()&lt;/code> function.&lt;/p>
&lt;p>&lt;code>strcmp()&lt;/code> (MySQL ONLY) takes two strings as arguments and returns one of the following:&lt;/p>
&lt;ul>
&lt;li>−1 if the first string comes before the second string in sort order&lt;/li>
&lt;li>0 if the strings are identical&lt;/li>
&lt;li>1 if the first string comes after the second string in sort order&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-sql">SELECT vchar_fld
FROM string_tbl
ORDER BY vchar_fld;
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">vchar_fld&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">12345&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">abcd&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">QRSTUV&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">qrstuv&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">xyz&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-sql">SELECT STRCMP('12345','12345') 12345_12345,
STRCMP('abcd','xyz') abcd_xyz,
STRCMP('abcd','QRSTUV') abcd_QRSTUV,
STRCMP('qrstuv','QRSTUV') qrstuv_QRSTUV, /*Case insensitive*/
STRCMP('12345','xyz') 12345_xyz,
STRCMP('xyz','qrstuv') xyz_qrstuv;
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">12345_12345&lt;/th>
&lt;th>abcd_xyz&lt;/th>
&lt;th>abcd_QRSTUV&lt;/th>
&lt;th>qrstuv_QRSTUV&lt;/th>
&lt;th>12345_xyz&lt;/th>
&lt;th align="right">xyz_qrstuv&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">0&lt;/td>
&lt;td>−1&lt;/td>
&lt;td>−1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>−1&lt;/td>
&lt;td align="right">1&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Add or replace characters in the &lt;em>middle&lt;/em> of a string：
&lt;code>insert()&lt;/code>
4 parameters: the original string, the start position, the number of characters to replace (0 for inserting a string), and the replacement string.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT INSERT('goodbye world', 9, 0, 'cruel ') string;
/*goodbye cruel world*/
SELECT INSERT('goodbye world', 1, 7, 'hello') string;
/*hello world*/
SELECT SUBSTRING('goodbye cruel world', 9, 5);
/*cruel*/
&lt;/code>&lt;/pre>
&lt;p>For other SQL,&lt;/p>
&lt;pre>&lt;code class="language-sql">/*Oracle*/
SELECT REPLACE('goodbye world', 'goodbye', 'hello') FROM dual;
/*hello world*/
SELECT substr('goodbye cruel world', 9, 5);
/*cruel*/
/*SQL Server*/
SELECT STUFF('hello world', 1, 5, 'goodbye cruel')
/*goodbye cruel world*/
SELECT SUBSTRING('goodbye cruel world', 9, 5);
/*cruel*/
&lt;/code>&lt;/pre>
&lt;h2 id="working-with-numeric-data">Working with Numeric Data&lt;/h2>
&lt;pre>&lt;code class="language-sql">SELECT (37 * 59) / (78 - (8 * 6));
&lt;/code>&lt;/pre>
&lt;h3 id="performing-arithmetic-functions--controlling-number-precision--handling-signed-data">Performing Arithmetic Functions &amp;amp; Controlling Number Precision &amp;amp; Handling Signed Data&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Function name&lt;/th>
&lt;th align="right">Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">acos( x )&lt;/td>
&lt;td align="right">Calculates the arc cosine of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">asin( x )&lt;/td>
&lt;td align="right">Calculates the arc sine of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">atan( x )&lt;/td>
&lt;td align="right">Calculates the arc tangent of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">cos( x )&lt;/td>
&lt;td align="right">Calculates the cosine of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">sin( x )&lt;/td>
&lt;td align="right">Calculates the sine of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">tan( x )&lt;/td>
&lt;td align="right">Calculates the tangent of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">cot( x )&lt;/td>
&lt;td align="right">Calculates the cotangent of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">exp( x )&lt;/td>
&lt;td align="right">Calculates ex&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">ln( x )&lt;/td>
&lt;td align="right">Calculates the natural log of x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">sqrt( x )&lt;/td>
&lt;td align="right">Calculates the square root of x&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Some useful functions in R and SQL (See Appendix for full results):&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">SQL&lt;/th>
&lt;th align="right">R&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">MOD( x )&lt;/td>
&lt;td align="right">%%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">POW( x )&lt;/td>
&lt;td align="right">^&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">CEIL( x )&lt;/td>
&lt;td align="right">ceiling()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">FLOOR( x )&lt;/td>
&lt;td align="right">floor()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">ROUND( x )&lt;/td>
&lt;td align="right">round()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">TRUNCATE( x )&lt;/td>
&lt;td align="right">trunc()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">SIGN( x )&lt;/td>
&lt;td align="right">sign()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">ABS( x )&lt;/td>
&lt;td align="right">abs()&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="working-with-temporal-data">Working with Temporal Data&lt;/h2>
&lt;h3 id="dealing-with-time-zones">Dealing with Time Zones&lt;/h3>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
SELECT @@global.time_zone, @@session.time_zone;
SET time_zone = 'Europe/Zurich';
/*Oracle Database*/
ALTER SESSION TIMEZONE = 'Europe/Zurich'
&lt;/code>&lt;/pre>
&lt;p>From:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">@@global.time_zone&lt;/th>
&lt;th align="right">@@session.time_zone&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">SYSTEM&lt;/td>
&lt;td align="right">SYSTEM&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>To:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">@@global.time_zone&lt;/th>
&lt;th align="right">@@session.time_zone&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">SYSTEM&lt;/td>
&lt;td align="right">Europe/Zurich&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">Sys.timezone()
Sys.setenv(TZ = &amp;quot;Europe/Zurich&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="generating-temporal-data">Generating Temporal Data&lt;/h3>
&lt;p>You can generate temporal data via any of the following means:&lt;/p>
&lt;ul>
&lt;li>Copying data from an existing date, datetime, or time column&lt;/li>
&lt;li>Executing a built-in function that returns a date, datetime, or time&lt;/li>
&lt;li>Building a string representation of the temporal data to be evaluated by the server&lt;/li>
&lt;/ul>
&lt;h4 id="string-representations-of-temporal-data">String representations of temporal data&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Component&lt;/th>
&lt;th>Definition&lt;/th>
&lt;th align="right">Range&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">YYYY&lt;/td>
&lt;td>Year, including century&lt;/td>
&lt;td align="right">1000 to 9999&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">MM&lt;/td>
&lt;td>Month&lt;/td>
&lt;td align="right">01 (January) to 12 (December)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">DD&lt;/td>
&lt;td>Day&lt;/td>
&lt;td align="right">01 to 31&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">HH&lt;/td>
&lt;td>Hour&lt;/td>
&lt;td align="right">Range 00 to 23&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">HHH&lt;/td>
&lt;td>Hours&lt;/td>
&lt;td align="right">−838 to 838&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">MI&lt;/td>
&lt;td>(elapsed) Minute&lt;/td>
&lt;td align="right">00 to 59&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">SS&lt;/td>
&lt;td>Second&lt;/td>
&lt;td align="right">00 to 59&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Type&lt;/th>
&lt;th align="right">Default format&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">date&lt;/td>
&lt;td align="right">YYYY-MM-DD&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">datetime&lt;/td>
&lt;td align="right">YYYY-MM-DD HH:MI:SS&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">timestamp&lt;/td>
&lt;td align="right">YYYY-MM-DD HH:MI:SS&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">time&lt;/td>
&lt;td align="right">HHH:MI:SS&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="string-to-date-conversions">String-to-date conversions&lt;/h4>
&lt;ul>
&lt;li>A simple query that returns a datetime value using the &lt;code>cast()&lt;/code> function&lt;/li>
&lt;/ul>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">SQL&lt;/th>
&lt;th align="right">R (lubridate)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">CAST(&amp;lsquo;2019-09-17 15:30:00&amp;rsquo; AS DATETIME)&lt;/td>
&lt;td align="right">as_datetime()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">STR_TO_DATE(&amp;lsquo;September 17, 2019&amp;rsquo;, &amp;lsquo;%M %d, %Y&amp;rsquo;)&lt;/td>
&lt;td align="right">as.Date(&amp;hellip;, format=&amp;hellip;)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">CAST(&amp;lsquo;2019-09-17&amp;rsquo; AS DATE)&lt;/td>
&lt;td align="right">as.Date()&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">CAST(&amp;lsquo;108:17:57&amp;rsquo; AS TIME)&lt;/td>
&lt;td align="right">as.POSIXlt()&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
SELECT str_to_date();
/*Oracle Database*/
SELECT to_date();
/*SQL server*/
SELECT convert();
/*Current System Time*/
SELECT CURRENT_DATE(), CURRENT_TIME(), CURRENT_TIMESTAMP();
&lt;/code>&lt;/pre>
&lt;p>Common notations for both R and SQL:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Format component&lt;/th>
&lt;th align="right">Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">%M&lt;/td>
&lt;td align="right">Month name (January to December)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%m&lt;/td>
&lt;td align="right">Month numeric (01 to 12)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%d&lt;/td>
&lt;td align="right">Day numeric (01 to 31)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%j&lt;/td>
&lt;td align="right">Day of year (001 to 366)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%W&lt;/td>
&lt;td align="right">Weekday name (Sunday to Saturday)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%Y&lt;/td>
&lt;td align="right">Year, four-digit numeric&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%y&lt;/td>
&lt;td align="right">Year, two-digit numeric&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%H&lt;/td>
&lt;td align="right">Hour (00 to 23)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%h&lt;/td>
&lt;td align="right">Hour (01 to 12)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%i&lt;/td>
&lt;td align="right">Minutes (00 to 59)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%s&lt;/td>
&lt;td align="right">Seconds (00 to 59)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%f&lt;/td>
&lt;td align="right">Microseconds (000000 to 999999)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%p&lt;/td>
&lt;td align="right">A.M. or P.M.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="manipulating-temporal-data">Manipulating Temporal Data&lt;/h3>
&lt;p>&lt;strong>Interval types for &lt;code>DATE_ADD()&lt;/code> and &lt;code>EXTRACT()&lt;/code>&lt;/strong>&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Interval name&lt;/th>
&lt;th align="right">Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">second&lt;/td>
&lt;td align="right">Number of seconds&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">minute&lt;/td>
&lt;td align="right">Number of minutes&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">hour&lt;/td>
&lt;td align="right">Number of hours&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">day&lt;/td>
&lt;td align="right">Number of days&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">month&lt;/td>
&lt;td align="right">Number of months&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">year&lt;/td>
&lt;td align="right">Number of years&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">minute_second&lt;/td>
&lt;td align="right">Number of minutes and seconds, separated by “:”&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">hour_second&lt;/td>
&lt;td align="right">Number of hours, minutes, and seconds, separated by “:”&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">year_month&lt;/td>
&lt;td align="right">Number of years and months, separated by “-”&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="temporal-functions-that-return-dates">Temporal functions that return dates&lt;/h4>
&lt;p>The same result can be performed on three different servers:&lt;/p>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
UPDATE employee
SET birth_date = DATE_ADD(birth_date, INTERVAL '9-11' YEAR_MONTH)
WHERE emp_id = 4789;
/*Oracle Database*/
UPDATE employee
SET birth_date = ADD_MONTHS(birth_date, 119)
WHERE emp_id = 4789;
/*SQL server*/
UPDATE employee
SET birth_date = DATEADD(MONTH, 119, birth_date)
WHERE emp_id = 4789
&lt;/code>&lt;/pre>
&lt;h4 id="temporal-functions-that-return-strings">Temporal functions that return strings&lt;/h4>
&lt;p>Some other functions for temporal data:&lt;/p>
&lt;pre>&lt;code class="language-sql">/*MySQL*/
SELECT LAST_DAY('2019-09-17'); /*Extract last day of Sept*/
SELECT DAYNAME('2019-09-18'); /*Wednesday*/
SELECT EXTRACT(YEAR FROM '2019-09-18 22:19:05'); /*2019*/
/*SQL Server*/
SELECT DATEPART(YEAR, GETDATE())
&lt;/code>&lt;/pre>
&lt;h4 id="temporal-functions-that-return-numbers">Temporal functions that return numbers&lt;/h4>
&lt;pre>&lt;code class="language-sql">SELECT DATEDIFF('2019-09-03', '2019-06-21');
/*74*/
SELECT DATEDIFF('2019-09-03 23:59:59', '2019-06-21 00:00:01');
/*74, time has no effects*/
SELECT DATEDIFF('2019-06-21', '2019-09-03');
/*-74*/
/*SQL Server*/
SELECT DATEDIFF(DAY, '2019-06-21', '2019-09-03')
&lt;/code>&lt;/pre>
&lt;h3 id="conversion-functions">Conversion Functions&lt;/h3>
&lt;pre>&lt;code class="language-sql">SELECT CAST('1456328' AS SIGNED INTEGER);
/*1456328*/
SELECT CAST('999ABC111' AS UNSIGNED INTEGER);
/*999 with warnings about truncation*/
&lt;/code>&lt;/pre>
&lt;h2 id="appendix-for-codes">Appendix for Codes&lt;/h2>
&lt;pre>&lt;code class="language-sql">SELECT MOD(10,4);
/*2*/
SELECT MOD(20.75,4); /*Real argument*/
/*0.75*/
SELECT POW(2,8);
/*256*/
SELECT CEIL(72.445), FLOOR(72.445);
/*73 72*/
SELECT CEIL(72.000000001), FLOOR(72.999999999);
/*73 72*/
SELECT ROUND(72.49999), ROUND(72.5), ROUND(72.50001);
/*72 73 73*/
SELECT ROUND(72.0909, 1), ROUND(72.0909, 2), ROUND(72.0909, 3);
/*72.1 72.09 72.091*/
SELECT TRUNCATE(72.0909, 1), TRUNCATE(72.0909, 2), TRUNCATE(72.0909, 3);
/*72.0 72.09 72.090*/
/*SQL Server*/
SELECT ROUND(72.0909, 1, 1)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">%%
^
ceiling()
floor()
round()
trunc()
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-sql">SELECT account_id, SIGN(balance), ABS(balance)
FROM account;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">sign()
abs()
&lt;/code>&lt;/pre>
&lt;p>Hope I can finish this before July. Stay safe.&lt;/p>
&lt;p>&lt;img src="2.gif" alt="">&lt;/p></description></item><item><title>Learning SQL Notes #5: Querying Multiple Tables (CH. 5)</title><link>https://siqi-zheng.rbind.io/post/2021-06-03-sql-notes-5/</link><pubDate>Thu, 03 Jun 2021 20:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-03-sql-notes-5/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#cross-join-cartesian-product">Cross Join (Cartesian Product)&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#inner-joins">Inner Joins&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#joining-three-or-more-tables">Joining Three or More Tables&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#using-subqueries-as-tables">Using Subqueries as Tables&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#using-the-same-table-twice">Using the Same Table Twice&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#self-joins">Self-Joins&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#outer-joins">Outer Joins&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#three-way-outer-joins">Three-Way Outer Joins&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#natural-joins">Natural Joins&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Join instructs the server to use a column as the &lt;em>transportation&lt;/em> between tables, thus allows columns from both tables to be included in the query’s result set.&lt;/p>
&lt;h2 id="cross-join-cartesian-product">Cross Join (Cartesian Product)&lt;/h2>
&lt;p>If the query didn’t specify how the two tables should be joined, the database server generated the &lt;em>Cartesian
product&lt;/em>, which is &lt;strong>every permutation&lt;/strong> of the two tables.&lt;/p>
&lt;pre>&lt;code class="language-sql">JOIN b
CROSS JOIN b
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">merge(x = df1, y = df2, by = NULL)
library(data.table)
CJ(a, b)
&lt;/code>&lt;/pre>
&lt;p>Can be used to create a list of consecutive numbers.&lt;/p>
&lt;h2 id="inner-joins">Inner Joins&lt;/h2>
&lt;p>If a value exists for the address_id column in one table but &lt;em>not&lt;/em> the other, then the join fails for the rows containing that value, and those rows are &lt;strong>excluded&lt;/strong> from the result set. Inner join only returns rows that satisfy the &lt;strong>join condition&lt;/strong>.&lt;/p>
&lt;pre>
INNER JOIN b
&lt;b>ON a.id=b.id&lt;/b>
&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">merge(df1, df2, by = &amp;quot;id&amp;quot;)
library(plyr)
join(df1, df2,
type = &amp;quot;inner&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h2 id="joining-three-or-more-tables">Joining Three or More Tables&lt;/h2>
&lt;p>Join order is not important!&lt;/p>
&lt;p>Force order:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT STRAIGHT_JOIN COL1
&lt;/code>&lt;/pre>
&lt;h2 id="using-subqueries-as-tables">Using Subqueries as Tables&lt;/h2>
&lt;p>See subquery notes.&lt;/p>
&lt;h2 id="using-the-same-table-twice">Using the Same Table Twice&lt;/h2>
&lt;p>Either one of the actors in the movie:&lt;/p>
&lt;pre>&lt;code class="language-SQL">SELECT f.title
FROM film f
INNER JOIN film_actor fa
ON f.film_id = fa.film_id
INNER JOIN actor a
ON fa.actor_id = a.actor_id
WHERE ((a.first_name = 'CATE' AND a.last_name = 'MCQUEEN')
OR (a.first_name = 'CUBA' AND a.last_name = 'BIRCH');
&lt;/code>&lt;/pre>
&lt;p>If we want movies that have both, you cannot simply replace OR with AND since this will return an empty set. Hence instead, you need to join the table twice:&lt;/p>
&lt;pre>&lt;code class="language-SQL">SELECT f.title
FROM film f
/*once: */
INNER JOIN film_actor fa1
ON f.film_id = fa1.film_id
INNER JOIN actor a1
ON fa1.actor_id = a1.actor_id
/*twice: */
INNER JOIN film_actor fa2
ON f.film_id = fa2.film_id
INNER JOIN actor a2
ON fa2.actor_id = a2.actor_id
/*filter condition is applied*/
WHERE (a1.first_name = 'CATE' AND a1.last_name = 'MCQUEEN')
AND (a2.first_name = 'CUBA' AND a2.last_name = 'BIRCH');
&lt;/code>&lt;/pre>
&lt;h2 id="self-joins">Self-Joins&lt;/h2>
&lt;p>Some tables include a self-referencing foreign key, which means that it includes a column that points to the primary key within the same table.&lt;/p>
&lt;p>Imagine that the film table includes the column prequel_film_id, which points to the film’s parent (e.g., the film Fiddler Lost II would use this column to point to the parent film Fiddler Lost).&lt;/p>
&lt;p>Using a self-join, you can write a query that lists every film that has a prequel, along with the prequel’s title:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT f.title, f_prnt.title prequel
FROM film f
INNER JOIN film f_prnt
ON f_prnt.film_id = f.prequel_film_id
WHERE f.prequel_film_id IS NOT NULL;
&lt;/code>&lt;/pre>
&lt;p>A possible outcome:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">title&lt;/th>
&lt;th align="right">prequel&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">FIDDLER LOST II&lt;/td>
&lt;td align="right">FIDDLER LOST&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="outer-joins">Outer Joins&lt;/h2>
&lt;pre>&lt;code class="language-sql">SELECT f.film_id, f.title, count(i.inventory_id) num_copies
FROM film f
LEFT OUTER JOIN inventory i
ON f.film_id = i.film_id
GROUP BY f.film_id, f.title;
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>
&lt;p>Left outer join includes all rows from the table on the left side of the join (film, in this case) and then include columns from the table on the right side of the join (inventory) if the join is successful.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The num_copies column definition was changed from count(*) to count(i.inventory_id), which will count the number of non-null values of the inventory.inventory_id column.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A left outer join B $\equiv$ B right outer join A.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="three-way-outer-joins">Three-Way Outer Joins&lt;/h3>
&lt;pre>
SELECT f.film_id, f.title, i.inventory_id, r.rental_date
FROM film f LEFT OUTER JOIN inventory i
ON f.film_id = i.film_id
&lt;b>LEFT OUTER JOIN rental r
ON i.inventory_id = r.inventory_id&lt;/b>
WHERE f.film_id BETWEEN 13 AND 15;
&lt;/pre>
&lt;h2 id="natural-joins">Natural Joins&lt;/h2>
&lt;p>Lets the database server determine what the join conditions need to be.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name, date(r.rental_date)
FROM customer c
NATURAL JOIN rental r;
&lt;/code>&lt;/pre>
&lt;p>Empty set (0.04 sec)&lt;/p>
&lt;p>Because you specified a natural join, the server inspected the table definitions and added the join condition r.customer_id = c.customer_id to join the two tables. This would have worked fine, but in the Sakila schema all of the tables include the column last_update to show when each row was last modified, so the server is also adding the join condition r.last_update = c.last_update, which causes the query to return no data.&lt;/p>
&lt;p>The only way around this issue is to use a subquery to restrict the columns for at least one of the tables:&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT cust.first_name, cust.last_name, date(r.rental_date)
FROM
(SELECT customer_id, first_name, last_name
FROM customer
) cust
NATURAL JOIN rental r;
&lt;/code>&lt;/pre></description></item><item><title>Learning SQL Notes #4.5: Regular Expression</title><link>https://siqi-zheng.rbind.io/post/2021-06-02-sql-notes-4-5/</link><pubDate>Wed, 02 Jun 2021 20:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-06-02-sql-notes-4-5/</guid><description>&lt;p>Adapted from &lt;a href="https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference">https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference&lt;/a>&lt;/p>
&lt;h2 id="character-escapes">Character Escapes&lt;/h2>
&lt;p>The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see &lt;a href="character-escapes-in-regular-expressions" data-linktype="relative-path">Character Escapes&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Escaped character&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>\a&lt;/code>&lt;/td>
&lt;td>Matches a bell character, \u0007.&lt;/td>
&lt;td>&lt;code>\a&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\u0007&amp;quot;&lt;/code> in &lt;code>&amp;quot;Error!&amp;quot; + '\u0007'&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\b&lt;/code>&lt;/td>
&lt;td>In a character class, matches a backspace, \u0008.&lt;/td>
&lt;td>&lt;code>[\b]{3,}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\b\b\b\b&amp;quot;&lt;/code> in &lt;code>&amp;quot;\b\b\b\b&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\t&lt;/code>&lt;/td>
&lt;td>Matches a tab, \u0009.&lt;/td>
&lt;td>&lt;code>(\w+)\t&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;item1\t&amp;quot;&lt;/code>, &lt;code>&amp;quot;item2\t&amp;quot;&lt;/code> in &lt;code>&amp;quot;item1\titem2\t&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\r&lt;/code>&lt;/td>
&lt;td>Matches a carriage return, \u000D. (&lt;code>\r&lt;/code> is not equivalent to the newline character, &lt;code>\n&lt;/code>.)&lt;/td>
&lt;td>&lt;code>\r\n(\w+)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\r\nThese&amp;quot;&lt;/code> in &lt;code>&amp;quot;\r\nThese are\ntwo lines.&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\v&lt;/code>&lt;/td>
&lt;td>Matches a vertical tab, \u000B.&lt;/td>
&lt;td>&lt;code>[\v]{2,}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\v\v\v&amp;quot;&lt;/code> in &lt;code>&amp;quot;\v\v\v&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\f&lt;/code>&lt;/td>
&lt;td>Matches a form feed, \u000C.&lt;/td>
&lt;td>&lt;code>[\f]{2,}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\f\f\f&amp;quot;&lt;/code> in &lt;code>&amp;quot;\f\f\f&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\n&lt;/code>&lt;/td>
&lt;td>Matches a new line, \u000A.&lt;/td>
&lt;td>&lt;code>\r\n(\w+)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\r\nThese&amp;quot;&lt;/code> in &lt;code>&amp;quot;\r\nThese are\ntwo lines.&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\e&lt;/code>&lt;/td>
&lt;td>Matches an escape, \u001B.&lt;/td>
&lt;td>&lt;code>\e&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\x001B&amp;quot;&lt;/code> in &lt;code>&amp;quot;\x001B&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\&lt;/code> &lt;em>nnn&lt;/em>&lt;/td>
&lt;td>Uses octal representation to specify a character (&lt;em>nnn&lt;/em> consists of two or three digits).&lt;/td>
&lt;td>&lt;code>\w\040\w&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;a b&amp;quot;&lt;/code>, &lt;code>&amp;quot;c d&amp;quot;&lt;/code> in &lt;code>&amp;quot;a bc d&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\x&lt;/code> &lt;em>nn&lt;/em>&lt;/td>
&lt;td>Uses hexadecimal representation to specify a character (&lt;em>nn&lt;/em> consists of exactly two digits).&lt;/td>
&lt;td>&lt;code>\w\x20\w&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;a b&amp;quot;&lt;/code>, &lt;code>&amp;quot;c d&amp;quot;&lt;/code> in &lt;code>&amp;quot;a bc d&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\c&lt;/code> &lt;em>X&lt;/em>&lt;br/>&lt;br/> &lt;code>\c&lt;/code> &lt;em>x&lt;/em>&lt;/td>
&lt;td>Matches the ASCII control character that is specified by &lt;em>X&lt;/em> or &lt;em>x&lt;/em>, where &lt;em>X&lt;/em> or &lt;em>x&lt;/em> is the letter of the control character.&lt;/td>
&lt;td>&lt;code>\cC&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;\x0003&amp;quot;&lt;/code> in &lt;code>&amp;quot;\x0003&amp;quot;&lt;/code> (Ctrl-C)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\u&lt;/code> &lt;em>nnnn&lt;/em>&lt;/td>
&lt;td>Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by &lt;em>nnnn&lt;/em>).&lt;/td>
&lt;td>&lt;code>\w\u0020\w&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;a b&amp;quot;&lt;/code>, &lt;code>&amp;quot;c d&amp;quot;&lt;/code> in &lt;code>&amp;quot;a bc d&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\&lt;/code>&lt;/td>
&lt;td>When followed by a character that is not recognized as an escaped character in this and other tables in this topic, matches that character. For example, &lt;code>\*&lt;/code> is the same as &lt;code>\x2A&lt;/code>, and &lt;code>\.&lt;/code> is the same as &lt;code>\x2E&lt;/code>. This allows the regular expression engine to disambiguate language elements (such as * or ?) and character literals (represented by &lt;code>\*&lt;/code> or &lt;code>\?&lt;/code>).&lt;/td>
&lt;td>&lt;code>\d+[\+-x\*]\d+&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;2+2&amp;quot;&lt;/code> and &lt;code>&amp;quot;3*9&amp;quot;&lt;/code> in &lt;code>&amp;quot;(2+2) * 3*9&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="character-classes">Character Classes&lt;/h2>
&lt;p>A character class matches any one of a set of characters. Character classes include the language elements listed in the following table. For more information, see &lt;a href="character-classes-in-regular-expressions" data-linktype="relative-path">Character Classes&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Character class&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>[&lt;/code> &lt;em>character_group&lt;/em> &lt;code>]&lt;/code>&lt;/td>
&lt;td>Matches any single character in &lt;em>character_group&lt;/em>. By default, the match is case-sensitive.&lt;/td>
&lt;td>&lt;code>[ae]&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;a&amp;quot;&lt;/code> in &lt;code>&amp;quot;gray&amp;quot;&lt;/code>&lt;br/>&lt;br/> &lt;code>&amp;quot;a&amp;quot;&lt;/code>, &lt;code>&amp;quot;e&amp;quot;&lt;/code> in &lt;code>&amp;quot;lane&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>[^&lt;/code> &lt;em>character_group&lt;/em> &lt;code>]&lt;/code>&lt;/td>
&lt;td>Negation: Matches any single character that is not in &lt;em>character_group&lt;/em>. By default, characters in &lt;em>character_group&lt;/em> are case-sensitive.&lt;/td>
&lt;td>&lt;code>[^aei]&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;r&amp;quot;&lt;/code>, &lt;code>&amp;quot;g&amp;quot;&lt;/code>, &lt;code>&amp;quot;n&amp;quot;&lt;/code> in &lt;code>&amp;quot;reign&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>[&lt;/code> &lt;em>first&lt;/em> &lt;code>-&lt;/code> &lt;em>last&lt;/em> &lt;code>]&lt;/code>&lt;/td>
&lt;td>Character range: Matches any single character in the range from &lt;em>first&lt;/em> to &lt;em>last&lt;/em>.&lt;/td>
&lt;td>&lt;code>[A-Z]&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;A&amp;quot;&lt;/code>, &lt;code>&amp;quot;B&amp;quot;&lt;/code> in &lt;code>&amp;quot;AB123&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>.&lt;/code>&lt;/td>
&lt;td>Wildcard: Matches any single character except \n.&lt;br/>&lt;br/> To match a literal period character (. or &lt;code>\u002E&lt;/code>), you must precede it with the escape character (&lt;code>\.&lt;/code>).&lt;/td>
&lt;td>&lt;code>a.e&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ave&amp;quot;&lt;/code> in &lt;code>&amp;quot;nave&amp;quot;&lt;/code>&lt;br/>&lt;br/> &lt;code>&amp;quot;ate&amp;quot;&lt;/code> in &lt;code>&amp;quot;water&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\p{&lt;/code> &lt;em>name&lt;/em> &lt;code>}&lt;/code>&lt;/td>
&lt;td>Matches any single character in the Unicode general category or named block specified by &lt;em>name&lt;/em>.&lt;/td>
&lt;td>&lt;code>\p{Lu}&lt;/code>&lt;br/>&lt;br/> &lt;code>\p{IsCyrillic}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;C&amp;quot;&lt;/code>, &lt;code>&amp;quot;L&amp;quot;&lt;/code> in &lt;code>&amp;quot;City Lights&amp;quot;&lt;/code>&lt;br/>&lt;br/> &lt;code>&amp;quot;Д&amp;quot;&lt;/code>, &lt;code>&amp;quot;Ж&amp;quot;&lt;/code> in &lt;code>&amp;quot;ДЖem&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\P{&lt;/code> &lt;em>name&lt;/em> &lt;code>}&lt;/code>&lt;/td>
&lt;td>Matches any single character that is not in the Unicode general category or named block specified by &lt;em>name&lt;/em>.&lt;/td>
&lt;td>&lt;code>\P{Lu}&lt;/code>&lt;br/>&lt;br/> &lt;code>\P{IsCyrillic}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;i&amp;quot;&lt;/code>, &lt;code>&amp;quot;t&amp;quot;&lt;/code>, &lt;code>&amp;quot;y&amp;quot;&lt;/code> in &lt;code>&amp;quot;City&amp;quot;&lt;/code>&lt;br/>&lt;br/> &lt;code>&amp;quot;e&amp;quot;&lt;/code>, &lt;code>&amp;quot;m&amp;quot;&lt;/code> in &lt;code>&amp;quot;ДЖem&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\w&lt;/code>&lt;/td>
&lt;td>Matches any word character.&lt;/td>
&lt;td>&lt;code>\w&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;I&amp;quot;&lt;/code>, &lt;code>&amp;quot;D&amp;quot;&lt;/code>, &lt;code>&amp;quot;A&amp;quot;&lt;/code>, &lt;code>&amp;quot;1&amp;quot;&lt;/code>, &lt;code>&amp;quot;3&amp;quot;&lt;/code> in &lt;code>&amp;quot;ID A1.3&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\W&lt;/code>&lt;/td>
&lt;td>Matches any non-word character.&lt;/td>
&lt;td>&lt;code>\W&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot; &amp;quot;&lt;/code>, &lt;code>&amp;quot;.&amp;quot;&lt;/code> in &lt;code>&amp;quot;ID A1.3&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\s&lt;/code>&lt;/td>
&lt;td>Matches any white-space character.&lt;/td>
&lt;td>&lt;code>\w\s&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;D &amp;quot;&lt;/code> in &lt;code>&amp;quot;ID A1.3&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\S&lt;/code>&lt;/td>
&lt;td>Matches any non-white-space character.&lt;/td>
&lt;td>&lt;code>\s\S&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot; _&amp;quot;&lt;/code> in &lt;code>&amp;quot;int __ctr&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\d&lt;/code>&lt;/td>
&lt;td>Matches any decimal digit.&lt;/td>
&lt;td>&lt;code>\d&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;4&amp;quot;&lt;/code> in &lt;code>&amp;quot;4 = IV&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\D&lt;/code>&lt;/td>
&lt;td>Matches any character other than a decimal digit.&lt;/td>
&lt;td>&lt;code>\D&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot; &amp;quot;&lt;/code>, &lt;code>&amp;quot;=&amp;quot;&lt;/code>, &lt;code>&amp;quot; &amp;quot;&lt;/code>, &lt;code>&amp;quot;I&amp;quot;&lt;/code>, &lt;code>&amp;quot;V&amp;quot;&lt;/code> in &lt;code>&amp;quot;4 = IV&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="anchors">Anchors&lt;/h2>
&lt;p>Anchors, or atomic zero-width assertions, cause a match to succeed or fail depending on the current position in the string, but they do not cause the engine to advance through the string or consume characters. The metacharacters listed in the following table are anchors. For more information, see &lt;a href="anchors-in-regular-expressions" data-linktype="relative-path">Anchors&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Assertion&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>^&lt;/code>&lt;/td>
&lt;td>By default, the match must start at the beginning of the string; in multiline mode, it must start at the beginning of the line.&lt;/td>
&lt;td>&lt;code>^\d{3}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;901&amp;quot;&lt;/code> in &lt;code>&amp;quot;901-333-&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>$&lt;/code>&lt;/td>
&lt;td>By default, the match must occur at the end of the string or before &lt;code>\n&lt;/code> at the end of the string; in multiline mode, it must occur before the end of the line or before &lt;code>\n&lt;/code> at the end of the line.&lt;/td>
&lt;td>&lt;code>-\d{3}$&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;-333&amp;quot;&lt;/code> in &lt;code>&amp;quot;-901-333&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\A&lt;/code>&lt;/td>
&lt;td>The match must occur at the start of the string.&lt;/td>
&lt;td>&lt;code>\A\d{3}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;901&amp;quot;&lt;/code> in &lt;code>&amp;quot;901-333-&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\Z&lt;/code>&lt;/td>
&lt;td>The match must occur at the end of the string or before &lt;code>\n&lt;/code> at the end of the string.&lt;/td>
&lt;td>&lt;code>-\d{3}\Z&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;-333&amp;quot;&lt;/code> in &lt;code>&amp;quot;-901-333&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\z&lt;/code>&lt;/td>
&lt;td>The match must occur at the end of the string.&lt;/td>
&lt;td>&lt;code>-\d{3}\z&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;-333&amp;quot;&lt;/code> in &lt;code>&amp;quot;-901-333&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\G&lt;/code>&lt;/td>
&lt;td>The match must occur at the point where the previous match ended.&lt;/td>
&lt;td>&lt;code>\G\(\d\)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;(1)&amp;quot;&lt;/code>, &lt;code>&amp;quot;(3)&amp;quot;&lt;/code>, &lt;code>&amp;quot;(5)&amp;quot;&lt;/code> in &lt;code>&amp;quot;(1)(3)(5)[7](9)&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\b&lt;/code>&lt;/td>
&lt;td>The match must occur on a boundary between a &lt;code>\w&lt;/code> (alphanumeric) and a &lt;code>\W&lt;/code> (nonalphanumeric) character.&lt;/td>
&lt;td>&lt;code>\b\w+\s\w+\b&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;them theme&amp;quot;&lt;/code>, &lt;code>&amp;quot;them them&amp;quot;&lt;/code> in &lt;code>&amp;quot;them theme them them&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\B&lt;/code>&lt;/td>
&lt;td>The match must not occur on a &lt;code>\b&lt;/code> boundary.&lt;/td>
&lt;td>&lt;code>\Bend\w*\b&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ends&amp;quot;&lt;/code>, &lt;code>&amp;quot;ender&amp;quot;&lt;/code> in &lt;code>&amp;quot;end sends endure lender&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="grouping-constructs">Grouping Constructs&lt;/h2>
&lt;p>Grouping constructs delineate subexpressions of a regular expression and typically capture substrings of an input string. Grouping constructs include the language elements listed in the following table. For more information, see &lt;a href="grouping-constructs-in-regular-expressions" data-linktype="relative-path">Grouping Constructs&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Grouping construct&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>(&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Captures the matched subexpression and assigns it a one-based ordinal number.&lt;/td>
&lt;td>&lt;code>(\w)\1&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ee&amp;quot;&lt;/code> in &lt;code>&amp;quot;deep&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?&amp;lt;&lt;/code> &lt;em>name&lt;/em> &lt;code>&amp;gt;&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;br/> or &lt;br/>&lt;code>(?'&lt;/code> &lt;em>name&lt;/em> &lt;code>'&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Captures the matched subexpression into a named group.&lt;/td>
&lt;td>&lt;code>(?&amp;lt;double&amp;gt;\w)\k&amp;lt;double&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ee&amp;quot;&lt;/code> in &lt;code>&amp;quot;deep&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?&amp;lt;&lt;/code> &lt;em>name1&lt;/em> &lt;code>-&lt;/code> &lt;em>name2&lt;/em> &lt;code>&amp;gt;&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code> &lt;br/> or &lt;br/> &lt;code>(?'&lt;/code> &lt;em>name1&lt;/em> &lt;code>-&lt;/code> &lt;em>name2&lt;/em> &lt;code>'&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Defines a balancing group definition. For more information, see the &amp;quot;Balancing Group Definition&amp;quot; section in &lt;a href="grouping-constructs-in-regular-expressions" data-linktype="relative-path">Grouping Constructs&lt;/a>.&lt;/td>
&lt;td>&lt;code>(((?'Open'\()[^\(\)]*)+((?'Close-Open'\))[^\(\)]*)+)*(?(Open)(?!))$&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;((1-3)*(3-1))&amp;quot;&lt;/code> in &lt;code>&amp;quot;3+2^((1-3)*(3-1))&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?:&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Defines a noncapturing group.&lt;/td>
&lt;td>&lt;code>Write(?:Line)?&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;WriteLine&amp;quot;&lt;/code> in &lt;code>&amp;quot;Console.WriteLine()&amp;quot;&lt;/code>&lt;br/>&lt;br/> &lt;code>&amp;quot;Write&amp;quot;&lt;/code> in &lt;code>&amp;quot;Console.Write(value)&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?imnsx-imnsx:&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Applies or disables the specified options within &lt;em>subexpression&lt;/em>. For more information, see &lt;a href="regular-expression-options" data-linktype="relative-path">Regular Expression Options&lt;/a>.&lt;/td>
&lt;td>&lt;code>A\d{2}(?i:\w+)\b&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;A12xl&amp;quot;&lt;/code>, &lt;code>&amp;quot;A12XL&amp;quot;&lt;/code> in &lt;code>&amp;quot;A12xl A12XL a12xl&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?=&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Zero-width positive lookahead assertion.&lt;/td>
&lt;td>&lt;code>\b\w+\b(?=.+and.+)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;cats&amp;quot;&lt;/code>, &lt;code>&amp;quot;dogs&amp;quot;&lt;/code>&lt;br/>in&lt;br/>&lt;code>&amp;quot;cats, dogs and some mice.&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?!&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Zero-width negative lookahead assertion.&lt;/td>
&lt;td>&lt;code>\b\w+\b(?!.+and.+)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;and&amp;quot;&lt;/code>, &lt;code>&amp;quot;some&amp;quot;&lt;/code>, &lt;code>&amp;quot;mice&amp;quot;&lt;/code>&lt;br/>in&lt;br/>&lt;code>&amp;quot;cats, dogs and some mice.&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?&amp;lt;=&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Zero-width positive lookbehind assertion.&lt;/td>
&lt;td>&lt;code>\b\w+\b(?&amp;lt;=.+and.+)&lt;/code>&lt;br/>&lt;br/>———————————&lt;br/>&lt;br/>&lt;code>\b\w+\b(?&amp;lt;=.+and.*)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;some&amp;quot;&lt;/code>, &lt;code>&amp;quot;mice&amp;quot;&lt;/code>&lt;br/>in&lt;br/>&lt;code>&amp;quot;cats, dogs and some mice.&amp;quot;&lt;/code>&lt;br/>————————————&lt;br/>&lt;code>&amp;quot;and&amp;quot;&lt;/code>, &lt;code>&amp;quot;some&amp;quot;&lt;/code>, &lt;code>&amp;quot;mice&amp;quot;&lt;/code>&lt;br/>in&lt;br/>&lt;code>&amp;quot;cats, dogs and some mice.&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?&amp;lt;!&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Zero-width negative lookbehind assertion.&lt;/td>
&lt;td>&lt;code>\b\w+\b(?&amp;lt;!.+and.+)&lt;/code>&lt;br/>&lt;br/>———————————&lt;br/>&lt;br/>&lt;code>\b\w+\b(?&amp;lt;!.+and.*)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;cats&amp;quot;&lt;/code>, &lt;code>&amp;quot;dogs&amp;quot;&lt;/code>, &lt;code>&amp;quot;and&amp;quot;&lt;/code>&lt;br/>in&lt;br/>&lt;code>&amp;quot;cats, dogs and some mice.&amp;quot;&lt;/code>&lt;br/>————————————&lt;br/>&lt;code>&amp;quot;cats&amp;quot;&lt;/code>, &lt;code>&amp;quot;dogs&amp;quot;&lt;/code>&lt;br/>in&lt;br/>&lt;code>&amp;quot;cats, dogs and some mice.&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?&amp;gt;&lt;/code> &lt;em>subexpression&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Atomic group.&lt;/td>
&lt;td>&lt;code>(?&amp;gt;a|ab)c&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ac&amp;quot;&lt;/code> in&lt;code>&amp;quot;ac&amp;quot;&lt;/code>&lt;br/>&lt;br/>&lt;em>nothing&lt;/em> in&lt;code>&amp;quot;abc&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="lookarounds-at-a-glance">Lookarounds at a glance&lt;/h3>
&lt;p>When the regular expression engine hits a &lt;strong>lookaround expression&lt;/strong>, it takes a substring reaching from the current position to the start (lookbehind) or end (lookahead) of the original string, and then runs
&lt;a href="https://siqi-zheng.rbind.io/en-us/dotnet/api/system.text.regularexpressions.regex.ismatch" data-linktype="absolute-path">Regex.IsMatch&lt;/a> on that substring using the lookaround pattern. Success of this subexpression's result is then determined by whether it's a positive or negative assertion.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Lookaround&lt;/th>
&lt;th>Name&lt;/th>
&lt;th>Function&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>(?=check)&lt;/code>&lt;/td>
&lt;td>Positive Lookahead&lt;/td>
&lt;td>Asserts that what immediately follows the current position in the string is &amp;quot;check&amp;quot;&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?&amp;lt;=check)&lt;/code>&lt;/td>
&lt;td>Positive Lookbehind&lt;/td>
&lt;td>Asserts that what immediately precedes the current position in the string is &amp;quot;check&amp;quot;&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?!check)&lt;/code>&lt;/td>
&lt;td>Negative Lookahead&lt;/td>
&lt;td>Asserts that what immediately follows the current position in the string is not &amp;quot;check&amp;quot;&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?&amp;lt;!check)&lt;/code>&lt;/td>
&lt;td>Negative Lookbehind&lt;/td>
&lt;td>Asserts that what immediately precedes the current position in the string is not &amp;quot;check&amp;quot;&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Once they have matched, &lt;strong>atomic groups&lt;/strong> won't be re-evaluated again, even when the remainder of the pattern fails due to the match. This can significantly improve performance when quantifiers occur within the atomic group or the remainder of the pattern.&lt;/p>
&lt;h2 id="quantifiers">Quantifiers&lt;/h2>
&lt;p>A quantifier specifies how many instances of the previous element (which can be a character, a group, or a character class) must be present in the input string for a match to occur. Quantifiers include the language elements listed in the following table. For more information, see &lt;a href="quantifiers-in-regular-expressions" data-linktype="relative-path">Quantifiers&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Quantifier&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>*&lt;/code>&lt;/td>
&lt;td>Matches the previous element zero or more times.&lt;/td>
&lt;td>&lt;code>\d*\.\d&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;.0&amp;quot;&lt;/code>, &lt;code>&amp;quot;19.9&amp;quot;&lt;/code>, &lt;code>&amp;quot;219.9&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>+&lt;/code>&lt;/td>
&lt;td>Matches the previous element one or more times.&lt;/td>
&lt;td>&lt;code>&amp;quot;be+&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;bee&amp;quot;&lt;/code> in &lt;code>&amp;quot;been&amp;quot;&lt;/code>, &lt;code>&amp;quot;be&amp;quot;&lt;/code> in &lt;code>&amp;quot;bent&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>?&lt;/code>&lt;/td>
&lt;td>Matches the previous element zero or one time.&lt;/td>
&lt;td>&lt;code>&amp;quot;rai?n&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ran&amp;quot;&lt;/code>, &lt;code>&amp;quot;rain&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>{&lt;/code> &lt;em>n&lt;/em> &lt;code>}&lt;/code>&lt;/td>
&lt;td>Matches the previous element exactly &lt;em>n&lt;/em> times.&lt;/td>
&lt;td>&lt;code>&amp;quot;,\d{3}&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;,043&amp;quot;&lt;/code> in &lt;code>&amp;quot;1,043.6&amp;quot;&lt;/code>, &lt;code>&amp;quot;,876&amp;quot;&lt;/code>, &lt;code>&amp;quot;,543&amp;quot;&lt;/code>, and &lt;code>&amp;quot;,210&amp;quot;&lt;/code> in &lt;code>&amp;quot;9,876,543,210&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>{&lt;/code> &lt;em>n&lt;/em> &lt;code>,}&lt;/code>&lt;/td>
&lt;td>Matches the previous element at least &lt;em>n&lt;/em> times.&lt;/td>
&lt;td>&lt;code>&amp;quot;\d{2,}&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;166&amp;quot;&lt;/code>, &lt;code>&amp;quot;29&amp;quot;&lt;/code>, &lt;code>&amp;quot;1930&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>{&lt;/code> &lt;em>n&lt;/em> &lt;code>,&lt;/code> &lt;em>m&lt;/em> &lt;code>}&lt;/code>&lt;/td>
&lt;td>Matches the previous element at least &lt;em>n&lt;/em> times, but no more than &lt;em>m&lt;/em> times.&lt;/td>
&lt;td>&lt;code>&amp;quot;\d{3,5}&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;166&amp;quot;&lt;/code>, &lt;code>&amp;quot;17668&amp;quot;&lt;/code>&lt;br/>&lt;br/> &lt;code>&amp;quot;19302&amp;quot;&lt;/code> in &lt;code>&amp;quot;193024&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>*?&lt;/code>&lt;/td>
&lt;td>Matches the previous element zero or more times, but as few times as possible.&lt;/td>
&lt;td>&lt;code>\d*?\.\d&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;.0&amp;quot;&lt;/code>, &lt;code>&amp;quot;19.9&amp;quot;&lt;/code>, &lt;code>&amp;quot;219.9&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>+?&lt;/code>&lt;/td>
&lt;td>Matches the previous element one or more times, but as few times as possible.&lt;/td>
&lt;td>&lt;code>&amp;quot;be+?&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;be&amp;quot;&lt;/code> in &lt;code>&amp;quot;been&amp;quot;&lt;/code>, &lt;code>&amp;quot;be&amp;quot;&lt;/code> in &lt;code>&amp;quot;bent&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>??&lt;/code>&lt;/td>
&lt;td>Matches the previous element zero or one time, but as few times as possible.&lt;/td>
&lt;td>&lt;code>&amp;quot;rai??n&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ran&amp;quot;&lt;/code>, &lt;code>&amp;quot;rain&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>{&lt;/code> &lt;em>n&lt;/em> &lt;code>}?&lt;/code>&lt;/td>
&lt;td>Matches the preceding element exactly &lt;em>n&lt;/em> times.&lt;/td>
&lt;td>&lt;code>&amp;quot;,\d{3}?&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;,043&amp;quot;&lt;/code> in &lt;code>&amp;quot;1,043.6&amp;quot;&lt;/code>, &lt;code>&amp;quot;,876&amp;quot;&lt;/code>, &lt;code>&amp;quot;,543&amp;quot;&lt;/code>, and &lt;code>&amp;quot;,210&amp;quot;&lt;/code> in &lt;code>&amp;quot;9,876,543,210&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>{&lt;/code> &lt;em>n&lt;/em> &lt;code>,}?&lt;/code>&lt;/td>
&lt;td>Matches the previous element at least &lt;em>n&lt;/em> times, but as few times as possible.&lt;/td>
&lt;td>&lt;code>&amp;quot;\d{2,}?&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;166&amp;quot;&lt;/code>, &lt;code>&amp;quot;29&amp;quot;&lt;/code>, &lt;code>&amp;quot;1930&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>{&lt;/code> &lt;em>n&lt;/em> &lt;code>,&lt;/code> &lt;em>m&lt;/em> &lt;code>}?&lt;/code>&lt;/td>
&lt;td>Matches the previous element between &lt;em>n&lt;/em> and &lt;em>m&lt;/em> times, but as few times as possible.&lt;/td>
&lt;td>&lt;code>&amp;quot;\d{3,5}?&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;166&amp;quot;&lt;/code>, &lt;code>&amp;quot;17668&amp;quot;&lt;/code>&lt;br/>&lt;br/> &lt;code>&amp;quot;193&amp;quot;&lt;/code>, &lt;code>&amp;quot;024&amp;quot;&lt;/code> in &lt;code>&amp;quot;193024&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="backreference-constructs">Backreference Constructs&lt;/h2>
&lt;p>A backreference allows a previously matched subexpression to be identified subsequently in the same regular expression. The following table lists the backreference constructs supported by regular expressions in .NET. For more information, see &lt;a href="backreference-constructs-in-regular-expressions" data-linktype="relative-path">Backreference Constructs&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Backreference construct&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>\&lt;/code> &lt;em>number&lt;/em>&lt;/td>
&lt;td>Backreference. Matches the value of a numbered subexpression.&lt;/td>
&lt;td>&lt;code>(\w)\1&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ee&amp;quot;&lt;/code> in &lt;code>&amp;quot;seek&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>\k&amp;lt;&lt;/code> &lt;em>name&lt;/em> &lt;code>&amp;gt;&lt;/code>&lt;/td>
&lt;td>Named backreference. Matches the value of a named expression.&lt;/td>
&lt;td>&lt;code>(?&amp;lt;char&amp;gt;\w)\k&amp;lt;char&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;ee&amp;quot;&lt;/code> in &lt;code>&amp;quot;seek&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="alternation-constructs">Alternation Constructs&lt;/h2>
&lt;p>Alternation constructs modify a regular expression to enable either/or matching. These constructs include the language elements listed in the following table. For more information, see &lt;a href="alternation-constructs-in-regular-expressions" data-linktype="relative-path">Alternation Constructs&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Alternation construct&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>|&lt;/code>&lt;/td>
&lt;td>Matches any one element separated by the vertical bar (&lt;code>|&lt;/code>) character.&lt;/td>
&lt;td>&lt;code>th(e|is|at)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;the&amp;quot;&lt;/code>, &lt;code>&amp;quot;this&amp;quot;&lt;/code> in &lt;code>&amp;quot;this is the day.&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?(&lt;/code> &lt;em>expression&lt;/em> &lt;code>)&lt;/code> &lt;em>yes&lt;/em> &lt;code>|&lt;/code> &lt;em>no&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Matches &lt;em>yes&lt;/em> if the regular expression pattern designated by &lt;em>expression&lt;/em> matches; otherwise, matches the optional &lt;em>no&lt;/em> part. &lt;em>expression&lt;/em> is interpreted as a zero-width assertion.&lt;/td>
&lt;td>&lt;code>(?(A)A\d{2}\b|\b\d{3}\b)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;A10&amp;quot;&lt;/code>, &lt;code>&amp;quot;910&amp;quot;&lt;/code> in &lt;code>&amp;quot;A10 C103 910&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?(&lt;/code> &lt;em>name&lt;/em> &lt;code>)&lt;/code> &lt;em>yes&lt;/em> &lt;code>|&lt;/code> &lt;em>no&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Matches &lt;em>yes&lt;/em> if &lt;em>name&lt;/em>, a named or numbered capturing group, has a match; otherwise, matches the optional &lt;em>no&lt;/em>.&lt;/td>
&lt;td>&lt;code>(?&amp;lt;quoted&amp;gt;&amp;quot;)?(?(quoted).+?&amp;quot;|\S+\s)&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;Dogs.jpg &amp;quot;&lt;/code>, &lt;code>&amp;quot;\&amp;quot;Yiska playing.jpg\&amp;quot;&amp;quot;&lt;/code> in &lt;code>&amp;quot;Dogs.jpg \&amp;quot;Yiska playing.jpg\&amp;quot;&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="substitutions">Substitutions&lt;/h2>
&lt;p>Substitutions are regular expression language elements that are supported in replacement patterns. For more information, see &lt;a href="substitutions-in-regular-expressions" data-linktype="relative-path">Substitutions&lt;/a>. The metacharacters listed in the following table are atomic zero-width assertions.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Character&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Replacement pattern&lt;/th>
&lt;th>Input string&lt;/th>
&lt;th>Result string&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>$&lt;/code> &lt;em>number&lt;/em>&lt;/td>
&lt;td>Substitutes the substring matched by group &lt;em>number&lt;/em>.&lt;/td>
&lt;td>&lt;code>\b(\w+)(\s)(\w+)\b&lt;/code>&lt;/td>
&lt;td>&lt;code>$3$2$1&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;one two&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;two one&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>${&lt;/code> &lt;em>name&lt;/em> &lt;code>}&lt;/code>&lt;/td>
&lt;td>Substitutes the substring matched by the named group &lt;em>name&lt;/em>.&lt;/td>
&lt;td>&lt;code>\b(?&amp;lt;word1&amp;gt;\w+)(\s)(?&amp;lt;word2&amp;gt;\w+)\b&lt;/code>&lt;/td>
&lt;td>&lt;code>${word2} ${word1}&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;one two&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;two one&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>$$&lt;/code>&lt;/td>
&lt;td>Substitutes a literal &amp;quot;$&amp;quot;.&lt;/td>
&lt;td>&lt;code>\b(\d+)\s?USD&lt;/code>&lt;/td>
&lt;td>&lt;code>$$$1&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;103 USD&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;$103&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>$&amp;amp;&lt;/code>&lt;/td>
&lt;td>Substitutes a copy of the whole match.&lt;/td>
&lt;td>&lt;code>\$?\d*\.?\d+&lt;/code>&lt;/td>
&lt;td>&lt;code>**$&amp;amp;**&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;$1.30&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;**$1.30**&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>$` &lt;/code>&lt;/td>
&lt;td>Substitutes all the text of the input string before the match.&lt;/td>
&lt;td>&lt;code>B+&lt;/code>&lt;/td>
&lt;td>&lt;code>$` &lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AABBCC&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AAAACC&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>$'&lt;/code>&lt;/td>
&lt;td>Substitutes all the text of the input string after the match.&lt;/td>
&lt;td>&lt;code>B+&lt;/code>&lt;/td>
&lt;td>&lt;code>$'&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AABBCC&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AACCCC&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>$+&lt;/code>&lt;/td>
&lt;td>Substitutes the last group that was captured.&lt;/td>
&lt;td>&lt;code>B+(C+)&lt;/code>&lt;/td>
&lt;td>&lt;code>$+&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AABBCCDD&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AACCDD&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>$_&lt;/code>&lt;/td>
&lt;td>Substitutes the entire input string.&lt;/td>
&lt;td>&lt;code>B+&lt;/code>&lt;/td>
&lt;td>&lt;code>$_&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AABBCC&amp;quot;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;AAAABBCCCC&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="regular-expression-options">Regular Expression Options&lt;/h2>
&lt;p>You can specify options that control how the regular expression engine interprets a regular expression pattern. Many of these options can be specified either inline (in the regular expression pattern) or as one or more &lt;a href="https://siqi-zheng.rbind.io/en-us/dotnet/api/system.text.regularexpressions.regexoptions" data-linktype="absolute-path">RegexOptions&lt;/a> constants. This quick reference lists only inline options. For more information about inline and &lt;a href="https://siqi-zheng.rbind.io/en-us/dotnet/api/system.text.regularexpressions.regexoptions" data-linktype="absolute-path">RegexOptions&lt;/a> options, see the article &lt;a href="regular-expression-options" data-linktype="relative-path">Regular Expression Options&lt;/a>.&lt;/p>
&lt;p>You can specify an inline option in two ways:&lt;/p>
&lt;ul>
&lt;li>By using the &lt;a href="miscellaneous-constructs-in-regular-expressions" data-linktype="relative-path">miscellaneous construct&lt;/a> &lt;code>(?imnsx-imnsx)&lt;/code>, where a minus sign (-) before an option or set of options turns those options off. For example, &lt;code>(?i-mn)&lt;/code> turns case-insensitive matching (&lt;code>i&lt;/code>) on, turns multiline mode (&lt;code>m&lt;/code>) off, and turns unnamed group captures (&lt;code>n&lt;/code>) off. The option applies to the regular expression pattern from the point at which the option is defined, and is effective either to the end of the pattern or to the point where another construct reverses the option.&lt;/li>
&lt;li>By using the &lt;a href="grouping-constructs-in-regular-expressions" data-linktype="relative-path">grouping construct&lt;/a>&lt;code>(?imnsx-imnsx:&lt;/code>&lt;em>subexpression&lt;/em>&lt;code>)&lt;/code>, which defines options for the specified group only.&lt;/li>
&lt;/ul>
&lt;p>The .NET regular expression engine supports the following inline options:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Option&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Pattern&lt;/th>
&lt;th>Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>i&lt;/code>&lt;/td>
&lt;td>Use case-insensitive matching.&lt;/td>
&lt;td>&lt;code>\b(?i)a(?-i)a\w+\b&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;aardvark&amp;quot;&lt;/code>, &lt;code>&amp;quot;aaaAuto&amp;quot;&lt;/code> in &lt;code>&amp;quot;aardvark AAAuto aaaAuto Adam breakfast&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>m&lt;/code>&lt;/td>
&lt;td>Use multiline mode. &lt;code>^&lt;/code> and &lt;code>$&lt;/code> match the beginning and end of a line, instead of the beginning and end of a string.&lt;/td>
&lt;td>For an example, see the &amp;quot;Multiline Mode&amp;quot; section in &lt;a href="regular-expression-options" data-linktype="relative-path">Regular Expression Options&lt;/a>.&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>n&lt;/code>&lt;/td>
&lt;td>Do not capture unnamed groups.&lt;/td>
&lt;td>For an example, see the &amp;quot;Explicit Captures Only&amp;quot; section in &lt;a href="regular-expression-options" data-linktype="relative-path">Regular Expression Options&lt;/a>.&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>s&lt;/code>&lt;/td>
&lt;td>Use single-line mode.&lt;/td>
&lt;td>For an example, see the &amp;quot;Single-line Mode&amp;quot; section in &lt;a href="regular-expression-options" data-linktype="relative-path">Regular Expression Options&lt;/a>.&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>x&lt;/code>&lt;/td>
&lt;td>Ignore unescaped white space in the regular expression pattern.&lt;/td>
&lt;td>&lt;code>\b(?x) \d+ \s \w+&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;quot;1 aardvark&amp;quot;&lt;/code>, &lt;code>&amp;quot;2 cats&amp;quot;&lt;/code> in &lt;code>&amp;quot;1 aardvark 2 cats IV centurions&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="miscellaneous-constructs">Miscellaneous Constructs&lt;/h2>
&lt;p>Miscellaneous constructs either modify a regular expression pattern or provide information about it. The following table lists the miscellaneous constructs supported by .NET. For more information, see &lt;a href="miscellaneous-constructs-in-regular-expressions" data-linktype="relative-path">Miscellaneous Constructs&lt;/a>.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Construct&lt;/th>
&lt;th>Definition&lt;/th>
&lt;th>Example&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>(?imnsx-imnsx)&lt;/code>&lt;/td>
&lt;td>Sets or disables options such as case insensitivity in the middle of a pattern.For more information, see &lt;a href="regular-expression-options" data-linktype="relative-path">Regular Expression Options&lt;/a>.&lt;/td>
&lt;td>&lt;code>\bA(?i)b\w+\b&lt;/code> matches &lt;code>&amp;quot;ABA&amp;quot;&lt;/code>, &lt;code>&amp;quot;Able&amp;quot;&lt;/code> in &lt;code>&amp;quot;ABA Able Act&amp;quot;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>(?#&lt;/code> &lt;em>comment&lt;/em> &lt;code>)&lt;/code>&lt;/td>
&lt;td>Inline comment. The comment ends at the first closing parenthesis.&lt;/td>
&lt;td>&lt;code>\bA(?#Matches words starting with A)\w+\b&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>#&lt;/code> [to end of line]&lt;/td>
&lt;td>X-mode comment. The comment starts at an unescaped &lt;code>#&lt;/code> and continues to the end of the line.&lt;/td>
&lt;td>&lt;code>(?x)\bA\w+\b#Matches words starting with A&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="see-also">See also&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://download.microsoft.com/download/D/2/4/D240EBF6-A9BA-4E4F-A63F-AEB6DA0B921C/Regular%20expressions%20quick%20reference.docx" data-linktype="external">Regular Expressions - Quick Reference (download in Word format)&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://download.microsoft.com/download/D/2/4/D240EBF6-A9BA-4E4F-A63F-AEB6DA0B921C/Regular%20expressions%20quick%20reference.pdf" data-linktype="external">Regular Expressions - Quick Reference (download in PDF format)&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Learning SQL Notes #4: Query Primer (CH. 7)</title><link>https://siqi-zheng.rbind.io/post/2021-05-27-sql-notes-4/</link><pubDate>Thu, 27 May 2021 20:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-05-27-sql-notes-4/</guid><description>&lt;h1 id="working-with-sets">Working with Sets&lt;/h1>
&lt;ul>
&lt;li>
&lt;a href="#working-with-sets">Working with Sets&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#set-theory-in-practice">Set Theory in Practice&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#set-operators">Set Operators&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#the-union-operator">The UNION Operator&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#the-intersect-operator-not-for-mysql">The INTERSECT Operator (Not for MySQL!)&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#the-except-operator-not-for-mysql">The EXCEPT Operator (Not for MySQL!)&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#set-operation-rules">Set Operation Rules&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#sorting-compound-query-results">Sorting Compound Query Results&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#sort">Sort&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#order">Order&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="set-theory-in-practice">Set Theory in Practice&lt;/h2>
&lt;ul>
&lt;li>Both data sets must have the &lt;strong>same number of columns&lt;/strong>.&lt;/li>
&lt;li>The &lt;strong>data&lt;/strong> &lt;strong>types&lt;/strong> of each column across the two data sets must be the &lt;strong>same&lt;/strong> (or the server must be able to convert one to the other).&lt;/li>
&lt;/ul>
&lt;h2 id="set-operators">Set Operators&lt;/h2>
&lt;h3 id="the-union-operator">The UNION Operator&lt;/h3>
&lt;p>The &lt;code>union&lt;/code> and &lt;code>union all&lt;/code> operators allow you to combine multiple data sets. The difference between the two is that &lt;code>union&lt;/code> sorts the combined set and &lt;em>removes duplicates&lt;/em>, whereas &lt;code>union all&lt;/code> does not.&lt;/p>
&lt;p>&lt;img src="union_all.png" alt="">
&lt;a href="https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/">https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/&lt;/a>&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
UNION ALL
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">first_name&lt;/th>
&lt;th align="right">last_name&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">JENNIFER&lt;/td>
&lt;td align="right">DAVIS&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JENNIFER&lt;/td>
&lt;td align="right">DAVIS&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JUDY&lt;/td>
&lt;td align="right">DEAN&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JODIE&lt;/td>
&lt;td align="right">DEGENERES&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JULIANNE&lt;/td>
&lt;td align="right">DENCH&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">library(dplyr)
union_all(df1,df2)
&lt;/code>&lt;/pre>
&lt;p>where as &lt;code>UNION&lt;/code> removes duplicate Jennifer Davis.&lt;/p>
&lt;p>&lt;img src="uinon.png" alt="">
&lt;a href="https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/">https://www.sqlshack.com/sql-union-vs-union-all-in-sql-server/&lt;/a>&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
UNION
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">first_name&lt;/th>
&lt;th align="right">last_name&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">JENNIFER&lt;/td>
&lt;td align="right">DAVIS&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JUDY&lt;/td>
&lt;td align="right">DEAN&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JODIE&lt;/td>
&lt;td align="right">DEGENERES&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JULIANNE&lt;/td>
&lt;td align="right">DENCH&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">library(dplyr)
union(df1,df2)
&lt;/code>&lt;/pre>
&lt;h3 id="the-intersect-operator-not-for-mysql">The INTERSECT Operator (Not for MySQL!)&lt;/h3>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
INTERSECT
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">first_name&lt;/th>
&lt;th align="right">last_name&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">JENNIFER&lt;/td>
&lt;td align="right">DAVIS&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">library(dplyr)
intersect(df1,df2)
&lt;/code>&lt;/pre>
&lt;h3 id="the-except-operator-not-for-mysql">The EXCEPT Operator (Not for MySQL!)&lt;/h3>
&lt;pre>&lt;code class="language-sql">SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
EXCEPT
SELECT a.first_name, a.last_name
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%';
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">first_name&lt;/th>
&lt;th align="right">last_name&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">JUDY&lt;/td>
&lt;td align="right">DEAN&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JODIE&lt;/td>
&lt;td align="right">DEGENERES&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">JULIANNE&lt;/td>
&lt;td align="right">DENCH&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">library(dplyr)
setdiff(df1,df2)
&lt;/code>&lt;/pre>
&lt;p>*Set A *&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">actor_id&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">10&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">11&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">12&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">10&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">10&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;em>Set B&lt;/em>
| actor_id |
| :&amp;mdash;&amp;mdash;: |
| 10 |
| 10 |&lt;/p>
&lt;p>The operation&lt;code> A except B&lt;/code> yields the following:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">actor_id&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">11&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">12&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The operation&lt;code> A except all B&lt;/code> yields the following:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">actor_id&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">10&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">11&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="center">12&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The difference between the two operations is that except removes all occurrences of duplicate data from set A, whereas except all removes only one occurrence of duplicate data from set A &lt;em>for every occurrence&lt;/em> in set B.&lt;/p>
&lt;h2 id="set-operation-rules">Set Operation Rules&lt;/h2>
&lt;p>The following sections outline some rules that you must follow when working with compound queries.&lt;/p>
&lt;h3 id="sorting-compound-query-results">Sorting Compound Query Results&lt;/h3>
&lt;h4 id="sort">Sort&lt;/h4>
&lt;pre>&lt;code class="language-sql">SELECT a.first_name fname, a.last_name lname /*aliases can be helpful*/
FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%' UNION ALL
SELECT c.first_name, c.last_name
FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%' ORDER BY lname, fname;
&lt;/code>&lt;/pre>
&lt;h4 id="order">Order&lt;/h4>
&lt;p>In general, compound queries containing three or more queries are evaluated in order from top to bottom. Except for:&lt;/p>
&lt;ul>
&lt;li>The ANSI SQL specification calls for the intersect operator to have precedence over the other set operators.&lt;/li>
&lt;li>You may dictate the order in which queries are combined by enclosing multiple queries in parentheses.&lt;/li>
&lt;/ul>
&lt;p>NOT FOR MySQL:&lt;/p>
&lt;p>You can also wrap adjoining queries in parentheses to override the default top-to-bottom processing of compound queries.&lt;/p>
&lt;pre>&lt;code class="language-sql">SELECT a.first_name, a.last_name FROM actor a
WHERE a.first_name LIKE 'J%' AND a.last_name LIKE 'D%' UNION (SELECT a.first_name, a.last_name FROM actor a
WHERE a.first_name LIKE 'M%' AND a.last_name LIKE 'T%' UNION ALL
SELECT c.first_name, c.last_name FROM customer c
WHERE c.first_name LIKE 'J%' AND c.last_name LIKE 'D%'
)
&lt;/code>&lt;/pre></description></item><item><title>Learning SQL Notes #3: Query Primer (CH. 3)</title><link>https://siqi-zheng.rbind.io/post/2021-05-26-sql-notes-3/</link><pubDate>Wed, 26 May 2021 20:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-05-26-sql-notes-3/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#query-mechanics">Query Mechanics&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#query-clauses">Query Clauses&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#select">SELECT&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#from">FROM&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#table-links">Table Links&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#table-aliases">Table Aliases&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#group-by-and-having-ch-8">GROUP BY and HAVING (CH. 8)&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#order-by">ORDER BY&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#filtering">Filtering&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#where">WHERE&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#or-operator">OR operator&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#and-operator">AND operator&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#not-operator">NOT operator&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#expressions">Expressions&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#null">NULL&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Complete sometime this summer:&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Finish Join Notes;&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Finish GROUP BY Notes;&lt;/li>
&lt;/ul>
&lt;h2 id="query-mechanics">Query Mechanics&lt;/h2>
&lt;ul>
&lt;li>Do you have permission to execute the statement?&lt;/li>
&lt;li>Do you have permission to access the desired data?&lt;/li>
&lt;li>Is your statement syntax correct?&lt;/li>
&lt;/ul>
&lt;h2 id="query-clauses">Query Clauses&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Clause name&lt;/th>
&lt;th align="right">Purpose&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">select&lt;/td>
&lt;td align="right">Determines which columns to include in the query’s result set&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">from&lt;/td>
&lt;td align="right">Identifies the tables from which to retrieve data and how the tables should be joined&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">where&lt;/td>
&lt;td align="right">Filters out unwanted data&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">group by&lt;/td>
&lt;td align="right">Used to group rows together by common column values&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">having&lt;/td>
&lt;td align="right">Filters out unwanted groups&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">order by&lt;/td>
&lt;td align="right">the rows of the final result set by one or more columns&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="select">SELECT&lt;/h3>
&lt;ul>
&lt;li>Literals, such as numbers or strings&lt;/li>
&lt;li>Expressions, such as transaction.amount * −1&lt;/li>
&lt;li>Built-in function calls, such as ROUND(transaction.amount, 2)&lt;/li>
&lt;li>User-defined function calls&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-SQL">SELECT version(), user(), database();
&lt;/code>&lt;/pre>
&lt;p>Results:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">version()&lt;/th>
&lt;th align="center">user()&lt;/th>
&lt;th align="right">database()&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">8.0.15&lt;/td>
&lt;td align="center">root@localhost&lt;/td>
&lt;td align="right">sakila&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-SQL">SELECT row1 AS r1;/*Column Aliases*/
SELECT DISTINCT row1 /*Removing Duplicates-should know beforehand whether duplicates are possible*/
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes：&lt;/p>
&lt;pre>&lt;code class="language-r">unique()
&lt;/code>&lt;/pre>
&lt;h3 id="from">FROM&lt;/h3>
&lt;ul>
&lt;li>Permanent tables (i.e., created using the create table statement)&lt;/li>
&lt;li>Derived tables (i.e., rows returned by a subquery and held in memory)
&lt;pre>&lt;code class="language-sql">SELECT *
FROM
(SELECT first_name, last_name, email
FROM customer
WHERE first_name = 'JESSIE'
) AS cust;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>Temporary tables (i.e., volatile data held in memory): any data inserted into a temporary table will disappear at some point
&lt;pre>&lt;code class="language-sql">CREATE TEMPORARY TABLE actors_j
(actor_id smallint(5),
first_name varchar(45),
last_name varchar(45)
);
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>Virtual tables (i.e., created using the create view statement): When you issue a query against a view, your query is &lt;strong>merged&lt;/strong> with the view definition to create a final query to be executed.
&lt;pre>&lt;code class="language-SQL">CREATE VIEW cust_vw AS
SELECT customer_id, first_name, last_name, active
FROM customer;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h4 id="table-links">Table Links&lt;/h4>
&lt;p>See JOIN in the next note.&lt;/p>
&lt;h4 id="table-aliases">Table Aliases&lt;/h4>
&lt;pre>&lt;code class="language-SQL">FROM customer AS c;
&lt;/code>&lt;/pre>
&lt;h3 id="group-by-and-having-ch-8">GROUP BY and HAVING (CH. 8)&lt;/h3>
&lt;p>[] Haven&amp;rsquo;t done&lt;/p>
&lt;h3 id="order-by">ORDER BY&lt;/h3>
&lt;ol>
&lt;li>
&lt;pre>&lt;code class="language-sql">ORDER BY col1, col2, etc;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes：&lt;/p>
&lt;pre>&lt;code class="language-r">df[order(col1),]
require(tidyverse)
df %&amp;gt;%
arrange(col1)
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;pre>&lt;code class="language-sql">ORDER BY col1;
ORDER BY col1 desc;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes：&lt;/p>
&lt;pre>&lt;code class="language-r">df[order(-col1),]
require(tidyverse)
df %&amp;gt;%
arrange(desc(col1))
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;pre>&lt;code class="language-sql">SELECT col1, col2, col3;
FROM table1
ORDER BY 3; /*equivalent to ORDER BY col3*/
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h2 id="filtering">Filtering&lt;/h2>
&lt;h3 id="where">WHERE&lt;/h3>
&lt;pre>&lt;code class="language-SQL">(...) AND (...)
(...) OR (...)
&lt;/code>&lt;/pre>
&lt;p>See &lt;strong>operators&lt;/strong> and &lt;strong>expressions&lt;/strong> for details.&lt;/p>
&lt;h4 id="or-operator">OR operator&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Intermediate result&lt;/th>
&lt;th align="right">Final result&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">WHERE true OR true&lt;/td>
&lt;td align="right">true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE true OR false&lt;/td>
&lt;td align="right">true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE false OR true&lt;/td>
&lt;td align="right">true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE false OR false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="and-operator">AND operator&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Intermediate result&lt;/th>
&lt;th align="right">Final result&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">WHERE (true OR true) AND true&lt;/td>
&lt;td align="right">true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE (true OR false) AND true&lt;/td>
&lt;td align="right">true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE (false OR true) AND true&lt;/td>
&lt;td align="right">true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE (false OR false) AND true&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE (true OR true) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE (true OR false) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE (false OR true) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE (false OR false) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="not-operator">NOT operator&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Intermediate result&lt;/th>
&lt;th align="right">Final result&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">WHERE NOT (true OR true) AND true&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE NOT (true OR false) AND true&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE NOT (false OR true) AND true&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE NOT (false OR false) AND true&lt;/td>
&lt;td align="right">true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE NOT (true OR true) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE NOT (true OR false) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE NOT (false OR true) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">WHERE NOT (false OR false) AND false&lt;/td>
&lt;td align="right">false&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="expressions">Expressions&lt;/h4>
&lt;p>An expression can be any of the following:&lt;/p>
&lt;ul>
&lt;li>A number&lt;/li>
&lt;li>A column in a table or view&lt;/li>
&lt;li>A string literal, such as &amp;lsquo;Maple Street&amp;rsquo;&lt;/li>
&lt;li>A built-in function, such as concat(&amp;lsquo;Learning&amp;rsquo;, ' &amp;lsquo;, &amp;lsquo;SQL&amp;rsquo;)&lt;/li>
&lt;li>A subquery&lt;/li>
&lt;li>A list of expressions, such as (&amp;lsquo;Boston&amp;rsquo;, &amp;lsquo;New York&amp;rsquo;, &amp;lsquo;Chicago&amp;rsquo;)&lt;/li>
&lt;/ul>
&lt;p>Operators:&lt;/p>
&lt;ul>
&lt;li>Comparison operators, such as =, !=, &amp;lt;, &amp;lt;=, &amp;gt;, &amp;gt;=, &amp;lt;&amp;gt;, like, in, between, is null, exists&lt;/li>
&lt;li>Arithmetic operators, such as +, −, *, /, DIV (integer division) and (% or MOD) for modulus&lt;/li>
&lt;/ul>
&lt;p>Note:&lt;/p>
&lt;ol>
&lt;li>= can be used for date/string/number;&lt;/li>
&lt;li>&amp;lsquo;between and&amp;rsquo; can be used for date/string/number;&lt;/li>
&lt;li>&amp;lsquo;between and&amp;rsquo; is inclusive;&lt;/li>
&lt;li>col1 (not) in (&amp;lsquo;A&amp;rsquo;,&amp;lsquo;B&amp;rsquo;)/subqueries;&lt;/li>
&lt;li>built-in function: left(name, 1) in (&amp;lsquo;A&amp;rsquo;,&amp;lsquo;B&amp;rsquo;);&lt;/li>
&lt;li>wildcards/regular expressions:
&lt;ul>
&lt;li>Strings beginning/ending with a certain &lt;strong>character&lt;/strong>&lt;/li>
&lt;li>Strings beginning/ending with a &lt;strong>substring&lt;/strong>&lt;/li>
&lt;li>Strings containing a certain &lt;strong>character&lt;/strong> &lt;strong>anywhere&lt;/strong> within the string&lt;/li>
&lt;li>Strings containing a &lt;strong>substring anywhere&lt;/strong> within the string&lt;/li>
&lt;li>Strings with a &lt;strong>specific format&lt;/strong>, regardless of individual characters&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Wildcard character&lt;/th>
&lt;th align="right">Matches&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">_&lt;/td>
&lt;td align="right">Exactly one character&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">%&lt;/td>
&lt;td align="right">Any number of characters (including 0)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="null">NULL&lt;/h4>
&lt;p>Null is used for various cases where a value cannot be supplied, such as:&lt;/p>
&lt;ul>
&lt;li>Not applicable
Such as the employee ID column for a transaction that took place at an ATM machine&lt;/li>
&lt;li>Value not yet known
Such as when the federal ID is not known at the time a customer row is created&lt;/li>
&lt;li>Value undefined
Such as when an account is created for a product that has not yet been added to the database&lt;/li>
&lt;/ul>
&lt;p>Note:&lt;/p>
&lt;ul>
&lt;li>An expression can be null, but it can &lt;strong>never equal&lt;/strong> null. IS NULL/IS NOT NULL.&lt;/li>
&lt;li>Two nulls are &lt;strong>never equal to each other&lt;/strong>.&lt;/li>
&lt;/ul></description></item><item><title>Learning SQL Notes #2: Data Types</title><link>https://siqi-zheng.rbind.io/post/2021-05-26-sql-notes-2/</link><pubDate>Wed, 26 May 2021 01:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-05-26-sql-notes-2/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#character-data">Character Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#numeric-data">Numeric Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#temporal-data">Temporal Data&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#bouns-find-current-time">BOUNS: Find Current Time&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="character-data">Character Data&lt;/h3>
&lt;pre>&lt;code class="language-SQL">char(20) /* fixed-length */
varchar(20) /* variable-length */
&lt;/code>&lt;/pre>
&lt;p>No easy way to constrain the length of character in &lt;strong>R&lt;/strong>, but one can try &lt;code>stringr::str_trunc()&lt;/code>.&lt;/p>
&lt;p>Note:&lt;/p>
&lt;ol>
&lt;li>If the data being loaded into a text column exceeds the maximum size for that type, the data will be truncated;&lt;/li>
&lt;li>Trailing spaces &lt;strong>will not&lt;/strong> be removed when data is loaded into the column;&lt;/li>
&lt;li>When using text columns for sorting or grouping, only the first 1,024 bytes are used, although this limit may be increased if necessary.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-SQL">CREATE DATABASE european_sales CHARACTER SET latin1;
&lt;/code>&lt;/pre>
&lt;h3 id="numeric-data">Numeric Data&lt;/h3>
&lt;ol>
&lt;li>Boolean: 0 False, 1 True.&lt;/li>
&lt;li>System-generated primary keys: 1 to $\infin$, integers;
&lt;pre>&lt;code class="language-SQL">mediumint −8,388,608 to 8,388,607
mediumint unsigned 0 to 16,777,215
int −2,147,483,648 to 2,147,483,647
int unsigned 0 to 4,294,967,295
bigint −2^63 to 2^63 - 1
bigint unsigned 0 to 2^64 - 1
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>Item number: positive integers in a range;
&lt;pre>&lt;code class="language-SQL">tinyint −128 to 127
tinyint unsigned 0 to 255
smallint −32,768 to 32,767
smallint unsigned 0 to 65,535
&lt;/code>&lt;/pre>
&lt;p>unsigned takes only positive values；&lt;/p>
&lt;/li>
&lt;li>High-precision scientific or manufacturing data;
&lt;pre>&lt;code class="language-SQL">float( p , s ) −3.402823466E+38 to −1.175494351E-38 and 1.175494351E-38 to 3.402823466E+38
double( p , s ) −1.7976931348623157E+308 to −2.2250738585072014E-308
and 2.2250738585072014E-308 to 1.7976931348623157E+308
&lt;/code>&lt;/pre>
&lt;p>p, s are optional parameters, precision (the total number of allowable digits both to the left and to the right of the decimal point) and a scale (the number of allowable digits to the right of the decimal point), left digits = p - s.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="temporal-data">Temporal Data&lt;/h3>
&lt;ul>
&lt;li>The &lt;strong>future date&lt;/strong> that a particular event is expected to happen, such as shipping a customer’s order
&lt;pre>&lt;code class="language-SQL">date YYYY-MM-DD 1000-01-01 to 9999-12-31
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>The date that a customer’s order &lt;strong>was shipped&lt;/strong>
&lt;pre>&lt;code class="language-SQL">datetime YYYY-MM-DD HH:MI:SS 1000-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>The &lt;strong>date and time&lt;/strong> that a user &lt;strong>modified&lt;/strong> a particular row in a table
&lt;pre>&lt;code class="language-SQL">timestamp YYYY-MM-DD HH:MI:SS 1970-01-01 00:00:00.000000 to 2038-01-18 22:14:07.999999
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>An employee’s &lt;strong>birth date&lt;/strong>
&lt;pre>&lt;code class="language-SQL">date YYYY-MM-DD 1000-01-01 to 9999-12-31
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>The &lt;strong>year&lt;/strong> corresponding to a row in a yearly_sales fact table in a data warehouse
&lt;pre>&lt;code class="language-SQL">year YYYY 1901-2155
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>The &lt;strong>elapsed time&lt;/strong> needed to complete a wiring harness on an automobile assembly line
&lt;pre>&lt;code class="language-SQL">time HHH:MI:SS −838:59:59.000000 to 838:59:59.000000
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="bouns-find-current-time">BOUNS: Find Current Time&lt;/h3>
&lt;p>To find the current data/time:&lt;/p>
&lt;pre>&lt;code class="language-SQL">SELECT now();
/*2019-04-04 20:44:26 Timezone not included*/
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes：&lt;/p>
&lt;pre>&lt;code class="language-r">sys.time()
# &amp;quot;2021-05-25 10:58:06 EDT&amp;quot;, Timezone included
&lt;/code>&lt;/pre>
&lt;p>If Oracle, add &lt;code>FROM dual;&lt;/code>;(Think about &lt;em>dummy variable&lt;/em>!)&lt;/p></description></item><item><title>Learning SQL Notes #1</title><link>https://siqi-zheng.rbind.io/post/2021-05-26-sql-notes-1/</link><pubDate>Tue, 25 May 2021 18:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-05-26-sql-notes-1/</guid><description>&lt;ul>
&lt;li>
&lt;a href="#introduction-to-databases">Introduction to Databases&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#more-about-relational-databases">More about Relational Databases&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#find-databases">Find Databases&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#find-a-table">Find a Table&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#create-a-table">Create a Table&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#add-a-row">Add a Row&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#change-a-cell">Change a Cell&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#delete-a-row">Delete a Row&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#table-overview">Table Overview&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#show-tables">Show Tables&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#drop-a-table">Drop a Table&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#export-to-xml">Export to XML&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;a href="#table-creation-ch-2">Table Creation (CH. 2)&lt;/a>
&lt;ul>
&lt;li>
&lt;a href="#1---design">1 Design&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#2---refinement">2 Refinement&lt;/a>&lt;/li>
&lt;li>
&lt;a href="#3---building-sql-schema-statements">3 Building SQL Schema Statements&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="introduction-to-databases">Introduction to Databases&lt;/h2>
&lt;ul>
&lt;li>SQL was initially created to be the language for generating, manipulating, and retrieving data from relational databases.&lt;/li>
&lt;li>A database is a set of related information.&lt;/li>
&lt;li>&lt;em>Database systems&lt;/em> are computerized data storage and retrieval mechanisms.&lt;/li>
&lt;li>&lt;em>Nonrelational Database Systems&lt;/em>:
&lt;ul>
&lt;li>In a &lt;em>hierarchical&lt;/em> database system, for example, data is represented as one or more tree structures. The hierarchical database system provides tools for locating a particular customer’s tree and then traversing the tree to find the desired accounts and/or transactions. Each node in the tree may have either zero or one parent and zero, one, or many children.&lt;/li>
&lt;li>&lt;em>Network database system&lt;/em> exposes sets of records and sets of links that define relationships between different records.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Data can be represented as sets of &lt;em>tables&lt;/em>. Rather than using pointers to navigate between related entities, redundant data is used to link records in different tables: &lt;em>relational model&lt;/em>.&lt;/li>
&lt;/ul>
&lt;h3 id="more-about-relational-databases">More about Relational Databases&lt;/h3>
&lt;ol>
&lt;li>Now columns/rows are constrained due to &lt;em>physical limit&lt;/em> or &lt;em>maintainability&lt;/em>;&lt;/li>
&lt;li>&lt;em>Primary key&lt;/em> includes information that &lt;strong>uniquely identifies&lt;/strong> a row in that table;
&lt;ol>
&lt;li>If more than one column, then &lt;em>compound key&lt;/em>;&lt;/li>
&lt;li>If select, say, first name, then it is a &lt;em>natural key&lt;/em>;&lt;/li>
&lt;li>If select an id, then it is a &lt;em>surrogate key&lt;/em>;&lt;/li>
&lt;li>&lt;strong>NEVER be allowed to change!&lt;/strong>&lt;/li>
&lt;li>Possible error:
&lt;pre>&lt;code>ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>More than one identifiers in a table including the &lt;em>primary key&lt;/em>: &lt;em>foreign keys&lt;/em>, connect the entities in different tables;&lt;/li>
&lt;li>Make sure that there is only &lt;strong>one place&lt;/strong> in the database that holds, say, the customer’s name; otherwise, the data might be changed in one place but not another, causing the data in the database to be unreliable. The process of refining a database design to ensure that each independent piece of information is in only &lt;strong>one place&lt;/strong> (except for foreign keys) is known as &lt;em>normalization&lt;/em>. (Think about the concept of &lt;em>Tidy Data&lt;/em> in &lt;strong>R&lt;/strong>!)&lt;/li>
&lt;li>Two-column primary key is also possible depending on the context (CH.2);&lt;/li>
&lt;li>Foreign key constraint limits the id to those exist in another table (CH.2); Possible error:
&lt;pre>&lt;code>ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ('sakila'.'favorite_food', CONSTRAINT 'fk_fav_food_person_id' FOREIGN KEY
('person_id') REFERENCES 'person' ('person_id'))
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>Ways to generate primary keys:
&lt;ul>
&lt;li>Look at the largest value currently in the table and add one.&lt;/li>
&lt;li>Let the database server provide the value for you.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-SQL">ALTER TABLE table_name MODIFY col_0 SMALLINT UNSIGNED AUTO_INCREMENT;
set foreign_key_checks=0; /*IMPORTANT*/
ALTER TABLE person
MODIFY person_id SMALLINT UNSIGNED AUTO_INCREMENT;
set foreign_key_checks=1; /*IMPORTANT*/
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h3 id="find-databases">Find Databases&lt;/h3>
&lt;p>To see the see the &lt;code>mysql&amp;gt;&lt;/code> prompt:&lt;/p>
&lt;pre>&lt;code>mysql -u root -p;
&lt;/code>&lt;/pre>
&lt;p>Then type &lt;code>show databases;&lt;/code> to display all databases;&lt;/p>
&lt;h3 id="find-a-table">Find a Table&lt;/h3>
&lt;p>To select a table, type &lt;code>use table_name;&lt;/code>;&lt;/p>
&lt;p>Can do the following:&lt;/p>
&lt;pre>&lt;code>mysql -u root -p table_name;
&lt;/code>&lt;/pre>
&lt;p>In&lt;strong>R&lt;/strong>, one can find it under the global environment.&lt;/p>
&lt;h3 id="create-a-table">Create a Table&lt;/h3>
&lt;pre>&lt;code class="language-SQL">CREATE TABLE table_name /*Create a table with name: ……*/
(col_0 smallint;
col_1 VARCHAR(30);
col_2 timestamp;
CONSTRAINT pk_col_0 PRIMARY KEY (col_0) /*set col_0 as primary key*/
); /*The most basic method to create a database*/
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-R">df &amp;lt;- data.frame()
# x1 = c(7, 3, 2, 9, 0),
# x2 = c(4, 4, 1, 1, 8),
# x0 = c(5, 3, 9, 2, 4)
# Primary key can only be added manually
&lt;/code>&lt;/pre>
&lt;h3 id="add-a-row">Add a Row&lt;/h3>
&lt;pre>&lt;code class="language-SQL">INSERT INTO table_name (col_0, col_1, col_2) /*The table*/
VALUES (27, 'Rdm Name', 'Acme Paper Corporation'); /*The values*/
/*The most basic method to insert a full row into a database*/
&lt;/code>&lt;/pre>
&lt;p>&lt;code>Query OK, 1 row affected&lt;/code>$\Rightarrow$one row was added to the database&lt;/p>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">new_row &amp;lt;- c(27, 'Rdm Name', 'Acme Paper Corporation')
rbind(df, new_row)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>You are not required to provide data for every column in the table unless the column cannot be NULL;&lt;/li>
&lt;li>MySQL will convert the &lt;strong>string&lt;/strong> to a &lt;strong>date&lt;/strong> for you as long as the &lt;strong>format is followed&lt;/strong>;
&lt;pre>&lt;code>ERROR 1292 (22007): Incorrect date value: 'DEC-21-1980' for column 'birth_date' at row 1
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="change-a-cell">Change a Cell&lt;/h3>
&lt;pre>&lt;code class="language-SQL">UPDATE table_name
/*Fix column*/ /*Insert the values*/
SET name = 'Certificate of Deposit'
WHERE col_2 = 'CD'; /*Fix row, otherwise all will be replaced*/
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">df[df$col_2=='CD', &amp;quot;name&amp;quot;] &amp;lt;- 'Certificate of Deposit'
# Fix column, fix row
&lt;/code>&lt;/pre>
&lt;h3 id="delete-a-row">Delete a Row&lt;/h3>
&lt;pre>&lt;code class="language-SQL">DELETE ...
/*Fix column*/
FROM table_name
WHERE col_2 = 'CD'; /*Fix row, otherwise all will be deleted*/
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">df[df$col_2=='CD', ] &amp;lt;- NULL
&lt;/code>&lt;/pre>
&lt;h3 id="table-overview">Table Overview&lt;/h3>
&lt;pre>&lt;code class="language-SQL">DESC favorite_food;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">str(df)
summary(df)
glimpse(df)
&lt;/code>&lt;/pre>
&lt;p>Describe the table.&lt;/p>
&lt;h3 id="show-tables">Show Tables&lt;/h3>
&lt;pre>&lt;code class="language-SQL">show tables
&lt;/code>&lt;/pre>
&lt;h3 id="drop-a-table">Drop a Table&lt;/h3>
&lt;pre>&lt;code class="language-SQL">drop table xxx
&lt;/code>&lt;/pre>
&lt;h3 id="export-to-xml">Export to XML&lt;/h3>
&lt;p>Type the following in CMD:&lt;/p>
&lt;pre>&lt;code>mysql -u lrngsql -p --xml bank
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>OR&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-SQL">SELECT * FROM table_name
FOR XML AUTO, ELEMENTS /*IMPORTANT*/
&lt;/code>&lt;/pre>
&lt;p>No easy way to do so in &lt;strong>R&lt;/strong>.&lt;/p>
&lt;h2 id="table-creation-ch-2">Table Creation (CH. 2)&lt;/h2>
&lt;h3 id="1---design">1 Design&lt;/h3>
&lt;p>What info is needed? Make a list.&lt;/p>
&lt;h3 id="2---refinement">2 Refinement&lt;/h3>
&lt;ol>
&lt;li>Compound objects need to be separated into multiple columns, including names or address;&lt;/li>
&lt;li>If a column is a list containing zero, one, or more independent items, we need another table;&lt;/li>
&lt;li>Need primary key column(s) to guarantee uniqueness.&lt;/li>
&lt;/ol>
&lt;h3 id="3---building-sql-schema-statements">3 Building SQL Schema Statements&lt;/h3>
&lt;p>Another type of constraint called a &lt;strong>check constraint&lt;/strong> constrains the allowable values for a particular column. A check constraint to be attached to a &lt;strong>column definition&lt;/strong>.&lt;/p>
&lt;pre>&lt;code class="language-SQL">eye_color CHAR(2) CHECK (eye_color IN ('BR','BL','GR'))
&lt;/code>&lt;/pre>
&lt;p>Possible error:&lt;/p>
&lt;pre>&lt;code>ERROR 1265 (01000): Data truncated for column 'eye_color' at row 1
&lt;/code>&lt;/pre>
&lt;p>MySQL does provide another character data type called &lt;code>enum&lt;/code> that merges the check constraint into the data type definition.&lt;/p>
&lt;pre>&lt;code class="language-SQL">eye_color ENUM('BR','BL','GR')
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>R&lt;/strong> codes:&lt;/p>
&lt;pre>&lt;code class="language-r">Enum &amp;lt;- function(...) {
## EDIT: use solution provided in comments to capture the arguments
values &amp;lt;- sapply(match.call(expand.dots = TRUE)[-1L], deparse)
stopifnot(identical(unique(values), values))
res &amp;lt;- setNames(seq_along(values), values)
res &amp;lt;- as.environment(as.list(res))
lockEnvironment(res, bindings = TRUE)
res
}
FRUITS &amp;lt;- Enum(APPLE, BANANA, MELON)
&lt;/code>&lt;/pre>
&lt;p>See &lt;a href="https://stackoverflow.com/questions/33838392/enum-like-arguments-in-r">https://stackoverflow.com/questions/33838392/enum-like-arguments-in-r&lt;/a> for further details.&lt;/p>
&lt;p>After processing the create table statement, the MySQL server returns the message &amp;ldquo;Query OK, 0 rows affected,&amp;rdquo; which tells me that the statement had no &lt;strong>syntax errors&lt;/strong>.&lt;/p></description></item><item><title>Learning Stats at UofT #8: Problems in Statistics Application</title><link>https://siqi-zheng.rbind.io/post/2021-03-27-blog-9-2021/</link><pubDate>Sat, 27 Mar 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-03-27-blog-9-2021/</guid><description>&lt;p>This is the eighth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.&lt;/p>
&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on discussing statistics.&lt;/p>
&lt;p>Now it&amp;rsquo;s approaching the end of semester for me. To reflect back on the past 2020, a lot of things were going on and everyone had a tough time. It was also the time to develop skills to collaborate virtually and be compassionate about other people in workplaces around the world.&lt;/p>
&lt;p>In a data driven world, we are connected by data and the study of data, statistics, is very essential to our day to day life. However, the abuse of statistics also created problems for us.&lt;/p>
&lt;h2 id="misspecified-models">Misspecified Models&lt;/h2>
&lt;p>At the early stage of the pandemic, people proposed many models to predict the number of cases around the world. Some even argued that the cases would grow exponentially. This was based on their past experience with virus spread. However, this was a very unreasonable guess because there was no up-to-date evidence to support this. this created some rumors and pessimistic expectations on our world. In fact, this could be avoided if the posters were more cautious about what they were going to say and the implications. But they were not. Statistics became a tool to spread rumors, and readers should be more critical towards such models. We statisticians have the responsibility to stand out and correct this mistake.&lt;/p>
&lt;h2 id="data-exploration">Data Exploration&lt;/h2>
&lt;p>New data were released every day from the government. The data analysis should be the job of data analyst. However, many people who had no such backgrounds also spread their ideas on the internet. It was not that harmful if they were accidentally correct. However, some people enjoyed playing around with data and sharing some false conclusion based on them. Hence I see the necessity for a general education about statistics for the public.&lt;/p></description></item><item><title>Learning Stats at UofT #7: Detail-oriented and Communication Skills</title><link>https://siqi-zheng.rbind.io/post/2021-03-20-blog-8-2021/</link><pubDate>Sat, 20 Mar 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-03-20-blog-8-2021/</guid><description>&lt;p>This is the seventh post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.&lt;/p>
&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on the programs offered by DoSS.&lt;/p>
&lt;p>I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts &amp;amp; Science, and I hope this paragraph serves well as an introduction.&lt;/p>
&lt;p>By meeting with a Vic alumnus, I summarize a set of core skills that are important for our future career. This is the last part of the core skills.&lt;/p>
&lt;h2 id="detail-oriented">Detail-oriented&lt;/h2>
&lt;p>In workplace, noticing the details means you are careful every pieces of your writing and avoid making errors due to carelessness. In real life, the skill refer to the followings. Be curious about the surroundings and the environment. Catch sight of the beautiful and show your appreciation. Remark on the unusual and take a note of it. Notice the changing seasons and take a photo of them. Savour the moment, whether you are walking to work, eating lunch or talking to friends. Be aware of the world around you and what you are feeling. You are not a robot, so you should not only work or study. Lastly, reflecting on your experiences will help you appreciate what matters to you (credit to Chad Jankowski).&lt;/p>
&lt;h2 id="communication-skills">Communication Skills&lt;/h2>
&lt;p>Communication is every where. You need to communicate verbally or in writing with family, friends, colleagues and neighbours. You apply communication skills at home, work, school or in your local community. Therefore, you need to think of your past communications as the cornerstones of your life and invest time in enhancing these skills. Building these connections skillfully will support and enrich you every day (credit to Chad Jankowski).&lt;/p>
&lt;p>In workplace, communication should be clear and precise. Sometimes you need to be a bit diplomatic when you talk to people, sometimes you need to be bold to speak up your needs. People should develop the ability to communicate differently with various people in many contexts.&lt;/p></description></item><item><title>Learning Stats at UofT #6: Critical Analysis and Problem Solving</title><link>https://siqi-zheng.rbind.io/post/2021-03-13-blog-7-2021/</link><pubDate>Sat, 13 Mar 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-03-13-blog-7-2021/</guid><description>&lt;p>This is the sixth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.&lt;/p>
&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on the programs offered by DoSS.&lt;/p>
&lt;p>I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts &amp;amp; Science, and I hope this paragraph serves well as an introduction.&lt;/p>
&lt;p>By meeting with a Vic alumnus, I summarize a set of core skills that are important for our future career. This is the first part of the core skills.&lt;/p>
&lt;h2 id="critical-analysis">Critical Analysis&lt;/h2>
&lt;p>Critical analysis involves the ability to analyze the situation, to retrieve information from different sources, and the ability to communicate the idea quantitatively and qualitatively.&lt;/p>
&lt;p>Statistics courses at U of T provide great training of quantitative analysis. In recent years, instructors also think about different assignments that requires student to apply their skills in analyzing real-life cases. Nonetheless, this is not enough from my perspective. First, such tasks have to align with the specific course objective. In particular, the data provided for the assignment are so good that you don&amp;rsquo;t need to consider any absurd situations. Second, the professors may not necessarily know what nowadays employers are looking for. Hence it is important to explore the real world by yourself.&lt;/p>
&lt;h2 id="problem-solving">Problem Solving&lt;/h2>
&lt;p>The following materials were adapted from the learning strategies at UofT, Rahul Bhat.&lt;/p>
&lt;h3 id="background">Background&lt;/h3>
&lt;p>What background information do I need to solve the problem? This should be combined with critical analysis. Specifically, you may want to know what information is missing all ignored.&lt;/p>
&lt;h3 id="rules">Rules&lt;/h3>
&lt;p>What theories, solutions, rules, proofs, or approaches might I use to solve the problem? In quantitative analysis, you will need to use mathematical knowledge, for example, theorems, to solve questions.&lt;/p>
&lt;h3 id="steps">Steps&lt;/h3>
&lt;p>Can I break the problem into steps - those I understand and those I can gather more information for? This way, you can explore the steps that can be done fairly easily, and save more time for the difficult tasks.&lt;/p>
&lt;h3 id="connection">Connection&lt;/h3>
&lt;p>Is there something I have seen in the past that resembles this problem? You practice active retrieve of knowledge in this aspect, and look for solutions that are applicable in some sense to this question.&lt;/p></description></item><item><title>Learning Stats at UofT #5: A Dialogic Way to Introduce MLE</title><link>https://siqi-zheng.rbind.io/post/2021-03-06-blog-6-2021/</link><pubDate>Sat, 06 Mar 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-03-06-blog-6-2021/</guid><description>&lt;p>This is the fifth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.&lt;/p>
&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on the programs offered by DoSS.&lt;/p>
&lt;p>I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts &amp;amp; Science, and I hope this paragraph serves well as an introduction.&lt;/p>
&lt;h1 id="a-dialogic-way-to-introduce-mle">A Dialogic Way to Introduce MLE&lt;/h1>
&lt;p>Imagine you walk into Starbucks at Robarts Library, and you meet one of your TAs from STA257. Now you may want to say hi to this TA, but you also want this TA to clarify the concept of MLE. If I were the TA, I would explain the concept of MLE in the following way.&lt;/p>
&lt;p>Sure, I can explain the concept of likelihood while we wait in line. In statistics, we often need to estimate the parameter of a model. But how? Well, Maximum Likelihood estimation (MLE) can help. First of all, we need to know likelihood means the probability of a value being the true value of a parameter θ in a model given a set of data. MLE provides a way to find a value θ ̂ with the maximum probability to be θ.&lt;/p>
&lt;p>Let’s use an example to illustrate this. Suppose you are interested in a model that describes the waiting time for a customer in this restaurant. Now you can first collect the data of individual waiting time randomly. Then you may assume that the true population of waiting time follows some classical distributions so that we only need to estimate the parameter θ of a known distribution. Then we may be able to use MLE here. Does that make sense so far?&lt;/p>
&lt;p>Alright, now, here comes the tricky part. In order to use MLE, we need the joint distribution of all data X, which gives the probability that each of X falls within a specific range for a variable. But wait, we do not actually know it since we only know the marginal distribution, i.e. the density function of individual X with an unknown θ. Hence we can consider making an assumption that all data are independent, meaning knowing one waiting time does not tell us about the next waiting time. It may not be true in reality, but it is sufficient for our purpose. Under this assumption, we can multiply all marginal density to get our joint density. To find the maximum likelihood estimate, we can set the first derivative with respect to θ equal 0 and find a value. We can calculate the second derivative as well to ensure that θ ̂ is a maxima. This θ ̂ is our Maximum Likelihood estimate for the parameter. Did that explanation help?&lt;/p>
&lt;p>To sum up, Maximum Likelihood estimation gives an estimate for a parameter in a model given a set of data. In particular, with a known joint probability density function of all data and the parameter, we can use derivatives to find an estimate of the parameter with the maximum probability to be the true value for the model. Let’s pause for a moment! It is my turn to get the drink!&lt;/p></description></item><item><title>Learning Stats at UofT #4: Model Selection</title><link>https://siqi-zheng.rbind.io/post/2021-02-27-blog-5-2021/</link><pubDate>Sat, 27 Feb 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-02-27-blog-5-2021/</guid><description>&lt;p>This is the fourth post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.&lt;/p>
&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on the programs offered by DoSS.&lt;/p>
&lt;p>I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts &amp;amp; Science, and I hope this paragraph serves well as an introduction.&lt;/p>
&lt;h1 id="model-selection">Model Selection&lt;/h1>
&lt;p>In the third year course STA302, you will learn about simple linear regression and multiple linear regression models. In fact, you will learn more about assumptions behind models and possible remedies for improvements. Nonetheless, what you will not learn is whether it is appropriate to apply this model in a particular field.&lt;/p>
&lt;p>In reality, you will find that many models fit the data pretty well, but those models are incorrect. So here comes the question: to what extent you can apply a complex model on your data? The disciplinary knowledge is crucial in this context. It not only provides a justification for model, but also a way to interpret the model.&lt;/p>
&lt;h1 id="does-disciplinary-knowledge-play-a-role-in-model-selection">Does Disciplinary Knowledge Play a Role in Model Selection?&lt;/h1>
&lt;p>In NFS284, you learn about some thresholds to determine whether the consumption of nutrients is adequate. These thresholds, however, depend on the normality assumption. In fact, if you look at the hypothesis testing in the academic papers, you will notice that almost always P=0.05 is selected as significant level and a normal assumption is approximated. But this requires some reasons. You cannot select a P value that just serves your convenience.&lt;/p>
&lt;p>I believe that the researchers have more knowledge in nutrition science than me, and they may have very good reasons for their application of statistics. However, it is not very common to see the justifications in the well-written papers. In fact, you cannot tell whether disciplinary knowledge plays a role in Model Selection.&lt;/p>
&lt;p>One possible reason is that this requires researchers to devote some paragraphs on it. The page limit of an academic journal, nevertheless, may not allow them to do so. To take a step back, even if there is some restrictions on the length of the article, this justification should not be given up just because of it.&lt;/p></description></item><item><title>Learning Stats at UofT #3: Some Controversies</title><link>https://siqi-zheng.rbind.io/post/2021-02-20-blog-4-2021/</link><pubDate>Sat, 20 Feb 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-02-20-blog-4-2021/</guid><description>&lt;p>This is the third post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.&lt;/p>
&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on the programs offered by DoSS.&lt;/p>
&lt;p>I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts &amp;amp; Science, and I hope this paragraph serves well as an introduction.&lt;/p>
&lt;h2 id="data-science-program">Data Science Program&lt;/h2>
&lt;p>The first controversy is about the data science program. It is controversial because of its high enrollment requirements and different views towards data science. The enrollment requirements are the highest among all stats programs offered at UofT. Moreover, it requires you to learn both knowledge in computer science and knowledge in statistics, but it doesn&amp;rsquo;t require you to learn a lot of theories about statistics. Rather, it asks more for probability theories and the application of statistics. On the other hand, it involves a lot of things about data structure, but less computer science knowledge. Therefore some people think that this course it&amp;rsquo;s kind of awkward between pure CS program and stats program. There is another view. Many also think so it&amp;rsquo;s better for the workplace because this program offer an internship opportunity.&lt;/p>
&lt;h2 id="the-changes-in-course-content">The Changes in Course Content&lt;/h2>
&lt;p>Statistics departments at U of T went through many changes. In particular, many courses changed their instructors every year. Furthermore, the course content evolved with the changing focus in nowadays workplace. The pro was that you could always learn the most up-to-date knowledge in statistics and instructors also had more flexibility in designing a course. Note that the scope of a course remained the same throughout the time, but the way to convey knowledge might change. For students, however, it was hard to prepare for the upcoming courses. Sometimes the course organization would have many small issues when the new content was added.&lt;/p></description></item><item><title>Learning Stats at UofT #2: A Guide to Second-year Courses</title><link>https://siqi-zheng.rbind.io/post/2021-02-13-blog-3-2021/</link><pubDate>Sat, 13 Feb 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-02-13-blog-3-2021/</guid><description>&lt;p>This is the second post of the series Learning Stats at UofT. In case you did not read my last post, here is the introudction.&lt;/p>
&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on the programs offered by DoSS.&lt;/p>
&lt;p>I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts &amp;amp; Science, and I hope this paragraph serves well as an introduction.&lt;/p>
&lt;h2 id="sta237238">STA237/238&lt;/h2>
&lt;p>There are three combinations of courses offered by DoSS. The first combination is STA237 and STA238. This combination primarily focuses on R. It also goes through the fundamentals of statistics. However, it may not be the best introductory courses for statistical theories because every year the focus is adjusted. The organization of the courses was not very satisfactory last year because students from RC without strong stats background found it too programming-based and students from CS found it less interesting because of lack of in-depth theories.&lt;/p>
&lt;h2 id="sta247248">STA247/248&lt;/h2>
&lt;p>There is another combination called STA247 and STA248. This combination is designed solely for computer science students and it is great to learn if you want more knowledge in probability, especially because it involves many creative questions about probability and some knowledge that computer science students may need for programming,&lt;/p>
&lt;h2 id="sta257261">STA257/261&lt;/h2>
&lt;p>The last combination, which is the combination I took in my second year, is STA257 and STA261. This combination is so-called the hardest one for second-year stats students. Typically, the instructor will introduce a bunch of distributions, some new concepts about CDF and PDF, and some calculations using double integration. Now this was the tricky part for me at that time. Many students didn&amp;rsquo;t learn convolution and double integration when they took this course. As a result, many of us needed to spend more time getting familiar with these things.&lt;/p>
&lt;p>Another interesting aspect of this course is that it doesn&amp;rsquo;t involved too much about Bayesian. I would say this is not really a limitation, but it somehow affects how students think about statistics in the future. This course introduces many concepts that was thought to be important in the future, particularly the part about ordered statistics and quantile thing. They will play an important role in the third-year courses&lt;/p></description></item><item><title>Learning Stats at UofT: A Guide to Focuses in Applied Statistics</title><link>https://siqi-zheng.rbind.io/post/2021-02-06-blog-2-2021/</link><pubDate>Sat, 06 Feb 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-02-06-blog-2-2021/</guid><description>&lt;p>The fundamental statistics courses at UofT are normally unchanged, at least from my experience in the past three years. Still, I think it is worth devoting some blogs to this topic. Before starting the introduction to courses, I would like to spend some time on the programs offered by DoSS.&lt;/p>
&lt;p>I am a student in Applied Statistics Specialist, or Method and Application, at UofT. Though there were some changes in the requirements, the main focus of the two programs is the same. In particular, you will go through some fundamentals in R and (Frequentist) statistics in your first two years, and take upper year courses in some advanced topics. Compared to the Theory one, you do not need to take so many courses in theory, but you need to choose a focus depending on your interest. The focus seemed less important, but I gave a lot of thoughts about it in my past years. So I would like to share some of them with you. Note that all of these can be found on the official website of Arts &amp;amp; Science, and I hope this paragraph serves well as an introduction.&lt;/p>
&lt;h2 id="focus-can-be-changed-but-you-have-to-plan-ahead">Focus can be changed, but you have to plan ahead&lt;/h2>
&lt;p>The selection of focus really depends on the courses you take in your first year. Most students in MP take ECO101/102 and CSC148/165 in their first year. This courses combination of CS and ECO has certain benefits. Specifically, this combination gives students much flexibility in their second and third year since it allows them to choose Data Science Specialist in Statistics program, CS programs and Economics programs&lt;/p>
&lt;p>However, a common solution is not necessarily good. As a student who would like to take new challenges and learn more about education, I chose to take Education courses in Victoria College. This to some extent limited my choices of programs. In particular, if I would like to enroll in other programs, I might need to start from the beginning. Nonetheless, I met great friends there and discovered that being a teacher in a primary/middle school was not what I really wanted.&lt;/p>
&lt;p>Then I decided to select Astrophysics as my focus in my second year, hoping to explore the broader universe that I have never learnt before. It was fun to learn, but it was too theoretical and I started to get interested in Finance and Economics soon. Hence I reached a crossroads again. If I continued on Astrophysics, I believed I could still do well in academia, but I could not imagine what I was going to do after that. On the other hand, if I chose Economics, I needed to take first-year courses in Economics in my second year and caught up with others in my third year. This was exactly the disadvantage of my first-year course selection.&lt;/p>
&lt;p>The key point is that there is going to be a trade-off when you want to select a focus - it is more common to stick to your first-year courses when you think about choosing a focus, but then you do not have the opportunity to choose some other interesting courses in the university.&lt;/p></description></item><item><title>First Blog in 2021 about Teaching Statistics</title><link>https://siqi-zheng.rbind.io/post/2021-01-30-blog-1-2021/</link><pubDate>Sat, 30 Jan 2021 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2021-01-30-blog-1-2021/</guid><description>&lt;p>A few days ago, a student asked me about the logic behind the simulated sampling distribution. She was curious about how we could use reshuffling labels + random sampling to obtain a sampling distribution. The logic is related to the Skeptic’s Argument and the consequence of it. Here is the theorem and its elegant proof.&lt;/p>
&lt;p>&lt;img src="1.jpg" alt="Theorem from the Skeptic’s Argument">&lt;/p>
&lt;p>I am particularly interested in her question because I knew the intuition behind the codes back to my first year at UofT, but I also saw some unbalanced CRDs and the same reshuffling method was used to calculate the sampling distribution. Hence I was confused by the use of the codes.&lt;/p>
&lt;p>The symmetric property of an unbalanced CRD turns out to be uncertain in theory, but in many cases we can still see a simulated sampling distribution somewhat symmetric around 0.&lt;/p>
&lt;p>Another question about test statistic appears in one of my students’ writing assignments. In the writing, this student defines the test statistic (mean) as a random variable after calculating an exact number from the sample. Indeed, a test statistic &lt;strong>can be&lt;/strong> a random variable, but it is not when a number is already obtained from the sample. A great answer from an online forum is attached below.&lt;/p>
&lt;p>&lt;img src="test.jpg" alt="Answer from Stack Exchange">&lt;/p>
&lt;p>Test statistic is about the sample, but parameter is about the population. If one is thinking about whether a parameter is a random variable or a fixed value, then one may be very interested in the controversy between Bayes and Frequentist. They are fundamentally different approaches to knowledge about data and uncertainty, but they can yield the same result in many situations mathematically.&lt;/p>
&lt;p>This also reminds me of my past experience with statistics. When I learnt probability theories in middle school, we did not really differentiate between these two approaches. We sometimes claimed that one event was more likely to happen because of higher probability, and sometimes interpreted the proportion of heads when flipping the coin for many times as the long-term frequency.&lt;/p>
&lt;p>To wrap up, I use these two examples to show that there can be complicated theories behind some seemingly simple facts. Though this is really beyond the scope of STA130, I still think students can benefit from thinking about these questions. One will get to know more about statistics when one takes a second-year statistics course.&lt;/p>
&lt;h2 id="references">References&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>
&lt;a href="http://pages.stat.wisc.edu/~wardrop/courses/371chapter3sum15.pdf" target="_blank" rel="noopener">Statistic Course Material&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>
&lt;a href="https://stats.stackexchange.com/questions/85426/is-test-statistic-a-value-or-a-random-variable" target="_blank" rel="noopener">Stack Exchange Question&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Revision Guide</title><link>https://siqi-zheng.rbind.io/post/2020-12-02-review/</link><pubDate>Wed, 02 Dec 2020 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2020-12-02-review/</guid><description>&lt;blockquote>
&lt;p>The revision guide can be downloaded by clicking the button above. Good luck on your exam!&lt;/p>
&lt;/blockquote></description></item><item><title>Week 6 Tutorial (Bootstrapping)</title><link>https://siqi-zheng.rbind.io/post/2020-11-05-sharing-short-notice/</link><pubDate>Thu, 22 Oct 2020 09:00:00 +0000</pubDate><guid>https://siqi-zheng.rbind.io/post/2020-11-05-sharing-short-notice/</guid><description>&lt;blockquote>
&lt;p>This tutorial was designed to illustrate a sample beamer presentation created from .Rmd file for teaching bootstrap sampling.&lt;/p>
&lt;/blockquote>
&lt;p>Teaching bootstrapping to students who were new to statistics can be difficult, especially when they were taught about hypothesis testing (z-test) before bootstrapping. During my actual practice, I found it useful to discuss the similarities between these two methods then the differences.&lt;/p>
&lt;p>An introduction about why we need such method can always be inspiring and motivating.&lt;/p></description></item></channel></rss>