Foundation for Inference Part 1

class: center, middle, inverse, title-slide

# Foundation for Inference Part 1
## DATA 606 - Statistics & Probability for Data Analytics
### Jason Bryer, Ph.D. and Angela Lui, Ph.D.
### September 29, 2021

---

# One Minute Paper Results

.pull-left[
**What was the most important thing you learned during this class?**
<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]
.pull-right[
**What important question remains unanswered for you?**
<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---
class: font140
# Homework Presentations

* 4.17	Nick Oliver

* 4.19	William Aiken

* 4.25	Tyler Baker

---
class: center, middle, inverse
# Crash Course in Calculus

---
class: font90
# Crash Course in Calculus

There are three major concepts in calculus that will be helpful to understand:

**Limits** - the value that a function (or sequence) approaches as the input (or index) approaches some value.

**Derivatives** - the slope of the line tangent at any given point on a function.

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

**Integrals** - the area under the curve.

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />

---
background-image: url(data:image/png;base64,#images/derivative_1.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_2.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_3.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_4.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_5.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_6.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_7.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_8.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
background-image: url(data:image/png;base64,#images/derivative_9.jpg)
background-size: contain
# Derivatives

<div class="my-footer"><span>
Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a>
</span></div>

---
# Function for Normal Distribution

`$$f\left( x|\mu ,\sigma  \right) =\frac { 1 }{ \sigma \sqrt { 2\pi  }  } { e }^{ -\frac { { \left( x-\mu  \right)  }^{ 2 } }{ { 2\sigma  }^{ 2 } }  }$$`

```r
f <- function(x, mean = 0, sigma = 1) {
	1 / (sigma * sqrt(2 * pi)) * exp(1)^(-1/2 * ( (x - mean) / sigma )^2)
}
```

```r
min <- 0; max <- 2
ggplot() + stat_function(fun = f) + xlim(c(-4, 4)) + 
	geom_vline(xintercept = c(min, max), color = 'blue', linetype = 2) + xlab('x')
```

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />

---
# Reimann Sums

One strategy to find the area between two values is to draw a series of rectangles. Given *n* rectangles, we know that the width of each is `\(\frac{2 - 0}{n}\)` and the height is `\(f(x)\)`. Here is an example with 3 rectangles.

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />

---
# Reimann Sums (10 rectangles)

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />

---
# Reimann Sums (30 rectangles)

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />

---
# Reimann Sums (300 rectangles)

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />

---
# `\(n\rightarrow \infty\)`

As *n* approaches infinity we are going to get the *exact* value for the area under the curve. This notion of letting a value get increasingly close to infinity, zero, or any other value, is called the **limit**.

The area under a function is called the integral.

```r
integrate(f, 0, 2)
```

```
## 0.4772499 with absolute error < 5.3e-15
```

```r
DATA606::shiny_demo('calculus')
```

---
# Normal Distribution

```r
normal_plot(cv = c(0, 2))
```

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />

```r
pnorm(2) - pnorm(0)
```

```
## [1] 0.4772499
```

---
# R's built in functions for working with distributions

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />

.font70[See https://github.com/jbryer/DATA606Fall2021/blob/master/R/distributions.R]

---
class: center, middle, inverse
# Foundation for Inference

---
# Population Distribution (Uniform)

```r
n <- 1e5
pop <- runif(n, 0, 1)
mean(pop)
```

```
## [1] 0.5009061
```

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" />

---
# Random Sample (n=10)

```r
samp1 <- sample(pop, size=10)
mean(samp1)
```

```
## [1] 0.5042593
```

```r
hist(samp1)
```

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" />

---
# Random Sample (n=30)

```r
samp2 <- sample(pop, size=30)
mean(samp2)
```

```
## [1] 0.4856277
```

```r
hist(samp2)
```

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" />

---
# Lots of Random Samples

```r
M <- 1000
samples <- numeric(length=M)
for(i in seq_len(M)) {
	samples[i] <- mean(sample(pop, size=30))
}
head(samples, n=8)
```

```
## [1] 0.3625186 0.4650553 0.4908226 0.5003795 0.5566081 0.4773403 0.4808297 0.4353974
```

---
# Sampling Distribution

```r
hist(samples)
```

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />

---
# Central Limit Theorem (CLT)

Let `\(X_1\)`, `\(X_2\)`, ..., `\(X_n\)` be independent, identically distributed random variables with mean `\(\mu\)` and variance `\(\sigma^2\)`, both finite. Then for any constant `\(z\)`,

$$ \underset { n\rightarrow \infty  }{ lim } P\left( \frac { \bar { X } -\mu  }{ \sigma /\sqrt { n }  } \le z \right) =\Phi \left( z \right)  $$

where `\(\Phi\)` is the cumulative distribution function (cdf) of the standard normal distribution.

---
# In other words...

The distribution of the sample mean is well approximated by a normal model:

$$ \bar { x } \sim N\left( mean=\mu ,SE=\frac { \sigma  }{ \sqrt { n }  }  \right)  $$

where SE represents the **standard error**, which is defined as the standard deviation of the sampling distribution. In most cases `\(\sigma\)` is not known, so use `\(s\)`.

---
# CLT Shiny App

```r
library(DATA606)
shiny_demo('sampdist')
shiny_demo('CLT_mean')
```

---
# Standard Error

```r
samp2 <- sample(pop, size=30)
mean(samp2)
```

```
## [1] 0.5151478
```

```r
(samp2.se <- sd(samp2) / sqrt(length(samp2)))
```

```
## [1] 0.05238345
```

---
# Confidence Interval

The confidence interval is then `\(\mu \pm CV \times SE\)` where CV is the critical value. For a 95% confidence interval, the critical value is ~1.96 since

`$$\int _{ -1.96 }^{ 1.96 }{ \frac { 1 }{ \sigma \sqrt { 2\pi  }  } { d }^{ -\frac { { \left( x-\mu  \right)  }^{ 2 } }{ 2{ \sigma  }^{ 2 } }  } } \approx 0.95$$`

```r
qnorm(0.025) # Remember we need to consider the two tails, 2.5% to the left, 2.5% to the right.
```

```
## [1] -1.959964
```

```r
(samp2.ci <- c(mean(samp2) - 1.96 * samp2.se, mean(samp2) + 1.96 * samp2.se))
```

```
## [1] 0.4124763 0.6178194
```

---
# Confidence Intervals (cont.)

We are 95% confident that the true population mean is between 0.4124763, 0.6178194.

That is, if we were to take 100 random samples, we would expect at least 95% of those samples to have a mean within 0.4124763, 0.6178194.

```r
ci <- data.frame(mean=numeric(), min=numeric(), max=numeric())
for(i in seq_len(100)) {
	samp <- sample(pop, size=30)
	se <- sd(samp) / sqrt(length(samp))
	ci[i,] <- c(mean(samp),
				mean(samp) - 1.96 * se, 
				mean(samp) + 1.96 * se)
}
ci$sample <- 1:nrow(ci)
ci$sig <- ci$min < 0.5 & ci$max > 0.5
```

---
# Confidence Intervals

```r
ggplot(ci, aes(x=min, xend=max, y=sample, yend=sample, color=sig)) + 
	geom_vline(xintercept=0.5) + 
	geom_segment() + xlab('CI') + ylab('') +
	scale_color_manual(values=c('TRUE'='grey', 'FALSE'='red'))
```

<img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" />

---
class: left, font140
# One Minute Paper

Complete the one minute paper: 
https://forms.gle/ENFqTnDB5fJDw3kx9

1. What was the most important thing you learned during this class?
2. What important question remains unanswered for you?