class: center, middle, inverse, title-slide # Foundation for Inference Part 1 ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D. and Angela Lui, Ph.D. ### September 29, 2021 --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- class: font140 # Homework Presentations * 4.17 Nick Oliver * 4.19 William Aiken * 4.25 Tyler Baker --- class: center, middle, inverse # Crash Course in Calculus --- class: font90 # Crash Course in Calculus There are three major concepts in calculus that will be helpful to understand: **Limits** - the value that a function (or sequence) approaches as the input (or index) approaches some value. **Derivatives** - the slope of the line tangent at any given point on a function. <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> **Integrals** - the area under the curve. <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- background-image: url(data:image/png;base64,#images/derivative_1.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_2.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_3.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_4.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_5.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_6.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_7.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_8.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- background-image: url(data:image/png;base64,#images/derivative_9.jpg) background-size: contain # Derivatives <div class="my-footer"><span> Source: <a href='https://github.com/allisonhorst/stats-illustrations'>@allison_horst</a> </span></div> --- # Function for Normal Distribution `$$f\left( x|\mu ,\sigma \right) =\frac { 1 }{ \sigma \sqrt { 2\pi } } { e }^{ -\frac { { \left( x-\mu \right) }^{ 2 } }{ { 2\sigma }^{ 2 } } }$$` ```r f <- function(x, mean = 0, sigma = 1) { 1 / (sigma * sqrt(2 * pi)) * exp(1)^(-1/2 * ( (x - mean) / sigma )^2) } ``` ```r min <- 0; max <- 2 ggplot() + stat_function(fun = f) + xlim(c(-4, 4)) + geom_vline(xintercept = c(min, max), color = 'blue', linetype = 2) + xlab('x') ``` <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Reimann Sums One strategy to find the area between two values is to draw a series of rectangles. Given *n* rectangles, we know that the width of each is `\(\frac{2 - 0}{n}\)` and the height is `\(f(x)\)`. Here is an example with 3 rectangles. <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Reimann Sums (10 rectangles) <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Reimann Sums (30 rectangles) <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Reimann Sums (300 rectangles) <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # `\(n\rightarrow \infty\)` As *n* approaches infinity we are going to get the *exact* value for the area under the curve. This notion of letting a value get increasingly close to infinity, zero, or any other value, is called the **limit**. The area under a function is called the integral. ```r integrate(f, 0, 2) ``` ``` ## 0.4772499 with absolute error < 5.3e-15 ``` ```r DATA606::shiny_demo('calculus') ``` --- # Normal Distribution ```r normal_plot(cv = c(0, 2)) ``` <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ```r pnorm(2) - pnorm(0) ``` ``` ## [1] 0.4772499 ``` --- # R's built in functions for working with distributions <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> .font70[See https://github.com/jbryer/DATA606Fall2021/blob/master/R/distributions.R] --- class: center, middle, inverse # Foundation for Inference --- # Population Distribution (Uniform) ```r n <- 1e5 pop <- runif(n, 0, 1) mean(pop) ``` ``` ## [1] 0.5009061 ``` <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # Random Sample (n=10) ```r samp1 <- sample(pop, size=10) mean(samp1) ``` ``` ## [1] 0.5042593 ``` ```r hist(samp1) ``` <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- # Random Sample (n=30) ```r samp2 <- sample(pop, size=30) mean(samp2) ``` ``` ## [1] 0.4856277 ``` ```r hist(samp2) ``` <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- # Lots of Random Samples ```r M <- 1000 samples <- numeric(length=M) for(i in seq_len(M)) { samples[i] <- mean(sample(pop, size=30)) } head(samples, n=8) ``` ``` ## [1] 0.3625186 0.4650553 0.4908226 0.5003795 0.5566081 0.4773403 0.4808297 0.4353974 ``` --- # Sampling Distribution ```r hist(samples) ``` <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- # Central Limit Theorem (CLT) Let `\(X_1\)`, `\(X_2\)`, ..., `\(X_n\)` be independent, identically distributed random variables with mean `\(\mu\)` and variance `\(\sigma^2\)`, both finite. Then for any constant `\(z\)`, $$ \underset { n\rightarrow \infty }{ lim } P\left( \frac { \bar { X } -\mu }{ \sigma /\sqrt { n } } \le z \right) =\Phi \left( z \right) $$ where `\(\Phi\)` is the cumulative distribution function (cdf) of the standard normal distribution. --- # In other words... The distribution of the sample mean is well approximated by a normal model: $$ \bar { x } \sim N\left( mean=\mu ,SE=\frac { \sigma }{ \sqrt { n } } \right) $$ where SE represents the **standard error**, which is defined as the standard deviation of the sampling distribution. In most cases `\(\sigma\)` is not known, so use `\(s\)`. --- # CLT Shiny App ```r library(DATA606) shiny_demo('sampdist') shiny_demo('CLT_mean') ``` --- # Standard Error ```r samp2 <- sample(pop, size=30) mean(samp2) ``` ``` ## [1] 0.5151478 ``` ```r (samp2.se <- sd(samp2) / sqrt(length(samp2))) ``` ``` ## [1] 0.05238345 ``` --- # Confidence Interval The confidence interval is then `\(\mu \pm CV \times SE\)` where CV is the critical value. For a 95% confidence interval, the critical value is ~1.96 since `$$\int _{ -1.96 }^{ 1.96 }{ \frac { 1 }{ \sigma \sqrt { 2\pi } } { d }^{ -\frac { { \left( x-\mu \right) }^{ 2 } }{ 2{ \sigma }^{ 2 } } } } \approx 0.95$$` ```r qnorm(0.025) # Remember we need to consider the two tails, 2.5% to the left, 2.5% to the right. ``` ``` ## [1] -1.959964 ``` ```r (samp2.ci <- c(mean(samp2) - 1.96 * samp2.se, mean(samp2) + 1.96 * samp2.se)) ``` ``` ## [1] 0.4124763 0.6178194 ``` --- # Confidence Intervals (cont.) We are 95% confident that the true population mean is between 0.4124763, 0.6178194. That is, if we were to take 100 random samples, we would expect at least 95% of those samples to have a mean within 0.4124763, 0.6178194. ```r ci <- data.frame(mean=numeric(), min=numeric(), max=numeric()) for(i in seq_len(100)) { samp <- sample(pop, size=30) se <- sd(samp) / sqrt(length(samp)) ci[i,] <- c(mean(samp), mean(samp) - 1.96 * se, mean(samp) + 1.96 * se) } ci$sample <- 1:nrow(ci) ci$sig <- ci$min < 0.5 & ci$max > 0.5 ``` --- # Confidence Intervals ```r ggplot(ci, aes(x=min, xend=max, y=sample, yend=sample, color=sig)) + geom_vline(xintercept=0.5) + geom_segment() + xlab('CI') + ylab('') + scale_color_manual(values=c('TRUE'='grey', 'FALSE'='red')) ``` <img src="data:image/png;base64,#05-Foundation_for_Inference_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- class: left, font140 # One Minute Paper Complete the one minute paper: https://forms.gle/ENFqTnDB5fJDw3kx9 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you?