class: center, middle, inverse, title-slide # Summarizing Data ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D. and Angela Lui, Ph.D. ### September 8, 2021 --- # Agenda .pull-left[.font130[ * Questions * Homework Presentations * Data wrangling * Data types * Descriptive statistics * Data visualization * Grammar of graphics * Types of graphics ]] .pull-right[ <img src='images/data_wrangler.png' alt='Data Wrangler' width='100%' /> .right[.font60[ Image source: [@allison_horst](https://twitter.com/allison_horst) ]] ] --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- # Announcements There is an error in the homework 2 Rmarkdown file. The `OIdata` is no longer available on CRAN. All the data for the textbook has been moved to the `openintro` package. However, in that conversion some of the datasets have been renamed. The `heartTr` data frame has been renamed to `heart_transplant`. You have two options: 1. If you already started the homework, search and replace `heartTr` with `heart_transplant`. 2. If not, the Rmd file has been updated on Github: https://github.com/jbryer/DATA606Fall2021/blob/master/Homework/Homework2.Rmd ___________ I made a few minor updates to the `DATA606` package. You can update by reinstalling: ```r remotes::install_github('jbryer/DATA606') ``` --- # Homework Presentations * 1.13 Lisa Szydziak * 1.37 Esteban Aramayo * 1.43 Mauricio Claudio --- # Workflow .center[ <img src='images/data-science-wrangle.png' alt = 'Data Science Workflow' width='1000' /> ] .font80[Source: [Wickham & Grolemund, 2017](https://r4ds.had.co.nz)] --- # Tidy Data .center[ <img src='images/tidydata_1.jpg' height='500' /> ] See Wickham (2014) [Tidy data](https://vita.had.co.nz/papers/tidy-data.html). --- # Types of Data .pull-left[ * Numerical (quantitative) * Continuous * Discrete ] .pull-right[ * Categorical (qualitative) * Regular categorical * Ordinal ] .center[ <img src='images/continuous_discrete.png' height='400' /> ] --- # Data Types in R <img src="images/DataTypesConceptModel.png" width="1000" style="display: block; margin: auto;" /> --- # Data Types / Descriptives / Visualizations Data Type | Descriptive Stats | Visualization -------------|-----------------------------------------------|-------------------| Continuous | mean, median, mode, standard deviation, IQR | histogram, density, box plot Discrete | contingency table, proportional table, median | bar plot Categorical | contingency table, proportional table | bar plot Ordinal | contingency table, proportional table, median | bar plot Two quantitative | correlation | scatter plot Two qualitative | contingency table, chi-squared | mosaic plot, bar plot Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot --- # Robust Statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, * for skewed distributions it is often more helpful to use median and IQR to describe the center and spread * for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread --- # About `legosets` <img src="images/hex/brickset.png" class="title-hex"> To install the `brickset` package: ```r remotes::install_github('jbryer/brickset') ``` To load the load the `legosets` dataset. ```r data('legosets', package = 'brickset') ``` The `legosets` data has 16355 observations of 34 variables. .code70[ ```r names(legosets) ``` ``` ## [1] "setID" "name" "year" "theme" ## [5] "themeGroup" "subtheme" "category" "released" ## [9] "pieces" "minifigs" "bricksetURL" "rating" ## [13] "reviewCount" "packagingType" "availability" "agerange_min" ## [17] "US_retailPrice" "US_dateFirstAvailable" "US_dateLastAvailable" "UK_retailPrice" ## [21] "UK_dateFirstAvailable" "UK_dateLastAvailable" "CA_retailPrice" "CA_dateFirstAvailable" ## [25] "CA_dateLastAvailable" "DE_retailPrice" "DE_dateFirstAvailable" "DE_dateLastAvailable" ## [29] "height" "width" "depth" "weight" ## [33] "thumbnailURL" "imageURL" ``` ] --- # Structure (`str`) <img src="images/hex/brickset.png" class="title-hex"> .code50[ ```r str(legosets) ``` ``` ## 'data.frame': 16355 obs. of 34 variables: ## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ... ## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ... ## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ... ## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ... ## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ... ## $ subtheme : chr NA NA NA NA ... ## $ category : chr "Normal" "Normal" "Normal" "Normal" ... ## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ... ## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ... ## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ... ## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ... ## $ rating : num 0 0 0 0 0 0 0 0 0 0 ... ## $ reviewCount : int 0 0 1 0 0 0 0 1 0 0 ... ## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ... ## $ US_retailPrice : num NA NA NA NA NA 1.99 NA NA 4.99 NA ... ## $ US_dateFirstAvailable: Date, format: NA NA NA NA ... ## $ US_dateLastAvailable : Date, format: NA NA NA NA ... ## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ UK_dateFirstAvailable: Date, format: NA NA NA NA ... ## $ UK_dateLastAvailable : Date, format: NA NA NA NA ... ## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ CA_dateFirstAvailable: Date, format: NA NA NA NA ... ## $ CA_dateLastAvailable : Date, format: NA NA NA NA ... ## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ DE_dateFirstAvailable: Date, format: NA NA NA NA ... ## $ DE_dateLastAvailable : Date, format: NA NA NA NA ... ## $ height : num NA NA NA NA NA ... ## $ width : num NA NA NA NA NA ... ## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ... ## $ weight : num NA NA NA NA NA NA NA NA NA NA ... ## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ... ## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ... ``` ] --- # RStudio Eenvironment tab can help <img src="images/hex/rstudio.png" class="title-hex"> <img src="images/legosets_rstudio_environment.png" width="500" style="display: block; margin: auto;" /> --- class: hide-logo # Table View .font60[
] --- # Data Wrangling Cheat Sheet <img src="images/hex/dplyr.png" class="title-hex"> .center[ <a href='https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf' target='_new'><img src='images/data-transformation.png' width='700' /></a> ] --- # Tidyverse vs Base R <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/pipe.png" class="title-hex"> .center[ <a href='images/R_Syntax_Comparison.jpeg' target='_new'><img src="images/R_Syntax_Comparison.jpeg" width='700' /></a> ] --- # Pipes `%>%` <img src="images/hex/magrittr.png" class="title-hex"> <img src='images/magrittr_pipe.jpg' align='right' width='275' /> .font90[ The pipe operator (`%>%`) introduced with the `magrittr` R package allows for the chaining of R operations. It takes the output from the left-hand side and passes it as the first parameter to the function on the right-hand side. In base R, to get the output of a proportional table, you need to first call `table` then `prop.table`. ] .pull-left[ You can do this in two steps: ```r tab_out <- table(legosets$category) prop.table(tab_out) ``` Or as nested function calls. ```r prop.table(table(legosets$category)) ``` ] .pull-right[ Using the pipe (`%>%`) operator we can chain these calls in a what is arguably a more readable format: ```r table(legosets$category) %>% prop.table() ``` ] <hr /> ``` ## ## Book Collection Extended Gear Normal Other Random ## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749 0.002873739 ``` --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_filter_sm.png' width='800' /> ] --- # Logical Operators * `!a` - TRUE if a is FALSE * `a == b` - TRUE if a and be are equal * `a != b` - TRUE if a and b are not equal * `a > b` - TRUE if a is larger than b, but not equal * `a >= b` - TRUE if a is larger or equal to b * `a < b` - TRUE if a is smaller than be, but not equal * `a <= b` - TRUE if a is smaller or equal to b * `a %in% b` - TRUE if a is in b where b is a vector ```r which( letters %in% c('a','e','i','o','u') ) ``` ``` ## [1] 1 5 9 15 21 ``` * `a | b` - TRUE if a *or* b are TRUE * `a & b` - TRUE if a *and* b are TRUE * `isTRUE(a)` - TRUE if a is TRUE --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015) ``` ### Base R ```r mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,] ``` <hr /> ```r nrow(mylego) ``` ``` ## [1] 61 ``` --- # Select <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs) ``` ### Base R ```r mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')] ``` <hr /> ```r head(mylego, n = 4) ``` ``` ## setID pieces theme availability US_retailPrice minifigs ## 1 26803 103 Education {Not specified} NA 6 ## 2 26689 142 Education {Not specified} NA 4 ## 3 26804 98 Education {Not specified} NA 6 ## 4 26277 188 Education Educational 78.95 NA ``` --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_relocate.png' width='800' /> ] --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` ### Base R ```r mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')] head(mylego2, n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/rename_sm.jpg' width='1000' /> ] --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3) ``` ``` ## setID pieces theme availability USD minifigs ## 1 26803 103 Education {Not specified} NA 6 ## 2 26689 142 Education {Not specified} NA 4 ## 3 26804 98 Education {Not specified} NA 6 ``` ### Base R ```r names(mylego2)[5] <- 'USD' head(mylego2, n = 3) ``` ``` ## theme availability setID pieces USD minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_mutate.png' width='700' /> ] --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ```r mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3) ``` ``` ## setID pieces theme availability US_retailPrice minifigs Price_per_piece ## 1 26277 188 Education Educational 78.95 NA 0.4199468 ## 2 25949 280 Education Educational 224.95 NA 0.8033929 ## 3 25954 1 Education Educational 14.95 NA 14.9500000 ``` ### Base R ```r mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),] mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice head(mylego2, n = 3) ``` ``` ## [1] setID pieces theme availability US_retailPrice minifigs Price_per_piece ## <0 rows> (or 0-length row.names) ``` --- # Group By and Summarize <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .code80[ ```r legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE), sd_price = sd(US_retailPrice, na.rm = TRUE), median_price = median(US_retailPrice, na.rm = TRUE), n = n(), missing = sum(is.na(US_retailPrice))) ``` ``` ## # A tibble: 15 × 6 ## themeGroup mean_price sd_price median_price n missing ## <chr> <dbl> <dbl> <dbl> <int> <int> ## 1 Action/Adventure 31.3 29.9 20.0 1280 462 ## 2 Basic 13.1 12.8 7.99 843 473 ## 3 Constraction 15.1 14.0 9.99 501 125 ## 4 Educational 89.0 107. 59.7 452 294 ## 5 Girls 23.4 22.6 15.0 677 225 ## 6 Historical 25.5 27.7 15.0 473 125 ## 7 Junior 18.6 13.2 17.8 228 93 ## 8 Licensed 42.9 58.3 25.0 2060 467 ## 9 Miscellaneous 14.3 20.8 6.99 4925 2117 ## 10 Model making 52.8 65.1 30.0 582 166 ## 11 Modern day 31.2 33.7 20.0 1723 763 ## 12 Pre-school 23.8 19.4 20.0 1487 699 ## 13 Racing 24.8 30.2 10 270 59 ## 14 Technical 60.8 68.1 40.0 550 137 ## 15 Vintage 9.71 9.56 7.50 304 264 ``` ] --- # Describe and Describe By ```r library(psych) describe(legosets$US_retailPrice) ``` ``` ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 9886 28.52 42 14.99 20.14 14.83 0 799.99 799.99 5.62 58.91 0.42 ``` ```r describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE) ``` ``` ## item group1 vars n mean sd min max range se ## X11 1 {Not specified} 1 3197 24.24484 36.282072 0.60 789.99 789.39 0.6416833 ## X12 2 Educational 1 9 140.95000 86.358265 14.95 244.95 230.00 28.7860885 ## X13 3 LEGO exclusive 1 1066 28.79797 70.954538 0.00 799.99 799.99 2.1732094 ## X14 4 LEGOLAND exclusive 1 7 12.70429 6.447591 4.99 19.99 15.00 2.4369603 ## X15 5 Not sold 1 1 12.99000 NA 12.99 12.99 0.00 NA ## X16 6 Promotional 1 167 9.19485 23.667555 0.00 249.99 249.99 1.8314504 ## X17 7 Promotional (Airline) 1 11 15.79455 6.614819 5.00 28.00 23.00 1.9944429 ## X18 8 Retail 1 4824 29.82030 33.270049 1.95 399.99 398.04 0.4790158 ## X19 9 Retail - limited 1 600 44.64837 57.391438 0.40 379.99 379.59 2.3429956 ## X110 10 Unknown 1 4 2.24750 1.253671 1.00 3.99 2.99 0.6268356 ``` --- class: middle # Grammer of Graphics .center[ <img src="images/ggplot2_masterpiece.png" height="550" /> ] --- # Data Visualizations with ggplot2 <img src="images/hex/ggplot2.png" class="title-hex"> * `ggplot2` is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics. * `ggplot2` is, in general, more flexible for creating "prettier" and complex plots. * Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) `ggplot2` has at least three ways of creating plots: 1. `qplot` 2. `ggplot(...) + geom_XXX(...) + ...` 3. `ggplot(...) + layer(...)` * We will focus only on the second. --- # Parts of a `ggplot2` Statement <img src="images/hex/ggplot2.png" class="title-hex"> * Data `ggplot(myDataFrame, aes(x=x, y=y))` * Layers `geom_point()`, `geom_histogram()` * Facets `facet_wrap(~ cut)`, `facet_grid(~ cut)` * Scales `scale_y_log10()` * Other options `ggtitle('my title')`, `ylim(c(0, 10000))`, `xlab('x-axis label')` --- # Lots of geoms <img src="images/hex/ggplot2.png" class="title-hex"> ```r ls('package:ggplot2')[grep('^geom_', ls('package:ggplot2'))] ``` ``` ## [1] "geom_abline" "geom_area" "geom_bar" "geom_bin_2d" ## [5] "geom_bin2d" "geom_blank" "geom_boxplot" "geom_col" ## [9] "geom_contour" "geom_contour_filled" "geom_count" "geom_crossbar" ## [13] "geom_curve" "geom_density" "geom_density_2d" "geom_density_2d_filled" ## [17] "geom_density2d" "geom_density2d_filled" "geom_dotplot" "geom_errorbar" ## [21] "geom_errorbarh" "geom_freqpoly" "geom_function" "geom_hex" ## [25] "geom_histogram" "geom_hline" "geom_jitter" "geom_label" ## [29] "geom_line" "geom_linerange" "geom_map" "geom_path" ## [33] "geom_point" "geom_pointrange" "geom_polygon" "geom_qq" ## [37] "geom_qq_line" "geom_quantile" "geom_raster" "geom_rect" ## [41] "geom_ribbon" "geom_rug" "geom_segment" "geom_sf" ## [45] "geom_sf_label" "geom_sf_text" "geom_smooth" "geom_spoke" ## [49] "geom_step" "geom_text" "geom_tile" "geom_violin" ## [53] "geom_vline" ``` --- # Data Visualization Cheat Sheet <img src="images/hex/ggplot2.png" class="title-hex"> .center[ <a href='https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf'><img src='images/data-visualization-2.1.png' width='700' /></a> ] --- # Scatterplot <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice)) + geom_point() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice, color=availability)) + geom_point() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs, color=availability)) + geom_point() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs)) + geom_point() + facet_wrap(~ availability) ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> --- # Boxplots <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x='Lego', y=US_retailPrice)) + geom_boxplot() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> --- # Boxplots (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> --- # Boxplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() + coord_flip() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> --- # Histograms <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> --- # Histograms (cont.)<img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + scale_x_log10() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- # Histograms (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + facet_wrap(~ availability) ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> --- # Density Plots <img src="images/hex/ggplot2.png" class="title-hex"> ```r ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> --- # `ggplot2` aesthetics <img src="images/hex/ggplot2.png" class="title-hex"> .center[ <a href='images/ggplot_aesthetics_cheatsheet.png' target='_new'> <img src='images/ggplot_aesthetics_cheatsheet.png' height='550' /></a> ] --- # Likert Scales <img src="images/hex/likert.png" class="title-hex"> Likert scales are a type of questionnaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree). ```r library(likert) library(reshape) data(pisaitems) items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q'] items24 <- rename(items24, c( ST24Q01="I read only if I have to.", ST24Q02="Reading is one of my favorite hobbies.", ST24Q03="I like talking about books with other people.", ST24Q04="I find it hard to finish books.", ST24Q05="I feel happy if I receive a book as a present.", ST24Q06="For me, reading is a waste of time.", ST24Q07="I enjoy going to a bookstore or a library.", ST24Q08="I read only to get information that I need.", ST24Q09="I cannot sit still and read for more than a few minutes.", ST24Q10="I like to express my opinions about books I have read.", ST24Q11="I like to exchange books with my friends.")) ``` --- # `likert` R Package <img src="images/hex/likert.png" class="title-hex"> ```r l24 <- likert(items24) summary(l24) ``` ``` ## Item low neutral high mean sd ## 10 I like to express my opinions about books I have read. 41.07516 0 58.92484 2.604913 0.9009968 ## 5 I feel happy if I receive a book as a present. 46.93475 0 53.06525 2.466751 0.9446590 ## 8 I read only to get information that I need. 50.39874 0 49.60126 2.484616 0.9089688 ## 7 I enjoy going to a bookstore or a library. 51.21231 0 48.78769 2.428508 0.9164136 ## 3 I like talking about books with other people. 54.99129 0 45.00871 2.328049 0.9090326 ## 11 I like to exchange books with my friends. 55.54115 0 44.45885 2.343193 0.9609234 ## 2 Reading is one of my favorite hobbies. 56.64470 0 43.35530 2.344530 0.9277495 ## 1 I read only if I have to. 58.72868 0 41.27132 2.291811 0.9369023 ## 4 I find it hard to finish books. 65.35125 0 34.64875 2.178299 0.8991628 ## 9 I cannot sit still and read for more than a few minutes. 76.24524 0 23.75476 1.974736 0.8793028 ## 6 For me, reading is a waste of time. 82.88729 0 17.11271 1.810093 0.8611554 ``` --- # `likert` Plots <img src="images/hex/likert.png" class="title-hex"> ```r plot(l24) ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> --- # `likert` Plots <img src="images/hex/likert.png" class="title-hex"> ```r plot(l24, type='heat') ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> --- # `likert` Plots <img src="images/hex/likert.png" class="title-hex"> ```r plot(l24, type='density') ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> --- class: font90 # Dual Scales <img src="images/hex/shiny.png" class="title-hex"> Some problems<sup>1</sup>: * The designer has to make choices about scales and this can have a big impact on the viewer * "Cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers) * They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues * Because of the issues above, in malicious hands they make it possible to deliberately mislead This example looks at the relationship between NZ dollar exchange rate and trade weighted index. ```r DATA606::shiny_demo('DualScales', package='DATA606') ``` My advise: * Avoid using them. You can usually do better with other plot types. * When necessary (or compelled) to use them, rescale (using z-scores, we'll discuss this in a few weeks) .font50[ <sup>1</sup> http://blog.revolutionanalytics.com/2016/08/dual-axis-time-series.html <sup>2</sup> http://ellisp.github.io/blog/2016/08/18/dualaxes ] --- # Pie Charts There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer. <center><img src='images/Pie.png' width='500'></center> --- # Pie Charts There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer. <center><img src='images/Pie.png' width='500'></center> <center><img src='images/Bar.png' width='500'></center> Source: [https://en.wikipedia.org/wiki/Pie_chart](https://en.wikipedia.org/wiki/Pie_chart). --- class: middle # Just say NO to pie charts! .font150[ "There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"] .right[.font130[John Tukey]] --- # Additional Resources For data wrangling: * `dplyr` website: https://dplyr.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html * Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome * Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf For data visualization: * `ggplot2` website: https://ggplot2.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/data-visualisation.html * R Graphics Cookbook: https://r-graphics.org * Data visualization cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf --- class: left # One Minute Paper .font140[ Complete the one minute paper: https://forms.gle/ENFqTnDB5fJDw3kx9 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you? ]