class: center, middle, inverse, title-slide # Tidyverse and ggplot ### Niels Richard Hansen ### February 6, 2020 --- layout: true background-image: url(KUlogo.pdf) background-size: cover background-position: left --- ## Tidyverse packages ```r library("tidyverse") ``` ``` ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ── ``` ``` ✓ ggplot2 3.2.1 ✓ purrr 0.3.3 ✓ tibble 2.1.3 ✓ dplyr 0.8.3 ✓ tidyr 1.0.2 ✓ stringr 1.4.0 ✓ readr 1.3.1 ✓ forcats 0.4.0 ``` ``` ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ── x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() ``` The tidyverse package loads a number of R packages that have been developed to support a cleaner, more efficient and more coherent framework for data handling and data transformation in R. --- ## Core tidyverse packages
--- ## Reading data ```r ## Using the readr package traffic <- read_csv("./Data/traffic_small.csv") ``` ``` Parsed with column specification: cols( `machine name` = col_character(), `user id` = col_double(), size = col_double(), time = col_double(), date = col_date(format = ""), month = col_character() ) ``` This data set is from a [classical 1995 data set on Boston University WWW data transfers](Data/traffic_small.csv). The `size` column contains sizes of downloaded documents in bytes and the `time` column contains the time in seconds. --- ## Data ```r ## This is a tibble traffic ``` ``` # A tibble: 16,414 x 6 `machine name` `user id` size time date month <chr> <dbl> <dbl> <dbl> <date> <chr> 1 cs18 146579 2464 0.493 1995-04-24 1995-04 2 cs18 995988 7745 0.326 1995-04-24 1995-04 3 cs18 317649 6727 0.314 1995-04-24 1995-04 4 cs18 748501 13049 0.583 1995-04-24 1995-04 5 cs18 955815 356 0.259 1995-04-24 1995-04 6 cs18 596819 15063 0.336 1995-04-24 1995-04 # … with 1.641e+04 more rows ``` A tibble is a data table and behaves in many ways effectively as a data frame --- ## Exercise Download the data set `ballon.txt` from Absalon. Read the data set into R as * a data frame using a base R function * a tibble using the readr package In both cases, use the RStudio "Import Dataset" functionality to construct the code you need to load the data set. After solving this exercise, you should have a data frame or a tibble in the workspace called `ballon`. --- ## ggplot If you want to use ggplot, you need to load the package ggplot2 (or tidyverse). ```r library(ggplot2) ``` -- You can make simple scatterplots quickly using `qplot` (quick plot). ```r qplot(rnorm(100), rnorm(100)) ## Uhhh, grey theme :-( ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/unnamed-chunk-3-1.png" width="700" style="display: block; margin: auto;" /> --- ## ggplot I always use a different theme. ```r theme_set(theme_bw()) ``` -- Newbie ggplot users spotted by their use of the default grey theme! ```r qplot(rnorm(100), rnorm(100)) ## Ahh, black-and-white theme :-) ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/unnamed-chunk-5-1.png" width="700" style="display: block; margin: auto;" /> --- ## A simple bar chart ```r p <- ggplot(data = traffic, aes(x = `machine name`)) + geom_bar() ## The `geom_bar` does tabulation and construct the 'count' variable p ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/simple_bar-1.png" width="900" style="display: block; margin: auto;" /> --- ## Flipping the bar chart ```r p + coord_flip() ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/simple_bar_fix-1.png" width="900" style="display: block; margin: auto;" /> --- ## A simple histogram ```r ggplot(data = traffic, aes(x = log10(time))) + geom_histogram() ``` ``` `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/simple_hist-1.png" width="900" style="display: block; margin: auto;" /> --- ## Some manual control ```r ggplot(data = traffic, aes(x = time)) + geom_histogram(bins = 20) + scale_x_log10() ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/simple_hist:again-1.png" width="900" style="display: block; margin: auto;" /> --- ## A density plot ```r ggplot(data = traffic, aes(x = time, fill = `machine name`)) + geom_density() + scale_x_log10() ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/unnamed-chunk-6-1.png" width="900" style="display: block; margin: auto;" /> --- ## Controlling fill color and transparency ```r ggplot(data = traffic, aes(x = time, fill = `machine name`)) + geom_density(alpha = 0.5) + scale_x_log10() + scale_fill_manual(values = c("red", "blue", "yellow")) ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/unnamed-chunk-7-1.png" width="900" style="display: block; margin: auto;" /> --- ## Stacking histograms ```r ggplot(data = traffic, aes(x = time, fill = `machine name`)) + geom_density(position = "stack") + scale_x_log10() + scale_fill_manual(values = c("red", "blue", "yellow")) ``` <img src="20-02-06-Tidy_and_Vis_files/figure-html/unnamed-chunk-8-1.png" width="900" style="display: block; margin: auto;" /> --- ## Exercise Construct a histogram and/or a density plot of the variable `Tid` in the ballon data set. Use different colors in the plot for the different colors of balloons. Experiment with stacking and/or transparency. Can you match the color used in your plot to the color of the balloon?