class: center, middle, inverse, title-slide # Data og Tidyverse ### Niels Richard Hansen ### 8. september, 2022 --- ## Tidyverse pakker ```r library("tidyverse") # Must be installed! ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.4 ✓ dplyr 1.0.7 ## ✓ tidyr 1.1.3 ✓ stringr 1.4.0 ## ✓ readr 2.0.1 ✓ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` Tidyverse pakken loader et antal R pakker, som udgør en sammenhængende og effektiv ramme til håndtering, transformering og visualisering af data. --- ## Kernepakker
Vi vil i dag fokusere på pakkerne `dplyr`, `tidyr` og `readr`. Der vil komme en separat video om visualisering og `ggplot2`. --- ## Datatabeller Vi organiserer data i tabeller. Generiske termer: **data frame** eller **datatabel**
```r library(nycflights13) # Must be installed! class(flights) # This is a tibble ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` --- ## flights data ```r flights ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Kategorisering af variable
``` ## # A tibble: 336,776 × 5 ## carrier distance dep_delay tailnum flight ## <chr> <dbl> <dbl> <chr> <int> ## 1 UA 1400 2 N14228 1545 ## 2 UA 1416 4 N24211 1714 ## 3 AA 1089 2 N619AA 1141 ## # … with 336,773 more rows ``` * `carrier` er en kategorisk variabel, datatypen er `character` * `distance` er en numerisk variabel, datatypen er `double` * `dep_delay` er en numerisk variabel, datatypen er `double` (kunne være `integer`) * `tailnum` er en **kategorisk** variabel, datatypen er `character` * `flight` (number) er en **kategorisk** variabel, datatypen er `integer`! --- ## Filtrering ```r # Alaska Airlines filter(flights, carrier == "AS") ``` ``` ## # A tibble: 714 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 724 725 -1 1020 1030 ## 2 2013 1 1 1808 1815 -7 2111 2130 ## 3 2013 1 2 722 725 -3 949 1030 ## # … with 711 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r flights ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## # … with 336,773 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Filtrering Forskellige betingelser kan kombineres i logiske udtryk. ```r filter(flights, (carrier == "AS" | carrier == "AA") & tailnum == "N200AA") ``` ``` ## # A tibble: 34 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 4 NA 1430 NA NA 1735 ## 2 2013 1 21 913 920 -7 1245 1245 ## 3 2013 10 7 1233 1230 3 1430 1405 ## 4 2013 10 22 1711 1715 -4 2013 2015 ## 5 2013 10 31 1705 1715 -10 1841 1905 ## 6 2013 11 14 927 929 -2 1220 1234 ## 7 2013 11 16 1220 1225 -5 1513 1510 ## 8 2013 12 31 1718 1720 -2 2027 2030 ## 9 2013 2 19 1316 1300 16 1439 1440 ## 10 2013 3 5 1900 1910 -10 2146 2215 ## # … with 24 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` Flyvninger, hvor fly N200AA har fløjet for enten Alaska Airlines eller American Airlines. --- ## Filtrering og variabelselektion ```r Alaska_flights <- filter(flights, carrier == "AS") Alaska_flights ``` ``` ## # A tibble: 714 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 724 725 -1 1020 1030 ## 2 2013 1 1 1808 1815 -7 2111 2130 ## 3 2013 1 2 722 725 -3 949 1030 ## 4 2013 1 2 1818 1815 3 2131 2130 ## 5 2013 1 3 724 725 -1 1012 1030 ## 6 2013 1 3 1817 1815 2 2121 2130 ## 7 2013 1 4 725 725 0 1031 1030 ## 8 2013 1 4 1808 1815 -7 2101 2130 ## 9 2013 1 5 725 725 0 1011 1030 ## 10 2013 1 5 1803 1815 -12 2118 2130 ## # … with 704 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r Alaska_flights <- select( Alaska_flights, c("carrier", "distance", "dep_delay", "tailnum", "flight") ) ``` --- ## Filtrering og variabelselektion ```r Alaska_flights ``` ``` ## # A tibble: 714 × 5 ## carrier distance dep_delay tailnum flight ## <chr> <dbl> <dbl> <chr> <int> ## 1 AS 2402 -1 N594AS 11 ## 2 AS 2402 -7 N553AS 7 ## 3 AS 2402 -3 N592AS 11 ## # … with 711 more rows ``` Det er muligt at udvælge variable ved hjælp af en række selektionskriterier, f.eks. ```r select(flights, dep_time:dep_delay) ``` ``` ## # A tibble: 336,776 × 3 ## dep_time sched_dep_time dep_delay ## <int> <int> <dbl> ## 1 517 515 2 ## 2 533 529 4 ## 3 542 540 2 ## # … with 336,773 more rows ``` --- ## Quiz Hvad er resultatet af følgende filtrering? ```r filter(flights, carrier == "AS" & carrier == "AA") ``` Hvad er dimensionerne af tabellen efter følgende selektion af variable? ```r select(flights, year:day) ``` --- ## Løsning ```r filter(flights, carrier == "AS" & carrier == "AA") ``` ```r select(flights, year:day) ``` --- ## Pipe Vi kan kæde flere operationer sammen med **nestede funktionskald** ```r select( filter(flights, (carrier == "AS" | carrier == "AA") & tailnum == "N200AA"), year:day ) ``` ``` ## # A tibble: 34 × 3 ## year month day ## <int> <int> <int> ## 1 2013 1 4 ## 2 2013 1 21 ## 3 2013 10 7 ## 4 2013 10 22 ## 5 2013 10 31 ## 6 2013 11 14 ## 7 2013 11 16 ## 8 2013 12 31 ## 9 2013 2 19 ## 10 2013 3 5 ## # … with 24 more rows ``` --- ## Pipe Pipe operatoren `%>%` er syntaktisk sukker, der gør nestede funktionskald læsbare. ```r flights %>% filter((carrier == "AS" | carrier == "AA") & tailnum == "N200AA") %>% select(year:day) ``` ``` ## # A tibble: 34 × 3 ## year month day ## <int> <int> <int> ## 1 2013 1 4 ## 2 2013 1 21 ## 3 2013 10 7 ## 4 2013 10 22 ## 5 2013 10 31 ## 6 2013 11 14 ## 7 2013 11 16 ## 8 2013 12 31 ## 9 2013 2 19 ## 10 2013 3 5 ## # … with 24 more rows ``` --- ## Aktiv vs passiv undervisning I et eksperiment blev 157 fysikstuderende delt i to grupper på et kursus i statik. I gruppen `active` inkluderede undervisningsformen aktivt de studerende. I gruppen `passive` inkluderede undervisningsformen ikke de studerende aktivt. De studerende tog efterfølgende en test og evaluerede underviseren ved at svare på, i hvor høj grad de var enige i udsagnet *The instructor was effective at teaching* Testscoren er numerisk i intervallet `\([0,1]\)` og evalueringen antager de fem værdier `Strongly disagree` `Disagree` `Neither agree nor disagree` `Agree` `Strongly agree` --- ## Aktiv vs passiv undervisning Data er i tabellen `statics` ```r statics ``` ``` ## # A tibble: 157 × 3 ## Format Score Evaluation ## <chr> <dbl> <chr> ## 1 active 0.78 Strongly agree ## 2 active 0.47 Strongly agree ## 3 active 0.65 Neither agree nor disagree ## 4 active 0.62 Agree ## 5 active 0.73 Agree ## 6 active 0.9 Agree ## 7 active 0.78 Neither agree nor disagree ## 8 active 0.71 Disagree ## 9 active 0.73 Strongly agree ## 10 active 0.68 Strongly agree ## # … with 147 more rows ``` Spørgsmål 1: Betyder undervisningsformen noget for, hvordan de studerende evaluerer underviseren? --- ## Tabulering ```r summarize(statics, count = n()) ``` ``` ## # A tibble: 1 × 1 ## count ## <int> ## 1 157 ``` ```r statics %>% group_by(Evaluation) %>% summarize(count = n()) ``` ``` ## # A tibble: 5 × 2 ## Evaluation count ## <chr> <int> ## 1 Agree 38 ## 2 Disagree 10 ## 3 Neither agree nor disagree 19 ## 4 Strongly agree 85 ## 5 Strongly disagree 5 ``` Stof til eftertanke: Hvordan er det lige at værdierne er ordnede? (MathDive Exercise 1.2) --- ## Tabulering for de to undervisningsformer .small[ ```r statics %>% * filter(Format == "active") %>% group_by(Evaluation) %>% summarize(count = n()) ``` ``` ## # A tibble: 5 × 2 ## Evaluation count ## <chr> <int> ## 1 Agree 22 ## 2 Disagree 9 ## 3 Neither agree nor disagree 12 ## 4 Strongly agree 35 ## 5 Strongly disagree 2 ``` ] .small[ ```r statics %>% * filter(Format == "passive") %>% group_by(Evaluation) %>% summarize(count = n()) ``` ``` ## # A tibble: 5 × 2 ## Evaluation count ## <chr> <int> ## 1 Agree 16 ## 2 Disagree 1 ## 3 Neither agree nor disagree 7 ## 4 Strongly agree 50 ## 5 Strongly disagree 3 ``` ] Kan vi ikke gøre det nemmere? --- ## Krydstabulering ```r statics %>% group_by(Format, Evaluation) %>% summarize(count = n()) ``` ``` ## # A tibble: 10 × 3 ## # Groups: Format [2] ## Format Evaluation count ## <chr> <chr> <int> ## 1 active Agree 22 ## 2 active Disagree 9 ## 3 active Neither agree nor disagree 12 ## 4 active Strongly agree 35 ## 5 active Strongly disagree 2 ## 6 passive Agree 16 ## 7 passive Disagree 1 ## 8 passive Neither agree nor disagree 7 ## 9 passive Strongly agree 50 ## 10 passive Strongly disagree 3 ``` --- # Pause <img src="coffee.jpeg" width="400" style="display: block; margin: auto;" /> --- ## Opsummering * `filter()` udvælger **rækker** (observationer), som opfylder en betingelse. Betingelser formuleres som logiske udtryk i termer af variable i data. * `select()` udvælger **søjler** (variable) i henhold til et `tidy-select` udtryk. Et udtryk kan være variabelnavne, men også meget andet. * `%>%` (pipe) kombinerer to eller flere operationer. Ækvivalent med nestede funktionskald, men lettere at læse. * `summarize(count = n())` tæller. I kombination med `group_by()` indenfor hver kombination af værdier for en eller flere variable. * `read_csv()` indlæser data fra en tekstfil med "comma separated variables". --- ## Indlæsning af data ```r statics <- read_csv( "../Data/statics.csv", col_types = cols( Evaluation = col_factor( levels = c( "Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree" ) ) ) ) ``` --- ## Evaluation som faktor ```r statics ``` ``` ## # A tibble: 157 × 3 ## Format Score Evaluation ## <chr> <dbl> <fct> ## 1 active 0.78 Strongly agree ## 2 active 0.47 Strongly agree ## 3 active 0.65 Neither agree nor disagree ## 4 active 0.62 Agree ## 5 active 0.73 Agree ## 6 active 0.9 Agree ## 7 active 0.78 Neither agree nor disagree ## 8 active 0.71 Disagree ## 9 active 0.73 Strongly agree ## 10 active 0.68 Strongly agree ## # … with 147 more rows ``` ```r statics$Evaluation[1:4] ``` ``` ## [1] Strongly agree Strongly agree ## [3] Neither agree nor disagree Agree ## 5 Levels: Strongly disagree Disagree Neither agree nor disagree ... Strongly agree ``` --- ## Krydstabulering igen ```r long_tab <- statics %>% group_by(Format, Evaluation) %>% summarize(count = n()) ``` ```r long_tab ``` ``` ## # A tibble: 10 × 3 ## # Groups: Format [2] ## Format Evaluation count ## <chr> <fct> <int> ## 1 active Strongly disagree 2 ## 2 active Disagree 9 ## 3 active Neither agree nor disagree 12 ## 4 active Agree 22 ## 5 active Strongly agree 35 ## 6 passive Strongly disagree 3 ## 7 passive Disagree 1 ## 8 passive Neither agree nor disagree 7 ## 9 passive Agree 16 ## 10 passive Strongly agree 50 ``` --- ## Pivoting ```r long_tab %>% pivot_wider(names_from = Format, values_from = count) ``` ``` ## # A tibble: 5 × 3 ## Evaluation active passive ## <fct> <int> <int> ## 1 Strongly disagree 2 3 ## 2 Disagree 9 1 ## 3 Neither agree nor disagree 12 7 ## 4 Agree 22 16 ## 5 Strongly agree 35 50 ``` En "bred" tabel er ikke tidy! Søjler er ikke variable, men værdier for en variabel. "Lange" tabeller er tidy. --- ## Søjlediagrammer (bar plots) ```r long_tab %>% filter(Format == "active") %>% ggplot(aes(x = Evaluation, y = count)) + geom_col() ``` <img src="08022022_Tirsdag_files/figure-html/unnamed-chunk-39-1.png" width="500" style="display: block; margin: auto;" /> --- ## Søjlediagrammer (bar plots) ```r *statics %>% filter(Format == "active") %>% ggplot(aes(x = Evaluation)) + * geom_bar() ``` <img src="08022022_Tirsdag_files/figure-html/unnamed-chunk-40-1.png" width="500" style="display: block; margin: auto;" /> --- ## Quiz Hvad sker der, hvis vi ikke filtrerer? ```r statics %>% *# filter(Format == "active") %>% ggplot(aes(x = Evaluation)) + geom_bar() ``` Og hvad viser følgende figur? ```r statics %>% ggplot(aes(x = Evaluation, fill = Format)) + geom_bar() ``` --- ## Løsning ```r statics %>% *# filter(Format == "active") %>% ggplot(aes(x = Evaluation)) + geom_bar() ``` --- ## Løsning ```r statics %>% ggplot(aes(x = Evaluation, fill = Format)) + geom_bar() ``` --- ## Vejrdata ```r weather # In nycflights13 package ``` ``` ## # A tibble: 26,115 × 15 ## origin year month day hour temp dewp humid wind_dir wind_speed ## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 ## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 ## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 ## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 ## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 ## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 ## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 ## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 ## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 ## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 ## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>, ## # pressure <dbl>, visib <dbl>, time_hour <dttm> ``` ```r temperature <- select(weather, c("month", "temp")) ``` --- ## Summarize ```r temperature %>% summarize( mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE) ) ``` ``` ## # A tibble: 1 × 2 ## mean std_dev ## <dbl> <dbl> ## 1 55.3 17.8 ``` ```r month_avg_temp <- temperature %>% * group_by(month) %>% summarize( mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE) ) ``` --- ## Summarize ```r month_avg_temp ``` ``` ## # A tibble: 12 × 3 ## month mean std_dev ## <int> <dbl> <dbl> ## 1 1 35.6 10.2 ## 2 2 34.3 6.98 ## 3 3 39.9 6.25 ## 4 4 51.7 8.79 ## 5 5 61.8 9.68 ## 6 6 72.2 7.55 ## 7 7 80.1 7.12 ## 8 8 74.5 5.19 ## 9 9 67.4 8.47 ## 10 10 60.1 8.85 ## 11 11 45.0 10.4 ## 12 12 38.4 9.98 ``` --- ## Join ```r temperature_joined <- temperature %>% left_join(month_avg_temp) ``` ``` ## Joining, by = "month" ``` ```r temperature_joined ``` ``` ## # A tibble: 26,115 × 4 ## month temp mean std_dev ## <int> <dbl> <dbl> <dbl> ## 1 1 39.0 35.6 10.2 ## 2 1 39.0 35.6 10.2 ## 3 1 39.0 35.6 10.2 ## 4 1 39.9 35.6 10.2 ## 5 1 39.0 35.6 10.2 ## 6 1 37.9 35.6 10.2 ## 7 1 39.0 35.6 10.2 ## 8 1 39.9 35.6 10.2 ## 9 1 39.9 35.6 10.2 ## 10 1 41 35.6 10.2 ## # … with 26,105 more rows ``` --- ## Standardize ```r # No grouping needed here temperature_joined %>% mutate(temp_stand = (temp - mean) / std_dev) %>% select(c("month", "temp", "temp_stand")) ``` ``` ## # A tibble: 26,115 × 3 ## month temp temp_stand ## <int> <dbl> <dbl> ## 1 1 39.0 0.331 ## 2 1 39.0 0.331 ## 3 1 39.0 0.331 ## 4 1 39.9 0.419 ## 5 1 39.0 0.331 ## 6 1 37.9 0.225 ## 7 1 39.0 0.331 ## 8 1 39.9 0.419 ## 9 1 39.9 0.419 ## 10 1 41 0.525 ## # … with 26,105 more rows ``` --- ## Standardize ```r temperature %>% * group_by(month) %>% mutate(temp_stand = (temp - mean(temp, na.rm = TRUE)) / sd(temp, na.rm = TRUE)) ``` ``` ## # A tibble: 26,115 × 3 ## # Groups: month [12] ## month temp temp_stand ## <int> <dbl> <dbl> ## 1 1 39.0 0.331 ## 2 1 39.0 0.331 ## 3 1 39.0 0.331 ## 4 1 39.9 0.419 ## 5 1 39.0 0.331 ## 6 1 37.9 0.225 ## 7 1 39.0 0.331 ## 8 1 39.9 0.419 ## 9 1 39.9 0.419 ## 10 1 41 0.525 ## # … with 26,105 more rows ``` --- ## Standardize ```r temperature %>% group_by(month) %>% mutate(temp_stand = scale(temp) %>% as.numeric()) ``` ``` ## # A tibble: 26,115 × 3 ## # Groups: month [12] ## month temp temp_stand ## <int> <dbl> <dbl> ## 1 1 39.0 0.331 ## 2 1 39.0 0.331 ## 3 1 39.0 0.331 ## 4 1 39.9 0.419 ## 5 1 39.0 0.331 ## 6 1 37.9 0.225 ## 7 1 39.0 0.331 ## 8 1 39.9 0.419 ## 9 1 39.9 0.419 ## 10 1 41 0.525 ## # … with 26,105 more rows ```