The tidyverse is a collection of R packages which, among other things, facilitate data handling and data transformation in R. See https://www.tidyverse.org/ for details.
We must install and load the R package tidyverse before we have access to the functions.
Install package: One option is to go via the Tools menu: Tools → Install packages → write tidyverse in the field called Packages. This only has to be done once.
Load package: Use the library command below (preferred), or go to the Packages menu in the bottom right window, find tidyverse in the list, and click it. This has to be done in every R-session where you use the package.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We can see that loading the tidyverse package here results in some technical messages from R. You do not have to worry about this (even if some of the text is coloured red).
Import data, inspect data
Data from Excel files can be imported via the Import Dataset facility. You may get the message that the package readxl should be installed. If so, then install it as explained for tidyverse above.
Find Import Data in the upper right window in RStudio, and choose From Excel in the dropdown menu.
A new window opens. Browse for the relevant Excel file; then a preview of the dataset is shown. Check that it looks OK, and click Import.
Three things happened: Three lines of code was generated (and executed) in the Console, a new dataset now appears in the Environment window, and the dataset is shown in the top left window. Check again that it looks OK.
Copy the first two lines of code into your R script (or into an R chunk in your Markdown document), but delete line starting with View and write instead the name of the dataset, here downloads. Then the first 10 lines of the data set are printed.
R has stored the data in a so-called tibble, a type of data frame. Rows are referred to as observations or data lines, columns as variables. The data rows appear in the order as in the Excel file.
Notice the line with data types below the variable names. to character variables, to data-and-time variables, (double) refer to numerical variables. It is often useful to convert variables of type to factor variables, as illustrated later in the document.
It is often useful to get a summary of each of the variables in the dataset:
summary(downloads)
machineName userID size time
Length:147035 Min. : 0 Min. : 0 Min. : 0.000
Class :character 1st Qu.:256435 1st Qu.: 0 1st Qu.: 0.000
Mode :character Median :507567 Median : 0 Median : 0.000
Mean :504923 Mean : 4154 Mean : 0.954
3rd Qu.:753920 3rd Qu.: 0 3rd Qu.: 0.000
Max. :999989 Max. :14518894 Max. :1878.076
date month
Min. :1994-11-22 00:00:00 Length:147035
1st Qu.:1995-02-07 00:00:00 Class :character
Median :1995-02-23 00:00:00 Mode :character
Mean :1995-03-04 22:17:15
3rd Qu.:1995-04-12 00:00:00
Max. :1995-05-18 00:00:00
At this point, the output is mainly useful for the numerical variables (and date variables). We will later see how categorical variables can be converted to so-called factors, and that the output is also useful for such variables.
Digression: If data are saved in a csv file (comma separated values), possibly generated via an Excel sheet, then data can be read with the read.csv function. For example, if the data file is called mydata.csv and values are separated with commas, then the command
mydata <-read.csv("mydata.csv", sep=",")
creates a data frame in R with the data. The data frame is not a tibble and some of the commands below would not work for such a data frame. Alternatively, you can also import csv files via the From Text item in the Import Dataset menu.
About the data
The dataset is from Boston University and is about www data transfers from November 1994 to May 1995.
It has 147,035 data lines and 6 variables
We will use 3 of the variables: machineName is the name of the server from which the file was downloaded, size is the download size in bytes, and time is the download time in seconds.
Extracting variables, simple summary statistics
Variables can be extracted with the $-syntax, and we can use squared brackets to show only the first 40, say, values.
Summary statistics like mean, standard deviation, median are easily computed for a vector.
Examples of R functions for computing summary statistics: length, mean, median, sd, var, sum, quantile, min, max, IQR.
length(time_vector)
[1] 147035
mean(time_vector)
[1] 0.9539674
sd(time_vector)
[1] 14.22557
median(time_vector)
[1] 0
min(time_vector)
[1] 0
Notice that more than half the observations have time equal to zero (median is zero).
With the summary function, many of the above summary statistics (but not the standard deviation) are computed in one go:
summary(time_vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 0.954 0.000 1878.076
Filter data (selecting rows): filter
The filter function is used to make sub-datasets where only certain datalines (rows) are maintained. It is described with logical expressions which datalines should be kept in the dataset.
Say that we only want observations with download time larger than 1000 seconds; there happens to be eight such observations:
# Note: Here we use the filter() function from the tidyverse-package.# If the tidyverse-package was not loaded, then we would use# the filter() function from the stats-package. And actually# get an error message.filter(downloads, time >1000)
Notice that this result is assigned to downloads2. It has 36,708 data lines. The original data called downloads still exists with 147,035 data lines.
Filtering requires logical predicates. These are expressions in terms of columns, which evaluate to either TRUE or FALSE for each row. Logical expressions can be combined with logical operations.
Comparisons: ==, !=, <, >, <=, >=, %in%, is.na
Logical operations: ! (not), | (or), & (and). A comma can be used instead of ´&´
Here comes two sub-datasets:
# Rows from kermit, and with size greater than 200000 bytes are kept.filter(downloads2, machineName =="kermit", size >200000)
Sometimes, datasets has many variables of which only some are relevant for the analysis. Variables can be selected or skipped with the select function.
# Without the date variableselect(downloads2, -date)
We then make a new categorical variable, speedCat, which is “Slow” is speed < 150 and “Fast” otherwise:
# New variable with if-else constructiondownloads3 <-mutate(downloads3, speedCat =ifelse(speed <150, "Slow", "Fast"))downloads3
# A tibble: 36,708 × 6
machineName size time speed logSize speedCat
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 cs18 2464 0.493 4998. 3.39 Fast
2 cs18 7745 0.326 23786. 3.89 Fast
3 cs18 6727 0.314 21444. 3.83 Fast
4 cs18 13049 0.583 22400. 4.12 Fast
5 cs18 356 0.259 1373. 2.55 Fast
6 cs18 15063 0.336 44897. 4.18 Fast
7 cs18 2548 0.285 8945. 3.41 Fast
8 cs18 1932 0.286 6763. 3.29 Fast
9 cs18 7294 0.397 18368. 3.86 Fast
10 cs18 4470 3.41 1311. 3.65 Fast
# ℹ 36,698 more rows
Categorical variables are coded as “character variables” when data are imported, but it is often preferable to code them as so-called factors instead. We can use the function factor in combination with mutate to obtain this. Here, we overwrite the old version of machineName with the new factor version. Notice how the output from summary now shown the number of observations for each machine.
# Change machineName to factordownloads3 <-mutate(downloads3, machineName=factor(machineName))summary(downloads3)
machineName size time speed
cs18 : 3814 Min. : 3 Min. :4.369e-02 Min. : 0.3
kermit : 9094 1st Qu.: 953 1st Qu.:4.426e-01 1st Qu.: 1060.5
piglet :11200 Median : 2188 Median :7.314e-01 Median : 2902.8
pluto : 5253 Mean : 16638 Mean :3.820e+00 Mean : 9990.4
tweetie: 7347 3rd Qu.: 6500 3rd Qu.:1.893e+00 3rd Qu.: 7126.4
Max. :14518894 Max. :1.878e+03 Max. :3228695.3
logSize speedCat
Min. :0.4771 Length:36708
1st Qu.:2.9791 Class :character
Median :3.3400 Mode :character
Mean :3.4475
3rd Qu.:3.8129
Max. :7.1619
Counting, tabulation of categorical variables: count
The count function is useful for counting datalines, posssibly according to certain criteria or for the different levels of categorical values.
# Total number of observations in the current datasetcount(downloads3)
# A tibble: 1 × 1
n
<int>
1 36708
# Number of observations from each machinecount(downloads3, machineName)
# Number of observations which have/have not size larger than 5000count(downloads3, size>5000)
# A tibble: 2 × 2
`size > 5000` n
<lgl> <int>
1 FALSE 25865
2 TRUE 10843
# Number of observations for each combination of machine name and the *slow* variable.count(downloads3, machineName, speedCat)
# A tibble: 10 × 3
machineName speedCat n
<fct> <chr> <int>
1 cs18 Fast 3662
2 cs18 Slow 152
3 kermit Fast 8717
4 kermit Slow 377
5 piglet Fast 10734
6 piglet Slow 466
7 pluto Fast 4963
8 pluto Slow 290
9 tweetie Fast 6983
10 tweetie Slow 364
Sort data: arrange
The arrange function can be used to sort the data according to one or more columns.
Let us sort the data according to download time (ascending order). The first lines of the sorted data set is printed on-screen, but the dataset downloads3 has not been changed.
arrange(downloads3, time)
# A tibble: 36,708 × 6
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 pluto 1806 0.0437 41335. 3.26 Fast
2 tweetie 7747 0.0462 167644. 3.89 Fast
3 piglet 353 0.0525 6729. 2.55 Fast
4 piglet 18110 0.0535 338359. 4.26 Fast
5 kermit 454 0.0550 8248. 2.66 Fast
6 tweetie 452 0.0551 8208. 2.66 Fast
7 tweetie 452 0.0554 8155. 2.66 Fast
8 tweetie 452 0.0564 8010. 2.66 Fast
9 tweetie 452 0.0566 7991. 2.66 Fast
10 kermit 452 0.0602 7514. 2.66 Fast
# ℹ 36,698 more rows
Two different examples:
# According to download size in descending orderarrange(downloads3, desc(time))
# A tibble: 36,708 × 6
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 pluto 8674562 1878. 4619. 6.94 Fast
2 cs18 4720220 1749. 2700. 6.67 Fast
3 kermit 14518894 1380. 10522. 7.16 Fast
4 cs18 2573336 1335. 1928. 6.41 Fast
5 cs18 4055821 1275. 3180. 6.61 Fast
6 tweetie 245003 1214. 202. 5.39 Fast
7 cs18 2743516 1151. 2383. 6.44 Fast
8 kermit 1079731 1129. 956. 6.03 Fast
9 cs18 6360764 863. 7374. 6.80 Fast
10 cs18 1679580 755. 2226. 6.23 Fast
# ℹ 36,698 more rows
# After machine name and then according to download size in descending orderarrange(downloads3, machineName, desc(time))
# A tibble: 36,708 × 6
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 cs18 4720220 1749. 2700. 6.67 Fast
2 cs18 2573336 1335. 1928. 6.41 Fast
3 cs18 4055821 1275. 3180. 6.61 Fast
4 cs18 2743516 1151. 2383. 6.44 Fast
5 cs18 6360764 863. 7374. 6.80 Fast
6 cs18 1679580 755. 2226. 6.23 Fast
7 cs18 1736924 744. 2335. 6.24 Fast
8 cs18 1683412 695. 2423. 6.23 Fast
9 cs18 1679580 684. 2457. 6.23 Fast
10 cs18 1273543 633. 2011. 6.11 Fast
# ℹ 36,698 more rows
Group data: group_by
We can group the dataset by one or more categorical variables with group_by. The dataset is not changed as such, but - as we will see - grouping can be useful for computation of summary statistics and graphics.
Here we group after machine name (first) and after machine name and the categorical speed variable (second). The only way we can see it at this point is in the second line in the output (# Groups:):
# Group according to machinegroup_by(downloads3, machineName)
# A tibble: 36,708 × 6
# Groups: machineName [5]
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 cs18 2464 0.493 4998. 3.39 Fast
2 cs18 7745 0.326 23786. 3.89 Fast
3 cs18 6727 0.314 21444. 3.83 Fast
4 cs18 13049 0.583 22400. 4.12 Fast
5 cs18 356 0.259 1373. 2.55 Fast
6 cs18 15063 0.336 44897. 4.18 Fast
7 cs18 2548 0.285 8945. 3.41 Fast
8 cs18 1932 0.286 6763. 3.29 Fast
9 cs18 7294 0.397 18368. 3.86 Fast
10 cs18 4470 3.41 1311. 3.65 Fast
# ℹ 36,698 more rows
# Group according to machine and slowgroup_by(downloads3, machineName, speedCat)
# A tibble: 36,708 × 6
# Groups: machineName, speedCat [10]
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 cs18 2464 0.493 4998. 3.39 Fast
2 cs18 7745 0.326 23786. 3.89 Fast
3 cs18 6727 0.314 21444. 3.83 Fast
4 cs18 13049 0.583 22400. 4.12 Fast
5 cs18 356 0.259 1373. 2.55 Fast
6 cs18 15063 0.336 44897. 4.18 Fast
7 cs18 2548 0.285 8945. 3.41 Fast
8 cs18 1932 0.286 6763. 3.29 Fast
9 cs18 7294 0.397 18368. 3.86 Fast
10 cs18 4470 3.41 1311. 3.65 Fast
# ℹ 36,698 more rows
Summary statistics, revisited: summarize
Recall how we could compute summary statistics for a single variable in a dataset, e.g.
mean(downloads3$size)
[1] 16638.36
max(downloads3$size)
[1] 14518894
With summarize we can compute summary statistics for a variable for each level of a grouping variable or for each combination of several grouping variables.
First, a bunch of summaries for the size variable for each machine name, where we give explicit names for the new variables:
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by machineName and speedCat.
ℹ Output is grouped by machineName.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(machineName, speedCat))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by machineName and speedCat.
ℹ Output is grouped by machineName.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(machineName, speedCat))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
The datasets with summaries can be saved as datasets themselves, for example to be used as the basis for certain graphs.
Merge datasets with different columns: join
Consider the situation where data from the same “individuals” are spread in different datasets, and we want to combine the data into one dataset. This is possible with the “join” functions which are available with the tidyverse package. There are several variants; here we just show how to use the one called full_join in a tiny example. Say we want to merge the dataset consisting of the first five datalines from downloads3 with another dataset which has six observations. They are linked together via a variable called id and we want a dataset which merges the two datasets, keeping observations with the same id together.
# Subset of the downloads3 datadownloads4 <- downloads3[1:5,]# Make an id variabledownloads4$id <-1:5downloads4
# A tibble: 5 × 7
machineName size time speed logSize speedCat id
<fct> <dbl> <dbl> <dbl> <dbl> <chr> <int>
1 cs18 2464 0.493 4998. 3.39 Fast 1
2 cs18 7745 0.326 23786. 3.89 Fast 2
3 cs18 6727 0.314 21444. 3.83 Fast 3
4 cs18 13049 0.583 22400. 4.12 Fast 4
5 cs18 356 0.259 1373. 2.55 Fast 5
# The "extra" dataset. It has two variables: id and xextraData <-tibble(id=6:1, x =c(16,15,14,13,12,11))extraData
# A tibble: 6 × 2
id x
<int> <dbl>
1 6 16
2 5 15
3 4 14
4 3 13
5 2 12
6 1 11
# Mergingfull_join(downloads4, extraData)
Joining with `by = join_by(id)`
# A tibble: 6 × 8
machineName size time speed logSize speedCat id x
<fct> <dbl> <dbl> <dbl> <dbl> <chr> <int> <dbl>
1 cs18 2464 0.493 4998. 3.39 Fast 1 11
2 cs18 7745 0.326 23786. 3.89 Fast 2 12
3 cs18 6727 0.314 21444. 3.83 Fast 3 13
4 cs18 13049 0.583 22400. 4.12 Fast 4 14
5 cs18 356 0.259 1373. 2.55 Fast 5 15
6 <NA> NA NA NA NA <NA> 6 16
The merged dataset has six datalines, but the last line has NA’s for all variables except ID and x because no observation with id=6 exists in downloads4. One can use different join functions to leave out this dataline.
Stack datasets with the same variables: bind_rows
Consider another situation where we have two datasets with the same variables, and want to stack them together. The function bind_rows is useful for that. Here, for illustration, we first make two small subsets of the data, and then combine them.
# Make small datasets and combine them part1 <- downloads3[1:3,]part2 <- downloads3[10001:10003,]part1
# A tibble: 3 × 6
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 cs18 2464 0.493 4998. 3.39 Fast
2 cs18 7745 0.326 23786. 3.89 Fast
3 cs18 6727 0.314 21444. 3.83 Fast
part2
# A tibble: 3 × 6
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 kermit 2747 0.541 5074. 3.44 Fast
2 kermit 2150 0.404 5327. 3.33 Fast
3 kermit 7998 0.748 10688. 3.90 Fast
bind_rows(part1, part2)
# A tibble: 6 × 6
machineName size time speed logSize speedCat
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 cs18 2464 0.493 4998. 3.39 Fast
2 cs18 7745 0.326 23786. 3.89 Fast
3 cs18 6727 0.314 21444. 3.83 Fast
4 kermit 2747 0.541 5074. 3.44 Fast
5 kermit 2150 0.404 5327. 3.33 Fast
6 kermit 7998 0.748 10688. 3.90 Fast
The pipe operator: %>%
Two or more function calls can be evaluated sequentially using the so-called pipe operator, %>%. Nesting of function calls becomes more readable, and intermediate assignments are avoided.
Let us try it to do a bunch of things in one go, starting with the original dataset. Notice also the extended use of summarize.
downloads %>%filter(size>0) %>%# Subset of datagroup_by(machineName) %>%# Grouping summarize(across(c("time","size"),list(avg=mean, sd=sd))) %>%# Compute all means and sd'sarrange(size_avg) # Sort after mean