5  Intro to the tidyverse

As an object oriented programming language (OOP), R is not particularly user-friendly and thus {tidyverse}–a suite of data management, visualization, and modeling packages–aims to turn R into a more user-friendly functional programming language. As one of the project’s major developers, Hadley Wickham, recently put it “R is not a language driven by the purity of its philosophy; R is a language designed to get shit done.” This may have become some what evident in the indexing section where I talked about [], [[]], and $. In general languages like Python only have one way to index an object.

The tidyverse is actually a collection of packages that share the same underlying philosophy (which we will get to). Generally I will load in the tidyverse like this.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

5.1 dplyr

“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leon Tolstoy.

Data is often talked about the in the same way. All clean datasets are alike in structure and dignity. Every messy dataset is messy in its own special way.

dplyr is the tidyverse way to manipulate data and an all around great package. But you need to get the basic functions right. Cleaning data is what you will spend the most time on.

We will use the palmerpenguins and the starwars dataset to demonstrate how to use the verbs in dplyr

data("starwars")
library(palmerpenguins)

5.1.1 select

Select is the most intuitive. Select picks out the columns you want or, in some cases, do not want. If we only want the name of the character and the place they are from we feed that name of the columns into select. If you are following along and copying and pasting the code in the guide to see what is going on you may notice that your output differs slightly. That is because behind the scenes I am just using select to cut down on the output.

select(starwars, name, homeworld)
name homeworld
Luke Skywalker Tatooine
C-3PO Tatooine
R2-D2 Naboo
Darth Vader Tatooine
Leia Organa Alderaan

If we wanted all the columns except for name and homeworld we would do this.

select(starwars, -name, -homeworld)
height mass hair_color skin_color eye_color starships
172 77 blond fair blue X-wing , Imperial shuttle
167 75 NA gold yellow
96 32 NA white, blue red
202 136 none white yellow TIE Advanced x1
150 49 brown light brown

We can also feed it a range of columns using :

select(starwars, name:hair_color)
name height mass hair_color
Luke Skywalker 172 77 blond
C-3PO 167 75 NA
R2-D2 96 32 NA
Darth Vader 202 136 none
Leia Organa 150 49 brown

5.1.2 filter

Filter is how we subset by rows. To do this we need to tell filter what rows we want! This feels like it should be intuitive, but we have to use some concepts that are likely new to you. Lets say we want characters that are from a particular world in starwars. To do that we do this.

filter(starwars, homeworld == "Naboo") 
name homeworld
R2-D2 Naboo
Palpatine Naboo
Jar Jar Binks Naboo
Roos Tarpals Naboo
Rugor Nass Naboo

What we are doing here is just creating a dataset with just things from Naboo. We need to set tell R what rows we want by setting up some tests. Each time a row meets that condition then R is going to grab it. As a reminder these are the kinds of test you can do

Test Meaning Test Meaning
x < y Less than x %in% y In set
x > y Greater than is.na(x) Is missing
== Equal to !is.na(x) Is not missing
x <= y Less than or equal to
! y Not
x >= y Greater than or equal to
x != y Not equal to
x | y Or
x & y And

So lets return back to our Naboo example. If we wanted to reuse this dataset later we would assign it to an object using <- so lets assign it to an object named naboo

naboo <- filter(starwars, homeworld == "Naboo") 

We can also combine multiple tests in filter. Lets say we wanted to all the characters that are from Naboo and are human

 filter(starwars, homeworld == "Naboo" & species == "Human")
name species homeworld
Palpatine Human Naboo
Gregar Typho Human Naboo
Cordé Human Naboo
Dormé Human Naboo
Padmé Amidala Human Naboo

Now we have all the characters that are from Naboo and are human! Filter automatically defaults to an and test. So these produce the same behavior.

filter(starwars, homeworld == "Naboo" & species == "Human") 

filter(starwars, homeworld == "Naboo", species == "Human")
name homeworld species
Palpatine Naboo Human
Gregar Typho Naboo Human
Cordé Naboo Human
Dormé Naboo Human
Padmé Amidala Naboo Human
name homeworld species
Palpatine Naboo Human
Gregar Typho Naboo Human
Cordé Naboo Human
Dormé Naboo Human
Padmé Amidala Naboo Human

If we wanted characters from two different homeworlds we would do an or test using | (the key above enter/return)

filter(starwars, homeworld == "Naboo" | homeworld == "Tatooine")
name homeworld
Luke Skywalker Tatooine
C-3PO Tatooine
R2-D2 Naboo
Darth Vader Tatooine
Owen Lars Tatooine

The reason we would use an or test is because one character can’t have two homeworlds! Remember, computers are dumb. As long as the code can run, it will do it. So if we use an and test, this is what it returns

filter(starwars, homeworld == "Naboo" & homeworld == "Tatooine")
# A tibble: 0 × 14
# ℹ 14 variables: name <chr>, height <int>, mass <dbl>, hair_color <chr>,
#   skin_color <chr>, eye_color <chr>, birth_year <dbl>, sex <chr>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

When working with things in R stray commas and spelling will lead to headaches. Here is an example that would not throw an error but will create a dataframe with zero observations. Naboo exists as a value of homeworld, but naboo does not.

 naboo <- filter(starwars, homeworld == "naboo")

naboo <- filter(starwars, homeworld == "Naboo",)

Filter also works if you want particular values of something. Let’s say we want penguins with longer flippers or characters less than a certain height. We can do that with filter.

filter(starwars, height < mean(height, na.rm = TRUE))

filter(penguins, flipper_length_mm > 2)
name height species
Luke Skywalker 172 Human
C-3PO 167 Droid
R2-D2 96 Droid
Leia Organa 150 Human
Beru Whitesun lars 165 Human
species flipper_length_mm
Adelie 181
Adelie 186
Adelie 195
Adelie 193
Adelie 190

Sometimes we want a dataset that does not have any missing values in it for a particular column. In this case all we do is just add ! in front of the is.na function.

filter(starwars, !is.na(height))
name height mass hair_color skin_color eye_color
Luke Skywalker 172 77 blond fair blue
C-3PO 167 75 NA gold yellow
R2-D2 96 32 NA white, blue red
Darth Vader 202 136 none white yellow
Leia Organa 150 49 brown light brown

We would do something similar if we wanted characters that are not human!

filter(starwars, species != "Human")
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
C-3PO 167 75 NA gold yellow 112 none masculine Tatooine Droid The Empire Strikes Back, Attack of the Clones , The Phantom Menace , Revenge of the Sith , Return of the Jedi , A New Hope
R2-D2 96 32 NA white, blue red 33 none masculine Naboo Droid The Empire Strikes Back, Attack of the Clones , The Phantom Menace , Revenge of the Sith , Return of the Jedi , A New Hope , The Force Awakens
R5-D4 97 32 NA white, red red NA none masculine Tatooine Droid A New Hope
Chewbacca 228 112 brown unknown blue 200 male masculine Kashyyyk Wookiee The Empire Strikes Back, Revenge of the Sith , Return of the Jedi , A New Hope , The Force Awakens AT-ST Millennium Falcon, Imperial shuttle
Greedo 173 74 NA green black 44 male masculine Rodia Rodian A New Hope

One last operator that I will show you is the %in% operator. This comes in really handy for lots of things. Intuitively we can think about it as a bunch of thing == "value"| thing == “value”` tests glued together. Lets say we wanted all the characters from Tatooine, Naboo, and “Coruscant”. With what we know we would do something like this.

filter(starwars, homeworld == "Naboo" | homeworld == "Coruscant" | homeworld == "Tatooine")

While not the most amount of typing in the world this can quickly get to be a lot if there is a bunch of mutually exclusive things we need to subset our data by. This is where the %in%operator comes in to save us! We can rewrite this series of or tests like this

filter(starwars, homeworld %in% c("Naboo", "Coruscant", "Tatooine"))
name homeworld
Luke Skywalker Tatooine
Anakin Skywalker Tatooine
Finis Valorum Coruscant
Shmi Skywalker Tatooine
Adi Gallia Coruscant
Jocasta Nu Coruscant
Padmé Amidala Naboo

I have done some additional filtering behind the scenes to show you that it worked so your dataset likely looks a little different.

Notice how the test works. The first thing is the name of the variable and the second is the stuff the homeworlds. What R is doing is taking the values of the homeworld variable for each row and seeing if they match whats on the right hand side of %in. Kind of like this

5 %in% 1:10
[1] TRUE

If we wanted all things outside of this subset we can do this

filter(starwars, !homeworld %in% c("Naboo", "Coruscant", "Tatooine"))
name homeworld
Leia Organa Alderaan
Obi-Wan Kenobi Stewjon
Wilhuff Tarkin Eriadu
Chewbacca Kashyyyk
Han Solo Corellia

5.1.3 mutate

mutate is how we create new columns in our dataset. There are tons of things that we may need to do in order to create variables. So lets start with making what we know as an indicator variable. Lets say we want to create a variable that indicates whether a character is human or not. We use the ifelse function to do this. The first thing we need to do is name the column. Lets call this column human. We then need to tell R what values the human variable has. If we want to create an indicator variable we can use the ifelse function in R which kind of works like filter.

ifelse has a few components

ifelse(test, what it does if true, what it does if false)

So in our case if the species column has a value of “human” than it returns TRUE otherwise return FALSE.

mutate(starwars, human = ifelse(species == "Human", TRUE, FALSE))
name species
Luke Skywalker Human
C-3PO Droid
R2-D2 Droid
Darth Vader Human
Leia Organa Human

ifelse is not just limited to TRUE or FALSE you can really put anything in there it is just easier if you do. Lets see a somewhat silly example with the palmerpenguins dataset

mutate(penguins, big_peng = ifelse(body_mass_g > mean(body_mass_g, na.rm = TRUE), "Chonky penguin", "Not a Chonky penguin"))
body_mass_g big_peng
3750 Not a Chonky penguin
3800 Not a Chonky penguin
3250 Not a Chonky penguin
NA NA
3450 Not a Chonky penguin

R gives us a lot of flexibility to create all kinds of variables. So let’s make a column in our dataset where we see how old a character is in dog years with a description, so other people know what is going on.

mutate(starwars,dog_years = paste(name, birth_year * 7, "in dog years")) 
name dog_years
Luke Skywalker Luke Skywalker is 133 in dog years
C-3PO C-3PO is 784 in dog years
R2-D2 R2-D2 is 231 in dog years
Darth Vader Darth Vader is 293.3 in dog years
Leia Organa Leia Organa is 133 in dog years

mutate is order aware. So if you want to do something with that new variable, you can do that in the same mutate call. If you want to do multiple things in mutate, that is also easy.

mutate(starwars, heightsqr = height^2,
                 height_square_root = sqrt(heightsqr),
                 human = ifelse(species == "human", TRUE, FALSE))
species height heightsqr height_square_root human
Human 172 29584 172 FALSE
Droid 167 27889 167 FALSE
Droid 96 9216 96 FALSE
Human 202 40804 202 FALSE
Human 150 22500 150 FALSE

There are tons of different kinds of operations you can do with mutate! Each variable has different kinds of things you can and cannot do to them! If you want a more complete breakdown I suggest that you look at R4Ds chapters 13:18. For now we will set that aside.

5.1.4 What if we want to do more than one thing at once?

Generally, data cleaning consists of multiple steps. Sometimes we need to subset our data only to include the columns we care about and make a new variable. We could use what is known as a nested function call like this.

select(mutate(starwars, human = ifelse(species == "human", TRUE, FALSE)), name, human)
name human
Luke Skywalker FALSE
C-3PO FALSE
R2-D2 FALSE
Darth Vader FALSE
Leia Organa FALSE

Or we could create an intermediate object named starwars_human_add and then select the columns we want from there. However, both these solutions are annoying and unintuitive. We use |>, technically called a pipe, to combine multiple steps in our data-cleaning pipeline. However, you should read it as and then. The easiest way to think of the pipe when working through stuff is this way from Andrew Heiss using your morning routine.

Behind the scenes, I have the |> all over the place.

me |> 
wake_up(time = "8.00am") |> 
get_out_of_bed(side = "correct") |> 
get_dressed(pants = "TRUE", shirt = "TRUE") |> 
leave_house(car = TRUE, bike = FALSE, MARTA = FALSE) |> 
am_late(traffic = TRUE)

This works because of the shared logic of the tidyverse

filter(.data = penguins, species == "Gentoo")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>
select(.data = penguins, species:bill_length_mm)
# A tibble: 344 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen           NA  
 5 Adelie  Torgersen           36.7
 6 Adelie  Torgersen           39.3
 7 Adelie  Torgersen           38.9
 8 Adelie  Torgersen           39.2
 9 Adelie  Torgersen           34.1
10 Adelie  Torgersen           42  
# ℹ 334 more rows
mutate(.data = penguins, bill_length_mm_sq = bill_length_mm^2)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_mm_sq <dbl>

Note how each verb has the .data argument in the first position. The pipe takes what’s on the left-hand side and evaluates it as the first argument on the right-hand side, so starwars |> passes the Starwars as the first argument in mutate. This lets chain together multiple operations. In my opinion is easier to decipher than large nested function calls like the left column in favor of a cleaner, easier-to-read version of the code on the right.

 filter(mutate(penguins,
  female = ifelse(sex == "female",
    TRUE, FALSE)),
     species == "Adelie")
# A tibble: 152 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 3 more variables: sex <fct>, year <int>, female <lgl>
penguins |>
filter(species == "Adelie") |>
mutate(female = ifelse(sex == "female", TRUE, FALSE))
# A tibble: 152 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 3 more variables: sex <fct>, year <int>, female <lgl>

The tidyverse has its own pipe that looks like %>%. This also works well. I use the pipe included in the latest versions of R. The keyboard shortcut for both is ctrl + shift + m in Windows and cmd + shift + m in Mac. To have the base R pipe appear go to `tools -> global options -> code -> then click on use native pipe operator.

5.1.5 group_by and summarize

We to use these two commands together because they are pretty well matched. group_by collapses data to a single row by a column or columns in our dataset. Think of this as collapsing our data into your unit of analysis. Like country or if you have panel data where we observe multiple countries over multiple years, we can use group_by to look at country and year. Summarize will let you pass a whole host of functions to get descriptive measures of the data.

Both the British English and American English spellings of summarize work the same. The main author and maintainer of dplyr and lots of other stuff in the tidyverse is from New Zealand so *R For Data Science` uses British spellings.

Imagine that we want to know the average height for each species. Using what you know about dplyr, you might write code like this with pipes.

starwars |>
group_by(species) |>
summarise(mean(height))
species mean(height)
Aleena 79
Besalisk 198
Cerean 198
Chagrian 196
Clawdite 168

Sometimes we want to count the number of species we have. There are many ways to do this in R. One of the most common you will see is.

starwars |>
group_by(species) |>
summarise(n()) |>
arrange(desc(`n()`))  # just sorts things from highest to lowest
species n()
Human 35
Droid 6
NA 4
Gungan 3
Kaminoan 2

We may also want to know the distinct number of species that live on each homeworld. One nice thing about lots of the tidyverse functions is that we can assign new names to stuff within the function. When you are naming stuff in summarize, you are just making a new variable as we do in mutate.

starwars |>
group_by(homeworld) |>
summarise( distinct_species = n_distinct(species)) |>
arrange(desc(distinct_species))
homeworld distinct_species
Naboo 4
NA 4
Coruscant 2
Kamino 2
Tatooine 2