As an object oriented programming language (OOP), R is not particularly user-friendly and thus {tidyverse}–a suite of data management, visualization, and modeling packages–aims to turn R into a more user-friendly functional programming language. As one of the project’s major developers, Hadley Wickham, recently put it “R is not a language driven by the purity of its philosophy; R is a language designed to get shit done.” This may have become some what evident in the indexing section where I talked about [], [[]], and $. In general languages like Python only have one way to index an object.
The tidyverse is actually a collection of packages that share the same underlying philosophy (which we will get to). Generally I will load in the tidyverse like this.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
5.1dplyr
“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leon Tolstoy.
Data is often talked about the in the same way. All clean datasets are alike in structure and dignity. Every messy dataset is messy in its own special way.
dplyr is the tidyverse way to manipulate data and an all around great package. But you need to get the basic functions right. Cleaning data is what you will spend the most time on.
We will use the palmerpenguins and the starwars dataset to demonstrate how to use the verbs in dplyr
data("starwars")library(palmerpenguins)
5.1.1select
Select is the most intuitive. Select picks out the columns you want or, in some cases, do not want. If we only want the name of the character and the place they are from we feed that name of the columns into select. If you are following along and copying and pasting the code in the guide to see what is going on you may notice that your output differs slightly. That is because behind the scenes I am just using select to cut down on the output.
select(starwars, name, homeworld)
name
homeworld
Luke Skywalker
Tatooine
C-3PO
Tatooine
R2-D2
Naboo
Darth Vader
Tatooine
Leia Organa
Alderaan
If we wanted all the columns except for name and homeworld we would do this.
select(starwars, -name, -homeworld)
height
mass
hair_color
skin_color
eye_color
starships
172
77
blond
fair
blue
X-wing , Imperial shuttle
167
75
NA
gold
yellow
96
32
NA
white, blue
red
202
136
none
white
yellow
TIE Advanced x1
150
49
brown
light
brown
We can also feed it a range of columns using :
select(starwars, name:hair_color)
name
height
mass
hair_color
Luke Skywalker
172
77
blond
C-3PO
167
75
NA
R2-D2
96
32
NA
Darth Vader
202
136
none
Leia Organa
150
49
brown
5.1.2filter
Filter is how we subset by rows. To do this we need to tell filter what rows we want! This feels like it should be intuitive, but we have to use some concepts that are likely new to you. Lets say we want characters that are from a particular world in starwars. To do that we do this.
filter(starwars, homeworld =="Naboo")
name
homeworld
R2-D2
Naboo
Palpatine
Naboo
Jar Jar Binks
Naboo
Roos Tarpals
Naboo
Rugor Nass
Naboo
What we are doing here is just creating a dataset with just things from Naboo. We need to set tell R what rows we want by setting up some tests. Each time a row meets that condition then R is going to grab it. As a reminder these are the kinds of test you can do
Test
Meaning
Test
Meaning
x < y
Less than
x %in% y
In set
x > y
Greater than
is.na(x)
Is missing
==
Equal to
!is.na(x)
Is not missing
x <= y
Less than or equal to
! y
Not
x >= y
Greater than or equal to
x != y
Not equal to
x | y
Or
x & y
And
So lets return back to our Naboo example. If we wanted to reuse this dataset later we would assign it to an object using <- so lets assign it to an object named naboo
naboo <-filter(starwars, homeworld =="Naboo")
We can also combine multiple tests in filter. Lets say we wanted to all the characters that are from Naboo and are human
filter(starwars, homeworld =="Naboo"& species =="Human")
name
species
homeworld
Palpatine
Human
Naboo
Gregar Typho
Human
Naboo
Cordé
Human
Naboo
Dormé
Human
Naboo
Padmé Amidala
Human
Naboo
Now we have all the characters that are from Naboo and are human! Filter automatically defaults to an and test. So these produce the same behavior.
filter(starwars, homeworld =="Naboo"& species =="Human") filter(starwars, homeworld =="Naboo", species =="Human")
name
homeworld
species
Palpatine
Naboo
Human
Gregar Typho
Naboo
Human
Cordé
Naboo
Human
Dormé
Naboo
Human
Padmé Amidala
Naboo
Human
name
homeworld
species
Palpatine
Naboo
Human
Gregar Typho
Naboo
Human
Cordé
Naboo
Human
Dormé
Naboo
Human
Padmé Amidala
Naboo
Human
If we wanted characters from two different homeworlds we would do an or test using | (the key above enter/return)
The reason we would use an or test is because one character can’t have two homeworlds! Remember, computers are dumb. As long as the code can run, it will do it. So if we use an and test, this is what it returns
# A tibble: 0 × 14
# ℹ 14 variables: name <chr>, height <int>, mass <dbl>, hair_color <chr>,
# skin_color <chr>, eye_color <chr>, birth_year <dbl>, sex <chr>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
When working with things in R stray commas and spelling will lead to headaches. Here is an example that would not throw an error but will create a dataframe with zero observations. Naboo exists as a value of homeworld, but naboo does not.
Filter also works if you want particular values of something. Let’s say we want penguins with longer flippers or characters less than a certain height. We can do that with filter.
Sometimes we want a dataset that does not have any missing values in it for a particular column. In this case all we do is just add ! in front of the is.na function.
filter(starwars, !is.na(height))
name
height
mass
hair_color
skin_color
eye_color
Luke Skywalker
172
77
blond
fair
blue
C-3PO
167
75
NA
gold
yellow
R2-D2
96
32
NA
white, blue
red
Darth Vader
202
136
none
white
yellow
Leia Organa
150
49
brown
light
brown
We would do something similar if we wanted characters that are not human!
filter(starwars, species !="Human")
name
height
mass
hair_color
skin_color
eye_color
birth_year
sex
gender
homeworld
species
films
vehicles
starships
C-3PO
167
75
NA
gold
yellow
112
none
masculine
Tatooine
Droid
The Empire Strikes Back, Attack of the Clones , The Phantom Menace , Revenge of the Sith , Return of the Jedi , A New Hope
R2-D2
96
32
NA
white, blue
red
33
none
masculine
Naboo
Droid
The Empire Strikes Back, Attack of the Clones , The Phantom Menace , Revenge of the Sith , Return of the Jedi , A New Hope , The Force Awakens
R5-D4
97
32
NA
white, red
red
NA
none
masculine
Tatooine
Droid
A New Hope
Chewbacca
228
112
brown
unknown
blue
200
male
masculine
Kashyyyk
Wookiee
The Empire Strikes Back, Revenge of the Sith , Return of the Jedi , A New Hope , The Force Awakens
AT-ST
Millennium Falcon, Imperial shuttle
Greedo
173
74
NA
green
black
44
male
masculine
Rodia
Rodian
A New Hope
One last operator that I will show you is the %in% operator. This comes in really handy for lots of things. Intuitively we can think about it as a bunch of thing == "value"| thing == “value”` tests glued together. Lets say we wanted all the characters from Tatooine, Naboo, and “Coruscant”. With what we know we would do something like this.
While not the most amount of typing in the world this can quickly get to be a lot if there is a bunch of mutually exclusive things we need to subset our data by. This is where the %in%operator comes in to save us! We can rewrite this series of or tests like this
I have done some additional filtering behind the scenes to show you that it worked so your dataset likely looks a little different.
Notice how the test works. The first thing is the name of the variable and the second is the stuff the homeworlds. What R is doing is taking the values of the homeworld variable for each row and seeing if they match whats on the right hand side of %in. Kind of like this
5%in%1:10
[1] TRUE
If we wanted all things outside of this subset we can do this
mutate is how we create new columns in our dataset. There are tons of things that we may need to do in order to create variables. So lets start with making what we know as an indicator variable. Lets say we want to create a variable that indicates whether a character is human or not. We use the ifelse function to do this. The first thing we need to do is name the column. Lets call this column human. We then need to tell R what values the human variable has. If we want to create an indicator variable we can use the ifelse function in R which kind of works like filter.
ifelse has a few components
ifelse(test, what it does if true, what it does if false)
So in our case if the species column has a value of “human” than it returns TRUE otherwise return FALSE.
mutate(starwars, human =ifelse(species =="Human", TRUE, FALSE))
name
species
Luke Skywalker
Human
C-3PO
Droid
R2-D2
Droid
Darth Vader
Human
Leia Organa
Human
ifelse is not just limited to TRUE or FALSE you can really put anything in there it is just easier if you do. Lets see a somewhat silly example with the palmerpenguins dataset
R gives us a lot of flexibility to create all kinds of variables. So let’s make a column in our dataset where we see how old a character is in dog years with a description, so other people know what is going on.
mutate(starwars,dog_years =paste(name, birth_year *7, "in dog years"))
name
dog_years
Luke Skywalker
Luke Skywalker is 133 in dog years
C-3PO
C-3PO is 784 in dog years
R2-D2
R2-D2 is 231 in dog years
Darth Vader
Darth Vader is 293.3 in dog years
Leia Organa
Leia Organa is 133 in dog years
mutate is order aware. So if you want to do something with that new variable, you can do that in the same mutate call. If you want to do multiple things in mutate, that is also easy.
There are tons of different kinds of operations you can do with mutate! Each variable has different kinds of things you can and cannot do to them! If you want a more complete breakdown I suggest that you look at R4Ds chapters 13:18. For now we will set that aside.
5.1.4 What if we want to do more than one thing at once?
Generally, data cleaning consists of multiple steps. Sometimes we need to subset our data only to include the columns we care about and make a new variable. We could use what is known as a nested function call like this.
select(mutate(starwars, human =ifelse(species =="human", TRUE, FALSE)), name, human)
name
human
Luke Skywalker
FALSE
C-3PO
FALSE
R2-D2
FALSE
Darth Vader
FALSE
Leia Organa
FALSE
Or we could create an intermediate object named starwars_human_add and then select the columns we want from there. However, both these solutions are annoying and unintuitive. We use |>, technically called a pipe, to combine multiple steps in our data-cleaning pipeline. However, you should read it as and then. The easiest way to think of the pipe when working through stuff is this way from Andrew Heiss using your morning routine.
Behind the scenes, I have the |> all over the place.
me |>wake_up(time ="8.00am") |>get_out_of_bed(side ="correct") |>get_dressed(pants ="TRUE", shirt ="TRUE") |>leave_house(car =TRUE, bike =FALSE, MARTA =FALSE) |>am_late(traffic =TRUE)
This works because of the shared logic of the tidyverse
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_mm_sq <dbl>
Note how each verb has the .data argument in the first position. The pipe takes what’s on the left-hand side and evaluates it as the first argument on the right-hand side, so starwars |> passes the Starwars as the first argument in mutate. This lets chain together multiple operations. In my opinion is easier to decipher than large nested function calls like the left column in favor of a cleaner, easier-to-read version of the code on the right.
filter(mutate(penguins,female =ifelse(sex =="female",TRUE, FALSE)), species =="Adelie")
# A tibble: 152 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 142 more rows
# ℹ 3 more variables: sex <fct>, year <int>, female <lgl>
# A tibble: 152 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 142 more rows
# ℹ 3 more variables: sex <fct>, year <int>, female <lgl>
The tidyverse has its own pipe that looks like %>%. This also works well. I use the pipe included in the latest versions of R. The keyboard shortcut for both is ctrl + shift + m in Windows and cmd + shift + m in Mac. To have the base R pipe appear go to `tools -> global options -> code -> then click on use native pipe operator.
5.1.5group_by and summarize
We to use these two commands together because they are pretty well matched. group_by collapses data to a single row by a column or columns in our dataset. Think of this as collapsing our data into your unit of analysis. Like country or if you have panel data where we observe multiple countries over multiple years, we can use group_by to look at country and year. Summarize will let you pass a whole host of functions to get descriptive measures of the data.
Both the British English and American English spellings of summarize work the same. The main author and maintainer of dplyr and lots of other stuff in the tidyverse is from New Zealand so *R For Data Science` uses British spellings.
Imagine that we want to know the average height for each species. Using what you know about dplyr, you might write code like this with pipes.
Sometimes we want to count the number of species we have. There are many ways to do this in R. One of the most common you will see is.
starwars |>group_by(species) |>summarise(n()) |>arrange(desc(`n()`)) # just sorts things from highest to lowest
species
n()
Human
35
Droid
6
NA
4
Gungan
3
Kaminoan
2
We may also want to know the distinct number of species that live on each homeworld. One nice thing about lots of the tidyverse functions is that we can assign new names to stuff within the function. When you are naming stuff in summarize, you are just making a new variable as we do in mutate.