Since I am going on the non-academic job market it is high time I learned SQL. I have tried lots of amazing resources but find it hard for me to navigate between notes and various and learning SQL since they are just familiar enough to trip me up and lots of them send you off to various editors. This blog post will serve as my notes and hopefully as a resource for not myself. The general idea is I am just going to work through R4DS and the various dplyr verbs. Then move onto some more advanced SQL stuff like window functions and what not.
Setup
For the majority of this palmerpenguins dataset not because you really need to use SQL for a dataset this small but copying over the nyc-taxi dataset is incredibly annoying for blogging purposes.
library(DBI)library(arrow)
Attaching package: 'arrow'
The following object is masked from 'package:utils':
timestamp
We are going to go back and forth using dbplyr and SQL to query the dataset. What impressed me throughout this process was how seamless dbplyr works with dplyr verbs work. With the exception of some string functions it can work as a drop in replacement for SQL. What really helped throughout this process was writing out my queries and using show_query.
pengs |>select(species) |>show_query()
<SQL>
SELECT `species`
FROM `pengs`
Which will give us a SQL query. Obviously this is a pretty simple query but as we get more and more complex this is going to be helpful. For the most part show_query outputs the right query but can be a little bit difficult to debug because of the limitations of showing things in the R console.
Dplyr
Select
One convention in SQL which I don’t really get but is a thing is that functions are defined using all caps. Luckily for us the SQL and dplyr versions are pretty much the same one is just shouty. If we wanted all the columns like we may when we are importing the dataset for the first time we are just going to do SELECT * FROM taxis. There is really not like a perfect equivalent in R except for maybe head. But even then it is not a perfect one to one.
R
head(pengs)
# Source: SQL [6 x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <chr>, year <int>
SQL
tbl(con, sql("SELECT * FROM pengs"))
# Source: SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>
For one or multiple variables we are going to use a very similar syntax but were SQL and R differ is where we put the object we are querying from meaning in R we use the pipe to use data as the first argument of select but in SQL we put the object name behind the columns we are selecting like this
So what if we want to keep all but one column we would do this in R
pengs |>select(-species) |>head(n =2)
# Source: SQL [2 x 7]
# Database: sqlite 3.46.0 [:memory:]
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<chr> <dbl> <dbl> <int> <int> <chr> <int>
1 Torger… 39.1 18.7 181 3750 male 2007
2 Torger… 39.5 17.4 186 3800 fema… 2007
Unfortunately that is really not like a thing in some flavors of SQL other flavors of SQL you can use except but as the results from show query suggests we actually need to feed it all the columns we want. This would be the same thing if we wanted to use starts_with.
Filter
The first major difference syntactically between dplyr and SQL is with filter statements aka WHERE statements in SQL. So let’s say we want only penguins that are Adelie penguins.
# Source: SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>
Becomes.
tbl(con,sql( " SELECT * from pengs WHERE species = 'Adelie'"))
# Source: SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>
Some flavors of SQL make you end lines with ‘;’
As dplyr users will notice the way we specified the equality position uses the = instead of ==. This is going to come up a lot. The same thing goes for negation operations.
pengs |>filter(species =='Chinstrap'| species =='Adelie')
# Source: SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>
becomes
tbl(con, sql("SELECT * from pengs WHERE species = 'Adelie' OR species = 'Chinstrap'"))
# Source: SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>
You could easily sub in AND but that feels a bit excessive to continue this process for each possible combination. One thing that I do all the time is use sets to subset my data.
However when we do this in some flavor of SQL it is not as straightforward. These are aggregation functions that where can’t handle because thats not its job. So if we did
tbl(con, "SELECT * from pengs WHERE body_mass_g < AVG(body_mass_g)")
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
ℹ Using SQL: SELECT * FROM `SELECT * from pengs WHERE body_mass_g <
AVG(body_mass_g)` AS `q01` WHERE (0 = 1)
Caused by error:
! no such table: SELECT * from pengs WHERE body_mass_g < AVG(body_mass_g)
We get an error. If we wanted to use aggregation functions we have to change how we do this
<SQL>
SELECT
`species`,
`island`,
`bill_length_mm`,
`bill_depth_mm`,
`flipper_length_mm`,
`body_mass_g`,
`sex`,
`year`
FROM (
SELECT `pengs`.*, AVG(`body_mass_g`) OVER () AS `col01`
FROM `pengs`
) AS `q01`
WHERE (`body_mass_g` < `col01`)
What is this OVER thing? OVER in SQL is a window function. There is a more technical way to explain this but heuristically when we pass AVG to WHERE we are effectively doing this. So there is not really anything to compare it too.
If we wanted to filter penguins that are less than the average body mass we have to prevent this aggregation process by creating a column and then creating a less than statement like this
tbl(con, sql("SELECT * FROM( SELECT pengs .*, AVG(body_mass_g) OVER () AS avg FROM pengs) WHERE (body_mass_g < avg)"))
It is a little clunky but the tl;dr is that we basically have two FROM statements so if we wanted all penguins between the minimum and the average we could do
tbl(con, sql("SELECT * FROM( SELECT pengs .*, AVG(body_mass_g) OVER() AS avg, MIN(body_mass_g) OVER() AS min FROM pengs) WHERE body_mass_g BETWEEN min AND avg"))
If you notice in all our examples, we have lots and lots of missing values. This is one of the most common tasks in like any data science task. Let’s say that we can safely ignore the missing valus. In R we have a lot of options whether we are using filter or drop_na from tidyr. However, in SQL missing values are usually represented by NULL
tbl(con, sql("SELECt * FROM pengs WHERE NOT sex IS NULL"))
The AS function is kind the work horse for the next few sections. The naming convention differs a little bit so instead of new_name = old_name we do SELECT old_name as new_name
tbl(con, sql("SELECT species AS kinds_of_penguins FROM pengs"))
As lots of things go we need to be able to create our own variables. So to do this in R we do this
pengs |>mutate(sqr_body_mass = body_mass_g^2)
# Source: SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, sqr_body_mass <dbl>
In SQL to get the equivalent statement we use SELECT transformation AS new_var_name when we need to do things that are not in the dataset. So we basically need to define the column before we do anything.
tbl(con, sql("SELECT pengs .*, POWER(body_mass_g,2) AS sqr_body_mass FROM pengs"))
# Source: SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, sqr_body_mass <dbl>
So if we needed wanted to make a ratio of bill depth to bill length we would do
tbl(con, sql("SELECT pengs .*, bill_depth_mm/bill_length_mm AS ratio FROM pengs"))
# Source: SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, ratio <dbl>
A very important thing we do all the time is generate indicator variables for treatment status gender etc. Oddly enough if we peep the output of show query we see a familiar face!
<SQL>
SELECT
`pengs`.*,
CASE WHEN (`sex` = 'Male') THEN 1.0 WHEN NOT (`sex` = 'Male') THEN 0.0 END AS `male`
FROM `pengs`
So to make an indicator variable we would just do
tbl(con, sql("SELECT pengs.*, CASE WHEN (sex = 'male') THEN 1.0 WHEN not (sex = 'male') THEN 0.0 END AS male FROM pengs"))
# Source: SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, male <dbl>
Let’s combine our window functions with our friend case_when
tbl(con, sql("SELECT * FROM(SELECT pengs .*, AVG(body_mass_g) AS avg, MIN(body_mass_g) AS min, MAX(body_mass_g) AS max, CASE WHEN (body_mass_g = min) THEN 'This penguins is small' WHEN (body_mass_g = avg) THEN 'This is an average sized penguin' WHEN (body_mass_g = max) THEN 'this is a really big penguin' END AS note FROM pengs)"))
I will spare you the long output of the error message. But needless to say this was wrong. If we translate what I was trying to do into dplyr we get this
pengs |>mutate(note =case_when( body_mass_g ==min(body_mass_g) ~'This is a small peng', body_mass_g ==mean(body_mass_g) ~'Average sized peng', body_mass_g ==max(body_mass_g) ~'Big sized peng',.default ='Penguin is some size')) |>show_query()
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
<SQL>
SELECT
`pengs`.*,
CASE
WHEN (`body_mass_g` = MIN(`body_mass_g`) OVER `win1`) THEN 'This is a small peng'
WHEN (`body_mass_g` = AVG(`body_mass_g`) OVER `win1`) THEN 'Average sized peng'
WHEN (`body_mass_g` = MAX(`body_mass_g`) OVER `win1`) THEN 'Big sized peng'
ELSE 'Penguin is some size'
END AS `note`
FROM `pengs`
WINDOW `win1` AS ()
So it looks like we need to change the window function
check =tbl(con, sql("SELECT pengs .*, CASE WHEN (body_mass_g >= MIN(body_mass_g) OVER win1) THEN 'this is a small penguin' WHEN (body_mass_g = AVG(body_mass_g) OVER win1) THEN 'this is an average sized penguin' WHEN (body_mass_g = MAX(body_mass_g) OVER win1) THEN 'this is a big penguin' ELSE 'This penguin is not big, small or average' END AS note FROM pengs WINDOW win1 AS ()")) |>collect()
Lets look at this a little closer to make sure this worked. We would probably want to make this a little more robust. So lets go ahead and define a range.
tbl(con, sql("SELECT pengs .*, CASE WHEN (body_mass_g >= MIN(body_mass_g) OR body_mass_g < AVG(body_mass_g) OVER win1) THEN 'this is a small penguin' WHEN (body_mass_g >= AVG(body_mass_g) OR body_mass_g < MAX(body_mass_G) OVER win1) THEN 'this is an average sized penguin' WHEN (body_mass_g >= MAX(body_mass_g) OVER win1) THEN 'this is a big penguin' ELSE 'This penguin is not big, small or average' END AS note FROM pengs WINDOW win1 AS ()"))
# Source: SQL [1 x 9]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 46.9 16.6 192 2700
# ℹ 3 more variables: sex <chr>, year <int>, note <chr>
Group by and summarize
As established earlier we can use SQL to summarize like this.
tbl(con, sql('SELECT AVG(bill_depth_mm) AS avg FROM pengs'))
But the actual practical utility is somewhat limited. Often we want group specific differences. Oddly enough I expected this to be a window function thing, but we actually delay computing of the mean by different groups to the end. I guess this makes sense if we are dealing with big data
tbl(con, sql("SELECT species, AVG(body_mass_g) AS avg_body_mass FROM pengs GROUP BY species"))
So if we wanted to count of the species we would do something along this line
tbl(con, sql("SELECT species, COUNT(species) AS total FROM pengs GROUP BY species"))
# Source: SQL [3 x 2]
# Database: sqlite 3.46.0 [:memory:]
species total
<chr> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
For multiple grouping variables we would define the grouping variables the same way as we would in dplyr
tbl(con, sql("SELECT species, sex, COUNT(species) AS total FROM pengs GROUP BY species, sex"))
# Source: SQL [8 x 3]
# Database: sqlite 3.46.0 [:memory:]
species sex total
<chr> <chr> <int>
1 Adelie <NA> 6
2 Adelie female 73
3 Adelie male 73
4 Chinstrap female 34
5 Chinstrap male 34
6 Gentoo <NA> 5
7 Gentoo female 58
8 Gentoo male 61
The same would go for multiple summary functions
tbl(con, sql("SELECT species, COUNT(species) AS total, AVG(bill_depth_mm) AS avg_bill_depth, MEDIAN(bill_depth_mm) AS median_bill_depth FROM pengs GROUP BY sex"))
In the real world it is rare that we will have all our data in one place. Companies keep information in lots of different places because well it would be bad if we kept credit card information with all the necessary components to make a purchase. Instead of having to figure out three different things malicious actors would just need to access one database. Replacing entire data tables can also skyrocket costs. So instead, it is more efficient to simply insert rows.
Apppending Rows
To kind of mimic this we are just going to slice this data frame roughly in half. While not entirely realistic the general process will be similar enough
For whatever reason show_query is not working with this so we are going to have to consult the interwebs. The SQL equivalent of bind_rows is UNION.
tbl(con2, sql("SELECT * FROM pengs_top UNION ALL SELECT * FROM pengs_bottom"))
# Source: SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>
One of the key things in this query is ALL which is somewhat new to me. Basically the ALL tells SQL that we don’t really care about duplicates so just add the rows regardless. So if we wanted to exclude duplicates we would do something like this
Code
tbl(con2, sql("SELECt * FROM pengs_top UNION SELECT * FROM pengs_top")) |>collect() |>nrow()
[1] 172
Code
tbl(con2,sql("SELECT * FROM pengs_top") ) |>collect() |>nrow()
[1] 172
Joins
Luckily for us the join syntax from dplyr is pretty directly taken SQL so lefts create some dummy data to join.
So we have some fake national level data that we would like to join in to the dataset. We could do something like this but what we notice is that it is going to decide the join keys for us and probably create some headaches for us later on. To solve this we need to use our keys if we expose the underlying logic
<SQL>
SELECT `national_data`.*, `libraries`, `schools`
FROM `national_data`
LEFT JOIN `national_libraries`
ON (
`national_data`.`state` = `national_libraries`.`state` AND
`national_data`.`year` = `national_libraries`.`year`
)
We will notice that join_by is shorthand for equality joins. What changes is that instead of left_key = right_key we have to specify what is coming from what table using .
db_con = con4$conquery ="SELECT * FROM national_data LEFT JOIN national_libraries ON ( national_data.state = national_libraries.state AND national_data.year = national_libraries.year ) "dbGetQuery(db_con, sql(query))
state year unemployment inflation population state year libraries schools
1 GA 2018 5.0 2.0 100 <NA> NA NA NA
2 GA 2019 5.3 1.8 200 <NA> NA NA NA
3 GA 2020 5.2 2.5 300 <NA> NA NA NA
4 NC 2018 6.1 1.8 350 NC 2018 200 610
5 NC 2019 5.9 1.6 375 NC 2019 210 590
6 NC 2020 5.3 1.8 400 NC 2020 220 530
7 CO 2018 4.7 2.7 200 CO 2018 230 470
8 CO 2019 4.4 2.6 300 CO 2019 240 440
9 CO 2020 5.1 2.5 400 CO 2020 270 510
For whatever reason with SQLite gets a little grumpy with the join syntax.
If we wanted to do various other joins like inner and anti joins we would do a similar thing.
query ="SELECT * FROM national_data INNER JOIN national_libraries ON( national_data.state = national_libraries.state AND national_data.year = national_libraries.year )"dbGetQuery(db_con, sql(query))
state year unemployment inflation population state year libraries schools
1 CO 2018 4.7 2.7 200 CO 2018 230 470
2 CO 2019 4.4 2.6 300 CO 2019 240 440
3 CO 2020 5.1 2.5 400 CO 2020 270 510
4 NC 2018 6.1 1.8 350 NC 2018 200 610
5 NC 2019 5.9 1.6 375 NC 2019 210 590
6 NC 2020 5.3 1.8 400 NC 2020 220 530
Inequality joins
Confession I have never really understood how inequality joins work in regular dplyr but I am sure at some point I am going to need them and now when the stakes are so low is a good time to do it. So lets just take the data from the dplyr 1.1.0 announcement to do this since we know what the output should be.
So the main idea of an inequality join is that we can join by a key in this case company but only keep records from a certain date. The blog post kind of equates it with a filter/WHERE that happens during the join phase. So we would see something like this
transactions |>inner_join(companies, join_by(company == id, year >= since))
# Source: SQL [5 x 5]
# Database: sqlite 3.46.0 [:memory:]
company year revenue since name
<chr> <dbl> <dbl> <dbl> <chr>
1 A 2019 50 1973 Patagonia
2 A 2020 4 1973 Patagonia
3 B 2021 10 2009 RStudio
4 B 2023 12 2009 RStudio
5 B 2023 12 2022 Posit
Instead of two equality statements we would effectively use the same syntax just swapping out the = with >=
query =" SELECT * FROM transactions INNER JOIN companies ON( transactions.company = companies.id AND transactions.year >= companies.since )"dbGetQuery(db_con, sql(query))
company year revenue id since name
1 A 2019 50 A 1973 Patagonia
2 A 2020 4 A 1973 Patagonia
3 B 2021 10 B 2009 RStudio
4 B 2023 12 B 2009 RStudio
5 B 2023 12 B 2022 Posit
Tidyr
This section is really about pivots
Pivots
In tidyverse parlance we use pivots to change the “shape of the data.” If you are unfamiliar with this idea consider the religion and income data below. You will notice that we have a column for each income bracket or what is sometimes called “wide” data. This may be useful for some question but generally if we want to plot things or do things it will be easier if they are “long” data.
Code
con5 =src_memdb()relig =copy_to(con5, relig_income, overwrite =TRUE)head(relig_income, n =2)
To make our data “long” we use pivot_longer and to make data “wide” we use pivot_wider each has their own quirks but the general idea is that we have to tell these functions where to put the old names/where to get the new names and where to put the old values/where to get the new values. So if we wanted to make our data longer we would do something like this.
long = relig_income |>pivot_longer(-religion,names_to ='income_bracket',values_to ='income')head(long, n =2)
# A tibble: 2 × 3
religion income_bracket income
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
If we wanted to make this wide again all we would do is reverse this with pivot_wider
wide = long |>pivot_wider(names_from = income_bracket, values_from = income)
There are ton of additional functionality that will not be covered like dealing with not uniquely identified columns.
To get a sense of how to do this let’s consult our old friend show_query
We are not going to actually show the results because it is quite the query. The summary of what is happening is that SQLite doesn’t have a perfect equivalent of pivot_longer. Basically, what you need to do is to keep appending smaller and smaller data frames to each other until you get to a long data frame. In other flavors of SQL this process is a lot more humane with explicit PIVOT and UNPIVOT but I am not in one of those flavors. To spare myself a bit I am just going to do two columns
tbl(con5, sql(" SELECT religion, '<$10k' AS income_bracket, '<$10k' AS income FROM relig_income UNION ALL SELECT religion, '$10-20k' AS income_bracket, '$10-20k' AS income FROM relig_income UNION ALL SELECT religion, '$20-30k' AS income_bracket, '$20-30k' AS income FROM relig_income"))
Okay again this is a little unwieldy to show. Basically what happens is that we are creating a big case_when condition and then from there we are going to use the same binding trick and then group the data. So lets go ahead and copy and paste some of this.
Code
query ="SELECT religion, MAX(CASE WHEN (income_bracket = '<$10k') THEN income END) AS '<$10K', MAX(CASE WHEN (income_bracket = '$10-20k') THEN income END) AS '$10-20k', MAX(CASE WHEN (income_bracket = '$20-30k') THEN income END) AS '$20-30k'FROM ( SELECT religion, '<$10k' AS income_bracket, '<$10k' AS income FROM relig_income UNION ALL SELECT religion, '$10-20k' AS income_bracket, '$10-20k' AS income FROM relig_income UNION ALL SELECT religion, '$20-30k' AS income_bracket, '$20-30k' AS income FROM relig_income) AS wide_religionGROUP BY religion"tbl(con5, sql(query))
So one thing that you come across from time to time in R and python data wrangling are list columns. These happen for a variety of reasons and are pretty innocuous to handle.
# A tibble: 173 × 2
name films
<chr> <chr>
1 Luke Skywalker A New Hope
2 Luke Skywalker The Empire Strikes Back
3 Luke Skywalker Return of the Jedi
4 Luke Skywalker Revenge of the Sith
5 Luke Skywalker The Force Awakens
6 C-3PO A New Hope
7 C-3PO The Empire Strikes Back
8 C-3PO Return of the Jedi
9 C-3PO The Phantom Menace
10 C-3PO Attack of the Clones
# ℹ 163 more rows
However, per this Stack overflow answer and the linked question this is not really a thing or like really not advised. Even when you try to copy the starwars dataset to a database you get an error.
Misc
Ranking
There are lots of different ways to rank things in R if we want to return the min/max you can do
pengs |>slice_max(bill_length_mm, n =3)
# Source: SQL [3 x 8]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 59.6 17 230 6050
2 Chinstrap Dream 58 17.8 181 3700
3 Gentoo Biscoe 55.9 17 228 5600
# ℹ 2 more variables: sex <chr>, year <int>
Like our dplyr join functions the dense_rank and min_rank function actually takes inspiration from SQL. So in our example where the two functions differ is how they handle ties. So in dense_rank and min_rank both id 2 and 3 get assigned the same rank where they differ is dense_rank will assign id 4 the rank of 3 and min_rank will assign id 4 the rank of 4.
<SQL>
SELECT
`example`.*,
CASE
WHEN (NOT((`col1` IS NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((`col1` IS NULL)) THEN 1 ELSE 0 END) ORDER BY `col1`)
END AS `rank_one`
FROM `example`
This is deceptively a bit more complex. So lets break it down.
tbl(con7, sql("SELECTexample .*, CASE WHEN (NOT((col1 is NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((col1 is NULL)) THEN 1 ELSE 0 END) ORDER BY col1)END AS rank_oneFROM example"))
So basically the PARTITION BY bit is used to divide the data into groups before we rank them. The CASE WHEN handles when we have missing values. Then the window function is applying dense rank over these partions. This was a somewhat silly example so lets do something a bit more realistic. Lets say we actually want to rank the penguins by average bill length and then return the penguins in the top 3.
tbl(con, sql(" SELECT ranked_pengs .*, CASE WHEN (NOT((avg_bill_length is NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((avg_bill_length is NULL)) THEN 1 ELSE 0 END) ORDER BY avg_bill_length) END AS rank FROM( SELECT pengs .*, AVG(bill_length_mm) OVER () AS avg_bill_length FROM pengs) AS ranked_pengs LIMIT 3 "))
# Source: SQL [3 x 10]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
# ℹ 4 more variables: sex <chr>, year <int>, avg_bill_length <dbl>, rank <int>
We could also do this by groups by just inserting a group by statement before the limit bit
tbl(con, sql(" SELECT ranked_pengs .*, CASE WHEN (NOT((avg_bill_length is NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((avg_bill_length is NULL)) THEN 1 ELSE 0 END) ORDER BY avg_bill_length) END AS rank FROM( SELECT pengs .*, AVG(bill_length_mm) OVER () AS avg_bill_length FROM pengs) AS ranked_pengs GROUP BY species LIMIT 3 "))
# Source: SQL [3 x 10]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Chinstrap Dream 46.5 17.9 192 3500
3 Gentoo Biscoe 46.1 13.2 211 4500
# ℹ 4 more variables: sex <chr>, year <int>, avg_bill_length <dbl>, rank <int>
Distinct Values
Duplicates are a fact of life but depending on your question or what information you are trying to show repeated records may not be desirable. We handle these with the same function but kind of like mutate we have to let select handle these. If we wanted one row per column without having to specify every column in our dataset than we could do something like this
tbl(con, sql("SELECT * FROM( SELECT pengs .*, ROW_NUMBER() OVER (PARTITION BY species ORDER BY species) AS id FROM PENGS) AS small_pengs WHERE id = 1"))
# Source: SQL [3 x 9]
# Database: sqlite 3.46.0 [:memory:]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Chinstrap Dream 46.5 17.9 192 3500
3 Gentoo Biscoe 46.1 13.2 211 4500
# ℹ 3 more variables: sex <chr>, year <int>, id <int>
However if we have a slightly less complex query than we can feed distinct multiple columns
tbl(con, sql("SELECT DISTINCT species, island FROM pengs"))
# Source: SQL [5 x 2]
# Database: sqlite 3.46.0 [:memory:]
species island
<chr> <chr>
1 Adelie Torgersen
2 Adelie Biscoe
3 Adelie Dream
4 Gentoo Biscoe
5 Chinstrap Dream
The End…for now
I am sure this will end up growing as I think of more than things in R that I need to be able to do in SQL.