Tidyverse to SQL

Since I am going on the non-academic job market it is high time I learned SQL. I have tried lots of amazing resources but find it hard for me to navigate between notes and various and learning SQL since they are just familiar enough to trip me up and lots of them send you off to various editors. This blog post will serve as my notes and hopefully as a resource for not myself. The general idea is I am just going to work through R4DS and the various dplyr verbs. Then move onto some more advanced SQL stuff like window functions and what not.

Setup

For the majority of this palmerpenguins dataset not because you really need to use SQL for a dataset this small but copying over the nyc-taxi dataset is incredibly annoying for blogging purposes.

library(DBI)
library(arrow)


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

library(dbplyr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::duration() masks arrow::duration()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::ident()        masks dbplyr::ident()
✖ dplyr::lag()          masks stats::lag()
✖ dplyr::sql()          masks dbplyr::sql()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

pengs = palmerpenguins::penguins

con =  src_memdb()

pengs = copy_to(con, pengs,
       overwrite = TRUE)

We are going to go back and forth using dbplyr and SQL to query the dataset. What impressed me throughout this process was how seamless dbplyr works with dplyr verbs work. With the exception of some string functions it can work as a drop in replacement for SQL. What really helped throughout this process was writing out my queries and using show_query.

pengs |>
    select(species) |>
    show_query()

<SQL>
SELECT `species`
FROM `pengs`

Which will give us a SQL query. Obviously this is a pretty simple query but as we get more and more complex this is going to be helpful. For the most part show_query outputs the right query but can be a little bit difficult to debug because of the limitations of showing things in the R console.

Dplyr

Select

One convention in SQL which I don’t really get but is a thing is that functions are defined using all caps. Luckily for us the SQL and dplyr versions are pretty much the same one is just shouty. If we wanted all the columns like we may when we are importing the dataset for the first time we are just going to do SELECT * FROM taxis. There is really not like a perfect equivalent in R except for maybe head. But even then it is not a perfect one to one.

R

head(pengs)

# Source:   SQL [6 x 8]
# Database: sqlite 3.46.0 [:memory:]
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <chr>, year <int>

SQL

tbl(con, sql("SELECT * FROM pengs"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

For one or multiple variables we are going to use a very similar syntax but were SQL and R differ is where we put the object we are querying from meaning in R we use the pipe to use data as the first argument of select but in SQL we put the object name behind the columns we are selecting like this

pengs |>
    select(body_mass_g, flipper_length_mm)

# Source:   SQL [?? x 2]
# Database: sqlite 3.46.0 [:memory:]
   body_mass_g flipper_length_mm
         <int>             <int>
 1        3750               181
 2        3800               186
 3        3250               195
 4          NA                NA
 5        3450               193
 6        3650               190
 7        3625               181
 8        4675               195
 9        3475               193
10        4250               190
# ℹ more rows

tbl(con, sql("SELECT body_mass_g, flipper_length_mm FROM pengs"))

# Source:   SQL [?? x 2]
# Database: sqlite 3.46.0 [:memory:]
   body_mass_g flipper_length_mm
         <int>             <int>
 1        3750               181
 2        3800               186
 3        3250               195
 4          NA                NA
 5        3450               193
 6        3650               190
 7        3625               181
 8        4675               195
 9        3475               193
10        4250               190
# ℹ more rows

So what if we want to keep all but one column we would do this in R

pengs |>
    select(-species) |>
    head(n =2)

# Source:   SQL [2 x 7]
# Database: sqlite 3.46.0 [:memory:]
  island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
  <chr>            <dbl>         <dbl>             <int>       <int> <chr> <int>
1 Torger…           39.1          18.7               181        3750 male   2007
2 Torger…           39.5          17.4               186        3800 fema…  2007

Unfortunately that is really not like a thing in some flavors of SQL other flavors of SQL you can use except but as the results from show query suggests we actually need to feed it all the columns we want. This would be the same thing if we wanted to use starts_with.

Filter

The first major difference syntactically between dplyr and SQL is with filter statements aka WHERE statements in SQL. So let’s say we want only penguins that are Adelie penguins.

pengs |>
    filter(species == 'Adelie')

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

Becomes.

tbl(con,sql( "
    SELECT * from pengs
    WHERE species = 'Adelie'
"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

Some flavors of SQL make you end lines with ‘;’

As dplyr users will notice the way we specified the equality position uses the = instead of ==. This is going to come up a lot. The same thing goes for negation operations.

pengs |>
    filter(species != 'Adelie')

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

tbl(con, sql("SELECT * from pengs 
             WHERE NOT species = 'Adelie'"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

If we want multiple conditions in our where statements instead of | or &/, we actually just use the words or and and

pengs |>
    filter(species == 'Chinstrap' | species == 'Adelie')

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

becomes

tbl(con, sql("SELECT * from pengs 
            WHERE species = 'Adelie' OR species = 'Chinstrap'"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

You could easily sub in AND but that feels a bit excessive to continue this process for each possible combination. One thing that I do all the time is use sets to subset my data.

pengs |>
    filter(species %in% c('Chinstrap', "Gentoo"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

Becomes

tbl(con, sql("SELECT * from pengs
            WHERE species IN ('Chinstrap', 'Gentoo')"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

in this case we define a set in a similar way. If we wanted to negate this statement all we would do is

tbl(con, sql("SELECT * from pengs
            WHERE NOT species IN ('Chinstrap', 'Gentoo')"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

Lets say we want to find penguins that are less than the average body mass in R this is fairly straightforward

pengs |>
    filter(body_mass_g < mean(body_mass_g, na.rm = TRUE))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           34.1          18.1               193        3475
 8 Adelie  Torgersen           37.8          17.1               186        3300
 9 Adelie  Torgersen           37.8          17.3               180        3700
10 Adelie  Torgersen           41.1          17.6               182        3200
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

However when we do this in some flavor of SQL it is not as straightforward. These are aggregation functions that where can’t handle because thats not its job. So if we did

tbl(con, "SELECT * from pengs WHERE body_mass_g < AVG(body_mass_g)")

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
ℹ Using SQL: SELECT * FROM `SELECT * from pengs WHERE body_mass_g <
  AVG(body_mass_g)` AS `q01` WHERE (0 = 1)
Caused by error:
! no such table: SELECT * from pengs WHERE body_mass_g < AVG(body_mass_g)

We get an error. If we wanted to use aggregation functions we have to change how we do this

pengs |>
    filter(body_mass_g < mean(body_mass_g, na.rm = TRUE)) |>
    show_query()

<SQL>
SELECT
  `species`,
  `island`,
  `bill_length_mm`,
  `bill_depth_mm`,
  `flipper_length_mm`,
  `body_mass_g`,
  `sex`,
  `year`
FROM (
  SELECT `pengs`.*, AVG(`body_mass_g`) OVER () AS `col01`
  FROM `pengs`
) AS `q01`
WHERE (`body_mass_g` < `col01`)

What is this OVER thing? OVER in SQL is a window function. There is a more technical way to explain this but heuristically when we pass AVG to WHERE we are effectively doing this. So there is not really anything to compare it too.

pengs |>
    summarise(mean(body_mass_g, na.rm = TRUE))

# Source:   SQL [1 x 1]
# Database: sqlite 3.46.0 [:memory:]
  `mean(body_mass_g, na.rm = TRUE)`
                              <dbl>
1                             4202.

If we wanted to filter penguins that are less than the average body mass we have to prevent this aggregation process by creating a column and then creating a less than statement like this

tbl(con, sql("SELECT * FROM(
              SELECT pengs .*, AVG(body_mass_g) OVER () AS avg
               FROM pengs)
              WHERE (body_mass_g < avg)"))

# Source:   SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           34.1          18.1               193        3475
 8 Adelie  Torgersen           37.8          17.1               186        3300
 9 Adelie  Torgersen           37.8          17.3               180        3700
10 Adelie  Torgersen           41.1          17.6               182        3200
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, avg <dbl>

It is a little clunky but the tl;dr is that we basically have two FROM statements so if we wanted all penguins between the minimum and the average we could do

palmerpenguins::penguins |>
    filter(between(body_mass_g, left = min(body_mass_g, na.rm = TRUE), right = mean(body_mass_g, na.rm = TRUE)))

# A tibble: 193 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           34.1          18.1               193        3475
 8 Adelie  Torgersen           37.8          17.1               186        3300
 9 Adelie  Torgersen           37.8          17.3               180        3700
10 Adelie  Torgersen           41.1          17.6               182        3200
# ℹ 183 more rows
# ℹ 2 more variables: sex <fct>, year <int>

tbl(con, sql("SELECT * FROM(
             SELECT pengs .*, AVG(body_mass_g) OVER() AS avg, MIN(body_mass_g) OVER() AS min
            FROM pengs)
            WHERE body_mass_g BETWEEN min AND avg"))

# Source:   SQL [?? x 10]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           34.1          18.1               193        3475
 8 Adelie  Torgersen           37.8          17.1               186        3300
 9 Adelie  Torgersen           37.8          17.3               180        3700
10 Adelie  Torgersen           41.1          17.6               182        3200
# ℹ more rows
# ℹ 4 more variables: sex <chr>, year <int>, avg <dbl>, min <int>

If you notice in all our examples, we have lots and lots of missing values. This is one of the most common tasks in like any data science task. Let’s say that we can safely ignore the missing valus. In R we have a lot of options whether we are using filter or drop_na from tidyr. However, in SQL missing values are usually represented by NULL

tbl(con, sql("SELECt * FROM pengs 
                WHERE NOT sex IS NULL"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

Rename

The AS function is kind the work horse for the next few sections. The naming convention differs a little bit so instead of new_name = old_name we do SELECT old_name as new_name

tbl(con, sql("SELECT species AS kinds_of_penguins
          FROM pengs"))

# Source:   SQL [?? x 1]
# Database: sqlite 3.46.0 [:memory:]
   kinds_of_penguins
   <chr>            
 1 Adelie           
 2 Adelie           
 3 Adelie           
 4 Adelie           
 5 Adelie           
 6 Adelie           
 7 Adelie           
 8 Adelie           
 9 Adelie           
10 Adelie           
# ℹ more rows

Mutate

As lots of things go we need to be able to create our own variables. So to do this in R we do this

pengs |>
    mutate(sqr_body_mass = body_mass_g^2)

# Source:   SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, sqr_body_mass <dbl>

In SQL to get the equivalent statement we use SELECT transformation AS new_var_name when we need to do things that are not in the dataset. So we basically need to define the column before we do anything.

tbl(con, sql("SELECT pengs .*, POWER(body_mass_g,2) AS sqr_body_mass
            FROM pengs"))

# Source:   SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, sqr_body_mass <dbl>

So if we needed wanted to make a ratio of bill depth to bill length we would do

tbl(con, sql("SELECT pengs .*, bill_depth_mm/bill_length_mm AS ratio 
            FROM pengs"))

# Source:   SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, ratio <dbl>

A very important thing we do all the time is generate indicator variables for treatment status gender etc. Oddly enough if we peep the output of show query we see a familiar face!

R

pengs |>
    mutate(male = ifelse(sex == 'Male', 1, 0)) |>
    show_query()

<SQL>
SELECT
  `pengs`.*,
  CASE WHEN (`sex` = 'Male') THEN 1.0 WHEN NOT (`sex` = 'Male') THEN 0.0 END AS `male`
FROM `pengs`

So to make an indicator variable we would just do

tbl(con, sql("SELECT pengs.*, CASE WHEN (sex = 'male') THEN 1.0 WHEN not (sex = 'male') THEN 0.0 END AS male
             FROM pengs"))

# Source:   SQL [?? x 9]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 3 more variables: sex <chr>, year <int>, male <dbl>

Let’s combine our window functions with our friend case_when

tbl(con, sql("SELECT * FROM(SELECT pengs .*,
           AVG(body_mass_g) AS avg, MIN(body_mass_g) AS min, MAX(body_mass_g) AS max,
            CASE WHEN (body_mass_g = min) THEN 'This penguins is small' WHEN (body_mass_g = avg) THEN 'This is an average sized penguin' WHEN (body_mass_g = max) THEN 'this is a really big penguin' END AS note 
            FROM pengs)"))

I will spare you the long output of the error message. But needless to say this was wrong. If we translate what I was trying to do into dplyr we get this

pengs |>
    mutate(note = case_when(
            body_mass_g == min(body_mass_g) ~ 'This is a small peng',
            body_mass_g == mean(body_mass_g) ~ 'Average sized peng',
            body_mass_g == max(body_mass_g) ~ 'Big sized peng',
             .default = 'Penguin is some size')) |>
        show_query()

Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.

<SQL>
SELECT
  `pengs`.*,
  CASE
WHEN (`body_mass_g` = MIN(`body_mass_g`) OVER `win1`) THEN 'This is a small peng'
WHEN (`body_mass_g` = AVG(`body_mass_g`) OVER `win1`) THEN 'Average sized peng'
WHEN (`body_mass_g` = MAX(`body_mass_g`) OVER `win1`) THEN 'Big sized peng'
ELSE 'Penguin is some size'
END AS `note`
FROM `pengs`
WINDOW `win1` AS ()

So it looks like we need to change the window function

check = tbl(con, sql("SELECT pengs .*,
              CASE
            WHEN (body_mass_g >= MIN(body_mass_g) OVER win1) THEN 'this is a small penguin'
            WHEN (body_mass_g = AVG(body_mass_g) OVER win1) THEN 'this is an average sized penguin'
            WHEN (body_mass_g = MAX(body_mass_g) OVER win1) THEN 'this is a big penguin'
            ELSE 'This penguin is not big, small or average'
            END AS note
            FROM pengs 
            WINDOW win1 AS ()")) |>
                collect()

Lets look at this a little closer to make sure this worked. We would probably want to make this a little more robust. So lets go ahead and define a range.

tbl(con, sql("SELECT pengs .*,
              CASE
            WHEN (body_mass_g >= MIN(body_mass_g) OR body_mass_g < AVG(body_mass_g)  OVER win1) THEN 'this is a small penguin'
            WHEN (body_mass_g >= AVG(body_mass_g) OR body_mass_g < MAX(body_mass_G) OVER win1) THEN 'this is an average sized penguin'
            WHEN (body_mass_g >= MAX(body_mass_g) OVER win1) THEN 'this is a big penguin'
            ELSE 'This penguin is not big, small or average'
            END AS note
            FROM pengs 
            WINDOW win1 AS ()"))

# Source:   SQL [1 x 9]
# Database: sqlite 3.46.0 [:memory:]
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>     <chr>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            46.9          16.6               192        2700
# ℹ 3 more variables: sex <chr>, year <int>, note <chr>

Group by and summarize

As established earlier we can use SQL to summarize like this.

tbl(con, sql('SELECT AVG(bill_depth_mm) AS avg
           FROM pengs'))

# Source:   SQL [1 x 1]
# Database: sqlite 3.46.0 [:memory:]
    avg
  <dbl>
1  17.2

But the actual practical utility is somewhat limited. Often we want group specific differences. Oddly enough I expected this to be a window function thing, but we actually delay computing of the mean by different groups to the end. I guess this makes sense if we are dealing with big data

tbl(con, sql("SELECT species, AVG(body_mass_g) AS avg_body_mass
            FROM pengs
            GROUP BY species"))

# Source:   SQL [3 x 2]
# Database: sqlite 3.46.0 [:memory:]
  species   avg_body_mass
  <chr>             <dbl>
1 Adelie            3701.
2 Chinstrap         3733.
3 Gentoo            5076.

So if we wanted to count of the species we would do something along this line

tbl(con, sql("SELECT species, COUNT(species) AS total
            FROM pengs 
            GROUP BY species"))

# Source:   SQL [3 x 2]
# Database: sqlite 3.46.0 [:memory:]
  species   total
  <chr>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

For multiple grouping variables we would define the grouping variables the same way as we would in dplyr

tbl(con, sql("SELECT species, sex, COUNT(species) AS total
            FROM pengs 
            GROUP BY species, sex"))

# Source:   SQL [8 x 3]
# Database: sqlite 3.46.0 [:memory:]
  species   sex    total
  <chr>     <chr>  <int>
1 Adelie    <NA>       6
2 Adelie    female    73
3 Adelie    male      73
4 Chinstrap female    34
5 Chinstrap male      34
6 Gentoo    <NA>       5
7 Gentoo    female    58
8 Gentoo    male      61

The same would go for multiple summary functions

tbl(con, sql("SELECT species, COUNT(species) AS total, AVG(bill_depth_mm) AS avg_bill_depth, MEDIAN(bill_depth_mm) AS median_bill_depth
             FROM pengs 
             GROUP BY sex"))

# Source:   SQL [3 x 4]
# Database: sqlite 3.46.0 [:memory:]
  species total avg_bill_depth median_bill_depth
  <chr>   <int>          <dbl>             <dbl>
1 Adelie     11           16.6              17.1
2 Adelie    165           16.4              17  
3 Adelie    168           17.9              18.4

Joins/Appending Rows

In the real world it is rare that we will have all our data in one place. Companies keep information in lots of different places because well it would be bad if we kept credit card information with all the necessary components to make a purchase. Instead of having to figure out three different things malicious actors would just need to access one database. Replacing entire data tables can also skyrocket costs. So instead, it is more efficient to simply insert rows.

Apppending Rows

To kind of mimic this we are just going to slice this data frame roughly in half. While not entirely realistic the general process will be similar enough

Code

pengs_top = palmerpenguins::penguins |>
    slice(1:172)

pengs_bottom = palmerpenguins::penguins |>
    slice(173:344)

con2 = src_memdb()

con3 = src_memdb()

pengs_top = copy_to(con2, pengs_top)

pengs_bottom = copy_to(con3, pengs_bottom)

For whatever reason show_query is not working with this so we are going to have to consult the interwebs. The SQL equivalent of bind_rows is UNION.

tbl(con2, sql("SELECT * FROM pengs_top
             UNION ALL 
             SELECT * FROM pengs_bottom"))

# Source:   SQL [?? x 8]
# Database: sqlite 3.46.0 [:memory:]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ more rows
# ℹ 2 more variables: sex <chr>, year <int>

One of the key things in this query is ALL which is somewhat new to me. Basically the ALL tells SQL that we don’t really care about duplicates so just add the rows regardless. So if we wanted to exclude duplicates we would do something like this

Code

tbl(con2, sql("SELECt * FROM pengs_top 
              UNION 
              SELECT * FROM pengs_top")) |>
                collect() |>
                nrow()

[1] 172

Code

tbl(con2,sql("SELECT * FROM pengs_top") ) |>
    collect() |>
    nrow()

[1] 172

Joins

Luckily for us the join syntax from dplyr is pretty directly taken SQL so lefts create some dummy data to join.

Code

national_data <- tribble(
  ~state, ~year, ~unemployment, ~inflation, ~population,
  "GA",   2018,  5,             2,          100,
  "GA",   2019,  5.3,           1.8,        200,
  "GA",   2020,  5.2,           2.5,        300,
  "NC",   2018,  6.1,           1.8,        350,
  "NC",   2019,  5.9,           1.6,        375,
  "NC",   2020,  5.3,           1.8,        400,
  "CO",   2018,  4.7,           2.7,        200,
  "CO",   2019,  4.4,           2.6,        300,
  "CO",   2020,  5.1,           2.5,        400
)

national_libraries <- tribble(
  ~state, ~year, ~libraries, ~schools,
  "CO",   2018,  230,        470,
  "CO",   2019,  240,        440,
  "CO",   2020,  270,        510,
  "NC",   2018,  200,        610,
  "NC",   2019,  210,        590,
  "NC",   2020,  220,        530,
)

con3 = src_memdb()

con4 = src_memdb()

national_data = copy_to(con4, national_data, overwrite = TRUE)

national_libraries = copy_to(con3, national_libraries, overwrite = TRUE)

So we have some fake national level data that we would like to join in to the dataset. We could do something like this but what we notice is that it is going to decide the join keys for us and probably create some headaches for us later on. To solve this we need to use our keys if we expose the underlying logic

national_data |>
    left_join(national_libraries, join_by(state, year)) |>
    show_query()

<SQL>
SELECT `national_data`.*, `libraries`, `schools`
FROM `national_data`
LEFT JOIN `national_libraries`
  ON (
    `national_data`.`state` = `national_libraries`.`state` AND
    `national_data`.`year` = `national_libraries`.`year`
  )

We will notice that join_by is shorthand for equality joins. What changes is that instead of left_key = right_key we have to specify what is coming from what table using .

db_con = con4$con

query = "SELECT *
             FROM national_data
             LEFT JOIN national_libraries
             ON (
             national_data.state = national_libraries.state AND
             national_data.year = national_libraries.year
             )
             "

dbGetQuery(db_con, sql(query))

  state year unemployment inflation population state year libraries schools
1    GA 2018          5.0       2.0        100  <NA>   NA        NA      NA
2    GA 2019          5.3       1.8        200  <NA>   NA        NA      NA
3    GA 2020          5.2       2.5        300  <NA>   NA        NA      NA
4    NC 2018          6.1       1.8        350    NC 2018       200     610
5    NC 2019          5.9       1.6        375    NC 2019       210     590
6    NC 2020          5.3       1.8        400    NC 2020       220     530
7    CO 2018          4.7       2.7        200    CO 2018       230     470
8    CO 2019          4.4       2.6        300    CO 2019       240     440
9    CO 2020          5.1       2.5        400    CO 2020       270     510

For whatever reason with SQLite gets a little grumpy with the join syntax.

If we wanted to do various other joins like inner and anti joins we would do a similar thing.

query = "SELECT * 
        FROM national_data
    INNER JOIN national_libraries 
    ON(
    national_data.state = national_libraries.state AND
    national_data.year = national_libraries.year
    )
"

dbGetQuery(db_con, sql(query))

  state year unemployment inflation population state year libraries schools
1    CO 2018          4.7       2.7        200    CO 2018       230     470
2    CO 2019          4.4       2.6        300    CO 2019       240     440
3    CO 2020          5.1       2.5        400    CO 2020       270     510
4    NC 2018          6.1       1.8        350    NC 2018       200     610
5    NC 2019          5.9       1.6        375    NC 2019       210     590
6    NC 2020          5.3       1.8        400    NC 2020       220     530

Inequality joins

Confession I have never really understood how inequality joins work in regular dplyr but I am sure at some point I am going to need them and now when the stakes are so low is a good time to do it. So lets just take the data from the dplyr 1.1.0 announcement to do this since we know what the output should be.

companies <- tibble(
  id = c("A", "B", "B"),
  since = c(1973, 2009, 2022),
  name = c("Patagonia", "RStudio", "Posit")
)

transactions <- tibble(
  company = c("A", "A", "B", "B"),
  year = c(2019, 2020, 2021, 2023),
  revenue = c(50, 4, 10, 12)
)

companies = copy_to(con3, companies, overwrite = TRUE)

transactions = copy_to(con4, transactions, overwrite = TRUE)

db_con = con3$con

So the main idea of an inequality join is that we can join by a key in this case company but only keep records from a certain date. The blog post kind of equates it with a filter/WHERE that happens during the join phase. So we would see something like this

transactions |>
  inner_join(companies, join_by(company == id, year >= since))

# Source:   SQL [5 x 5]
# Database: sqlite 3.46.0 [:memory:]
  company  year revenue since name     
  <chr>   <dbl>   <dbl> <dbl> <chr>    
1 A        2019      50  1973 Patagonia
2 A        2020       4  1973 Patagonia
3 B        2021      10  2009 RStudio  
4 B        2023      12  2009 RStudio  
5 B        2023      12  2022 Posit

Instead of two equality statements we would effectively use the same syntax just swapping out the = with >=

query = "
      SELECT * FROM transactions
      INNER JOIN companies 
      ON(
      transactions.company = companies.id AND
      transactions.year >= companies.since
      )

"

dbGetQuery(db_con, sql(query))

  company year revenue id since      name
1       A 2019      50  A  1973 Patagonia
2       A 2020       4  A  1973 Patagonia
3       B 2021      10  B  2009   RStudio
4       B 2023      12  B  2009   RStudio
5       B 2023      12  B  2022     Posit

Tidyr

This section is really about pivots

Pivots

In tidyverse parlance we use pivots to change the “shape of the data.” If you are unfamiliar with this idea consider the religion and income data below. You will notice that we have a column for each income bracket or what is sometimes called “wide” data. This may be useful for some question but generally if we want to plot things or do things it will be easier if they are “long” data.

Code

con5 = src_memdb()

relig = copy_to(con5, relig_income, overwrite = TRUE)

head(relig_income, n = 2)

# A tibble: 2 × 11
  religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic      27        34        60        81        76       137        122
2 Atheist       12        27        37        52        35        70         73
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

To make our data “long” we use pivot_longer and to make data “wide” we use pivot_wider each has their own quirks but the general idea is that we have to tell these functions where to put the old names/where to get the new names and where to put the old values/where to get the new values. So if we wanted to make our data longer we would do something like this.

long = relig_income |>
    pivot_longer(-religion,
                names_to = 'income_bracket',
                values_to = 'income')

head(long, n = 2)

# A tibble: 2 × 3
  religion income_bracket income
  <chr>    <chr>           <dbl>
1 Agnostic <$10k              27
2 Agnostic $10-20k            34

If we wanted to make this wide again all we would do is reverse this with pivot_wider

wide = long |>
    pivot_wider(names_from = income_bracket, values_from = income)

There are ton of additional functionality that will not be covered like dealing with not uniquely identified columns.

To get a sense of how to do this let’s consult our old friend show_query

relig |>
    pivot_longer(-religion,
                names_to = 'income_bracket',
                values_to = 'income') |>
                    show_query()

We are not going to actually show the results because it is quite the query. The summary of what is happening is that SQLite doesn’t have a perfect equivalent of pivot_longer. Basically, what you need to do is to keep appending smaller and smaller data frames to each other until you get to a long data frame. In other flavors of SQL this process is a lot more humane with explicit PIVOT and UNPIVOT but I am not in one of those flavors. To spare myself a bit I am just going to do two columns

tbl(con5, sql("
    SELECT religion, '<$10k' AS income_bracket, '<$10k' AS income
    FROM relig_income 

    UNION ALL

    SELECT religion, '$10-20k' AS income_bracket, '$10-20k' AS income
    FROM relig_income

    UNION ALL

    SELECT religion, '$20-30k' AS income_bracket, '$20-30k' AS income
    FROM relig_income

    
"))

# Source:   SQL [?? x 3]
# Database: sqlite 3.46.0 [:memory:]
   religion                income_bracket income
   <chr>                   <chr>          <chr> 
 1 Agnostic                <$10k          <$10k 
 2 Atheist                 <$10k          <$10k 
 3 Buddhist                <$10k          <$10k 
 4 Catholic                <$10k          <$10k 
 5 Don’t know/refused      <$10k          <$10k 
 6 Evangelical Prot        <$10k          <$10k 
 7 Hindu                   <$10k          <$10k 
 8 Historically Black Prot <$10k          <$10k 
 9 Jehovah's Witness       <$10k          <$10k 
10 Jewish                  <$10k          <$10k 
# ℹ more rows

I am a little scared to see what this looks for pivot_wider but we should at least give it a go.

long = relig |>
    pivot_longer(-religion,
                 names_to = 'income_bracket',
                 values_to = 'income')

long |>
    pivot_wider(names_from = income_bracket, values_from = income) |>
    show_query()

Okay again this is a little unwieldy to show. Basically what happens is that we are creating a big case_when condition and then from there we are going to use the same binding trick and then group the data. So lets go ahead and copy and paste some of this.

Code

query = "
SELECT
    religion,
    MAX(CASE WHEN (income_bracket = '<$10k') THEN income END) AS '<$10K',
    MAX(CASE WHEN (income_bracket = '$10-20k') THEN income END) AS '$10-20k',
    MAX(CASE WHEN (income_bracket = '$20-30k') THEN income END) AS '$20-30k'
FROM (
    SELECT religion, '<$10k' AS income_bracket, '<$10k' AS income
    FROM relig_income

    UNION ALL

    SELECT religion, '$10-20k' AS income_bracket, '$10-20k' AS income
    FROM relig_income

    UNION ALL

    SELECT religion, '$20-30k' AS income_bracket, '$20-30k' AS income
    FROM relig_income
) AS wide_religion
GROUP BY religion
"

tbl(con5, sql(query))

# Source:   SQL [?? x 4]
# Database: sqlite 3.46.0 [:memory:]
   religion                `<$10K` `$10-20k` `$20-30k`
   <chr>                   <chr>   <chr>     <chr>    
 1 Agnostic                <$10k   $10-20k   $20-30k  
 2 Atheist                 <$10k   $10-20k   $20-30k  
 3 Buddhist                <$10k   $10-20k   $20-30k  
 4 Catholic                <$10k   $10-20k   $20-30k  
 5 Don’t know/refused      <$10k   $10-20k   $20-30k  
 6 Evangelical Prot        <$10k   $10-20k   $20-30k  
 7 Hindu                   <$10k   $10-20k   $20-30k  
 8 Historically Black Prot <$10k   $10-20k   $20-30k  
 9 Jehovah's Witness       <$10k   $10-20k   $20-30k  
10 Jewish                  <$10k   $10-20k   $20-30k  
# ℹ more rows

Unnest/a brief aside

So one thing that you come across from time to time in R and python data wrangling are list columns. These happen for a variety of reasons and are pretty innocuous to handle.

list_starwars = starwars |>
    select(name, films)

 list_starwars |>
    unnest_longer(films)

# A tibble: 173 × 2
   name           films                  
   <chr>          <chr>                  
 1 Luke Skywalker A New Hope             
 2 Luke Skywalker The Empire Strikes Back
 3 Luke Skywalker Return of the Jedi     
 4 Luke Skywalker Revenge of the Sith    
 5 Luke Skywalker The Force Awakens      
 6 C-3PO          A New Hope             
 7 C-3PO          The Empire Strikes Back
 8 C-3PO          Return of the Jedi     
 9 C-3PO          The Phantom Menace     
10 C-3PO          Attack of the Clones   
# ℹ 163 more rows

However, per this Stack overflow answer and the linked question this is not really a thing or like really not advised. Even when you try to copy the starwars dataset to a database you get an error.

Misc

Ranking

There are lots of different ways to rank things in R if we want to return the min/max you can do

pengs |>
    slice_max(bill_length_mm, n = 3)

# Source:   SQL [3 x 8]
# Database: sqlite 3.46.0 [:memory:]
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>     <chr>           <dbl>         <dbl>             <int>       <int>
1 Gentoo    Biscoe           59.6          17                 230        6050
2 Chinstrap Dream            58            17.8               181        3700
3 Gentoo    Biscoe           55.9          17                 228        5600
# ℹ 2 more variables: sex <chr>, year <int>

There are also various ranking functions.

example = tribble(~id, ~col1,
                   1, 1,
                   2, 2,
                   3, 2,
                   4, 3,
                   5, 4)

example |>
    mutate(rank_one = dense_rank(col1),
           rank_two = min_rank(col1))

# A tibble: 5 × 4
     id  col1 rank_one rank_two
  <dbl> <dbl>    <int>    <int>
1     1     1        1        1
2     2     2        2        2
3     3     2        2        2
4     4     3        3        4
5     5     4        4        5

Like our dplyr join functions the dense_rank and min_rank function actually takes inspiration from SQL. So in our example where the two functions differ is how they handle ties. So in dense_rank and min_rank both id 2 and 3 get assigned the same rank where they differ is dense_rank will assign id 4 the rank of 3 and min_rank will assign id 4 the rank of 4.

So how would we do this in SQL

con7 = src_memdb()

team_rankings = copy_to(con7, example)

team_rankings |>
    mutate(rank_one = dense_rank(col1)) |>
    show_query()

<SQL>
SELECT
  `example`.*,
  CASE
WHEN (NOT((`col1` IS NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((`col1` IS NULL)) THEN 1 ELSE 0 END) ORDER BY `col1`)
END AS `rank_one`
FROM `example`

This is deceptively a bit more complex. So lets break it down.

tbl(con7, sql("
SELECT
example .*,
    CASE 
WHEN (NOT((col1 is NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((col1 is NULL)) THEN 1 ELSE 0 END) ORDER BY col1)
END AS rank_one
FROM example
"))

# Source:   SQL [5 x 3]
# Database: sqlite 3.46.0 [:memory:]
     id  col1 rank_one
  <dbl> <dbl>    <int>
1     1     1        1
2     2     2        2
3     3     2        2
4     4     3        3
5     5     4        4

So basically the PARTITION BY bit is used to divide the data into groups before we rank them. The CASE WHEN handles when we have missing values. Then the window function is applying dense rank over these partions. This was a somewhat silly example so lets do something a bit more realistic. Lets say we actually want to rank the penguins by average bill length and then return the penguins in the top 3.

tbl(con, sql(
    "
    SELECT
    ranked_pengs .*,
    CASE
    WHEN (NOT((avg_bill_length is NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((avg_bill_length is NULL)) THEN 1 ELSE 0 END) ORDER BY avg_bill_length)
    END AS rank
    FROM( 
     SELECT pengs .*, AVG(bill_length_mm) OVER () AS avg_bill_length
     FROM pengs)
     AS ranked_pengs 
     LIMIT 3
    "
))

# Source:   SQL [3 x 10]
# Database: sqlite 3.46.0 [:memory:]
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 4 more variables: sex <chr>, year <int>, avg_bill_length <dbl>, rank <int>

We could also do this by groups by just inserting a group by statement before the limit bit

tbl(con, sql(
    "
    SELECT
    ranked_pengs .*,
    CASE
    WHEN (NOT((avg_bill_length is NULL))) THEN DENSE_RANK() OVER (PARTITION BY (CASE WHEN ((avg_bill_length is NULL)) THEN 1 ELSE 0 END) ORDER BY avg_bill_length)
    END AS rank
    FROM( 
     SELECT pengs .*, AVG(bill_length_mm) OVER () AS avg_bill_length
     FROM pengs)
     AS ranked_pengs 
     GROUP BY species
     LIMIT 3
    "
))

# Source:   SQL [3 x 10]
# Database: sqlite 3.46.0 [:memory:]
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>     <chr>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Torgersen           39.1          18.7               181        3750
2 Chinstrap Dream               46.5          17.9               192        3500
3 Gentoo    Biscoe              46.1          13.2               211        4500
# ℹ 4 more variables: sex <chr>, year <int>, avg_bill_length <dbl>, rank <int>

Distinct Values

Duplicates are a fact of life but depending on your question or what information you are trying to show repeated records may not be desirable. We handle these with the same function but kind of like mutate we have to let select handle these. If we wanted one row per column without having to specify every column in our dataset than we could do something like this

tbl(con, sql("SELECT *
            FROM(
            SELECT pengs .*,
            ROW_NUMBER() OVER (PARTITION BY species ORDER BY species) AS id 
            FROM PENGS) AS small_pengs
            WHERE id = 1"))

# Source:   SQL [3 x 9]
# Database: sqlite 3.46.0 [:memory:]
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>     <chr>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Torgersen           39.1          18.7               181        3750
2 Chinstrap Dream               46.5          17.9               192        3500
3 Gentoo    Biscoe              46.1          13.2               211        4500
# ℹ 3 more variables: sex <chr>, year <int>, id <int>

However if we have a slightly less complex query than we can feed distinct multiple columns

tbl(con, sql("SELECT DISTINCT species, island
            FROM pengs"))

# Source:   SQL [5 x 2]
# Database: sqlite 3.46.0 [:memory:]
  species   island   
  <chr>     <chr>    
1 Adelie    Torgersen
2 Adelie    Biscoe   
3 Adelie    Dream    
4 Gentoo    Biscoe   
5 Chinstrap Dream

The End…for now

I am sure this will end up growing as I think of more than things in R that I need to be able to do in SQL.