Getting Started in R

Josh Allen

Department of Political Science at Georgia State University

1/20/23

Research Data Services

via GIPHY

Our Team

Get Ready Badges

How To Get the Badges

The Workshop

Why Use R?

Why R and RStudio?(cont)

R? Rstudio? Whats the Difference?

  • R is a statistical programming language
  • RStudio is a convenient interface for R (an Integrated Developer Environment, IDE)
  • At its simplest:
    • R is like a car’s engine
    • RStudio is like a car’s dashboard

Setting Your Working Directory

  • Your working directory is where all your files live

  • You may know where your files are…

  • But R does not

  • If you want to use any data that does not come with a package you are going to need to tell R where it lives

Cats and Boxes

  • You can put a box inside a box.

  • You can put a cat inside a box

  • You can put a cat inside a box inside of a box

  • You cannot put a box inside a cat

  • You cannot put cat in a cat

Setting Your Working Directory(cont)

Seeing What Working Directory You are Using

getwd()## The working directory where all the materials for the workshops live
[1] "/Users/josh/Dropbox/Research-Data-Services-Workshops/research-data-services-r-workshops/slides"

Setting Your Working Directory

setwd("your/working/directory/here/") ## sets the working directory on mac
setwd("your\working\directory\here") ## sets the working directory on windows

How To Make Your Life Easier

source: Jenny Bryan

How To Make Your Life Easier

Working Directory for My Laptop

"/Users/josh/Dropbox/Research-Data-Services-Workshops/research-data-services-r-workshops/slides" 

Working Directory of My Office Computer

"/Volumes/6TB Raid 10/Dropbox/Research-Data-Services-Workshops/research-data-services-r-workshops/slides"

R Projects

Objects

  • Everything is an object

  • Everything has a name

  • You do stuff with functions

  • Packages(i.e. libraries) are homes to pre-written functions.

    • You can also write your own functions and in some cases should.

Install and loading packages

  • Console or Script install.packages("package-i-need-to-install")
    • In the case of multiple packages you can do install.packages(c("Packages", "I", "don't","have"))
  • RStudio Click the “Packages” tab in the bottom-right window pane. Then click “Install” and search for these two packages.

Install and load(cont.)

Once the packages are installed we need load them into our R session with the library() function

# We talk to ourselves using #
library(Package) 
library(I)
library(JustInstalled)

Notice too that you don’t need quotes around the package names any more.

  • R now recognises these packages as defined objects with given names

  • Everything in R is an and everything has a name

R Some Basics

Basic Maths

  • R is equipped with lots of mathematical operations
2+2 ## addition
[1] 4
4-2 ## subtraction
[1] 2
600*100 ##multiplication
[1] 60000
100/10 ##division
[1] 10
10*10/(3^4*2)-2 ## Pemdas 
[1] -1.382716
log(100)
[1] 4.60517
sqrt(100)
[1] 10

Basic Maths

R is also equipped with modulo operations (integer division and remainders), matrix algebra, etc

100 %/% 60 # How many whole hours in 100 minutes?
[1] 1
100 %% 60 # How many minutes are left over?
[1] 40
m <- matrix(1:8, nrow=2) # Don't worry about the <- for now 
n <- matrix(8:15, nrow=4) # this is just me creating matrices 
mat <- matrix(1:15, ncol = 5)
m %*% n # Matrix multiplication
     [,1] [,2]
[1,]  162  226
[2,]  200  280
t(mat) # transpose a matrix
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
[5,]   13   14   15

Logical Statements & Booleans

Test Meaning Test Meaning
x < y Less than x %in% y In set
x > y Greater than is.na(x) Is missing
== Equal to !is.na(x) Is not missing
x <= y Less than or equal to
x >= y Greater than or equal to
x != y Not equal to
x | y Or
x & y And

Booleans and Logicals in Action

1>2 
[1] FALSE
1<2
[1] TRUE
1 == 2
[1] FALSE
1 < 2 | 3 > 4 ## only one test needs to true to return true
[1] TRUE
1 < 2 & 3>4 ## both tests must be true to return true
[1] FALSE

Logicals, Booleans, and Precedence

  • R like most other programming languages will evaluate our logical operators(==, >, etc) before our booleans(|, &, etc).
1 > 0.5 & 2
[1] TRUE
  • What’s happening here is that R is evaluating two separate “logical” statements:

  • 1 > 0.5, which is is obviously TRUE.

  • 2, which is TRUE(!) because R is “helpfully” converting it to as.logical(2).

  • It is way safer to make explicit what you are doing.

  • If your code is doing something weird it might just be because of precedence issues

1 > 0.5 & 1 > 2
[1] FALSE

Other Useful Tricks

Value matching using %in%

To see whether an object is contained within (i.e. matches one of) a list of items, use %in%.

4 %in% 1:10
[1] TRUE
4 %in% 5:10
[1] FALSE

Cool Now What?

  • While this is boring it opens up lots

  • We may need to set up a group of tests to do something to data.

  • We may need all this math stuff to create new variables

  • However we need to Assign them to reuse them later in functions.

    • Including datasets

Everything is an Object

Assignment

  • The most popular assigment operator in R is <- which is just < followed by -
    • read aloud as “gets”
a <- 2 + 2

a * 2
[1] 8
h <- "harry potter" # note that text needs to be wrapped in quotes 
  • You can also use -> but this is far less common and makes me uncomfortable
 a^2 -> b

Assignment(cont)

  • Using = as an assignment operator also works and is the one I tend to use
    • Note: = is also used to evaluate arguments within functions
b = b * 2

d = b/3
  • Tbh this is a matter of taste really.
    • R added = in the 2000’s to make it easier for people coming from other object oriented programming languages1
  • Just keep it consistent..or become ungovernable and use all three in one script.

Working with Objects

e = c(1,3,5,6,67,7) # creates a vector

length(e) ## How many things are in there? 
[1] 6
sum(e)/length(e) # Hand calculate mean
[1] 14.83333
mean(e) # Why make our life hard when there is a built in function?
[1] 14.83333
e = data.frame(x = 1:22,
               y = 20:41)
mean(y)
Error in mean(y): object 'y' not found

Global Environment(cont)

Error in mean(y): object 'y' not found
  • Gives us a hint out about what went wrong

Fixing Our Issue

  • To do this we need to index e to get to y
mean(e$y)
[1] 30.5
  • R will look for named objects in the environment

  • If the interpreter can’t find y or any other object it will give up because it does not think it exists

  • You need to tell the interpreter what to look for inside of the object

What are Objects?

  • Objects are what we work with in R
 [1] "is.array"                "is.atomic"              
 [3] "is.call"                 "is.character"           
 [5] "is.complex"              "is.data.frame"          
 [7] "is.double"               "is.element"             
 [9] "is.environment"          "is.expression"          
[11] "is.factor"               "is.finite"              
[13] "is.function"             "is.infinite"            
[15] "is.integer"              "is.language"            
[17] "is.list"                 "is.loaded"              
[19] "is.logical"              "is.matrix"              
[21] "is.na"                   "is.na.data.frame"       
[23] "is.na.numeric_version"   "is.na.POSIXlt"          
[25] "is.na<-"                 "is.na<-.default"        
[27] "is.na<-.factor"          "is.na<-.numeric_version"
[29] "is.name"                 "is.nan"                 
[31] "is.null"                 "is.numeric"             
[33] "is.numeric_version"      "is.numeric.Date"        
[35] "is.numeric.difftime"     "is.numeric.POSIXt"      
[37] "is.object"               "is.ordered"             
[39] "is.package_version"      "is.pairlist"            
[41] "is.primitive"            "is.qr"                  
[43] "is.R"                    "is.raw"                 
[45] "is.recursive"            "is.single"              
[47] "is.symbol"               "is.table"               
[49] "is.unsorted"             "is.vector"              
[51] "isa"                     "isatty"                 
[53] "isBaseNamespace"         "isdebugged"             
[55] "isFALSE"                 "isIncomplete"           
[57] "isNamespace"             "isNamespaceLoaded"      
[59] "isOpen"                  "isRestart"              
[61] "isS4"                    "isSeekable"             
[63] "isSymmetric"             "isSymmetric.matrix"     
[65] "isTRUE"                 

Vectors

  • Come in two flavors

  • Atomic: all the stuff must be the same type

  • Lists: stuff can be different types

my_vec <- c(1:10)
is.vector(my_vec)
[1] TRUE
my_list <- list(a = c(1:4), b = "Hello World", c = data.frame(x = 1:10, y = 1:10))
is.vector(my_list)
[1] TRUE

Atomic Vectors

  • Come in a variety of flavors

  • Numeric: Can contain whole numbers or decimals

  • Logicals: Can only take two values TRUE or FALSE

  • Factors: Can only contain predefined values. Used to store categorical data

    • Ordered factors are special kind of factor where the order of the level matters.
  • Characters: Holds character strings

    • Base R will often convert characters to factors. That is bad because it will choose the levels for you

Lists

  • Lists are everywhere in R
data_frame <- data.frame(a = rnorm(3),
                         b = rnorm(3))
typeof(data_frame)
[1] "list"
dats_wrong <- data.frame(a = 1:3,
                         b = 1:4)
Error in data.frame(a = 1:3, b = 1:4): arguments imply differing number of rows: 3, 4
example_mod <- lm(body_mass_g ~ bill_depth_mm, data = penguins)
typeof(example_mod)
[1] "list"
length(example_mod$residuals);length(example_mod$coefficients)
[1] 342
[1] 2

A Quick Aside on Naming Stuff

Things we can never name stuff

The reason we can’t use any of these are because they are reserved for R

if 
else 
while 
function 
for
TRUE 
FALSE 
NULL 
Inf 
NaN 
NA 

A Quick Aside on Naming Stuff(cont)

Semi-reserved words

For simple things like assigning c = 4 and then doing d = c(1,2,3,4) R will be able to distinguish between assign c the value of 4 and the c that calls concatenate which is way more important in R.

However it is generally a good idea, unless you know what you are doing, to avoid naming things that are functions in R because R will get confused.

my_cool_fun <- function(x){
 x <- x*5
return(x)
}

datas <- c(1:10)

my_cool_fun(datas)
 [1]  5 10 15 20 25 30 35 40 45 50
my_cool_fun[1]
Error in my_cool_fun[1]: object of type 'closure' is not subsettable

How and What to Name Objects

The best practice is to use concise descriptive names

When loading in data typically I do raw_my_dataset_name and after data all of my cleaning I do clean_my_dataset_name

  • Objects must start with a letter. But can contain letters, numbers, _, or .
    • snake_case_like_this_is_what_I_use
    • somePeopleUseCamelCase
    • some_People.are_Do_not.like_Convention

The Data We are Working With

artwork by @allison_horst

Importing Data

  • You have the option of pointing and clicking via import dataset

  • I would recommend importing data via code

    • You don’t have to remember what you named the object originally
    • Saves future you time
  • This is a common error you will get

penguins = read.csv("peguins.csv")
Error in file(file, "rt"): cannot open the connection
penguins = read.csv("penguins.csv")
Error in file(file, "rt"): cannot open the connection
  • This happens most often when
    • the file name is spelled wrong
    • the file is in a subdirectory or your working directory is not set correctly

Your Turn

  • Create a vector in R named my_vec with “Game of Thrones” in it.

  • Create a vector in R named my_second_vec with 1:100 in it

  • Read in the data included to the website using read.csv

    • What happens when you do not assign the dataset?
    • If you are on a Windows machine right click on the zip file and then click extract all
  • Assign the penguins dataset to an object named penguins

  • Use View, head, and tail to inspect the dataset

  • Using install.packages() install ggplot2

04:00

Our Data

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Indexing []

  • We can use column position to index objects.

  • There are two slots we can use rows and columns in the brackets if we are using a dataframe like this.

  • object_name[row number, column number]

  • We can also subset our data by column position using : or c(column 1, column 2)

penguins[1,1]
species
Adelie
penguins[1,1:2]

penguins[1,c(1,4)]
species island
Adelie Torgersen
species bill_depth_mm
Adelie 18.7

Negative Indexing

  • We can also exclude various elements using - and/or tests that I showed you earlier
penguins[,-1]
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Torgersen 39.1 18.7 181 3750 male 2007
Torgersen 39.5 17.4 186 3800 female 2007
Torgersen 40.3 18.0 195 3250 female 2007
Torgersen NA NA NA NA NA 2007
Torgersen 36.7 19.3 193 3450 female 2007
Torgersen 39.3 20.6 190 3650 male 2007

Negative Indexing(cont)

  • We can use - or : as well to subset stuff
penguins[,-(1:4)]
flipper_length_mm body_mass_g sex year
181 3750 male 2007
186 3800 female 2007
195 3250 female 2007
NA NA NA 2007
193 3450 female 2007
190 3650 male 2007
penguins[,-c(2,3,5,8)]
species bill_depth_mm body_mass_g sex
Adelie 18.7 3750 male
Adelie 17.4 3800 female
Adelie 18.0 3250 female
Adelie NA NA NA
Adelie 19.3 3450 female
Adelie 20.6 3650 male

Indexing [] (cont)

  • We can also do the same thing with lists.

  • We can tell R what element of a list using a combo of [] and [[]]

my_list = list(a = 100:110, b = "Learning R was the best of times and the worst of times",
               c = data.frame(x = 1:3, y = 4:6))
my_list[[1]][2] ## get the first item in the list and the second element of that item
[1] 101
my_list[2]
$b
[1] "Learning R was the best of times and the worst of times"
my_list[[3]][[1]]
[1] 1 2 3

[] vs [[]]

Subsetting By Tests

penguins[penguins["sex"] == "female", c("species", "sex")]
species sex
Adelie female
Adelie female
NA NA
Adelie female
Adelie female
NA NA
NA NA
NA NA
NA NA
Adelie female

$ Indexing

A really useful way of indexing in R is referencing stuff by name rather than position. - The way we do this is throught the $

my_list$a
 [1] 100 101 102 103 104 105 106 107 108 109 110
my_list$b
[1] "Learning R was the best of times and the worst of times"
my_list$c
  x y
1 1 4
2 2 5
3 3 6

Indexing(cont)

my_list[[3]][[2]] ## these are just returning the same thing 
[1] 4 5 6
my_list$c$y
[1] 4 5 6

$ in action

This will just subset things

penguins[penguins$species == "Gentoo", c("species", "island", "bill_length_mm")] 
species island bill_length_mm
Gentoo Biscoe 46.1
Gentoo Biscoe 50.0
Gentoo Biscoe 48.7
Gentoo Biscoe 50.0
Gentoo Biscoe 47.6
Gentoo Biscoe 46.5
Gentoo Biscoe 45.4
Gentoo Biscoe 46.7
Gentoo Biscoe 43.3
Gentoo Biscoe 46.8

$ in action(cont)

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
mean(penguins$bill_depth_mm)
[1] NA

uh oh what happened?

Finding Help

  • Asking for help in R is easy the most common ways are help(thingineedhelpwith) and ?thingineedhelpwith
?mean
  • ?thingineedhelpwith is probably the most common because it requires less typing.

Fixing our issue

mean(penguins$bill_depth_mm, na.rm =TRUE)
[1] 17.15117
  • Good documentation fluctuates wildly because it is an open source language

  • If in doubt

:::

Your Turn

  • Find the minimum value of bill_length_mm

  • Find the maximum value of body_mass_g

  • Subset the penguins data any way you want using column position or $

  • Assign each of them to an object

  • Create a vector from 1:10 index that vector using [] to return 2 and 4

05:00

Some additional useful stuff

  • Sometimes we want summary statistics per group

    • What kind of penguins live where
    • Are their any interesting patterns by group etc
  • Fortunately R comes with some handy functions to use

  • table counts each factor level

  • tapply will let you group stuff by a factor and get some useful balance statistics

Table

table(penguins$sex)

female   male 
   165    168 
table(penguins$sex, useNA = "ifany")

female   male   <NA> 
   165    168     11 

tapply and calculating descriptive statistics by groups

tapply(penguins$species,penguins$sex, table, useNA = "ifany")
$female

   Adelie Chinstrap    Gentoo 
       73        34        58 

$male

   Adelie Chinstrap    Gentoo 
       73        34        61 
tapply(penguins$bill_depth_mm, penguins$species, mean, na.rm = TRUE)
   Adelie Chinstrap    Gentoo 
 18.34636  18.42059  14.98211 

Plotting

plot(penguins$bill_length_mm,
   penguins$body_mass_g,
   xlab = "Bill Length(mm)",
   ylab = "Body Mass(g)")

Plotting(cont)

hist(penguins$bill_length_mm,
 xlim = c(30, 60))

Making New Things

  • To foreshadow our next workshop often we need to do things with our data
    • Like deal with all those pesky missing values
    • Create new variables
    • subset our data(kind of like we have been doing)
    • recode our variables
  • To add new variables we can use what we know
penguins$range_body_mass = max(penguins$body_mass_g, na.rm = TRUE) - min(penguins$body_mass_g, na.rm = TRUE)

penguins$chinstrap[penguins$species == "Adelie" | penguins$species == "Gentoo"] <- "Not Chinstrap"

penguins$chinstrap[penguins$species == "Chinstrap"] <- "Chinstrap"

penguins[,c("species", "range_body_mass", "chinstrap")]
# A tibble: 344 × 3
   species range_body_mass chinstrap    
   <fct>             <int> <chr>        
 1 Adelie             3600 Not Chinstrap
 2 Adelie             3600 Not Chinstrap
 3 Adelie             3600 Not Chinstrap
 4 Adelie             3600 Not Chinstrap
 5 Adelie             3600 Not Chinstrap
 6 Adelie             3600 Not Chinstrap
 7 Adelie             3600 Not Chinstrap
 8 Adelie             3600 Not Chinstrap
 9 Adelie             3600 Not Chinstrap
10 Adelie             3600 Not Chinstrap
# … with 334 more rows

Cleaning up after yourself

  • rm(objectname) will remove the objects you created

  • rm(list=ls()) will remove all the objects your created

  • You can remove packages, sometimes, with detach(package:packageyouwanttoremove)

    • This can be iffy for a variety of reasons
    • Some packages automatically load another package or depend on another.
  • However, restarting your R session is generally best practice because it will do both

Getting Good at R

Tell Us How We Did

https://gsu.qualtrics.com/jfe/form/SV_9nucJR3soZ9lkqO

via GIPHY