Department of Political Science at Georgia State University
10/7/22
Research Data Services
Our Team
Get Ready Badges
How To Get the Badges
Packages You Will Need
install.packages("pacman")pacman::p_load("sandwich", # standard error fixes"lmtest", # diagnostic tests"marginaleffects", # getting marginal effects"emmeans", # getting marginal effect"modelsummary", # making tables "broom", # pull diagnostic statistics"caret", # lots of machine learning models"tidyverse", # ggplot and friends "fixest", # fixed effects "kableExtra", # extenstions for making latex tables"plm", # panel data models "ggfortify", install =TRUE)set.seed(1994) # some of the models we will # data we are usingpenguins = palmerpenguins::penguins
Exploratory Data Analysis in R
Describing Variables
This depends on what kind of variable it is i.e. continuous, categorical etc
It also depends on what story you need to tell
Is this confounder a big deal?
Do we see anticipation of treatment?
Are there any outliers?
etc?
Remember R is just a toolbox.
First Cut
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
studyName Sample Number Species Region
Length:344 Min. : 1.00 Length:344 Length:344
Class :character 1st Qu.: 29.00 Class :character Class :character
Mode :character Median : 58.00 Mode :character Mode :character
Mean : 63.15
3rd Qu.: 95.25
Max. :152.00
Island Stage Individual ID Clutch Completion
Length:344 Length:344 Length:344 Length:344
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm)
Min. :2007-11-09 Min. :32.10 Min. :13.10 Min. :172.0
1st Qu.:2007-11-28 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
Median :2008-11-09 Median :44.45 Median :17.30 Median :197.0
Mean :2008-11-27 Mean :43.92 Mean :17.15 Mean :200.9
3rd Qu.:2009-11-16 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
Max. :2009-12-01 Max. :59.60 Max. :21.50 Max. :231.0
NA's :2 NA's :2 NA's :2
Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
Min. :2700 Length:344 Min. : 7.632 Min. :-27.02
1st Qu.:3550 Class :character 1st Qu.: 8.300 1st Qu.:-26.32
Median :4050 Mode :character Median : 8.652 Median :-25.83
Mean :4202 Mean : 8.733 Mean :-25.69
3rd Qu.:4750 3rd Qu.: 9.172 3rd Qu.:-25.06
Max. :6300 Max. :10.025 Max. :-23.79
NA's :2 NA's :14 NA's :13
Comments
Length:344
Class :character
Mode :character
penguins_t_test =na.omit(penguins) # can also use na.action in t.testpenguins_t_test$gentoo =ifelse(penguins_t_test$species =="Gentoo", TRUE, FALSE)t.test(penguins_t_test$body_mass_g, penguins_t_test$bill_length_mm,var.equal =TRUE)# this is just to show you # some of the options
Two Sample t-test
data: penguins_t_test$body_mass_g and penguins_t_test$bill_length_mm
t = 94.344, df = 664, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4076.420 4249.709
sample estimates:
mean of x mean of y
4207.05706 43.99279
Welch Two Sample t-test
data: flipper_length_mm by gentoo
t = -32.47, df = 263.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-26.84983 -23.77964
sample estimates:
mean in group FALSE mean in group TRUE
191.9206 217.2353
Correlations
Base
## this lets you do it by class penguins[, sapply(penguins, is.numeric)] |>na.omit() |>cor() # alternatively you could just specify various corr options
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
I(bill_depth_mm^2) + species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-830.17 -200.27 -20.66 213.11 1049.73
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3215.8543 504.6759 -6.372 6.14e-10 ***
bill_length_mm 43.3825 7.1601 6.059 3.67e-09 ***
flipper_length_mm 20.9168 3.1119 6.721 7.73e-11 ***
I(bill_depth_mm^2) 3.7284 0.5316 7.014 1.28e-11 ***
speciesChinstrap -535.4478 82.0922 -6.523 2.54e-10 ***
speciesGentoo 847.6827 136.1247 6.227 1.42e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 318.1 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8449, Adjusted R-squared: 0.8426
F-statistic: 366.2 on 5 and 336 DF, p-value: < 2.2e-16
Your Turn
Estimate a model that looks something like this
lm(flipper_length_mm ~ body_mass_g + other_vars)
Add any other variables
what happens if you do this in the equation
lm(flipper_length_mm ~ body_mass_g +log(somevar))
Bonus: change the factor levels of species so Gentoo is the reference group
hint: use fct_relevel or relevel
05:00
Getting Diagnostic Statistics
Base R
peng_adjust =lm(body_mass_g ~ bill_length_mm + flipper_length_mm + species,data = peng_sans_miss_base)## remember you need to open the train car door or it will return a listy thingpeng_sans_miss_base$.fitted_vals_brack = peng_adjust[[5]] peng_sans_miss_base$.fitted_vals_dollar = peng_adjust$fitted.valuespeng_sans_miss_base$.predicted_vals =predict(peng_adjust, interval ="prediction")peng_sans_miss_base$.residuals_vals = peng_adjust$residualspeng_sans_miss_base$.studentized_resids =rstudent(peng_adjust)peng_sans_miss_base$.cooks_distance =cooks.distance(peng_adjust)
The constitutive terms will appear automatically with *
pengs_interact_stand =lm(body_mass_g ~ bill_length_mm * sex + species + flipper_length_mm,data = peng_sans_miss_tidy)pengs_interact_col =lm(body_mass_g ~ bill_length_mm * sex + sex + species + bill_length_mm + flipper_length_mm ,data = peng_sans_miss_tidy)summary(pengs_interact_stand)
Call:
lm(formula = body_mass_g ~ bill_length_mm * sex + species + flipper_length_mm,
data = peng_sans_miss_tidy)
Residuals:
Min 1Q Median 3Q Max
-722.25 -189.89 -5.58 188.97 897.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1279.184 601.073 -2.128 0.034073 *
bill_length_mm 28.695 7.980 3.596 0.000374 ***
sexmale 1004.946 279.166 3.600 0.000368 ***
speciesChinstrap -302.268 81.331 -3.716 0.000238 ***
speciesGentoo 670.469 95.795 6.999 1.47e-11 ***
flipper_length_mm 19.052 2.954 6.449 4.06e-10 ***
bill_length_mm:sexmale -12.525 6.403 -1.956 0.051324 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 290.7 on 326 degrees of freedom
Multiple R-squared: 0.872, Adjusted R-squared: 0.8697
F-statistic: 370.2 on 6 and 326 DF, p-value: < 2.2e-16
Getting Marginal Effects for Interactions
library(marginaleffects)pengs_interact_effect =lm(body_mass_g ~ bill_length_mm + flipper_length_mm * sex + species, data = peng_sans_miss_tidy)plot_comparisons( pengs_interact_effect,condition ="flipper_length_mm",variables ="sex") +labs(x ="Bill Length(mm)", y ="Unit Level Conditional Estimates") +theme_minimal()
Interflex
library(interflex)peng_interflex =as.data.frame(peng_sans_miss_tidy)interflex(Y ="body_mass_g",D ="sex",X ="flipper_length_mm",Z =c("bill_length_mm", "species"),estimator ="kernel",data = peng_interflex,theme.bw =TRUE,parallel =FALSE) # recommend that you set this to true
Tables with multiple models
modelsummary(list(peng_naive, peng_adjust, pengs_interact_stand),stars =TRUE,gof_omit =".*", ## omitting all goodness of fit for spacecoef_map =c("bill_length_mm"="Bill Length(mm)","flipper_length_mm"="Flipper Length(mm)","bill_length_mm:sexmale"="Bill Length(mm) x Male Penguin"),vcov ="robust",note ="Robust Standard errors in Parenthsis")
(1)
(2)
(3)
Bill Length(mm)
87.415***
60.117***
28.695***
(6.898)
(6.519)
(6.817)
Flipper Length(mm)
27.544***
19.052***
(3.095)
(2.910)
Bill Length(mm) x Male Penguin
−12.525*
(6.207)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Robust Standard errors in Parenthsis
Fancy Tables
Often times we have multiple dv’s
A popular strategy in Pols and Econ is to make a table with two panels
Where outcome 1 is in one panel and outcome two is in the bottom panel
gm <-c("r.squared", "nobs", "rmse")panels <-list(list(lm(body_mass_g ~ bill_length_mm + species, data = peng_sans_miss_tidy),lm(body_mass_g ~ bill_length_mm, data = peng_sans_miss_tidy) ),list(lm(flipper_length_mm ~ species + bill_length_mm, data = peng_sans_miss_tidy),lm(flipper_length_mm ~ species + bill_length_mm + bill_depth_mm, data = peng_sans_miss_tidy) ))modelsummary( panels,shape ="rbind",gof_map = gm)
Output
(1)
(2)
Panel A
(Intercept)
200.453
388.845
(271.646)
(289.817)
bill_length_mm
90.298
86.792
(6.951)
(6.538)
speciesChinstrap
-876.942
(88.744)
speciesGentoo
596.702
(76.429)
R2
0.785
0.347
RMSE
372.93
649.48
Panel B
(Intercept)
147.563
127.187
(4.223)
(5.250)
speciesChinstrap
-5.247
-1.519
(1.380)
(1.450)
speciesGentoo
17.552
27.394
(1.188)
(1.987)
bill_length_mm
1.096
0.709
(0.108)
(0.121)
bill_depth_mm
1.929
(0.320)
R2
0.828
0.845
RMSE
5.80
5.50
Num.Obs.
333
333
Fixed Effects
Mega popular estimation strategy in Econ and Political Science
library(plm)library(vdemdata)vdem_feols =feols(v2x_polyarchy ~ v2x_gender + v2x_corr | year + country_id, data = vdem)vdem_stand =lm(v2x_polyarchy ~ v2x_gender + v2x_corr +factor(year) +factor(country_id), data = vdem)vdem_plm =plm(v2x_polyarchy ~ v2x_gender + v2x_corr +factor(year), data = vdem,index =c("country_id"),model ="within")
Performance
Vdem has 202 country ids and 233 years
Lets time them to see
# A tibble: 3 × 2
expr `Mean Time in Seconds`
<fct> <dbl>
1 vdem_feols 0.00182
2 vdem_stand 0.479
3 vdem_plm 0.367
Your Turn
Create and indicator variable for the chinstrap penguin species
Add an interaction term with bill depth in a regression
create a table with modelsummary with all your models
05:00
Generating Predictions
set.seed(1994)penguins_prediction =mutate(penguins, id =row_number()) |>drop_na()peng_train = penguins_prediction |>sample_frac(0.7) # sample 70% of the dataset peng_test =anti_join(penguins_prediction, peng_train, by ="id")peng_test$data_set ="Test"# add an indicator for what data set we havepenguins_training_model =lm(body_mass_g ~ bill_length_mm, data = peng_train)penguins_train_plot =augment(penguins_training_model, interval ="prediction") |>select(.fitted, .lower, .upper, body_mass_g, bill_length_mm) |>mutate(data_set ="train")# alterntively this works# predict(penguins_train_plot, newdata = peng_test)peng_predict_test =augment(penguins_training_model,interval ="prediction",newdata = peng_test) |>select(.fitted, .lower, .upper, body_mass_g, bill_length_mm, data_set) |>bind_rows(penguins_train_plot)