Department of Political Science at Georgia State University

10/7/22

Research Data Services

Our Team

Get Ready Badges

How To Get the Badges

Packages You Will Need

install.packages("pacman")pacman::p_load("sandwich", # standard error fixes"lmtest", # diagnostic tests"marginaleffects", # getting marginal effects"emmeans", # getting marginal effect"modelsummary", # making tables "broom", # pull diagnostic statistics"caret", # lots of machine learning models"tidyverse", # ggplot and friends "fixest", # fixed effects "kableExtra", # extenstions for making latex tables"plm", # panel data models "ggfortify", install =TRUE)set.seed(1994) # some of the models we will # data we are usingpenguins = palmerpenguins::penguins

Exploratory Data Analysis in R

Describing Variables

This depends on what kind of variable it is i.e. continuous, categorical etc

It also depends on what story you need to tell

Is this confounder a big deal?

Do we see anticipation of treatment?

Are there any outliers?

etc?

Remember R is just a toolbox.

First Cut

summary(penguins)

species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2

studyName Sample Number Species Region
Length:344 Min. : 1.00 Length:344 Length:344
Class :character 1st Qu.: 29.00 Class :character Class :character
Mode :character Median : 58.00 Mode :character Mode :character
Mean : 63.15
3rd Qu.: 95.25
Max. :152.00
Island Stage Individual ID Clutch Completion
Length:344 Length:344 Length:344 Length:344
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm)
Min. :2007-11-09 Min. :32.10 Min. :13.10 Min. :172.0
1st Qu.:2007-11-28 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
Median :2008-11-09 Median :44.45 Median :17.30 Median :197.0
Mean :2008-11-27 Mean :43.92 Mean :17.15 Mean :200.9
3rd Qu.:2009-11-16 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
Max. :2009-12-01 Max. :59.60 Max. :21.50 Max. :231.0
NA's :2 NA's :2 NA's :2
Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
Min. :2700 Length:344 Min. : 7.632 Min. :-27.02
1st Qu.:3550 Class :character 1st Qu.: 8.300 1st Qu.:-26.32
Median :4050 Mode :character Median : 8.652 Median :-25.83
Mean :4202 Mean : 8.733 Mean :-25.69
3rd Qu.:4750 3rd Qu.: 9.172 3rd Qu.:-25.06
Max. :6300 Max. :10.025 Max. :-23.79
NA's :2 NA's :14 NA's :13
Comments
Length:344
Class :character
Mode :character

penguins_t_test =na.omit(penguins) # can also use na.action in t.testpenguins_t_test$gentoo =ifelse(penguins_t_test$species =="Gentoo", TRUE, FALSE)t.test(penguins_t_test$body_mass_g, penguins_t_test$bill_length_mm,var.equal =TRUE)# this is just to show you # some of the options

Two Sample t-test
data: penguins_t_test$body_mass_g and penguins_t_test$bill_length_mm
t = 94.344, df = 664, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4076.420 4249.709
sample estimates:
mean of x mean of y
4207.05706 43.99279

Welch Two Sample t-test
data: flipper_length_mm by gentoo
t = -32.47, df = 263.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-26.84983 -23.77964
sample estimates:
mean in group FALSE mean in group TRUE
191.9206 217.2353

Correlations

Base

## this lets you do it by class penguins[, sapply(penguins, is.numeric)] |>na.omit() |>cor() # alternatively you could just specify various corr options

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
I(bill_depth_mm^2) + species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-830.17 -200.27 -20.66 213.11 1049.73
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3215.8543 504.6759 -6.372 6.14e-10 ***
bill_length_mm 43.3825 7.1601 6.059 3.67e-09 ***
flipper_length_mm 20.9168 3.1119 6.721 7.73e-11 ***
I(bill_depth_mm^2) 3.7284 0.5316 7.014 1.28e-11 ***
speciesChinstrap -535.4478 82.0922 -6.523 2.54e-10 ***
speciesGentoo 847.6827 136.1247 6.227 1.42e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 318.1 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8449, Adjusted R-squared: 0.8426
F-statistic: 366.2 on 5 and 336 DF, p-value: < 2.2e-16

Your Turn

Estimate a model that looks something like this

lm(flipper_length_mm ~ body_mass_g + other_vars)

Add any other variables

what happens if you do this in the equation

lm(flipper_length_mm ~ body_mass_g +log(somevar))

Bonus: change the factor levels of species so Gentoo is the reference group

hint: use fct_relevel or relevel

05:00

Getting Diagnostic Statistics

Base R

peng_adjust =lm(body_mass_g ~ bill_length_mm + flipper_length_mm + species,data = peng_sans_miss_base)## remember you need to open the train car door or it will return a listy thingpeng_sans_miss_base$.fitted_vals_brack = peng_adjust[[5]] peng_sans_miss_base$.fitted_vals_dollar = peng_adjust$fitted.valuespeng_sans_miss_base$.predicted_vals =predict(peng_adjust, interval ="prediction")peng_sans_miss_base$.residuals_vals = peng_adjust$residualspeng_sans_miss_base$.studentized_resids =rstudent(peng_adjust)peng_sans_miss_base$.cooks_distance =cooks.distance(peng_adjust)

The constitutive terms will appear automatically with *

pengs_interact_stand =lm(body_mass_g ~ bill_length_mm * sex + species + flipper_length_mm,data = peng_sans_miss_tidy)pengs_interact_col =lm(body_mass_g ~ bill_length_mm * sex + sex + species + bill_length_mm + flipper_length_mm ,data = peng_sans_miss_tidy)summary(pengs_interact_stand)

Call:
lm(formula = body_mass_g ~ bill_length_mm * sex + species + flipper_length_mm,
data = peng_sans_miss_tidy)
Residuals:
Min 1Q Median 3Q Max
-722.25 -189.89 -5.58 188.97 897.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1279.184 601.073 -2.128 0.034073 *
bill_length_mm 28.695 7.980 3.596 0.000374 ***
sexmale 1004.946 279.166 3.600 0.000368 ***
speciesChinstrap -302.268 81.331 -3.716 0.000238 ***
speciesGentoo 670.469 95.795 6.999 1.47e-11 ***
flipper_length_mm 19.052 2.954 6.449 4.06e-10 ***
bill_length_mm:sexmale -12.525 6.403 -1.956 0.051324 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 290.7 on 326 degrees of freedom
Multiple R-squared: 0.872, Adjusted R-squared: 0.8697
F-statistic: 370.2 on 6 and 326 DF, p-value: < 2.2e-16

Getting Marginal Effects for Interactions

library(marginaleffects)pengs_interact_effect =lm(body_mass_g ~ bill_length_mm + flipper_length_mm * sex + species, data = peng_sans_miss_tidy)plot_comparisons( pengs_interact_effect,condition ="flipper_length_mm",variables ="sex") +labs(x ="Bill Length(mm)", y ="Unit Level Conditional Estimates") +theme_minimal()

Interflex

library(interflex)peng_interflex =as.data.frame(peng_sans_miss_tidy)interflex(Y ="body_mass_g",D ="sex",X ="flipper_length_mm",Z =c("bill_length_mm", "species"),estimator ="kernel",data = peng_interflex,theme.bw =TRUE,parallel =FALSE) # recommend that you set this to true

Tables with multiple models

modelsummary(list(peng_naive, peng_adjust, pengs_interact_stand),stars =TRUE,gof_omit =".*", ## omitting all goodness of fit for spacecoef_map =c("bill_length_mm"="Bill Length(mm)","flipper_length_mm"="Flipper Length(mm)","bill_length_mm:sexmale"="Bill Length(mm) x Male Penguin"),vcov ="robust",note ="Robust Standard errors in Parenthsis")

(1)

(2)

(3)

Bill Length(mm)

87.415***

60.117***

28.695***

(6.898)

(6.519)

(6.817)

Flipper Length(mm)

27.544***

19.052***

(3.095)

(2.910)

Bill Length(mm) x Male Penguin

−12.525*

(6.207)

^{} + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

^{} Robust Standard errors in Parenthsis

Fancy Tables

Often times we have multiple dv’s

A popular strategy in Pols and Econ is to make a table with two panels

Where outcome 1 is in one panel and outcome two is in the bottom panel

gm <-c("r.squared", "nobs", "rmse")panels <-list(list(lm(body_mass_g ~ bill_length_mm + species, data = peng_sans_miss_tidy),lm(body_mass_g ~ bill_length_mm, data = peng_sans_miss_tidy) ),list(lm(flipper_length_mm ~ species + bill_length_mm, data = peng_sans_miss_tidy),lm(flipper_length_mm ~ species + bill_length_mm + bill_depth_mm, data = peng_sans_miss_tidy) ))modelsummary( panels,shape ="rbind",gof_map = gm)

Output

(1)

(2)

Panel A

(Intercept)

200.453

388.845

(271.646)

(289.817)

bill_length_mm

90.298

86.792

(6.951)

(6.538)

speciesChinstrap

-876.942

(88.744)

speciesGentoo

596.702

(76.429)

R2

0.785

0.347

RMSE

372.93

649.48

Panel B

(Intercept)

147.563

127.187

(4.223)

(5.250)

speciesChinstrap

-5.247

-1.519

(1.380)

(1.450)

speciesGentoo

17.552

27.394

(1.188)

(1.987)

bill_length_mm

1.096

0.709

(0.108)

(0.121)

bill_depth_mm

1.929

(0.320)

R2

0.828

0.845

RMSE

5.80

5.50

Num.Obs.

333

333

Fixed Effects

Mega popular estimation strategy in Econ and Political Science

library(plm)library(vdemdata)vdem_feols =feols(v2x_polyarchy ~ v2x_gender + v2x_corr | year + country_id, data = vdem)vdem_stand =lm(v2x_polyarchy ~ v2x_gender + v2x_corr +factor(year) +factor(country_id), data = vdem)vdem_plm =plm(v2x_polyarchy ~ v2x_gender + v2x_corr +factor(year), data = vdem,index =c("country_id"),model ="within")

Performance

Vdem has 202 country ids and 233 years

Lets time them to see

# A tibble: 3 × 2
expr `Mean Time in Seconds`
<fct> <dbl>
1 vdem_feols 0.00182
2 vdem_stand 0.479
3 vdem_plm 0.367

Your Turn

Create and indicator variable for the chinstrap penguin species

Add an interaction term with bill depth in a regression

create a table with modelsummary with all your models

05:00

Generating Predictions

set.seed(1994)penguins_prediction =mutate(penguins, id =row_number()) |>drop_na()peng_train = penguins_prediction |>sample_frac(0.7) # sample 70% of the dataset peng_test =anti_join(penguins_prediction, peng_train, by ="id")peng_test$data_set ="Test"# add an indicator for what data set we havepenguins_training_model =lm(body_mass_g ~ bill_length_mm, data = peng_train)penguins_train_plot =augment(penguins_training_model, interval ="prediction") |>select(.fitted, .lower, .upper, body_mass_g, bill_length_mm) |>mutate(data_set ="train")# alterntively this works# predict(penguins_train_plot, newdata = peng_test)peng_predict_test =augment(penguins_training_model,interval ="prediction",newdata = peng_test) |>select(.fitted, .lower, .upper, body_mass_g, bill_length_mm, data_set) |>bind_rows(penguins_train_plot)