Regression Tutorial - Parish Cancer Rates Controlling for Educational Attainment

This mini-tutorial provides an example of controlling for potential confounders using regression analysis. In this case, we’ll add parish-level smoking rates to our Cancer Alley data and generate estimates of parish cancer mortality rates that control for differences in smoking rates.

Load Cancer Alley Data

First, we’ll need to load the data frames we created in previous Cancer Alley tutorials. If you’ve saved these data frames to an image, you can load that image as follows

library(tidyverse)
library(knitr)
load("~/Dropbox/Documents/Teaching Materials/Health Policy/GitHub Site/hpam7660-sp24//assignments/tutorial_8.RData")

Otherwise, you’ll need to re-run your code that creates the parish_rates data frame. Once you have that data frame loaded in your environment, you can do some data cleaning as follows:

Join Smoking Rate Data

parish_rates <- parish_rates %>%
  filter(black == 0) %>%
  subset(, select = c("cntyrsd", "cancer_parish", "year", "total_rate_adj", "total_weight", "population"))

Here, I’m modifying the parish_rates data so that it only includes the columns that I want to use for this mini-tutorial and so that it restricts the sample to parish cancer mortality rates for white people. The reason that I’m restricting the sample to white people is that I don’t have smoking rates by race and ethnicity (if this were a real research project, that’s something I’d want to obtain). And since whites have higher smoking rates and comprise a greater share of the population in Louisiana than other races and ethnicities, it makes sense to restrict this example to the confounding effects of smoking on cancer mortality for whites.

Next, we’ll read in the smoking rate data:

la_smoke <- 
  read_csv("https://www.dropbox.com/scl/fi/fh5gokmw3so6xfc5rthbl/la_smoke.csv?rlkey=f0vy11mvqcd37htsz8ubovtb8&dl=1")

And then join that data to our parish_rates data using the key variables of cntyrsd (called fips in the smoking rate data) and year:

la_smoke_rates <- parish_rates %>%
  inner_join(la_smoke, by = c("cntyrsd" = "fips", "year"))

Regress Parish Cancer Rates on Parish Smoking Rates

Now we’ll run a regression that estimates the relationship between the share of people who smoke in a parish in a given year and the total cancer mortality rate in that year (there are some other terms included in the regression that aren’t that important for the purposes of this tutorial.)

model <- lm(total_rate_adj ~ smoke_perct + as.factor(year) + as.factor(cntyrsd), data = la_smoke_rates)

The lm() command tells R that I want to run a linear regression model. You can visualize this as drawing a straight line (i.e., linear) through a plot of points where smoking rate is on the x-axis and cancer mortality rate is on the y-axis.

Next, I want to predict parish-level cancer mortality rates holding smoking rates constant (i.e., controlling for smoking rates):

la_smoke_rates$total_rate_hat <- predict(model, newdata = la_smoke_rates)

This code creates a new variable called total_rate_hat that represents cancer mortality rates holding smoking rates constant.

Create Population Weighted Measures

Next, I want to weight this rate by population just like we did for the cancer mortality rates we created previously:

la_smoke_rates$smoke_weight <- (la_smoke_rates$total_rate_hat) * (la_smoke_rates$population)
la_smoke_final <- la_smoke_rates %>%
  group_by(cntyrsd, year) %>%
  summarize(
    total_rate_adj_wt = sum((total_weight) / sum(population), na.rm = TRUE),
    smoke_rate_adj_wt = sum((smoke_weight) / sum(population), na.rm = TRUE)
  )

That’s it. We’re done. Now we can compare the parish-level cancer mortality rates to the cancer mortality rates that control for the share of the population that smokes to see the difference. Here’s an example for one parish:

Output Results

la_smoke_filtered <- subset(la_smoke_final, cntyrsd == 1)
kable(la_smoke_filtered)
cntyrsd year total_rate_adj_wt smoke_rate_adj_wt
1 2010 233.1905 208.5036
1 2011 239.2326 205.4408
1 2012 212.9739 205.9210
1 2013 203.9621 202.7800
1 2014 171.4920 201.0846
1 2015 199.0707 192.4747
1 2016 175.8754 199.2939
1 2017 202.7927 192.2017
1 2018 182.7003 191.0731
1 2019 163.6795 186.1962