Summarizing Data Tutorial - Louisiana Mortality File

Refining our Analyic Sample

In the previous tutorial, we calculated crude cancer mortality rates across Louisiana parishes from 2005 through 2019. Now we’ll refine our method for calculating those rates by adjusting for different age distributions across parishes and over time.

It’s important to age-adjust our mortality rates because different parishes have different age profiles and that could bias our estimated rates. For example, suppose that parish A has a relatively high proportion of older people and parish B has a relatively high proportion of younger people. Crude cancer mortality rates would likely show higher rates among parish A than parish B. However, this may not actually be the case once we account for the fact that the older population in parish A is more prone to cancer.

Age-adjustment is really a way to determine how cancer mortality rates would differ between parish A and parish B if both parishes had the same population age distribution.

Let’s start by opening the .Rproj file in your hpam7660_Cancer_Alley folder. Then open the Markdown document you used in the previous tutorial to create the analytic sample. We can continue working off of this same Markdown document as we refine our sample in this tutorial.

Age-Adjusted Cancer Mortality Rates

To calculate age-adjusted cancer mortality rates, we need to generate cancer death counts by parish and year for each age group in our data. Start by reading in the raw Louisiana cancer mortality file (you may be able to skip this step if the la_mort data frame is still loaded in your project environment).

library(readr)
la_mort <- 
  read_csv("https://www.dropbox.com/scl/fi/fzsnhfd3lq80v2o3sag6c/la_mort.csv?rlkey=h1vyjm2b8ppgejgsg3e8evm7i&dl=1")

We now need to repeat a couple of steps that we’ve seen before. First, we’ll create an indicator for whether a parish is in Cancer Alley:

la_mort$cancer_parish <- ifelse(la_mort$cntyrsd %in% c(5, 33, 47, 51, 71, 89, 93, 95, 121), 1, 0)

Then we’ll create an indicator for a cancer death:

la_mort$cancer39 <- ifelse(la_mort$ucr39 %in% c(5:15), 1, 0)

Next, we need to aggregate to the parish level like we did before, but with one important difference - we need to create age-specific counts of parish-level cancer deaths. To do so, we need to assign people in the mortality file to these age groups using the age variable. But remember, the age variable in the mortality file uses some strange values (refer to the codebook you created for specifics). First we’ll need to replace values of the age variable with age in years. Use the following code to do so:

library(dplyr)
la_mort_age <- la_mort %>%
  filter(age != 9999)
la_mort_age$age <- ifelse(la_mort_age$age < 2000, la_mort_age$age - 1000, 0)

There are a few people for whom age in the la_mort file is missing and coded 9999. First, we’re dropping those people from the data by creating a new data frame called la_mort_age where anyone with an age equal to 9999 is excluded. The next line tells R to replace the age variable with values equal to the current value minus 1000 if the current value is less than 2000. Remember that a leading digit of 1 in this variable indicates that age is recorded in years and the remaining three digits are age in years. So by subtracting 1000, we’re left with age in years for those whose age was recorded in years (confusing I know!). Everyone with a leading digit between 2 and 9 died before they turned 1 year old and so their age is measured in months, days, etc. Here we’re telling R to change all those values to 0 since we only want age measured in years.

Now that we have the ages for all individuals in our data represented as age in years, we can categorize ages based on the categories available in the denominator file, which are as follows:

0 to 4
5 to 9
10 to 14
15 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 to 79
80 to 84
85 and above

Here’s the R code that will allow us to do that:

age_breaks <- c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, Inf)
age_labels <- c("0_4", "5_9", "10_14", "15_19", "20_24", "25_29", "30_34", "35_39", 
                "40_44", "45_49", "50_54", "55_59", "60_64", "65_69", "70_74", 
                "75_79", "80_84", "85+")
la_mort_age$agegrp <- as.character(cut(la_mort_age$age, breaks = age_breaks, labels = age_labels, right = FALSE))

The first line in this code is telling R to define a vector called age_breaks where the “break” corresponds to the lower bound for each age group (e.g., 0 for 0 to 4, 5 for 5 to 9, etc.). The second line is defining values for each age category. Note that I’ve defined these values so that they correspond exactly to the values in the denominator file (la_pop). Finally, the third line is creating a new variable in the la_mort_age data frame called agegrp using the cut function that categorizes the numerical age variable into discrete intervals. Here, the discrete intervals are defined by the values in the age_breaks vector and labeled using the values in the age_labels vector. The right = FALSE option tells R to exclude the next break age from the current age interval (e.g., exclude 5 from the 0-5 interval, exclude 10 from the 5-10 interval, and so on). Also, note that I called this new variable agegrp and created it as a character variable so that it would match the name and variable type of the agegrp variable in the la_pop data. This is because we’ll use agegrp as a key variable when we’re joining the two data frames.

Aggregating to the Parish Level

Last time, when we aggregated to the parish level, we used the following code:

parish_count <- la_mort %>%
  group_by(cntyrsd, cancer_parish, year) %>%
  summarize(cancer39 = sum(cancer39, na.rm = TRUE))

If you remember, this code created a data frame that included the overall count of cancer deaths by parish and year, but did not differentiate by age. Let’s modify this code so that we aggregate to the parish-age-year level instead of just the parish-year level. To do so, we’ll just need to add agegrp to our group_by function:

parish_count_age <- la_mort_age %>%
  group_by(cntyrsd, cancer_parish, agegrp, year) %>%
  summarize(cancer39 = sum(cancer39, na.rm = TRUE))

After running this, you should see that the parish_count_age data frame contains counts of cancer deaths for each age category we defined above at the parish-year level.

Merging Population Data

Now it’s time to join the parish_count_age data frame that we just created with the la_pop data frame that contains our population denominators. First, load the la_pop data if it’s not already in your project environment:

la_pop <- 
  read_csv("https://www.dropbox.com/scl/fi/650k1obpczky6bwa19ex6/la_county_pop.csv?rlkey=0aokd9m76q7mxwus97uslsx7g&dl=1")

Remember that last time, we changed the key variable names in the the cancer count data to match the names in the population data (e.g., cntyrsd became county). This time, we’ll run the join with the different key variable names so you can see what that looks like.

Here’s the code we used last time to do the join:

la_joined <- parish_count %>%
  inner_join(la_pop, by = c("county", "year"))

We’ll modify that code this time to account for the different key variable names (cntyrsd and county) and because we now have a new key variable to include: agegrp.

la_joined <- parish_count_age %>%
  inner_join(la_pop, by = c("cntyrsd" = "county", "year", "agegrp"))

This code tells R that we want to create a new data frame called la_joined that is formed using an inner_join on the parish_count_age and la_pop data frames. They key variables for R to join on are county (which is called cntyrsd in the parish_count_age data frame and called county in the la_pop data frame), year, and age group.

If you take a look at the new la_joined data frame, you’ll see that we have cancer death counts and population counts for each age in each parish-year. We also have population counts by age-race-ethnicity, which we’ll use later on.

Calculating Age-Adjusted Cancer Mortality Rates

The process of age-adjusting parish cancer mortality rates consists of standardizing age distributions so that when comparing cancer mortality across parish or over time, we’re comparing mortality rates for people who are the “same age” (in a statistical sense). Standardizing age distributions involves choosing a reference population and calculating weighted mortality rates based on that population. Let’s walk through how this is done.

STANDARD POPULATION

First we need to choose a standard population age distribution. Most commonly, people use the age distribution in the U.S. at the time of the 2000 census.

Source: https://seer.cancer.gov/stdpopulations/stdpop.19ages.html
Age 2000 U.S. Standard (in millions)
0 to 4 69,135
5 to 9 72,533
10 to 14 73,032
15 to 19 72,169
20 to 24 66,478
25 to 29 64,529
30 to 34 71,044
35 to 39 80,762
40 to 44 81,851
45 to 49 72,118
50 to 54 62,716
55 to 59 48,454
60 to 64 38,793
65 to 69 34,264
70 to 74 31,773
75 to 79 26,999
80 to 84 17,842
85+ 15,508

Here’s a .csv file of the standard population age distribution. Read it into R and join it with your la_joined data frame.

stnrd_pop <- 
  read_csv("https://www.dropbox.com/scl/fi/xzd2o5lza237so6vamqwb/stnrd_pop.csv?rlkey=zp90au2tuq6eptvi1yiyjfzua&dl=1")
la_joined_stnrd <- la_joined %>%
  inner_join(stnrd_pop, by = "agegrp")

Note that this code uses an inner_join command on the key variable agegrp. Both the la_joined and stnrd_pop data frames contain age ranges and so this join statement will add the standard population column in the table above to our la_joined data frame.

STANDARD POPULATION WEIGHTS

Now that we have the standard population age distribution, we need to use this information to create population weights. The idea here is that we calculate the share of the total population represented by each age group. We’ll then multiply this share by the age-specific cancer mortality rates we calculated above.

Run the following code to calculate the population weights:

la_joined_stnrd$stnrd_pop_weight <- (la_joined_stnrd$stnrd_pop) / (sum(stnrd_pop$stnrd_pop))

This code creates a new variable called stnrd_pop_weight that is calculated as the age-specific standard population divided by the total standard population. In other words, the value of this stnrd_pop_weight variable is each age group’s share of the total population.

APPLYING STANDARD POPULAION WEIGHTS

Now we want to multiply the age-specific cancer mortality rates (that we actually haven’t created yet in this tutorial) by this new stnrd_pop_weight variable. In the last tutorial, we calculated cancer mortality rates by dividing the total count of cancer deaths by the total population in each parish and then multiplied by 100,000 so that rates were per 100,000 population. We’ll do the same thing here, but this time rates will be age-specific and we’ll multiply the rates by the stnrd_pop_weight variable.

la_joined_stnrd$cancer_rate_adj <- ((la_joined_stnrd$cancer39) / (la_joined_stnrd$tot_pop / 100000)) * la_joined_stnrd$stnrd_pop_weight

This code tells R to create a new variable called cancer_rate_adj that is equal to the count of cancer deaths cancer39 divided by the total population tot_pop (which has been divided by 100,000) and multiplied by the stnrd_pop_weight variable. However, the big difference between what we’re doing this time and what we did in the last tutorial is that the rows in our la_joined_stnrd data frame represent age-specific cancer death counts and populations and not overall counts at the parish level.

AGGREGATE OVER AGE

Our goal here is to calculate a single age-adjusted cancer mortality rate for each parish in each year. Now that we’ve calculated cancer mortality rates for each age group in each parish-year, we just need to aggregate over those age groups to get the age-adjusted parish level cancer mortality rate. We can do that as follows:

parish_rates <- la_joined_stnrd %>%
  group_by(cntyrsd, cancer_parish, year) %>%
  summarize(cancer_rate_adj = sum(cancer_rate_adj, na.rm = TRUE), cancer39 = sum(cancer39), tot_pop = 
              sum(tot_pop))
parish_rates$cancer_rate_crude <- (parish_rates$cancer39) / (parish_rates$tot_pop / 100000)

This code creates a new data frame called parish_rates that contains aggregated cancer morality rates (cancer_rate_adj) by parish-year. I’ve also included columns for the total count of cancer deaths in that parish-year and the total population in that parish-year. We’ll use those columns shortly. Finally, notice that the last line in the code chunk above creates a variable called cancer_rate_crude. This is just the non-age-adjusted parish cancer mortality rate that we calculated in the last tutorial.

This step may be particularly confusing, so if there’s one way to wrap your head around what’s happening here, it’s this: values in the la_joined_stnrd data frame rows were measured at the age-parish-year level, while values in the parish_rates data frame are measured at the parish-year level.

Let’s take a look at differences between crude and adjusted cancer rates. We can recreate the table of 2019 cancer rates that we made at the end of the last tutorial. Here was the code we used:

parish_cancer_2019 <- subset(la_joined_all, year == 2019)
library(knitr)
kable(parish_cancer_2019[, c("county", "cancer_rate_total")])

Let’s modify this code to include the adjusted rate and change the names to reflect our new data frame names:

parish_cancer_2019 <- subset(parish_rates, year == 2019)
kable(parish_cancer_2019[, c("cntyrsd", "cancer_rate_crude", "cancer_rate_adj")])

Weighting by Parish Population

You’ll notice there are some big differences between some of the crude and age-adjusted cancer mortality rates. This tends to happen when a parish’s population (and hence the number of cancer deaths in that parish is small). We’ll want to account for this before we compare cancer mortality rates for Cancer Alley and non-Cancer Alley parishes. We can do this by weighting our aggregated (Cancer Alley vs. non-Cancer Alley) means by parish populations. In other words, when comparing a single “Cancer Alley mortality rate” to a single “Non-Cancer Alley mortality rate”, these rates are comprised of an average of rates for each parish in each grouping. But we’d like the more populous parishes in those groupings to contribute more to the overall average mortality rate rather than weighting each parish equally.

We can create population weights (different from standard population weights) by first multiplying our age-adjusted parish cancer mortality rates by the total parish population, adding up each of those values for all the Cancer Alley parishes and the non-Cancer Alley parishes separately, and then dividing the sum of those values by the total population in Cancer Alley and non-Cancer Alley parishes (kind of confusing, I know). Here’s the code to do that:

parish_rates$pop_weight <- (parish_rates$cancer_rate_adj) * (parish_rates$tot_pop)
cancer_alley_rates <- parish_rates %>%
  group_by(cancer_parish, year) %>%
  summarize(cancer_rate_adj_wt = sum(pop_weight) / sum(tot_pop))

The first line of the code creates a new variable in the parish_rates data frame called pop_weight that is equal to the product of each parish’s age-adjusted cancer mortality rate (cancer_rate_adj) and that parish’s total population (tot_pop). The rest of the code sums up these values over Cancer Alley and non-Cancer Alley parishes and divides the weighted mortality rates by the total population in each group.

Comparing Age-Adjusted Cancer Mortality Rates

Now that we have our age-adjusted and population-weighted parish cancer mortality rates, we can create a table to compare annual cancer mortality rates for Cancer Alley and non-Cancer Alley parishes. Just like we did before, let’s use the kable command to create the table:

kable(cancer_alley_rates)

This is ok, but it would be much easier to read this table if we had a column of cancer mortality rates for Cancer Alley parishes and a separate column of cancer mortality rates for non-Cancer Alley Parishes. To get there, we can split the cancer_alley_rates data frame by the cancer_parish indicator and then join the two data frames back together:

cancer_alley <- 
  subset(cancer_alley_rates, cancer_parish == 1, select = c(cancer_rate_adj_wt, year)) %>%
  rename(cancer_alley_rate = cancer_rate_adj_wt)
no_cancer_alley <- 
  subset(cancer_alley_rates, cancer_parish == 0, select = c(cancer_rate_adj_wt, year)) %>%
  rename(no_cancer_alley_rate = cancer_rate_adj_wt)
cancer_alley_table <- cancer_alley %>%
  inner_join(no_cancer_alley, by = "year")
cancer_alley_table <- cancer_alley_table[,c("year", "cancer_alley_rate", "no_cancer_alley_rate")]
kable(cancer_alley_table)

This looks much better and provides a nice way to summarize age-adjusted, population weighted, cancer mortality rates for Cancer Alley and non-Cancer Alley parishes. In the next tutorial, we’ll refine our definition of Cancer Alley parishes and run some subgroup analyses that will allow us to examine cancer mortality rates by cancer type and by race and ethnicity.

Be sure to save your Markdown file and push the file to your hpam7660_Cancer_Alley GitHub repo.