library(readr)
<-
la_mort read_csv("https://www.dropbox.com/scl/fi/fzsnhfd3lq80v2o3sag6c/la_mort.csv?rlkey=h1vyjm2b8ppgejgsg3e8evm7i&dl=1")
Summarizing Data Tutorial - Louisiana Mortality File
Refining our Analyic Sample
In the previous tutorial, we calculated crude cancer mortality rates across Louisiana parishes from 2005 through 2019. Now we’ll refine our method for calculating those rates by adjusting for different age distributions across parishes and over time.
It’s important to age-adjust our mortality rates because different parishes have different age profiles and that could bias our estimated rates. For example, suppose that parish A has a relatively high proportion of older people and parish B has a relatively high proportion of younger people. Crude cancer mortality rates would likely show higher rates among parish A than parish B. However, this may not actually be the case once we account for the fact that the older population in parish A is more prone to cancer.
Age-adjustment is really a way to determine how cancer mortality rates would differ between parish A and parish B if both parishes had the same population age distribution.
Let’s start by opening the .Rproj file in your hpam7660_Cancer_Alley
folder. Then open the Markdown document you used in the previous tutorial to create the analytic sample. We can continue working off of this same Markdown document as we refine our sample in this tutorial.
Age-Adjusted Cancer Mortality Rates
To calculate age-adjusted cancer mortality rates, we need to generate cancer death counts by parish and year for each age group in our data. Start by reading in the raw Louisiana cancer mortality file (you may be able to skip this step if the la_mort
data frame is still loaded in your project environment).
We now need to repeat a couple of steps that we’ve seen before. First, we’ll create an indicator for whether a parish is in Cancer Alley:
$cancer_parish <- ifelse(la_mort$cntyrsd %in% c(5, 33, 47, 51, 71, 89, 93, 95, 121), 1, 0) la_mort
Then we’ll create an indicator for a cancer death:
$cancer39 <- ifelse(la_mort$ucr39 %in% c(5:15), 1, 0) la_mort
Next, we need to aggregate to the parish level like we did before, but with one important difference - we need to create age-specific counts of parish-level cancer deaths. To do so, we need to assign people in the mortality file to these age groups using the age
variable. But remember, the age
variable in the mortality file uses some strange values (refer to the codebook you created for specifics). First we’ll need to replace values of the age
variable with age in years. Use the following code to do so:
library(dplyr)
<- la_mort %>%
la_mort_age filter(age != 9999)
$age <- ifelse(la_mort_age$age < 2000, la_mort_age$age - 1000, 0) la_mort_age
There are a few people for whom age in the la_mort
file is missing and coded 9999. First, we’re dropping those people from the data by creating a new data frame called la_mort_age
where anyone with an age equal to 9999 is excluded. The next line tells R to replace the age
variable with values equal to the current value minus 1000 if the current value is less than 2000. Remember that a leading digit of 1 in this variable indicates that age is recorded in years and the remaining three digits are age in years. So by subtracting 1000, we’re left with age in years for those whose age was recorded in years (confusing I know!). Everyone with a leading digit between 2 and 9 died before they turned 1 year old and so their age is measured in months, days, etc. Here we’re telling R to change all those values to 0 since we only want age measured in years.
Now that we have the ages for all individuals in our data represented as age in years, we can categorize ages based on the categories available in the denominator file, which are as follows:
0 to 4
5 to 9
10 to 14
15 to 19
20 to 24
25 to 29
30 to 34
35 to 39
40 to 44
45 to 49
50 to 54
55 to 59
60 to 64
65 to 69
70 to 74
75 to 79
80 to 84
85 and above
Here’s the R code that will allow us to do that:
<- c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, Inf)
age_breaks <- c("0_4", "5_9", "10_14", "15_19", "20_24", "25_29", "30_34", "35_39",
age_labels "40_44", "45_49", "50_54", "55_59", "60_64", "65_69", "70_74",
"75_79", "80_84", "85+")
$agegrp <- as.character(cut(la_mort_age$age, breaks = age_breaks, labels = age_labels, right = FALSE)) la_mort_age
The first line in this code is telling R to define a vector called age_breaks
where the “break” corresponds to the lower bound for each age group (e.g., 0 for 0 to 4, 5 for 5 to 9, etc.). The second line is defining values for each age category. Note that I’ve defined these values so that they correspond exactly to the values in the denominator file (la_pop
). Finally, the third line is creating a new variable in the la_mort_age
data frame called agegrp
using the cut
function that categorizes the numerical age
variable into discrete intervals. Here, the discrete intervals are defined by the values in the age_breaks
vector and labeled using the values in the age_labels
vector. The right = FALSE
option tells R to exclude the next break age from the current age interval (e.g., exclude 5 from the 0-5 interval, exclude 10 from the 5-10 interval, and so on). Also, note that I called this new variable agegrp
and created it as a character variable so that it would match the name and variable type of the agegrp
variable in the la_pop
data. This is because we’ll use agegrp
as a key variable when we’re joining the two data frames.
Aggregating to the Parish Level
Last time, when we aggregated to the parish level, we used the following code:
<- la_mort %>%
parish_count group_by(cntyrsd, cancer_parish, year) %>%
summarize(cancer39 = sum(cancer39, na.rm = TRUE))
If you remember, this code created a data frame that included the overall count of cancer deaths by parish and year, but did not differentiate by age. Let’s modify this code so that we aggregate to the parish-age-year level instead of just the parish-year level. To do so, we’ll just need to add agegrp
to our group_by
function:
<- la_mort_age %>%
parish_count_age group_by(cntyrsd, cancer_parish, agegrp, year) %>%
summarize(cancer39 = sum(cancer39, na.rm = TRUE))
After running this, you should see that the parish_count_age
data frame contains counts of cancer deaths for each age category we defined above at the parish-year level.
Merging Population Data
Now it’s time to join the parish_count_age
data frame that we just created with the la_pop
data frame that contains our population denominators. First, load the la_pop
data if it’s not already in your project environment:
<-
la_pop read_csv("https://www.dropbox.com/scl/fi/650k1obpczky6bwa19ex6/la_county_pop.csv?rlkey=0aokd9m76q7mxwus97uslsx7g&dl=1")
Remember that last time, we changed the key variable names in the the cancer count data to match the names in the population data (e.g., cntyrsd
became county
). This time, we’ll run the join with the different key variable names so you can see what that looks like.
Here’s the code we used last time to do the join:
<- parish_count %>%
la_joined inner_join(la_pop, by = c("county", "year"))
We’ll modify that code this time to account for the different key variable names (cntyrsd
and county
) and because we now have a new key variable to include: agegrp
.
<- parish_count_age %>%
la_joined inner_join(la_pop, by = c("cntyrsd" = "county", "year", "agegrp"))
This code tells R that we want to create a new data frame called la_joined
that is formed using an inner_join
on the parish_count_age
and la_pop
data frames. They key variables for R to join on are county (which is called cntyrsd
in the parish_count_age
data frame and called county
in the la_pop
data frame), year, and age group.
If you take a look at the new la_joined
data frame, you’ll see that we have cancer death counts and population counts for each age in each parish-year. We also have population counts by age-race-ethnicity, which we’ll use later on.
Calculating Age-Adjusted Cancer Mortality Rates
The process of age-adjusting parish cancer mortality rates consists of standardizing age distributions so that when comparing cancer mortality across parish or over time, we’re comparing mortality rates for people who are the “same age” (in a statistical sense). Standardizing age distributions involves choosing a reference population and calculating weighted mortality rates based on that population. Let’s walk through how this is done.
STANDARD POPULATION
First we need to choose a standard population age distribution. Most commonly, people use the age distribution in the U.S. at the time of the 2000 census.
Age | 2000 U.S. Standard (in millions) |
---|---|
0 to 4 | 69,135 |
5 to 9 | 72,533 |
10 to 14 | 73,032 |
15 to 19 | 72,169 |
20 to 24 | 66,478 |
25 to 29 | 64,529 |
30 to 34 | 71,044 |
35 to 39 | 80,762 |
40 to 44 | 81,851 |
45 to 49 | 72,118 |
50 to 54 | 62,716 |
55 to 59 | 48,454 |
60 to 64 | 38,793 |
65 to 69 | 34,264 |
70 to 74 | 31,773 |
75 to 79 | 26,999 |
80 to 84 | 17,842 |
85+ | 15,508 |
Here’s a .csv file of the standard population age distribution. Read it into R and join it with your la_joined
data frame.
<-
stnrd_pop read_csv("https://www.dropbox.com/scl/fi/xzd2o5lza237so6vamqwb/stnrd_pop.csv?rlkey=zp90au2tuq6eptvi1yiyjfzua&dl=1")
<- la_joined %>%
la_joined_stnrd inner_join(stnrd_pop, by = "agegrp")
Note that this code uses an inner_join
command on the key variable agegrp
. Both the la_joined
and stnrd_pop
data frames contain age ranges and so this join statement will add the standard population column in the table above to our la_joined
data frame.
STANDARD POPULATION WEIGHTS
Now that we have the standard population age distribution, we need to use this information to create population weights. The idea here is that we calculate the share of the total population represented by each age group. We’ll then multiply this share by the age-specific cancer mortality rates we calculated above.
Run the following code to calculate the population weights:
$stnrd_pop_weight <- (la_joined_stnrd$stnrd_pop) / (sum(stnrd_pop$stnrd_pop)) la_joined_stnrd
This code creates a new variable called stnrd_pop_weight
that is calculated as the age-specific standard population divided by the total standard population. In other words, the value of this stnrd_pop_weight
variable is each age group’s share of the total population.
APPLYING STANDARD POPULAION WEIGHTS
Now we want to multiply the age-specific cancer mortality rates (that we actually haven’t created yet in this tutorial) by this new stnrd_pop_weight
variable. In the last tutorial, we calculated cancer mortality rates by dividing the total count of cancer deaths by the total population in each parish and then multiplied by 100,000 so that rates were per 100,000 population. We’ll do the same thing here, but this time rates will be age-specific and we’ll multiply the rates by the stnrd_pop_weight
variable.
$cancer_rate_adj <- ((la_joined_stnrd$cancer39) / (la_joined_stnrd$tot_pop / 100000)) * la_joined_stnrd$stnrd_pop_weight la_joined_stnrd
This code tells R to create a new variable called cancer_rate_adj
that is equal to the count of cancer deaths cancer39
divided by the total population tot_pop
(which has been divided by 100,000) and multiplied by the stnrd_pop_weight
variable. However, the big difference between what we’re doing this time and what we did in the last tutorial is that the rows in our la_joined_stnrd
data frame represent age-specific cancer death counts and populations and not overall counts at the parish level.
AGGREGATE OVER AGE
Our goal here is to calculate a single age-adjusted cancer mortality rate for each parish in each year. Now that we’ve calculated cancer mortality rates for each age group in each parish-year, we just need to aggregate over those age groups to get the age-adjusted parish level cancer mortality rate. We can do that as follows:
<- la_joined_stnrd %>%
parish_rates group_by(cntyrsd, cancer_parish, year) %>%
summarize(cancer_rate_adj = sum(cancer_rate_adj, na.rm = TRUE), cancer39 = sum(cancer39), tot_pop =
sum(tot_pop))
$cancer_rate_crude <- (parish_rates$cancer39) / (parish_rates$tot_pop / 100000) parish_rates
This code creates a new data frame called parish_rates
that contains aggregated cancer morality rates (cancer_rate_adj
) by parish-year. I’ve also included columns for the total count of cancer deaths in that parish-year and the total population in that parish-year. We’ll use those columns shortly. Finally, notice that the last line in the code chunk above creates a variable called cancer_rate_crude
. This is just the non-age-adjusted parish cancer mortality rate that we calculated in the last tutorial.
This step may be particularly confusing, so if there’s one way to wrap your head around what’s happening here, it’s this: values in the la_joined_stnrd
data frame rows were measured at the age-parish-year level, while values in the parish_rates
data frame are measured at the parish-year level.
Let’s take a look at differences between crude and adjusted cancer rates. We can recreate the table of 2019 cancer rates that we made at the end of the last tutorial. Here was the code we used:
<- subset(la_joined_all, year == 2019)
parish_cancer_2019 library(knitr)
kable(parish_cancer_2019[, c("county", "cancer_rate_total")])
Let’s modify this code to include the adjusted rate and change the names to reflect our new data frame names:
<- subset(parish_rates, year == 2019)
parish_cancer_2019 kable(parish_cancer_2019[, c("cntyrsd", "cancer_rate_crude", "cancer_rate_adj")])
Weighting by Parish Population
You’ll notice there are some big differences between some of the crude and age-adjusted cancer mortality rates. This tends to happen when a parish’s population (and hence the number of cancer deaths in that parish is small). We’ll want to account for this before we compare cancer mortality rates for Cancer Alley and non-Cancer Alley parishes. We can do this by weighting our aggregated (Cancer Alley vs. non-Cancer Alley) means by parish populations. In other words, when comparing a single “Cancer Alley mortality rate” to a single “Non-Cancer Alley mortality rate”, these rates are comprised of an average of rates for each parish in each grouping. But we’d like the more populous parishes in those groupings to contribute more to the overall average mortality rate rather than weighting each parish equally.
We can create population weights (different from standard population weights) by first multiplying our age-adjusted parish cancer mortality rates by the total parish population, adding up each of those values for all the Cancer Alley parishes and the non-Cancer Alley parishes separately, and then dividing the sum of those values by the total population in Cancer Alley and non-Cancer Alley parishes (kind of confusing, I know). Here’s the code to do that:
$pop_weight <- (parish_rates$cancer_rate_adj) * (parish_rates$tot_pop)
parish_rates<- parish_rates %>%
cancer_alley_rates group_by(cancer_parish, year) %>%
summarize(cancer_rate_adj_wt = sum(pop_weight) / sum(tot_pop))
The first line of the code creates a new variable in the parish_rates
data frame called pop_weight
that is equal to the product of each parish’s age-adjusted cancer mortality rate (cancer_rate_adj
) and that parish’s total population (tot_pop
). The rest of the code sums up these values over Cancer Alley and non-Cancer Alley parishes and divides the weighted mortality rates by the total population in each group.
Comparing Age-Adjusted Cancer Mortality Rates
Now that we have our age-adjusted and population-weighted parish cancer mortality rates, we can create a table to compare annual cancer mortality rates for Cancer Alley and non-Cancer Alley parishes. Just like we did before, let’s use the kable
command to create the table:
kable(cancer_alley_rates)
This is ok, but it would be much easier to read this table if we had a column of cancer mortality rates for Cancer Alley parishes and a separate column of cancer mortality rates for non-Cancer Alley Parishes. To get there, we can split the cancer_alley_rates
data frame by the cancer_parish
indicator and then join the two data frames back together:
<-
cancer_alley subset(cancer_alley_rates, cancer_parish == 1, select = c(cancer_rate_adj_wt, year)) %>%
rename(cancer_alley_rate = cancer_rate_adj_wt)
<-
no_cancer_alley subset(cancer_alley_rates, cancer_parish == 0, select = c(cancer_rate_adj_wt, year)) %>%
rename(no_cancer_alley_rate = cancer_rate_adj_wt)
<- cancer_alley %>%
cancer_alley_table inner_join(no_cancer_alley, by = "year")
<- cancer_alley_table[,c("year", "cancer_alley_rate", "no_cancer_alley_rate")]
cancer_alley_table kable(cancer_alley_table)
This looks much better and provides a nice way to summarize age-adjusted, population weighted, cancer mortality rates for Cancer Alley and non-Cancer Alley parishes. In the next tutorial, we’ll refine our definition of Cancer Alley parishes and run some subgroup analyses that will allow us to examine cancer mortality rates by cancer type and by race and ethnicity.
Be sure to save your Markdown file and push the file to your hpam7660_Cancer_Alley
GitHub repo.