Data Lab 3 - Descriptive Comparisons: Medicaid and Health

In this tutorial, we’ll analyze the 2023 Behavioral Risk Factor Surveillance System (BRFSS) data to explore Medicaid coverage and its relationship with self-reported health. By summarizing Medicaid coverage and comparing health outcomes across Medicaid recipients and the uninsured, we’ll lay the foundation for more advanced analysis in future tutorials.

Step 1: Create a New R Project & R Script File

See the instructions from Data Lab 2 to create a new R Project and R Script File. You can call this new project/folder “BRFSS Medicaid Health” or whatever you’d like to call it (as long as you remember the name!). You should type all of the code for this Data Lab in your R Script file and save that file when you’re finished. That way, if you need to use that code again (and you will), you’ll have it saved and won’t have to retype everything.

Step 2: Downloading BRFSS Data for 2023

Start by downloading the 2023 BRFSS data from the CDC BRFSS Annual Survey Data page. Follow these steps:

Navigate to the 2016-2023 Annual Survey Data Section and click on the link that says “2023 Annual Survey Data”.
Download the codebook by clicking on “2023 BRFSS Codebook” link. We’ll reference this later when we’re looking at values for specific variables in the data. (To align with the priorities of the new Trump administration, the CDC has removed some files from their website. Last I checked, the 2023 BRFSS data file was still available, but the codebook was not. If you are still seeing a dead link when clicking on the codebook, you can download it here.)
Scroll down the page to the “Data Files” section and download the file called “2023 BRFSS Data (SAS Transport Format with the .xpt extension). Save the file to your project folder. This will download as a zip file, so you’ll need to unzip the file once you’ve downloaded it. This is a pretty large file when unzipped (1.22GB), so keep that in mind when downloading.

Step 3: Importing the Data into R

Now that you’ve saved the .xpt file, let’s load it into R using the read_xpt command from the haven package. You’ll probably need to install this package before loading the library. Be patient as you load this file, it’s a large data set and will take a few minutes to load. Once it’s loaded, you should see the data frame in your Global Environment window.

install.packages("haven")
library(haven)
library(dplyr)
brfss_data <- read_xpt("path_to_file/LLCP2023.XPT")

Note that you’ll need to replace “path_to_file” in the code chunk above with the actual path to the .xpt file that you saved on your computer.

Special note for Windows users: Windows uses backslashes (\) instead of forward slashes (/) for file paths, but R doesn’t like backslashes. If you’re a Windows user, you can try and replace the backslashes with forward slashes. In some cases that will work, but if it doesn’t, you can use two backslashes instead of one to appease R. So, for example:

brfss_data <- read_xpt("C:\\Users\\kevin\\data\\LLCP2023.XPT")

Step 4: Subsetting the Data

Take a look at the BRFSS data tibble by typing “brfss_data” into the Console command line. You’ll see that this dataset contains 433,323 rows (these are individual-level observations) and 350 columns (each column is a separate variable).

Now, we won’t need all 350 variables for our work in this Data Lab, so let’s select only the variables we’ll actually need. This will reduce the size of the dataset we’re working with and allow things to run faster.

brfss_smaller <- select(brfss_data, PRIMINS1, GENHLTH, `_AGE80`, SEXVAR, INCOME3, EDUCA)

This tells R that we want to create a dataset called “brfss_smaller” that includes ONLY the columns (variables) PRIMINS1, GENHLTH, _AGE80, SEXVAR, INCOME3, and EDUCA. You’ll see that the variable _AGE80 is enclosed in backticks “`”. This is because that column name starts with an underscore and R doesn’t like that.

Step 5: Cleaning the Data

Let’s go through the codebook and take a look at the values for each of these variables. Starting with the “PRIMINS1” variable, you’ll see that this is where we can identify a respondent’s primary source of health insurance coverage. After locating this variable in the codebook, let’s take a look at the values of the “PRIMINS1” variable in the data.

table(brfss_smaller$PRIMINS1, useNA = "ifany")

You should see that, like the codebook indicates, the values of the “PRIMINS1” variable in the data range from 1 to 10, include values of 77, 88, and 99, and that there are 5 missing values for this variable in the data. For our purposes today, we’ll focus on those with Medicaid coverage and those who are uninsured. A value of 5 in the PRIMINS1 column indicates that the respondent has Medicaid coverage and there appear to be 28,729 individuals in the data with Medicaid. A value of 88 indicates that the respondent has no insurance coverage of any type and there appear to be 22,703 individuals in the data who are uninsured.

Next, let’s take a look at the values of the “GENHLTH” variable. This variable is a measure of self-reported health where people rank their health on a scale ranging from “excellent” to “poor”. Also notice that there are some individuals who didn’t know how they’d rate their health (value = 7), some that refused to answer the question (value = 9), and a few who were not asked the question or having missing responses (value = BLANK or NA in R terms). We’ll want to get rid of these people before doing any comparisons of health status.

Next is “_AGE80”. This is simply a respondent’s age in years with all ages above 80 top coded at 80. This won’t matter much for us, but note that if you wanted to examine outcomes for specific ages above 80, you wouldn’t be able to do it in this data. As far as we’re concerned, when cleaning the data, we’ll want to drop those who are 65 and older because they will likely have Medicare (possibly in addition to Medicaid).

Next we have “SEXVAR”, which is an indicator of sex. You can see that this variable is coded as either a 1 or a 2 depending on a respondent’s sex. It can be tricky to remember which value corresponds to female and which to male, so we’ll create a new variable called “FEMALE” based on “SEXVAR” that will equal 1 if the respondent is female and will equal 0 if they are male. That’s much easier to remember.

Next is “INCOME3”. This reports annual household income in various groupings. Again, there are some people who respond that they don’t know their annual household income (value = 77), some people who refuse to respond to this question (value = 88), and some people for whom a response was not recorded (value = BLANK or NA).

Finally, we have “EDUCA”, which records the highest grade of school that a respondent completed. Here again we have some people who refused to respond (value = 9) and some with missing responses (value = BLANK or NA).

To clean the data, we’ll do the following:

Keep only those with Medicaid insurance coverage or those who are uninsured (you’ll see why shortly).
Keep only those who are under the age of 65.
Create a new variable called “FEMALE” based on “SEXVAR” that will equal 1 if the respondent is female and will equal 0 if they are male.
Drop any observations (i.e., rows) where a respondent has BLANK values for any of the variables in the data.

The following code will generate this clean dataset (don’t forget the backticks around the _AGE80 variable):

brfss_clean <- brfss_smaller %>%
  filter(PRIMINS1 %in% c(5, 88),
         `_AGE80` < 65) %>%
  mutate(FEMALE = ifelse(SEXVAR == 2, 1, 0)) %>%
  na.omit()

Here we’re creating a new dataset called brfss_clean that:

Starts with our brfss_smaller dataset: brfss_clean <- brfss_smaller
Keeps only those with Medicaid coverage or the uninsured and only those who are under age 65: filter(PRIMINS1 %in% c(5, 88), _AGE80 <65) where the c() function creates a vector or list of values.
Creates the new variable FEMALE: mutate(FEMALE = ifelse(SEXVAR == 2, 1, 0)) where this reads as “create a variable called FEMALE that equals 1 when SEXVAR is equal to 2 and equals 0 otherwise.”
Drops any rows (i.e., observations) that have missing values for any of the variables: na.omit()

One other thing to note about this code is that it uses the R pipe operator (%>%). This allows us to combine multiple commands in R. It’s not all that important that you understand how the pipe operator works in this Data Lab becuase you’ll be introduced to it again in an upcoming Problem Set.

Let’s make sure this new dataset brfss_clean looks the way it ought to. First, lets take a look at the values of the “PRIMINS1” variable and make sure they look the way they’re supposed to.

table(brfss_clean$PRIMINS1, useNA = "ifany")

You should only see two values 5 and 88 when you run this command. There should be 24,401 respondents with a PRIMINS1 value of 5 and 21,014 respondents with a PRIMINS1 value of 88.

Next, let’s check our age range.

summary(brfss_clean$`_AGE80`)

Here we’re using the summary command instead of the table command because summary provides a quick way to gauge a range of values. You can see that the minimum value for the “_AGE80” variable is 18 and the maximum value is 64, so all good there.

Next, let’s check the values of the “FEMALE” variable.

table(brfss_clean$FEMALE)

You should see that the data include 24,944 respondents whose sex is female and 20,471 whose sex is male.

So it looks like the data are cleaned properly. Let’s move on to generating some descriptive comparisons of the relationship between Medicaid coverage and health.

Step 6: Comparing Self-Reported Health between those with Medicaid Coverage and the Uninsured

Let’s start to get a sense of the unadjusted relationship between Medicaid coverage and health by comparing self-reported health status for those with Medicaid and those who are uninsured.

table(brfss_clean$GENHLTH, brfss_clean$PRIMINS1)

Remember that a value of 5 for “PRIMINS1” indicates that the respondent has Medicaid coverage and a value of 88 indicates they are uninsured. It’s also helpful to remember that self-reported health is ranked on a descending scale from 1 (excellent) to 5 (poor).

This table command shows us the counts of respondents in each category of self-reported health by insurance status. This actually isn’t all that helpful because it’s hard to make comparisons when we’re just looking at counts. Let’s convert these numbers to proportions instead.

prop.table(table(brfss_clean$GENHLTH, brfss_clean$PRIMINS1), margin = 2)

This looks a little messy. Let’s clean it up by converting these from proportions to percentages and only keeping two decimal points.

round(prop.table(table(brfss_clean$GENHLTH, brfss_clean$PRIMINS1), margin = 2) * 100, 2)

This is much better. Here we can see that 11.18% of Medicaid respondents report their health as excellent compared to 15.78% of the uninsured. Similarly, 9.81% of Medicaid respondents report their health as poor compared to only 4.69% of the uninsured. Hmmm. This seems a little strange. It appears that the uninsured may actually be in better health (at least according to this measure of self-reported health than those with Medicaid coverage).

Let’s try another comparison. Suppose that our self-reported health scale is ordinal in the sense that moving from 1 to 2 reflects the same change in health status as moving from 2 to 3 and 3 to 4 and so on. In that case, we could calculate the mean score of “GENHLTH” and compare those means across our two groups. We can do that as follows:

brfss_clean %>% 
  group_by(PRIMINS1) %>% 
  summarize(mean_health=mean(GENHLTH, na.rm=TRUE))

Here we’re using the group_by command along with the summarize command to calculate the means of “GENHLTH” by insurance coverage. We’re also adding the na.rm=TRUE option which will drop any missing values before calculating the means. This shouldn’t matter in our case because we’ve already dropped missing values from the dataset, but it’s good practice to get in the habit of using it anyway.

Ok, after running this code, you should see that the mean value of “GENHLTH” for those with Medicaid coverage is 3.02 and the mean value for those who are uninsured is 2.76. Since a higher score represents worse self-reported health, we’re still seeing that the uninsured tend to be healthier than those with Medicaid coverage!

But wait! We forgot that the variable “GENHLTH” contains values of 7 for people who don’t know their health status and 9 for people who refused to answer the question. We don’t want these values to count towards the means we’re calculating. Let’s fix that.

brfss_clean %>%
  group_by(PRIMINS1) %>%
  summarize(mean_health = mean(GENHLTH[!GENHLTH %in% c(7, 9)], na.rm = TRUE))

Here we’ve added the condition that “GENHLTH” values of 7 and 9 should not be counted when we’re calculating the mean of “GENHLTH” for each group. As you can see, this makes little difference to our mean values, but it’s important for accuracy.

Summary and Key Takeaways

In this Data Lab, we downloaded data from the 2023 BRFSS Survey and compared differences in self-reported health for those who had insurance coverage through the Medicaid program and those who were uninsured. We found that people who were covered by Medicaid tended to be in worse health than those with no insurance coverage at all.

While we might be tempted to draw conclusions about the value of Medicaid coverage from this analysis, we should be very hesitant to do so. As we’ll see in the upcoming class discussions and our future data work, there are likely to be very good explanations for this finding that don’t involve Medicaid coverage worsening health.

In our next Data Lab, we’ll work on creating additional descriptive statistics and refining our anlaytical dataset for future analyses. Once we’ve done that, we can start improving the rigor of the methods we’re using to compare the health of those with Medicaid coverage and those who are uninsured.

Be sure to save your R Script file before exiting RStudio!