Data Lab 3 - Importing and Describing Data

In this Data Lab, we’ll load the Family Connects New Orleans data that we’ll be using throughout the semester. By describing and summarizing the claims data, we’ll lay the foundation for more advanced analysis in future Data Labs.

Step 1: Create a New R Markdown File

See the instructions from Data Lab 2 to create a new R Markdown document. You can call this new project/folder “Family Connects” or whatever you’d like to call it (as long as you remember the name!). You should type all of the code for this Data Lab in your R Markdown file and save that file when you’re finished. That way, if you need to use that code again (and you will), you’ll have it saved and won’t have to retype everything

Step 2: Download the FCNO Data

Start by downloading a csv file of the FCNO data using this link FCNO Data.

Step 3: Importing the Data into R

Now that you’ve saved the .csv file, let’s load it into R using the read.csv command. The read.csv command is a standard R command, so we don’t need to load a specific library to use it. Once the data file is loaded, you should see the data frame in your Global Environment window.

Note that you will need to change the word PATH below to the actual path to the saved .csv file.

fcno_data = read.csv("PATH/fcno_data.csv", header = TRUE)

The header = TRUE part of this command is telling R that our data has column names in the first row. If you’re loading data without column names, then you would replace header = TRUE to header = FALSE.

Special note for Windows users: Windows uses backslashes (\) instead of forward slashes (/) for file paths, but R doesn’t like backslashes. If you’re a Windows user, you can try and replace the backslashes with forward slashes. In some cases that will work, but if it doesn’t, you can use two backslashes instead of one to appease R. So, for example:

fcno_data <- read.csv("C:\\Users\\kevin\\data\\fcno_data.csv", header=TRUE)

Step 5: Evaluating and Cleaning the Data

Now that you’ve loaded the data, let’s take a second to discuss what these data entail before doing any manipulation or data cleaning. Each row in this data is a medical claim. A medical claim is a request for payment submitted by a clinician to an insurer. So, each row in this data file represents a medical service that someone has received. More specifically, these are Medicaid claims, meaning that clinicians are submitting claims to the Medicaid program for treating patients with Medicaid coverage.

Let’s preview the data file using the head command in R:

head(fcno_data)

Note that the first column is labeled “patient_id”. That data in that column is just what it sounds like - a number that identifies an individual patient. Because most people receive multiple medical services over the course of their time in Medicaid, most patient_ids will be repeated several times in the data.

The second column is called “days_from_delivery”. Using actual service dates would divulge protected health information (and I’m not trying to get sued here). So, instead of using the actual service date, the “days_from_delivery” column tells us the number of days before or after a women gave birth that she received this service.

Let’s use the `range`` command to take a look at the range of values for the “days_from_delivery” variable:

range(fcno_data$days_from_delivery)

You should see that the values of “days_from_delivery” range from -280 to 365. This means that we’re observing medical services received up to 9 months (280 days) prior to delivery and up to 1 year post-delivery.

Next, we have four columns called “procedure_code”, “revenue_code”, “dx10_diag_code_1”, and “dx10_diag_code_2”. We’re not going to go into the data stored in these columns right now, but before moving on, try the following. Use the range command to view the values of the “revenue_code” column. You should see something like “NA NA”. This is because the “revenue_code” column contains missing values. When you view the range of a column with missing values, R is going to return “NA NA” unless you tell it to ignore rows with missing values. We can do that as follows:

range(fcno_data$revenue_code, na.rm = TRUE)

When you run this command, you should see that the revenue codes range from 1 to 999.

Moving on, we have a column called “allowed_amt”. This is the price that Medicaid actually paid for the service. Note that this is NOT the price that the provider charges for the service. Medicaid doesn’t pay charge prices (almost nobody does). Medicaid has a fee schedule of administratively established prices and pays those prices for health care services. Whenever you see the phrase “allowed amount”, you know that’s the amount the insurer actually paid and not the list price.

Next we have a column called “age”. Again, pretty straightforward. Use the range command to view the age values in the data.

Another sometimes useful R command is the table command. In addition to showing us the range of values for a variable, table also shows us how many observations are in each value. Try the following:

table(fcno_data$age)

You can see that most of the women giving birth in the data are between the ages of 22 and 31, with counts trailing off as we get to younger and older ages.

Finally, we have the “fcno” column. This is an indicator for whether the mother participated in the New Orleans Family Connects program. Use the table command here to see how many women participated and how many did not.

You should see that 1,617,900 observations have an FCNO value of 0 and 84,489 have a value of 1. But remember that each row in the data represents a claim and not an individual. So these numbers don’t really tell us the number of individual participants and non-participants. Geting that information, take a bit more work:

library(dplyr)

fcno_participant <- fcno_data %>%
  distinct(patient_id, fcno) %>%
  count(fcno) %>%
  print()

Note that we need to first load the dplyr library since the distinct() and count() functions and the pipe operator are all part of the dplyr package.

Next, we’re subsetting the data into a new data set called “fcno_participant” that includes only distinct patient IDs. In other words, we’re going from a data set that contains multiple observations per person to a data set that contains one observation per person. That way we’re not double counting people when caluclating the numbers of FCNO participants and non-participants.

Step 6: Comparing Spending for FCNO Participants and Non-Participants

We’re going to spend much more time throughout the semester comparing outcomes for those who participated in Family Connects New Orleans and those who didn’t, but first, let’s take a minute and look at prenatal spending for participants and non-participants.

Since each row is a claim level, we’re going to want to sum spending over all claims in the prenatal period for each patent. We can do that by using the filter and group_by commands as follows:

prenatal_spend <- fcno_data %>%
  filter(days_from_delivery < 0) %>%
  group_by(patient_id) %>%
  summarize(
    sum_spend = sum(allowed_amt, na.rm = TRUE),
    fcno = max(fcno, na.rm = TRUE)
  )

summary(prenatal_spend$sum_spend[prenatal_spend$fcno == 1])
summary(prenatal_spend$sum_spend[prenatal_spend$fcno == 0])

Question (type the answer to this question in your Markdown document)

Compare prenatal spending for FCNO participants and non-participants. How might this information be helpful if our goal was to evaluate the effect of FCNO participation on postnatal spending?

Summary and Key Takeaways

In this Data Lab, we loaded data on the medical service use of women who recently gave birth and either participated in the Family Connects Program or did not participate. We saw that we have data on the medical service use of these women up to 9 months prior to their delivery and for up to 1 year after their delivery.

In our next Data Lab, we’ll learn more about the Family Connects New Orleans Program and dig deeper into the data as we work toward developing a data analysis proposal.

Now upload your PDF document to Canvas using this link and you’re all done.