R Programming Cheat Sheet

You will need to load the following libraries in RStudio before attempting some of the techniques in this tutorial.

library(tidyverse)
install.packages("nycflights13")

The following package(s) will be installed:
- nycflights13 [1.0.2]
These packages will be installed into "~/Dropbox/Documents/Teaching Materials/Health Policy/GitHub Site/hpam7660-sp24/renv/library/R-4.3/x86_64-apple-darwin20".

# Installing packages --------------------------------------------------------
- Installing nycflights13 ...                   OK [linked from cache]
Successfully installed 1 package in 17 milliseconds.

library(nycflights13)
install.packages("gapminder")

The following package(s) will be installed:
- gapminder [1.0.0]
These packages will be installed into "~/Dropbox/Documents/Teaching Materials/Health Policy/GitHub Site/hpam7660-sp24/renv/library/R-4.3/x86_64-apple-darwin20".

# Installing packages --------------------------------------------------------
- Installing gapminder ...                      OK [linked from cache]
Successfully installed 1 package in 12 milliseconds.

library(gapminder)

R Basics

Creating a vector

You can create a vector using the c function:

## Any R code that begins with the # character is a comment
## Comments are ignored by R

my_numbers <- c(4, 8, 15, 16, 23, 42) # Anything after # is also a
# comment
my_numbers

[1]  4  8 15 16 23 42

Installing and loading a package

You can install a package with the install.packages function, passing the name of the package to be installed as a string (that is, in quotes):

install.packages("ggplot2")

You can load a package into the R environment by calling library() with the name of package without quotes. You should only have one package per library call.

library(ggplot2)

Calling functions from specific packages

We can also use the mypackage:: prefix to access package functions without loading:

knitr::kable(head(mtcars))

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Data Visualization

Scatter plot

You can produce a scatter plot with using the x and y aesthetics along with the geom_point() function.

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty)) +
  geom_point()

Smoothed curves

You can add a smoothed curve that summarizes the relationship between two variables with the geom_smooth() function. By default, it uses a loess smoother to estimated the conditional mean of the y-axis variable as a function of the x-axis variable.

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty)) +
  geom_point() + geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Adding a regression line

geom_smooth can also add a regression line by setting the argument method = "lm" and we can turn off the shaded regions around the line with se = FALSE

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Changing the scale of the axes

If we want the scale of the x-axis to be logged to stretch out the data we can use the scale_x_log10():

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10()

`geom_smooth()` using formula = 'y ~ x'

Adding informative labels to a plot

Use the labs() to add informative labels to the plot:

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  scale_x_log10() +
  labs(x = "Population Density",
       y = "Percent of County Below Poverty Line",
       title = "Poverty and Population Density",
       subtitle = "Among Counties in the Midwest",
       source = "US Census, 2000")

`geom_smooth()` using formula = 'y ~ x'

Mapping aesthetics to variables

If you would like to map an aesthetic to a variable for all geoms in the plot, you can put it in the aes call in the ggplot() function:

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty,
                     color = state,
                     fill = state)) +
  geom_point() +
  geom_smooth() +
  scale_x_log10()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Mapping aesthetics for a single geom

You can also map aesthetics for a specific geom using the mapping argument to that function:

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty)) +
  geom_point(mapping = aes(color = state)) +
  geom_smooth(color = "black") +
  scale_x_log10()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Setting the aesthetics for all observations

If you would like to set the color or size or shape of a geom for all data points (that is, not mapped to any variables), be sure to set these outside of aes():

ggplot(data = midwest,
       mapping = aes(x = popdensity,
                     y = percbelowpoverty)) +
  geom_point(color = "purple") +
  geom_smooth() +
  scale_x_log10()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Histograms

ggplot(data = midwest,
       mapping = aes(x = percbelowpoverty)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Wrangling

Subsetting a data frame

Use the filter() function from the dplyr package to subset a data frame. In this example, you’ll use the nycflights13 data and filter by United Airlines flights.

library(nycflights13)

flights |> filter(carrier == "UA")

# A tibble: 58,665 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      554            558        -4      740            728
 4  2013     1     1      558            600        -2      924            917
 5  2013     1     1      558            600        -2      923            937
 6  2013     1     1      559            600        -1      854            902
 7  2013     1     1      607            607         0      858            915
 8  2013     1     1      611            600        11      945            931
 9  2013     1     1      623            627        -4      933            932
10  2013     1     1      628            630        -2     1016            947
# ℹ 58,655 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

You can filter based on multiple conditions to subset to the rows that meet all conditions:

flights |> filter(carrier == "UA", origin == "JFK")

# A tibble: 4,534 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      558            600        -2      924            917
 2  2013     1     1      611            600        11      945            931
 3  2013     1     1      803            800         3     1132           1144
 4  2013     1     1      829            830        -1     1152           1200
 5  2013     1     1     1112           1100        12     1440           1438
 6  2013     1     1     1127           1130        -3     1504           1448
 7  2013     1     1     1422           1425        -3     1748           1759
 8  2013     1     1     1522           1530        -8     1858           1855
 9  2013     1     1     1726           1729        -3     2042           2100
10  2013     1     1     1750           1750         0     2109           2115
# ℹ 4,524 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

You can use the | operator to match one of two conditions (“OR” rather than “AND”):

  flights |> filter(carrier == "UA" | carrier == "AA")

# A tibble: 91,394 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      554            558        -4      740            728
 5  2013     1     1      558            600        -2      753            745
 6  2013     1     1      558            600        -2      924            917
 7  2013     1     1      558            600        -2      923            937
 8  2013     1     1      559            600        -1      941            910
 9  2013     1     1      559            600        -1      854            902
10  2013     1     1      606            610        -4      858            910
# ℹ 91,384 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

To test if a variable is one of several possible values, you can also use the %in% command:

flights |> filter(carrier %in% c("UA", "AA"))

# A tibble: 91,394 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      554            558        -4      740            728
 5  2013     1     1      558            600        -2      753            745
 6  2013     1     1      558            600        -2      924            917
 7  2013     1     1      558            600        -2      923            937
 8  2013     1     1      559            600        -1      941            910
 9  2013     1     1      559            600        -1      854            902
10  2013     1     1      606            610        -4      858            910
# ℹ 91,384 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

If you want to subset to a set of specific row numbers, you can use the slice function:

## subset to the first 5 rows
flights |> slice(1:5)

# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Here the 1:5 syntax tells R to produce a vector that starts at 1 and ends at 5, incrementing by 1:

1:5

[1] 1 2 3 4 5

Filtering to the largest/smallest values of a variable

To subset to the rows that have the largest or smallest values of a given variable, use the slice_max and slice_max functions. For the largest values, use slice_max and use the n argument to specify how many rows you want:

flights |> slice_max(dep_time, n = 5)

# A tibble: 29 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10    30     2400           2359         1      327            337
 2  2013    11    27     2400           2359         1      515            445
 3  2013    12     5     2400           2359         1      427            440
 4  2013    12     9     2400           2359         1      432            440
 5  2013    12     9     2400           2250        70       59           2356
 6  2013    12    13     2400           2359         1      432            440
 7  2013    12    19     2400           2359         1      434            440
 8  2013    12    29     2400           1700       420      302           2025
 9  2013     2     7     2400           2359         1      432            436
10  2013     2     7     2400           2359         1      443            444
# ℹ 19 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

To get lowest values, use slice_min:

flights |> slice_min(dep_time, n = 5)

# A tibble: 25 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    13        1           2249        72      108           2357
 2  2013     1    31        1           2100       181      124           2225
 3  2013    11    13        1           2359         2      442            440
 4  2013    12    16        1           2359         2      447            437
 5  2013    12    20        1           2359         2      430            440
 6  2013    12    26        1           2359         2      437            440
 7  2013    12    30        1           2359         2      441            437
 8  2013     2    11        1           2100       181      111           2225
 9  2013     2    24        1           2245        76      121           2354
10  2013     3     8        1           2355         6      431            440
# ℹ 15 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Sorting rows by a variable

You can sort the rows of a data set using the arrange() function. By default, this will sort the rows from smallest to largest.

flights |> arrange(dep_time)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    13        1           2249        72      108           2357
 2  2013     1    31        1           2100       181      124           2225
 3  2013    11    13        1           2359         2      442            440
 4  2013    12    16        1           2359         2      447            437
 5  2013    12    20        1           2359         2      430            440
 6  2013    12    26        1           2359         2      437            440
 7  2013    12    30        1           2359         2      441            437
 8  2013     2    11        1           2100       181      111           2225
 9  2013     2    24        1           2245        76      121           2354
10  2013     3     8        1           2355         6      431            440
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

If you would like to sort the rows from largest to smallest (descending order), you can wrap the variable name with desc():

flights |> arrange(desc(dep_time))

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10    30     2400           2359         1      327            337
 2  2013    11    27     2400           2359         1      515            445
 3  2013    12     5     2400           2359         1      427            440
 4  2013    12     9     2400           2359         1      432            440
 5  2013    12     9     2400           2250        70       59           2356
 6  2013    12    13     2400           2359         1      432            440
 7  2013    12    19     2400           2359         1      434            440
 8  2013    12    29     2400           1700       420      302           2025
 9  2013     2     7     2400           2359         1      432            436
10  2013     2     7     2400           2359         1      443            444
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Selecting/subsetting the columns

You can subset the data to only certain columns using the select() command:

flights |> select(dep_time, arr_time, dest)

# A tibble: 336,776 × 3
   dep_time arr_time dest 
      <int>    <int> <chr>
 1      517      830 IAH  
 2      533      850 IAH  
 3      542      923 MIA  
 4      544     1004 BQN  
 5      554      812 ATL  
 6      554      740 ORD  
 7      555      913 FLL  
 8      557      709 IAD  
 9      557      838 MCO  
10      558      753 ORD  
# ℹ 336,766 more rows

If you want to select a range of columns from, say, callsign to ideology, you can use the : operator:

flights |> select(dep_time:arr_delay)

# A tibble: 336,776 × 6
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1      517            515         2      830            819        11
 2      533            529         4      850            830        20
 3      542            540         2      923            850        33
 4      544            545        -1     1004           1022       -18
 5      554            600        -6      812            837       -25
 6      554            558        -4      740            728        12
 7      555            600        -5      913            854        19
 8      557            600        -3      709            723       -14
 9      557            600        -3      838            846        -8
10      558            600        -2      753            745         8
# ℹ 336,766 more rows

You can remove a variable from the data set by using the minus sign - in front of it:

flights |> select(-year)

# A tibble: 336,776 × 18
   month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1     1     1      517            515         2      830            819
 2     1     1      533            529         4      850            830
 3     1     1      542            540         2      923            850
 4     1     1      544            545        -1     1004           1022
 5     1     1      554            600        -6      812            837
 6     1     1      554            558        -4      740            728
 7     1     1      555            600        -5      913            854
 8     1     1      557            600        -3      709            723
 9     1     1      557            600        -3      838            846
10     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

You can also drop several variables using the c() function or the (a:b) syntax:

flights |> select(-c(year, month, day))

# A tibble: 336,776 × 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> select(-(year:day))

# A tibble: 336,776 × 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You can also select columns based on matching patterns in the names with functions like starts_with() or ends_with():

flights |> select(ends_with("delay"))

# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# ℹ 336,766 more rows

This code finds all variables with column names that end with the string “delay”. See the help page for select() for more information on different ways to select.

Renaming a variable

You can rename a variable useing the function rename(new_name = old_name):

flights |> rename(flight_number = flight)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight_number <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Creating new variables

You can create new variables that are functions of old variables using the mutate() function:

flights |> mutate(flight_length = arr_time - dep_time) |> select(arr_time, dep_time, flight_length)

# A tibble: 336,776 × 3
   arr_time dep_time flight_length
      <int>    <int>         <int>
 1      830      517           313
 2      850      533           317
 3      923      542           381
 4     1004      544           460
 5      812      554           258
 6      740      554           186
 7      913      555           358
 8      709      557           152
 9      838      557           281
10      753      558           195
# ℹ 336,766 more rows

Creating new variables based on yes/no conditions

If you want to create a new variable that can take on two values based on a logical conditional, you should use the if_else() function inside of mutate(). For instance, if we want to create a more nicely labeled version of the sinclair2017 variable (which is 0/1), we could do:

flights |>
  mutate(late = if_else(arr_delay > 0,
                             "Flight Delayed",
                             "Flight On Time")) |>
  select(arr_delay, late)

# A tibble: 336,776 × 2
   arr_delay late          
       <dbl> <chr>         
 1        11 Flight Delayed
 2        20 Flight Delayed
 3        33 Flight Delayed
 4       -18 Flight On Time
 5       -25 Flight On Time
 6        12 Flight Delayed
 7        19 Flight Delayed
 8       -14 Flight On Time
 9        -8 Flight On Time
10         8 Flight Delayed
# ℹ 336,766 more rows

Summarizing a variable

You can calculate summaries of variables in the data set using the summarize() function.

flights |>
  summarize(
    avg_dep_time = mean(dep_time),
    sd_dep_time = sd(dep_time),
    median_dep_time = median(dep_time)
  )

# A tibble: 1 × 3
  avg_dep_time sd_dep_time median_dep_time
         <dbl>       <dbl>           <int>
1           NA          NA              NA

Summarizing variables by groups of rows

By default, summarize() calculates the summaries of variables for all rows in the data frame. You can also calculate these summaries within groups of rows defined by another variable in the data frame using the group_by() function before summarizing.

flights |>
  group_by(carrier) |>
  summarize(
    avg_dep_time = mean(dep_time),
    sd_dep_time = sd(dep_time),
    median_dep_time = median(dep_time)
  )

# A tibble: 16 × 4
   carrier avg_dep_time sd_dep_time median_dep_time
   <chr>          <dbl>       <dbl>           <dbl>
 1 9E               NA         NA                NA
 2 AA               NA         NA                NA
 3 AS               NA         NA                NA
 4 B6               NA         NA                NA
 5 DL               NA         NA                NA
 6 EV               NA         NA                NA
 7 F9               NA         NA                NA
 8 FL               NA         NA                NA
 9 HA              949.        53.6             954
10 MQ               NA         NA                NA
11 OO               NA         NA                NA
12 UA               NA         NA                NA
13 US               NA         NA                NA
14 VX               NA         NA                NA
15 WN               NA         NA                NA
16 YV               NA         NA                NA

Here, the summarize() function breaks apart the original data into smaller data frames for each carrier and applies the summary functions to those, then combines everything into one tibble.

Summarizing by multiple variables

You can group by multiple variables and summarize() will create groups based on every combination of each variable:

flights |>
  group_by(carrier, month) |>
  summarize(
    avg_dep_time = mean(dep_time)
  )

`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.

# A tibble: 185 × 3
# Groups:   carrier [16]
   carrier month avg_dep_time
   <chr>   <int>        <dbl>
 1 9E          1           NA
 2 9E          2           NA
 3 9E          3           NA
 4 9E          4           NA
 5 9E          5           NA
 6 9E          6           NA
 7 9E          7           NA
 8 9E          8           NA
 9 9E          9           NA
10 9E         10           NA
# ℹ 175 more rows

You’ll notice the message that summarize() sends after using to let us know that resulting tibble is grouped by carrier. By default, summarize() drops the last group you provided in group_by (month in this case). This isn’t an error message, it’s just letting us know some helpful information. If you want to avoid this messaging displaying, you need to specify what grouping you want after using the .groups argument:

flights |>
  group_by(carrier, month) |>
  summarize(
    avg_dep_time = mean(dep_time),
    .groups = "drop_last"
  )

# A tibble: 185 × 3
# Groups:   carrier [16]
   carrier month avg_dep_time
   <chr>   <int>        <dbl>
 1 9E          1           NA
 2 9E          2           NA
 3 9E          3           NA
 4 9E          4           NA
 5 9E          5           NA
 6 9E          6           NA
 7 9E          7           NA
 8 9E          8           NA
 9 9E          9           NA
10 9E         10           NA
# ℹ 175 more rows

Summarizing across many variables

If you want to apply the same summary to multiple variables, you can use the across(vars, fun) function, where vars is a vector of variable names (specified like with select()) and fun is a summary function to apply to those variables.

flights |>
  group_by(carrier, month) |>
  summarize(
    across(c(dep_time, dep_delay), mean)
  )

`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.

# A tibble: 185 × 4
# Groups:   carrier [16]
   carrier month dep_time dep_delay
   <chr>   <int>    <dbl>     <dbl>
 1 9E          1       NA        NA
 2 9E          2       NA        NA
 3 9E          3       NA        NA
 4 9E          4       NA        NA
 5 9E          5       NA        NA
 6 9E          6       NA        NA
 7 9E          7       NA        NA
 8 9E          8       NA        NA
 9 9E          9       NA        NA
10 9E         10       NA        NA
# ℹ 175 more rows

As with select(), you can use the : operator to select a range of variables

flights |>
  group_by(carrier, month) |>
  summarize(
    across(dep_time:arr_delay, mean)
  )

`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.

# A tibble: 185 × 8
# Groups:   carrier [16]
   carrier month dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <chr>   <int>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
 1 9E          1       NA          1485.        NA       NA          1676.
 2 9E          2       NA          1471.        NA       NA          1661.
 3 9E          3       NA          1472.        NA       NA          1664.
 4 9E          4       NA          1502.        NA       NA          1697.
 5 9E          5       NA          1509.        NA       NA          1712.
 6 9E          6       NA          1512.        NA       NA          1718.
 7 9E          7       NA          1493.        NA       NA          1702.
 8 9E          8       NA          1497.        NA       NA          1706.
 9 9E          9       NA          1458.        NA       NA          1658.
10 9E         10       NA          1432.        NA       NA          1632.
# ℹ 175 more rows
# ℹ 1 more variable: arr_delay <dbl>

Table of counts of a categorical variable

There are two way to produce a table of counts of each category of a variable. The first is to use group_by and summarize along with the summary function n(), which returns the numbers of rows in each grouping (that is, each combination of the grouping variables):

flights |>
  group_by(carrier) |>
  summarize(n = n())

# A tibble: 16 × 2
   carrier     n
   <chr>   <int>
 1 9E      18460
 2 AA      32729
 3 AS        714
 4 B6      54635
 5 DL      48110
 6 EV      54173
 7 F9        685
 8 FL       3260
 9 HA        342
10 MQ      26397
11 OO         32
12 UA      58665
13 US      20536
14 VX       5162
15 WN      12275
16 YV        601

A simpler way to acheive the same outcome is to use the count() function, which implements these two steps:

flights |> count(carrier)

# A tibble: 16 × 2
   carrier     n
   <chr>   <int>
 1 9E      18460
 2 AA      32729
 3 AS        714
 4 B6      54635
 5 DL      48110
 6 EV      54173
 7 F9        685
 8 FL       3260
 9 HA        342
10 MQ      26397
11 OO         32
12 UA      58665
13 US      20536
14 VX       5162
15 WN      12275
16 YV        601

Producing nicely formatted tables with `kable()`

You can take any tibble in R and convert it into a more readable output by passing it to knitr::kable(). In our homework, generally, we will save the tibble as an object and then pass it to this function.

month_summary <- flights |>
  group_by(month) |>
  summarize(
    avg_arr_delay = mean(arr_delay),
    sd_arr_delay = sd(arr_delay)
  )

knitr::kable(month_summary)

month	avg_arr_delay	sd_arr_delay
1	NA	NA
2	NA	NA
3	NA	NA
4	NA	NA
5	NA	NA
6	NA	NA
7	NA	NA
8	NA	NA
9	NA	NA
10	NA	NA
11	NA	NA
12	NA	NA

You can add informative column names to the table using the col.names argument.

knitr::kable(
  month_summary,
  col.names = c("Month", "Average Delay", "SD of Delay")
)

Month	Average Delay	SD of Delay
1	NA	NA
2	NA	NA
3	NA	NA
4	NA	NA
5	NA	NA
6	NA	NA
7	NA	NA
8	NA	NA
9	NA	NA
10	NA	NA
11	NA	NA
12	NA	NA

Finally, we can round the numbers in the table to be a bit nicer using the digits argument. This will tell kable() how many significant digits to show.

knitr::kable(
  month_summary,
  col.names = c("Month", "Average Delay", "SD of Delay"),
  digits = 3
)

Month	Average Delay	SD of Delay
1	NA	NA
2	NA	NA
3	NA	NA
4	NA	NA
5	NA	NA
6	NA	NA
7	NA	NA
8	NA	NA
9	NA	NA
10	NA	NA
11	NA	NA
12	NA	NA

Barplots of counts

You can visualize counts of a variable using a barplot:

flights |>
  ggplot(mapping = aes(x = carrier)) +
  geom_bar()

Barplots of other summaries

We can use barplots to visualize other grouped summaries like means, but we need to use the geom_col() geom instead and specify the variable you want to be the height of the bars. We also want to filter our data so that only values of arr_delay that are greater than zero are considered.

flights |>
  filter(arr_delay > 0) |>
  group_by(carrier) |>
  summarize(
    avg_delay = mean(arr_delay)
  ) |>
  ggplot(mapping = aes(x = carrier, y = avg_delay)) +
  geom_col()

Reordering/sorting barplot axes

Often we want to sort the barplot axes to be in the order of the variable of interest so we can quickly compare them. We can use the fct_reorder(group_var, ordering_var) function to do this where the group_var is the grouping variable that is going on the axes and the ordering_var is the variable that we will sort the groups on.

flights |>
  filter(arr_delay > 0) |>
  group_by(carrier) |>
  summarize(
    avg_delay = mean(arr_delay)
  ) |>
  ggplot(mapping = aes(x = fct_reorder(carrier, avg_delay),
                       y = avg_delay)) +
  geom_col()

Coloring barplots by another variable

You can color the barplots by a another variable using the fill aesthetic:

flights |>
  filter(arr_delay > 0) |>
  group_by(carrier) |>
  summarize(
    avg_delay = mean(arr_delay)
  ) |>
  slice_max(avg_delay, n = 10) |>
  ggplot(mapping = aes(y = fct_reorder(carrier, avg_delay),
                       x = avg_delay)) +
  geom_col(mapping = aes(fill = carrier))

Creating logical vectors

You can create logical variables in your tibbles using mutate:

flights |>
  mutate(
    late = arr_delay > 0,
    fall = month == 9  | month == 10| month == 11,
    .keep = "used"
)

# A tibble: 336,776 × 4
   month arr_delay late  fall 
   <int>     <dbl> <lgl> <lgl>
 1     1        11 TRUE  FALSE
 2     1        20 TRUE  FALSE
 3     1        33 TRUE  FALSE
 4     1       -18 FALSE FALSE
 5     1       -25 FALSE FALSE
 6     1        12 TRUE  FALSE
 7     1        19 TRUE  FALSE
 8     1       -14 FALSE FALSE
 9     1        -8 FALSE FALSE
10     1         8 TRUE  FALSE
# ℹ 336,766 more rows

The .keep = "used" argument here tells mutate to only return the variables created and any variables used to create them. We’re using it here for display purposes.

You can filter based on these logical variables. In particular, if we want to subset to rows where both late and fall were TRUE we could do the following filter:

flights |>
  mutate(
    late = arr_delay > 0,
    fall = month == 9  | month == 10| month == 11,
    .keep = "used"
  ) |>
  filter(late & fall)

# A tibble: 26,307 × 4
   month arr_delay late  fall 
   <int>     <dbl> <lgl> <lgl>
 1    10        11 TRUE  TRUE 
 2    10        12 TRUE  TRUE 
 3    10         4 TRUE  TRUE 
 4    10        16 TRUE  TRUE 
 5    10         7 TRUE  TRUE 
 6    10         4 TRUE  TRUE 
 7    10         6 TRUE  TRUE 
 8    10         1 TRUE  TRUE 
 9    10         9 TRUE  TRUE 
10    10        83 TRUE  TRUE 
# ℹ 26,297 more rows

Using `!` to negate logicals

Any time you place the exclamation point in front of a logical, it will turn any TRUE into a FALSE and vice versa. For instance, if we wanted on-time flights in the fall, we could used

flights |>
  mutate(
    late = arr_delay > 0,
    fall = month == 9  | month == 10| month == 11,
    .keep = "used"
  ) |>
  filter(!late & fall)

# A tibble: 56,292 × 4
   month arr_delay late  fall 
   <int>     <dbl> <lgl> <lgl>
 1    10       -34 FALSE TRUE 
 2    10       -22 FALSE TRUE 
 3    10       -46 FALSE TRUE 
 4    10       -26 FALSE TRUE 
 5    10       -16 FALSE TRUE 
 6    10       -20 FALSE TRUE 
 7    10       -23 FALSE TRUE 
 8    10       -12 FALSE TRUE 
 9    10       -10 FALSE TRUE 
10    10        -3 FALSE TRUE 
# ℹ 56,282 more rows

Or if we wanted to subset to any combination except late flights and fall, we could negate the AND statement using parentheses:

flights |>
  mutate(
    late = arr_delay > 0,
    fall = month == 9  | month == 10| month == 11,
    .keep = "used"
  ) |>
  filter(!(late & fall))

# A tibble: 309,337 × 4
   month arr_delay late  fall 
   <int>     <dbl> <lgl> <lgl>
 1     1        11 TRUE  FALSE
 2     1        20 TRUE  FALSE
 3     1        33 TRUE  FALSE
 4     1       -18 FALSE FALSE
 5     1       -25 FALSE FALSE
 6     1        12 TRUE  FALSE
 7     1        19 TRUE  FALSE
 8     1       -14 FALSE FALSE
 9     1        -8 FALSE FALSE
10     1         8 TRUE  FALSE
# ℹ 309,327 more rows

This is often used in combination with %in% to acheive a “not in” logical:

flights |>
  filter(!(carrier %in% c("AA", "UA")))

# A tibble: 245,382 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      544            545        -1     1004           1022
 2  2013     1     1      554            600        -6      812            837
 3  2013     1     1      555            600        -5      913            854
 4  2013     1     1      557            600        -3      709            723
 5  2013     1     1      557            600        -3      838            846
 6  2013     1     1      558            600        -2      849            851
 7  2013     1     1      558            600        -2      853            856
 8  2013     1     1      559            559         0      702            706
 9  2013     1     1      600            600         0      851            858
10  2013     1     1      600            600         0      837            825
# ℹ 245,372 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Grouped summaries with `any()` and `all()`

Once you group a tibble, you can summarize logicals within groups using two commands. any() will return TRUE if a logical is TRUE for any row in a group and FALSE otherwise. all() will return TRUE when the logical inside it is TRUE for all rows in a group and FALSE otherwise.

flights |>
  group_by(carrier) |>
  summarize(
    any_late = any(arr_delay > 0),
    never_late = all(arr_delay <=0)
  )

# A tibble: 16 × 3
   carrier any_late never_late
   <chr>   <lgl>    <lgl>     
 1 9E      TRUE     FALSE     
 2 AA      TRUE     FALSE     
 3 AS      TRUE     FALSE     
 4 B6      TRUE     FALSE     
 5 DL      TRUE     FALSE     
 6 EV      TRUE     FALSE     
 7 F9      TRUE     FALSE     
 8 FL      TRUE     FALSE     
 9 HA      TRUE     FALSE     
10 MQ      TRUE     FALSE     
11 OO      TRUE     FALSE     
12 UA      TRUE     FALSE     
13 US      TRUE     FALSE     
14 VX      TRUE     FALSE     
15 WN      TRUE     FALSE     
16 YV      TRUE     FALSE