This lesson is being piloted (Beta version)

Exploring with summary statistics

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How do I get means, medians, IQRs and other summary statistics on my data?

  • How do I get summary statistics for different groups in my data?

Objectives
  • Learn the summarise and group_by functions for efficient summary statistics.

Summary statistics

What is the average delay on the departure of these flights?

Most of the functions we use for these operations comes from the library dplyr, which is part of the tidyverse package.

The function to get summary statistics (average/mean, median, standard deviation etc) from dataframes is summarise. It works like this, remember to add na.rm = T to handle missing values:

flightdata %>% 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = T),
            med_dep_delay = median(dep_delay, na.rm = T),
            sd_dep_delay = sd(dep_delay, na.rm = T),
            max_dep_delay = max(dep_delay, na.rm = T),
            min_dep_delay = min(dep_delay, na.rm  =T),
            iqr = IQR(dep_delay, na.rm = T))
# A tibble: 1 × 6
  avg_dep_delay med_dep_delay sd_dep_delay max_dep_delay min_dep_delay   iqr
          <dbl>         <dbl>        <dbl>         <dbl>         <dbl> <dbl>
1          12.6            -2         40.2          1301           -43    16

IQR

Lining up all values of departure delay in order from the lowest to the highest, we can split up the observations in quartiles, each containing 25% of the observations. The Inter Quartile Range, tells us the range in which the middle 50% of the observations are. It is a measure of the spread of the observations around the median.

Which airline is most delayed?

Adding group_by() we can get the summary statistics for all airlines:

flightdata %>% 
  group_by(carrier) %>% 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = T),
            med_dep_delay = median(dep_delay, na.rm = T),
            sd_dep_delay = sd(dep_delay, na.rm = T),
            max_dep_delay = max(dep_delay, na.rm = T),
            min_dep_delay = min(dep_delay, na.rm  =T),
            iqr = IQR(dep_delay, na.rm = T))
# A tibble: 16 × 7
   carrier avg_dep_delay med_dep_delay sd_dep_delay max_dep_delay min_dep_delay
   <chr>           <dbl>         <dbl>        <dbl>         <dbl>         <dbl>
 1 9E              16.7           -2           45.9           747           -24
 2 AA               8.59          -3           37.4          1014           -24
 3 AS               5.80          -3           31.4           225           -21
 4 B6              13.0           -1           38.5           502           -43
 5 DL               9.26          -2           39.7           960           -33
 6 EV              20.0           -1           46.6           548           -32
 7 F9              20.2            0.5         58.4           853           -27
 8 FL              18.7            1           52.7           602           -22
 9 HA               4.90          -4           74.1          1301           -16
10 MQ              10.6           -3           39.2          1137           -26
11 OO              12.6           -6           43.1           154           -14
12 UA              12.1            0           35.7           483           -20
13 US               3.78          -4           28.1           500           -19
14 VX              12.9            0           44.8           653           -20
15 WN              17.7            1           43.3           471           -13
16 YV              19.0           -2           49.2           387           -16
# ℹ 1 more variable: iqr <dbl>

A second step would be to sort the carriers by the average departure delay. The arrange() function does this:

flightdata %>% 
  group_by(carrier) %>% 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = T),
            med_dep_delay = median(dep_delay, na.rm = T),
            sd_dep_delay = sd(dep_delay, na.rm = T),
            max_dep_delay = max(dep_delay, na.rm = T),
            min_dep_delay = min(dep_delay, na.rm  =T),
            iqr = IQR(dep_delay, na.rm = T)) %>% 
  arrange(avg_dep_delay)
# A tibble: 16 × 7
   carrier avg_dep_delay med_dep_delay sd_dep_delay max_dep_delay min_dep_delay
   <chr>           <dbl>         <dbl>        <dbl>         <dbl>         <dbl>
 1 US               3.78          -4           28.1           500           -19
 2 HA               4.90          -4           74.1          1301           -16
 3 AS               5.80          -3           31.4           225           -21
 4 AA               8.59          -3           37.4          1014           -24
 5 DL               9.26          -2           39.7           960           -33
 6 MQ              10.6           -3           39.2          1137           -26
 7 UA              12.1            0           35.7           483           -20
 8 OO              12.6           -6           43.1           154           -14
 9 VX              12.9            0           44.8           653           -20
10 B6              13.0           -1           38.5           502           -43
11 9E              16.7           -2           45.9           747           -24
12 WN              17.7            1           43.3           471           -13
13 FL              18.7            1           52.7           602           -22
14 YV              19.0           -2           49.2           387           -16
15 EV              20.0           -1           46.6           548           -32
16 F9              20.2            0.5         58.4           853           -27
# ℹ 1 more variable: iqr <dbl>

The carrier “US” does best. What carrier is that actually?

Before doing that, let us save the result in an object:

summary_delays <- flightdata %>% 
  group_by(carrier) %>% 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = T),
            med_dep_delay = median(dep_delay, na.rm = T),
            sd_dep_delay = sd(dep_delay, na.rm = T),
            max_dep_delay = max(dep_delay, na.rm = T),
            min_dep_delay = min(dep_delay, na.rm  =T),
            iqr = IQR(dep_delay, na.rm = T)) %>% 
  arrange(avg_dep_delay)

Next up - how to figure out the name of the carrier “US”.

Key Points

  • Summary statistics tells us about the distribution of data