Joining data

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How do I import data from other sheets in a spreadsheet?

How do I enrich tables with additional data?

What is a join?

Objectives

Learn how to import data from sheet number 2 (and 3 etc) in spreadsheets

Learn how to join data frames

What is the actual name of the best airline?

It is nice to be able to identify the airline that is most on time. But we have carrier codes, not the actual names of them.

How do we get that?

Reading another sheet in a spreadsheet

Our spreadsheet contains more that one sheet. The second sheet contains the names and carrier codes of the relevant airlines.

We can read sheet number 2 from the Excelfile, by giving read_excel() a second argument, specifying the sheet we want:

carriers <- read_excel("data/flightdata.xlsx", sheet = 2)

carriers

# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
9E      Endeavor Air Inc.          
AA      American Airlines Inc.     
AS      Alaska Airlines Inc.       
B6      JetBlue Airways            
DL      Delta Air Lines Inc.       
EV      ExpressJet Airlines Inc.   
F9      Frontier Airlines Inc.     
FL      AirTran Airways Corporation
HA      Hawaiian Airlines Inc.     
MQ      Envoy Air                  
OO      SkyWest Airlines Inc.      
UA      United Air Lines Inc.      
US      US Airways Inc.            
VX      Virgin America             
WN      Southwest Airlines Co.     
YV      Mesa Airlines Inc.         

We now have a second data frame, containing the names of the carriers. And we have a data frame containing the average delays

summary_delays

# A tibble: 16 × 7
   carrier avg_dep_delay med_dep_delay sd_dep_delay max_dep_delay min_dep_delay
   <chr>           <dbl>         <dbl>        <dbl>         <dbl>         <dbl>
US               3.78          -4           28.1           500           -19
HA               4.90          -4           74.1          1301           -16
AS               5.80          -3           31.4           225           -21
AA               8.59          -3           37.4          1014           -24
DL               9.26          -2           39.7           960           -33
MQ              10.6           -3           39.2          1137           -26
UA              12.1            0           35.7           483           -20
OO              12.6           -6           43.1           154           -14
VX              12.9            0           44.8           653           -20
B6              13.0           -1           38.5           502           -43
9E              16.7           -2           45.9           747           -24
WN              17.7            1           43.3           471           -13
FL              18.7            1           52.7           602           -22
YV              19.0           -2           49.2           387           -16
EV              20.0           -1           46.6           548           -32
F9              20.2            0.5         58.4           853           -27
# ℹ 1 more variable: iqr <dbl>

what we would like is something like this:

# A tibble: 16 × 8
   carrier name           avg_dep_delay med_dep_delay sd_dep_delay max_dep_delay
   <chr>   <chr>                  <dbl>         <dbl>        <dbl>         <dbl>
US      US Airways In…          3.78          -4           28.1           500
HA      Hawaiian Airl…          4.90          -4           74.1          1301
AS      Alaska Airlin…          5.80          -3           31.4           225
AA      American Airl…          8.59          -3           37.4          1014
DL      Delta Air Lin…          9.26          -2           39.7           960
MQ      Envoy Air              10.6           -3           39.2          1137
UA      United Air Li…         12.1            0           35.7           483
OO      SkyWest Airli…         12.6           -6           43.1           154
VX      Virgin America         12.9            0           44.8           653
B6      JetBlue Airwa…         13.0           -1           38.5           502
9E      Endeavor Air …         16.7           -2           45.9           747
WN      Southwest Air…         17.7            1           43.3           471
FL      AirTran Airwa…         18.7            1           52.7           602
YV      Mesa Airlines…         19.0           -2           49.2           387
EV      ExpressJet Ai…         20.0           -1           46.6           548
F9      Frontier Airl…         20.2            0.5         58.4           853
# ℹ 2 more variables: min_dep_delay <dbl>, iqr <dbl>

What we want to do is joining the two dataframes, so we enrich the original dataframe containing summary statistics on departure delays, with the name matching the carrier code.

Several different types of joins exist. The one we need here is left_join()

left_join returns all rows in the left dataframe, enriched with data from the rigth .

With a left_join() we join the two dataframes. All rows in the left dataframe are returned, enriched with the matching values from the columns in the right dataframe. Rows in the left dataframe, that does not have a matching row in the right dataframe will get NA-values.

Let us do that!

summary_delays %>% 
  left_join(carriers) %>% 
  relocate(name, .after = carrier) 

Joining with `by = join_by(carrier)`

# A tibble: 16 × 8
   carrier name           avg_dep_delay med_dep_delay sd_dep_delay max_dep_delay
   <chr>   <chr>                  <dbl>         <dbl>        <dbl>         <dbl>
US      US Airways In…          3.78          -4           28.1           500
HA      Hawaiian Airl…          4.90          -4           74.1          1301
AS      Alaska Airlin…          5.80          -3           31.4           225
AA      American Airl…          8.59          -3           37.4          1014
DL      Delta Air Lin…          9.26          -2           39.7           960
MQ      Envoy Air              10.6           -3           39.2          1137
UA      United Air Li…         12.1            0           35.7           483
OO      SkyWest Airli…         12.6           -6           43.1           154
VX      Virgin America         12.9            0           44.8           653
B6      JetBlue Airwa…         13.0           -1           38.5           502
9E      Endeavor Air …         16.7           -2           45.9           747
WN      Southwest Air…         17.7            1           43.3           471
FL      AirTran Airwa…         18.7            1           52.7           602
YV      Mesa Airlines…         19.0           -2           49.2           387
EV      ExpressJet Ai…         20.0           -1           46.6           548
F9      Frontier Airl…         20.2            0.5         58.4           853
# ℹ 2 more variables: min_dep_delay <dbl>, iqr <dbl>

The relocate function is used to change the order of the columns.

There are other join functions

left_join is an example of a mutating join. Like the mutate function left_join introduces new variables, new columns, in our dataframe.

Mutating joins

The closes cousin of the left_join funtion is the right_join.

The left_join returns all rows on the left hand side of the join, augmented with values from the dataframe on the right hand side of the join, based on matching values.

The right_join function returns all rows on the right hand side of the join. Depending on the flow of our code, we choose the variant that best suit our purpose.

inner_join keeps only the rows on the left hand side of the join, that have matching values in the dataframe on the right hand side of the join. Whereas left_join and right_join keeps all rows, and add NA values where there is no match, inner_join only returns rows that have a match.

full_join on the other hand keeps all rows from both dataframes being joined. Any observations that does not have a match in the other dataframe, will have NA values added.

Filtering joins

These joins will filter rows from the left hand dataframe, based on the presence or absence of matches in the right hand dataframe.

semi_join returns all rows that have a match on the right hand side of the join. But not those that do not have a match.

anti_join returns all rows, that do not have a match on the rigth hand side of the join.

Can we plot the average delays with carrier names?

Yes we can!

We take the summary_delays dataframe, containing the average delays, and join it with the data on carrier names. We do not have to specify the column on which we are joining, since there is only one shared column name. We then pipe that result to ggplot, where we specify that we would like the delay to be on the x-axis, and the name of the carrier on the y-axis:

summary_delays %>% 
  left_join(carriers) %>% 
  ggplot(mapping = aes(x = avg_dep_delay, y = name)) +
    geom_col()

Joining with `by = join_by(carrier)`

plot of chunk unnamed-chunk-7

Key Points

Data is often organized in separate tables, joining them can enrich the data we are analysing

previous episode

R EDA

next episode