Before we Start


Figure 1

Example of a working directory structure
Example of a working directory structure

Getting to know the dataTaking a look at the data


Figure 1

A histogram divides the numeric values of the departure delay into “buckets” with a fixed width. It then counts the number of observations in each bucket, and plot a column matching that count.


Figure 2

We pipe the data to sample_frac in order to look at 0.5% of the data. The result of that is piped to the ggplot function, where we specify that the data should be mapped to the plot, by placing the values of the delay of departure on the X-axis, and the delay of arrival on the Y-axis.


Figure 3

This gives us a good first indication of how the different variables varies together. The name of this type of plot is correllogram because it shows all the correlations between the selected variables.


Figure 4


Exploring with summary statistics


Joining data


Figure 1

left_join returns all rows in the left dataframe, enriched with data from the rigth.


Figure 2


Boxplots and linear regressions


Figure 1


Figure 2

The number of flights United Air Lines have to Hawaii is too low to actually see here. But we do get a more nuanced view of the distribution of airtime for the individual airlines than we do using boxplots.


Figure 3

This looks more or less linear. We can place a linear regression line in the plot using the function geom_smooth(method = "lm"), where we specify that the function should fit a linear line to the data.


Figure 4

So, what is the actual linear model of this data?


What is the next step?What should I do next?