This lesson is being piloted (Beta version)

Different types of plots

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • What other types of plots can we make?

  • How can we control the order of stuff in plots?

Objectives
  • Learn how to make histograms, barcharts, boxplots and violinplots

A collection of different types of plots

Scatterplots are very useful, but we often need other types of plots. In this part of the course, we are going to look at some of the more common types.

Histograms

Histograms splits all observations of a variable up in a number of “bins”. It counts how many observations are in each bin. Then we plot a column with a height equivalent to the number of observations for each bin.

Note that we here use the pipe to get the diamonds data into ggplot(). Both methods can be used, and if we need to manipulate the data before plotting, it is a common way to get the modified data into ggplot().

diamonds %>% 
  ggplot(mapping = aes(carat)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot of chunk unnamed-chunk-2

plot of chunk unnamed-chunk-2

Note that we get a warning from geom_histogram that the number of bins by default is set to 30. 30 bins will almost never be the correct number of bins, and we should chose a better value ourself.

diamonds %>% 
  ggplot(aes(carat)) +
  geom_histogram(bins = 25)
plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-3

What number of bins should I choose? There are some general rules for this (some can be found https://kubdatalab.github.io/forklaringer/12-histogrammer/index.html, beware, the page is in Danish.) In general it is our recommendation that you experiment with different number of bins to find the one that best shows your data.

Note that we excluded the mapping part of the ggplot function. The first argument of ggplot is always data, and we can get that via the pipe. The second argument is always mapping, and therefore we do not need to specify it.

In the following we are sometimes going to specify the mapping argument. There are two reasons for that. One: We have forgotten to be consistent. Two: In some cases it is useful to remind ourselves that we are actually mapping data to something.

Barcharts

Not to be confused with histograms, barcharts count the number of observations in different groups. Where the scale in histograms is continuous, and split into bins, the scale in barcharts is discrete.

Here we map the color-variable to the x-axis in the barchart. geom_bar counts the number of observations itself - we do not need to provide a count:

diamonds %>% 
  ggplot(aes(color)) +
  geom_bar()
plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-4

A small excursion

Why are the columns in the barchart above in that order?

One might guess that they are simply in alphabetical order.

Not so! Color is a categorical variable. Diamonds either have the color “D” (which is the best color), or another color (like “J”, which is the worst).

There are no “D.E” colors, they do not exist on a continous range.

This is called “factors” in R. The data in a factor can take one of several values, called levels. And the order of these levels are what control the order in the plot.

The order can be either arbitrary, or there can exist an implicit order in the data, like with the color of the diamonds, where D is the best color, and J is the worst. These types of ordered categorical data are called ordinal data.

They look like this:

diamonds %>% 
  select(cut, color, clarity) %>% 
  str()
tibble [53,940 × 3] (S3: tbl_df/tbl/data.frame)
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...

Note that even though the colour “D” is better than “E”, the levels of the color factor indicates that “D<E”.

All this just to say: We can control the order of columns in the plot, by controlling the order of the levels of the categorical value we are plotting:

diamonds %>% 
  mutate(color = fct_rev(color)) %>%  
  ggplot(aes(color)) +
  geom_bar()
plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-6

fct_rev is a function that reverses the order of a factor. It comes from the library forcats that makes it easier to work with categorical data.

Boxplots

Boxplots are suitable for visualising the distribution of data. We can make a boxplot of a single variable in the data - or we can make several boxplots in one plot:

diamonds %>% 
  ggplot(aes(x = carat, y = cut)) +
  geom_boxplot()
plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-7

Here we have the variable we are making boxplots of, on the x-axis, and splitting them up in one plot per cut, on the y-axis.

What is a boxplot?

Boxplots are useful for showing different distributions. The fat line in the middle of the box is the median, the two ends of the box is first and third quartile, and the two whiskers (or lines) on both sides of the box shows the minimum and maximum values - excluding outliers, defined for this purpose as values that lies more that 1.5 times the interquartile range from the box.

Violinplots

Boxplots are not necessarily the best option for showing distributions. A good alternative could be violinplots. They show a density plot - basically a histogram with infinite bins - for each group, blotted symmetrically around an axis:

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-8

exercise

The geom_ for making violin plots is geom_violin Look at the help for geom_violin and make a violinplot with carat on the x-axis, and cut on the y-axis.

Solution

diamonds %>% ggplot(aes(carat, y = cut)) +

geom_violin()

And many more

ggplot2 is born with a multitude of different plots. A complete list of plots will be very long, and take up all the time for this course. Take a look at The R Graph Gallery or at Graphs in R (NB a work in progress), where we will collect weird and wonderful plots, when to use them, when not to use them. And how to make them.

ggplot2 is written as an extensible package, meaning that developers can create packages making plots that are not included in ggplot2, or introduce more advanced functionality around plots. Two of the more interesting extensions are:

ggforce extends ggplot2 with specialised plottypes.

gganimate makes it easyish to make animated plots using ggplot2

Key Points

  • Categorical data, aka factors can control the order of data in plots

  • ggplot makes it easy to make many different types of plots

  • ggplot have many useful extensions