Testing for normality

Last updated on 2024-10-22 | Edit this page

Overview

Questions

  • How do we determine if a dataset might be normally distributed?

Objectives

  • Explain how to use markdown with the new lesson template

R

library(tidyverse)
library(palmerpenguins)

How to test if our data is normally distributed?


Common question - is my data normally distributed?

What does it mean that it is normally distributed? It means that the distribution of our data has the same properties as the normal distribution.

Let us get some data that we can test:

R

normal_test_data <- penguins %>% 
  filter(species == "Chinstrap") %>% 
  select(bill_depth_mm)

Mean and median

One of the properties of the normal distribution is that the mean and median of the data is equal. Let us look at the penguins:

R

summary(normal_test_data)

OUTPUT

 bill_depth_mm
 Min.   :16.40
 1st Qu.:17.50
 Median :18.45
 Mean   :18.42
 3rd Qu.:19.40
 Max.   :20.80  

This is actually pretty close! But equality between median and mean is a neccesary, not a sufficient condition.

What next. A histogram of the data should look normal. Let us take a closer look at bill_depth_mm where mean and median are closest:

R

normal_test_data %>% 
  ggplot(aes(x=bill_depth_mm)) +
  geom_histogram()

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This does not really look like a normal distribution. The fact that mean and median are almost identical was not a sufficient criterium for normalcy.

The shape of a histogram is heavily dependent on the bins we choose. A better way of visualizing often density-plots:

R

normal_test_data %>% 
  ggplot(aes(x=bill_depth_mm)) +
  geom_density()

We can think of this as a histogram with infinitely small bins.

This does look more normal - but it would be nice to be able to quantize the degree of normalcy.

Percentiels and QQ-plots as a test


The properties of the normal distribution informs us that 50% of the observations in the data are smaller than the mean (and conversely 50% are larger). We alson know that 50% of the observations should be in the interquartile range. And what we also know is that 2.5% of the observations (the 2.5 percentile) are smaller than the mean minus 1.96 times the standard deviation.

Actually we can calculate exactly which percentile a given observation is in. And we we can calculate which value a given percentile should have.

Comparing those gives us an indication of how well the data conforms to a normal distribution.

Rather than doing that by hand, we can get R to do it for us in a nice graphical way:

R

normal_test_data %>% 
  ggplot(aes(sample = bill_depth_mm)) +
  geom_qq() +
  geom_qq_line()

The geom_qq function calculate and plots which percentile an observation is in.

Rather than being given percentiles, we are given the value that the percentile corresponds to if we calculate it as number of standard deviations from the mean.

This results in plots that are more comparable.

geom_qq_line plots the line corresponding til the values the percentiles should have, if the data was normally distributed.

A common theme is that the midle of the data falls relatively close to the line, and that there are deviations from the line at both ends. In this case the deviations are largish, indicating that the data is not normally distributed.

Callout

We have two tails in the qq-plot, a left and a right. And they can be either above or below the qq-line.

That results in four different cases, that informs us about what is wrong with the data - in regards to how it deviates from normalcy.

Left tail Right tail Name What
Above Below Leptokurtic Heavier tails - ie. more extreme values. Higher kurtosis
Below Above Platykurtic Lighter tails - ie. fewer extreme values. Lower kurtosis
Above Above Right skewed A tail that stretches to the higher values - the extreme values are larger.
Below Below Left skewed A tail that stretches to the lower values - the extreme values are smaller.

Numerical measures rather than graphical


With experience the qq-plots can be used to determine if the data is normally distributed - the points are exactly on the line. But only rarely the points matches exactly - even if the data is normally distributed enough. And how much can the tails deviate from the line before the data is not normally distributed enough?

The deviations can be described numerically using two parameters:

Kurtosis and skewness.

Base-R do not have functions for this, but the package e1071 does.

R

library(e1071)

R

skewness(normal_test_data$bill_depth_mm, na.rm = TRUE)

OUTPUT

[1] 0.006578141

kurtosis

R

kurtosis(normal_test_data$bill_depth_mm, na.rm = TRUE)

OUTPUT

[1] -0.960087

As a rule of thumb, skewness should be within +/-0.5 if the data is normally distributed. Values between +/-0.5 and +/- 1 indicates a moderate skewness, where data can still be approximately normally distributed. Values larger that +/-1 indicates a significant skewness, and the data is probably not normally distributed.

For the kurtosis matters are a bit more complicated as the kurtosis for a normal distribution is 3. We therefore look at excess kurtosis, where i subtract 3 from the calculated kurtosis. An value of +/-1 exces kurtosis indicates that the data has a ‘tailedness’ close to the normal distribution. Values between +/-1 and +/-2 indicates a moderate deviation from the normal distribution, but the data can still be approximately normally distributed. Values larger than +/-2 is in general taken as an indication that the data is not normally distributed.

The absolute value of the excess kurtosis is larger than 2, indicating that the data is not normally distributed.

More direct tests


The question of wether or not data is normally distributed is important in many contexts, and it should come as no surprise that a multitude of tests has been devised for testing exactly that.

Shapiro-Wilk

The Shapiro-Wilk test is especially suited for small sample sizes (<50, some claim it works well up to <2000).

It is a measure of the linear correlation between data and the normally distributed quantiles, what we see in the qqplot.

The null-hypothesis is that data is normally distributed, and the Shapiro-Wilk test returns a p-value reporting the risk of being wrong if we reject the null-hypothesis.

R

shapiro.test(normal_test_data$bill_depth_mm) 

OUTPUT


	Shapiro-Wilk normality test

data:  normal_test_data$bill_depth_mm
W = 0.97274, p-value = 0.1418

The p-value in this case is 0.1418 - and we do not have enough evidense to reject the null-hypothesis. The data is probably normally distributed.

Kolmogorov-Smirnov

The KS-test allows us to test if the data is distributed as a lot of different distributions, not only the normal distribution. Because of this, we need to specify the specific distribution we are testing for, in this case a normal distribution with specific values for mean and standard deviation.

Therefore we need to calculate those:

R

mean <- mean(normal_test_data$bill_depth_mm, na.rm = TRUE)
sd <- sd(normal_test_data$bill_depth_mm, na.rm = TRUE)
ks.test(normal_test_data$bill_depth_mm, "pnorm", mean = mean, sd = sd)

WARNING

Warning in ks.test.default(normal_test_data$bill_depth_mm, "pnorm", mean =
mean, : ties should not be present for the one-sample Kolmogorov-Smirnov test

OUTPUT


	Asymptotic one-sample Kolmogorov-Smirnov test

data:  normal_test_data$bill_depth_mm
D = 0.073463, p-value = 0.8565
alternative hypothesis: two-sided

In this test the null-hypothesis is also that data is normally distributed. The p-values is very high, and therefore we cannot reject the null-hypothesis. Again, this is not the same as the data actually being normally distributed.

This test assumes that there are no repeated values in the data, as that can affect the precision of the test. The p-value is still very high, and we will conclude that we cannot rule out that the data is not normally distributed.

Note that the KS-test assumes that we actually know the true mean and standard deviation. Here we calculate those values based on the sample, which is problematic.

Liliefors test

This is a variation on the KS-test, that is designed specifically for testing for normality. It does not require us to know the true mean and standard deviation for the population.

This test is also not available in base-R, but can be found in the nortest package:

R

library(nortest)
lillie.test(normal_test_data$bill_depth_mm)

OUTPUT


	Lilliefors (Kolmogorov-Smirnov) normality test

data:  normal_test_data$bill_depth_mm
D = 0.073463, p-value = 0.483

Again the p-value is so high, that we cannot reject the null-hypothesis.

Anderson-Darling test

This test is more sensitive for deviations in the tails.

It is not available in base-R, but can be found in the nortest package.

R

ad.test(normal_test_data$bill_depth_mm)

OUTPUT


	Anderson-Darling normality test

data:  normal_test_data$bill_depth_mm
A = 0.44788, p-value = 0.2714

In this case the null-hypothesis is also that data is normally distributed, and the p-value indicates that we cannot reject the null-hypothesis.

And is it actually normally distributed?


Probably not. Except for the excess kurtosis all the tests we have done indicate that the depth of the beaks of chinstrap penguins can be normally distributed. Or rather, that we cannot reject the null-hypothesis that they are normally distributed.

But the fact that we cannot reject this hypothesis is not the same as concluding that the data actually is normally distributed.

Based on the excess kurtosis and the qq-plot, it would be reasonable to conclude that it is not.

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally