Descriptive Statistics#

Learning Objectives

Questions:

  • How can we calculate and analyse descriptive statistics using Pandas?

Objectives:

  • Use built-in methods for descriptive statistics.

  • Use the split-apply-combine pattern to perform descriptive statistics on groupings of data.

  • Use different count methods on groupings of data.

Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Creating a DataFrame from a CSV file#

The pd.read_csv() function is used to read the data from the file ‘data/titanic.csv’.

# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.

titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.

titanic
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns


Aggregating statistics#

Aggregating statistics from a single coloumn in a DataFrame

Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.

For example, we can use .mean() to find the average age of the Titanic passengers:

titanic["Age"].mean()
Hide code cell output
np.float64(29.69911764705882)

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.

Aggregating statistics from a multiple coloumns in a DataFrame

What is the median age and ticket fare price of the Titanic passengers?

titanic[["Age", "Fare"]].median()
Hide code cell output
Age     28.0000
Fare    14.4542
dtype: float64

The statistic applied to multiple columns of a DataFrame (the selection of two columns returns a DataFrame, see the Subsets tutorial) is calculated for each numeric column.

Multiple descriptive statistics can easily be calculated for one or more columns at the same time. For this we can use the describe() method.

titanic[["Age", "Fare"]].describe()
Hide code cell output
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200

Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the DataFrame.agg() method:

titanic.agg(
    {
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    }
)
Hide code cell output
Age Fare
min 0.420000 0.000000
max 80.000000 512.329200
median 28.000000 14.454200
skew 0.389108 NaN
mean NaN 32.204208

Aggregating statistics grouped by category#

Aggregating statistics grouped by category

Aggregating statistics according to the grouped by category is helpful when you want to e.g. compare the avarage age for different groups of passenger. Take a look at the example below:

What is the average age for male versus female Titanic passengers?

titanic[["Sex", "Age"]].groupby("Sex").mean()
Hide code cell output
Age
Sex
female 27.915709
male 30.726645

As our interest is the average age for each sex, a subselection on these two columns is made first: titanic[["Sex", "Age"]].
Next, the groupby() method is applied on the ‘Sex’ column to make a group per category. The average age for each sex is calculated and returned.

Calculating a given statistic (e.g. mean() age) for each category in a column (e.g. male/female in the ‘Sex’ column) is a common pattern. The groupby() method is used to support this type of operations. This fits in the more general split-apply-combine pattern:

  • Split the data into groups.

  • Apply a function to each group independently.

  • Combine the results into a data structure.

The apply and combine steps are typically done together in Pandas.

In the previous example, we explicitly selected the 2 columns first. If not, the mean() method is applied to each column containing numerical columns by passing numeric_only=True:

titanic.groupby("Sex").mean(numeric_only=True)
Hide code cell output
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893

It does not make much sense to get the average value of the ‘Pclass’. If we are only interested in the average age for each sex, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:

titanic.groupby("Sex")["Age"].mean()
Hide code cell output
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

The split-apply-combine pattern

What is the mean ticket fare price for each of the sex and cabin class combinations?

titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
Hide code cell output
Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby() method.

Count number of records by category#

Counting number of records by category

What is the number of passengers in each of the cabin classes?

titanic["Pclass"].value_counts()
Hide code cell output
Pclass
3    491
1    216
2    184
Name: count, dtype: int64

The value_counts() method counts the number of records for each category in a column.

The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

titanic.groupby("Pclass")["Pclass"].count()
Hide code cell output
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

Note

Both size() and count() can be used in combination with groupby(). Whereas size includes NaN values and just provides the number of rows (size of the table), count() excludes the missing values. In the value_counts() method, use the dropna parameter to include or exclude the NaN values.

Key points#

  • Descriptive statistics can be calculated on entire columns or rows.

  • groupby() provides the power of the split-apply-combine pattern.

  • value_counts() is a convenient shortcut to count the number of entries in each category of a variable.