{
"cells": [
{
"cell_type": "markdown",
"id": "a9fc7c09-6376-4bf8-abd2-7596ac805d62",
"metadata": {},
"source": [
"# Descriptive Statistics"
]
},
{
"cell_type": "markdown",
"id": "8f4c0f2a-e065-494f-acfe-f343801f9a69",
"metadata": {},
"source": [
"```{admonition} Learning Objectives\n",
"\n",
"Questions:\n",
"* How can we calculate and analyse descriptive statistics using Pandas?\n",
"\n",
"Objectives:\n",
"* Use built-in methods for descriptive statistics.\n",
"* Use the split-apply-combine pattern to perform descriptive statistics on groupings of data.\n",
"* Use different count methods on groupings of data."
]
},
{
"cell_type": "markdown",
"id": "9ffbfee4-8515-4ccf-9b92-77cc164cb850",
"metadata": {},
"source": [
"## Import Pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "641047cc-98b3-42c1-87ee-4333d820e7fd",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [],
"source": [
"# This line imports the pandas library and aliases it as 'pd'.\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "c5ed751e-ec4b-487d-8150-4a0d5bc42275",
"metadata": {},
"source": [
"___"
]
},
{
"cell_type": "markdown",
"id": "60a03b66-78bb-4fe9-a1a2-3b1fb625ba6a",
"metadata": {},
"source": [
"## Creating a `DataFrame` from a CSV file"
]
},
{
"cell_type": "markdown",
"id": "a37391f9-2438-4d1e-8d55-2c46d0fc6b71",
"metadata": {},
"source": [
"The `pd.read_csv()` function is used to read the data from the file 'data/titanic.csv'."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "18196144-f324-4fc3-b575-7857c1138dbf",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-output"
]
},
"outputs": [],
"source": [
"# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.\n",
"\n",
"titanic = pd.read_csv('data/titanic.csv')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "fc424032-fbf3-4cc2-b9b8-1a0ce3f6031b",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 886 | \n",
" 887 | \n",
" 0 | \n",
" 2 | \n",
" Montvila, Rev. Juozas | \n",
" male | \n",
" 27.0 | \n",
" 0 | \n",
" 0 | \n",
" 211536 | \n",
" 13.0000 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 887 | \n",
" 888 | \n",
" 1 | \n",
" 1 | \n",
" Graham, Miss. Margaret Edith | \n",
" female | \n",
" 19.0 | \n",
" 0 | \n",
" 0 | \n",
" 112053 | \n",
" 30.0000 | \n",
" B42 | \n",
" S | \n",
"
\n",
" \n",
" 888 | \n",
" 889 | \n",
" 0 | \n",
" 3 | \n",
" Johnston, Miss. Catherine Helen \"Carrie\" | \n",
" female | \n",
" NaN | \n",
" 1 | \n",
" 2 | \n",
" W./C. 6607 | \n",
" 23.4500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 889 | \n",
" 890 | \n",
" 1 | \n",
" 1 | \n",
" Behr, Mr. Karl Howell | \n",
" male | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 111369 | \n",
" 30.0000 | \n",
" C148 | \n",
" C | \n",
"
\n",
" \n",
" 890 | \n",
" 891 | \n",
" 0 | \n",
" 3 | \n",
" Dooley, Mr. Patrick | \n",
" male | \n",
" 32.0 | \n",
" 0 | \n",
" 0 | \n",
" 370376 | \n",
" 7.7500 | \n",
" NaN | \n",
" Q | \n",
"
\n",
" \n",
"
\n",
"
891 rows × 12 columns
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas male 27.0 0 \n",
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"890 Dooley, Mr. Patrick male 32.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
".. ... ... ... ... ... \n",
"886 0 211536 13.0000 NaN S \n",
"887 0 112053 30.0000 B42 S \n",
"888 2 W./C. 6607 23.4500 NaN S \n",
"889 0 111369 30.0000 C148 C \n",
"890 0 370376 7.7500 NaN Q \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the DataFrame 'titanic'.\n",
"\n",
"titanic"
]
},
{
"cell_type": "markdown",
"id": "305b5f59-68b8-4a8b-88a2-28dc370510b8",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"___"
]
},
{
"cell_type": "markdown",
"id": "d6200516-7871-4985-a3d5-634d623d4e60",
"metadata": {},
"source": [
"## Aggregating statistics\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "6b9be4d7-6a34-4d95-8a37-8c45ca2e4bc0",
"metadata": {},
"source": [
"Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.\n",
"\n",
"For example, we can use `.mean()` to find the average age of the Titanic passengers:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9705ecac-f8d8-4e08-abbd-10499b3ee8fa",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(29.69911764705882)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic[\"Age\"].mean()"
]
},
{
"cell_type": "markdown",
"id": "59c03aca-a6b2-44ce-adda-7827162f0d4b",
"metadata": {},
"source": [
"Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default."
]
},
{
"cell_type": "markdown",
"id": "78f45a26-4d3c-423f-8c05-e0d775220222",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "5cf95abb-6db5-4531-9575-fd706c39ecda",
"metadata": {},
"source": [
"What is the median age and ticket fare price of the Titanic passengers?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c51ac34b-37d1-4fc1-9c1e-6447cafa7aff",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"Age 28.0000\n",
"Fare 14.4542\n",
"dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic[[\"Age\", \"Fare\"]].median()"
]
},
{
"cell_type": "markdown",
"id": "a6f2ce57-6107-4759-abc1-077b33f71fe2",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [Subsets tutorial](03_pandas_subsets)) is calculated for each numeric column."
]
},
{
"cell_type": "markdown",
"id": "18813351-2fb9-4201-b69a-9fdd61adef72",
"metadata": {},
"source": [
"Multiple descriptive statistics can easily be calculated for one or more columns at the same time. For this we can use the `describe()` method."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5cc3d857-9507-4cd4-9d13-dd45b439af40",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Fare | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 714.000000 | \n",
" 891.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 29.699118 | \n",
" 32.204208 | \n",
"
\n",
" \n",
" std | \n",
" 14.526497 | \n",
" 49.693429 | \n",
"
\n",
" \n",
" min | \n",
" 0.420000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 20.125000 | \n",
" 7.910400 | \n",
"
\n",
" \n",
" 50% | \n",
" 28.000000 | \n",
" 14.454200 | \n",
"
\n",
" \n",
" 75% | \n",
" 38.000000 | \n",
" 31.000000 | \n",
"
\n",
" \n",
" max | \n",
" 80.000000 | \n",
" 512.329200 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Fare\n",
"count 714.000000 891.000000\n",
"mean 29.699118 32.204208\n",
"std 14.526497 49.693429\n",
"min 0.420000 0.000000\n",
"25% 20.125000 7.910400\n",
"50% 28.000000 14.454200\n",
"75% 38.000000 31.000000\n",
"max 80.000000 512.329200"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic[[\"Age\", \"Fare\"]].describe()"
]
},
{
"cell_type": "markdown",
"id": "fb47036c-23b8-4da5-99ed-e180e1dfd159",
"metadata": {},
"source": [
"Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the `DataFrame.agg()` method:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b6f7f96a-3502-4e40-b37b-7b0db6775665",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
" Fare | \n",
"
\n",
" \n",
" \n",
" \n",
" min | \n",
" 0.420000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" max | \n",
" 80.000000 | \n",
" 512.329200 | \n",
"
\n",
" \n",
" median | \n",
" 28.000000 | \n",
" 14.454200 | \n",
"
\n",
" \n",
" skew | \n",
" 0.389108 | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" NaN | \n",
" 32.204208 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Fare\n",
"min 0.420000 0.000000\n",
"max 80.000000 512.329200\n",
"median 28.000000 14.454200\n",
"skew 0.389108 NaN\n",
"mean NaN 32.204208"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.agg(\n",
" {\n",
" \"Age\": [\"min\", \"max\", \"median\", \"skew\"],\n",
" \"Fare\": [\"min\", \"max\", \"median\", \"mean\"],\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"id": "99f41c0b-c1b4-48ba-8de4-6455c158e707",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"```{admonition} Descriptive statistics\n",
":class: seealso dropdown\n",
"\n",
"Details about descriptive statistics can be found in the official Pandas documentation on [Descriptive statistics](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats).\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "f2934055-47ba-4b82-a888-f5991eba1079",
"metadata": {},
"source": [
"___"
]
},
{
"cell_type": "markdown",
"id": "0a53e5e8-f18b-48d7-a152-213dc3b1f959",
"metadata": {},
"source": [
"## Aggregating statistics grouped by category\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "1c336bd6-e53b-4b3d-adbe-8e119c26f9cb",
"metadata": {},
"source": [
"Aggregating statistics according to the grouped by category is helpful when you want to e.g. compare the avarage age for different groups of passenger. Take a look at the example below:\n",
"\n",
"What is the average age for male versus female Titanic passengers?"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "8dc2bb4f-52df-4ae0-900d-b16253cb1900",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Age | \n",
"
\n",
" \n",
" Sex | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" female | \n",
" 27.915709 | \n",
"
\n",
" \n",
" male | \n",
" 30.726645 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age\n",
"Sex \n",
"female 27.915709\n",
"male 30.726645"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic[[\"Sex\", \"Age\"]].groupby(\"Sex\").mean()"
]
},
{
"cell_type": "markdown",
"id": "b50118ae-5aae-4845-9b60-ce7db7fc75d9",
"metadata": {},
"source": [
"As our interest is the average age for each sex, a subselection on these two columns is made first: `titanic[[\"Sex\", \"Age\"]]`.\\\n",
"Next, the `groupby()` method is applied on the 'Sex' column to make a group per category. The average age for each sex is calculated and returned."
]
},
{
"cell_type": "markdown",
"id": "a10e043d-e43f-4fdf-95bd-f53d64594aa6",
"metadata": {},
"source": [
"Calculating a given statistic (e.g. `mean()` age) *for each category in a column* (e.g. male/female in the 'Sex' column) is a common pattern. The `groupby()` method is used to support this type of operations. This fits in the more general **split-apply-combine** pattern:\n",
"\n",
"* **Split** the data into groups.\n",
"* **Apply** a function to each group independently.\n",
"* **Combine** the results into a data structure.\n",
"\n",
"The apply and combine steps are typically done together in Pandas."
]
},
{
"cell_type": "markdown",
"id": "777efc81-4d5d-4ec6-9dec-29e6990c0047",
"metadata": {},
"source": [
"In the previous example, we explicitly selected the 2 columns first. If not, the `mean()` method is applied to each column containing numerical columns by passing `numeric_only=True`:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "8800c4f7-8ed5-4fac-b3dc-ee8feff70323",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
"
\n",
" \n",
" Sex | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" female | \n",
" 431.028662 | \n",
" 0.742038 | \n",
" 2.159236 | \n",
" 27.915709 | \n",
" 0.694268 | \n",
" 0.649682 | \n",
" 44.479818 | \n",
"
\n",
" \n",
" male | \n",
" 454.147314 | \n",
" 0.188908 | \n",
" 2.389948 | \n",
" 30.726645 | \n",
" 0.429809 | \n",
" 0.235702 | \n",
" 25.523893 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp Parch \\\n",
"Sex \n",
"female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 \n",
"male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 \n",
"\n",
" Fare \n",
"Sex \n",
"female 44.479818 \n",
"male 25.523893 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.groupby(\"Sex\").mean(numeric_only=True)"
]
},
{
"cell_type": "markdown",
"id": "2973eb05-f3c5-49cd-9840-1d2d05971445",
"metadata": {},
"source": [
"It does not make much sense to get the average value of the 'Pclass'. If we are only interested in the average age for each sex, the selection of columns (rectangular brackets `[]` as usual) is supported on the grouped data as well:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "e2405337-1719-4950-8f59-1c5f389d008b",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"Sex\n",
"female 27.915709\n",
"male 30.726645\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.groupby(\"Sex\")[\"Age\"].mean()"
]
},
{
"cell_type": "markdown",
"id": "6e65a34d-9ec1-42ad-865c-607518520afa",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "18a23998-b0ea-47b2-aaa1-37163124a14b",
"metadata": {},
"source": [
"````{admonition} Categorical data\n",
":class: seealso dropdown\n",
"\n",
"The `Pclass` column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn't make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.\n",
"\n",
"Pandas provides a `Categorical` data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.\n",
"\n",
"To convert the `Pclass` column from numerical to categorical data in Pandas, you can use the following code:\n",
"\n",
"```python\n",
"titanic['Pclass'] = titanic['Pclass'].astype('category')\n",
"```\n",
"\n",
"This will treat `Pclass` as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.\n",
"\n",
"More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html).\n",
"````"
]
},
{
"cell_type": "markdown",
"id": "03e18c29-943c-4bc4-ab7e-f312126fe3b2",
"metadata": {},
"source": [
"What is the mean ticket fare price for each of the sex and cabin class combinations?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "88b0c068-e8b9-43c6-8e9b-eceda26954fb",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"Sex Pclass\n",
"female 1 106.125798\n",
" 2 21.970121\n",
" 3 16.118810\n",
"male 1 67.226127\n",
" 2 19.741782\n",
" 3 12.661633\n",
"Name: Fare, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.groupby([\"Sex\", \"Pclass\"])[\"Fare\"].mean()"
]
},
{
"cell_type": "markdown",
"id": "fab1d6d6-a73f-48b5-a495-099d5be83fa8",
"metadata": {},
"source": [
"Grouping can be done by multiple columns at the same time. Provide the column names as a list to the `groupby()` method."
]
},
{
"cell_type": "markdown",
"id": "090e2641-d22d-4746-ab41-98f7f745ef63",
"metadata": {},
"source": [
"```{admonition} Groupby operations\n",
":class: seealso dropdown\n",
"\n",
"A full description on the split-apply-combine approach is provided in the official Pandas documentation on [Groupby operations](https://pandas.pydata.org/docs/user_guide/groupby.html)."
]
},
{
"cell_type": "markdown",
"id": "25b13998-9c7b-4faf-a173-f036fa6a5605",
"metadata": {},
"source": [
"## Count number of records by category\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "9c83b1bb-0ce1-4642-afe2-c18dafb1606f",
"metadata": {},
"source": [
"What is the number of passengers in each of the cabin classes?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "d90d447c-1222-47cd-99ca-b9988ab6af32",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"Pclass\n",
"3 491\n",
"1 216\n",
"2 184\n",
"Name: count, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic[\"Pclass\"].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "d662c6e7-775a-401a-9c81-6e47376f5fc5",
"metadata": {},
"source": [
"The `value_counts()` method counts the number of records for each category in a column."
]
},
{
"cell_type": "markdown",
"id": "a6c8471e-b4b7-4200-8588-385209de4902",
"metadata": {},
"source": [
"The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "87660d64-9a90-494f-8f80-f54bb89cab33",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"Pclass\n",
"1 216\n",
"2 184\n",
"3 491\n",
"Name: Pclass, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.groupby(\"Pclass\")[\"Pclass\"].count()"
]
},
{
"cell_type": "markdown",
"id": "0ace2489-7767-4c29-91ab-a597332e1fc7",
"metadata": {},
"source": [
"```{note}\n",
"\n",
"Both `size()` and `count()` can be used in combination with `groupby()`. Whereas `size` includes `NaN` values and just provides the number of rows (size of the table), `count()` excludes the missing values. In the `value_counts()` method, use the `dropna` parameter to include or exclude the `NaN` values."
]
},
{
"cell_type": "markdown",
"id": "11614db4-3dec-4a57-8ac5-81abea69d36b",
"metadata": {},
"source": [
"## Key points\n",
"\n",
"* Descriptive statistics can be calculated on entire columns or rows.\n",
"* `groupby()` provides the power of the **split-apply-combine** pattern.\n",
"* `value_counts()` is a convenient shortcut to count the number of entries in each category of a variable."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc46cc17-e319-4902-9ce0-5dcd2b0fe3a3",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}