{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a9fc7c09-6376-4bf8-abd2-7596ac805d62",
   "metadata": {},
   "source": [
    "# Descriptive Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f4c0f2a-e065-494f-acfe-f343801f9a69",
   "metadata": {},
   "source": [
    "```{admonition} Learning Objectives\n",
    "\n",
    "Questions:\n",
    "* How can we calculate and analyse descriptive statistics using Pandas?\n",
    "\n",
    "Objectives:\n",
    "* Use built-in methods for descriptive statistics.\n",
    "* Use the split-apply-combine pattern to perform descriptive statistics on groupings of data.\n",
    "* Use different count methods on groupings of data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ffbfee4-8515-4ccf-9b92-77cc164cb850",
   "metadata": {},
   "source": [
    "## Import Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "641047cc-98b3-42c1-87ee-4333d820e7fd",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "# This line imports the pandas library and aliases it as 'pd'.\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5ed751e-ec4b-487d-8150-4a0d5bc42275",
   "metadata": {},
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60a03b66-78bb-4fe9-a1a2-3b1fb625ba6a",
   "metadata": {},
   "source": [
    "## Creating a `DataFrame` from a CSV file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a37391f9-2438-4d1e-8d55-2c46d0fc6b71",
   "metadata": {},
   "source": [
    "The `pd.read_csv()` function is used to read the data from the file 'data/titanic.csv'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "18196144-f324-4fc3-b575-7857c1138dbf",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.\n",
    "\n",
    "titanic = pd.read_csv('data/titanic.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fc424032-fbf3-4cc2-b9b8-1a0ce3f6031b",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>887</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Montvila, Rev. Juozas</td>\n",
       "      <td>male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211536</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>888</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Graham, Miss. Margaret Edith</td>\n",
       "      <td>female</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>112053</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>B42</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>889</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>W./C. 6607</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>890</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Behr, Mr. Karl Howell</td>\n",
       "      <td>male</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111369</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>C148</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>891</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Dooley, Mr. Patrick</td>\n",
       "      <td>male</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>370376</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "0              1         0       3   \n",
       "1              2         1       1   \n",
       "2              3         1       3   \n",
       "3              4         1       1   \n",
       "4              5         0       3   \n",
       "..           ...       ...     ...   \n",
       "886          887         0       2   \n",
       "887          888         1       1   \n",
       "888          889         0       3   \n",
       "889          890         1       1   \n",
       "890          891         0       3   \n",
       "\n",
       "                                                  Name     Sex   Age  SibSp  \\\n",
       "0                              Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                               Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                             Allen, Mr. William Henry    male  35.0      0   \n",
       "..                                                 ...     ...   ...    ...   \n",
       "886                              Montvila, Rev. Juozas    male  27.0      0   \n",
       "887                       Graham, Miss. Margaret Edith  female  19.0      0   \n",
       "888           Johnston, Miss. Catherine Helen \"Carrie\"  female   NaN      1   \n",
       "889                              Behr, Mr. Karl Howell    male  26.0      0   \n",
       "890                                Dooley, Mr. Patrick    male  32.0      0   \n",
       "\n",
       "     Parch            Ticket     Fare Cabin Embarked  \n",
       "0        0         A/5 21171   7.2500   NaN        S  \n",
       "1        0          PC 17599  71.2833   C85        C  \n",
       "2        0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3        0            113803  53.1000  C123        S  \n",
       "4        0            373450   8.0500   NaN        S  \n",
       "..     ...               ...      ...   ...      ...  \n",
       "886      0            211536  13.0000   NaN        S  \n",
       "887      0            112053  30.0000   B42        S  \n",
       "888      2        W./C. 6607  23.4500   NaN        S  \n",
       "889      0            111369  30.0000  C148        C  \n",
       "890      0            370376   7.7500   NaN        Q  \n",
       "\n",
       "[891 rows x 12 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the DataFrame 'titanic'.\n",
    "\n",
    "titanic"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "305b5f59-68b8-4a8b-88a2-28dc370510b8",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6200516-7871-4985-a3d5-634d623d4e60",
   "metadata": {},
   "source": [
    "## Aggregating statistics\n",
    "\n",
    "![Aggregating statistics from a single coloumn in a DataFrame](images/04_aggregate.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b9be4d7-6a34-4d95-8a37-8c45ca2e4bc0",
   "metadata": {},
   "source": [
    "Pandas offers built-in methods that allow you to quickly generate summary statistics for your data. These methods make it easy to analyse and summarise data with just a few lines of code.\n",
    "\n",
    "For example, we can use `.mean()` to find the average age of the Titanic passengers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9705ecac-f8d8-4e08-abbd-10499b3ee8fa",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "np.float64(29.69911764705882)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic[\"Age\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59c03aca-a6b2-44ce-adda-7827162f0d4b",
   "metadata": {},
   "source": [
    "Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78f45a26-4d3c-423f-8c05-e0d775220222",
   "metadata": {},
   "source": [
    "![Aggregating statistics from a multiple coloumns in a DataFrame](images/04_reduction.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cf95abb-6db5-4531-9575-fd706c39ecda",
   "metadata": {},
   "source": [
    "What is the median age and ticket fare price of the Titanic passengers?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c51ac34b-37d1-4fc1-9c1e-6447cafa7aff",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age     28.0000\n",
       "Fare    14.4542\n",
       "dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic[[\"Age\", \"Fare\"]].median()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6f2ce57-6107-4759-abc1-077b33f71fe2",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "The statistic applied to multiple columns of a `DataFrame` (the selection of two columns returns a `DataFrame`, see the [Subsets tutorial](03_pandas_subsets)) is calculated for each numeric column."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18813351-2fb9-4201-b69a-9fdd61adef72",
   "metadata": {},
   "source": [
    "Multiple descriptive statistics can easily be calculated for one or more columns at the same time. For this we can use the `describe()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5cc3d857-9507-4cd4-9d13-dd45b439af40",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>714.000000</td>\n",
       "      <td>891.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>29.699118</td>\n",
       "      <td>32.204208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>14.526497</td>\n",
       "      <td>49.693429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.420000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>20.125000</td>\n",
       "      <td>7.910400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>28.000000</td>\n",
       "      <td>14.454200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>38.000000</td>\n",
       "      <td>31.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>80.000000</td>\n",
       "      <td>512.329200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Age        Fare\n",
       "count  714.000000  891.000000\n",
       "mean    29.699118   32.204208\n",
       "std     14.526497   49.693429\n",
       "min      0.420000    0.000000\n",
       "25%     20.125000    7.910400\n",
       "50%     28.000000   14.454200\n",
       "75%     38.000000   31.000000\n",
       "max     80.000000  512.329200"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic[[\"Age\", \"Fare\"]].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb47036c-23b8-4da5-99ed-e180e1dfd159",
   "metadata": {},
   "source": [
    "Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the `DataFrame.agg()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b6f7f96a-3502-4e40-b37b-7b0db6775665",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.420000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>80.000000</td>\n",
       "      <td>512.329200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>median</th>\n",
       "      <td>28.000000</td>\n",
       "      <td>14.454200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>skew</th>\n",
       "      <td>0.389108</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>32.204208</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Age        Fare\n",
       "min      0.420000    0.000000\n",
       "max     80.000000  512.329200\n",
       "median  28.000000   14.454200\n",
       "skew     0.389108         NaN\n",
       "mean          NaN   32.204208"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.agg(\n",
    "    {\n",
    "        \"Age\": [\"min\", \"max\", \"median\", \"skew\"],\n",
    "        \"Fare\": [\"min\", \"max\", \"median\", \"mean\"],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99f41c0b-c1b4-48ba-8de4-6455c158e707",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "```{admonition} Descriptive statistics\n",
    ":class: seealso dropdown\n",
    "\n",
    "Details about descriptive statistics can be found in the official Pandas documentation on [Descriptive statistics](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats).\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2934055-47ba-4b82-a888-f5991eba1079",
   "metadata": {},
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a53e5e8-f18b-48d7-a152-213dc3b1f959",
   "metadata": {},
   "source": [
    "## Aggregating statistics grouped by category\n",
    "\n",
    "![Aggregating statistics grouped by category](images/04_groupby.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c336bd6-e53b-4b3d-adbe-8e119c26f9cb",
   "metadata": {},
   "source": [
    "Aggregating statistics according to the grouped by category is helpful when you want to e.g. compare the avarage age for different groups of passenger. Take a look at the example below:\n",
    "\n",
    "What is the average age for male versus female Titanic passengers?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "8dc2bb4f-52df-4ae0-900d-b16253cb1900",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sex</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>female</th>\n",
       "      <td>27.915709</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>30.726645</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Age\n",
       "Sex              \n",
       "female  27.915709\n",
       "male    30.726645"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic[[\"Sex\", \"Age\"]].groupby(\"Sex\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b50118ae-5aae-4845-9b60-ce7db7fc75d9",
   "metadata": {},
   "source": [
    "As our interest is the average age for each sex, a subselection on these two columns is made first: `titanic[[\"Sex\", \"Age\"]]`.\\\n",
    "Next, the `groupby()` method is applied on the 'Sex' column to make a group per category. The average age for each sex is calculated and returned."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a10e043d-e43f-4fdf-95bd-f53d64594aa6",
   "metadata": {},
   "source": [
    "Calculating a given statistic (e.g. `mean()` age) *for each category in a column* (e.g. male/female in the 'Sex' column) is a common pattern. The `groupby()` method is used to support this type of operations. This fits in the more general **split-apply-combine** pattern:\n",
    "\n",
    "* **Split** the data into groups.\n",
    "* **Apply** a function to each group independently.\n",
    "* **Combine** the results into a data structure.\n",
    "\n",
    "The apply and combine steps are typically done together in Pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "777efc81-4d5d-4ec6-9dec-29e6990c0047",
   "metadata": {},
   "source": [
    "In the previous example, we explicitly selected the 2 columns first. If not, the `mean()` method is applied to each column containing numerical columns by passing `numeric_only=True`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8800c4f7-8ed5-4fac-b3dc-ee8feff70323",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sex</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>female</th>\n",
       "      <td>431.028662</td>\n",
       "      <td>0.742038</td>\n",
       "      <td>2.159236</td>\n",
       "      <td>27.915709</td>\n",
       "      <td>0.694268</td>\n",
       "      <td>0.649682</td>\n",
       "      <td>44.479818</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>454.147314</td>\n",
       "      <td>0.188908</td>\n",
       "      <td>2.389948</td>\n",
       "      <td>30.726645</td>\n",
       "      <td>0.429809</td>\n",
       "      <td>0.235702</td>\n",
       "      <td>25.523893</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        PassengerId  Survived    Pclass        Age     SibSp     Parch  \\\n",
       "Sex                                                                      \n",
       "female   431.028662  0.742038  2.159236  27.915709  0.694268  0.649682   \n",
       "male     454.147314  0.188908  2.389948  30.726645  0.429809  0.235702   \n",
       "\n",
       "             Fare  \n",
       "Sex                \n",
       "female  44.479818  \n",
       "male    25.523893  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.groupby(\"Sex\").mean(numeric_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2973eb05-f3c5-49cd-9840-1d2d05971445",
   "metadata": {},
   "source": [
    "It does not make much sense to get the average value of the 'Pclass'. If we are only interested in the average age for each sex, the selection of columns (rectangular brackets `[]` as usual) is supported on the grouped data as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e2405337-1719-4950-8f59-1c5f389d008b",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sex\n",
       "female    27.915709\n",
       "male      30.726645\n",
       "Name: Age, dtype: float64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.groupby(\"Sex\")[\"Age\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e65a34d-9ec1-42ad-865c-607518520afa",
   "metadata": {},
   "source": [
    "![The split-apply-combine pattern](images/04_groupby_select_detail.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18a23998-b0ea-47b2-aaa1-37163124a14b",
   "metadata": {},
   "source": [
    "````{admonition} Categorical data\n",
    ":class: seealso dropdown\n",
    "\n",
    "The `Pclass` column in the Titanic dataset contains numerical values, but these numbers actually represent three distinct categories: 1st class, 2nd class, and 3rd class. Since these numbers correspond to categories rather than actual quantities, calculating statistics like the mean or sum on them doesn't make much sense. For instance, averaging these numbers would yield a meaningless result because class rankings are ordinal and not inherently numerical.\n",
    "\n",
    "Pandas provides a `Categorical` data type specifically designed to handle such categorical data. This is useful because it enables more efficient storage and manipulation of categorical variables. It also allows you to apply specific operations that make sense for categories, such as ordering or grouping, without misinterpreting the data as continuous numerical values.\n",
    "\n",
    "To convert the `Pclass` column from numerical to categorical data in Pandas, you can use the following code:\n",
    "\n",
    "```python\n",
    "titanic['Pclass'] = titanic['Pclass'].astype('category')\n",
    "```\n",
    "\n",
    "This will treat `Pclass` as a categorical variable, allowing Pandas to handle it appropriately for analysis. For example, you can easily perform operations like counting the number of passengers in each class or plotting the distribution of passenger classes, with results that correctly reflect the categorical nature of the data.\n",
    "\n",
    "More information is provided in the official Pandas documentation on [Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html).\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03e18c29-943c-4bc4-ab7e-f312126fe3b2",
   "metadata": {},
   "source": [
    "What is the mean ticket fare price for each of the sex and cabin class combinations?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "88b0c068-e8b9-43c6-8e9b-eceda26954fb",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sex     Pclass\n",
       "female  1         106.125798\n",
       "        2          21.970121\n",
       "        3          16.118810\n",
       "male    1          67.226127\n",
       "        2          19.741782\n",
       "        3          12.661633\n",
       "Name: Fare, dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.groupby([\"Sex\", \"Pclass\"])[\"Fare\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fab1d6d6-a73f-48b5-a495-099d5be83fa8",
   "metadata": {},
   "source": [
    "Grouping can be done by multiple columns at the same time. Provide the column names as a list to the `groupby()` method."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "090e2641-d22d-4746-ab41-98f7f745ef63",
   "metadata": {},
   "source": [
    "```{admonition} Groupby operations\n",
    ":class: seealso dropdown\n",
    "\n",
    "A full description on the split-apply-combine approach is provided in the official Pandas documentation on [Groupby operations](https://pandas.pydata.org/docs/user_guide/groupby.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25b13998-9c7b-4faf-a173-f036fa6a5605",
   "metadata": {},
   "source": [
    "## Count number of records by category\n",
    "\n",
    "![Counting number of records by category](images/04_valuecounts.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c83b1bb-0ce1-4642-afe2-c18dafb1606f",
   "metadata": {},
   "source": [
    "What is the number of passengers in each of the cabin classes?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d90d447c-1222-47cd-99ca-b9988ab6af32",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pclass\n",
       "3    491\n",
       "1    216\n",
       "2    184\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic[\"Pclass\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d662c6e7-775a-401a-9c81-6e47376f5fc5",
   "metadata": {},
   "source": [
    "The `value_counts()` method counts the number of records for each category in a column."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6c8471e-b4b7-4200-8588-385209de4902",
   "metadata": {},
   "source": [
    "The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "87660d64-9a90-494f-8f80-f54bb89cab33",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pclass\n",
       "1    216\n",
       "2    184\n",
       "3    491\n",
       "Name: Pclass, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic.groupby(\"Pclass\")[\"Pclass\"].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ace2489-7767-4c29-91ab-a597332e1fc7",
   "metadata": {},
   "source": [
    "```{note}\n",
    "\n",
    "Both `size()` and `count()` can be used in combination with `groupby()`. Whereas `size` includes `NaN` values and just provides the number of rows (size of the table), `count()` excludes the missing values. In the `value_counts()` method, use the `dropna` parameter to include or exclude the `NaN` values."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11614db4-3dec-4a57-8ac5-81abea69d36b",
   "metadata": {},
   "source": [
    "## Key points\n",
    "\n",
    "* Descriptive statistics can be calculated on entire columns or rows.\n",
    "* `groupby()` provides the power of the **split-apply-combine** pattern.\n",
    "* `value_counts()` is a convenient shortcut to count the number of entries in each category of a variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc46cc17-e319-4902-9ce0-5dcd2b0fe3a3",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}