Read and Write Tabular Data#
Learning Objectives
Questions:
How can we read, inspect, and export tabular data using Pandas?
Objectives:
Load tabular data from a spreadsheet into a
DataFrame
.Inspect the structure and contents of your dataset.
Identify missing data within your dataset.
Apply filters to extract specific subsets of data based on conditions.
Save and export your dataset to various file formats.
Download the Titanic data#
For this page and the ones that follow, we will be working with the Titanic dataset. The full dataset can be found on Kaggle, but we will only be using the training data, which we have made available here.
To follow the exercises and use the code exactly as written below, you will need to download the data and place it in a folder called ‘data’ in the same directory as your Jupyter notebook. In other words, the folder where your notebook is located should contain a subfolder named ‘data’, which holds the file ‘titanic.csv’.
Import Pandas#
# This line imports the pandas library and aliases it as 'pd'.
import pandas as pd
Current working directory#
The %pwd
command (print working directory) is known as a “magic command” in Jupyter Notebook.
You can read more about magic commands here.
It is used to display the current working directory (cwd
) within the notebook environment.
This is the folder where the Jupyter Notebook is running.
Understanding the current working directory is important for file operations like reading and writing files.
%pwd
Creating a DataFrame
from a CSV file#
CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database.
See also
For more information on working with CSV files in Python, see GeeksforGeeks.
The pd.read_csv()
function is used to read the data from the file ‘data/titanic.csv’.
The file is located in the ‘data’ directory, relative to the current working directory.
I.e., in the folder where your notebook is located, you should have a folder named ‘data’ containing the file ‘titanic.csv’.
The resulting DataFrame
‘titanic’ contains the dataset, ready for analysis and manipulation.
# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.
titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.
titanic
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
Note, even though we only see the first and last five rows, we actually read the whole DataFrame
into the kernel’s memory.
Memory usage
The pressure on memory usage can be alleviated by using the head()
method described below.
However, this will only be an issue with very large datasets, so do not worry too much about it for now.
You can find out how much memory a DataFrame
uses by using the memory_usage()
method:
titanic.memory_usage(deep=True).sum()
Getting information about the DataFrame
#
Besides displaying the the first and last rows of your DataFrame
there are a few other attributes which can be useful for getting to know your data.
shape
#
The shape
attribute can be used to determine the dimensions of the DataFrame
‘titanic’.
It returns a tuple representing the number of rows and columns (rows, columns).
titanic.shape
Show code cell output
(891, 12)
head()
#
The head()
method displays the first five rows of the ‘titanic’ DataFrame
.
This provides a quick overview of the dataset, showing the column names and the first rows of data.
titanic.head()
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
tail()
#
Using the tail()
method and specifying 3 as argument will display the last three rows of the ‘titanic’ DataFrame
.
This allows us to inspect the end of the dataset, showing the last rows of data.
titanic.tail(3)
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
A note on head() and tail()
By default, Pandas limits the display of rows in a DataFrame
to a maximum of 60 rows.
Calling a number higher than 60 will display the first five rows followed by an ellipsis (...
), indicating that rows are being skipped, and then display the last five rows of the selection.
To display all rows without truncation, you can set the max_rows
option to None
using pd.set_option()
:
pd.set_option('display.max_rows', None)
Likewise, Pandas limits the display of columns in a DataFrame
to a maximum of 20 columns.
To display all columns without truncation, you can set the max_columns
option to None
using pd.set_option()
:
pd.set_option('display.max_columns', None)
dtypes
#
The dtypes
attribute can be used to view the data types of each column in the ‘titanic’ DataFrame
.
This command provides information about the data type of each column, such as integer, float, or object (string). The data type in each colum is useful to know especially if you are working with data that you did not create yourself. Moreover, this is important for performing statistics and calculations later on, as this can only be done on integers and floats.
titanic.dtypes
Show code cell output
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
info()
#
The info()
method is used to display a concise summary of the ‘titanic’ DataFrame
.
This command provides essential information about the DataFrame
, including the number of non-null values in each column, the data type of each column, and memory usage.
titanic.info()
Show code cell output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
notna()
and isna()
#
The notna()
conditional function returns a True
for each row where the values are not a null value (i.e., not NaN
).
Use the notna()
method to create a boolean DataFrame
indicating whether each element in ‘titanic’ is not null.
This command returns a DataFrame
of the same shape as ‘titanic’, where True
indicates a non-null value, and False
indicates a null value.
titanic.notna()
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | True | True | True | True | True | True | True | True | False | True |
1 | True | True | True | True | True | True | True | True | True | True | True | True |
2 | True | True | True | True | True | True | True | True | True | True | False | True |
3 | True | True | True | True | True | True | True | True | True | True | True | True |
4 | True | True | True | True | True | True | True | True | True | True | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | True | True | True | True | True | True | True | True | True | True | False | True |
887 | True | True | True | True | True | True | True | True | True | True | True | True |
888 | True | True | True | True | True | False | True | True | True | True | False | True |
889 | True | True | True | True | True | True | True | True | True | True | True | True |
890 | True | True | True | True | True | True | True | True | True | True | False | True |
891 rows × 12 columns
Use the notna()
method followed by sum()
to count the non-null values in each column of the ‘titanic’ DataFrame
.
This command calculates the sum of True
values (non-null) along each column axis, providing a count of non-null values for each column.
titanic.notna().sum()
Show code cell output
PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64
The isna()
conditional function returns a True
for each row where the values are a null value (i.e., NaN
).
Use the isna()
method to create a boolean DataFrame
indicating whether each element in ‘titanic’ is null.
This command returns a DataFrame
of the same shape as ‘titanic’, where True
indicates a null value, and False
indicates a non-null value.
titanic.isna()
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | True | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | True | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | False | False | False | False | False | False | False | False | False | False | True | False |
887 | False | False | False | False | False | False | False | False | False | False | False | False |
888 | False | False | False | False | False | True | False | False | False | False | True | False |
889 | False | False | False | False | False | False | False | False | False | False | False | False |
890 | False | False | False | False | False | False | False | False | False | False | True | False |
891 rows × 12 columns
Use the isna()
method followed by sum()
to count the null values in each column of the ‘titanic’ DataFrame
.
This command calculates the sum of True
values (null) along each column axis, providing a count of null values for each column.
titanic.isna().sum()
Show code cell output
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Filtering on column value#
We can filter a DataFrame
to select rows where a specific column contains a specific value.
This command returns a subset of the DataFrame
containing only the rows where the ‘PassengerId’ column has the value of 666:
titanic[titanic['PassengerId'] == 666]
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
665 | 666 | 0 | 2 | Hickman, Mr. Lewis | male | 32.0 | 2 | 0 | S.O.C. 14879 | 73.5 | NaN | S |
Exercise#
Extract the two passengers whose point of departure (‘Embarked’) is unknown (NaN
)
Solution
titanic[titanic['Embarked'].isna()]
This solution uses boolean indexing to filter the DataFrame
‘titanic’, extracting only the rows where the ‘Embarked’ column contains missing values (NaN
).
This approach allows us to identify the passengers whose point of departure is unknown.
Reading and writing Excel files#
The to_excel()
method is used to save the DataFrame
to an Excel file.
The argument index=False
specifies that we do not want to include the row index in the Excel file.
# Write the DataFrame 'titanic' to an Excel file named 'titanic.xlsx'.
titanic.to_excel('titanic.xlsx', index=False)
We can read data from an Excel file into a DataFrame
using the read_excel()
function.
pd.read_excel('titanic.xlsx')
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
Note, that we are only reading the data, not storing it in a variable.
To do so, we could write:
new_df = pd.read_excel('titanic.xlsx')
Key points#
Reading data into Pandas from many different file formats or data sources is supported by
read_*
functions.The
head()
,tail()
, andinfo()
methods are convenient for a first check.The
notna()
andisna()
methods are useful for finding and isolating missing data.Exporting data out of pandas is provided by different
to_*
methods.