Read and Write Tabular Data

Read and Write Tabular Data#

How to read and write tabular data

Learning Objectives

Questions:

How can we read, inspect, and export tabular data using Pandas?

Objectives:

Load tabular data from a spreadsheet into a DataFrame.
Inspect the structure and contents of your dataset.
Identify missing data within your dataset.
Apply filters to extract specific subsets of data based on conditions.
Save and export your dataset to various file formats.

Download the Titanic data#

For this page and the ones that follow, we will be working with the Titanic dataset. The full dataset can be found on Kaggle, but we will only be using the training data, which we have made available here.

To follow the exercises and use the code exactly as written below, you will need to download the data and place it in a folder called ‘data’ in the same directory as your Jupyter notebook. In other words, the folder where your notebook is located should contain a subfolder named ‘data’, which holds the file ‘titanic.csv’.

Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Current working directory#

The %pwd command (print working directory) is known as a “magic command” in Jupyter Notebook.
You can read more about magic commands here.

It is used to display the current working directory (cwd) within the notebook environment.

This is the folder where the Jupyter Notebook is running.
Understanding the current working directory is important for file operations like reading and writing files.

%pwd

Creating a `DataFrame` from a CSV file#

CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database.

Getting information about the `DataFrame`#

Besides displaying the the first and last rows of your DataFramethere are a few other attributes which can be useful for getting to know your data.

`shape`#

The shape attribute can be used to determine the dimensions of the DataFrame ‘titanic’.
It returns a tuple representing the number of rows and columns (rows, columns).

titanic.shape

`head()`#

The head() method displays the first five rows of the ‘titanic’ DataFrame.
This provides a quick overview of the dataset, showing the column names and the first rows of data.

titanic.head()

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

`tail()`#

Using the tail() method and specifying 3 as argument will display the last three rows of the ‘titanic’ DataFrame.
This allows us to inspect the end of the dataset, showing the last rows of data.

titanic.tail(3)

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

A note on head() and tail()

By default, Pandas limits the display of rows in a DataFrame to a maximum of 60 rows.

Calling a number higher than 60 will display the first five rows followed by an ellipsis (...), indicating that rows are being skipped, and then display the last five rows of the selection.

To display all rows without truncation, you can set the max_rows option to None using pd.set_option():

pd.set_option('display.max_rows', None)

Likewise, Pandas limits the display of columns in a DataFrame to a maximum of 20 columns.
To display all columns without truncation, you can set the max_columns option to None using pd.set_option():

pd.set_option('display.max_columns', None)

`dtypes`#

The dtypes attribute can be used to view the data types of each column in the ‘titanic’ DataFrame.
This command provides information about the data type of each column, such as integer, float, or object (string). The data type in each colum is useful to know especially if you are working with data that you did not create yourself. Moreover, this is important for performing statistics and calculations later on, as this can only be done on integers and floats.

titanic.dtypes

`info()`#

The info() method is used to display a concise summary of the ‘titanic’ DataFrame.
This command provides essential information about the DataFrame, including the number of non-null values in each column, the data type of each column, and memory usage.

titanic.info()

`notna()` and `isna()`#

The notna() conditional function returns a True for each row where the values are not a null value (i.e., not NaN).

Use the notna() method to create a boolean DataFrame indicating whether each element in ‘titanic’ is not null.
This command returns a DataFrame of the same shape as ‘titanic’, where True indicates a non-null value, and False indicates a null value.

titanic.notna()

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	True	True	True	True	True	True	True	True	True	True	False	True
1	True	True	True	True	True	True	True	True	True	True	True	True
2	True	True	True	True	True	True	True	True	True	True	False	True
3	True	True	True	True	True	True	True	True	True	True	True	True
4	True	True	True	True	True	True	True	True	True	True	False	True
...	...	...	...	...	...	...	...	...	...	...	...	...
886	True	True	True	True	True	True	True	True	True	True	False	True
887	True	True	True	True	True	True	True	True	True	True	True	True
888	True	True	True	True	True	False	True	True	True	True	False	True
889	True	True	True	True	True	True	True	True	True	True	True	True
890	True	True	True	True	True	True	True	True	True	True	False	True

891 rows × 12 columns

Use the notna() method followed by sum() to count the non-null values in each column of the ‘titanic’ DataFrame.

This command calculates the sum of True values (non-null) along each column axis, providing a count of non-null values for each column.

titanic.notna().sum()

The isna() conditional function returns a True for each row where the values are a null value (i.e., NaN).

Use the isna() method to create a boolean DataFrame indicating whether each element in ‘titanic’ is null.
This command returns a DataFrame of the same shape as ‘titanic’, where True indicates a null value, and False indicates a non-null value.

titanic.isna()

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	False	False	False	False	False	False	False	False	False	False	True	False
1	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	True	False
3	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	True	False
...	...	...	...	...	...	...	...	...	...	...	...	...
886	False	False	False	False	False	False	False	False	False	False	True	False
887	False	False	False	False	False	False	False	False	False	False	False	False
888	False	False	False	False	False	True	False	False	False	False	True	False
889	False	False	False	False	False	False	False	False	False	False	False	False
890	False	False	False	False	False	False	False	False	False	False	True	False

891 rows × 12 columns

Use the isna() method followed by sum() to count the null values in each column of the ‘titanic’ DataFrame.
This command calculates the sum of True values (null) along each column axis, providing a count of null values for each column.

titanic.isna().sum()

Filtering on column value#

We can filter a DataFrame to select rows where a specific column contains a specific value.

This command returns a subset of the DataFrame containing only the rows where the ‘PassengerId’ column has the value of 666:

titanic[titanic['PassengerId'] == 666]

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
665	666	0	2	Hickman, Mr. Lewis	male	32.0	2	0	S.O.C. 14879	73.5	NaN	S

Exercise#

Extract the two passengers whose point of departure (‘Embarked’) is unknown (NaN)

Solution

titanic[titanic['Embarked'].isna()]

This solution uses boolean indexing to filter the DataFrame ‘titanic’, extracting only the rows where the ‘Embarked’ column contains missing values (NaN).
This approach allows us to identify the passengers whose point of departure is unknown.

Reading and writing Excel files#

The to_excel() method is used to save the DataFrame to an Excel file.

The argument index=False specifies that we do not want to include the row index in the Excel file.

# Write the DataFrame 'titanic' to an Excel file named 'titanic.xlsx'.

titanic.to_excel('titanic.xlsx', index=False)

We can read data from an Excel file into a DataFrame using the read_excel() function.

pd.read_excel('titanic.xlsx')

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Note, that we are only reading the data, not storing it in a variable.
To do so, we could write:

new_df = pd.read_excel('titanic.xlsx')

Key points#

Reading data into Pandas from many different file formats or data sources is supported by read_* functions.
The head(), tail(), and info() methods are convenient for a first check.
The notna() and isna() methods are useful for finding and isolating missing data.
Exporting data out of pandas is provided by different to_* methods.

Read and Write Tabular Data

Contents

Read and Write Tabular Data#

Download the Titanic data#

Import Pandas#

Current working directory#

Creating a DataFrame from a CSV file#

Getting information about the DataFrame#

shape#

head()#

tail()#

dtypes#

info()#

notna() and isna()#

Filtering on column value#

Exercise#

Reading and writing Excel files#

Key points#

Creating a `DataFrame` from a CSV file#

Getting information about the `DataFrame`#

`shape`#

`head()`#

`tail()`#

`dtypes`#

`info()`#

`notna()` and `isna()`#