Read and Write Tabular Data#

How to read and write tabular data

Learning Objectives

Questions:

  • How can we read, inspect, and export tabular data using Pandas?

Objectives:

  • Load tabular data from a spreadsheet into a DataFrame.

  • Inspect the structure and contents of your dataset.

  • Identify missing data within your dataset.

  • Apply filters to extract specific subsets of data based on conditions.

  • Save and export your dataset to various file formats.

Download the Titanic data#

For this page and the ones that follow, we will be working with the Titanic dataset. The full dataset can be found on Kaggle, but we will only be using the training data, which we have made available here.

To follow the exercises and use the code exactly as written below, you will need to download the data and place it in a folder called ‘data’ in the same directory as your Jupyter notebook. In other words, the folder where your notebook is located should contain a subfolder named ‘data’, which holds the file ‘titanic.csv’.


Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Current working directory#

The %pwd command (print working directory) is known as a “magic command” in Jupyter Notebook.
You can read more about magic commands here.

It is used to display the current working directory (cwd) within the notebook environment.

This is the folder where the Jupyter Notebook is running.
Understanding the current working directory is important for file operations like reading and writing files.

%pwd

Creating a DataFrame from a CSV file#

CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database.

See also

For more information on working with CSV files in Python, see GeeksforGeeks.

The pd.read_csv() function is used to read the data from the file ‘data/titanic.csv’.

The file is located in the ‘data’ directory, relative to the current working directory.
I.e., in the folder where your notebook is located, you should have a folder named ‘data’ containing the file ‘titanic.csv’.

The resulting DataFrame ‘titanic’ contains the dataset, ready for analysis and manipulation.

# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.

titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.

titanic
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Note, even though we only see the first and last five rows, we actually read the whole DataFrame into the kernel’s memory.


Getting information about the DataFrame#

Besides displaying the the first and last rows of your DataFramethere are a few other attributes which can be useful for getting to know your data.

shape#

The shape attribute can be used to determine the dimensions of the DataFrame ‘titanic’.
It returns a tuple representing the number of rows and columns (rows, columns).

titanic.shape
Hide code cell output
(891, 12)

tail()#

Using the tail() method and specifying 3 as argument will display the last three rows of the ‘titanic’ DataFrame.
This allows us to inspect the end of the dataset, showing the last rows of data.

titanic.tail(3)
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

dtypes#

The dtypes attribute can be used to view the data types of each column in the ‘titanic’ DataFrame.
This command provides information about the data type of each column, such as integer, float, or object (string). The data type in each colum is useful to know especially if you are working with data that you did not create yourself. Moreover, this is important for performing statistics and calculations later on, as this can only be done on integers and floats.

titanic.dtypes
Hide code cell output
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

info()#

The info() method is used to display a concise summary of the ‘titanic’ DataFrame.
This command provides essential information about the DataFrame, including the number of non-null values in each column, the data type of each column, and memory usage.

titanic.info()
Hide code cell output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

notna() and isna()#

The notna() conditional function returns a True for each row where the values are not a null value (i.e., not NaN).

Use the notna() method to create a boolean DataFrame indicating whether each element in ‘titanic’ is not null.
This command returns a DataFrame of the same shape as ‘titanic’, where True indicates a non-null value, and False indicates a null value.

titanic.notna()
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 True True True True True True True True True True False True
1 True True True True True True True True True True True True
2 True True True True True True True True True True False True
3 True True True True True True True True True True True True
4 True True True True True True True True True True False True
... ... ... ... ... ... ... ... ... ... ... ... ...
886 True True True True True True True True True True False True
887 True True True True True True True True True True True True
888 True True True True True False True True True True False True
889 True True True True True True True True True True True True
890 True True True True True True True True True True False True

891 rows × 12 columns

Use the notna() method followed by sum() to count the non-null values in each column of the ‘titanic’ DataFrame.

This command calculates the sum of True values (non-null) along each column axis, providing a count of non-null values for each column.

titanic.notna().sum()
Hide code cell output
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

The isna() conditional function returns a True for each row where the values are a null value (i.e., NaN).

Use the isna() method to create a boolean DataFrame indicating whether each element in ‘titanic’ is null.
This command returns a DataFrame of the same shape as ‘titanic’, where True indicates a null value, and False indicates a non-null value.

titanic.isna()
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 False False False False False False False False False False True False
1 False False False False False False False False False False False False
2 False False False False False False False False False False True False
3 False False False False False False False False False False False False
4 False False False False False False False False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False False True False
887 False False False False False False False False False False False False
888 False False False False False True False False False False True False
889 False False False False False False False False False False False False
890 False False False False False False False False False False True False

891 rows × 12 columns

Use the isna() method followed by sum() to count the null values in each column of the ‘titanic’ DataFrame.
This command calculates the sum of True values (null) along each column axis, providing a count of null values for each column.

titanic.isna().sum()
Hide code cell output
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Filtering on column value#

We can filter a DataFrame to select rows where a specific column contains a specific value.

This command returns a subset of the DataFrame containing only the rows where the ‘PassengerId’ column has the value of 666:

titanic[titanic['PassengerId'] == 666]
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
665 666 0 2 Hickman, Mr. Lewis male 32.0 2 0 S.O.C. 14879 73.5 NaN S

Exercise#

Extract the two passengers whose point of departure (‘Embarked’) is unknown (NaN)


Reading and writing Excel files#

The to_excel() method is used to save the DataFrame to an Excel file.

The argument index=False specifies that we do not want to include the row index in the Excel file.

# Write the DataFrame 'titanic' to an Excel file named 'titanic.xlsx'.

titanic.to_excel('titanic.xlsx', index=False)

We can read data from an Excel file into a DataFrame using the read_excel() function.

pd.read_excel('titanic.xlsx')
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Note, that we are only reading the data, not storing it in a variable.
To do so, we could write:

new_df = pd.read_excel('titanic.xlsx')

Key points#

  • Reading data into Pandas from many different file formats or data sources is supported by read_* functions.

  • The head(), tail(), and info() methods are convenient for a first check.

  • The notna() and isna() methods are useful for finding and isolating missing data.

  • Exporting data out of pandas is provided by different to_* methods.