Pandas DataFrame#

Learning Objectives

Questions:

  • What is a Pandas DataFrame?

  • How is a DataFrame used to manage and manipulate data?

Objectives:

  • Understand the structure and purpose of a DataFrame.

  • Create a DataFrame from lists.

  • Perform basic DataFrame manipulations, such as renaming columns and transposing data.

  • Understand the structure and purpose of a Pandas Series.

  • Append Series as columns to an existing DataFrame and create new columns based on existing data.

Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Pandas data table representation#

Representation of a Pandas DataFrame

A DataFrame is a two-dimensional data structure in the Pandas library, designed to hold and manage data efficiently. It organizes data into labeled columns and rows, allowing you to store various types of data, such as numbers, text, and categorical values, all within the same structure.

Creating our first DataFrame#

We start by creating three lists of equal length (i.e., containing the same amount of elements).
These lists will be used as columns for a DataFrame, with each list representing a column and each element within the list representing a row in that column.

# Create three lists named 'name', 'age', and 'sex'.

name = ["Braund", "Allen", "Bonnel"]
age = [22, 35, 58]
sex = ["male", "male", "female"]

We use lists name, age, and sex to fill in the columns.
Each list corresponds to a column in the DataFrame.
Name, Age, and Sex are the titles of these columns.

# Create a DataFrame named 'df' based on three lists.

df = pd.DataFrame({'Name': name, 'Age': age, 'Sex': sex})

DataFrames and dictionaries#

Creating a DataFrame in pandas is similar to creating a dictionary. The keys in the dictionary become the column names, while the values, which are lists or arrays, form the columns’ data.

See also

For more information on dictionarys in Python, see GeeksforGeeks.

# Display the DataFrame 'df'.

df
Hide code cell output
Name Age Sex
0 Braund 22 male
1 Allen 35 male
2 Bonnel 58 female

In a spreadsheet software, the table representation of our data would look very similar

Table representaion in spreadsheet software

# Check the type of the 'df' object using the 'type()' function.

type(df)
Hide code cell output
pandas.core.frame.DataFrame

Attributes#

We can use the shape attribute to determine the dimensions of the DataFrame.
It returns a tuple representing the number of rows and columns (rows, columns). This can be helpful especially if you have not created the DataFrameyourself.

df.shape
Hide code cell output
(3, 3)

And we can use the dtypes attribute to view the data types of each column in the DataFrame.
This command provides information about the data type of each column, such as integer, float, or object (string). This is useful knowledge to have when you start working more in-depth with your data.

df.dtypes
Hide code cell output
Name    object
Age      int64
Sex     object
dtype: object

When asking for the shape or dtypes, no parentheses () are used. Both are an attribute of DataFrame and Series. (Series will be explained later.)

Attributes of a DataFrame or Series do not need ().

Attributes represent a characteristic of a DataFrame/Series, whereas methods (which require parentheses ()) do something with the DataFrame/Series.


Transposing a DataFrame#

The transpose() method swaps the DataFrame’s rows and columns, creating ‘df_transposed’.
Transposing is useful for reshaping data, making it easier to compare rows or apply certain operations that are typically column-based.

# Transpose the DataFrame 'df' using the 'transpose()' method.

df_transposed = df.transpose()
# Display the DataFrame 'df_transposed'.

df_transposed
Hide code cell output
0 1 2
Name Braund Allen Bonnel
Age 22 35 58
Sex male male female

Renaming columns#

We can rename the columns of our DataFrame after creation.
This is done by assigning a new list of column names to df.columns.
The new column names are Names, Age, and Sex, in that order.

# Rename the columns of the DataFrame 'df'.

df.columns = ['Names', 'Age', 'Sex']

The method below is useful for selectively renaming only one or more columns without changing the entire set of column names:

# Rename the 'Age' column to 'Ages' in the DataFrame 'df'.

df = df.rename(columns={'Age': 'Ages'})
# Our DataFrame now looks like this:

df
Hide code cell output
Names Ages Sex
0 Braund 22 male
1 Allen 35 male
2 Bonnel 58 female

Series#

DataFrame Series

Each column in a DataFrame is a Series. When we access a column in a DataFrame, this actually returns a Series object containing all the data in that column.

df['Ages']
Hide code cell output
0    22
1    35
2    58
Name: Ages, dtype: int64
# Check the type of the 'Ages' column in 'df' using the 'type()' function.

type(df['Ages'])
Hide code cell output
pandas.core.series.Series

Creating our own Series#

We can create and name a Series in the following way.
The name parameter assigns the name ‘Fare’ to the Series.

# Create a pandas Series named 'fare' with specified values.

fare = pd.Series([7.2500, 71.2833, 7.9250], name='Fare')

This outputs the values along with their index positions and the name of the Series:

# Display the 'fare' Series.

fare
Hide code cell output
0     7.2500
1    71.2833
2     7.9250
Name: Fare, dtype: float64
# Check the data type of 'fare' using the 'type()' function.

type(fare)
Hide code cell output
pandas.core.series.Series

Appending Series#

We can add a Series as a new column to a DataFrame, extending it horizontally.
Here, the name of the ‘fare’ Series (‘Fare’) becomes the column name in the updated DataFrame.

# Concatenate the 'fare' Series to the 'df' DataFrame along the columns (axis=1).

df = pd.concat([df, fare], axis=1)

The axis parameter in pandas.concat() determines whether you are combining data along rows or columns:

axis=0 (Default): Concatenation along Rows
axis=1: Concatenation along Columns

(When concatenating along axis=1, Pandas aligns on the index by default. If indices do not match, you may see NaN values where data is missing.)

# Display the updated DataFrame 'df'.

df
Hide code cell output
Names Ages Sex Fare
0 Braund 22 male 7.2500
1 Allen 35 male 71.2833
2 Bonnel 58 female 7.9250

Creating a new column based on existing data#

We can also create a new column based on the data in an existing column.

Here, we create a new column ‘Age_in_3_years’ in the DataFrame ‘df’.
This column is calculated by adding 3 to each value in the ‘Ages’ column.

df['Age_in_3_years'] = df['Ages'] + 3
# Display the updated DataFrame 'df'.

df
Hide code cell output
Names Ages Sex Fare Age_in_3_years
0 Braund 22 male 7.2500 25
1 Allen 35 male 71.2833 38
2 Bonnel 58 female 7.9250 61

Exercise#

Create a new column called ‘Fare_in_DKK’ based on the column ‘Fare’.

We assume the old fare prices to be in GBP and the exchange rate to be £1 = 8.7 DKK


Key points#

  • Import the library, aka import pandas as pd.

  • A table of data is stored as a pandas DataFrame.

  • The shapeand dtypes attributes are convenient for a first check.

  • Each column in a DataFrame is a Series.

  • We can append Series as columns to an existing DataFrame.