Pandas DataFrame

Pandas `DataFrame`#

Learning Objectives

Questions:

What is a Pandas DataFrame?
How is a DataFrame used to manage and manipulate data?

Objectives:

Understand the structure and purpose of a DataFrame.
Create a DataFrame from lists.
Perform basic DataFrame manipulations, such as renaming columns and transposing data.
Understand the structure and purpose of a Pandas Series.
Append Series as columns to an existing DataFrame and create new columns based on existing data.

Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Why pandas as pd?

Aliasing pandas as pd is a widely adopted convention that simplifies the syntax for accessing its functionalities.
After this statement, you can use pd to access all the functionalities provided by the pandas library.

Pandas data table representation#

Representation of a Pandas DataFrame

A DataFrame is a two-dimensional data structure in the Pandas library, designed to hold and manage data efficiently. It organizes data into labeled columns and rows, allowing you to store various types of data, such as numbers, text, and categorical values, all within the same structure.

Creating our first `DataFrame`#

We start by creating three lists of equal length (i.e., containing the same amount of elements).
These lists will be used as columns for a DataFrame, with each list representing a column and each element within the list representing a row in that column.

# Create three lists named 'name', 'age', and 'sex'.

name = ["Braund", "Allen", "Bonnel"]
age = [22, 35, 58]
sex = ["male", "male", "female"]

We use lists name, age, and sex to fill in the columns.
Each list corresponds to a column in the DataFrame.
Name, Age, and Sex are the titles of these columns.

# Create a DataFrame named 'df' based on three lists.

df = pd.DataFrame({'Name': name, 'Age': age, 'Sex': sex})

`DataFrames` and `dictionaries`#

Creating a DataFrame in pandas is similar to creating a dictionary. The keys in the dictionary become the column names, while the values, which are lists or arrays, form the columns’ data.

Attributes#

We can use the shape attribute to determine the dimensions of the DataFrame.
It returns a tuple representing the number of rows and columns (rows, columns). This can be helpful especially if you have not created the DataFrameyourself.

df.shape

And we can use the dtypes attribute to view the data types of each column in the DataFrame.
This command provides information about the data type of each column, such as integer, float, or object (string). This is useful knowledge to have when you start working more in-depth with your data.

df.dtypes

When asking for the shape or dtypes, no parentheses () are used. Both are an attribute of DataFrame and Series. (Series will be explained later.)

Attributes of a DataFrame or Series do not need ().

Attributes represent a characteristic of a DataFrame/Series, whereas methods (which require parentheses ()) do something with the DataFrame/Series.

Transposing a `DataFrame`#

The transpose() method swaps the DataFrame’s rows and columns, creating ‘df_transposed’.
Transposing is useful for reshaping data, making it easier to compare rows or apply certain operations that are typically column-based.

# Transpose the DataFrame 'df' using the 'transpose()' method.

df_transposed = df.transpose()

# Display the DataFrame 'df_transposed'.

df_transposed

Show code cell output Hide code cell output

	0	1	2
Name	Braund	Allen	Bonnel
Age	22	35	58
Sex	male	male	female

Renaming columns#

We can rename the columns of our DataFrame after creation.
This is done by assigning a new list of column names to df.columns.
The new column names are Names, Age, and Sex, in that order.

# Rename the columns of the DataFrame 'df'.

df.columns = ['Names', 'Age', 'Sex']

The method below is useful for selectively renaming only one or more columns without changing the entire set of column names:

# Rename the 'Age' column to 'Ages' in the DataFrame 'df'.

df = df.rename(columns={'Age': 'Ages'})

# Our DataFrame now looks like this:

df

Show code cell output Hide code cell output

	Names	Ages	Sex
0	Braund	22	male
1	Allen	35	male
2	Bonnel	58	female

`Series`#

DataFrame Series

Each column in a DataFrame is a Series. When we access a column in a DataFrame, this actually returns a Series object containing all the data in that column.

df['Ages']

# Check the type of the 'Ages' column in 'df' using the 'type()' function.

type(df['Ages'])

Creating our own `Series`#

We can create and name a Series in the following way.
The name parameter assigns the name ‘Fare’ to the Series.

# Create a pandas Series named 'fare' with specified values.

fare = pd.Series([7.2500, 71.2833, 7.9250], name='Fare')

This outputs the values along with their index positions and the name of the Series:

# Display the 'fare' Series.

fare

# Check the data type of 'fare' using the 'type()' function.

type(fare)

Appending `Series`#

We can add a Series as a new column to a DataFrame, extending it horizontally.
Here, the name of the ‘fare’ Series (‘Fare’) becomes the column name in the updated DataFrame.

# Concatenate the 'fare' Series to the 'df' DataFrame along the columns (axis=1).

df = pd.concat([df, fare], axis=1)

The axis parameter in pandas.concat() determines whether you are combining data along rows or columns:

axis=0 (Default): Concatenation along Rows
axis=1: Concatenation along Columns

(When concatenating along axis=1, Pandas aligns on the index by default. If indices do not match, you may see NaN values where data is missing.)

# Display the updated DataFrame 'df'.

df

Show code cell output Hide code cell output

	Names	Ages	Sex	Fare
0	Braund	22	male	7.2500
1	Allen	35	male	71.2833
2	Bonnel	58	female	7.9250

Creating a new column based on existing data#

We can also create a new column based on the data in an existing column.

Here, we create a new column ‘Age_in_3_years’ in the DataFrame ‘df’.
This column is calculated by adding 3 to each value in the ‘Ages’ column.

df['Age_in_3_years'] = df['Ages'] + 3

# Display the updated DataFrame 'df'.

df

Show code cell output Hide code cell output

	Names	Ages	Sex	Fare	Age_in_3_years
0	Braund	22	male	7.2500	25
1	Allen	35	male	71.2833	38
2	Bonnel	58	female	7.9250	61

Exercise#

Create a new column called ‘Fare_in_DKK’ based on the column ‘Fare’.

We assume the old fare prices to be in GBP and the exchange rate to be £1 = 8.7 DKK

Solution

df['Fare_in_DKK'] = df['Fare'] * 8.7

This solution creates a new column in the DataFrame named ‘Fare_in_DKK’, which contains the fare prices converted from GBP to DKK using the given exchange rate.
Each fare value in GBP is multiplied by the exchange rate to obtain the corresponding fare value in DKK.

Key points#

Import the library, aka import pandas as pd.
A table of data is stored as a pandas DataFrame.
The shapeand dtypes attributes are convenient for a first check.
Each column in a DataFrame is a Series.
We can append Series as columns to an existing DataFrame.

Pandas DataFrame

Contents

Pandas DataFrame#

Import Pandas#

Pandas data table representation#

Creating our first DataFrame#

DataFrames and dictionaries#

Attributes#

Transposing a DataFrame#

Renaming columns#

Series#

Creating our own Series#

Appending Series#

Creating a new column based on existing data#

Exercise#

Key points#

Pandas `DataFrame`#

Creating our first `DataFrame`#

`DataFrames` and `dictionaries`#

Transposing a `DataFrame`#

`Series`#

Creating our own `Series`#

Appending `Series`#