Pandas DataFrame
#
Learning Objectives
Questions:
What is a Pandas
DataFrame
?How is a
DataFrame
used to manage and manipulate data?
Objectives:
Understand the structure and purpose of a
DataFrame
.Create a
DataFrame
from lists.Perform basic
DataFrame
manipulations, such as renaming columns and transposing data.Understand the structure and purpose of a Pandas
Series
.Append
Series
as columns to an existingDataFrame
and create new columns based on existing data.
Import Pandas#
# This line imports the pandas library and aliases it as 'pd'.
import pandas as pd
Why pandas as pd?
Aliasing pandas as pd
is a widely adopted convention that simplifies the syntax for accessing its functionalities.
After this statement, you can use pd
to access all the functionalities provided by the pandas library.
Pandas data table representation#
A DataFrame
is a two-dimensional data structure in the Pandas library, designed to hold and manage data efficiently. It organizes data into labeled columns and rows, allowing you to store various types of data, such as numbers, text, and categorical values, all within the same structure.
Creating our first DataFrame
#
We start by creating three lists of equal length (i.e., containing the same amount of elements).
These lists will be used as columns for a DataFrame
, with each list representing a column and each element within the list representing a row in that column.
# Create three lists named 'name', 'age', and 'sex'.
name = ["Braund", "Allen", "Bonnel"]
age = [22, 35, 58]
sex = ["male", "male", "female"]
We use lists name
, age
, and sex
to fill in the columns.
Each list corresponds to a column in the DataFrame.
Name
, Age
, and Sex
are the titles of these columns.
# Create a DataFrame named 'df' based on three lists.
df = pd.DataFrame({'Name': name, 'Age': age, 'Sex': sex})
DataFrames
and dictionaries
#
Creating a DataFrame
in pandas is similar to creating a dictionary
. The key
s in the dictionary
become the column names, while the value
s, which are list
s or array
s, form the columns’ data.
See also
For more information on dictionary
s in Python, see GeeksforGeeks.
# Display the DataFrame 'df'.
df
Show code cell output
Name | Age | Sex | |
---|---|---|---|
0 | Braund | 22 | male |
1 | Allen | 35 | male |
2 | Bonnel | 58 | female |
In a spreadsheet software, the table representation of our data would look very similar
# Check the type of the 'df' object using the 'type()' function.
type(df)
Show code cell output
pandas.core.frame.DataFrame
Attributes#
We can use the shape
attribute to determine the dimensions of the DataFrame
.
It returns a tuple representing the number of rows and columns (rows, columns). This can be helpful especially if you have not created the DataFrame
yourself.
df.shape
Show code cell output
(3, 3)
And we can use the dtypes
attribute to view the data types of each column in the DataFrame
.
This command provides information about the data type of each column, such as integer, float, or object (string). This is useful knowledge to have when you start working more in-depth with your data.
df.dtypes
Show code cell output
Name object
Age int64
Sex object
dtype: object
When asking for the shape
or dtypes
, no parentheses ()
are used. Both are an attribute of DataFrame
and Series
. (Series
will be explained later.)
Attributes of a DataFrame
or Series
do not need ()
.
Attributes represent a characteristic of a DataFrame
/Series
, whereas methods (which require parentheses ()
) do something with the DataFrame
/Series
.
Transposing a DataFrame
#
The transpose()
method swaps the DataFrame
’s rows and columns, creating ‘df_transposed’.
Transposing is useful for reshaping data, making it easier to compare rows or apply certain operations that are typically column-based.
# Transpose the DataFrame 'df' using the 'transpose()' method.
df_transposed = df.transpose()
# Display the DataFrame 'df_transposed'.
df_transposed
Show code cell output
0 | 1 | 2 | |
---|---|---|---|
Name | Braund | Allen | Bonnel |
Age | 22 | 35 | 58 |
Sex | male | male | female |
Renaming columns#
We can rename the columns of our DataFrame
after creation.
This is done by assigning a new list of column names to df.columns
.
The new column names are Names
, Age
, and Sex
, in that order.
# Rename the columns of the DataFrame 'df'.
df.columns = ['Names', 'Age', 'Sex']
The method below is useful for selectively renaming only one or more columns without changing the entire set of column names:
# Rename the 'Age' column to 'Ages' in the DataFrame 'df'.
df = df.rename(columns={'Age': 'Ages'})
# Our DataFrame now looks like this:
df
Show code cell output
Names | Ages | Sex | |
---|---|---|---|
0 | Braund | 22 | male |
1 | Allen | 35 | male |
2 | Bonnel | 58 | female |
Series
#
Each column in a DataFrame
is a Series
. When we access a column in a DataFrame
, this actually returns a Series
object containing all the data in that column.
df['Ages']
Show code cell output
0 22
1 35
2 58
Name: Ages, dtype: int64
# Check the type of the 'Ages' column in 'df' using the 'type()' function.
type(df['Ages'])
Show code cell output
pandas.core.series.Series
Creating our own Series
#
We can create and name a Series
in the following way.
The name
parameter assigns the name ‘Fare’ to the Series.
# Create a pandas Series named 'fare' with specified values.
fare = pd.Series([7.2500, 71.2833, 7.9250], name='Fare')
This outputs the values along with their index positions and the name of the Series
:
# Display the 'fare' Series.
fare
Show code cell output
0 7.2500
1 71.2833
2 7.9250
Name: Fare, dtype: float64
# Check the data type of 'fare' using the 'type()' function.
type(fare)
Show code cell output
pandas.core.series.Series
Appending Series
#
We can add a Series
as a new column to a DataFrame
, extending it horizontally.
Here, the name of the ‘fare’ Series
(‘Fare’) becomes the column name in the updated DataFrame
.
# Concatenate the 'fare' Series to the 'df' DataFrame along the columns (axis=1).
df = pd.concat([df, fare], axis=1)
The axis
parameter in pandas.concat()
determines whether you are combining data along rows or columns:
axis=0
(Default): Concatenation along Rows
axis=1
: Concatenation along Columns
(When concatenating along axis=1
, Pandas aligns on the index by default. If indices do not match, you may see NaN
values where data is missing.)
# Display the updated DataFrame 'df'.
df
Show code cell output
Names | Ages | Sex | Fare | |
---|---|---|---|---|
0 | Braund | 22 | male | 7.2500 |
1 | Allen | 35 | male | 71.2833 |
2 | Bonnel | 58 | female | 7.9250 |
Creating a new column based on existing data#
We can also create a new column based on the data in an existing column.
Here, we create a new column ‘Age_in_3_years’ in the DataFrame
‘df’.
This column is calculated by adding 3 to each value in the ‘Ages’ column.
df['Age_in_3_years'] = df['Ages'] + 3
# Display the updated DataFrame 'df'.
df
Show code cell output
Names | Ages | Sex | Fare | Age_in_3_years | |
---|---|---|---|---|---|
0 | Braund | 22 | male | 7.2500 | 25 |
1 | Allen | 35 | male | 71.2833 | 38 |
2 | Bonnel | 58 | female | 7.9250 | 61 |
Exercise#
Create a new column called ‘Fare_in_DKK’ based on the column ‘Fare’.
We assume the old fare prices to be in GBP and the exchange rate to be £1 = 8.7 DKK
Solution
df['Fare_in_DKK'] = df['Fare'] * 8.7
This solution creates a new column in the DataFrame
named ‘Fare_in_DKK’, which contains the fare prices converted from GBP to DKK using the given exchange rate.
Each fare value in GBP is multiplied by the exchange rate to obtain the corresponding fare value in DKK.
Key points#
Import the library, aka
import pandas as pd
.A table of data is stored as a pandas
DataFrame
.The
shape
anddtypes
attributes are convenient for a first check.Each column in a
DataFrame
is aSeries
.We can append
Series
as columns to an existingDataFrame
.