Subsets#
Learning Objectives
Questions:
How can we select and filter specific columns and rows in a
DataFrame
?
Objectives:
Select specific columns from a
DataFrame
using square bracket notation ([]
).Use the
loc
property for label-based indexing in aDataFrame
.Use the
iloc
property for position-based indexing in aDataFrame
.Filter rows based on the presence or absence of missing values.
Understand the differences and complementary uses of
.loc[]
and.iloc[]
.
Import Pandas#
# This line imports the pandas library and aliases it as 'pd'.
import pandas as pd
Creating a DataFrame
from a CSV file#
The pd.read_csv()
function is used to read the data from the file ‘data/titanic.csv’.
# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.
titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.
titanic
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
Selecting specific columns#
Selecting specific columns is useful when you want to work with specific subsets of the dataset e.g. you might want to perform statistical calculations based on only the colum Age
or later on when you want to access and manipulate your data in more complex ways.
Access the ‘Age’ column from the DataFrame
‘titanic’ to return a Series
object containing all the data in the ‘Age’ column:
titanic['Age']
Show code cell output
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 NaN
889 26.0
890 32.0
Name: Age, Length: 891, dtype: float64
# Check the type of the 'Age' column in 'titanic' using the 'type()' function.
type(titanic['Age'])
Show code cell output
pandas.core.series.Series
Use the shape
attribute to determine the dimensions of the Series
.
It returns a tuple representing the number of rows and columns (rows, columns).
titanic['Age'].shape
Show code cell output
(891,)
Calling multiple Series
#
Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame
using double square brackets:
# Select the columns 'Age' and 'Sex' from the 'titanic' DataFrame.
titanic[['Age', 'Sex']]
Show code cell output
Age | Sex | |
---|---|---|
0 | 22.0 | male |
1 | 38.0 | female |
2 | 26.0 | female |
3 | 35.0 | female |
4 | 35.0 | male |
... | ... | ... |
886 | 27.0 | male |
887 | 19.0 | female |
888 | NaN | female |
889 | 26.0 | male |
890 | 32.0 | male |
891 rows × 2 columns
Note
The inner square brackets define a Python list
with column names, whereas the outer square brackets are used to select the data from a pandas DataFrame
as seen in the previous example.
Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame
using double square brackets, and then apply the type()
function to the resulting DataFrame
subset:
# Use the 'type()' function to determine the data type of the DataFrame subset.
type(titanic[['Age', 'Sex']])
Show code cell output
pandas.core.frame.DataFrame
Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame
using double square brackets, and then apply the shape
attribute to the resulting DataFrame
subset:
# Use the 'shape' attribute to determine the dimensions of the DataFrame subset.
titanic[['Age', 'Sex']].shape
Show code cell output
(891, 2)
Introducing .loc[]
and .iloc[]
#
Now that you’ve learned how to select specific columns from a DataFrame
, it’s time to explore more advanced and flexible ways to access and manipulate your data. As you start filtering rows based on conditions or need to select both rows and columns simultaneously, using .loc[]
and .iloc[]
will give you greater control and clarity in your code.
These powerful tools allow you to:
Safely and explicitly select rows and columns by labels with
.loc[]
.Access rows and columns by their integer positions using
.iloc[]
.Avoid potential issues that can arise with more complex operations, ensuring your data manipulations are clear, consistent, and reliable.
With these methods, you’ll be better equipped to handle more complex data tasks efficiently.
What exactly are loc and iloc?
loc
and iloc
in Pandas are technically properties that return a specialized indexer object. These indexer objects allow for label-based (loc
) and integer-based (iloc
) indexing. Although they behave similarly to methods in that you use them to perform actions (i.e., selecting or slicing data), they are properties because they return an indexing object rather than directly executing a function.
When you access loc
or iloc
on a DataFrame
, what you are getting is not a simple value but an object that has its own methods and behaviors, specifically designed to handle indexing operations. This design allows for the concise and powerful data selection syntax that Pandas is known for.
Avoid ambiguous operations
Some of the operations we’ll be covering can technically be performed without using .loc[]
or .iloc[]
. For instance, you might see data selections or modifications done with simple square brackets like df[df['column'] > value]
. While this can work for basic tasks, it can lead to issues in more complex scenarios, such as unexpected behavior, ambiguity, or performance inefficiencies.
Why is this a bad idea?
Ambiguity: It’s not always clear if you’re modifying a copy of your data or the original
DataFrame
, which can lead to unintended side effects.Potential errors: Using square brackets for complex selections might trigger warnings like
SettingWithCopyWarning
, indicating that your code might not be doing what you think it is.Lack of flexibility: As your data operations become more complex, relying on square brackets can limit your ability to write clear and maintainable code.
For these reasons, we recommend using .loc[]
and .iloc[]
as they provide a more reliable and explicit way to handle data selection and modification in Pandas.
If you’re interested in learning more about the potential pitfalls of using square brackets and the benefits of .loc[]
and .iloc[]
, you can read more in the Pandas documentation on indexing and selecting data.
Filtering rows based on conditional expressions#
With .loc[]
and .iloc[]
, you have more control and flexibility when filtering rows and selecting columns in a DataFrame
.
Let’s start by exploring how you can use these tools to filter rows based on conditions.
Filtering specific rows with .loc[]
#
To filter rows based on a condition, you can place the condition inside the .loc[]
selector.
Here’s how you can filter rows where the Age
column is greater than 35:
# Filter rows in the 'titanic' DataFrame where the 'Age' column is greater than 35.
titanic.loc[titanic['Age'] > 35]
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
865 | 866 | 1 | 2 | Bystrom, Mrs. (Karolina) | female | 42.0 | 0 | 0 | 236852 | 13.0000 | NaN | S |
871 | 872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 1 | 1 | 11751 | 52.5542 | D35 | S |
873 | 874 | 0 | 3 | Vander Cruyssen, Mr. Victor | male | 47.0 | 0 | 0 | 345765 | 9.0000 | NaN | S |
879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
217 rows × 12 columns
This command filters the rows where the condition titanic['Age'] > 35
is True
, and returns the entire DataFrame
for those rows.
If you only want to select specific columns (e.g., Name
and Pclass
) for those filtered rows, you can add the column names after the comma:
# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.
titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]
Show code cell output
Name | Pclass | |
---|---|---|
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 |
6 | McCarthy, Mr. Timothy J | 1 |
11 | Bonnell, Miss. Elizabeth | 1 |
13 | Andersson, Mr. Anders Johan | 3 |
15 | Hewlett, Mrs. (Mary D Kingcome) | 2 |
... | ... | ... |
865 | Bystrom, Mrs. (Karolina) | 2 |
871 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 1 |
873 | Vander Cruyssen, Mr. Victor | 3 |
879 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | 1 |
885 | Rice, Mrs. William (Margaret Norton) | 3 |
217 rows × 2 columns
When using loc/iloc
The part before the comma is the rows you want to select.
The part after the comma is the columns you want to select.
Using .iloc[]
for positional indexing#
The .iloc[]
method works similarly to .loc[]
, but it selects rows and columns based on their integer positions rather than labels.
For example, to select the first 5 rows and the first 2 columns, you can use:
# Select the first 5 rows and the first 2 columns using .iloc[].
titanic.iloc[:5, :2]
Show code cell output
PassengerId | Survived | |
---|---|---|
0 | 1 | 0 |
1 | 2 | 1 |
2 | 3 | 1 |
3 | 4 | 1 |
4 | 5 | 0 |
Inclusive vs. Exclusive Slicing#
When slicing with .loc[]
, the slicing is inclusive of the start and end labels. For example:
# Select rows 0 through 3 (inclusive) and columns 'Name' and 'Pclass'.
titanic.loc[0:3, ['Name', 'Pclass']]
Show code cell output
Name | Pclass | |
---|---|---|
0 | Braund, Mr. Owen Harris | 3 |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 |
2 | Heikkinen, Miss. Laina | 3 |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 |
This command returns rows 0, 1, 2, and 3, including both the start and end indices.
When slicing with .iloc[]
, the slicing is exclusive of the end position:
# Select rows 0 through 2 (exclusive of 3) and the first two columns using .iloc[].
titanic.iloc[0:3, 0:2]
Show code cell output
PassengerId | Survived | |
---|---|---|
0 | 1 | 0 |
1 | 2 | 1 |
2 | 3 | 1 |
Using .loc[]
and .iloc[]
for complex conditions#
You can use both .loc[]
and .iloc[]
to filter rows based on more complex conditions, combining multiple criteria with logical operators.
For instance, to filter rows where Pclass
is either 2 or 3 using .loc[]
:
# Filter rows where 'Pclass' is either 2 or 3 using .loc[].
titanic.loc[titanic['Pclass'].isin([2, 3])]
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
675 rows × 12 columns
Similarly, if you want to perform position-based filtering using .iloc[]
, you would first need to identify the positions of the relevant rows and columns. However, .iloc[]
is typically less useful for condition-based filtering because it requires knowledge of the exact row and column positions rather than labels.
You can also combine multiple conditions using logical operators like &
(AND) and |
(OR):
# Filter rows where ('Pclass' is either 2 or 3) AND ('Age' is greater than 35),
# and then select the 'Name' and 'Pclass' columns.
titanic.loc[((titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)) & (titanic['Age'] > 35), ['Name', 'Pclass']]
Show code cell output
Name | Pclass | |
---|---|---|
13 | Andersson, Mr. Anders Johan | 3 |
15 | Hewlett, Mrs. (Mary D Kingcome) | 2 |
25 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... | 3 |
33 | Wheadon, Mr. Edward H | 2 |
40 | Ahlin, Mrs. Johan (Johanna Persdotter Larsson) | 3 |
... | ... | ... |
854 | Carter, Mrs. Ernest Courtenay (Lilian Hughes) | 2 |
860 | Hansen, Mr. Claus Peter | 3 |
865 | Bystrom, Mrs. (Karolina) | 2 |
873 | Vander Cruyssen, Mr. Victor | 3 |
885 | Rice, Mrs. William (Margaret Norton) | 3 |
113 rows × 2 columns
Handling missing values with .loc[]
and .iloc[]
#
Both .loc[]
and .iloc[]
allow you to filter rows based on the presence or absence of missing values.
(However, .loc[]
is generally more convenient when working with labeled data, while .iloc[]
requires knowing the exact positions.)
For example, to filter rows where the Embarked
column is not null using .loc[]
:
# Filter rows where 'Embarked' is not null using .loc[].
titanic.loc[titanic['Embarked'].notna()]
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
889 rows × 12 columns
You can then compare the number of extracted rows with the original DataFrame to identify how many rows have missing data:
# Display information about the DataFrame to compare row counts.
titanic.info()
Show code cell output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Comparing the two number of rows (889 and 891) tells us, that two rows are missing embarkation data.
As we saw in the exercise on the previous page on reading and writing tabular data, the two rows missing embarkation data can be found like this:
titanic[titanic['Embarked'].isna()]
Show code cell output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
Selecting rows and columns#
The true power of .loc[]
becomes evident when you want to filter rows and simultaneously select specific columns. For example:
# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.
titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]
Show code cell output
Name | Pclass | |
---|---|---|
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 |
6 | McCarthy, Mr. Timothy J | 1 |
11 | Bonnell, Miss. Elizabeth | 1 |
13 | Andersson, Mr. Anders Johan | 3 |
15 | Hewlett, Mrs. (Mary D Kingcome) | 2 |
... | ... | ... |
865 | Bystrom, Mrs. (Karolina) | 2 |
871 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 1 |
873 | Vander Cruyssen, Mr. Victor | 3 |
879 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | 1 |
885 | Rice, Mrs. William (Margaret Norton) | 3 |
217 rows × 2 columns
Combining loc
and iloc
#
We can use iloc
to select:
Rows from index position 9 to 25 (exclusive).
Columns from index position 2 to 6 (exclusive).
Then further filter these rows using loc
based on the condition where:
‘Age’ is greater than 35.
titanic.iloc[9:25, 2:6].loc[titanic['Age'] > 35]
Show code cell output
Pclass | Name | Sex | Age | |
---|---|---|---|---|
11 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 |
13 | 3 | Andersson, Mr. Anders Johan | male | 39.0 |
15 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 |
Key points#
Access a single column or multiple columns from a
DataFrame
using square bracket notation ([]
).Use label-based indexing (
.loc[]
) and position-based indexing (.iloc[]
) to select specific rows and columns.Slicing with
.loc[]
is inclusive, while slicing with.iloc[]
is exclusive.Filter rows based on conditions and apply additional conditions or select specific columns within the same operation.