Subsets

Subsets#

Learning Objectives

Questions:

How can we select and filter specific columns and rows in a DataFrame?

Objectives:

Select specific columns from a DataFrame using square bracket notation ([]).
Use the loc property for label-based indexing in a DataFrame.
Use the iloc property for position-based indexing in a DataFrame.
Filter rows based on the presence or absence of missing values.
Understand the differences and complementary uses of .loc[] and .iloc[].

Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Creating a `DataFrame` from a CSV file#

The pd.read_csv() function is used to read the data from the file ‘data/titanic.csv’.

# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.

titanic = pd.read_csv('data/titanic.csv')

# Display the DataFrame 'titanic'.

titanic

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Selecting specific columns#

Selecting specific columns from a DataFrame

Selecting specific columns is useful when you want to work with specific subsets of the dataset e.g. you might want to perform statistical calculations based on only the colum Age or later on when you want to access and manipulate your data in more complex ways.

Access the ‘Age’ column from the DataFrame ‘titanic’ to return a Series object containing all the data in the ‘Age’ column:

titanic['Age']

# Check the type of the 'Age' column in 'titanic' using the 'type()' function.

type(titanic['Age'])

Use the shape attribute to determine the dimensions of the Series.
It returns a tuple representing the number of rows and columns (rows, columns).

titanic['Age'].shape

Calling multiple `Series`#

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets:

# Select the columns 'Age' and 'Sex' from the 'titanic' DataFrame.

titanic[['Age', 'Sex']]

Show code cell output Hide code cell output

	Age	Sex
0	22.0	male
1	38.0	female
2	26.0	female
3	35.0	female
4	35.0	male
...	...	...
886	27.0	male
887	19.0	female
888	NaN	female
889	26.0	male
890	32.0	male

891 rows × 2 columns

Note

The inner square brackets define a Python list with column names, whereas the outer square brackets are used to select the data from a pandas DataFrame as seen in the previous example.

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the type() function to the resulting DataFrame subset:

# Use the 'type()' function to determine the data type of the DataFrame subset.

type(titanic[['Age', 'Sex']])

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the shape attribute to the resulting DataFrame subset:

# Use the 'shape' attribute to determine the dimensions of the DataFrame subset.

titanic[['Age', 'Sex']].shape

Introducing `.loc[]` and `.iloc[]`#

Now that you’ve learned how to select specific columns from a DataFrame, it’s time to explore more advanced and flexible ways to access and manipulate your data. As you start filtering rows based on conditions or need to select both rows and columns simultaneously, using .loc[] and .iloc[] will give you greater control and clarity in your code.

These powerful tools allow you to:

Safely and explicitly select rows and columns by labels with .loc[].
Access rows and columns by their integer positions using .iloc[].
Avoid potential issues that can arise with more complex operations, ensuring your data manipulations are clear, consistent, and reliable.

With these methods, you’ll be better equipped to handle more complex data tasks efficiently.

What exactly are loc and iloc?

loc and iloc in Pandas are technically properties that return a specialized indexer object. These indexer objects allow for label-based (loc) and integer-based (iloc) indexing. Although they behave similarly to methods in that you use them to perform actions (i.e., selecting or slicing data), they are properties because they return an indexing object rather than directly executing a function.

When you access loc or iloc on a DataFrame, what you are getting is not a simple value but an object that has its own methods and behaviors, specifically designed to handle indexing operations. This design allows for the concise and powerful data selection syntax that Pandas is known for.

Avoid ambiguous operations

Some of the operations we’ll be covering can technically be performed without using .loc[] or .iloc[]. For instance, you might see data selections or modifications done with simple square brackets like df[df['column'] > value]. While this can work for basic tasks, it can lead to issues in more complex scenarios, such as unexpected behavior, ambiguity, or performance inefficiencies.

Why is this a bad idea?

Ambiguity: It’s not always clear if you’re modifying a copy of your data or the original DataFrame, which can lead to unintended side effects.
Potential errors: Using square brackets for complex selections might trigger warnings like SettingWithCopyWarning, indicating that your code might not be doing what you think it is.
Lack of flexibility: As your data operations become more complex, relying on square brackets can limit your ability to write clear and maintainable code.

For these reasons, we recommend using .loc[] and .iloc[] as they provide a more reliable and explicit way to handle data selection and modification in Pandas.

If you’re interested in learning more about the potential pitfalls of using square brackets and the benefits of .loc[] and .iloc[], you can read more in the Pandas documentation on indexing and selecting data.

Filtering rows based on conditional expressions#

Filtering specific rows from a DataFrame

With .loc[] and .iloc[], you have more control and flexibility when filtering rows and selecting columns in a DataFrame.

Let’s start by exploring how you can use these tools to filter rows based on conditions.

Filtering specific rows with `.loc[]`#

To filter rows based on a condition, you can place the condition inside the .loc[] selector.

Here’s how you can filter rows where the Age column is greater than 35:

# Filter rows in the 'titanic' DataFrame where the 'Age' column is greater than 35.

titanic.loc[titanic['Age'] > 35]

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
865	866	1	2	Bystrom, Mrs. (Karolina)	female	42.0	0	0	236852	13.0000	NaN	S
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
873	874	0	3	Vander Cruyssen, Mr. Victor	male	47.0	0	0	345765	9.0000	NaN	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q

217 rows × 12 columns

This command filters the rows where the condition titanic['Age'] > 35 is True, and returns the entire DataFrame for those rows.

If you only want to select specific columns (e.g., Name and Pclass) for those filtered rows, you can add the column names after the comma:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]

Show code cell output Hide code cell output

	Name	Pclass
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1
6	McCarthy, Mr. Timothy J	1
11	Bonnell, Miss. Elizabeth	1
13	Andersson, Mr. Anders Johan	3
15	Hewlett, Mrs. (Mary D Kingcome)	2
...	...	...
865	Bystrom, Mrs. (Karolina)	2
871	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	1
873	Vander Cruyssen, Mr. Victor	3
879	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	1
885	Rice, Mrs. William (Margaret Norton)	3

217 rows × 2 columns

When using loc/iloc

The part before the comma is the rows you want to select.

The part after the comma is the columns you want to select.

Using `.iloc[]` for positional indexing#

The .iloc[] method works similarly to .loc[], but it selects rows and columns based on their integer positions rather than labels.

For example, to select the first 5 rows and the first 2 columns, you can use:

# Select the first 5 rows and the first 2 columns using .iloc[].

titanic.iloc[:5, :2]

Show code cell output Hide code cell output

	PassengerId	Survived
0	1	0
1	2	1
2	3	1
3	4	1
4	5	0

Inclusive vs. Exclusive Slicing#

When slicing with .loc[], the slicing is inclusive of the start and end labels. For example:

# Select rows 0 through 3 (inclusive) and columns 'Name' and 'Pclass'.

titanic.loc[0:3, ['Name', 'Pclass']]

Show code cell output Hide code cell output

	Name	Pclass
0	Braund, Mr. Owen Harris	3
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1
2	Heikkinen, Miss. Laina	3
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1

This command returns rows 0, 1, 2, and 3, including both the start and end indices.

When slicing with .iloc[], the slicing is exclusive of the end position:

# Select rows 0 through 2 (exclusive of 3) and the first two columns using .iloc[].

titanic.iloc[0:3, 0:2]

Show code cell output Hide code cell output

	PassengerId	Survived
0	1	0
1	2	1
2	3	1

Using `.loc[]` and `.iloc[]` for complex conditions#

You can use both .loc[] and .iloc[] to filter rows based on more complex conditions, combining multiple criteria with logical operators.

For instance, to filter rows where Pclass is either 2 or 3 using .loc[]:

# Filter rows where 'Pclass' is either 2 or 3 using .loc[].

titanic.loc[titanic['Pclass'].isin([2, 3])]

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	NaN	S
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

675 rows × 12 columns

Similarly, if you want to perform position-based filtering using .iloc[], you would first need to identify the positions of the relevant rows and columns. However, .iloc[] is typically less useful for condition-based filtering because it requires knowledge of the exact row and column positions rather than labels.

You can also combine multiple conditions using logical operators like & (AND) and | (OR):

# Filter rows where ('Pclass' is either 2 or 3) AND ('Age' is greater than 35),
# and then select the 'Name' and 'Pclass' columns.

titanic.loc[((titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)) & (titanic['Age'] > 35), ['Name', 'Pclass']]

Show code cell output Hide code cell output

	Name	Pclass
13	Andersson, Mr. Anders Johan	3
15	Hewlett, Mrs. (Mary D Kingcome)	2
25	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	3
33	Wheadon, Mr. Edward H	2
40	Ahlin, Mrs. Johan (Johanna Persdotter Larsson)	3
...	...	...
854	Carter, Mrs. Ernest Courtenay (Lilian Hughes)	2
860	Hansen, Mr. Claus Peter	3
865	Bystrom, Mrs. (Karolina)	2
873	Vander Cruyssen, Mr. Victor	3
885	Rice, Mrs. William (Margaret Norton)	3

113 rows × 2 columns

Handling missing values with `.loc[]` and `.iloc[]`#

Both .loc[] and .iloc[] allow you to filter rows based on the presence or absence of missing values.
(However, .loc[] is generally more convenient when working with labeled data, while .iloc[] requires knowing the exact positions.)

For example, to filter rows where the Embarked column is not null using .loc[]:

# Filter rows where 'Embarked' is not null using .loc[].

titanic.loc[titanic['Embarked'].notna()]

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

889 rows × 12 columns

You can then compare the number of extracted rows with the original DataFrame to identify how many rows have missing data:

# Display information about the DataFrame to compare row counts.

titanic.info()

Comparing the two number of rows (889 and 891) tells us, that two rows are missing embarkation data.

As we saw in the exercise on the previous page on reading and writing tabular data, the two rows missing embarkation data can be found like this:

titanic[titanic['Embarked'].isna()]

Show code cell output Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

Selecting rows and columns#

Selecting specific rows and columns from a DataFrame

The true power of .loc[] becomes evident when you want to filter rows and simultaneously select specific columns. For example:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]

Show code cell output Hide code cell output

	Name	Pclass
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1
6	McCarthy, Mr. Timothy J	1
11	Bonnell, Miss. Elizabeth	1
13	Andersson, Mr. Anders Johan	3
15	Hewlett, Mrs. (Mary D Kingcome)	2
...	...	...
865	Bystrom, Mrs. (Karolina)	2
871	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	1
873	Vander Cruyssen, Mr. Victor	3
879	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	1
885	Rice, Mrs. William (Margaret Norton)	3

217 rows × 2 columns

Combining `loc` and `iloc`#

We can use iloc to select:

Rows from index position 9 to 25 (exclusive).
Columns from index position 2 to 6 (exclusive).

Then further filter these rows using loc based on the condition where:

‘Age’ is greater than 35.

titanic.iloc[9:25, 2:6].loc[titanic['Age'] > 35]

Show code cell output Hide code cell output

	Pclass	Name	Sex	Age
11	1	Bonnell, Miss. Elizabeth	female	58.0
13	3	Andersson, Mr. Anders Johan	male	39.0
15	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0

Key points#

Access a single column or multiple columns from a DataFrame using square bracket notation ([]).
Use label-based indexing (.loc[]) and position-based indexing (.iloc[]) to select specific rows and columns.
Slicing with .loc[] is inclusive, while slicing with .iloc[] is exclusive.
Filter rows based on conditions and apply additional conditions or select specific columns within the same operation.

Subsets

Contents

Subsets#

Import Pandas#

Creating a DataFrame from a CSV file#

Selecting specific columns#

Calling multiple Series#

Introducing .loc[] and .iloc[]#

Filtering rows based on conditional expressions#

Filtering specific rows with .loc[]#

Using .iloc[] for positional indexing#

Inclusive vs. Exclusive Slicing#

Using .loc[] and .iloc[] for complex conditions#

Handling missing values with .loc[] and .iloc[]#

Selecting rows and columns#

Combining loc and iloc#

Key points#

Creating a `DataFrame` from a CSV file#

Calling multiple `Series`#

Introducing `.loc[]` and `.iloc[]`#

Filtering specific rows with `.loc[]`#

Using `.iloc[]` for positional indexing#

Using `.loc[]` and `.iloc[]` for complex conditions#

Handling missing values with `.loc[]` and `.iloc[]`#

Combining `loc` and `iloc`#