Subsets#

Learning Objectives

Questions:

  • How can we select and filter specific columns and rows in a DataFrame?

Objectives:

  • Select specific columns from a DataFrame using square bracket notation ([]).

  • Use the loc property for label-based indexing in a DataFrame.

  • Use the iloc property for position-based indexing in a DataFrame.

  • Filter rows based on the presence or absence of missing values.

  • Understand the differences and complementary uses of .loc[] and .iloc[].

Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Creating a DataFrame from a CSV file#

The pd.read_csv() function is used to read the data from the file ‘data/titanic.csv’.

# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.

titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.

titanic
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns


Selecting specific columns#

Selecting specific columns from a DataFrame

Selecting specific columns is useful when you want to work with specific subsets of the dataset e.g. you might want to perform statistical calculations based on only the colum Age or later on when you want to access and manipulate your data in more complex ways.

Access the ‘Age’ column from the DataFrame ‘titanic’ to return a Series object containing all the data in the ‘Age’ column:

titanic['Age']
Hide code cell output
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64
# Check the type of the 'Age' column in 'titanic' using the 'type()' function.

type(titanic['Age'])
Hide code cell output
pandas.core.series.Series

Use the shape attribute to determine the dimensions of the Series.
It returns a tuple representing the number of rows and columns (rows, columns).

titanic['Age'].shape
Hide code cell output
(891,)

Calling multiple Series#

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets:

# Select the columns 'Age' and 'Sex' from the 'titanic' DataFrame.

titanic[['Age', 'Sex']]
Hide code cell output
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
... ... ...
886 27.0 male
887 19.0 female
888 NaN female
889 26.0 male
890 32.0 male

891 rows × 2 columns

Note

The inner square brackets define a Python list with column names, whereas the outer square brackets are used to select the data from a pandas DataFrame as seen in the previous example.

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the type() function to the resulting DataFrame subset:

# Use the 'type()' function to determine the data type of the DataFrame subset.

type(titanic[['Age', 'Sex']])
Hide code cell output
pandas.core.frame.DataFrame

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the shape attribute to the resulting DataFrame subset:

# Use the 'shape' attribute to determine the dimensions of the DataFrame subset.

titanic[['Age', 'Sex']].shape
Hide code cell output
(891, 2)

Introducing .loc[] and .iloc[]#

Now that you’ve learned how to select specific columns from a DataFrame, it’s time to explore more advanced and flexible ways to access and manipulate your data. As you start filtering rows based on conditions or need to select both rows and columns simultaneously, using .loc[] and .iloc[] will give you greater control and clarity in your code.

These powerful tools allow you to:

  • Safely and explicitly select rows and columns by labels with .loc[].

  • Access rows and columns by their integer positions using .iloc[].

  • Avoid potential issues that can arise with more complex operations, ensuring your data manipulations are clear, consistent, and reliable.

With these methods, you’ll be better equipped to handle more complex data tasks efficiently.


Filtering rows based on conditional expressions#

Filtering specific rows from a DataFrame

With .loc[] and .iloc[], you have more control and flexibility when filtering rows and selecting columns in a DataFrame.

Let’s start by exploring how you can use these tools to filter rows based on conditions.

Filtering specific rows with .loc[]#

To filter rows based on a condition, you can place the condition inside the .loc[] selector.

Here’s how you can filter rows where the Age column is greater than 35:

# Filter rows in the 'titanic' DataFrame where the 'Age' column is greater than 35.

titanic.loc[titanic['Age'] > 35]
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q

217 rows × 12 columns

This command filters the rows where the condition titanic['Age'] > 35 is True, and returns the entire DataFrame for those rows.

If you only want to select specific columns (e.g., Name and Pclass) for those filtered rows, you can add the column names after the comma:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]
Hide code cell output
Name Pclass
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1
6 McCarthy, Mr. Timothy J 1
11 Bonnell, Miss. Elizabeth 1
13 Andersson, Mr. Anders Johan 3
15 Hewlett, Mrs. (Mary D Kingcome) 2
... ... ...
865 Bystrom, Mrs. (Karolina) 2
871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 1
873 Vander Cruyssen, Mr. Victor 3
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 1
885 Rice, Mrs. William (Margaret Norton) 3

217 rows × 2 columns

When using loc/iloc

The part before the comma is the rows you want to select.

The part after the comma is the columns you want to select.

Using .iloc[] for positional indexing#

The .iloc[] method works similarly to .loc[], but it selects rows and columns based on their integer positions rather than labels.

For example, to select the first 5 rows and the first 2 columns, you can use:

# Select the first 5 rows and the first 2 columns using .iloc[].

titanic.iloc[:5, :2]
Hide code cell output
PassengerId Survived
0 1 0
1 2 1
2 3 1
3 4 1
4 5 0

Inclusive vs. Exclusive Slicing#

When slicing with .loc[], the slicing is inclusive of the start and end labels. For example:

# Select rows 0 through 3 (inclusive) and columns 'Name' and 'Pclass'.

titanic.loc[0:3, ['Name', 'Pclass']]
Hide code cell output
Name Pclass
0 Braund, Mr. Owen Harris 3
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1
2 Heikkinen, Miss. Laina 3
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1

This command returns rows 0, 1, 2, and 3, including both the start and end indices.

When slicing with .iloc[], the slicing is exclusive of the end position:

# Select rows 0 through 2 (exclusive of 3) and the first two columns using .iloc[].

titanic.iloc[0:3, 0:2]
Hide code cell output
PassengerId Survived
0 1 0
1 2 1
2 3 1

Using .loc[] and .iloc[] for complex conditions#

You can use both .loc[] and .iloc[] to filter rows based on more complex conditions, combining multiple criteria with logical operators.

For instance, to filter rows where Pclass is either 2 or 3 using .loc[]:

# Filter rows where 'Pclass' is either 2 or 3 using .loc[].

titanic.loc[titanic['Pclass'].isin([2, 3])]
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

675 rows × 12 columns

Similarly, if you want to perform position-based filtering using .iloc[], you would first need to identify the positions of the relevant rows and columns. However, .iloc[] is typically less useful for condition-based filtering because it requires knowledge of the exact row and column positions rather than labels.

You can also combine multiple conditions using logical operators like & (AND) and | (OR):

# Filter rows where ('Pclass' is either 2 or 3) AND ('Age' is greater than 35),
# and then select the 'Name' and 'Pclass' columns.

titanic.loc[((titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)) & (titanic['Age'] > 35), ['Name', 'Pclass']]
Hide code cell output
Name Pclass
13 Andersson, Mr. Anders Johan 3
15 Hewlett, Mrs. (Mary D Kingcome) 2
25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... 3
33 Wheadon, Mr. Edward H 2
40 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) 3
... ... ...
854 Carter, Mrs. Ernest Courtenay (Lilian Hughes) 2
860 Hansen, Mr. Claus Peter 3
865 Bystrom, Mrs. (Karolina) 2
873 Vander Cruyssen, Mr. Victor 3
885 Rice, Mrs. William (Margaret Norton) 3

113 rows × 2 columns

Handling missing values with .loc[] and .iloc[]#

Both .loc[] and .iloc[] allow you to filter rows based on the presence or absence of missing values.
(However, .loc[] is generally more convenient when working with labeled data, while .iloc[] requires knowing the exact positions.)

For example, to filter rows where the Embarked column is not null using .loc[]:

# Filter rows where 'Embarked' is not null using .loc[].

titanic.loc[titanic['Embarked'].notna()]
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

889 rows × 12 columns

You can then compare the number of extracted rows with the original DataFrame to identify how many rows have missing data:

# Display information about the DataFrame to compare row counts.

titanic.info()
Hide code cell output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Comparing the two number of rows (889 and 891) tells us, that two rows are missing embarkation data.

As we saw in the exercise on the previous page on reading and writing tabular data, the two rows missing embarkation data can be found like this:

titanic[titanic['Embarked'].isna()]
Hide code cell output
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN

Selecting rows and columns#

Selecting specific rows and columns from a DataFrame

The true power of .loc[] becomes evident when you want to filter rows and simultaneously select specific columns. For example:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]
Hide code cell output
Name Pclass
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1
6 McCarthy, Mr. Timothy J 1
11 Bonnell, Miss. Elizabeth 1
13 Andersson, Mr. Anders Johan 3
15 Hewlett, Mrs. (Mary D Kingcome) 2
... ... ...
865 Bystrom, Mrs. (Karolina) 2
871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 1
873 Vander Cruyssen, Mr. Victor 3
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 1
885 Rice, Mrs. William (Margaret Norton) 3

217 rows × 2 columns

Combining loc and iloc#

We can use iloc to select:

  • Rows from index position 9 to 25 (exclusive).

  • Columns from index position 2 to 6 (exclusive).

Then further filter these rows using loc based on the condition where:

  • ‘Age’ is greater than 35.

titanic.iloc[9:25, 2:6].loc[titanic['Age'] > 35]
Hide code cell output
Pclass Name Sex Age
11 1 Bonnell, Miss. Elizabeth female 58.0
13 3 Andersson, Mr. Anders Johan male 39.0
15 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0

Key points#

  • Access a single column or multiple columns from a DataFrame using square bracket notation ([]).

  • Use label-based indexing (.loc[]) and position-based indexing (.iloc[]) to select specific rows and columns.

  • Slicing with .loc[] is inclusive, while slicing with .iloc[] is exclusive.

  • Filter rows based on conditions and apply additional conditions or select specific columns within the same operation.