Harvest text from the web

Harvest text from the web#

The purpose is to gain knowledge about how to scrape data from the web in order to obtain text data for digital text analysis.

To achieve this, we start with a very brief introduction to HTML and continues with the libraries BeautifulSoup and Requests.

Introduction HTML#

Read your first html file, and let us inspect it.

# Read your html file
html_file_path = 'html1.html'  
with open(html_file_path, 'r', encoding='utf-8') as file:
    html_content = file.read()

#print 
print(html_content)

<div>
	<p>This is a paragraph</p>
</div>

The first line is <div>. It is a html tag.

< is the opening of the tag. The word div is the type of tag, and it defines a section in a html document. The second line begins with a <p> tag, and it defines a paragraph. The tag is nested. Which is a way of saying that it is embedded inside the

section. The text This is a paragraph is the text that will be displayed on a webpage. Now follows the </p>. The </ indicates this is a closing tag. A closing tag means that the nested piece is over. This is opposite to the opening of the tag <.

Attributes#

Read another html file and look at the first line. Inside the div tag we have added an attribute called class. The attributes has “content” as a value. Besides class is id a common attribute.

# Read your html file
html_file_path = 'html2.html'  
with open(html_file_path, 'r', encoding='utf-8') as file:
    html_content = file.read()

#print 
print(html_content)

<div class="content">
	<p>This is a paragraph</p>
</div>

Working with HTML with Python#

You would often see, that when people are working with HTML with Python then they will use the libraries BeautifulSoup, Selenium, and Regex; or a combination of the three libraries.

This notebook sticks with BeautifulSoup alone.

To install BeautifulSoup you can use either pip install beautifulsoup4 or conda install anaconda::beautifulsoup4 .

When the library is installed, then import the library by writing:

from bs4 import BeautifulSoup as bs

Now you can read a html file and work with the HTML in Python.

# Read your html file
html_file_path = 'html3.html'  
with open(html_file_path, 'r', encoding='utf-8') as file:
    html_content = file.read()

#print 
print(html_content)

<html>
	<body>	
		<div class="content">
            <h1>Title</h1>
			<p>My first paragraph</p>
		</div>
		
		<div class="other">
            <h2>My first headline</h2>
			<p>My second paragraph</p>
		</div>
		
		<div class="content">
            <h2>My second headline</h2>
			<p>My third paragraph</p>
		</div>
	</body>
</html>

	

When working with beautifulSoup the convention is to store the html in a variable called soup.

soup = bs(html_content, 'html.parser')
print(soup.prettify())

<html>
 <body>
  <div class="content">
   <h1>
    Title
   </h1>
   <p>
    My first paragraph
   </p>
  </div>
  <div class="other">
   <h2>
    My first headline
   </h2>
   <p>
    My second paragraph
   </p>
  </div>
  <div class="content">
   <h2>
    My second headline
   </h2>
   <p>
    My third paragraph
   </p>
  </div>
 </body>
</html>

Find and find_all#

.find returns the first tag.

.find_all returns a list of tags.

soup.find('div')

<div class="content">
<h1>Title</h1>
<p>My first paragraph</p>
</div>

soup.find_all('div')

[<div class="content">
 <h1>Title</h1>
 <p>My first paragraph</p>
 </div>,
 <div class="other">
 <h2>My first headline</h2>
 <p>My second paragraph</p>
 </div>,
 <div class="content">
 <h2>My second headline</h2>
 <p>My third paragraph</p>
 </div>]

When you have a list, they can access each element in the list using the index number. [0] is the first element.

soup.find_all('div')[0]

<div class="content">
<h1>Title</h1>
<p>My first paragraph</p>
</div>

How do we access the p tag in the first element?

We do it like this:

first_item = soup.find_all('div')[0]
first_item.find('p')

<p>My first paragraph</p>

By adding .text we can return the actual text string, and not the tag.

first_item = soup.find_all('div')[0]
first_item.find('p').text

'My first paragraph'

When you have a list, then you can also loop through the list using a for loop.

for i in soup.find_all('div'):
    print (i.text)

Title
My first paragraph


My first headline
My second paragraph


My second headline
My third paragraph

If you are only interested in the div tags that have a particular attribute, you can add an argument to .find or .find_all. The argument takes the form of a dictionary. The key is the attribute and the value is the name of the attribute you want to extract.

soup.find_all('div', {'class': 'other'})

[<div class="other">
 <h2>My first headline</h2>
 <p>My second paragraph</p>
 </div>]

all_content = soup.find_all('div', {'class': 'content'})
all_content

[<div class="content">
 <h1>Title</h1>
 <p>My first paragraph</p>
 </div>,
 <div class="content">
 <h2>My second headline</h2>
 <p>My third paragraph</p>
 </div>]

for i in all_content:
    p_tag = i.find('p')
    print (p_tag.text)

My first paragraph
My third paragraph

all_text_tags = soup.find_all(['h1', 'h2', 'p'])
for i in all_text_tags:
    print (i.text)

Title
My first paragraph
My first headline
My second paragraph
My second headline
My third paragraph

Scrape a webpage#

When scraping data from the web, you should behave properly. Here are some rules of thumb that you can take with you.

Take only the data you need, so think about whether you can save the data that you harvest instead of harvesting the same data many times
Be careful not to scrape material that you are not allowed to use
Do not try to get, what you can not access
Slow down! Avoid sending too many requests in a short period. Use a timer to add a pause between the requests.
Go to https://httpbin.org/headers to get information about user-agent to add to your script.

To scrape webpages using BeautifulSoup you got to add another library. That is the Requests library.

# Import libraries
import requests
from bs4 import BeautifulSoup as bs

Read the infomation that you send to the webpage when making a request by inspecting the response from https://httpbin.org/headers

url = 'https://httpbin.org/headers'

response = requests.get(url)

response.text

'{\n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.32.3", \n    "X-Amzn-Trace-Id": "Root=1-677e4c71-08f63e9365a4e70c2baf1a56"\n  }\n}\n'

Customize your header#

When running the code below you can see the User-Agent tells that you are using python requests.

Some guides to webscraping will encouraged you to change the value of the User-Agent, because some websites will block requests from headers containing python requests.

If you wish to change the value of User-Agent then open https://httpbin.org/headers in your default browser to get the data to add to your customized header.

I would encourage you to add your name and email to your header. In this way web managers would know that the request is not from a evil minded person. They also have a chance to reach out to you.

url = 'https://httpbin.org/headers'

response = requests.get(url)

response.text

'{\n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.32.3", \n    "X-Amzn-Trace-Id": "Root=1-677e4c72-2965d87815aa04c40c63e835"\n  }\n}\n'

To customize your header you can do this:

url = 'https://httpbin.org/headers'

headers={"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0"}

response = requests.get(url, headers=headers)

response.text

'{\n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Host": "httpbin.org", \n    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0", \n    "X-Amzn-Trace-Id": "Root=1-677e4c72-4514c8bd341f350e34ff4c20"\n  }\n}\n'

Scrape a wiki page#

In the script below we:

Store the url to the wikipedia page as a string in a variable called url.
Customize the header. It got to be build like a dictionary.
Use the requests.get method and add the url and the header as arguments.
Download the HTML by storing it into the variable called soup.

url = 'https://en.wikipedia.org/wiki/2019%E2%80%932020_Hong_Kong_protests'

headers={"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0"}

response = requests.get(url, headers=headers)

soup = bs(response.text, 'html.parser')

Parse / extract data from the HTML#

When we parse data from the HTML we have to use the inspect tool. I use Firefox and when I right click on the wikipedia page I can choose a tool called Inspect from the popup window. Take a look at this video for more information on how to use Inspect: Python for Digital Humanities - 21: Beautiful Soup

The title tag is in the ‘h1’ (headline) tag that has a attribute called ‘id’ with the value ‘firstHeading’.

We can use .find and add the dictionary as an argument. Then we store the returned values in a variable (title_tag). In the second line we can add .text to the variable to extract the text string for the tag. We send the textstring to the variable title and print it.

title_tag = soup.find('h1', {'id': "firstHeading"})
title = title_tag.text

title

'2019–2020 Hong Kong protests'

The rest of the content is embedded in h2 and p tags.

Let us extract the text from the h1, h2 og p tags and store the text data in a txt fil.

all_text_tags = soup.find_all(['h1', 'h2', 'p'])
all_text = [i.text for i in all_text_tags]
all_text = ' '.join(all_text)

# Store text as txt
with open(title+'.txt', 'w', encoding='utf-8') as f:
    f.write(all_text)

Now you are done!

You can clean the text before you start analysing it. To do so take a look at the “Text preprocessing pipeline”.