Havest Peoples Daily from june 1989#
The purpose is to gain knowledge about how to scrape text from the web using Python and the BeautifulSoup library.
# Import libraries
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
import random
On ‘https://www.laoziliao.net/rmrb/’ you can find old editions of People’s Daily.
In this notebook I show how to harwest text from June 1989
Start by retrieving links located on the page ‘https://www.laoziliao.net/rmrb/1989-06’
Links are located in a div box called month box.
# url to papers from June 1989
url = 'https://www.laoziliao.net/rmrb/1989-06'
# download html
headers={"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0"}
response = requests.get(url, headers=headers)
soup = bs(response.text, 'html.parser')
# get the html in the div tag with the attribute 'id' with the value month_box
month_box = soup.find('div', {'id' : 'month_box'})
When we dive into the html we can see that we need to find all the ‘a’ tags in the month box to get links to go a layer deeper into the page.
When we do that, we can find the links that take us to pages with articles related to each day of the month.
We can see that the dates are stated in the URLs.
See for example “1989-06-01” in the first link.
To extract the links we most loop the list and apply .get(‘href’) to each item.
a_tags = month_box.find_all('a')# loop through the list of 'a' tags and get every hyper reference for each day
links = [i.get('href') for i in a_tags]
Each link goes to a subpage, and the HTML of the pages must be downloaded and inspected.
def get_soup(url):
# download html
headers={'name': 'add name', 'e-mail': 'add email'}
response = requests.get(url, headers=headers)
soup = bs(response.text, 'html.parser')
time.sleep(random.randint(0, 3))
return soup
soups_from_each_day = [get_soup(html) for html in links]
A close reading of the HTML shows that the pages are set up so that links to different articles are located in the div box ‘<div class=”card mt-2”>’.
After this, we end up with a list of lists, which we pass back to one list.
The different links contain both links to articles and the article titles. We store the links in one long list and the titles in another. We store both lists in a dataframe.
lists_of_a_tags_linking_to_individual_articles = []
def get_links_to_articles(another_soups):
card_mt2 = another_soups.find_all('div', {'class' : 'card mt-2'})
def get_a_tags(i):
return i.find_all('a')
listoflists = [get_a_tags(i) for i in card_mt2]
a_tag_list = [i for y in listoflists for i in y]
return a_tag_list
lists_of_a_tags_linking_to_individual_articles = [get_links_to_articles(another_soup) for another_soup in soups_from_each_day]
flatten_lists_of_a_tags_linking_to_individual_articles = [i for y in lists_of_a_tags_linking_to_individual_articles for i in y]
links_to_individual_articles = [i.get('href') for i in flatten_lists_of_a_tags_linking_to_individual_articles
]
titles_of_individual_articles = [i.text for i in flatten_lists_of_a_tags_linking_to_individual_articles
]
df = pd.DataFrame({'links': links_to_individual_articles,
'titles': titles_of_individual_articles})
1760 articles from June 1989 have been found. Each article must be downloaded. A relatively complicated analysis is needed to understand how to best make this download process happen.
It then takes quite a long time to download the articles, because 1760 requests must be made.
When the articles have been retrieved, they are sent to our dataframe and stored there.
raw_texts = []
for i in links_to_individual_articles:
# identify the article id in the link
get_article_id = i.split('#')[-1]
just_another_soup = get_soup(i)
card_mt2 = just_another_soup.find_all('div', {'class' : 'card mt-2'})
for j in card_mt2:
# identify the card_mt2 element that holds the article id
if get_article_id in str(j):
raw_texts.append(j.text)
# add the texts to the dataframe
df['text'] = raw_texts
We save this Dataframe as both .pkl and as .tsv
df.to_pickle("rmrb_june_1989.pkl")
df.to_csv('rmrb_june_1989.tsv', sep='\t', index=False, header=True)
Later we can import the data again like this.
import pandas as pd
df1 = pd.read_table('rmrb_june_1989.tsv',delimiter='\t')