Collocation and digitized books#
In this notebook we tune in on the concept of collocation. Collocation is defined as the combination of two or more words that usually appear frequently together, e.g., bear responsibility or weighty arguments. Source: “kollokation” in ordnet.dk
Different approaches is being used to find words that collocates is texts. In the NLTK library the collocation tool is build around n-grams and measuring of association. Source: NLTK, Documentation, Collocations The collocation script in this notebook is different. It works by define a “window size” around a keyword. The “window size” is for example 10 words before and after the keyword. The script then count words in the window, excluding the keyword itself. Into the script goes a list of words, and I suggest that you send in a word list without stopwords, as this would often give the most useful results.
The results can be usefull to get an understanding of the context and semantic around the selected keywords.
We will use collocations to look into this more than 00 years old book for travellers in the Nordic periphery of Europe.
The book has been digitized by The Royal Danish Library in 2015, and it is available on the url: https://www.kb.dk/e-mat/dod/130014244515_bw.pdf
Download the book#
#! pip install PyPDF2
import requests
from io import BytesIO
from PyPDF2 import PdfReader
# URL to the ocr scanned, pdf verison of the text
url = "https://www.kb.dk/e-mat/dod/130014244515_bw.pdf"
# Download the pdf file
response = requests.get(url)
response.raise_for_status() # Check if the request was successful
# Open the pdf file in memory
pdf_file = BytesIO(response.content)
# Create a PDF reader object
reader = PdfReader(pdf_file)
# Extract text from each page starting from page 5
text_content = []
for page in reader.pages[6:]:
text_content.append(page.extract_text())
# Join all the text content into a single string
full_text = "\n".join(text_content)
# Print the extracted text
print(full_text[0:10])
5 .
I.’ V
Preprocess the text#
Next step is to send the text through a scrubbing pipeline.
import re
def clean(text):
# solving a hyphen-newline issue.
text = text.replace('-\n', '')
# match a variety of punctuation and special characters
# backslash \ and the pipe symbols | plays important roles, for example here \?
# Now it is a good idea to look up a see what \ and | does
text = re.sub(r'\.|«|,|:|;|!|\?|\(|\)|\||\+|\'|\"|‘|’|“|”|\'|\’|…|\-|_|–|—|\$|&|\*|>|<|\/|\[|\]', ' ', text)
# Regex pattern to match numbers and words containing numbers
text = re.sub(r'\b\w*\d\w*\b', '', text)
# Remove words with length 2 or less
text = re.sub(r'\b\w{1,2}\b', '', text)
# sequences of white spaces
text = re.sub(r'\s+', ' ', text)
# lower the letters
text = text.lower()
# return the text
return text
clean_full_text = clean(full_text)
Remove stopwords#
# remove stopwords
import urllib.request
#import an English stopword list
url = "https://sciencedata.dk/shared/5dad7af4b0c257e968111dd0ce19eb99?download"
en_stop_words = urllib.request.urlopen(url).read().decode().split()
# Add additional stopwords using Pythons list append() method
en_stop_words.extend(['■', '%'])
# text data in
text = clean_full_text
# Change text to wordlist
tokens = text.split()
tokens_wo_stopwords = [i for i in tokens if i.lower() not in en_stop_words]
Collocation#
Add or replace the keywords with your own words.
from collections import Counter
in_data_list = tokens_wo_stopwords
keywords = ['belt', 'sound']
keyword_proximity_counts = {keyword: Counter() for keyword in keywords}
window_size = 10
for i, token in enumerate(in_data_list):
if token in keywords:
# Define the window around the keyword
start = max(0, i - window_size)
end = min(len(in_data_list), i + window_size + 1)
# Count terms in the window, excluding the keyword itself
for j in range(start, end):
if j != i:
keyword_proximity_counts[token][in_data_list[j]] += 1
# Filter out terms with counts less than x (count >= x)
filtered_keyword_proximity_counts = {
keyword: Counter({term: count for term, count in counts.items() if count >= 3})
for keyword, counts in keyword_proximity_counts.items()
}
filtered_keyword_proximity_counts
{'belt': Counter({'great': 6, 'miles': 3, 'description': 3, 'danes': 3}),
'sound': Counter({'place': 4,
'ice': 4,
'marble': 4,
'helsingborg': 3,
'great': 3})}
Explanation of the collocation algorithm#
Input Data:
in_data_list
: This is a list of tokens (words) from which stopwords have been removed.keywords
: A list of specific keywords (['fiord', 'belt', 'sound']
) that you are interested in analyzing within thein_data_list
.
Data Structures:
keyword_proximity_counts
: A dictionary where each keyword is associated with aCounter
object. ThisCounter
will keep track of how often other terms appear near the keyword within a specified window.
Parameters:
window_size
: This is set to 10, meaning that the script will consider a window of 10 tokens before and after each occurrence of a keyword in thein_data_list
.
Main Loop:
The script iterates over each token in
in_data_list
usingenumerate
to get both the index (i
) and the token itself.If the token is one of the specified
keywords
, the script defines a “window” around this keyword:start
: The beginning of the window, calculated asmax(0, i - window_size)
. This ensures the window doesn’t start before the list begins.end
: The end of the window, calculated asmin(len(in_data_list), i + window_size + 1)
. This ensures the window doesn’t extend beyond the list.
Within this window, the script counts the occurrence of each term, excluding the keyword itself. This is done using another loop over the indices from
start
toend
. If the current indexj
is not equal toi
(the index of the keyword), the term atin_data_list[j]
is counted in theCounter
for that keyword.
Filtering:
After populating
keyword_proximity_counts
, the script filters out terms that appear less than a specified number of times (in this case, 4 times) near each keyword.This is done using a dictionary comprehension that creates a new dictionary,
filtered_keyword_proximity_counts
. For each keyword, it creates a newCounter
that only includes terms with a count of 4 or more.
Output:
filtered_keyword_proximity_counts
: This is the final result, a dictionary where each keyword is associated with aCounter
of terms that frequently appear near it, filtered to only include those with a count of at least 4.
NB: Explanation created using chatGPT-4-omni.