Collocation and digitized books#

In this notebook we tune in on the concept of collocation. Collocation is defined as the combination of two or more words that usually appear frequently together, e.g., bear responsibility or weighty arguments. Source: “kollokation” in ordnet.dk

Different approaches is being used to find words that collocates is texts. In the NLTK library the collocation tool is build around n-grams and measuring of association. Source: NLTK, Documentation, Collocations The collocation script in this notebook is different. It works by define a “window size” around a keyword. The “window size” is for example 10 words before and after the keyword. The script then count words in the window, excluding the keyword itself. Into the script goes a list of words, and I suggest that you send in a word list without stopwords, as this would often give the most useful results.

The results can be usefull to get an understanding of the context and semantic around the selected keywords.

We will use collocations to look into this more than 00 years old book for travellers in the Nordic periphery of Europe.

Macdonald, James. Travels through Denmark and Part of Sweden during the Winter and Spring of the Year 1809 : Containing Authentic Particulars of the Domestic Condition of Those Countries, the Opinions of the Inhabitants, and the State of Agriculture. 2015.

The book has been digitized by The Royal Danish Library in 2015, and it is available on the url: https://www.kb.dk/e-mat/dod/130014244515_bw.pdf

Download the book#

#! pip install PyPDF2
import requests
from io import BytesIO
from PyPDF2 import PdfReader

# URL to the ocr scanned, pdf verison of the text
url = "https://www.kb.dk/e-mat/dod/130014244515_bw.pdf"

# Download the pdf file
response = requests.get(url)
response.raise_for_status()  # Check if the request was successful

# Open the pdf file in memory
pdf_file = BytesIO(response.content)

# Create a PDF reader object
reader = PdfReader(pdf_file)

# Extract text from each page starting from page 5
text_content = []
for page in reader.pages[6:]:
    text_content.append(page.extract_text())

# Join all the text content into a single string
full_text = "\n".join(text_content)

# Print the extracted text
print(full_text[0:10])
5 .
I.’ V 

Preprocess the text#

Next step is to send the text through a scrubbing pipeline.

import re
def clean(text): 

    # solving a hyphen-newline issue. 
    text = text.replace('-\n', '')
    
    # match a variety of punctuation and special characters
    # backslash \ and the pipe symbols | plays important roles, for example here \? 
    # Now it is a good idea to look up a see what \ and | does 
    text = re.sub(r'\.|«|,|:|;|!|\?|\(|\)|\||\+|\'|\"|‘|’|“|”|\'|\’|…|\-|_|–|—|\$|&|\*|>|<|\/|\[|\]', ' ', text)

    # Regex pattern to match numbers and words containing numbers
    text = re.sub(r'\b\w*\d\w*\b', '', text)
  
    # Remove words with length 2 or less
    text = re.sub(r'\b\w{1,2}\b', '', text)
    
    # sequences of white spaces 
    text = re.sub(r'\s+', ' ', text) 

    # lower the letters
    text = text.lower()

    # return the text
    return text
    

clean_full_text = clean(full_text)

Remove stopwords#

# remove stopwords
import urllib.request

#import an English stopword list
url = "https://sciencedata.dk/shared/5dad7af4b0c257e968111dd0ce19eb99?download"
en_stop_words = urllib.request.urlopen(url).read().decode().split()
# Add additional stopwords using Pythons list append() method
en_stop_words.extend(['■', '%'])

# text data in
text = clean_full_text

# Change text to wordlist
tokens = text.split()
tokens_wo_stopwords = [i for i in tokens if i.lower() not in en_stop_words]

Collocation#

Add or replace the keywords with your own words.

from collections import Counter
in_data_list = tokens_wo_stopwords

keywords = ['belt', 'sound']


keyword_proximity_counts = {keyword: Counter() for keyword in keywords}
window_size = 10

for i, token in enumerate(in_data_list):
    if token in keywords:
        # Define the window around the keyword
        start = max(0, i - window_size)
        end = min(len(in_data_list), i + window_size + 1)
        # Count terms in the window, excluding the keyword itself
        for j in range(start, end):
            if j != i:
                keyword_proximity_counts[token][in_data_list[j]] += 1

# Filter out terms with counts less than x (count >= x)
filtered_keyword_proximity_counts = {
    keyword: Counter({term: count for term, count in counts.items() if count >= 3})
    for keyword, counts in keyword_proximity_counts.items()
}

filtered_keyword_proximity_counts
{'belt': Counter({'great': 6, 'miles': 3, 'description': 3, 'danes': 3}),
 'sound': Counter({'place': 4,
          'ice': 4,
          'marble': 4,
          'helsingborg': 3,
          'great': 3})}

Explanation of the collocation algorithm#

  1. Input Data:

    • in_data_list: This is a list of tokens (words) from which stopwords have been removed.

    • keywords: A list of specific keywords (['fiord', 'belt', 'sound']) that you are interested in analyzing within the in_data_list.

  2. Data Structures:

    • keyword_proximity_counts: A dictionary where each keyword is associated with a Counter object. This Counter will keep track of how often other terms appear near the keyword within a specified window.

  3. Parameters:

    • window_size: This is set to 10, meaning that the script will consider a window of 10 tokens before and after each occurrence of a keyword in the in_data_list.

  4. Main Loop:

    • The script iterates over each token in in_data_list using enumerate to get both the index (i) and the token itself.

    • If the token is one of the specified keywords, the script defines a “window” around this keyword:

      • start: The beginning of the window, calculated as max(0, i - window_size). This ensures the window doesn’t start before the list begins.

      • end: The end of the window, calculated as min(len(in_data_list), i + window_size + 1). This ensures the window doesn’t extend beyond the list.

    • Within this window, the script counts the occurrence of each term, excluding the keyword itself. This is done using another loop over the indices from start to end. If the current index j is not equal to i (the index of the keyword), the term at in_data_list[j] is counted in the Counter for that keyword.

  5. Filtering:

    • After populating keyword_proximity_counts, the script filters out terms that appear less than a specified number of times (in this case, 4 times) near each keyword.

    • This is done using a dictionary comprehension that creates a new dictionary, filtered_keyword_proximity_counts. For each keyword, it creates a new Counter that only includes terms with a count of 4 or more.

  6. Output:

    • filtered_keyword_proximity_counts: This is the final result, a dictionary where each keyword is associated with a Counter of terms that frequently appear near it, filtered to only include those with a count of at least 4.

NB: Explanation created using chatGPT-4-omni.