Identify textual changes#

As example data we use the first chapter of two different editions of the novel Frankenstein.

Metadata first book. Author: Shelley, Mary Wollstonecraft, 1797-1851 Title: Frankenstein; Or, The Modern Prometheus Original Publication: United Kingdom: Lackington, Hughes, Harding, Mavor, & Jones, 1818.

Metadata second book. Author: Shelley, Mary Wollstonecraft, 1797-1851 Title: Frankenstein; Or, The Modern Prometheus Original Publication: United Kingdom : H. Colburn and R. Bentley, 1831.

After we have extracted the first chapter of each book we try two different approaches to text comparison.

To the first approaches we use the library difflib, that is originally developed to a visual comparision of diffenrent versions of code.

To the second approaches we use the library fuzzywuzzy, that is using Levenshtein distance to calculate the differences between sequence.

Download the two novels from Gutenberg.org#

Download the two novels from Gutenberg.org and extract the first chapter of each book.

import urllib.request

# Get the 1818 edition
url1 = 'https://gutenberg.org/cache/epub/41445/pg41445.txt'
raw_text1 = urllib.request.urlopen(url1).read().decode()
# Get Chapter 1
chap1_1 = raw_text1[raw_text1.find('CHAPTER I'):raw_text1.find('CHAPTER II')]

# Get the 1831 edition
url2 = 'https://gutenberg.org/cache/epub/42324/pg42324.txt'
raw_text2 = urllib.request.urlopen(url2).read().decode()
# Get Chapter 1
chap1_2 = raw_text2[raw_text2.find('CHAPTER I'):raw_text2.find('CHAPTER II')]

Preproces the text to get homogeneous data#

import re
def preprocess_text(text):
    text = text.replace('\r', ' ').replace('\n', ' ')
    text = re.sub(r'[^\w\s.]', '', text)
    return re.sub(r'\s+', ' ', text)
    

edition1818 = preprocess_text(chap1_1)
edition1831 = preprocess_text(chap1_2)

Using difflib for text comparison#

Use the information in this table to understand the output.

Code

Meaning

‘- ‘

line unique to sequence 1

‘+ ‘

line unique to sequence 2

‘ ‘

line common to both sequences

‘? ‘

line not present in either input sequence

Lines beginning with ‘?’ attempt to guide the eye to intraline differences, and were not present in either input sequence. These lines can be confusing if the sequences contain whitespace characters, such as spaces, tabs or line breaks.

Source: https://docs.python.org/3/library/difflib.html

# Using difflib for text comparison
import difflib

def compare_texts(text1, text2):
    # Split the texts into lines or sentences
    lines1 = text1.split('. ')
    lines2 = text2.split('. ')

    # Create a Differ object
    differ = difflib.Differ()

    # Compare the texts
    diff = differ.compare(lines1, lines2)

    # Print the differences
    for line in diff:
        print(line)

# compare the first 2000 signs in the first chapter of each edition
compare_texts(edition1818[0:2000], edition1831[0:2000])
  CHAPTER I
  I am by birth a Genevese and my family is one of the most distinguished of that republic
  My ancestors had been for many years counsellors and syndics and my father had filled several public situations with honour and reputation
  He was respected by all who knew him for his integrity and indefatigable attention to public business
- He passed his younger days perpetually occupied by the affairs of his country and it was not until the decline of life that he thought of marrying and bestowing on the state sons who might carry his virtues and his name down to posterity
+ He passed his younger days perpetually occupied by the affairs of his country a variety of circumstances had prevented his marrying early nor was it until the decline of life that he became a husband and the father of a family
  As the circumstances of his marriage illustrate his character I cannot refrain from relating them
  One of his most intimate friends was a merchant who from a flourishing state fell through numerous mischances into poverty
  This man whose name was Beaufort was of a proud and unbending disposition and could not bear to live in poverty and oblivion in the same country where he had formerly been distinguished for his rank and magnificence
  Having paid his debts therefore in the most honourable manner he retreated with his daughter to the town of Lucerne where he lived unknown and in wretchedness
  My father loved Beaufort with the truest friendship and was deeply grieved by his retreat in these unfortunate circumstances
- He grieved also for the loss of his society and resolved to seek him out and endeavour to persuade him to begin the world again through his credit and assistance
+ He bitterly deplored the false pride which led his friend to a conduct so little worthy of the affection that united them
+ He lost no time in endeavouring to seek him out with the hope of persuading him to begin the world again through his credit and assistance
  Beaufort had taken effectual measures to conceal himself and it was ten months before my father discovered his abode
  Overjoyed at this discovery he hastened to the house which was situated in a mean street near the Reuss
  But when he entered misery and despair alone welcomed him
- Beaufort had saved but a very small sum of money from the wreck of his fortunes but it was sufficient to provide him with sustenance for some months and in the mean time he hoped to procure some respectable employment in a merchants house
?                                                                                                                                                              ---------------------------------------------------------------------------------

+ Beaufort had saved but a very small sum of money from the wreck of his fortunes but it was sufficient to provide him with sustenance for some months and in t
- The in

Using fuzzywuzzy to compute the similarity of two texts#

# ! pip install fuzzywuzzy
# ! pip install python-Levenshtein
# Using fuzzywuzzy for Sentence Similarity

from fuzzywuzzy import fuzz

def compare_sentences(sentence1, sentence2):
    # Calculate the similarity ratio
    ratio = fuzz.ratio(sentence1, sentence2)
    return ratio

similarity = compare_sentences(edition1818, edition1831)
print(f"Similarity: {similarity}%")
Similarity: 54%