A text preprocessing pipeline#
The purpose is to provide a way to create uniform text data that is ready for analysis.
To get the purpose in goal we need text data and we can use the novel Frankenstein, which is available on Gutenberg.org.
# import libaries
import urllib.request
# Get the 1818 edition of Frankenstein
url = 'https://gutenberg.org/cache/epub/41445/pg41445.txt'
raw_text = urllib.request.urlopen(url).read().decode()
# Get text
text_start = raw_text.find('PREFACE.')
text_end = raw_text.find('*** END OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text = raw_text[text_start:text_end].strip() # Slice
# Identify noise in the text
text[:3000]
'PREFACE.\r\n\r\n\r\nThe event on which this fiction is founded has been supposed, by Dr.\r\nDarwin, and some of the physiological writers of Germany, as not of\r\nimpossible occurrence. I shall not be supposed as according the remotest\r\ndegree of serious faith to such an imagination; yet, in assuming it as\r\nthe basis of a work of fancy, I have not considered myself as merely\r\nweaving a series of supernatural terrors. The event on which the\r\ninterest of the story depends is exempt from the disadvantages of a mere\r\ntale of spectres or enchantment. It was recommended by the novelty of\r\nthe situations which it developes; and, however impossible as a physical\r\nfact, affords a point of view to the imagination for the delineating of\r\nhuman passions more comprehensive and commanding than any which the\r\nordinary relations of existing events can yield.\r\n\r\nI have thus endeavoured to preserve the truth of the elementary\r\nprinciples of human nature, while I have not scrupled to innovate\r\nupon their combinations. The _Iliad_, the tragic poetry of\r\nGreece,—Shakespeare, in the _Tempest_ and _Midsummer Night’s\r\nDream_,—and most especially Milton, in _Paradise Lost_, conform to this\r\nrule; and the most humble novelist, who seeks to confer or receive\r\namusement from his labours, may, without presumption, apply to prose\r\nfiction a licence, or rather a rule, from the adoption of which so many\r\nexquisite combinations of human feeling have resulted in the highest\r\nspecimens of poetry.\r\n\r\nThe circumstance on which my story rests was suggested in casual\r\nconversation. It was commenced, partly as a source of amusement, and\r\npartly as an expedient for exercising any untried resources of mind.\r\nOther motives were mingled with these, as the work proceeded. I am by no\r\nmeans indifferent to the manner in which whatever moral tendencies exist\r\nin the sentiments or characters it contains shall affect the reader; yet\r\nmy chief concern in this respect has been limited to the avoiding of the\r\nenervating effects of the novels of the present day, and to the\r\nexhibitions of the amiableness of domestic affection, and the excellence\r\nof universal virtue. The opinions which naturally spring from the\r\ncharacter and situation of the hero are by no means to be conceived as\r\nexisting always in my own conviction; nor is any inference justly to be\r\ndrawn from the following pages as prejudicing any philosophical doctrine\r\nof whatever kind.\r\n\r\nIt is a subject also of additional interest to the author, that this\r\nstory was begun in the majestic region where the scene is principally\r\nlaid, and in society which cannot cease to be regretted. I passed the\r\nsummer of 1816 in the environs of Geneva. The season was cold and rainy,\r\nand in the evenings we crowded around a blazing wood fire, and\r\noccasionally amused ourselves with some German stories of ghosts, which\r\nhappened to fall into our hands. These tales excited in us a playful\r\ndesire of imitation. Two other friends (a tale from the pen of one of\r\nwhom wou'
Cleaning text data#
import re
def clean(text):
# match a variety of punctuation and special characters
# backslash \ and the pipe symbols | plays important roles, for example here \?
# Now it is a good idea to look up a see what \ and | does
text = re.sub(r'\.|,|:|;|!|\?|\(|\)|\||\+|\'|\"|‘|’|“|”|\'|\’|…|\-|_|–|—|\$|&|\*|>|<|\/|\[|\]', ' ', text)
# Regex pattern to match numbers and words containing numbers
text = re.sub(r'\b\w*\d\w*\b', '', text)
# Remove words with length 2 or less
text = re.sub(r'\b\w{1,2}\b', '', text)
# sequences of white spaces
text = re.sub(r'\s+', ' ', text)
# lower the letters
text = text.lower()
# return the text
return text
clean(text)[:3000]
'preface the event which this fiction founded has been supposed darwin and some the physiological writers germany not impossible occurrence shall not supposed according the remotest degree serious faith such imagination yet assuming the basis work fancy have not considered myself merely weaving series supernatural terrors the event which the interest the story depends exempt from the disadvantages mere tale spectres enchantment was recommended the novelty the situations which developes and however impossible physical fact affords point view the imagination for the delineating human passions more comprehensive and commanding than any which the ordinary relations existing events can yield have thus endeavoured preserve the truth the elementary principles human nature while have not scrupled innovate upon their combinations the iliad the tragic poetry greece shakespeare the tempest and midsummer night dream and most especially milton paradise lost conform this rule and the most humble novelist who seeks confer receive amusement from his labours may without presumption apply prose fiction licence rather rule from the adoption which many exquisite combinations human feeling have resulted the highest specimens poetry the circumstance which story rests was suggested casual conversation was commenced partly source amusement and partly expedient for exercising any untried resources mind other motives were mingled with these the work proceeded means indifferent the manner which whatever moral tendencies exist the sentiments characters contains shall affect the reader yet chief concern this respect has been limited the avoiding the enervating effects the novels the present day and the exhibitions the amiableness domestic affection and the excellence universal virtue the opinions which naturally spring from the character and situation the hero are means conceived existing always own conviction nor any inference justly drawn from the following pages prejudicing any philosophical doctrine whatever kind subject also additional interest the author that this story was begun the majestic region where the scene principally laid and society which cannot cease regretted passed the summer the environs geneva the season was cold and rainy and the evenings crowded around blazing wood fire and occasionally amused ourselves with some german stories ghosts which happened fall into our hands these tales excited playful desire imitation two other friends tale from the pen one whom would far more acceptable the public than any thing can ever hope produce and myself agreed write each story founded some supernatural occurrence the weather however suddenly became serene and two friends left journey among the alps and lost the magnificent scenes which they present all memory their ghostly visions the following tale the only one which has been completed frankenstein the modern prometheus letter mrs saville england petersburgh dec you will rejoice hear that disaster has accompanied '