Ny udgave af text mining på engelsk: All in One View

Last updated on 2025-07-22 | Edit this page

Overview

Questions

What is text mining?
What are stop words?
What is tokenisation?
What is tidytext?

Objectives

Explain what text mining is
Explain what stop words are
Explain what tokenisation is
Explain what tidytext is

What is text mining?

Text mining is the process of extracting meaningful information and knowledge from text. Text mining tools and methods allow the user to analyse large bodies of texts and to visualise the results.

By applying text mining principles to to a body of text, you can gain insights that would otherwise be impossible to detect with the naked eye.

Before carrying out your analysis, the text must be transformed so that it can be read by a machine.

Stopwords

Text often contains words that hold no particular meaning. These are called stop words and can be found throughout the text. Since stop words rarely contribute to the understanding of the text, it is a good idea to remove them before analysing the text.

Example of removing stopwords

Tidytext and tokenisation

In the following we will be making the text machine-readable by means of the tidy text principles.

Tidy text

The tidy text princples were developed by Silge and Robinson (reference tilføjes - https://www.tidytextmining.com) and apply the principles from the tidy data on text.

The tidy data framework principles are:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Applying these principles to text data leads to a format that is easily manipulated, visualised and analysed using standard data science tools.

Tidy text represents the text by breaking it down into smaller parts such as sentences, words or letters. This process is called tokenisation.

Tokenisation is language independent, as long as the language uses space between each word.

Here is an example of tokenisation at word-level.

Example of tokenization

Key Points

Know what text mining is
Know what stop words are
Knowledge of the data we are working with
Know what tidy text means

Content from Loading data

Last updated on 2025-07-22 | Edit this page

Overview

Questions

Which packages are needed?
How is the dataset loaded?
How is a dataset inspected?

Objectives

Knowledge of the relevant packages
Ability to to load the dataset
Ability to inspect the dataset

Getting startet

When performing text analysis in R, the built-in functions in R are not sufficient. It is therefore necessary to install some additional packages. In this course we will be using the packages tidyverse, tidytext and tm.

R

install.packages("tidyverse")
install.packages("tidytext")
install.packages("tm")

library(tidyverse)
library(tidytext)
library(tm)

Documentation for each package

If you would like to know more about the different packages, please click on the links below.

Getting data

Begin by downloading the dataset called articles.csv. Place the downloaded file in the data/ folder. You can do this directly from R by copying and pasting this into your terminal. (The terminal is the tab to the right of the console.)

R

download.file("https://raw.githubusercontent.com/KUBDatalab/R-textmining_new/main/episodes/data/articles.csv", "data/articles.csv", mode = "wb")

After downloading the data you need to load the data in R’s memory by using the function read_csv().

R

articles <- read_csv("data/articles.csv", na = c("NA", "NULL", ""))

Data description

The dataset contains newspaper articles from the Guardian newspaper. The harvested articles were published on the first inauguration day of each of the two presidents. Inclusion criteria were that the articles contained the name of the relevant president, the word “inauguration” and a publication date similar to the inauguration date.

The original dataset contained lots of variables considered irrelevant within the parameters of this course. The following variables were kept:

id - a unique number identifying each article
president - the president mentioned in the article
text - the full text from the article
web_publication_date - the date of publication
pillar_name - the section in the newspaper

Taking a quick look at the data

The ‘tidyverse’-package has some functions that allow you to inspect the dataset. Below, you can see some of these functions and what they do.

How to show the first / last rows

R

head(articles)

OUTPUT

# A tibble: 6 × 5
     id president text                          web_publication_date pillar_name
  <dbl> <chr>     <chr>                         <dttm>               <chr>
1     1 obama     "Obama inauguration: We will… 2009-01-20 19:16:38  News
2     2 obama     "Obama from outer space Whet… 2009-01-20 22:00:00  Opinion
3     3 obama     "Obama inauguration: today's… 2009-01-20 10:17:27  News
4     4 obama     "Obama inauguration: Countdo… 2009-01-19 23:01:00  News
5     5 obama     "Inaugural address of Presid… 2009-01-20 16:07:44  News
6     6 obama     "Liveblogging the inaugurati… 2009-01-20 13:56:40  News

R

tail(articles)

OUTPUT

# A tibble: 6 × 5
     id president text                          web_publication_date pillar_name
  <dbl> <chr>     <chr>                         <dttm>               <chr>
1   132 trump     Buy, George? World's largest… 2017-01-20 15:53:41  News
2   133 trump     Gove’s ‘snowflake’ tweet is … 2017-01-20 12:44:10  Opinion
3   134 trump     Monet, Renoir and a £44.2m M… 2017-01-20 04:00:22  News
4   135 trump     El Chapo is not a Robin Hood… 2017-01-20 17:09:54  News
5   136 trump     They call it fun, but the di… 2017-01-20 16:19:50  Opinion
6   137 trump     Totes annoying: words that s… 2017-01-20 12:00:06  News

How to show information about the columns

R

glimpse(articles)

OUTPUT

Rows: 137
Columns: 5
$ id                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ president            <chr> "obama", "obama", "obama", "obama", "obama", "oba…
$ text                 <chr> "Obama inauguration: We will remake America, vows…
$ web_publication_date <dttm> 2009-01-20 19:16:38, 2009-01-20 22:00:00, 2009-0…
$ pillar_name          <chr> "News", "Opinion", "News", "News", "News", "News"…

Get the names of the variables / columns

R

names(articles)

OUTPUT

[1] "id"                   "president"            "text"
[4] "web_publication_date" "pillar_name"

Get the dimension of the dataset (number of rows and coloumns)

R

dim(articles)

OUTPUT

[1] 137   5

Key Points

Packages must be installed and loaded
The dataset needs to be loaded
The dataset can be inspected by means of different functions

Content from Tokenisation and stopwords

Last updated on 2025-07-22 | Edit this page

Overview

Questions

How is text prepared for analysis?

Objectives

Ability to tokenise a text
Ability to remove stopwords from text

Tokenisation

Since we are working with text mining, we focus on the text coloumn. We do this because the coloumn contains the text from the articles in question.

To tokenise a coloumn, we use the functions unnest_tokens() from the tidytext-package. The function gets two arguments. The first one is word. This defines that the text should be split up by words. The second argument, text, defines the column that we want to tokenise.

R

articles_tidy <- articles %>% 
  unnest_tokens(word, text)

Tokenisation

The result of the tokenisation is 118,269 rows. The reason is that the text-column has been replaced by a new column named word. This columns contains all words found in all of the articles. The information from the remaining columns are kept. This makes is possible to determine which article each word belongs to.

Stopwords

The next step is to remove stopwords. We have chosen to use the stopword list from the package tidytext. The list contains 1,149 words that are considered stopwords. Other lists are available, and they differ in terms of how many words they contain.

R

data(stop_words)
stop_words

OUTPUT

# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>
 1 a           SMART
 2 a's         SMART
 3 able        SMART
 4 about       SMART
 5 above       SMART
 6 according   SMART
 7 accordingly SMART
 8 across      SMART
 9 actually    SMART
10 after       SMART
# ℹ 1,139 more rows

Adding and removing stopwords

You may find yourself in need of either adding or removing words from the stopwords list.

Here is how you add and remove stopwords to a predefined list.

Add stopwords

First, create a tibble with the word you wish to add to the stop words list

R

new_stop_words <- tibble(
  word = c("cat", "dog"),
  lexicon = "my_stopwords"
)

Then make a new stopwords tibble based on the original one, but with the new words added.

R

updated_stop_words <- stop_words %>%
  bind_rows(new_stop_words)

Run the following code to see that the added lexicon my_stopwords contains two words.

R

updated_stop_words %>% 
  count(lexicon)

OUTPUT

# A tibble: 4 × 2
  lexicon          n
  <chr>        <int>
1 SMART          571
2 my_stopwords     2
3 onix           404
4 snowball       174

Remove stopword

First, create a vector with the word(s) you wish to remove from the stopwords list.

R

words_to_remove <- c("cat", "dog")

Then remove the rows containing the unwanted words.

R

updated_stop_words <- stop_words %>%
  filter(!word %in% words_to_remove)

Run the following code to see that the added lexicon my_stopwords nolonger exists.

R

updated_stop_words %>% 
  count(lexicon)

OUTPUT

# A tibble: 3 × 2
  lexicon      n
  <chr>    <int>
1 SMART      571
2 onix       404
3 snowball   174

In order to remove stopwords from articles_tidy, we have to use the anti_join-function.

R

articles_anti_join <- articles_tidy %>% 
  anti_join(stop_words, by = "word")

The anti_join-function removes the stopwords from the orginal dataset. This is illustrated in the figure below. The only part left after anti-joining is the dark grey area to the left.

These words are saved as the object articles_anti_join.

`Join` and `anti_join`

There are multiple join-functions in R.

Key Points

Know how to prepare text for analysis

Content from Word frequency analysis

Last updated on 2025-07-22 | Edit this page

Overview

Questions

How is a frequency analysis conducted?

Objectives

Learn how to find frequent words
Learn how to analyse and visualise it

Frequency analysis

A word frequency is a relatively simple analysis. It measures how often words occur in a text.

R

articles_anti_join %>% 
  count(word, sort = TRUE)

OUTPUT

# A tibble: 12,328 × 2
   word             n
   <chr>        <int>
 1 obama          513
 2 trump          479
 3 president      450
 4 people         337
 5 inauguration   249
 6 america        237
 7 world          212
 8 american       201
 9 time           189
10 day            188
# ℹ 12,318 more rows

The previous code chunk resulted in a list containing the most frequent words. The words are from articles about both presidents, and they are sorted based on frequency with the highest number on top.

A closer look at the list may reveal that some words are irrelevant. Given that the articles in the dataset are about the two presidents’ respective inaugurations, we consider the words below irrelevant for our analysis. Therefore, we make a new dataset without these words.

R

articles_filtered <- articles_anti_join %>%
  filter(!word %in% c("trump", "trump’s", "obama", "obama's", "inauguration", "president"))

articles_filtered %>% 
  count(word, sort = TRUE)

OUTPUT

# A tibble: 12,322 × 2
   word           n
   <chr>      <int>
 1 people       337
 2 america      237
 3 world        212
 4 american     201
 5 time         189
 6 day          188
 7 bush         186
 8 speech       183
 9 white        180
10 washington   150
# ℹ 12,312 more rows

The words deemed irrelevant are no longer on the list above.

Instead of a general list it may be more interesting to focus on the most frequent words belonging to articles about the two presidents respectively.

R

articles_filtered %>%
  count(president, word, sort = TRUE)

OUTPUT

# A tibble: 15,989 × 3
   president word         n
   <chr>     <chr>    <int>
 1 obama     bush       174
 2 obama     people     170
 3 trump     people     167
 4 obama     america    123
 5 obama     speech     121
 6 obama     world      120
 7 obama     time       119
 8 obama     american   116
 9 trump     america    114
10 trump     it’s       108
# ℹ 15,979 more rows

Keeping an overview of the words associated with each president can be a bit tricky. For instance, the word “people” is associated with both presidents. This is easy to see, as the two words are right next to each other. The two occurrences of the word America, however, are further apart, although this word is also associated with both presidents. A visualisation may solve this problem.

R

articles_filtered %>%
  count(president, word, sort = TRUE) %>% 
  group_by(president) %>%
  slice(1:10) %>% 
  ggplot(mapping = aes(x = n, y = word, colour = president, shape = president)) +
  geom_point()

The plot above shows the top-ten words associated Obama and Trump respectively. If a word features on both presidents’ top-ten list, it only occurs once in the plot. This is why the plot doesn’t contain 20 words in total.

Another interesting aspect to look at would be the most frequent words used in relation to each president. In this analysis the president is the guiding principle.

R

articles_filtered %>%
  count(president, word, sort = TRUE) %>% 
  pivot_wider(
    names_from = president,
    values_from = n
  )

OUTPUT

# A tibble: 12,322 × 3
   word     obama trump
   <chr>    <int> <int>
 1 bush       174    12
 2 people     170   167
 3 america    123   114
 4 speech     121    62
 5 world      120    92
 6 time       119    70
 7 american   116    85
 8 it’s        NA   108
 9 day        106    82
10 donald       1   106
# ℹ 12,312 more rows

R

articles_filtered %>%
  group_by(president) %>% 
  count(word, sort = TRUE) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder_within(word, n, president)) %>% 
  ggplot(aes(n, word, fill = president)) +
  geom_col() +
  facet_wrap(~president, scales = "free") +
  scale_y_reordered() + 
  labs(x = "word occurrences")

OUTPUT

Selecting by n

The analyses just made can easily be adjusted. For instance, if we want look at the words by pillar_name instead of by president, we simply replace president with pillar_name in the code.

R

articles_filtered %>%
  count(pillar_name, word, sort = TRUE) %>% 
  group_by(pillar_name) %>%
  slice(1:10) %>% 
  ggplot(mapping = aes(x = n, y = word, colour = pillar_name, shape = pillar_name)) +
  geom_point()

Key Points

Making a frequency analysis
Visualising the results

Content from Sentiment analysis

Last updated on 2025-07-22 | Edit this page

Overview

Questions

How is sentiment analysis conducted?

Objectives

Learn about different lexicon
Learn how to add sentiment to words
Analyse and visualise the sentiments in a text

R

knitr::opts_chunk$set(warning = FALSE)

Sentiment analysis

Sentiment refers to the emotion or tone in a text. It is typically categorised as positive, negative or neutral. Sentiment is often used to analyse opinions, attitudes or emotions in written content. In this case the written content is newspaper articles.

Sentiment analysis is a method used to identify and classify emotions in textual data. This is often done using word list (lexicons). The goals is to determine whether a given text has a positive, negative or neutral tone.

In order to do a sentiment analysis on our data we From the previous section we have a dataset containing a list of words in the text without stopwords. To do a sentiment analysis we can use a so-called lexicon and assign a sentiment to each word. In order to do this we need an list of words and their sentiment. A simple form would be wether they are positive or negative.

There are multiple sentiment lexicons. For a start we will be using the bing lexicon. This lexicon categorizes words as either positive or negative.

R

get_sentiments("bing")

OUTPUT

# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>
 1 2-faces     negative
 2 abnormal    negative
 3 abolish     negative
 4 abominable  negative
 5 abominably  negative
 6 abominate   negative
 7 abomination negative
 8 abort       negative
 9 aborted     negative
10 aborts      negative
# ℹ 6,776 more rows

In order to use the bing-lexicon, we have to save it.

R

bing <- get_sentiments("bing")

We now need to combine the sentiment to the words from our articles. We do this by performing an inner_join.

R

articles_bing <- articles_filtered %>% 
  inner_join(bing)

OUTPUT

Joining with `by = join_by(word)`

R

articles_bing

OUTPUT

# A tibble: 6,159 × 6
      id president web_publication_date pillar_name word          sentiment
   <dbl> <chr>     <dttm>               <chr>       <chr>         <chr>
 1     1 obama     2009-01-20 19:16:38  News        promises      positive
 2     1 obama     2009-01-20 19:16:38  News        promise       positive
 3     1 obama     2009-01-20 19:16:38  News        dust          negative
 4     1 obama     2009-01-20 19:16:38  News        cold          negative
 5     1 obama     2009-01-20 19:16:38  News        dawn          positive
 6     1 obama     2009-01-20 19:16:38  News        celebrate     positive
 7     1 obama     2009-01-20 19:16:38  News        inspirational positive
 8     1 obama     2009-01-20 19:16:38  News        failed        negative
 9     1 obama     2009-01-20 19:16:38  News        resound       positive
10     1 obama     2009-01-20 19:16:38  News        attacks       negative
# ℹ 6,149 more rows

In R, inner_join() is commonly used to combine datasets based on a shared column. In this case it is the word column. inner_join() matches words from a text dataset, in this case articles_filtered with words in the Bing sentiment lexicon to determine whether they are positive or negative.

When we have the combined dataset we can begin making a sentiment analysis. A start could be to count the number of positive and negative words used in articles, per president.

R

articles_bing %>% 
  group_by(president) %>% 
  summarise(positive = sum(sentiment == "positive"),
            negative = sum(sentiment == "negative"),
            difference = positive - negative)

OUTPUT

# A tibble: 2 × 4
  president positive negative difference
  <chr>        <int>    <int>      <int>
1 obama         1499     1800       -301
2 trump         1160     1700       -540

This shows that more positive than negative words are associated with both presidents. It also shows that Trump is the president with the highest number of associated negative words.

Another interesting thing to look at would the 10 most positive and negative words used in the articles.

R

articles_bing %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup() %>% 
  group_by(sentiment) %>% 
  slice_max(n, n = 10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(mapping = aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y")

Here we can see the positive and negative words used in the articles.

With ´bing´ we only look at the sentiment in a binary fashion - a word is either positive or negative. If we try to do a similar analysis with AFINN, it looks different.

R

install.packages("textdata")

OUTPUT

The following package(s) will be installed:
- textdata [0.4.5]
These packages will be installed into "~/work/R-textmining_new/R-textmining_new/renv/profiles/lesson-requirements/renv/library/linux-ubuntu-jammy/R-4.5/x86_64-pc-linux-gnu".

# Installing packages --------------------------------------------------------
- Installing textdata ...                       OK [linked from cache]
Successfully installed 1 package in 5.5 milliseconds.

R

library(textdata)

R

afinn <- get_sentiments("afinn")

OUTPUT

Do you want to download:
 Name: AFINN-111
 URL: http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
 License: Open Database License (ODbL) v1.0
 Size: 78 KB (cleaned 59 KB)
 Download mechanism: https

ERROR

Error in menu(choices = c("Yes", "No"), title = title): menu() cannot be used non-interactively

R

articles_afinn <- articles_filtered %>% 
  inner_join(afinn)

ERROR

Error: object 'afinn' not found

R

articles_afinn %>% 
  group_by(president) %>% 
  summarise(sentiment = sum(value))

ERROR

Error: object 'articles_afinn' not found

R

articles_afinn %>% 
  group_by(president, value) %>% 
  summarise(sentiment = sum(value)) %>% 
  ungroup() %>% 
  ggplot(mapping = aes(x = value, y = sentiment, fill = president)) +
  geom_col(position = "dodge")

ERROR

Error: object 'articles_afinn' not found

R

articles_afinn %>% 
  count(president, word, value, sort = TRUE) %>% 
  ungroup() %>% 
  group_by(president, value) %>% 
  slice_max(n, n = 3) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(mapping = aes(n, word, fill = president)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~value, scales = "free_y") +
  labs(x = "Contribution to sentiment", 
       y = NULL)

ERROR

Error: object 'articles_afinn' not found

Key Points

There are different lexicons
It is possible to add sentiments to words
It is possible to visualise the sentiments

Overview

Questions

Objectives

What is text mining?

Stopwords

Example of removing stopwords

Tidytext and tokenisation

Tidy text

Example of tokenization

Key Points

Overview

Questions

Objectives

Getting startet

R

Documentation for each package

Getting data

R

R

Data description

Taking a quick look at the data

How to show the first / last rows

R

OUTPUT

R

OUTPUT

How to show information about the columns

R

OUTPUT

Get the names of the variables / columns

R

OUTPUT

Get the dimension of the dataset (number of rows and coloumns)

R

OUTPUT

Key Points

Overview

Questions

Objectives

Tokenisation

R

Tokenisation

Stopwords

R

OUTPUT

Adding and removing stopwords

Add stopwords

R

R

R

OUTPUT

Remove stopword

R

R

R

OUTPUT

R

Join and anti_join

Key Points

Overview

Questions

Objectives

Frequency analysis

R

OUTPUT

R

OUTPUT

R

OUTPUT

R

R

OUTPUT

R

OUTPUT

R

Key Points

Overview

Questions

Objectives

R

`Join` and `anti_join`