Cohens Kappa
Last updated on 2025-07-01 | Edit this page
Estimated time: 12 minutes
Overview
Questions
- How can agrement on classification be quantised?
- How is Cohens \(\kappa\) calculated?
Objectives
- Learn to calculate Cohens \(\kappa\)
- Become familiar with the usecases for Cohens \(\kappa\)
- Know the limitations of Cohens \(\kappa\)
What do we use it for?
We use Cohens Kappa (or \(\kappa\)), to compare if two sets of categorisations is done reliably.
Imagine a set of images. We ask two persons to answer a simple question - is there a cat on the images. To what degree do the two persons agree on the presence or absence of cats?
That seems simple enough. However if they agree that there is a cat, this might indicate that they are both good at identifying cats, and of course agrees. But they might just as well be very bad at recognizing cats - but agree by random chance.
Where is it used?
Inter-Rater Reliability. That is the example from earlier. To what degree do two “raters” agree on their rating or classification of something? This is useful in subjects and research where subjective assessments are important, such as psychology, medicine and languages.
Intra-Rater Reliability. Here we test how consistent a single rater is. Are my ratings from a week ago, consistent with the ratings I make today?
Comparison of diagnostic tests. It can be used to compare the results of two different diagnostic tests for some medical condition. Does the cheaper, less invasive test, give the same results as the more expensive invasive test?
Natural language processing (NLP). When we categorise textual data, eg, evaluating the sentiment of a text, or identify named entities - how well does two methods or persons agree?
Educational research. Will two teachers give the same grade to the same papers turned in by students.
In humanities we are concerned with intersubjectively transferability - based on available data, are the subjective evaluation of two different persons in agreement.
Limitations
Cohens \(\kappa\) can be used directly for two nominal (no order of the classes) categorical variables. The main limitation is that Cohens \(\kappa\) only applies for exactly two raters.
How do we calculate it?
The library vcd
contains a function Kappa()
that does the heavy lifting.
It needs a so called confusion matrix. Let us begin by constructing one.
Ja, dette bør ligge i et datasæt der kan indlæses.
R
library(tidyverse)
OUTPUT
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
R
diagnoses <- tribble(~rater1, ~rater2,
"dep", "dep",
"dep", "dep",
"dep", "dep",
"dep", "dep",
"dep", "dep",
"dep", "dep",
"dep", "dep",
"dep", "dis",
"dep", "sch",
"dep", "sch",
"dep", "neu",
"dep", "neu",
"dep", "neu",
"dis", "dis",
"dis", "dis",
"dis", "dis",
"dis", "dis",
"dis", "dis",
"dis", "dis",
"dis", "dis",
"dis", "dis",
"dis", "sch",
"dis", "sch",
"sch", "sch",
"sch", "sch",
"neu", "neu",
"oth", "oth",
"oth", "oth",
"oth", "oth",
"oth", "oth"
)
The object diagnoses
now contains 30 observations.
Imagine two psychiatrists both evaluating 30 patients. They diagnose one
of these disorders: Depression (dep), Schizophrenia (sch), Dissociative
disorder (dis), Neurosis (neu) and Other (oth). To what degree do they
agree?
R
head(diagnoses)
OUTPUT
# A tibble: 6 × 2
rater1 rater2
<chr> <chr>
1 dep dep
2 dep dep
3 dep dep
4 dep dep
5 dep dep
6 dep dep
R
conf_table <- table(diagnoses)
conf_table
OUTPUT
rater2
rater1 dep dis neu oth sch
dep 7 1 3 0 2
dis 0 8 0 0 2
neu 0 0 1 0 0
oth 0 0 0 4 0
sch 0 0 0 0 2
This is the confusion matrix. 7 patients are diagnosed with depression by both raters (psychiatrists). But looking at Schizophrenia, we see that two patients are diagnosed with Schizophrenia by rater2, but with Depression by rater1.
This is the table we need to provide to the Kappa()
function:
R
library(vcd)
OUTPUT
Loading required package: grid
R
res.k <- Kappa(conf_table)
res.k
OUTPUT
value ASE z Pr(>|z|)
Unweighted 0.6507 0.0999 6.513 7.372e-11
Weighted 0.5588 0.1282 4.359 1.304e-05
For some use cases it does not really matter what kind of disagreement the two raters have. They of course should agree, but the disagreement of classifying a Golden Retriver as a Labrador Retriver, is not as serious as mistaking it for a Chihuahua.
The Unweighted \(\kappa\) is calculated with the assumption that the different disagreements are of equal seriousness.
The Weighted \(\kappa\) assumes that some disagreements are worse than others.
How confident are we in \(\kappa\)?
The function confint()
will give us the 95% confidence
interval of the \(\kappa\)
calculated:
R
confint(res.k)
OUTPUT
Kappa lwr upr
Unweighted 0.4548491 0.8464610
Weighted 0.3075771 0.8100699
Interpretation
Cohens \(\kappa\) takes into account the possibility that the two raters agree by chance. This makes it a more robust measure of agreement than simply looking at the precent agreement. It returns a value between -1 and 1, 1 indicating perfect agreement, 0 no agreement besides random chance, and negative values less agreement than we would expect by chance.
So. Is an unweighted \(\kappa\) of 0.6507 good or bad?
That depends on what we are actually classifying. Classifying psychiatric disorders is obviously more important than classifying dog breeds.
Fleiss et al (2003), argues that the rules-of-thumb in this table are good indications:
Value of K | Level of agreement |
---|---|
>0.75 | Excellent agreement beyond chance |
0.4> value < 0.75 | Fair to good agreement beyond chance |
<0.4 | Poor agreement beyond chance |
Whereas McHugh (2012) presents this table:
Value of K | Level of agreement | % of data that are reliable |
---|---|---|
0 - 0.20 | None | 0 - 4% |
0.21 - 0.39 | Minimal | 4 - 15% |
0.40 - 0.59 | Weak | 15 - 35% |
0.60 - 0.79 | Moderate | 35 - 63% |
0.80 - 0.90 | Strong | 64 - 81% |
Above 0.90 | Almost Perfect | 82 - 100% |
The estimates of the % of data that are reliable depends on the distribution of the classifications, and the values in this table should be taken with a grain of salt.
Assumptions
Cohens \(\kappa\) have the following requirements:
Two raters (or coders). No more, no less.
Nominal scale – there are no order to the classifications.
Same set of items – both raters rate the exact same cases.
Independence of ratings – the two ratings must not influence each other.
Mutually exclusive categories – an arbitrary case can only get one classification.
The math
We define \(\kappa\) as:
\[\kappa = \frac{P_0 - P_e}{1-P_e}\] Where \(P_0\) is the degree of agreement we observe, and \(P_e\) is the degree of agreement that is random.
The null hypothesis declares that any agreement is random, and \(\kappa\) is 0. If \(\kappa\) is different from 0, the agreement is not random.
If we are looking at two raters, in this example doctors, determining if a patient is sick, a simple yes/no question in this case we will get these results:
R
## Doctor2
## Doctor1 Yes No Total
## Yes a b R1
## No c d R2
## Total C1 C2 N
\(P_0\), the observed percentage of agreement is now easy to calculate. It is the sum of “a” and “d”, divided by the number of patients:
\[P_0 = \frac{a + d}{N}\]
The random agreement \(P_e\) takes a bit more math:
- Doctor 1 answers “yes” in \(\frac{R_1}{N}\) cases.
- Doctor 2 answers “yes” in \(\frac{C_1}{N}\) cases.
Based on this, the probability that they both, randomly and by chance, answers “yes” is:
\[\frac{R1}{N} \frac{C1}{N}\]
And for the random agreement of “no”:
Doctor1 answers “no” in \(\frac{R2}{N}\) cases Doctor2 answers “no” in \(\frac{C2}{N}\) cases
And the probability that they both, again randomly and by chance, answers “no” is:
\[\frac{R2}{N} \frac{C2}{N}\]
We then calculate the total probability that the two raters agree by chance by adding the two ways they can do that:
\[P_e = \frac{R1}{N} \frac{C1}{N}+ \frac{R2}{N} \frac{C2}{N}\]
Now that we have both \(P_0\) and \(P_e\), we can calculate \(\kappa\), and confidence intervals: \(\kappa \pm Z_{\alpha/2} SE_\kappa\) based on the Z-distribution (if normality is assured, otherwise the t-distribution).
If you have ever looked at how the math in a \(\chi^2\) test works, this might look familiar. Thats because it actually is.
With two raters and two categories, that \(\kappa\) statistic is the same. And if you want to delve deeper into that, you can look up this paper:
Feingold, Marcia (1992). “The Equivalence of Cohen’s Kappa and Pearson’s Chi-Square Statistics in the 2 × 2 Table.” Educational and Psychological Measurement 52(1): 57-61. https://hdl.handle.net/2027.42/67443
Alternatives
Cohens \(\kappa\) only works (directly) for nominal values, and two raters. The option of weighting the data give us some options for using it on ordinal values, but more direct options exist.
- Weighted \(\kappa\) is used for
ordinal values. We use the function
kappa2()
from theirr
package. - Light’s \(\kappa\) is used for
studying interrater agreement between more than 2 raters. We use the
function
kappamlight()
from theirr
package. - Fleiss \(\kappa\) is also used when having more than 2 raters. But does not require the same raters for each subject.
Key Points
- With two raters Cohens \(\kappa\) can be used as a measure of interrater agreement on categorical, nominal classes.
- Other methods exists for ordinal classes, and more than two raters.