Table One
Last updated on 2024-10-22 | Edit this page
Estimated time: 12 minutes
Overview
Questions
- How do you make a Table One?
Objectives
- Explain what a Table One is
- Know how to make a Tabel One and adjust key parameters
What is a “Table One”?
Primarily used in medical and epidemiological research, a Table One is typically the first table in any publication using data.
It presents the baseline characteristics of the participants in a study, and provides a concise overview of the relevant demographic and clinical variables.
It typically compares different groups (male~female, treatment~control), to highlight similarities and differences.
It can look like this:
control
|
case
|
Overall
|
||||
---|---|---|---|---|---|---|
no (N=298) |
yes (N=48) |
no (N=135) |
yes (N=29) |
no (N=433) |
yes (N=77) |
|
Age (years) | ||||||
Mean (SD) | 61.3 (4.75) | 58.9 (5.68) | 61.5 (4.85) | 58.1 (5.32) | 61.4 (4.78) | 58.6 (5.53) |
Median [Min, Max] | 62.0 [46.0, 69.0] | 59.0 [46.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [49.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [46.0, 68.0] |
estradol (pg/mL) | ||||||
Mean (SD) | 8.05 (5.29) | 8.73 (8.84) | 10.5 (9.72) | 10.6 (13.7) | 8.81 (7.06) | 9.44 (10.9) |
Median [Min, Max] | 6.00 [2.00, 46.0] | 6.50 [2.00, 57.0] | 8.00 [3.00, 85.0] | 6.00 [3.00, 76.0] | 7.00 [2.00, 85.0] | 6.00 [2.00, 76.0] |
estrone (pg/mL) | ||||||
Mean (SD) | 28.7 (15.0) | 26.8 (12.0) | 32.3 (15.7) | 27.7 (13.2) | 29.8 (15.3) | 27.1 (12.3) |
Median [Min, Max] | 25.0 [10.0, 131] | 23.0 [13.0, 65.0] | 29.0 [11.0, 119] | 24.0 [12.0, 59.0] | 26.0 [10.0, 131] | 23.0 [12.0, 65.0] |
Missing | 58 (19.5%) | 15 (31.3%) | 30 (22.2%) | 11 (37.9%) | 88 (20.3%) | 26 (33.8%) |
testost | ||||||
Mean (SD) | 25.3 (13.2) | 22.2 (10.7) | 27.6 (16.1) | 28.2 (15.6) | 26.0 (14.2) | 24.4 (13.0) |
Median [Min, Max] | 23.0 [4.00, 111] | 21.5 [8.00, 63.0] | 25.0 [6.00, 144] | 24.0 [10.0, 69.0] | 23.0 [4.00, 144] | 22.0 [8.00, 69.0] |
Missing | 6 (2.0%) | 2 (4.2%) | 3 (2.2%) | 1 (3.4%) | 9 (2.1%) | 3 (3.9%) |
prolactn | ||||||
Mean (SD) | 9.60 (5.10) | 13.7 (12.3) | 10.8 (6.79) | 9.57 (3.29) | 9.99 (5.70) | 12.2 (10.1) |
Median [Min, Max] | 8.16 [1.96, 37.3] | 8.81 [3.87, 55.8] | 9.30 [2.66, 59.9] | 8.88 [4.49, 17.6] | 8.64 [1.96, 59.9] | 8.84 [3.87, 55.8] |
Missing | 14 (4.7%) | 0 (0%) | 6 (4.4%) | 1 (3.4%) | 20 (4.6%) | 1 (1.3%) |
Please note that the automatic styling of this site results in a table-one that is not very nice looking.
We have 510 participants in a study, split into control and case groups, and further subdivided into two groups based on Postmenopausal hormone use. It describes the distribution of sex and concentration of estradiol, estrone, testosterone and prolactin in a blood sample.
A number of packages making it easy to make a Table One exists. Here
we look at the package table1
.
The specific way of doing it depends on the data available. If we do not have data on the weight of the participants, we are not able to describe the distribution of their weight.
Let us begin by looking at the data. We begin by loading the two
packages tidyverse
and table1
. We then read in
the data from the csv-file “BLOOD.csv”, which we have downloaded
from this link.
R
library(tidyverse)
library(table1)
blood <- read_csv("data/BLOOD.csv")
head(blood)
OUTPUT
# A tibble: 6 × 9
ID matchid case curpmh ageblood estradol estrone testost prolactn
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100013 164594 0 1 46 57 65 25 11.1
2 100241 107261 0 0 65 11 26 999 2.8
3 100696 110294 0 1 66 3 999 8 38
4 101266 101266 1 0 57 4 18 6 8.9
5 101600 101600 1 0 66 6 18 25 6.9
6 102228 155717 0 1 57 10 999 31 13.9
510 rows. Its a case-control study, where the ID represents one individual, and matchid gives us the link between cases and controls. Ageblood is the age of the individual at the time when the blood sample was drawn, and we then have levels of four different hormones.
The data contains missing values, coded as “999.0” for estrone and testost, and 99.99 for prolactin.
Let us fix that:
R
blood <- blood %>%
mutate(estrone = na_if(estrone, 999.0)) %>%
mutate(testost = na_if(testost, 999.0)) %>%
mutate(prolactn = na_if(prolactn, 99.99))
We then ensure that categorical values are stored as categorical values, and adjust the labels of those categorical values:
R
blood <- blood %>%
mutate(case = factor(case, labels = c("control", "case"))) %>%
mutate(curpmh = factor(curpmh, labels = c("no", "yes")))
And now we can make our table one like this:
R
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood)
control
|
case
|
Overall
|
||||
---|---|---|---|---|---|---|
no (N=298) |
yes (N=48) |
no (N=135) |
yes (N=29) |
no (N=433) |
yes (N=77) |
|
ageblood | ||||||
Mean (SD) | 61.3 (4.75) | 58.9 (5.68) | 61.5 (4.85) | 58.1 (5.32) | 61.4 (4.78) | 58.6 (5.53) |
Median [Min, Max] | 62.0 [46.0, 69.0] | 59.0 [46.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [49.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [46.0, 68.0] |
estradol | ||||||
Mean (SD) | 8.05 (5.29) | 8.73 (8.84) | 10.5 (9.72) | 10.6 (13.7) | 8.81 (7.06) | 9.44 (10.9) |
Median [Min, Max] | 6.00 [2.00, 46.0] | 6.50 [2.00, 57.0] | 8.00 [3.00, 85.0] | 6.00 [3.00, 76.0] | 7.00 [2.00, 85.0] | 6.00 [2.00, 76.0] |
estrone | ||||||
Mean (SD) | 28.7 (15.0) | 26.8 (12.0) | 32.3 (15.7) | 27.7 (13.2) | 29.8 (15.3) | 27.1 (12.3) |
Median [Min, Max] | 25.0 [10.0, 131] | 23.0 [13.0, 65.0] | 29.0 [11.0, 119] | 24.0 [12.0, 59.0] | 26.0 [10.0, 131] | 23.0 [12.0, 65.0] |
Missing | 58 (19.5%) | 15 (31.3%) | 30 (22.2%) | 11 (37.9%) | 88 (20.3%) | 26 (33.8%) |
testost | ||||||
Mean (SD) | 25.3 (13.2) | 22.2 (10.7) | 27.6 (16.1) | 28.2 (15.6) | 26.0 (14.2) | 24.4 (13.0) |
Median [Min, Max] | 23.0 [4.00, 111] | 21.5 [8.00, 63.0] | 25.0 [6.00, 144] | 24.0 [10.0, 69.0] | 23.0 [4.00, 144] | 22.0 [8.00, 69.0] |
Missing | 6 (2.0%) | 2 (4.2%) | 3 (2.2%) | 1 (3.4%) | 9 (2.1%) | 3 (3.9%) |
prolactn | ||||||
Mean (SD) | 9.60 (5.10) | 13.7 (12.3) | 10.8 (6.79) | 9.57 (3.29) | 9.99 (5.70) | 12.2 (10.1) |
Median [Min, Max] | 8.16 [1.96, 37.3] | 8.81 [3.87, 55.8] | 9.30 [2.66, 59.9] | 8.88 [4.49, 17.6] | 8.64 [1.96, 59.9] | 8.84 [3.87, 55.8] |
Missing | 14 (4.7%) | 0 (0%) | 6 (4.4%) | 1 (3.4%) | 20 (4.6%) | 1 (1.3%) |
It is a good idea, and increases readability, to add labels and units
to the variables. The table1
package provides functions for
that:
R
label(blood$curpmh) <- "current_pmh"
label(blood$case) <- "case_control"
label(blood$ageblood) <- "Age"
units(blood$ageblood) <- "years"
units(blood$estradol) <- "pg/mL"
units(blood$estrone) <- "pg/mL"
Which looks a bit nicer:
R
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood)
control
|
case
|
Overall
|
||||
---|---|---|---|---|---|---|
no (N=298) |
yes (N=48) |
no (N=135) |
yes (N=29) |
no (N=433) |
yes (N=77) |
|
Age (years) | ||||||
Mean (SD) | 61.3 (4.75) | 58.9 (5.68) | 61.5 (4.85) | 58.1 (5.32) | 61.4 (4.78) | 58.6 (5.53) |
Median [Min, Max] | 62.0 [46.0, 69.0] | 59.0 [46.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [49.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [46.0, 68.0] |
estradol (pg/mL) | ||||||
Mean (SD) | 8.05 (5.29) | 8.73 (8.84) | 10.5 (9.72) | 10.6 (13.7) | 8.81 (7.06) | 9.44 (10.9) |
Median [Min, Max] | 6.00 [2.00, 46.0] | 6.50 [2.00, 57.0] | 8.00 [3.00, 85.0] | 6.00 [3.00, 76.0] | 7.00 [2.00, 85.0] | 6.00 [2.00, 76.0] |
estrone (pg/mL) | ||||||
Mean (SD) | 28.7 (15.0) | 26.8 (12.0) | 32.3 (15.7) | 27.7 (13.2) | 29.8 (15.3) | 27.1 (12.3) |
Median [Min, Max] | 25.0 [10.0, 131] | 23.0 [13.0, 65.0] | 29.0 [11.0, 119] | 24.0 [12.0, 59.0] | 26.0 [10.0, 131] | 23.0 [12.0, 65.0] |
Missing | 58 (19.5%) | 15 (31.3%) | 30 (22.2%) | 11 (37.9%) | 88 (20.3%) | 26 (33.8%) |
testost | ||||||
Mean (SD) | 25.3 (13.2) | 22.2 (10.7) | 27.6 (16.1) | 28.2 (15.6) | 26.0 (14.2) | 24.4 (13.0) |
Median [Min, Max] | 23.0 [4.00, 111] | 21.5 [8.00, 63.0] | 25.0 [6.00, 144] | 24.0 [10.0, 69.0] | 23.0 [4.00, 144] | 22.0 [8.00, 69.0] |
Missing | 6 (2.0%) | 2 (4.2%) | 3 (2.2%) | 1 (3.4%) | 9 (2.1%) | 3 (3.9%) |
prolactn | ||||||
Mean (SD) | 9.60 (5.10) | 13.7 (12.3) | 10.8 (6.79) | 9.57 (3.29) | 9.99 (5.70) | 12.2 (10.1) |
Median [Min, Max] | 8.16 [1.96, 37.3] | 8.81 [3.87, 55.8] | 9.30 [2.66, 59.9] | 8.88 [4.49, 17.6] | 8.64 [1.96, 59.9] | 8.84 [3.87, 55.8] |
Missing | 14 (4.7%) | 0 (0%) | 6 (4.4%) | 1 (3.4%) | 20 (4.6%) | 1 (1.3%) |
Structuring the data
Most things in R are simple to do (but rarely simple to understand) when the data has the correct structure.
If we follow the general rules of thumb for tidy data, we are off to a good start. This is the structure of the data set we are working with here - after we have made some modifications as described above.
R
head(blood)
OUTPUT
# A tibble: 6 × 9
ID matchid case curpmh ageblood estradol estrone testost prolactn
<dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100013 164594 control yes 46 57 65 25 11.1
2 100241 107261 control no 65 11 26 NA 2.8
3 100696 110294 control yes 66 3 NA 8 38
4 101266 101266 case no 57 4 18 6 8.9
5 101600 101600 case no 66 6 18 25 6.9
6 102228 155717 control yes 57 10 NA 31 13.9
The important thing to note is that when we stratify the summary statistics by some variable, this variable have to be a categorical variable. The variables we want to do summary statistics on also have to have the correct type. Are the values categorical, the column in the dataframe have to actually be categorical. Are they numeric, the data type have to be numeric.
More advanced stuff
We might want to be able to precisely control the summary statistics presented in the table.
We can do that by specifying input to the arguments
render.continuous
and render.categorical
that
control how continuous and categorical data respectively, is shown in
the table.
The simple way of doing that is by using abbrevieated function names:
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood)
R
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
render.continuous=c(.="Mean (SD%)", .="Median [Min, Max]",
"Geom. mean (Geo. SD%)"="GMEAN (GSD%)"))
control
|
case
|
Overall
|
||||
---|---|---|---|---|---|---|
no (N=298) |
yes (N=48) |
no (N=135) |
yes (N=29) |
no (N=433) |
yes (N=77) |
|
Age (years) | ||||||
Mean (SD%) | 61.3 (4.75%) | 58.9 (5.68%) | 61.5 (4.85%) | 58.1 (5.32%) | 61.4 (4.78%) | 58.6 (5.53%) |
Median [Min, Max] | 62.0 [46.0, 69.0] | 59.0 [46.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [49.0, 68.0] | 62.0 [45.0, 69.0] | 58.0 [46.0, 68.0] |
Geom. mean (Geo. SD%) | 61.1 (1.08%) | 58.7 (1.10%) | 61.3 (1.08%) | 57.9 (1.10%) | 61.2 (1.08%) | 58.4 (1.10%) |
estradol (pg/mL) | ||||||
Mean (SD%) | 8.05 (5.29%) | 8.73 (8.84%) | 10.5 (9.72%) | 10.6 (13.7%) | 8.81 (7.06%) | 9.44 (10.9%) |
Median [Min, Max] | 6.00 [2.00, 46.0] | 6.50 [2.00, 57.0] | 8.00 [3.00, 85.0] | 6.00 [3.00, 76.0] | 7.00 [2.00, 85.0] | 6.00 [2.00, 76.0] |
Geom. mean (Geo. SD%) | 6.92 (1.69%) | 6.81 (1.90%) | 8.53 (1.78%) | 7.63 (2.03%) | 7.39 (1.74%) | 7.11 (1.94%) |
estrone (pg/mL) | ||||||
Mean (SD%) | 28.7 (15.0%) | 26.8 (12.0%) | 32.3 (15.7%) | 27.7 (13.2%) | 29.8 (15.3%) | 27.1 (12.3%) |
Median [Min, Max] | 25.0 [10.0, 131] | 23.0 [13.0, 65.0] | 29.0 [11.0, 119] | 24.0 [12.0, 59.0] | 26.0 [10.0, 131] | 23.0 [12.0, 65.0] |
Geom. mean (Geo. SD%) | 25.9 (1.56%) | 24.6 (1.50%) | 29.4 (1.54%) | 25.0 (1.59%) | 26.9 (1.56%) | 24.7 (1.53%) |
Missing | 58 (19.5%) | 15 (31.3%) | 30 (22.2%) | 11 (37.9%) | 88 (20.3%) | 26 (33.8%) |
testost | ||||||
Mean (SD%) | 25.3 (13.2%) | 22.2 (10.7%) | 27.6 (16.1%) | 28.2 (15.6%) | 26.0 (14.2%) | 24.4 (13.0%) |
Median [Min, Max] | 23.0 [4.00, 111] | 21.5 [8.00, 63.0] | 25.0 [6.00, 144] | 24.0 [10.0, 69.0] | 23.0 [4.00, 144] | 22.0 [8.00, 69.0] |
Geom. mean (Geo. SD%) | 22.4 (1.65%) | 20.0 (1.58%) | 24.6 (1.60%) | 24.6 (1.69%) | 23.1 (1.64%) | 21.6 (1.63%) |
Missing | 6 (2.0%) | 2 (4.2%) | 3 (2.2%) | 1 (3.4%) | 9 (2.1%) | 3 (3.9%) |
prolactn | ||||||
Mean (SD%) | 9.60 (5.10%) | 13.7 (12.3%) | 10.8 (6.79%) | 9.57 (3.29%) | 9.99 (5.70%) | 12.2 (10.1%) |
Median [Min, Max] | 8.16 [1.96, 37.3] | 8.81 [3.87, 55.8] | 9.30 [2.66, 59.9] | 8.88 [4.49, 17.6] | 8.64 [1.96, 59.9] | 8.84 [3.87, 55.8] |
Geom. mean (Geo. SD%) | 8.59 (1.58%) | 10.7 (1.89%) | 9.63 (1.58%) | 9.05 (1.41%) | 8.90 (1.59%) | 10.1 (1.73%) |
Missing | 14 (4.7%) | 0 (0%) | 6 (4.4%) | 1 (3.4%) | 20 (4.6%) | 1 (1.3%) |
table1
recognizes the following summary statisticis: N,
NMISS, MEAN, SD, CV, GMEAN, GCV, MEDIAN, MIN, MAX, IQR, Q1, Q2, Q3, T1,
T2, FREQ, PCT
Details can be found in the help to the function
stats.default()
Note that they are case-insensitive, and we can write Median or mediAn instead of median.
Also note that we write .="Mean (SD%)"
which will be
recognized as the functions mean()
and sd()
,
but also that the label shown should be “Mean (SD%)”.
If we want to specify the label, we can write
"Geom. mean (Geo. SD%)"="GMEAN (GSD%)"
Change the labels
We have two unusual values in this table - geometric mean and geometric standard deviation. Change the code to write out “Geom.” and “geo.” as geometric.
R
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
render.continuous=c(.="Mean (SD%)", .="Median [Min, Max]",
"Geometric mean (Geometric SD%)"="GMEAN (GSD%)"))
The geometric mean of two numbers is the squareroot of the product of the two numbers. If we have three numbers, we take the cube root of the product. In general:
\[\left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}}\]
The geometric standard deviation is defined by: \[ \exp\left(\sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \log x_i - \frac{1}{n} \sum_{j=1}^{n} \log x_j \right)^2}\right)\]
Very advanced stuff
If we want to specify the summary statistics very precisely, we have to define a function ourself:
R
my_summary <- function(x){
c("","Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
render.continuous = my_summary)
control
|
case
|
Overall
|
||||
---|---|---|---|---|---|---|
no (N=298) |
yes (N=48) |
no (N=135) |
yes (N=29) |
no (N=433) |
yes (N=77) |
|
Age (years) | ||||||
Median | 62.000 | 59.000 | 62.000 | 58.000 | 62.000 | 58.000 |
Variance | 22.6 | 32.3 | 23.5 | 28.3 | 22.8 | 30.6 |
estradol (pg/mL) | ||||||
Median | 6.000 | 6.500 | 8.000 | 6.000 | 7.000 | 6.000 |
Variance | 28.0 | 78.1 | 94.5 | 186.8 | 49.8 | 118.0 |
estrone (pg/mL) | ||||||
Median | 25.000 | 23.000 | 29.000 | 24.000 | 26.000 | 23.000 |
Variance | 223.8 | 143.8 | 246.4 | 174.4 | 232.8 | 151.5 |
Missing | 58 (19.5%) | 15 (31.3%) | 30 (22.2%) | 11 (37.9%) | 88 (20.3%) | 26 (33.8%) |
testost | ||||||
Median | 23.000 | 21.500 | 25.000 | 24.000 | 23.000 | 22.000 |
Variance | 173.6 | 115.0 | 257.7 | 241.9 | 200.4 | 169.0 |
Missing | 6 (2.0%) | 2 (4.2%) | 3 (2.2%) | 1 (3.4%) | 9 (2.1%) | 3 (3.9%) |
prolactn | ||||||
Median | 8.155 | 8.805 | 9.300 | 8.880 | 8.640 | 8.835 |
Variance | 26.1 | 151.3 | 46.1 | 10.8 | 32.5 | 102.8 |
Missing | 14 (4.7%) | 0 (0%) | 6 (4.4%) | 1 (3.4%) | 20 (4.6%) | 1 (1.3%) |
We do not need to use the sprintf()
function,
but it is a very neat way of combining text with numeric variables
because it allows us to format them directly.
Summary statistics for categorical data can be adjusted similarly, by
specifying render.categorical
.
What does %.3f actually do?
Can you guess what the formatting in ´sprintf´ does?
Try to change “%.3f” in the function to “%.2f”.
R
my_summary <- function(x){
c("","Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
sprintf
uses a bit of an arcane way of specifying the
way numbers should be formatted when we combine them with text. The
“%”-sign specifies that “this is where we place the number in the
function”. “.3f” specifies that we are treating the number as a floating
point number (which is just a fancy way of saying that it is a decimal
number), and that we would like three digits after the decimal
point.
Whats up with that blank line?
Note that in the function, we define a vector as output, with three elements:
R
my_summary <- function(x){
c("",
"Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
Calculating and formatting the median and the varianse is pretty straightforward.
But the first element is an empty string. Whats up with that?
Try to remove the empty string from the function, and use it is a table one as previously shown:
R
my_summary <- function(x){
c("Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
render.continuous = my_summary)
The line beginning with “Median” does not show up, but the median value is shown next to the “Age” and “Weight” lines.
Primarily of use if there are medical students on the course
Key Points
- A Table One provides a compact describtion of the data we are working with
- With a little bit of work we can control the content of the table.