Table One

Last updated on 2024-10-22 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How do you make a Table One?

Objectives

  • Explain what a Table One is
  • Know how to make a Tabel One and adjust key parameters

What is a “Table One”?


Primarily used in medical and epidemiological research, a Table One is typically the first table in any publication using data.

It presents the baseline characteristics of the participants in a study, and provides a concise overview of the relevant demographic and clinical variables.

It typically compares different groups (male~female, treatment~control), to highlight similarities and differences.

It can look like this:

control
case
Overall
no
(N=298)
yes
(N=48)
no
(N=135)
yes
(N=29)
no
(N=433)
yes
(N=77)
Age (years)
Mean (SD) 61.3 (4.75) 58.9 (5.68) 61.5 (4.85) 58.1 (5.32) 61.4 (4.78) 58.6 (5.53)
Median [Min, Max] 62.0 [46.0, 69.0] 59.0 [46.0, 68.0] 62.0 [45.0, 69.0] 58.0 [49.0, 68.0] 62.0 [45.0, 69.0] 58.0 [46.0, 68.0]
estradol (pg/mL)
Mean (SD) 8.05 (5.29) 8.73 (8.84) 10.5 (9.72) 10.6 (13.7) 8.81 (7.06) 9.44 (10.9)
Median [Min, Max] 6.00 [2.00, 46.0] 6.50 [2.00, 57.0] 8.00 [3.00, 85.0] 6.00 [3.00, 76.0] 7.00 [2.00, 85.0] 6.00 [2.00, 76.0]
estrone (pg/mL)
Mean (SD) 28.7 (15.0) 26.8 (12.0) 32.3 (15.7) 27.7 (13.2) 29.8 (15.3) 27.1 (12.3)
Median [Min, Max] 25.0 [10.0, 131] 23.0 [13.0, 65.0] 29.0 [11.0, 119] 24.0 [12.0, 59.0] 26.0 [10.0, 131] 23.0 [12.0, 65.0]
Missing 58 (19.5%) 15 (31.3%) 30 (22.2%) 11 (37.9%) 88 (20.3%) 26 (33.8%)
testost
Mean (SD) 25.3 (13.2) 22.2 (10.7) 27.6 (16.1) 28.2 (15.6) 26.0 (14.2) 24.4 (13.0)
Median [Min, Max] 23.0 [4.00, 111] 21.5 [8.00, 63.0] 25.0 [6.00, 144] 24.0 [10.0, 69.0] 23.0 [4.00, 144] 22.0 [8.00, 69.0]
Missing 6 (2.0%) 2 (4.2%) 3 (2.2%) 1 (3.4%) 9 (2.1%) 3 (3.9%)
prolactn
Mean (SD) 9.60 (5.10) 13.7 (12.3) 10.8 (6.79) 9.57 (3.29) 9.99 (5.70) 12.2 (10.1)
Median [Min, Max] 8.16 [1.96, 37.3] 8.81 [3.87, 55.8] 9.30 [2.66, 59.9] 8.88 [4.49, 17.6] 8.64 [1.96, 59.9] 8.84 [3.87, 55.8]
Missing 14 (4.7%) 0 (0%) 6 (4.4%) 1 (3.4%) 20 (4.6%) 1 (1.3%)

Please note that the automatic styling of this site results in a table-one that is not very nice looking.

We have 510 participants in a study, split into control and case groups, and further subdivided into two groups based on Postmenopausal hormone use. It describes the distribution of sex and concentration of estradiol, estrone, testosterone and prolactin in a blood sample.

A number of packages making it easy to make a Table One exists. Here we look at the package table1.

The specific way of doing it depends on the data available. If we do not have data on the weight of the participants, we are not able to describe the distribution of their weight.

Let us begin by looking at the data. We begin by loading the two packages tidyverse and table1. We then read in the data from the csv-file “BLOOD.csv”, which we have downloaded from this link.

R

library(tidyverse)
library(table1)
blood <- read_csv("data/BLOOD.csv")
head(blood)

OUTPUT

# A tibble: 6 × 9
      ID matchid  case curpmh ageblood estradol estrone testost prolactn
   <dbl>   <dbl> <dbl>  <dbl>    <dbl>    <dbl>   <dbl>   <dbl>    <dbl>
1 100013  164594     0      1       46       57      65      25     11.1
2 100241  107261     0      0       65       11      26     999      2.8
3 100696  110294     0      1       66        3     999       8     38
4 101266  101266     1      0       57        4      18       6      8.9
5 101600  101600     1      0       66        6      18      25      6.9
6 102228  155717     0      1       57       10     999      31     13.9

510 rows. Its a case-control study, where the ID represents one individual, and matchid gives us the link between cases and controls. Ageblood is the age of the individual at the time when the blood sample was drawn, and we then have levels of four different hormones.

The data contains missing values, coded as “999.0” for estrone and testost, and 99.99 for prolactin.

Let us fix that:

R

blood <- blood %>% 
  mutate(estrone = na_if(estrone, 999.0)) %>% 
  mutate(testost = na_if(testost, 999.0)) %>% 
  mutate(prolactn = na_if(prolactn, 99.99)) 

We then ensure that categorical values are stored as categorical values, and adjust the labels of those categorical values:

R

blood <- blood %>% 
  mutate(case = factor(case, labels = c("control", "case"))) %>% 
  mutate(curpmh = factor(curpmh, labels = c("no", "yes")))

And now we can make our table one like this:

R

table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood)
control
case
Overall
no
(N=298)
yes
(N=48)
no
(N=135)
yes
(N=29)
no
(N=433)
yes
(N=77)
ageblood
Mean (SD) 61.3 (4.75) 58.9 (5.68) 61.5 (4.85) 58.1 (5.32) 61.4 (4.78) 58.6 (5.53)
Median [Min, Max] 62.0 [46.0, 69.0] 59.0 [46.0, 68.0] 62.0 [45.0, 69.0] 58.0 [49.0, 68.0] 62.0 [45.0, 69.0] 58.0 [46.0, 68.0]
estradol
Mean (SD) 8.05 (5.29) 8.73 (8.84) 10.5 (9.72) 10.6 (13.7) 8.81 (7.06) 9.44 (10.9)
Median [Min, Max] 6.00 [2.00, 46.0] 6.50 [2.00, 57.0] 8.00 [3.00, 85.0] 6.00 [3.00, 76.0] 7.00 [2.00, 85.0] 6.00 [2.00, 76.0]
estrone
Mean (SD) 28.7 (15.0) 26.8 (12.0) 32.3 (15.7) 27.7 (13.2) 29.8 (15.3) 27.1 (12.3)
Median [Min, Max] 25.0 [10.0, 131] 23.0 [13.0, 65.0] 29.0 [11.0, 119] 24.0 [12.0, 59.0] 26.0 [10.0, 131] 23.0 [12.0, 65.0]
Missing 58 (19.5%) 15 (31.3%) 30 (22.2%) 11 (37.9%) 88 (20.3%) 26 (33.8%)
testost
Mean (SD) 25.3 (13.2) 22.2 (10.7) 27.6 (16.1) 28.2 (15.6) 26.0 (14.2) 24.4 (13.0)
Median [Min, Max] 23.0 [4.00, 111] 21.5 [8.00, 63.0] 25.0 [6.00, 144] 24.0 [10.0, 69.0] 23.0 [4.00, 144] 22.0 [8.00, 69.0]
Missing 6 (2.0%) 2 (4.2%) 3 (2.2%) 1 (3.4%) 9 (2.1%) 3 (3.9%)
prolactn
Mean (SD) 9.60 (5.10) 13.7 (12.3) 10.8 (6.79) 9.57 (3.29) 9.99 (5.70) 12.2 (10.1)
Median [Min, Max] 8.16 [1.96, 37.3] 8.81 [3.87, 55.8] 9.30 [2.66, 59.9] 8.88 [4.49, 17.6] 8.64 [1.96, 59.9] 8.84 [3.87, 55.8]
Missing 14 (4.7%) 0 (0%) 6 (4.4%) 1 (3.4%) 20 (4.6%) 1 (1.3%)

It is a good idea, and increases readability, to add labels and units to the variables. The table1 package provides functions for that:

R

label(blood$curpmh) <- "current_pmh"
label(blood$case) <- "case_control"
label(blood$ageblood) <- "Age"
units(blood$ageblood) <- "years"
units(blood$estradol) <- "pg/mL"
units(blood$estrone) <- "pg/mL"

Which looks a bit nicer:

R

table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood)
control
case
Overall
no
(N=298)
yes
(N=48)
no
(N=135)
yes
(N=29)
no
(N=433)
yes
(N=77)
Age (years)
Mean (SD) 61.3 (4.75) 58.9 (5.68) 61.5 (4.85) 58.1 (5.32) 61.4 (4.78) 58.6 (5.53)
Median [Min, Max] 62.0 [46.0, 69.0] 59.0 [46.0, 68.0] 62.0 [45.0, 69.0] 58.0 [49.0, 68.0] 62.0 [45.0, 69.0] 58.0 [46.0, 68.0]
estradol (pg/mL)
Mean (SD) 8.05 (5.29) 8.73 (8.84) 10.5 (9.72) 10.6 (13.7) 8.81 (7.06) 9.44 (10.9)
Median [Min, Max] 6.00 [2.00, 46.0] 6.50 [2.00, 57.0] 8.00 [3.00, 85.0] 6.00 [3.00, 76.0] 7.00 [2.00, 85.0] 6.00 [2.00, 76.0]
estrone (pg/mL)
Mean (SD) 28.7 (15.0) 26.8 (12.0) 32.3 (15.7) 27.7 (13.2) 29.8 (15.3) 27.1 (12.3)
Median [Min, Max] 25.0 [10.0, 131] 23.0 [13.0, 65.0] 29.0 [11.0, 119] 24.0 [12.0, 59.0] 26.0 [10.0, 131] 23.0 [12.0, 65.0]
Missing 58 (19.5%) 15 (31.3%) 30 (22.2%) 11 (37.9%) 88 (20.3%) 26 (33.8%)
testost
Mean (SD) 25.3 (13.2) 22.2 (10.7) 27.6 (16.1) 28.2 (15.6) 26.0 (14.2) 24.4 (13.0)
Median [Min, Max] 23.0 [4.00, 111] 21.5 [8.00, 63.0] 25.0 [6.00, 144] 24.0 [10.0, 69.0] 23.0 [4.00, 144] 22.0 [8.00, 69.0]
Missing 6 (2.0%) 2 (4.2%) 3 (2.2%) 1 (3.4%) 9 (2.1%) 3 (3.9%)
prolactn
Mean (SD) 9.60 (5.10) 13.7 (12.3) 10.8 (6.79) 9.57 (3.29) 9.99 (5.70) 12.2 (10.1)
Median [Min, Max] 8.16 [1.96, 37.3] 8.81 [3.87, 55.8] 9.30 [2.66, 59.9] 8.88 [4.49, 17.6] 8.64 [1.96, 59.9] 8.84 [3.87, 55.8]
Missing 14 (4.7%) 0 (0%) 6 (4.4%) 1 (3.4%) 20 (4.6%) 1 (1.3%)

Structuring the data


Most things in R are simple to do (but rarely simple to understand) when the data has the correct structure.

If we follow the general rules of thumb for tidy data, we are off to a good start. This is the structure of the data set we are working with here - after we have made some modifications as described above.

R

head(blood)

OUTPUT

# A tibble: 6 × 9
      ID matchid case    curpmh ageblood estradol estrone testost prolactn
   <dbl>   <dbl> <fct>   <fct>     <dbl>    <dbl>   <dbl>   <dbl>    <dbl>
1 100013  164594 control yes          46       57      65      25     11.1
2 100241  107261 control no           65       11      26      NA      2.8
3 100696  110294 control yes          66        3      NA       8     38
4 101266  101266 case    no           57        4      18       6      8.9
5 101600  101600 case    no           66        6      18      25      6.9
6 102228  155717 control yes          57       10      NA      31     13.9

The important thing to note is that when we stratify the summary statistics by some variable, this variable have to be a categorical variable. The variables we want to do summary statistics on also have to have the correct type. Are the values categorical, the column in the dataframe have to actually be categorical. Are they numeric, the data type have to be numeric.

More advanced stuff


We might want to be able to precisely control the summary statistics presented in the table.

We can do that by specifying input to the arguments render.continuous and render.categorical that control how continuous and categorical data respectively, is shown in the table.

The simple way of doing that is by using abbrevieated function names:

table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood)

R

table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
        render.continuous=c(.="Mean (SD%)", .="Median [Min, Max]",
                           "Geom. mean (Geo. SD%)"="GMEAN (GSD%)"))
control
case
Overall
no
(N=298)
yes
(N=48)
no
(N=135)
yes
(N=29)
no
(N=433)
yes
(N=77)
Age (years)
Mean (SD%) 61.3 (4.75%) 58.9 (5.68%) 61.5 (4.85%) 58.1 (5.32%) 61.4 (4.78%) 58.6 (5.53%)
Median [Min, Max] 62.0 [46.0, 69.0] 59.0 [46.0, 68.0] 62.0 [45.0, 69.0] 58.0 [49.0, 68.0] 62.0 [45.0, 69.0] 58.0 [46.0, 68.0]
Geom. mean (Geo. SD%) 61.1 (1.08%) 58.7 (1.10%) 61.3 (1.08%) 57.9 (1.10%) 61.2 (1.08%) 58.4 (1.10%)
estradol (pg/mL)
Mean (SD%) 8.05 (5.29%) 8.73 (8.84%) 10.5 (9.72%) 10.6 (13.7%) 8.81 (7.06%) 9.44 (10.9%)
Median [Min, Max] 6.00 [2.00, 46.0] 6.50 [2.00, 57.0] 8.00 [3.00, 85.0] 6.00 [3.00, 76.0] 7.00 [2.00, 85.0] 6.00 [2.00, 76.0]
Geom. mean (Geo. SD%) 6.92 (1.69%) 6.81 (1.90%) 8.53 (1.78%) 7.63 (2.03%) 7.39 (1.74%) 7.11 (1.94%)
estrone (pg/mL)
Mean (SD%) 28.7 (15.0%) 26.8 (12.0%) 32.3 (15.7%) 27.7 (13.2%) 29.8 (15.3%) 27.1 (12.3%)
Median [Min, Max] 25.0 [10.0, 131] 23.0 [13.0, 65.0] 29.0 [11.0, 119] 24.0 [12.0, 59.0] 26.0 [10.0, 131] 23.0 [12.0, 65.0]
Geom. mean (Geo. SD%) 25.9 (1.56%) 24.6 (1.50%) 29.4 (1.54%) 25.0 (1.59%) 26.9 (1.56%) 24.7 (1.53%)
Missing 58 (19.5%) 15 (31.3%) 30 (22.2%) 11 (37.9%) 88 (20.3%) 26 (33.8%)
testost
Mean (SD%) 25.3 (13.2%) 22.2 (10.7%) 27.6 (16.1%) 28.2 (15.6%) 26.0 (14.2%) 24.4 (13.0%)
Median [Min, Max] 23.0 [4.00, 111] 21.5 [8.00, 63.0] 25.0 [6.00, 144] 24.0 [10.0, 69.0] 23.0 [4.00, 144] 22.0 [8.00, 69.0]
Geom. mean (Geo. SD%) 22.4 (1.65%) 20.0 (1.58%) 24.6 (1.60%) 24.6 (1.69%) 23.1 (1.64%) 21.6 (1.63%)
Missing 6 (2.0%) 2 (4.2%) 3 (2.2%) 1 (3.4%) 9 (2.1%) 3 (3.9%)
prolactn
Mean (SD%) 9.60 (5.10%) 13.7 (12.3%) 10.8 (6.79%) 9.57 (3.29%) 9.99 (5.70%) 12.2 (10.1%)
Median [Min, Max] 8.16 [1.96, 37.3] 8.81 [3.87, 55.8] 9.30 [2.66, 59.9] 8.88 [4.49, 17.6] 8.64 [1.96, 59.9] 8.84 [3.87, 55.8]
Geom. mean (Geo. SD%) 8.59 (1.58%) 10.7 (1.89%) 9.63 (1.58%) 9.05 (1.41%) 8.90 (1.59%) 10.1 (1.73%)
Missing 14 (4.7%) 0 (0%) 6 (4.4%) 1 (3.4%) 20 (4.6%) 1 (1.3%)

table1 recognizes the following summary statisticis: N, NMISS, MEAN, SD, CV, GMEAN, GCV, MEDIAN, MIN, MAX, IQR, Q1, Q2, Q3, T1, T2, FREQ, PCT

Details can be found in the help to the function stats.default()

Note that they are case-insensitive, and we can write Median or mediAn instead of median.

Also note that we write .="Mean (SD%)" which will be recognized as the functions mean() and sd(), but also that the label shown should be “Mean (SD%)”.

If we want to specify the label, we can write "Geom. mean (Geo. SD%)"="GMEAN (GSD%)"

Change the labels

We have two unusual values in this table - geometric mean and geometric standard deviation. Change the code to write out “Geom.” and “geo.” as geometric.

R

table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
        render.continuous=c(.="Mean (SD%)", .="Median [Min, Max]",
                           "Geometric mean (Geometric SD%)"="GMEAN (GSD%)"))

The geometric mean of two numbers is the squareroot of the product of the two numbers. If we have three numbers, we take the cube root of the product. In general:

\[\left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}}\]

The geometric standard deviation is defined by: \[ \exp\left(\sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \log x_i - \frac{1}{n} \sum_{j=1}^{n} \log x_j \right)^2}\right)\]

Very advanced stuff


If we want to specify the summary statistics very precisely, we have to define a function ourself:

R

my_summary <- function(x){
  c("","Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
render.continuous = my_summary)
control
case
Overall
no
(N=298)
yes
(N=48)
no
(N=135)
yes
(N=29)
no
(N=433)
yes
(N=77)
Age (years)
Median 62.000 59.000 62.000 58.000 62.000 58.000
Variance 22.6 32.3 23.5 28.3 22.8 30.6
estradol (pg/mL)
Median 6.000 6.500 8.000 6.000 7.000 6.000
Variance 28.0 78.1 94.5 186.8 49.8 118.0
estrone (pg/mL)
Median 25.000 23.000 29.000 24.000 26.000 23.000
Variance 223.8 143.8 246.4 174.4 232.8 151.5
Missing 58 (19.5%) 15 (31.3%) 30 (22.2%) 11 (37.9%) 88 (20.3%) 26 (33.8%)
testost
Median 23.000 21.500 25.000 24.000 23.000 22.000
Variance 173.6 115.0 257.7 241.9 200.4 169.0
Missing 6 (2.0%) 2 (4.2%) 3 (2.2%) 1 (3.4%) 9 (2.1%) 3 (3.9%)
prolactn
Median 8.155 8.805 9.300 8.880 8.640 8.835
Variance 26.1 151.3 46.1 10.8 32.5 102.8
Missing 14 (4.7%) 0 (0%) 6 (4.4%) 1 (3.4%) 20 (4.6%) 1 (1.3%)

We do not need to use the sprintf() function, but it is a very neat way of combining text with numeric variables because it allows us to format them directly.

Summary statistics for categorical data can be adjusted similarly, by specifying render.categorical.

What does %.3f actually do?

Can you guess what the formatting in ´sprintf´ does?

Try to change “%.3f” in the function to “%.2f”.

R

my_summary <- function(x){
  c("","Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}

sprintf uses a bit of an arcane way of specifying the way numbers should be formatted when we combine them with text. The “%”-sign specifies that “this is where we place the number in the function”. “.3f” specifies that we are treating the number as a floating point number (which is just a fancy way of saying that it is a decimal number), and that we would like three digits after the decimal point.

Whats up with that blank line?

Note that in the function, we define a vector as output, with three elements:

R

my_summary <- function(x){
  c("",
"Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}

Calculating and formatting the median and the varianse is pretty straightforward.

But the first element is an empty string. Whats up with that?

Try to remove the empty string from the function, and use it is a table one as previously shown:

R

my_summary <- function(x){
  c("Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
table1(~ageblood + estradol + estrone + testost + prolactn|case + curpmh, data = blood,
render.continuous = my_summary)

The line beginning with “Median” does not show up, but the median value is shown next to the “Age” and “Weight” lines.

Primarily of use if there are medical students on the course

Key Points

  • A Table One provides a compact describtion of the data we are working with
  • With a little bit of work we can control the content of the table.