Table One

Last updated on 2025-07-01 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How do you make a Table One?

Objectives

Explain what a Table One is
Know how to make a Tabel One and adjust key parameters

What is a “Table One”?

Primarily used in medical and epidemiological research, a Table One is typically the first table in any publication using data.

It presents the baseline characteristics of the participants in a study, and provides a concise overview of the relevant demographic and clinical variables.

It typically compares different groups (male~female, treatment~control), to highlight similarities and differences.

It can look like this:

	control		case		Overall
	no (N=298)	yes (N=48)	no (N=135)	yes (N=29)	no (N=433)	yes (N=77)
Age (years)
Mean (SD)	61.3 (4.75)	58.9 (5.68)	61.5 (4.85)	58.1 (5.32)	61.4 (4.78)	58.6 (5.53)
Median [Min, Max]	62.0 [46.0, 69.0]	59.0 [46.0, 68.0]	62.0 [45.0, 69.0]	58.0 [49.0, 68.0]	62.0 [45.0, 69.0]	58.0 [46.0, 68.0]
testost
Mean (SD)	25.3 (13.2)	22.2 (10.7)	27.6 (16.1)	28.2 (15.6)	26.0 (14.2)	24.4 (13.0)
Median [Min, Max]	23.0 [4.00, 111]	21.5 [8.00, 63.0]	25.0 [6.00, 144]	24.0 [10.0, 69.0]	23.0 [4.00, 144]	22.0 [8.00, 69.0]
Missing	6 (2.0%)	2 (4.2%)	3 (2.2%)	1 (3.4%)	9 (2.1%)	3 (3.9%)
prolactn
Mean (SD)	9.60 (5.10)	13.7 (12.3)	10.8 (6.79)	9.57 (3.29)	9.99 (5.70)	12.2 (10.1)
Median [Min, Max]	8.16 [1.96, 37.3]	8.81 [3.87, 55.8]	9.30 [2.66, 59.9]	8.88 [4.49, 17.6]	8.64 [1.96, 59.9]	8.84 [3.87, 55.8]
Missing	14 (4.7%)	0 (0%)	6 (4.4%)	1 (3.4%)	20 (4.6%)	1 (1.3%)

Please note that the automatic styling of this site results in a table-one that is not very nice looking.

We have 510 participants in a study, split into control and case groups, and further subdivided into two groups based on PostMenopausal Hormone use. It describes the distribution of sex and concentration of testosterone and prolactin in a blood sample.

How do we make that?

Structuring the data

Most things in R are simple to do (but rarely simple to understand) when the data has the correct structure.

If we follow the general rules of thumb for tidy data, we are off to a good start. This is the structure of the data set we are working with here - after we have made some modifications regarding labels, levels and units.

R

head(blood)

OUTPUT

# A tibble: 6 × 8
      ID matchid case    curpmh ageblood estradol testost prolactn
   <dbl>   <dbl> <fct>   <fct>     <dbl>    <dbl>   <dbl>    <dbl>
1 100013  164594 control yes          46       57      25     11.1
2 100241  107261 control no           65       11      NA      2.8
3 100696  110294 control yes          66        3       8     38
4 101266  101266 case    no           57        4       6      8.9
5 101600  101600 case    no           66        6      25      6.9
6 102228  155717 control yes          57       10      31     13.9

The important thing to note is that when we stratify the summary statistics by some variable, this variable have to be a categorical variable. The variables we want to do summary statistics on also have to have the correct type. Are the values categorical, the column in the dataframe have to actually be categorical. Are they numeric, the data type have to be numeric.

And having the data - how do we actually do it?

A number of packages making it easy to make a Table One exists. Here we look at the package table1.

The specific way of doing it depends on the data available. If we do not have data on the weight of the participants, we are not able to describe the distribution of their weight.

Let us begin by looking at the data. We begin by loading the two packages tidyverse and table1. We then read in the data from the csv-file “BLOOD.csv”, which we have downloaded from this link.

R

library(tidyverse)
library(table1)
blood <- read_csv("data/BLOOD.csv")
head(blood)

OUTPUT

# A tibble: 6 × 9
      ID matchid  case curpmh ageblood estradol estrone testost prolactn
   <dbl>   <dbl> <dbl>  <dbl>    <dbl>    <dbl>   <dbl>   <dbl>    <dbl>
1 100013  164594     0      1       46       57      65      25     11.1
2 100241  107261     0      0       65       11      26     999      2.8
3 100696  110294     0      1       66        3     999       8     38
4 101266  101266     1      0       57        4      18       6      8.9
5 101600  101600     1      0       66        6      18      25      6.9
6 102228  155717     0      1       57       10     999      31     13.9

510 rows. Its a case-control study, where the ID represents one individual, and matchid gives us the link between cases and controls. Ageblood is the age of the individual at the time when the blood sample was drawn, and we then have levels of four different hormones.

The data contains missing values, coded as “999.0” for estrone and testost, and 99.99 for prolactin.

Let us fix that:

R

blood <- blood %>% 
  mutate(estrone = na_if(estrone, 999.0)) %>% 
  mutate(testost = na_if(testost, 999.0)) %>% 
  mutate(prolactn = na_if(prolactn, 99.99))

We then ensure that categorical values are stored as categorical values, and adjust the labels of those categorical values:

R

blood <- blood %>% 
  mutate(case = factor(case, labels = c("control", "case"))) %>% 
  mutate(curpmh = factor(curpmh, labels = c("no", "yes")))

And now we can make our table one like this. Note that we only include testosterone and prolactin, in order to get a more manageble table 1:

R

table1(~ageblood + testost + prolactn|case + curpmh, data = blood)

	control		case		Overall
	no (N=298)	yes (N=48)	no (N=135)	yes (N=29)	no (N=433)	yes (N=77)
ageblood
Mean (SD)	61.3 (4.75)	58.9 (5.68)	61.5 (4.85)	58.1 (5.32)	61.4 (4.78)	58.6 (5.53)
Median [Min, Max]	62.0 [46.0, 69.0]	59.0 [46.0, 68.0]	62.0 [45.0, 69.0]	58.0 [49.0, 68.0]	62.0 [45.0, 69.0]	58.0 [46.0, 68.0]
testost
Mean (SD)	25.3 (13.2)	22.2 (10.7)	27.6 (16.1)	28.2 (15.6)	26.0 (14.2)	24.4 (13.0)
Median [Min, Max]	23.0 [4.00, 111]	21.5 [8.00, 63.0]	25.0 [6.00, 144]	24.0 [10.0, 69.0]	23.0 [4.00, 144]	22.0 [8.00, 69.0]
Missing	6 (2.0%)	2 (4.2%)	3 (2.2%)	1 (3.4%)	9 (2.1%)	3 (3.9%)
prolactn
Mean (SD)	9.60 (5.10)	13.7 (12.3)	10.8 (6.79)	9.57 (3.29)	9.99 (5.70)	12.2 (10.1)
Median [Min, Max]	8.16 [1.96, 37.3]	8.81 [3.87, 55.8]	9.30 [2.66, 59.9]	8.88 [4.49, 17.6]	8.64 [1.96, 59.9]	8.84 [3.87, 55.8]
Missing	14 (4.7%)	0 (0%)	6 (4.4%)	1 (3.4%)	20 (4.6%)	1 (1.3%)

It is a good idea, and increases readability, to add labels and units to the variables. The table1 package provides functions for that:

R

label(blood$curpmh) <- "current_pmh"
label(blood$case) <- "case_control"
label(blood$ageblood) <- "Age"
units(blood$ageblood) <- "years"

This will add labels to the plot, and allow us to give the data more meaningful names and units without changing the date it self. This looks nicer, and is easier to read:

R

table1(~ageblood + testost + prolactn|case + curpmh, data = blood)

	control		case		Overall
	no (N=298)	yes (N=48)	no (N=135)	yes (N=29)	no (N=433)	yes (N=77)
Age (years)
Mean (SD)	61.3 (4.75)	58.9 (5.68)	61.5 (4.85)	58.1 (5.32)	61.4 (4.78)	58.6 (5.53)
Median [Min, Max]	62.0 [46.0, 69.0]	59.0 [46.0, 68.0]	62.0 [45.0, 69.0]	58.0 [49.0, 68.0]	62.0 [45.0, 69.0]	58.0 [46.0, 68.0]
testost
Mean (SD)	25.3 (13.2)	22.2 (10.7)	27.6 (16.1)	28.2 (15.6)	26.0 (14.2)	24.4 (13.0)
Median [Min, Max]	23.0 [4.00, 111]	21.5 [8.00, 63.0]	25.0 [6.00, 144]	24.0 [10.0, 69.0]	23.0 [4.00, 144]	22.0 [8.00, 69.0]
Missing	6 (2.0%)	2 (4.2%)	3 (2.2%)	1 (3.4%)	9 (2.1%)	3 (3.9%)
prolactn
Mean (SD)	9.60 (5.10)	13.7 (12.3)	10.8 (6.79)	9.57 (3.29)	9.99 (5.70)	12.2 (10.1)
Median [Min, Max]	8.16 [1.96, 37.3]	8.81 [3.87, 55.8]	9.30 [2.66, 59.9]	8.88 [4.49, 17.6]	8.64 [1.96, 59.9]	8.84 [3.87, 55.8]
Missing	14 (4.7%)	0 (0%)	6 (4.4%)	1 (3.4%)	20 (4.6%)	1 (1.3%)

More advanced stuff

We might want to be able to precisely control the summary statistics presented in the table.

We can do that by specifying input to the arguments render.continuous and render.categorical that control how continuous and categorical data respectively, is shown in the table.

The simple way of doing that is by using abbrevieated function names. We only include testosterone and prolactin in the the table to save space:

R

table1(~ageblood + testost + prolactn|case + curpmh, data = blood,
        render.continuous=c(.="Mean (SD%)", .="Median [Min, Max]",
                           "Geom. mean (Geo. SD%)"="GMEAN (GSD%)"))

	control		case		Overall
	no (N=298)	yes (N=48)	no (N=135)	yes (N=29)	no (N=433)	yes (N=77)
Age (years)
Mean (SD%)	61.3 (4.75%)	58.9 (5.68%)	61.5 (4.85%)	58.1 (5.32%)	61.4 (4.78%)	58.6 (5.53%)
Median [Min, Max]	62.0 [46.0, 69.0]	59.0 [46.0, 68.0]	62.0 [45.0, 69.0]	58.0 [49.0, 68.0]	62.0 [45.0, 69.0]	58.0 [46.0, 68.0]
Geom. mean (Geo. SD%)	61.1 (1.08%)	58.7 (1.10%)	61.3 (1.08%)	57.9 (1.10%)	61.2 (1.08%)	58.4 (1.10%)
testost
Mean (SD%)	25.3 (13.2%)	22.2 (10.7%)	27.6 (16.1%)	28.2 (15.6%)	26.0 (14.2%)	24.4 (13.0%)
Median [Min, Max]	23.0 [4.00, 111]	21.5 [8.00, 63.0]	25.0 [6.00, 144]	24.0 [10.0, 69.0]	23.0 [4.00, 144]	22.0 [8.00, 69.0]
Geom. mean (Geo. SD%)	22.4 (1.65%)	20.0 (1.58%)	24.6 (1.60%)	24.6 (1.69%)	23.1 (1.64%)	21.6 (1.63%)
Missing	6 (2.0%)	2 (4.2%)	3 (2.2%)	1 (3.4%)	9 (2.1%)	3 (3.9%)
prolactn
Mean (SD%)	9.60 (5.10%)	13.7 (12.3%)	10.8 (6.79%)	9.57 (3.29%)	9.99 (5.70%)	12.2 (10.1%)
Median [Min, Max]	8.16 [1.96, 37.3]	8.81 [3.87, 55.8]	9.30 [2.66, 59.9]	8.88 [4.49, 17.6]	8.64 [1.96, 59.9]	8.84 [3.87, 55.8]
Geom. mean (Geo. SD%)	8.59 (1.58%)	10.7 (1.89%)	9.63 (1.58%)	9.05 (1.41%)	8.90 (1.59%)	10.1 (1.73%)
Missing	14 (4.7%)	0 (0%)	6 (4.4%)	1 (3.4%)	20 (4.6%)	1 (1.3%)

table1 recognizes the following summary statisticis: N, NMISS, MEAN, SD, CV, GMEAN, GCV, MEDIAN, MIN, MAX, IQR, Q1, Q2, Q3, T1, T2, FREQ, PCT

Details can be found in the help to the function stats.default()

Note that they are case-insensitive, and we can write Median or mediAn instead of median.

Also note that we write .="Mean (SD%)" which will be recognized as the functions mean() and sd(), but also that the label shown should be “Mean (SD%)”.

If we want to specify the label, we can write "Geom. mean (Geo. SD%)"="GMEAN (GSD%)"

Change the labels

We have two unusual values in this table - geometric mean and geometric standard deviation. Change the code to write out “Geom.” and “geo.” as geometric.

Show me the solution

R

table1(~ageblood + testost + prolactn |case + curpmh, data = blood,
        render.continuous=c(.="Mean (SD%)", .="Median [Min, Max]",
                           "Geometric mean (Geometric SD%)"="GMEAN (GSD%)"))

The geometric mean of two numbers is the squareroot of the product of the two numbers. If we have three numbers, we take the cube root of the product. In general:

\[\left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}}\]

The geometric standard deviation is defined by: \[ \exp\left(\sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \log x_i - \frac{1}{n} \sum_{j=1}^{n} \log x_j \right)^2}\right)\]

Very advanced stuff

If we want to specify the summary statistics very precisely, we have to define a function ourself:

R

my_summary <- function(x){
  c("","Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
table1(~ageblood + testost + prolactn|case + curpmh, data = blood,
render.continuous = my_summary)

	control		case		Overall
	no (N=298)	yes (N=48)	no (N=135)	yes (N=29)	no (N=433)	yes (N=77)
Age (years)
Median	62.000	59.000	62.000	58.000	62.000	58.000
Variance	22.6	32.3	23.5	28.3	22.8	30.6
testost
Median	23.000	21.500	25.000	24.000	23.000	22.000
Variance	173.6	115.0	257.7	241.9	200.4	169.0
Missing	6 (2.0%)	2 (4.2%)	3 (2.2%)	1 (3.4%)	9 (2.1%)	3 (3.9%)
prolactn
Median	8.155	8.805	9.300	8.880	8.640	8.835
Variance	26.1	151.3	46.1	10.8	32.5	102.8
Missing	14 (4.7%)	0 (0%)	6 (4.4%)	1 (3.4%)	20 (4.6%)	1 (1.3%)

We do not need to use the sprintf() function, but it is a very neat way of combining text with numeric variables because it allows us to format them directly.

Summary statistics for categorical data can be adjusted similarly, by specifying render.categorical.

What does %.3f actually do?

Can you guess what the formatting in ´sprintf´ does?

Try to change “%.3f” in the function to “%.2f”.

Show me the solution

R

my_summary <- function(x){
  c("","Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}

sprintf uses a bit of an arcane way of specifying the way numbers should be formatted when we combine them with text. The “%”-sign specifies that “this is where we place the number in the function”. “.3f” specifies that we are treating the number as a floating point number (which is just a fancy way of saying that it is a decimal number), and that we would like three digits after the decimal point.

Whats up with that blank line?

Note that in the function, we define a vector as output, with three elements:

R

my_summary <- function(x){
  c("",
"Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}

Calculating and formatting the median and the varianse is pretty straightforward.

But the first element is an empty string. Whats up with that?

Show me the solution

Try to remove the empty string from the function, and use it is a table one as previously shown:

R

my_summary <- function(x){
  c("Median" = sprintf("%.3f", median(x, na.rm = TRUE)),
"Variance" = sprintf("%.1f", var(x, na.rm=TRUE)))
}
table1(~ageblood + testost + prolactn|case + curpmh, data = blood,
render.continuous = my_summary)

The line beginning with “Median” does not show up, but the median value is shown next to the “Age” and “Weight” lines.

Instructor Note

Primarily of use if there are medical students on the course

Key Points

A Table One provides a compact describtion of the data we are working with
With a little bit of work we can control the content of the table.