What about danstat?
Overview
Teaching: 30 min
Exercises: 15 minQuestions
An easier way to access Statistics Denmark
Objectives
Understand what an API do
Connect to Statistics Denmark, and extract data
Create a list of lists to control the variables to be extracted
Please note: These pages are autogenerated. Some of the API-calls may fail during that process. We are figuring out what to do about it, but please excuse us for any red errors on the pages for the time being.
What is an API?
An API is an Application Programming Interface. It is a way of making applications, in our case an R-script, able to communicate with another application, here the Statistics Denmark databases.
Talking about APIs, we talk about several different things. It can be quite confusing, but dont worry!
What we want to be able to do, is to let our own application, our R-script, send a command to a remote application, the databases of Statistics Denmark, in order to retrieve specific data.
An API defines the different commands we can send, and how the data that we get back, is formatted.
Often APIs will require a user account with a login and a password. Statistics Denmark does not.
The standard way to send a command, or a request, to an API is to use the GET (and POST) functions at the core of the internet.
In a certain sense this is what we do when we access a website. We go to www.dr.dk/sporten and get a result, the current webpage at the front of the sports section of Danmarks Radio.
If we instead ask www.dr.dk to return the result of our request for www.dr.dk/nyheder/politik, we will get the current webpage with news on politics.
This is what we do when we access an API. But instead of using our browser, we use the method our browser uses (GET), tells that method that we would like some specified information, and get a result that is not a webpage, but rather a set of data. Hopefully organised in a way that is easy to read.
Writing our own GET-requests to communicate with an API is not simple. Thankfully kind people have written libraries, some in R, that makes accessing specific APIs easier. The one we are going to use here is called “danstat”
The danstat package/library
Before doing anything else, it is useful to take a look at the result:
# A tibble: 6 × 4
IELAND KØN TID INDHOLD
<chr> <chr> <chr> <dbl>
1 Denmark Men 2008Q1 2465810
2 Denmark Men 2008Q2 2466036
3 Denmark Men 2008Q3 2467712
4 Denmark Men 2008Q4 2469977
5 Denmark Men 2009Q1 2470457
6 Denmark Men 2009Q2 2470287
This is from the table “folk1c” from Statistics Denmark.
We get some variables, IELAND, KØN, and TID. And then the content of the table, INDHOLD. Ie the number of men, living in denmark i the first quarter of 2008 in the first line.
How do we get that table?
All tables from Statistics Denmark are organised in a hierarcical tree of subjects.
Let us begin there.
Before using the library, we need to install it:
install.packages("danstat")
Some installations of R may have problems installing it. In that case, try this:
install.packages("remotes")
library(remotes)
remotes:install_github("cran/danstat")
After installation, we load the library using the library function. And then we can access the functions included in the library:
The get_subjects() function sends a request to the Statistics Denmark API, asking for a list of the subjects. The information is returned to our script, and the get_subjects() function presents us with a dataframe containing the information.
library(danstat)
subjects <- get_subjects()
subjects
id description active hasSubjects subjects
1 1 People TRUE TRUE NULL
2 2 Labour and income TRUE TRUE NULL
3 3 Economy TRUE TRUE NULL
4 4 Social conditions TRUE TRUE NULL
5 5 Education and research TRUE TRUE NULL
6 6 Business TRUE TRUE NULL
7 7 Transport TRUE TRUE NULL
8 8 Culture and leisure TRUE TRUE NULL
9 9 Environment and energy TRUE TRUE NULL
10 19 Other TRUE TRUE NULL
We get the 13 major subjects from Statistics Denmark. Each of them have sub-subjects.
If we want to take a closer look at the subdivisions of a given subject, we use the get_subjects() function again, this time specifying which subject we are interested in:
Let us try to get the sub-subjects from the subject 1 - containing information about populations and elections:
sub_subjects <- get_subjects(subjects = 1)
sub_subjects
id description active hasSubjects
1 1 People TRUE TRUE
subjects
1 3401, 3407, 3410, 3415, 3412, 3411, 3428, 3409, Population, Households, families and children, Migration, Housing, Health, Democracy, National church, Names, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
The result is a bit complicated. The column “subjects” in the resulting dataframe contains another dataframe. We access it like we normally would access a column in a dataframe:
sub_subjects$subjects
[[1]]
id description active hasSubjects subjects
1 3401 Population TRUE TRUE NULL
2 3407 Households, families and children TRUE TRUE NULL
3 3410 Migration TRUE TRUE NULL
4 3415 Housing TRUE TRUE NULL
5 3412 Health TRUE TRUE NULL
6 3411 Democracy TRUE TRUE NULL
7 3428 National church TRUE TRUE NULL
8 3409 Names TRUE TRUE NULL
Those sub-subjects have their own subjects! Lets get to the bottom of this, and use 2401, Population and population projections as an example:
sub_sub_subjects <- get_subjects("3401")
sub_sub_subjects$subjects
[[1]]
id description active hasSubjects subjects
1 20021 Population figures TRUE FALSE NULL
2 20024 Immigrants and their descendants TRUE FALSE NULL
3 20022 Population projections TRUE FALSE NULL
4 20019 Adoptions FALSE FALSE NULL
5 20017 Births TRUE FALSE NULL
6 20018 Fertility TRUE FALSE NULL
7 20014 Deaths TRUE FALSE NULL
8 20015 Life expectancy TRUE FALSE NULL
Now we are at the bottom. We can see in the column “hasSubjects” that there are no sub_sub_sub_subjects.
The hierarchy is: 1 Population and elections | 3401 Population | 20021 Population figures
The final sub_sub_subject contains a number of tables, that actually contains the data we are looking for.
get_subjects is able to retrieve all the sub, sub-sub and sub-sub-sub-jects in one go. The result is a bit confusing and difficult to navigate.
Remember that the initial result was a dataframe containing another dataframe. If we go all the way to the bottom, we will get a dataframe, containing several dataframes, each of those containing several dataframes.
We recommend that you do not try it, but this is how it is done:
lots_of_subjects <- get_subjects(1, recursive = T, include_tables = T)
The “recursive = T” parameter means that get_subjects will retrieve the subjects of the subjects, and then the subjects of those subjects.
Which datatables exists?
But we ended up with a sub_sub_subject,
20021 Population figures
How do we find out which tables exists in this subject?
The get_tables() function returns a dataframe with information about the tables available for a given subject.
tables <- get_tables(subjects="20021")
tables
id text
1 FOLK1A Population at the first day of the quarter
2 FOLK1AM Population at the first day of the month
3 FOLK3 Population 1. January
4 FOLK3FOD Population 1. January
5 BEF5 Population 1. January
6 FT Population figures from the censuses
7 BY1 Population 1. January
8 BY2 Population 1. January
9 BY3 Population 1. January
10 KM1 Population at the first day of the quarter
11 SOGN1 Population 1. January
12 SOGN10 Population 1. January
13 BEF4 Population 1. January
14 BEF5F People born in Faroe Islands and living in Denmark 1. January
15 BEF5G People born in Greenland and living in Denmark 1. January
16 BEV22 Summary vital statistics (provisional data)
17 BEV107 Summary vital statistics
18 KMSTA003 Summary vital statistics
19 GALDER Average age
20 KMGALDER Average age
21 HISB3 Summary vital statistics
unit updated firstPeriod latestPeriod active
1 Number 2022-02-11T08:00:00 2008Q1 2022Q1 TRUE
2 Number 2022-03-07T08:00:00 2021M10 2022M02 TRUE
3 Number 2022-02-11T08:00:00 2008 2022 TRUE
4 Number 2022-03-18T08:00:00 2008 2022 TRUE
5 Number 2022-02-11T08:00:00 1990 2022 TRUE
6 Number 2022-02-11T08:00:00 1769 2022 TRUE
7 Number 2021-04-29T08:00:00 2010 2021 TRUE
8 Number 2021-04-29T08:00:00 2010 2021 TRUE
9 - 2021-04-29T08:00:00 2017 2021 TRUE
10 Number 2022-02-17T08:00:00 2007Q1 2022Q1 TRUE
11 Number 2022-02-17T08:00:00 2010 2022 TRUE
12 Number 2021-09-22T08:00:00 1925 2021 TRUE
13 Number 2021-03-31T08:00:00 1901 2021 TRUE
14 Number 2022-02-11T08:00:00 2008 2022 TRUE
15 Number 2022-02-11T08:00:00 2008 2022 TRUE
16 Number 2022-02-11T08:00:00 2007Q2 2021Q4 TRUE
17 Number 2022-02-11T08:00:00 2006 2021 TRUE
18 Number 2022-02-17T08:00:00 2015 2021 TRUE
19 Average 2022-02-11T08:00:00 2005 2022 TRUE
20 Average 2022-02-17T08:00:00 2007 2022 TRUE
21 Number 2022-02-11T08:00:00 1901 2022 TRUE
variables
1 region, sex, age, marital status, time
2 region, sex, age, time
3 day of birth, birth month, year of birth, time
4 day of birth, birth month, country of birth, time
5 sex, age, country of birth, time
6 national part, time
7 urban and rural areas, age, sex, time
8 municipality, city size, age, sex, time
9 urban and rural areas, population, area and population density, time
10 parish, member of the National Church, time
11 parish, sex, age, time
12 parish, time
13 islands, time
14 sex, age, parents place of birth, time
15 sex, age, parents place of birth, time
16 region, type of movement, sex, time
17 region, type of movement, sex, time
18 parish, movements, time
19 municipality, sex, time
20 parish, sex, time
21 type of movement, time
We get at lot of information here. The id identifies the table, text gives a description of the table that humans can understand. When the table was last updated and the first and last period that the table contains data for.
In the variables column, we get information on what kind of data is stored in the table.
Before we pull out the data, we need to know which variables are available in the table. We do this with this function:
metadata <- get_table_metadata("FOLK1A", variables_only = T)
metadata
id text elimination time map
1 OMRÅDE region TRUE FALSE denmark_municipality_07
2 KØN sex TRUE FALSE <NA>
3 ALDER age TRUE FALSE <NA>
4 CIVILSTAND marital status TRUE FALSE <NA>
5 Tid time FALSE TRUE <NA>
values
1 000, 084, 101, 147, 155, 185, 165, 151, 153, 157, 159, 161, 163, 167, 169, 183, 173, 175, 187, 201, 240, 210, 250, 190, 270, 260, 217, 219, 223, 230, 400, 411, 085, 253, 259, 350, 265, 269, 320, 376, 316, 326, 360, 370, 306, 329, 330, 340, 336, 390, 083, 420, 430, 440, 482, 410, 480, 450, 461, 479, 492, 530, 561, 563, 607, 510, 621, 540, 550, 573, 575, 630, 580, 082, 710, 766, 615, 707, 727, 730, 741, 740, 746, 706, 751, 657, 661, 756, 665, 760, 779, 671, 791, 081, 810, 813, 860, 849, 825, 846, 773, 840, 787, 820, 851, All Denmark, Region Hovedstaden, Copenhagen, Frederiksberg, Dragør, Tårnby, Albertslund, Ballerup, Brøndby, Gentofte, Gladsaxe, Glostrup, Herlev, Hvidovre, Høje-Taastrup, Ishøj, Lyngby-Taarbæk, Rødovre, Vallensbæk, Allerød, Egedal, Fredensborg, Frederikssund, Furesø, Gribskov, Halsnæs, Helsingør, Hillerød, Hørsholm, Rudersdal, Bornholm, Christiansø, Region Sjælland, Greve, Køge, Lejre, Roskilde, Solrød, Faxe, Guldborgsund, Holbæk, Kalundborg, Lolland, Næstved, Odsherred, Ringsted, Slagelse, Sorø, Stevns, Vordingborg, Region Syddanmark, Assens, Faaborg-Midtfyn, Kerteminde, Langeland, Middelfart, Nordfyns, Nyborg, Odense, Svendborg, Ærø, Billund, Esbjerg, Fanø, Fredericia, Haderslev, Kolding, Sønderborg, Tønder, Varde, Vejen, Vejle, Aabenraa, Region Midtjylland, Favrskov, Hedensted, Horsens, Norddjurs, Odder, Randers, Samsø, Silkeborg, Skanderborg, Syddjurs, Aarhus, Herning, Holstebro, Ikast-Brande, Lemvig, Ringkøbing-Skjern, Skive, Struer, Viborg, Region Nordjylland, Brønderslev, Frederikshavn, Hjørring, Jammerbugt, Læsø, Mariagerfjord, Morsø, Rebild, Thisted, Vesthimmerlands, Aalborg
2 TOT, 1, 2, Total, Men, Women
3 IALT, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, Total, 0 years, 1 year, 2 years, 3 years, 4 years, 5 years, 6 years, 7 years, 8 years, 9 years, 10 years, 11 years, 12 years, 13 years, 14 years, 15 years, 16 years, 17 years, 18 years, 19 years, 20 years, 21 years, 22 years, 23 years, 24 years, 25 years, 26 years, 27 years, 28 years, 29 years, 30 years, 31 years, 32 years, 33 years, 34 years, 35 years, 36 years, 37 years, 38 years, 39 years, 40 years, 41 years, 42 years, 43 years, 44 years, 45 years, 46 years, 47 years, 48 years, 49 years, 50 years, 51 years, 52 years, 53 years, 54 years, 55 years, 56 years, 57 years, 58 years, 59 years, 60 years, 61 years, 62 years, 63 years, 64 years, 65 years, 66 years, 67 years, 68 years, 69 years, 70 years, 71 years, 72 years, 73 years, 74 years, 75 years, 76 years, 77 years, 78 years, 79 years, 80 years, 81 years, 82 years, 83 years, 84 years, 85 years, 86 years, 87 years, 88 years, 89 years, 90 years, 91 years, 92 years, 93 years, 94 years, 95 years, 96 years, 97 years, 98 years, 99 years, 100 years, 101 years, 102 years, 103 years, 104 years, 105 years, 106 years, 107 years, 108 years, 109 years, 110 years, 111 years, 112 years, 113 years, 114 years, 115 years, 116 years, 117 years, 118 years, 119 years, 120 years, 121 years, 122 years, 123 years, 124 years, 125 years
4 TOT, U, G, E, F, Total, Never married, Married/separated, Widowed, Divorced
5 2008K1, 2008K2, 2008K3, 2008K4, 2009K1, 2009K2, 2009K3, 2009K4, 2010K1, 2010K2, 2010K3, 2010K4, 2011K1, 2011K2, 2011K3, 2011K4, 2012K1, 2012K2, 2012K3, 2012K4, 2013K1, 2013K2, 2013K3, 2013K4, 2014K1, 2014K2, 2014K3, 2014K4, 2015K1, 2015K2, 2015K3, 2015K4, 2016K1, 2016K2, 2016K3, 2016K4, 2017K1, 2017K2, 2017K3, 2017K4, 2018K1, 2018K2, 2018K3, 2018K4, 2019K1, 2019K2, 2019K3, 2019K4, 2020K1, 2020K2, 2020K3, 2020K4, 2021K1, 2021K2, 2021K3, 2021K4, 2022K1, 2008Q1, 2008Q2, 2008Q3, 2008Q4, 2009Q1, 2009Q2, 2009Q3, 2009Q4, 2010Q1, 2010Q2, 2010Q3, 2010Q4, 2011Q1, 2011Q2, 2011Q3, 2011Q4, 2012Q1, 2012Q2, 2012Q3, 2012Q4, 2013Q1, 2013Q2, 2013Q3, 2013Q4, 2014Q1, 2014Q2, 2014Q3, 2014Q4, 2015Q1, 2015Q2, 2015Q3, 2015Q4, 2016Q1, 2016Q2, 2016Q3, 2016Q4, 2017Q1, 2017Q2, 2017Q3, 2017Q4, 2018Q1, 2018Q2, 2018Q3, 2018Q4, 2019Q1, 2019Q2, 2019Q3, 2019Q4, 2020Q1, 2020Q2, 2020Q3, 2020Q4, 2021Q1, 2021Q2, 2021Q3, 2021Q4, 2022Q1
There is a lot of other metadata in the tables, including the phone number to the staffmember at Statistics Denmark that is responsible for maintaining the table. We are only interested in the variables, which is why we add the parameter “variables_only = T”.
What kind of values can the individual datapoints take?
metadata %>% slice(4) %>% pull(values)
[[1]]
id text
1 TOT Total
2 U Never married
3 G Married/separated
4 E Widowed
5 F Divorced
We use the slice function from tidyverse to pull out the fourth row of the dataframe, and the pull-function to pull out the values in the values column.
The same trick can be done for the other fields in the table:
metadata %>% slice(1) %>% pull(values) %>% .[[1]] %>% head
id text
1 000 All Denmark
2 084 Region Hovedstaden
3 101 Copenhagen
4 147 Frederiksberg
5 155 Dragør
6 185 Tårnby
Here we see the individual municipalities in Denmark.
Now we are almost ready to pull out the actual data!
But first!
Which variables do we want?
We need to specify which variables we want in our answer. Do we want the total population for all municipalities in Denmark? Or just a few? Do we want the total population, or do we want it broken down by sex.
These variables, and the values of them, need to be specified when we pull the data from Statistics Denmark.
We also need to provide that information in a specific way.
If we want data for all municipalites, we want to pull the variable “OMRÅDE” from the list of variables.
Therefore we need to give the function an argument containing both the information that we want the population data broken down by “OMRÅDE”, and that we want all values of “OMRÅDE”.
Vectors are characterized by only being able to contain one type of data.
When we need to have structures that can contain more than one type of data, we can use the list structure.
Lists allows us to have values, with names (sometime descriptive).
Lists can even contain lists.
And that is what we need here. Let us make our first list:
list(code = "OMRÅDE", values = NA)
$code
[1] "OMRÅDE"
$values
[1] NA
This list have to components. One called “code”, and one called “values”. Code have the content “OMRÅDE”, specifying that we want the variable in the data from Statistics Denmark calld “OMRÅDE”.
“values” has the content “NA”. We use “NA”, when we want to specify that we want all the “OMRÅDE”. If we only wanted a specific municipality, we could instead specify it instead of writing “NA”.
Let us assume that we also want to break down the data based on marriage status.
That information is stored in the variable “CIVILSTAND”.
And above, we saw that we had the following values in that variable:
metadata %>% slice(4) %>% pull(values)
[[1]]
id text
1 TOT Total
2 U Never married
3 G Married/separated
4 E Widowed
5 F Divorced
A value for the total population is probably not that interesting, if we pull all the individual values for “Never married” etc.
We can now make another list:
list(code = "CIVILSTAND", values = c("U", "G", "E", "F"))
$code
[1] "CIVILSTAND"
$values
[1] "U" "G" "E" "F"
Here the “values” part is a vector containing the values we want to pull out for that variable.
It might be interesting to take a look at how the population changes over time.
In that case we need to pull out data from the “Tid” variable.
That would look like this:
list(code = "Tid", values = NA)
$code
[1] "Tid"
$values
[1] NA
If we want to pull data broken down by all three variables, we need to provide a list, containing three lists.
We do that using this code:
variables <- list(list(code = "OMRÅDE", values = NA),
list(code = "CIVILSTAND", values = c("U", "G", "E", "F")),
list(code = "Tid", values = NA)
)
variables
[[1]]
[[1]]$code
[1] "OMRÅDE"
[[1]]$values
[1] NA
[[2]]
[[2]]$code
[1] "CIVILSTAND"
[[2]]$values
[1] "U" "G" "E" "F"
[[3]]
[[3]]$code
[1] "Tid"
[[3]]$values
[1] NA
And now, finally, we are ready to get the data!
data <- get_data(table_id = "FOLK1A", variables = variables)
Rows: 23940 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): OMRÅDE, CIVILSTAND, TID
dbl (1): INDHOLD
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
It takes a short moment. But now we have a dataframe containing the data we requested:
head(data)
# A tibble: 6 × 4
OMRÅDE CIVILSTAND TID INDHOLD
<chr> <chr> <chr> <dbl>
1 All Denmark Never married 2008Q1 2552700
2 All Denmark Never married 2008Q2 2563134
3 All Denmark Never married 2008Q3 2564705
4 All Denmark Never married 2008Q4 2568255
5 All Denmark Never married 2009Q1 2575185
6 All Denmark Never married 2009Q2 2584993
This procedure will work for all the tables from Statistics Denmark!
The data is nicely formatted and ready to use. Almost.
Before we do anything else, let us save the data.
write_csv2(data, "../data/SD_data.csv")
Key Points
R Markdown is a useful language for creating reproducible documents combining text and executable R-code.