Time

Overview

Teaching: 50 min
Exercises: 30 min
Questions
  • How can I select specific rows and/or columns from a dataframe?

Objectives
  • Describe the purpose of an R package and the dplyr and tidyr packages.

  • Select certain columns in a dataframe with the dplyr function select.

A relatively short session on time.

“People assume that time is a strict progression from cause to effect, but actually from a non-linear, non-subjective viewpoint, it’s more like a big ball of wibbly-wobbly, timey-wimey stuff.”

Time is not easy to deal with. It is actually really complicated. Here is a rant on how complicated it is…

https://www.youtube.com/watch?v=-5wpm-gesOY

Why?

We just pulled data out giving us the danish population, broken down by marriage status and geographical area. And time.

If the data is not still in memory, we can read it in:

data <- read_csv2("../data/SD_data.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 23100 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): OMRÅDE, CIVILSTAND, TID
dbl (1): INDHOLD

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
# A tibble: 6 × 4
  OMRÅDE      CIVILSTAND    TID    INDHOLD
  <chr>       <chr>         <chr>    <dbl>
1 All Denmark Never married 2008Q1 2552700
2 All Denmark Never married 2008Q2 2563134
3 All Denmark Never married 2008Q3 2564705
4 All Denmark Never married 2008Q4 2568255
5 All Denmark Never married 2009Q1 2575185
6 All Denmark Never married 2009Q2 2584993

Note that the datatype for “TID” is , meaning character. Those are simply text, not a time. And if we want to plot this, as a function of time, the "TID" variable needs to be converted into something R can understand as time.

A general tool

lubridate is a package written to make working with dates and times easy(er).

It may need to be installed first.

install.packages("lubridate")

After that, we can load it:

library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Lubridate converts a lot of different ways of writing dates to a consistent date-time format.

The most important functions we need to know, are:

And variations of these, especially ymd.

ymd(“2021-09-21”) converts the date 2020-09-21 to a date-format that R can understand:

ymd("2021-09-21")
[1] "2021-09-21"

Sometimes we have dates formatted as “21-09-2021”. That is day, month and year in that order.

That can be converted to at standard date-format with the function dmy():

dmy("21-09-2021")
[1] "2021-09-21"

We might even have dates formatted as “2021 21 4”, (year, day month), the function ydm() can handle that.

ydm("2021 21 4")
[1] "2021-04-21"

Time is handled in a similar way, but time is usually not written as creatively as dates:

hm("14:05")
[1] "14H 5M 0S"
hms("14.05.21")
[1] "14H 5M 21S"

Dates and times can be combined, as in: “2021-04-21 14:05:12”:

ymd_hms("2021-04-21 14:05:12")
[1] "2021-04-21 14:05:12 UTC"

Those were the nice dates…

Not so nice date formats - a more specific tool

Statistics Denmark returns a lot of data-series by quarter, or month. And we need to convert it to something we can work with. Without necessarily understanding all the details.

The library tsibble provides functions that can convert “2020Q1”, the first quarter of 2020, into something R can understand as time-value:

We might need to install it first:

install.packages("tsibble")

And then load it:

library(tsibble)

Attaching package: 'tsibble'
The following object is masked from 'package:lubridate':

    interval
The following objects are masked from 'package:base':

    intersect, setdiff, union

This is a vector containg the 8 quarters of the years 2019 and 2020.

quarters <- c("2019Q1", "2019Q2", "2019Q3", "2019Q4", "2020Q1", "2020Q2", "2020Q3", "2020Q4")
class(quarters)
[1] "character"

It is a character vector, ie strings. If we want to analyse any data associated with these specific quarters, we need to convert them to something R is able to recognize as time.

yearquarter(quarters)
<yearquarter[8]>
[1] "2019 Q1" "2019 Q2" "2019 Q3" "2019 Q4" "2020 Q1" "2020 Q2" "2020 Q3"
[8] "2020 Q4"
# Year starts on: January

We are not going to go into further details on the challenges of working with time-series. The generic lubridate functions and yearquarter() will be enough for our purposes.

Let us finish by converting the “TID” column in our data, to a time-format.

data <- data %>% 
  mutate(TID = yearquarter(TID))

We mutate the column “TID” into the result of running yearquarter() on the column “TID”. And now we have a data frame that we can do interesting things with.

Now might be a good time to save the data in its new version:

write_csv2(data, "../data/SD_data.csv")

Note that we are using write_csv2() here. We do not have decimalpoints in this data, but other data might have.

Key Points

  • Use pivot_longer() to go from wide to long format.