Using a relational database with R

Last updated on 2024-01-30 | Edit this page

Introduction


A common problem with R in that all operations are conducted in-memory. Here, “memory” means a computer drive called Random Access Memory (RAM for short) which is phsically installed on your computer. RAM allows for data to be read in nearly the same amount of time regardless of its physical location inside the memory. This is in stark contrast to the speed data could be read from CDs, DVDs, or other storage media, where the speed of transfer depended on how quickly the drive could rotate or the arm could move. Unfortunately, your computer has a limited amount of RAM, so the amount of data you can work with is limited by the available memory. So far, we have used small datasets that can easily fit into your computer’s memory. But what about datasets that are too large for your computer to handle as a whole?

In this case, it is helpful to organze the data into a database stored outside of R before creating a connection to the database itself. This connection will essentially remove the limitation of memory because SQL queries can be sent directly from R to the database and return to R only the results that you have identified as being neccessary for your analysis.

Once we have made the connection to the database, much of what we do will look familiar because the code we will be using is very similar to what we saw in the SQL lesson and earlier episodes of this R lesson.

In this lesson, we will be connecting to an SQLite database, which allows us to send strings containing SQL statements directly from R to the database and recieve the results. In addition, we will be connecting to the database in such a way that we can use ‘dplyr’ functions to operate directly on the database tables.

Prelminaries


First, install and load the neccessary packages. You can install the RSQLite package with

R

install.packages("RSQLite")

Load the packages with

R

library(RSQLite)
library(dplyr)

OUTPUT


Attaching package: 'dplyr'

OUTPUT

The following objects are masked from 'package:stats':

    filter, lag

OUTPUT

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Next, create a variable that contains the location of the SQLite database we are going to use. Here, we are assuming that it is in the current working directory.

R

dbfile <- "data/SN7577.sqlite"

Connecting to an SQLite Database


Connect to the SQLite database specified by dbfile, above, using the dbConnect function.

R

mydb <- dbConnect(dbDriver("SQLite"), dbfile)

Here, mydb represents the connection to the database. It will be specified every time we need to access the database.

Now that we have a connection, we can get a list of the tables in the database.

R

dbListTables(mydb)

OUTPUT

character(0)

Our objective here is to bring data from the database into R by sending a query to the database and then asking for the results of that query.

R

# Assign the results of a SQL query to an SQLiteResult object
results <- dbSendQuery(mydb, "SELECT * FROM Question1")

ERROR

Error: no such table: Question1

R

# Return results from a custom object to a dataframe
data <- fetch(results)

ERROR

Error in h(simpleError(msg, call)): error in evaluating the argument 'res' in selecting a method for function 'fetch': object 'results' not found

data is a standard R dataframe that can be explored and manipulated.

R

# Return column names
names(data)

OUTPUT

NULL

R

# Return description of dataframe structure
str(data)

OUTPUT

function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
    envir = .GlobalEnv, overwrite = TRUE)  

R

# Return the second column
data[,2]

ERROR

Error in data[, 2]: object of type 'closure' is not subsettable

R

# Return the value of the second column, fourth row
data[4,2]

ERROR

Error in data[4, 2]: object of type 'closure' is not subsettable

R

# Return the second column where the value of the column 'key' is greater than 7
data[data$key > 7,2]

ERROR

Error in data$key: object of type 'closure' is not subsettable

Once you have retrieved the data you should close the connection.

R

dbClearResult(results)

ERROR

Error in h(simpleError(msg, call)): error in evaluating the argument 'res' in selecting a method for function 'dbClearResult': object 'results' not found

In addition to sending simple queries we can send complex one like a join. You may want to set this up in a concateneted string first for readability.

R

SQL_query <- paste("SELECT q.value,",
                   "count(*) as how_many",
                   "FROM SN7577 s",
                   "JOIN Question1  q",
                   "ON q.key = s.Q1",
                   "GROUP BY  s.Q1")

results <- dbSendQuery(mydb, SQL_query)

ERROR

Error: no such table: SN7577

R

data <- fetch(results)

ERROR

Error in h(simpleError(msg, call)): error in evaluating the argument 'res' in selecting a method for function 'fetch': object 'results' not found

R

data

OUTPUT

function (..., list = character(), package = NULL, lib.loc = NULL, 
    verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
{
    fileExt <- function(x) {
        db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x)
        ans <- sub(".*\\.", "", x)
        ans[db] <- sub(".*\\.([^.]+\\.)(gz|bz2|xz)$", "\\1\\2", 
            x[db])
        ans
    }
    my_read_table <- function(...) {
        lcc <- Sys.getlocale("LC_COLLATE")
        on.exit(Sys.setlocale("LC_COLLATE", lcc))
        Sys.setlocale("LC_COLLATE", "C")
        read.table(...)
    }
    stopifnot(is.character(list))
    names <- c(as.character(substitute(list(...))[-1L]), list)
    if (!is.null(package)) {
        if (!is.character(package)) 
            stop("'package' must be a character vector or NULL")
    }
    paths <- find.package(package, lib.loc, verbose = verbose)
    if (is.null(lib.loc)) 
        paths <- c(path.package(package, TRUE), if (!length(package)) getwd(), 
            paths)
    paths <- unique(normalizePath(paths[file.exists(paths)]))
    paths <- paths[dir.exists(file.path(paths, "data"))]
    dataExts <- tools:::.make_file_exts("data")
    if (length(names) == 0L) {
        db <- matrix(character(), nrow = 0L, ncol = 4L)
        for (path in paths) {
            entries <- NULL
            packageName <- if (file_test("-f", file.path(path, 
                "DESCRIPTION"))) 
                basename(path)
            else "."
            if (file_test("-f", INDEX <- file.path(path, "Meta", 
                "data.rds"))) {
                entries <- readRDS(INDEX)
            }
            else {
                dataDir <- file.path(path, "data")
                entries <- tools::list_files_with_type(dataDir, 
                  "data")
                if (length(entries)) {
                  entries <- unique(tools::file_path_sans_ext(basename(entries)))
                  entries <- cbind(entries, "")
                }
            }
            if (NROW(entries)) {
                if (is.matrix(entries) && ncol(entries) == 2L) 
                  db <- rbind(db, cbind(packageName, dirname(path), 
                    entries))
                else warning(gettextf("data index for package %s is invalid and will be ignored", 
                  sQuote(packageName)), domain = NA, call. = FALSE)
            }
        }
        colnames(db) <- c("Package", "LibPath", "Item", "Title")
        footer <- if (missing(package)) 
            paste0("Use ", sQuote(paste("data(package =", ".packages(all.available = TRUE))")), 
                "\n", "to list the data sets in all *available* packages.")
        else NULL
        y <- list(title = "Data sets", header = NULL, results = db, 
            footer = footer)
        class(y) <- "packageIQR"
        return(y)
    }
    paths <- file.path(paths, "data")
    for (name in names) {
        found <- FALSE
        for (p in paths) {
            tmp_env <- if (overwrite) 
                envir
            else new.env()
            if (file_test("-f", file.path(p, "Rdata.rds"))) {
                rds <- readRDS(file.path(p, "Rdata.rds"))
                if (name %in% names(rds)) {
                  found <- TRUE
                  if (verbose) 
                    message(sprintf("name=%s:\t found in Rdata.rds", 
                      name), domain = NA)
                  thispkg <- sub(".*/([^/]*)/data$", "\\1", p)
                  thispkg <- sub("_.*$", "", thispkg)
                  thispkg <- paste0("package:", thispkg)
                  objs <- rds[[name]]
                  lazyLoad(file.path(p, "Rdata"), envir = tmp_env, 
                    filter = function(x) x %in% objs)
                  break
                }
                else if (verbose) 
                  message(sprintf("name=%s:\t NOT found in names() of Rdata.rds, i.e.,\n\t%s\n", 
                    name, paste(names(rds), collapse = ",")), 
                    domain = NA)
            }
            if (file_test("-f", file.path(p, "Rdata.zip"))) {
                warning("zipped data found for package ", sQuote(basename(dirname(p))), 
                  ".\nThat is defunct, so please re-install the package.", 
                  domain = NA)
                if (file_test("-f", fp <- file.path(p, "filelist"))) 
                  files <- file.path(p, scan(fp, what = "", quiet = TRUE))
                else {
                  warning(gettextf("file 'filelist' is missing for directory %s", 
                    sQuote(p)), domain = NA)
                  next
                }
            }
            else {
                files <- list.files(p, full.names = TRUE)
            }
            files <- files[grep(name, files, fixed = TRUE)]
            if (length(files) > 1L) {
                o <- match(fileExt(files), dataExts, nomatch = 100L)
                paths0 <- dirname(files)
                paths0 <- factor(paths0, levels = unique(paths0))
                files <- files[order(paths0, o)]
            }
            if (length(files)) {
                for (file in files) {
                  if (verbose) 
                    message("name=", name, ":\t file= ...", .Platform$file.sep, 
                      basename(file), "::\t", appendLF = FALSE, 
                      domain = NA)
                  ext <- fileExt(file)
                  if (basename(file) != paste0(name, ".", ext)) 
                    found <- FALSE
                  else {
                    found <- TRUE
                    zfile <- file
                    zipname <- file.path(dirname(file), "Rdata.zip")
                    if (file.exists(zipname)) {
                      Rdatadir <- tempfile("Rdata")
                      dir.create(Rdatadir, showWarnings = FALSE)
                      topic <- basename(file)
                      rc <- .External(C_unzip, zipname, topic, 
                        Rdatadir, FALSE, TRUE, FALSE, FALSE)
                      if (rc == 0L) 
                        zfile <- file.path(Rdatadir, topic)
                    }
                    if (zfile != file) 
                      on.exit(unlink(zfile))
                    switch(ext, R = , r = {
                      library("utils")
                      sys.source(zfile, chdir = TRUE, envir = tmp_env)
                    }, RData = , rdata = , rda = load(zfile, 
                      envir = tmp_env), TXT = , txt = , tab = , 
                      tab.gz = , tab.bz2 = , tab.xz = , txt.gz = , 
                      txt.bz2 = , txt.xz = assign(name, my_read_table(zfile, 
                        header = TRUE, as.is = FALSE), envir = tmp_env), 
                      CSV = , csv = , csv.gz = , csv.bz2 = , 
                      csv.xz = assign(name, my_read_table(zfile, 
                        header = TRUE, sep = ";", as.is = FALSE), 
                        envir = tmp_env), found <- FALSE)
                  }
                  if (found) 
                    break
                }
                if (verbose) 
                  message(if (!found) 
                    "*NOT* ", "found", domain = NA)
            }
            if (found) 
                break
        }
        if (!found) {
            warning(gettextf("data set %s not found", sQuote(name)), 
                domain = NA)
        }
        else if (!overwrite) {
            for (o in ls(envir = tmp_env, all.names = TRUE)) {
                if (exists(o, envir = envir, inherits = FALSE)) 
                  warning(gettextf("an object named %s already exists and will not be overwritten", 
                    sQuote(o)))
                else assign(o, get(o, envir = tmp_env, inherits = FALSE), 
                  envir = envir)
            }
            rm(tmp_env)
        }
    }
    invisible(names)
}
<bytecode: 0x55dd65eabca0>
<environment: namespace:utils>

R

dbClearResult(results)

ERROR

Error in h(simpleError(msg, call)): error in evaluating the argument 'res' in selecting a method for function 'dbClearResult': object 'results' not found

Exercise

What happens if you send invalid SQL syntax?

Solution

An error message is returned from SQLite. Notice that R is just the conduit; it cannot check the SQL syntax.

{: .solution} {: .challenge}

We can also create a new database and add tables to it. Let’s base this new dataframe on the Question1 table that can be found in our existing database.

R

# First, use a SQL query to extract the Question1 table from the existing database
results = dbSendQuery(mydb, "SELECT * from Question1")

ERROR

Error: no such table: Question1

R

# Then, store it as a dataframe
Q1 <- fetch(results)

ERROR

Error in h(simpleError(msg, call)): error in evaluating the argument 'res' in selecting a method for function 'fetch': object 'results' not found

Now, we can create the new database and add data to it, either from an external file or a local dataframe.

R

dbfile_new = "data/a_newdb.sqlite"
mydb_new = dbConnect(dbDriver("SQLite"), dbfile_new)

dbWriteTable(conn = mydb_new , name = "SN7577", value = "data/SN7577.csv",
             row.names = FALSE, header = TRUE)

WARNING

Warning in file(file, "rt"): cannot open file 'data/SN7577.csv': No such file
or directory

ERROR

Error in file(file, "rt"): cannot open the connection

R

dbWriteTable(conn = mydb_new , name = "Q1", value = Q1,
             row.names = FALSE)

ERROR

Error in h(simpleError(msg, call)): error in evaluating the argument 'value' in selecting a method for function 'dbWriteTable': object 'Q1' not found

R

dbListTables(mydb_new)

OUTPUT

character(0)

Connecting to a Database for dplyr Use


When we want to use dplyr functions to operate directly on the database tables, a different connection method is used.

R

mydb_dplyr <- src_sqlite(path="data/SN7577.sqlite")

WARNING

Warning: `src_sqlite()` was deprecated in dplyr 1.0.0.
ℹ Please use `tbl()` directly with a database connection
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

as is the method for running queries. However using the ‘tbl’ function we still need to provide avalid SQL string. (?)

R

tbl(mydb_dplyr, sql("SELECT count(*) from SN7577"))

ERROR

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! no such table: SN7577

The real advantage of using dplyr is that once we have stored the table as an object (here, SN7577_d), we can use dplyr functions instead of SQL statements.

R

# Store the table as an object
SN7577_d <- tbl(mydb_dplyr, sql("SELECT * FROM SN7577"))

ERROR

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! no such table: SN7577

R

# Explore the object
head(SN7577_d, n = 10)

ERROR

Error in eval(expr, envir, enclos): object 'SN7577_d' not found

R

nrow(SN7577_d)

ERROR

Error in eval(expr, envir, enclos): object 'SN7577_d' not found

R

# Apply dplyr functions to the object
SN7577_d %>%
  filter(numage > 60) %>%
  select(sex, age, numage) %>%
  group_by(sex, age) %>%
  summarize(avg_age = mean(numage))

ERROR

Error in eval(expr, envir, enclos): object 'SN7577_d' not found

Notice that on the nrow command we get NA rather than a count of rows. Thisis because dplyr doesn’t hold the full table even after the ‘Select * …’

If you need the row count you can use

R

SN7577_d %>%
  tally()

ERROR

Error in eval(expr, envir, enclos): object 'SN7577_d' not found

Exercise

Store the SN7577 table as an object for dplyr use.

Write a query using dplyr functions that will return the average age (numage) by sex for all records where the response for Q2 is missing (missing values are indicated by a value of -1).

Solution

R

SN7577_d <- tbl(mydb_dplyr, sql("SELECT * FROM SN7577"))

ERROR

Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Caused by error:
! no such table: SN7577

R

SN7577_d %>%
  filter(Q2 == -1)   %>%
  group_by(sex)   %>%
  summarize(avg_age = mean(numage))

ERROR

Error in eval(expr, envir, enclos): object 'SN7577_d' not found

{: .solution} {: .challenge}

{% include links.md %}