26 Aug 2017

tidyverse / SQLite / rstats / dplyr / DBI / odbc

Introduction

Recently, I have been building shiny apps for work. The app that I am currently working on is an interface to a database for storing information about laboratory samples being collected. In addition to building the shiny app for my coworkers to interact with the database, I also was tasked with creating and building the database. I have never build a SQL database from scratch, but luckily the odbc and the DBI packages make it fairly straight foreward.

Let’s start by loading in the packages that we will need.

library(DBI)
library(odbc)
library(RSQLite)
library(tidyverse)
library(magrittr)
library(dbplyr)

Connecting to the Database

The dbConnect function from the DBI package allows us to create a SQLite database directly from R. SQLite databases are saved as files in the current working directory with this method. As described in the RSQLite packge vignette, if you simply want to use a temporary database, you can create either an on-disk database or an in-memory database with this same method. For this example, we will create a new SQLite in-memory database

con <- dbConnect(RSQLite::SQLite(), ":memory:")

Currently, our database is empty, as can be seen if we use the dbListTables function.

dbListTables(con)

## character(0)

Writing Tables to database

To add data to this database we will use the dbWriteTable function. First, let’s load in two common datasets, mtcars and diamonds.

data("mtcars")
data("diamonds")

mtcars %<>% 
  rownames_to_column()

Now that we have these two data sets loaded into the session, lets write them into the database.

dbWriteTable(con, "cars", mtcars)
dbWriteTable(con, "diamonds", diamonds)

dbListTables(con)

## [1] "cars"     "diamonds"

Query the Database

There are several ways that we can query the tables in this database. We can read in the entire table using the dbReadTable function.

dbReadTable(con, "cars") %>%
  head(10)

##              rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1          Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2      Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3         Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4     Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5  Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6            Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7         Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8          Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9           Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10          Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Alternatively, we can write out a full sql query using the dbGetQuery function.

dbGetQuery(con, "select * from cars") %>%
  head(10)

##              rowname  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1          Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2      Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3         Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4     Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5  Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6            Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7         Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8          Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9           Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10          Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

We can use the microbenchmark package to determine which of these methods is faster. We will measure the time for the diamonds data set as that has nearly 54,000 observations, as opposed to the 32 in the mtcars dataset.

microbenchmark::microbenchmark(
  read_table = dbReadTable(con, "diamonds"),
  query = dbGetQuery(con, "select * from diamonds")
)

## Unit: milliseconds
##        expr      min       lq     mean   median       uq      max neval
##  read_table 91.36283 94.55963 115.4623 98.51866 150.1218 175.2671   100
##       query 90.61233 93.50576 111.0308 98.11973 113.0769 181.2567   100

It looks like the dbReadTable method is slightly faster than a full query. However, the real benefit to using dbGetQuery is the ability to write much more complex sql queries. For example, if we want to subset the data, we are able to.

query <- paste("select carat, cut, clarity, color, price from diamonds",
               "where carat > 1 and cut = 'Ideal'")
dbGetQuery(con, query) %>%
  as.tibble()

## # A tibble: 5,662 x 5
##    carat   cut clarity color price
##    <dbl> <chr>   <chr> <chr> <int>
##  1  1.01 Ideal      I1     I  2844
##  2  1.02 Ideal     SI2     H  2856
##  3  1.02 Ideal      I1     I  2872
##  4  1.02 Ideal     SI2     J  2879
##  5  1.01 Ideal      I1     I  2896
##  6  1.02 Ideal      I1     I  2925
##  7  1.14 Ideal     SI1     J  3045
##  8  1.02 Ideal     SI2     H  3142
##  9  1.06 Ideal     SI2     I  3146
## 10  1.02 Ideal     VS2     I  3148
## # ... with 5,652 more rows

This particular query returned just over 10% of the total data with 5,662 rows matching the conditions set. This feature is extremely important when dealing with database that house extremely large amounts of data. Having to query full tables would be extemely unfeasible in most situations.

In addition to writing more complex sql queries, the dbplyr package allows for R users to avoid having to write queries at all. This package allows users to create a reference to the sql table and interact with it using typical dplyr verbs. We can recreate the query above using this method. First we will use the tbl function to create the reference to the diamonds table in the database. Then we will be able to use that reference with all of our favorite dplyr verbs.

diamonds_tbl <- tbl(con, "diamonds")

diamonds_tbl %>%
  select(carat, cut, clarity, color, price) %>%
  filter(carat > 1, 
         cut == "Ideal") %>% 
  collect() %>%
  as.tibble()

## # A tibble: 5,662 x 5
##    carat   cut clarity color price
##    <dbl> <chr>   <chr> <chr> <int>
##  1  1.01 Ideal      I1     I  2844
##  2  1.02 Ideal     SI2     H  2856
##  3  1.02 Ideal      I1     I  2872
##  4  1.02 Ideal     SI2     J  2879
##  5  1.01 Ideal      I1     I  2896
##  6  1.02 Ideal      I1     I  2925
##  7  1.14 Ideal     SI1     J  3045
##  8  1.02 Ideal     SI2     H  3142
##  9  1.06 Ideal     SI2     I  3146
## 10  1.02 Ideal     VS2     I  3148
## # ... with 5,652 more rows

The collect verb is important if you want the full query to be brought into your R session. The dbplyr package uses lazy evaluation and only brings in a portion of the query into your session.

Let’s take a look at how these two methods compare using the microbenchmark package.

microbenchmark::microbenchmark(
  db_query = dbGetQuery(con, query),
  dbplyr = diamonds_tbl %>%
             select(carat, cut, clarity, color, price) %>%
             filter(carat > 1, 
                    cut == "Ideal") %>% 
             collect()
)

## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  db_query 10.10302 10.30996 10.63014 10.44201 10.60457 15.27123   100
##    dbplyr 51.23319 52.08525 54.66543 53.53378 54.32673 87.69649   100

As we can see, the dbplyr method, while very familiar and potentially easier if you have no experience writing sql queries, takes nearly 6x as long as the straight sql query.

Modify Tables in Place

While there are a lot of blog posts and some great package vignettes about setting up your tables and querying sql databases, there is not too much (that I have seen) about modifying tables in place in your database. There are a few options that are possible when you want to modify a table in a sql database. The first option is to simply query the entire database, make your desired changes using your prefered R tools and then overwrite the table in the database. However, this approach is not practical if you have a large amount of data in your table.

The method that I have found that seems to be fairly straight forward is using the dbSendQuery function. While knowing this function is important, the more important part of this function is knowing what SQL commmads to include in your query. The blog post on Win-Vector Blog concerning using PostgreSqL in R shows how you can drop entire tables from your database, and the RSQLite vignette by Hadley Wickham shows how to delete rows that meet certain conditions. However, if you want to modify a table in your database, the sql commands needed are “update” “set”, and “where”. You can see below how we can use these commands.

update_query <- paste("update cars",
                      "set mpg = 20",
                      "where cyl = 6")

dbSendQuery(con, update_query)

## <SQLiteResult>
##   SQL  update cars set mpg = 20 where cyl = 6
##   ROWS Fetched: 0 [complete]
##        Changed: 7

We can see that 7 rows were changed in the database. Let’s now query the database and see how the table now looks.

dbGetQuery(con, "select * from cars") %>%
  as.tibble()

## # A tibble: 32 x 12
##              rowname   mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##                <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1         Mazda RX4  20.0     6 160.0   110  3.90 2.620 16.46     0     1
##  2     Mazda RX4 Wag  20.0     6 160.0   110  3.90 2.875 17.02     0     1
##  3        Datsun 710  22.8     4 108.0    93  3.85 2.320 18.61     1     1
##  4    Hornet 4 Drive  20.0     6 258.0   110  3.08 3.215 19.44     1     0
##  5 Hornet Sportabout  18.7     8 360.0   175  3.15 3.440 17.02     0     0
##  6           Valiant  20.0     6 225.0   105  2.76 3.460 20.22     1     0
##  7        Duster 360  14.3     8 360.0   245  3.21 3.570 15.84     0     0
##  8         Merc 240D  24.4     4 146.7    62  3.69 3.190 20.00     1     0
##  9          Merc 230  22.8     4 140.8    95  3.92 3.150 22.90     1     0
## 10          Merc 280  20.0     6 167.6   123  3.92 3.440 18.30     1     0
## # ... with 22 more rows, and 2 more variables: gear <dbl>, carb <dbl>

It looks like all of the rows where cyl = 6 have had their mpg changed to 20. While this is a somewhat trivial example, as you would most likely not want to change the results for a data set like this, this can be an incredibly useful feature if you are maintaining a database from R.

You can modify more rows by adding additional arguments to the “set” command and add more conditions by setting additional arguments to the “where” command. For example, we can edit the diamonds table below.

update_query <- paste("update diamonds",
                      "set cut = 'new Ideal',",
                      "color = 'Z'",
                      "where cut = 'Ideal' and",
                      "color = 'E'")
dbSendQuery(con, update_query)

## <SQLiteResult>
##   SQL  update diamonds set cut = 'new Ideal', color = 'Z' where cut = 'Ideal' and color = 'E'
##   ROWS Fetched: 0 [complete]
##        Changed: 3903

We can see that this changed 3903 rows in the diamonds dataset.

Before we finish, it is imprtant to remember to disconnect from the in-memory database using the dbDisconnect function.

dbDisconnect(con)

Conclusion

The DBI, odbc, and dbplyr packages offer an incredible number of tools for interacting with SQL databases of all different kinds. While you are certainly able to navigate through most SQL query problems with only the functions provided in these packages, you can cartainly increase your capabilites by learning some basic SQL commands and how to use them in conjunction with the R functions provided in these packages.

Creating, Writing, Querying, and Modifying SQL Database from R using odbc, dbplyr, and DBI