You are here

Filtering 100s of megabytes to a few gigabytes of CSV data: RSQLite, queryBuildR, DT and Shiny.

Dealing with CSV files on the order of 100s MBs or GBs is often difficult in R (or even worse, Excel). For example, loading a 1 GB file in R using read.table is likely to take a few minutes (or will make Excel crash).

In this blog post, we describe an integrated solution for filtering such files with close to real-time results, using RSQLite, queryBuildR, DT and Shiny. The proposed solution is the one we used for building the open TED browsing interface, which allows to interactively explore a 3.2GB file containing 1,5 million award notices of European public procurements.

The data are made available thanks to the open TED initiative, in CSV format. Let us download for example the CSV for contract awards in 2012 (894MB) and load it in R.

system.time(data<-read.table("ted-contracts-2012.csv",header=T,sep=","))
user  system elapsed 
213.017   5.070 220.288

This takes about 4 minutes to load on a ‘good’ laptop (16GB RAM, SSD hard drive). The data set has 432230 rows and 141 columns.

RSQLite

SQLite is a very effective way to store and explore such data files. The data file (e.g., in a CSV format) will be stored in a local SQLite file, which you can query using SQL syntax. In most cases, you actually only need a subset of those data, and SQL will come quite handy to get the subset of data you care about. The RSQLite R package will make this straightforward. Let us store this file to a local ‘data.db’ SQLite file.

library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), "data.db")
system.time(dbWriteTable(con,"data",data))
user  system elapsed 
8.339   3.326  12.103

Writing this file to a ‘data.db’ SQLite file takes about 12s.

Now, let us use SQL to get the five first entries.

system.time(subset<-dbGetQuery(con,"select * from data limit 5"))
user  system elapsed 
0.001   0.000   0.003

This takes only 3 milliseconds. There are 141 columns, so we do not show the result, but this illustrates how fast SQLite is at retrieving data from such a data set.

If we want to retrieve all entries for which the column ‘contract_operator_town’ is ‘Zaragoza’

system.time(subset<-dbGetQuery(con,"select * from data where contract_operator_town='Zaragoza'"))
user  system elapsed 
0.280   0.433   0.894 

There are 64 records (out of 432230) in the subset, and it took less than one second to retrieve them. Note that there is no way read.table could efficiently retrieve this subset of records. Finally, let us look at the time for retrieving the whole data set:

system.time(data<-dbGetQuery(con,"select * from data"))
user  system elapsed 
19.520   2.550  23.022 

All data are retrieved in 23 seconds, i.e., about ten times faster than with read.table. In our Web interface, we limit the results to the first 1000 (which takes less than one second) for the sake of interactivity.

queryBuildR

jQuery QueryBuilder is a JavaScript module which provides a user-friendly way to create filters, and export them in SQL. We made the module available in R, see queryBuildR. Install with:

library(devtools)
devtools::install_github('Yannael/queryBuildR')
library(queryBuildR)

The query widget is created using the queryBuildR function, which takes the rules and filters as parameters (see the documentation for jQuery QueryBuilder). The function getFiltersFromTable may be used to get a set of default filters.

data[is.na(data)]<-""
filters<-getFiltersFromTable(data)
queryBuildR("",filters)

DT

DT is an R interface to the DataTables library, and provides a convenient way to display data in a tabular format.

library(DT)
datatable(data[1:1000,c(2,5,36)])

Tables with millions of rows can be handled smoothly using server side processing.

Connecting RSQLite, queryBuildR and DT with Shiny

Both queryBuildR and DT expose Shiny bindings, allowing them to be integrated in a Web interface. We created a Shiny site illustrating how all components can be bound together, using as toy example the iris dataset. The source code is available here on GitHub. Note that the whole code for the user interface, server and data preprocessing is less than one hundred lines. To use your own data set, simply edit the global.R/createDB function.

Below is a screenshort of the Shiny demo, where selected data are those setosa species with sepal width greater or equal to 4. The interface also displays the SQL query generated by queryBuildR (here “select * from datatable where species = ‘setosa’ AND sepal_width >= 4”).

The code for the open TED browsing interface is available here. The data set used for that interface can be generated by downloading TED CSV files, and running the preprocessing.R script.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer