Extending etl

The etl package provides a framework for working with medium data. These data are typically:

  • downloadable via publicly-accessible URLs
  • larger than what would comfortably fit in a laptop’s memory (e.g. several gigabytes)
  • ideally stored in a SQL-based relational database management system

The etl package provides a series of S3 generics and convenience functions that facilitate consistent development of etl-style packages. Current examples include:

  • macleish: CRAN_Status_Badge Weather and spatial data from the MacLeish Field Station in Whately, MA.
  • nyctaxi: CRAN_Status_Badge Trip data from the New York City Taxi and Limousine Commission
  • airlines: On-time flight arrival data from the Bureau of Transportation Statistics
  • citibike: Municipal bike-sharing system in New York City
  • nyc311: Phone calls to New York City’s feedback hotline
  • fec: Campaign contribution data from the Federal Election Commission
  • imdb: Mirror of the Internet Movie Database

Please see these packages for examples of how to develop an etl-dependent package. This vignette documents how these extensions can be crafted.

The basics

Suppose that you want to write the etl-dependent package foo. In order for foo to work, it should implement at least one of these methods (and sometimes all three):

  1. etl_extract.etl_foo() - downloads data from Internet
  2. etl_transform.etl_foo() - takes downloaded data and puts it into CSV format suitable for import to SQL
  3. etl_load.etl_foo() - imports data into a SQL database

That’s basically it. The rest of the machinery should be taken care of for you. In particular, etl_extract(), etl_transform(), and etl_load() are all generic functions that have sensible default methods.

  1. etl_extract.default() - pulls all of the data.frames available through the data() function for foo, and writes them as CSVs
  2. etl_extract.default()- simply copies all of the CSVs to the load directory.
  3. etl_load.default() - imports all of those CSVs into a SQL database

Note that you don’t have to write an etl method to handle foo. You do, however, have to have the foo package installed in order for the etl instantiation function to work.

library(etl)
foo <- etl("foo")
## Error in etl.default("foo"): Please make sure that the 'foo' package is installed

To see the default methods in action, pick a package with some data and import it.

ggplots <- etl("ggplot2") %>%
  etl_update()
## No database was specified so I created one for you at:
## /tmp/RtmpztTdHN/file181f3859b447.sqlite3
## Loading 11 file(s) into the database...
src_tbls(ggplots)
##  [1] "diamonds"       "economics"      "economics_long" "faithfuld"     
##  [5] "luv_colours"    "midwest"        "mpg"            "msleep"        
##  [9] "presidential"   "seals"          "txhousing"

The details

Main etl methods

Each of the three main etl methods must take an etl_foo object as it’s first argument, and (should invisibly) return an etl_foo object. These methods are pipeable and predictable, but not pure, since they by design have side-effects (i.e. downloading files, etc.) Your major task in writing the foo package will be to write these functions. How you write them is entirely up to you, and the particular implementation will of course depend on what the purpose of foo is.

All three of the main etl methods should take the same set or arguments. Most commonly these define the span of time for the files that you want to extract, transform, or load. For example, in the airlines package, these functions take optional year and month arguments.

We illustrate with cities, which unfortunately takes only .... Also, etl_cities uses etl_load.default(), so there is no etl:::etl_load.etl_cities() method.

etl_extract.etl_cities %>% args()
## Error: object 'etl_extract.etl_cities' not found
etl_transform.etl_cities %>% args()
## Error: object 'etl_transform.etl_cities' not found
etl_load.etl_cities %>% args()
## Error: object 'etl_load.etl_cities' not found

Other etl methods

There are four additional functions in the etl toolchain:

  1. etl_init() - initialize the database
  2. etl_cleanup() - delete unnecessary files
  3. etl_update() - run etl_extract, etl_transform() and etl_load() in succession with the same arguments
  4. etl_create() - run etl_init(), etl_update(), and etl_cleanup() in succession

These functions can generally be used without modification and thus are not commonly extended by foo.

The etl_init() function will initialize the SQL database.

If you want to contribute your own hard-coded SQL initialization script, it must be placed in inst/sql/. The etl_init() function will look there, and find files whose file extensions match the database type. For example, scripts written for MySQL should have the .mysql file extension, while scripts written for PostgreSQL should have the .postgresql file extension.

If no such file exists, all of the tables and views in the database will be deleted, and new tables schemas will be created on-the-fly by dplyr.

etl_foo object attributes

Every etl_foo object has a directory where it can store files and a DBIConnection where it can write to a database. By default, these come from tempdir() and RSQLite::SQLite(), but the user can alternatively specify other locations.

cities <- etl("cities")
## No database was specified so I created one for you at:
## /tmp/RtmpztTdHN/file181f6a67c6dd.sqlite3
str(cities)
## List of 2
##  $ con  :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
##   .. ..@ ptr                :<externalptr> 
##   .. ..@ dbname             : chr "/tmp/RtmpztTdHN/file181f6a67c6dd.sqlite3"
##   .. ..@ loadable.extensions: logi TRUE
##   .. ..@ flags              : int 70
##   .. ..@ vfs                : chr ""
##   .. ..@ ref                :<environment: 0x55f4c31ea740> 
##   .. ..@ bigint             : chr "integer64"
##   .. ..@ extended_types     : logi FALSE
##  $ disco: NULL
##  - attr(*, "class")= chr [1:6] "etl_cities" "etl" "src_SQLiteConnection" "src_dbi" ...
##  - attr(*, "pkg")= chr "etl"
##  - attr(*, "dir")= chr "/tmp/RtmpztTdHN"
##  - attr(*, "raw_dir")= chr "/tmp/RtmpztTdHN/raw"
##  - attr(*, "load_dir")= chr "/tmp/RtmpztTdHN/load"

Note that an etl_foo object is also a src_dbi object and a src_sql object. Please see the dbplyr vignette for more information about these database connections.

References

citation("etl")
## To cite package 'etl' in publications use:
## 
##   Baumer B (2019). "A Grammar for Reproducible and Painless
##   Extract-Transform-Load Operations on Medium Data." _Journal of
##   Computational and Graphical Statistics_, *28*(2), 256-264.
##   doi:10.1080/10618600.2018.1512867
##   <https://doi.org/10.1080/10618600.2018.1512867>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data},
##     author = {Benjamin S. Baumer},
##     journal = {Journal of Computational and Graphical Statistics},
##     year = {2019},
##     volume = {28},
##     number = {2},
##     pages = {256--264},
##     doi = {10.1080/10618600.2018.1512867},
##   }
citation("dplyr")
## To cite package 'dplyr' in publications use:
## 
##   Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
##   Grammar of Data Manipulation_. R package version 1.1.4,
##   https://github.com/tidyverse/dplyr, <https://dplyr.tidyverse.org>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {dplyr: A Grammar of Data Manipulation},
##     author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
##     year = {2023},
##     note = {R package version 1.1.4, https://github.com/tidyverse/dplyr},
##     url = {https://dplyr.tidyverse.org},
##   }
citation("dbplyr")
## To cite package 'dbplyr' in publications use:
## 
##   Wickham H, Girlich M, Ruiz E (2024). _dbplyr: A 'dplyr' Back End for
##   Databases_. R package version 2.5.0,
##   https://github.com/tidyverse/dbplyr, <https://dbplyr.tidyverse.org/>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {dbplyr: A 'dplyr' Back End for Databases},
##     author = {Hadley Wickham and Maximilian Girlich and Edgar Ruiz},
##     year = {2024},
##     note = {R package version 2.5.0, https://github.com/tidyverse/dbplyr},
##     url = {https://dbplyr.tidyverse.org/},
##   }