etl
is an R package
to facilitate Extract -
Transform - Load (ETL) operations for medium data.
The end result is generally a populated SQL database, but the user
interaction takes place solely within R.
etl
Instantiate an etl
object using a string that determines
the class of the resulting object, and the package that provides access
to that data. The trivial mtcars
database is built into
etl
.
## No database was specified so I created one for you at:
## /tmp/RtmpztTdHN/file181f20cf7c42.sqlite3
## [1] "etl_mtcars" "etl" "src_SQLiteConnection"
## [4] "src_dbi" "src_sql" "src"
Pay careful attention to where the SQLite database is stored. The
default location is a temporary directory, but you will want to move
this to a more secure location if you want this storage to be
persistent. See file.copy()
for examples on how to move a
file.
etl
works with a local or remote database to store your
data. Every etl
object extends a
dplyr::src_dbi
object. If, as in the example above, you do
not specify a SQL source, a local RSQLite
database will be
created for you. However, you can also specify any source that inherits
from dplyr::src_dbi
.
Note: If you want to use a database other than a local RSQLite, you must create the
mtcars
database and have permission to write to it first!
# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)
At the heart of etl
are three functions:
etl_extract()
, etl_transform()
, and
etl_load()
.
The first step is to acquire data from an online source.
## Extracting raw data...
This creates a local store of raw data.
These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).
Populate the SQL database with the transformed data.
## Loading 12 file(s) into the database...
To populate the whole database from scratch, use
etl_create
.
You can also update an existing database without re-initializing, but watch out for primary key collisions.
Under the hood, there are three functions that
etl_update
chains together:
## function (obj, ...)
## {
## obj <- obj %>% etl_extract(...) %>% etl_transform(...) %>%
## etl_load(...)
## invisible(obj)
## }
## <bytecode: 0x55f4c8b6d200>
## <environment: namespace:etl>
etl_create
is simply a call to etl_update
that forces the SQL database to be written from scratch.
## function (obj, ...)
## {
## obj <- obj %>% etl_init(...) %>% etl_update(...) %>% etl_cleanup(...)
## invisible(obj)
## }
## <bytecode: 0x55f4cacc6670>
## <environment: namespace:etl>
Now that your database is populated, you can work with it as a
src
data table just like any other dplyr
source.
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
## # Source: SQL [3 x 3]
## # Database: sqlite 3.46.0 [/tmp/RtmpztTdHN/file181f20cf7c42.sqlite3]
## cyl N mean_mpg
## <int> <int> <dbl>
## 1 4 11 26.7
## 2 6 7 19.7
## 3 8 14 15.1
etl
Suppose you want to create your own ETL package called
pkgname
. All you have to do is write a package that
requires etl
, and then you have to write one S3
methods:
You may also wish to write
All of these functions must take and return an object of class
etl_pkgname
that inherits from etl
. Please see
the “Extending etl” vignette for more
information.