---
title: "Using etl"
author: "Ben Baumer"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using etl}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

`etl` is an R package to facilitate [Extract - Transform - Load (ETL)](https://en.wikipedia.org/wiki/Extract,_transform,_load) operations for **medium data**. The end result is generally a populated SQL database, but the user interaction takes place solely within R.

## Using `etl`

Instantiate an `etl` object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial `mtcars` database is built into `etl`.

```{r, warning=FALSE}
library(etl)
cars <- etl("mtcars")
class(cars)
```

Pay careful attention to where the SQLite database is stored. The default location is a temporary directory, but you will want to move this to a more secure location if you want this storage to be persistent. See `file.copy()` for examples on how to move a file. 

### Connect to a local or remote database

`etl` works with a local or remote database to store your data. Every `etl` object extends a `dplyr::src_dbi` object. If, as in the example above, you do not specify a SQL source, a local `RSQLite` database will be created for you. However, you can also specify any source that inherits from `dplyr::src_dbi`.

> Note: If you want to use a database other than a local RSQLite, you must create the `mtcars` database and have permission to write to it first!

```{r, eval=FALSE}
# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")

# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)
```

At the heart of `etl` are three functions: `etl_extract()`, `etl_transform()`, and `etl_load()`. 

### Extract

The first step is to acquire data from an online source. 

```{r}
cars %>%
  etl_extract()
```

This creates a local store of raw data. 

### Transform

These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).

```{r}
cars %>%
  etl_transform()
```

### Load

Populate the SQL database with the transformed data. 

```{r}
cars %>%
  etl_load()
```

### Do it all at once

To populate the whole database from scratch, use `etl_create`. 

```{r, eval=FALSE}
cars %>%
  etl_create()
```

You can also update an existing database without re-initializing, but watch out for primary key collisions.

```{r, eval=FALSE}
cars %>%
  etl_update()
```

### Step-by-step

Under the hood, there are three functions that `etl_update` chains together:

```{r}
getS3method("etl_update", "default")
```

`etl_create` is simply a call to `etl_update` that forces the SQL database to be written from scratch. 

```{r}
getS3method("etl_create", "default")
```


### Do Your Analysis

Now that your database is populated, you can work with it as a `src` data table just like any other `dplyr` source. 
```{r}
cars %>%
  tbl("mtcars") %>%
  group_by(cyl) %>%
  summarise(N = n(), mean_mpg = mean(mpg))
```

## Extending `etl`

### Create your own ETL packages

Suppose you want to create your own ETL package called `pkgname`. All you have to do is write a package that requires `etl`, and then you have to write **one S3 methods**:

```{r, eval=FALSE}
etl_extract.etl_pkgname()
```

You may also wish to write

```{r, eval=FALSE}
etl_transform.etl_pkgname()
etl_load.etl_pkgname()
```

All of these functions must take and return an object of class `etl_pkgname` that inherits from `etl`. Please see the "[Extending etl](extending_etl.html)" vignette for more information. 

### Use other ETL packages

Packages that use the `etl` framework are available on CRAN and/or GitHub:

```{r}
tools::dependsOnPkgs("etl")
```