cvmartin
11/8/2017 - 10:52 PM

Set up cronjobs

Including improved function to add Cronjobs

---
title: "Set up automatic tasks"
author: "Carlos Varela Martinn"
date: "21 de agosto de 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Objective: set up easily an droplet with a docker with Rstudio server and a number of useful automated tasks.

Run the document to install all the relevant dependencies and set up all the following tasks: 
- Scrap the web for APX prices (datetime as unique, repeated few times over the midday)
- Scrap the web for imbalance prices (datetime as unique, each hour)
- Update the weather forecast of the last six days + tomorrow (erase the previous table, every 20 min)
- Gather data from Smappees (unique_id, each hour)

All of this accomplished through Mongo db and AWS S3.


# Set up with AWS

It happens that the creation of the droplet is not possible in windows (at least for me), because the ssh keys do not work well. 

Instead, I deploy from Amazon Web Services:
1. Have set and running an Rstudio in Amazon
2. Start session in DigitalOcean and update the rss key if it is necessary (check if you can connect with `analogsea::droplets()`)
3. Run the code below (note that the size of the droplet can be edited):

```{r deploy r docker, eval=FALSE, include=FALSE}
# install.packages(c("analogsea", "plumber"))

library(analogsea)
library(plumber)

# Generates an Rstudio session with root privileges
# This version comes with the tidyverse intalled, what is very cool
# Further R versions come with even more stuff installed, like TeX.
# https://hub.docker.com/u/rocker/
# https://hub.docker.com/r/rocker/verse/
docklet_rstudio_root <-
  function (droplet,
            port = "8787",
            img = "rocker/verse", # We use here the heaviest version, with tex &publishing packages
            dir = "",
            ssh_user = "root")
  {
    droplet <- as.droplet(droplet)
    docklet_pull(droplet, img, ssh_user)
    docklet_run(droplet,
                cmd = sprintf(" -d -p %s:8787 -e ROOT=TRUE %s", port, img))
    print(sprintf("Done; Rstudio is listening in port %s", port))
  }

# Run new droplet:
dropletest <- docklet_create(size = getOption("do_size", "512mb"), region = "ams2")
Sys.sleep(30) # The port 22 has to be open first, apparently
docklet_rstudio_root(dropletest)


```

Once the process is finished, the Rstudio server will be available in the port 8787 of the public IP. 

# Work in DigitalOcean droplet

First, update the password. For this, open the shell and run  `sudo passwd rstudio`.

Add files: Upload the zip file containing the R scrips and the javascript files.

To check the packages installed already:
```{r, eval=FALSE, include=FALSE}
installed.packages()[,1]
```

Install dependencies for packages and cron:
```{r, eval=FALSE, include=FALSE}
system("sudo apt-get update")
system("sudo apt-get install -y apt-utils")
system("sudo apt-get install -y libsasl2-dev")
system("sudo apt-get install -y libxml2-dev")
system("sudo apt-get install bzip2")
system("sudo apt-get install -y cron")
system("sudo apt-get install -y nano") # useful to access cron files
```

Install packages
```{r, eval=FALSE, include=FALSE}
install.packages("selectr") # need to update it to use web scraping
install.packages("mongolite")
install.packages("jsonlite")
install.packages("aws.s3")
install.packages("webshot")
install.packages("rvest")
install.packages("htmlTable")
install.packages("htmltab")
install.packages("httr")
install.packages("randomForest")
install.packages("e1071")
install.packages("rpart")
install.packages("glmnet")
install.packages("gbm")
install.packages("xts")
install.packages("dygraphs")
install.packages("caTools")
install.packages("mailR")
install.packages("cronR")


```

Install phantomjs and other utilities
```{r, eval=FALSE, include=FALSE}
webshot::install_phantomjs()

system("sudo apt-get install python-pip libxslt1-dev -y")
system("sudo pip install premailer")
```
This is the easiest way, apparently. The binaries are stored in the `bin` folder.

Now, the files should be able to run directly.

Setup cron jobs:
For this task, I use a modified version of the `cron_add` function in the `cronR` package. The original function does not consider the traditional nomenclature of cron, what I think is a big mistake. It is not possible, for instance, to run scripts every five minutes. 
The improved function looks like this:

```{r}
library(cronR)

cron_add_improved <- function (command, frequency = "0-59 * * * *", id, tags = "", description = "", dry_run = FALSE, 
                               user = "") 
{
  crontab <- tryCatch(parse_crontab(user = user), error = function(e) {
    return(character())
  })
  call <- match.call()
  digested <- FALSE
  if (missing(id)) {
    digested <- TRUE
    id <- digest(call)
  }
  if (length(crontab) && length(crontab$cronR)) {
    if (id %in% sapply(crontab$cronR, "[[", "id")) {
      if (digested) {
        warning("This id was auto-generated by 'digest'; it is likely that ", 
                "you attempted to submit an identical job.")
      }
      stop("Can't add this job: a job with id '", id, "' already exists.")
    }
  }
  call_str <- paste(collapse = "", gsub(" +$", "", capture.output(call)))
  job <- list(frequency = NULL, command = NULL)
  job[["frequency"]] <- frequency
  job[["command"]] <- command
  if (any(is.null(job))) 
    stop("NULL commands in 'job!' Job is: ", paste(job, collapse = " ", 
                                                   sep = " "))
  description <- unlist(strsplit(wrap(description), "\n"))
  if (length(description) > 1) {
    description[2:length(description)] <- paste0("##   ", 
                                                 description[2:length(description)])
  }
  description <- paste(description, collapse = "\n")
  header <- paste(sep = "\n", collapse = "\n", "## cronR job", 
                  paste0("## id:   ", id), paste0("## tags: ", paste(tags, 
                                                                     collapse = ", ")), paste0("## desc: ", description))
  job_str <- paste(sep = "\n", collapse = "\n", header, paste(job, 
                                                                collapse = " ", sep = " "))
  message("Adding cronjob:\n", "---------------\n\n", job_str)
  if (!dry_run) {
    old_crontab <- suppressWarnings(system("crontab -l", 
                                           intern = TRUE, ignore.stderr = TRUE))
    old_crontab[old_crontab == " "] <- ""
    if (length(old_crontab)) {
      new_crontab <- paste(sep = "\n", paste(old_crontab, 
                                              collapse = "\n"), paste0(job_str, "\n"))
    }
    else {
      new_crontab <- paste0(job_str, "\n")
    }
    tempfile <- tempfile()
    on.exit(unlink(tempfile))
    cat(new_crontab, "\n", file = tempfile)
    system(paste("crontab", tempfile))
  }
  return(invisible(job))
}
environment(cron_add_improved) <- asNamespace('cronR')
```

Initialize cron and set the scripts on:
```{r, eval=FALSE, include=FALSE}
cron_add_improved(command = cron_rscript("imbalance_etl.R"), 
                  frequency = '30 * * * *', 
                  id = 'imbalance', 
                  description = "imbalance prices from Tennet website every 30 min")

cron_add_improved(command = cron_rscript("smappee_etl.R"), 
                  frequency = '2 * * * *', 
                  id = 'smappee', 
                  description = "smappee data from each household, by hour, a bit later than o'clock")

cron_add_improved(command = cron_rscript("weather_etl.R"), 
                  frequency = '*/20 * * * *', 
                  id = 'weather',
                  description = "relevant weather data in Borneokade, last 6 days and tomorrow")

cron_add_improved(command = cron_rscript("apx_etl.R"), 
                  frequency = '0,10 11,12 * * *', 
                  id = 'apx', 
                  description = "APX prices from the official site, through phantomjs")

cron_add_improved(command = cron_rscript("spanbroek_etl.R"), 
                  frequency = '*/5 * * * *', 
                  id = 'spanbroek', 
                  description = "data from spanbroek installation retrieved through kropman")


system("sudo cron start") # Initialize cron!

cron_ls()

```