Including improved function to add Cronjobs
---
title: "Set up automatic tasks"
author: "Carlos Varela Martinn"
date: "21 de agosto de 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Objective: set up easily an droplet with a docker with Rstudio server and a number of useful automated tasks.
Run the document to install all the relevant dependencies and set up all the following tasks:
- Scrap the web for APX prices (datetime as unique, repeated few times over the midday)
- Scrap the web for imbalance prices (datetime as unique, each hour)
- Update the weather forecast of the last six days + tomorrow (erase the previous table, every 20 min)
- Gather data from Smappees (unique_id, each hour)
All of this accomplished through Mongo db and AWS S3.
# Set up with AWS
It happens that the creation of the droplet is not possible in windows (at least for me), because the ssh keys do not work well.
Instead, I deploy from Amazon Web Services:
1. Have set and running an Rstudio in Amazon
2. Start session in DigitalOcean and update the rss key if it is necessary (check if you can connect with `analogsea::droplets()`)
3. Run the code below (note that the size of the droplet can be edited):
```{r deploy r docker, eval=FALSE, include=FALSE}
# install.packages(c("analogsea", "plumber"))
library(analogsea)
library(plumber)
# Generates an Rstudio session with root privileges
# This version comes with the tidyverse intalled, what is very cool
# Further R versions come with even more stuff installed, like TeX.
# https://hub.docker.com/u/rocker/
# https://hub.docker.com/r/rocker/verse/
docklet_rstudio_root <-
function (droplet,
port = "8787",
img = "rocker/verse", # We use here the heaviest version, with tex &publishing packages
dir = "",
ssh_user = "root")
{
droplet <- as.droplet(droplet)
docklet_pull(droplet, img, ssh_user)
docklet_run(droplet,
cmd = sprintf(" -d -p %s:8787 -e ROOT=TRUE %s", port, img))
print(sprintf("Done; Rstudio is listening in port %s", port))
}
# Run new droplet:
dropletest <- docklet_create(size = getOption("do_size", "512mb"), region = "ams2")
Sys.sleep(30) # The port 22 has to be open first, apparently
docklet_rstudio_root(dropletest)
```
Once the process is finished, the Rstudio server will be available in the port 8787 of the public IP.
# Work in DigitalOcean droplet
First, update the password. For this, open the shell and run `sudo passwd rstudio`.
Add files: Upload the zip file containing the R scrips and the javascript files.
To check the packages installed already:
```{r, eval=FALSE, include=FALSE}
installed.packages()[,1]
```
Install dependencies for packages and cron:
```{r, eval=FALSE, include=FALSE}
system("sudo apt-get update")
system("sudo apt-get install -y apt-utils")
system("sudo apt-get install -y libsasl2-dev")
system("sudo apt-get install -y libxml2-dev")
system("sudo apt-get install bzip2")
system("sudo apt-get install -y cron")
system("sudo apt-get install -y nano") # useful to access cron files
```
Install packages
```{r, eval=FALSE, include=FALSE}
install.packages("selectr") # need to update it to use web scraping
install.packages("mongolite")
install.packages("jsonlite")
install.packages("aws.s3")
install.packages("webshot")
install.packages("rvest")
install.packages("htmlTable")
install.packages("htmltab")
install.packages("httr")
install.packages("randomForest")
install.packages("e1071")
install.packages("rpart")
install.packages("glmnet")
install.packages("gbm")
install.packages("xts")
install.packages("dygraphs")
install.packages("caTools")
install.packages("mailR")
install.packages("cronR")
```
Install phantomjs and other utilities
```{r, eval=FALSE, include=FALSE}
webshot::install_phantomjs()
system("sudo apt-get install python-pip libxslt1-dev -y")
system("sudo pip install premailer")
```
This is the easiest way, apparently. The binaries are stored in the `bin` folder.
Now, the files should be able to run directly.
Setup cron jobs:
For this task, I use a modified version of the `cron_add` function in the `cronR` package. The original function does not consider the traditional nomenclature of cron, what I think is a big mistake. It is not possible, for instance, to run scripts every five minutes.
The improved function looks like this:
```{r}
library(cronR)
cron_add_improved <- function (command, frequency = "0-59 * * * *", id, tags = "", description = "", dry_run = FALSE,
user = "")
{
crontab <- tryCatch(parse_crontab(user = user), error = function(e) {
return(character())
})
call <- match.call()
digested <- FALSE
if (missing(id)) {
digested <- TRUE
id <- digest(call)
}
if (length(crontab) && length(crontab$cronR)) {
if (id %in% sapply(crontab$cronR, "[[", "id")) {
if (digested) {
warning("This id was auto-generated by 'digest'; it is likely that ",
"you attempted to submit an identical job.")
}
stop("Can't add this job: a job with id '", id, "' already exists.")
}
}
call_str <- paste(collapse = "", gsub(" +$", "", capture.output(call)))
job <- list(frequency = NULL, command = NULL)
job[["frequency"]] <- frequency
job[["command"]] <- command
if (any(is.null(job)))
stop("NULL commands in 'job!' Job is: ", paste(job, collapse = " ",
sep = " "))
description <- unlist(strsplit(wrap(description), "\n"))
if (length(description) > 1) {
description[2:length(description)] <- paste0("## ",
description[2:length(description)])
}
description <- paste(description, collapse = "\n")
header <- paste(sep = "\n", collapse = "\n", "## cronR job",
paste0("## id: ", id), paste0("## tags: ", paste(tags,
collapse = ", ")), paste0("## desc: ", description))
job_str <- paste(sep = "\n", collapse = "\n", header, paste(job,
collapse = " ", sep = " "))
message("Adding cronjob:\n", "---------------\n\n", job_str)
if (!dry_run) {
old_crontab <- suppressWarnings(system("crontab -l",
intern = TRUE, ignore.stderr = TRUE))
old_crontab[old_crontab == " "] <- ""
if (length(old_crontab)) {
new_crontab <- paste(sep = "\n", paste(old_crontab,
collapse = "\n"), paste0(job_str, "\n"))
}
else {
new_crontab <- paste0(job_str, "\n")
}
tempfile <- tempfile()
on.exit(unlink(tempfile))
cat(new_crontab, "\n", file = tempfile)
system(paste("crontab", tempfile))
}
return(invisible(job))
}
environment(cron_add_improved) <- asNamespace('cronR')
```
Initialize cron and set the scripts on:
```{r, eval=FALSE, include=FALSE}
cron_add_improved(command = cron_rscript("imbalance_etl.R"),
frequency = '30 * * * *',
id = 'imbalance',
description = "imbalance prices from Tennet website every 30 min")
cron_add_improved(command = cron_rscript("smappee_etl.R"),
frequency = '2 * * * *',
id = 'smappee',
description = "smappee data from each household, by hour, a bit later than o'clock")
cron_add_improved(command = cron_rscript("weather_etl.R"),
frequency = '*/20 * * * *',
id = 'weather',
description = "relevant weather data in Borneokade, last 6 days and tomorrow")
cron_add_improved(command = cron_rscript("apx_etl.R"),
frequency = '0,10 11,12 * * *',
id = 'apx',
description = "APX prices from the official site, through phantomjs")
cron_add_improved(command = cron_rscript("spanbroek_etl.R"),
frequency = '*/5 * * * *',
id = 'spanbroek',
description = "data from spanbroek installation retrieved through kropman")
system("sudo cron start") # Initialize cron!
cron_ls()
```