Google Analytics in R http://brocktibert.com/emnerdery/2013/11/07/use-r-to-analyze-google-analytics/
I long have wanted to post about the things I am doing with `R` and Google Analytics for my division, so I was very excited to see that tonight's `#emchat` is on web analytics. This post will be more on the technical end, but I wanted to highlight what's possible beyond the basic point-and-click interface in Google Analytics.
## Why do this post?
Google, as well as other vendors, provide APIs to our web data. Basically this just means that we can create queries using the API and return the data into a nice form just like we can any other database on campus. By passing different combinations of parameters (dimensions, metrics, filters), we can ask Google to return pretty detailed information about our web traffic. Pretty sweet, right?
In full disclosure, this wont be the most detailed post, but hopefully show you just enough of what's possible when we pull data straight from the API.
One last thing before I dive in. This post assumes that your school is your Gogole Analytics. I will be using an R package specifically for GA. If you don't use GA, no worries, the `R` community is amazing. Check out this site if you use [Adobe SiteCatalyst](https://github.com/randyzwitch/RSiteCatalyst).
## Setup
First, nearly all of my posts will talk about `R` and `Python`. If you are on a Mac, python is already installed. Just fire up a terminal and type `python`. Voila, you are now ready to code!
`R`, however, is not pre-installed. There are millions of tutorials on how to get up and running. Maybe at some point I will provide a more comprehensive overview on how to turn your computer into a lean, mean, data crunching machine (for free!), but for time being, here is my `R` Setup.
1. Install R [here](http://cran.us.r-project.org/)
2. Download [RStudio](http://www.rstudio.com/) - an amazing IDE that really eases the process of learning R
## OK, enough blabbing.
In this brief post, I want to highlight the `R` package [rga](https://github.com/skardhamar/rga), which provides a super simple way to connect to your Google Analytics account and to pull data down from the API.
Before you can begin, you will need to install a few packages if you don't already have them loaded into `R`. The following codes will load the packages that you need for this post.
```{r setup, eval=FALSE, echo=T, comment=NA}
install.packages("devtools")
library(devtools)
install_github("rga", "skardhamar")
install.packages("ggplot2")
install.packages("forecast")
```
Now let's load the packages that we will use to crunch our data.
```{r eval=F}
library(rga)
library(ggplot2)
library(forecast)
```
The next code segment is necessary if you are Windows. In full disclosure, this is not an ideal workaround, but when using secure connections (i.e. the Analytics API) you need a certificate. Windows doesn't handle this that well out of the box.
```{r eval=FALSE, echo=TRUE, comment=NA}
options(RCurlOptions = list(verbose = FALSE,
capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
```
Let's connect to Google.
```{r eval=FALSE, echo=TRUE}
rga.open(instance="ga")
```
This will fire up a browser tab and show a similar screen.

Click Accept. The next screen will show you a very long string of characters. This is our token that let's us connect to GA. Copy this string, as we need to paste this string back in the terminal in `R`

Simply paste the string at the prompt.

That's it! I know that might seem like way too many steps, but authenticating with APIs around the web can be a real PITA.
## Finally, the analysis
Take a look at the profiles that you have access to. Pretty cool huh?
```{r eval=F}
ga$getProfiles()
```
Keep note of the profile you want to query. The value you need is found in the `id` column.
The next part will appear pretty technical, but Google has created a [Query Explorer](http://ga-dev-tools.appspot.com/explorer/) to help navigate the API. I suggest you go play around with the tool. Above all else, it will help you find the values for the dimensions and metrics that you want to include in your queries.
Below I am going to set some values that will be passed into a function that will query GA.
```{r eval=F}
ids = 'ga:XXXXXXXXXXX'
start.date = "2013-01-01"
end.date = "2013-10-31"
DIM = 'ga:date'
MET = 'ga:pageviews,ga:uniquePageviews,ga:entrances,ga:exits,ga:bounces,ga:timeOnPage'
```
The value for `ids` is my GA profile id. `start.date` and `end.date` define the timeframe for our query. `DIM` defines the dimensions, which simply is the `row` of our query. The columns, or metrics in Google-speak, are defined in the variable `MET`.
Now let's get the data!
```{r eval=F}
stats = ga$getData(ids,
start.date,
end.date,
walk = TRUE,
batch = TRUE,
metrics = MET,
dimensions = DIM,
sort = "",
filters = "",
segment = "")
```
Above, all I am doing is saving the results of our API call to a variable call `stats`. I will just highlight this topic, but the variable `stats` is a `data.frame` in `R`. At the end of the day, the structure of a `data.frame` is nearly identical to a dataset that you might have in Excel.
I probably should comment on the data that was just retrieved. I am grabbing daily data for our website from Jan 1, 2013 to October 31, 2013. I am telling Google to give us daily stats for pageviews, uniquepageviews, entrances, exits, bounces, and total time, which is returned in seconds.
Below is a (time series) plot of my daily pageviews. It's pretty easy to see that there are patterns in the data.

I am sure that the group will talk about this tonight, but this type of chart is nothing special and can easily be found in GA.
However, the benefit of R is that we can now create a forecast from this data. We can't do this in Google Analytics.
```{r eval=F}
fit = auto.arima(stats$pageviews)
plot(forecast(fit), axes=F)
```

The chart above leverages a built-in function of the `forecast` packaged called `auto.arima`. This uses an algorithm which attempts to fit the best model to our daily pageviews. The forecast is in blue, and the chart shows us the confidence intervals for these estimates.
## Who cares?
Why should we care about forecasting? Let's say your team is about to launch a redesign of some, maybe all, of the pages on your site. You might want to forecast the pageviews for your site over the next month. When the data starts pouring in, you can compare the actual pageview stats to the predicted values in order to see how well your new site is performing. If you are constantly beating estimates, you ***probably*** can conclude that the redesign worked!
By leveraging `R`, we were able to pull down data from GA and not only replicate the reports found online, but add additional value by forecasting what we think performance should be.
### Next Steps
Stepping back, let's dive into some other reasons as to why you might want to consider using the API.
1. Depending on how much traffic your site gets, Google might return sampled data. Normally this isn't the end of the world, as the story of your data won't change. However, we usually can get around this by using the `batch` and `walk` parameters above. This splits our request into a number of daily queries, and if necessary, collects all of the data in batches. Google has a limit of 10,000 rows per page. The `batch` command intelligently handles this for us if our query returns more than 10k rows.
2. You can answer some really deep questions that are impossible to answer using the online reporting, at least easily. For example, you can pull down the stats for each page on your website by geography. If you wanted to, you could rank the pages by key geographic markets and evaluate how the top pages perform across these markets. Is the rank order the same? Are some markets more likely to view your Financial Aid page? Are other markets more interested in campus life?
3. By leveraging the API, you can store the data you get in a database. I love [Tableau](http://www.tableausoftware.com/). It's a huge part of my workflow. To build my monthly web reports, I query the data using R, and save my queries into MySQL. I then connect Tableau to the MySQL database, and just refresh my report pages. That's it! A pretty detailed report takes me all of 15 minutes to complete.
4. Speaking of API requests, you should refer to Google's documentation. More than likely, you won't hit the daily API limit, but it's something to keep in mind.
## My Crazy Idea
I have long hoped that the `#emchat` community would unite and start to share comparable web data on a monthly basis. The lack of benchmark data makes it extremely difficult to assess website performance. In a perfect world, we would even go as far as tracking things like Info request conversion rates. Think about it. What if you could use benchmark data to highlight that your institution's lead form is under performing your peers. Collecting and storing this data doesn't have to be difficult. It can be scripted with `R`!
Hopefully you found this post valuable. As always, please let me know what you think. Happy data nerdery!
###############################################################################
## How we can use R to connect to Google Analytics
###############################################################################
## Setup - make sure you have the rga package installed
# install.packages("devtools")
# library(devtools)
# install_github("rga", "skardhamar")
# install.packages("ggplot2")
# install.packages("forecast")
## Load the package
library(rga)
library(ggplot2)
library(forecast)
## run the code below if you are on windows.
## not ideal, but it works
options(RCurlOptions = list(verbose = FALSE,
capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
## lets connect to Google Analytics
rga.open(instance="ga")
## what sort of commands do we have in the rga package?
rga
## list our profiles -- we need the profile ID
ga$getProfiles()
## You want the value in the id column
## lets setup some basic data
## use http://ga-dev-tools.appspot.com/explorer/ to explore query strings
ids = 'ga:34840136'
start.date = "2013-01-01"
end.date = "2013-10-31"
DIM = 'ga:date'
MET = 'ga:pageviews,ga:uniquePageviews,ga:entrances,ga:exits,ga:bounces,ga:timeOnPage'
## get the data
stats = ga$getData(ids,
start.date,
end.date,
walk = TRUE,
batch = TRUE,
metrics = MET,
dimensions = DIM,
sort = "",
filters = "",
segment = "")
## look at the data
head(stats)
## lets plot the pageviews over time
g = ggplot(stats, aes(x=date, y=pageviews)) + geom_line()
g + ggtitle("Daily Pageviews January 2013 through October 2013")
## lets go one step further - forecast daily pageviews for November
fit = auto.arima(stats$pageviews)
plot(forecast(fit), axes=F)