Info Request Study in Google Analytics in R
## About the post
Just like in the [previous entry](http://brocktibert.com/emnerdery/2013/11/07/use-r-to-analyze-google-analytics/), we will be using `R` to access our school's Google Analytics data through their API.
In this post, I want to highlight how we can figure out *when* a vistor to our website completes our a goal on our site. In my case, I am interested in learning more about how, and when, prospective students (and/or parents) complete our information request form.
This could be any goal on your site, but our recruit pool data tend to confirm that self-initiated actions are strong predictors of interest. This is why I tend to emphasize these actions over "soft-interest" conversions like a simple click-through's on a random email.
Before we begin, I assume that you are relatively familiar with the Google Analytics, what data are available, and that you have goals setup for your website. In my case, we told Google that one of our "goals" was the completion page of the web request form.
I won't talk about why goals are *massively* awesome things to have setup in GA, but if this concept is new to you, check out [this link](http://www.hongkiat.com/blog/google-analytics-goals-funnels-tips/) for an overview.
## Setup
In the context of `R`, I am going to make one assumption. If you have been playing around with the `rga` package, you probably have figured out that it's really helpful to save our connection object for later sessions. This prevents us from having to authenticate each time we want data.
For help on the package, look [here](https://github.com/skardhamar/rga).
After firing up `R`, let's setup or environment and reconnect to the API for our undergraduate account. Below, I am using the `where` argument to reference the `uga.rga` file in my current directory. This file contains my saved credentials.
```{r comment=NA, message=FALSE}
## load the R package we use to access Google Analytics
library(rga)
## not ideal, but a setting that we need to apply if using Windows
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL",
"cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
## the token for GA
rga.open(instance="ga", where="uga.rga")
```
## Who Converts? New or Returning Visitors?
Now that we connected to the API, we can start to have some fun.
Before going too crazy, let's answer the basic question of *who*. Simply, of the people that convert, are they New or Returning vistitors? We are going to count the visits by New and Returning visitors from January through November 2013.
```{r echo=TRUE, eval=FALSE, comment=NA}
start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = 'ga:visitorType'
MET = 'ga:visits'
## get the data
type = ga$getData('ga:XXXXXXXX',
start.date,
end.date,
walk = TRUE,
metrics = MET,
dimensions = DIM,
sort = "",
filters = "",
segment = "dynamic::ga:goal1Completions>=1",
start = 1,
max = 10000)
```
```{r echo=FALSE, eval=TRUE, comment=NA, cache=TRUE, results='hide', message=FALSE}
start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = 'ga:visitorType'
MET = 'ga:visits'
## get the data
type = ga$getData('ga:34840136',
start.date,
end.date,
walk = TRUE,
metrics = MET,
dimensions = DIM,
sort = "",
filters = "",
segment = "dynamic::ga:goal1Completions>=1",
start = 1,
max = 10000)
## just in case, remove missing data
type = type[complete.cases(type), ]
```
One thing reql quick. I want to point out how we can define segments "on-the-fly" in the API. If you use the web reporting tool for GA, we can define **Advanced Segments**. These segments allow you put your traffic into buckets. While you can access these using the API as well, we can also generate these programatically by dusing `dynamic::`. This feature is prett helpful in my opinion.
Also, we were able to avoid sampled data by using the `walk` argument above, but it means that we now have to aggregate the data by `visitorType`.
```{r comment=NA}
type_summary = aggregate(visits ~ visitorType, data=type, FUN="sum")
type_summary$pct = type_summary$visits / sum(type_summary$visits)
type_summary
```
After printing out the data, we can see that about 61% of our information request form conversions were from New Visitors between January and November 2013.
## How long is the conversion cycle?
Now let's dig a bit deeper and try to answer the question of *when* they convert. In this case, I am defining *when* as the number of visits before for someone to completes the form. These data will be pulled into a data frame called `basic`.
```{r echo=TRUE, eval=FALSE, comment=NA, results='hide'}
## use http://ga-dev-tools.appspot.com/explorer/ to explore query strings
start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = 'ga:date,ga:visitCount'
MET = 'ga:visits'
## get the data
basic = ga$getData('ga:XXXXXXXX',
start.date,
end.date,
walk = TRUE,
metrics = MET,
dimensions = DIM,
sort = "",
filters = "",
segment = "dynamic::ga:goal1Completions>=1",
start = 1,
max = 10000)
```
```{r echo=FALSE, comment=NA, cache=TRUE, results='hide', message=FALSE}
start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = 'ga:date,ga:visitCount'
MET = 'ga:visits'
## get the data
basic = ga$getData('ga:34840136',
start.date,
end.date,
walk = TRUE,
metrics = MET,
dimensions = DIM,
sort = "",
filters = "",
segment = "dynamic::ga:goal1Completions>=1",
start = 1,
max = 10000)
## just in case, remove missing data
basic = basic[complete.cases(basic), ]
## convert visitCount to a number
basic$visitCount = as.numeric(basic$visitCount)
```
First, we should take a peak what we pulled down to ensure that our dataset looks as expected.
```{r comment=NA}
class(basic); dim(basic); head(basic);
```
At a very high level, how many visits does it take to convert a suspect?
```{r comment=NA}
round(mean(basic$visitCount), 2)
```
We see that our info request conversions typically take between 5 and 6 visits.
But wait, didn't we just point out that 61% of our conversions were from New Visitors? Because averages are easily influenced by extreme values, we should visualize the distribtion.
```{r comment=NA}
hist(basic$visitCount,
main="Distribution of Visits required to Convert",
xlab="# Visits",
col="red",
breaks=100)
```
Now things are starting to make sense. We have some very large values. Let's standardize the data and remove these outliers.
```{r comment=NA}
## copy our data
basic2 = basic
## create a new variable that is the standardized value
basic2$z = scale(basic2$visitCount)
## keep only scaled values +/- 3 (in reality, only "+" values exist)
basic2 = subset(basic2, z >= -3 & z <= 3)
## re-plot the distribution
hist(basic2$visitCount,
main="Distribution of Visits required to Convert",
xlab="# Visits",
col="red",
breaks=100)
```
After removing very large values, our distribution starts to take shape. The chart confirms that the large majority are new visitors, but we can see that there are a decent number of conversions that happen well after the first visit.
To me, these are the lurkers that we should attempt to learn more about in the future.
Now, I am curious as to how many visits it takes after the first visit. Below, I am going to group (or bin) the data.
```{r comment=NA, message=FALSE}
## cut our data into bands. (0,1] = 1 visit, (1, 2] = 2 visits, (8, 14] = 8-14 visits
basic2 = transform(basic2, bins = cut(visitCount, breaks= c(0:7, 14, 21, 100)))
## put our data into a summary table using the plyr package
library(plyr)
visit_summary = ddply(basic2, .(bins), summarise, visits = sum(visits))
visit_summary = transform(visit_summary, pct_total = round(visits / sum(visits), 3))
visit_summary
```
We can see that the large majority of visitors will go on to request information within the first 3 visits to our site. I know that this is a stretch, but to me this suggests that we only have about 3 chances to influence lurkers, or those that are window shopping our institution.
Just because I can't help myself, one last cut of the data. I am going to manually classify our data into New/Returning visitors and explore if the Month impacts *who* converts.
```{r comment=NA, message=FALSE, out.width=="100px"}
## clean up the month from our date variable (which is stored as a date)
basic2 = transform(basic2, month = month(date, label=TRUE))
## manually classify visits as New/Returning
basic2 = transform(basic2, visit_type = ifelse(visitCount == 1, "New", "Returning"))
## summarize the data before we plot it
basic2_summ = ddply(basic2, .(month, visit_type), summarise, visits = sum(visits))
## plot the distribtions for each month using the ggplot2 plotting library
library(ggplot2)
ggplot(basic2_summ, aes(x=month, y=visits, fill=factor(visit_type))) +
geom_bar(position="fill", stat="identity")
```
Visually, I am not sure there is a strong pattern in our data. However, there might be some evidence to suggest that our conversions increasingly come from New Visits during the fall months; senior year if you are looking at this at the undergraduate level.
## Summary
Above, I ran through some quick code to determine the number of visits it takes before a suspect will request more information from our institution. In addition, we were able to figure out if our conversions are coming from New or Returning visitors.
Stepping back, you could have used the web reporting interface to answer a few of the questions above, but where is the fun in that?
All kidding aside, this is only a fraction of what we could have done. For example, we could have isolated conversions with a `visitCount > 1` and then studied how the traffic came to our site. In addition, we could also explore if we have longer conversion cycles based on visitor geography or even evaluted the conversion impact of mobile devices.