daniel-s
7/24/2015 - 10:18 AM

R Cheatsheet.

R Cheatsheet.

# by(data, factorlist, function)
by(pf$friend_count, pf$gender, summary)

# Getting logical
pf$mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)
percent_mobile <- sum(pf$mobile_check_in)/length(pf$mobile_check_in) * 100

# Getting a sample and analyze it
set.seed(4231)
sample.ids <- sample(levels(yo$id), 16) 
# Get 16 samples of the yo$id parameter, we're selecting 16 householders that sells

ggplot(aes(x = time, y = price),
       data = subset(yo, id %in% sample.ids)) +
  facet_wrap(~id) +
  geom_line() +
  geom_point(aes(size = all.purchases), pch = 1)
  
# Scatterplot Matrix
install.packages('GGally')
library(GGally)

set.seed(1836) # We'll get a sample of 1000 rows within the total
pf_subset <- pf[ , c(2:15)]
names(pf_subset)
ggpairs(pf_subset[sample.int(nrow(pf_subset), 1000), ])
ggpairs(pf_subset[sample.int(nrow(pf_subset), 1000), ], axisLabels = 'internal')

Useful R libraries

  • ggplot2 Visualization library
  • magrittr Library for using pipe command %>% (Cmd+Shift+M)
  • tidyr & dplyr Data wrangling with R
  • pander Render R objects into Pandoc's markdown
  • ggthemes Themes for ggplot2 library
  • gridExtra For aggregate different plots with grid.arrange(p1, p2, ..., ncol = 1)
  • scales Implement scales in a way that is graphics system agnostic

To install a new package and use it:

install.packages('name_of_the_package', dependencies = T)
library(name_of_the_package)

R Cheatsheet

General

  • getwd() Get Working Directory
  • setwd('~/Downloads') Set Working Directory
  • ls() List variables on Environment
  • dir() List directories on Working Directory
  • list.files() List files on Working Directory
  • rm('variable1') Remove variable1 from Environment
  • rm(list = ls())Remove all variables on Environment
  • identical(data1, data2)
  • colnames(data) Get column names (also names(df) on data frames)
  • rownames(data) Get row names
  • data(name_dataset) Load data set into Environment
  • Execute script from terminal: Rscript my_script.R

Load Data

  • read.csv('file.csv') Read from CSV to data.frame
  • read.csv('file.tsv', sep = '\t') Readm from TSV to data.frame
  • alumni <- read.csv(path_alumni, na.strings = c('-'), colClasses = c('character', 'character', 'numeric', 'numeric'))

Data Frames

  • subset(df, <condition>) Example: subset(statesInfo, state.region == 1)
  • df[ROWS, COLUMNS]
    • Example: statesInfo[statesInfo$state.region == 1, ]
    • Example2: statesInfo[statesInfo$state.region == 1 & statesInfo$population > 3000, ]
  • nrow(df)
  • ncol(df)
  • by(data, factorlist, function) Ex: by(pf$friend_count, pf$gender, summary)

Data Overview

  • str(data) Structure of the data
  • summary(data) Summary of the data
  • head(data)
  • tail(data)
  • For factor variables (categoricals)
    • table(variable)
    • levels(variable)
    • reddit$age.range <- ordered(reddit$age.range, levels = c('Under 18', '18-24', '25-34', '35-44', '45-54', '55-64', '65 or Above'))
    • reddit$income.range <- factor(reddit$income.range, levels = c("Under $20,000", "$20,000 - $29,999", "$30,000 - $39,999", "$40,000 - $49,999", "$50,000 - $69,999", "$70,000 - $99,999", "$100,000 - $149,999", "$150,000 or more"), ordered = T)

Update packages

update.packages(ask=FALSE, checkBuilt = TRUE)

Load R script from GitHub gists

library(devtools)
source_gist("524eade46135f6348140", filename = "ggplot_smooth_func.R")