12/12/2015 - 6:50 AM

Basic statistics in R

Basic statistics in R

#Here I present some codes related to ggplot2 from Hadley.
#1 sapply - Statistics for every variables in a data frame
# E.g., you want to calculate mean for every variables in your data frame.
# Using iris pre-loaded data set, include 4 numeric variables and 1 factor variable ("Species)

  sapply(iris, mean)
  #You will see an error because "Species" is category variable so it can not be calculated mean. To avoid it, you should change to:
  sapply(iris[,-5], mean)
#2 tapply, aggregate - Statistics for 1 variable but for more group factors
#In this case, you want to calculate Sepal.Length for every types of Species. It is quite similar to "bys Species: su Sepal.Length, detail" function in Stata

  tapply(iris$Sepal.Length, iris$Species, mean)
#Sometimes, you need more than 1 factor, such as when you want to calculate median of returns ("ret") for every firms ("gvkey") by every year ("year").
#In this case, use "list" function:

  tapply(data$ret, list(data$gvkey, data$year), median)
  #The results here is quite similar to a cross-tab, so you need another one? OK, use "aggregate"
  aggregate(Sepal.Width~Species, iris, mean)
  aggregate(breaks~wool+tension, warpbreaks, median) #warpbreaks is pre-loaded dataset; breaks is numeric variable, wool and tension are category variables
#Finally, sometimes you want to repeat this analysis for more than 1 numberic variables. So use "cbind":
  aggregate(cbind(Sepal.Width, Sepal.Length)~Species, iris, mean)
#3 Summary statistics
#3.1 Category variable
# The first thing you should think is the frequency table:

# If you want a table for 2 category variables (cross-tab)
  table(warpbreaks$wool, warpbreaks$tension)
#Similarly, you can create contingency tables for three or more variables: