Basic statistics in R
#Here I present some codes related to ggplot2 from Hadley.
#1 sapply - Statistics for every variables in a data frame
# E.g., you want to calculate mean for every variables in your data frame.
# Using iris pre-loaded data set, include 4 numeric variables and 1 factor variable ("Species)
sapply(iris, mean)
#You will see an error because "Species" is category variable so it can not be calculated mean. To avoid it, you should change to:
sapply(iris[,-5], mean)
#2 tapply, aggregate - Statistics for 1 variable but for more group factors
#In this case, you want to calculate Sepal.Length for every types of Species. It is quite similar to "bys Species: su Sepal.Length, detail" function in Stata
tapply(iris$Sepal.Length, iris$Species, mean)
#Sometimes, you need more than 1 factor, such as when you want to calculate median of returns ("ret") for every firms ("gvkey") by every year ("year").
#In this case, use "list" function:
tapply(data$ret, list(data$gvkey, data$year), median)
#The results here is quite similar to a cross-tab, so you need another one? OK, use "aggregate"
aggregate(Sepal.Width~Species, iris, mean)
aggregate(breaks~wool+tension, warpbreaks, median) #warpbreaks is pre-loaded dataset; breaks is numeric variable, wool and tension are category variables
#Finally, sometimes you want to repeat this analysis for more than 1 numberic variables. So use "cbind":
aggregate(cbind(Sepal.Width, Sepal.Length)~Species, iris, mean)
#3 Summary statistics
#3.1 Category variable
# The first thing you should think is the frequency table:
table(iris$Species)
# If you want a table for 2 category variables (cross-tab)
table(warpbreaks$wool, warpbreaks$tension)
#Similarly, you can create contingency tables for three or more variables: