diengiau
12/12/2015 - 6:50 AM

## Basic statistics in R

Basic statistics in R

``````#Here I present some codes related to ggplot2 from Hadley.
``````
``````#1 sapply - Statistics for every variables in a data frame
# E.g., you want to calculate mean for every variables in your data frame.
# Using iris pre-loaded data set, include 4 numeric variables and 1 factor variable ("Species)

sapply(iris, mean)
#You will see an error because "Species" is category variable so it can not be calculated mean. To avoid it, you should change to:
sapply(iris[,-5], mean)

#2 tapply, aggregate - Statistics for 1 variable but for more group factors
#In this case, you want to calculate Sepal.Length for every types of Species. It is quite similar to "bys Species: su Sepal.Length, detail" function in Stata

tapply(iris\$Sepal.Length, iris\$Species, mean)

#Sometimes, you need more than 1 factor, such as when you want to calculate median of returns ("ret") for every firms ("gvkey") by every year ("year").
#In this case, use "list" function:

tapply(data\$ret, list(data\$gvkey, data\$year), median)
#The results here is quite similar to a cross-tab, so you need another one? OK, use "aggregate"

aggregate(Sepal.Width~Species, iris, mean)
aggregate(breaks~wool+tension, warpbreaks, median) #warpbreaks is pre-loaded dataset; breaks is numeric variable, wool and tension are category variables

#Finally, sometimes you want to repeat this analysis for more than 1 numberic variables. So use "cbind":
aggregate(cbind(Sepal.Width, Sepal.Length)~Species, iris, mean)

#3 Summary statistics
#3.1 Category variable
# The first thing you should think is the frequency table:

table(iris\$Species)

# If you want a table for 2 category variables (cross-tab)
table(warpbreaks\$wool, warpbreaks\$tension)
#Similarly, you can create contingency tables for three or more variables:

``````