nievergeltlab of Simulations
7/11/2017 - 9:19 PM

Evaluate the efficiency of meta analysis with small study sizes and rare variants

Evaluate the efficiency of meta analysis with small study sizes and rare variants

###Simulate whether or not we'll get valid test stats doing a meta analysis of small samples

nsim=100
ncases=5000
ncontrols=ncases
set.seed=18
nsubs=ncases+ncontrols
case_af <- .1
control_af <- .07
setsize=50

 library(metafor)
 
#Sample same population st cases are enumerated in batches of 50
#till we get 10000 cases --  assuming 5% of the population is cases, we need a pop of 200k people. maybe i should just do this in plink?? # http://cnsgenomics.com/software/gcta/Simu.html

#Save the meta analysis test stat in the first column, the whole data regression analysis in the second
 results <- matrix(ncol=2,nrow=nsim)
 
 for (sim in 1:nsim)
 {
 
 genotype <- rbinom(nsubs,2,p=rep(c(case_af,control_af),nsubs))
 
 dat <- data.frame(cbind(rep(c(1,0),ncases),genotype))
 

 #Split into abritrary sets , do assoc analysis
 pres <- ncases/setsize
 stat <- matrix(nrow=pres*2,ncol=2)
 
 for (rep in 1:(2*pres))
 {

  ds <- dat[((rep-1)*setsize + 1):(rep*setsize),]
  stat[rep,] <- summary(glm(V1~ genotype,family="binomial",data=ds))$coefficients[2,1:2]
 }
 weight=1/stat[,2]^2
 t_top=sum(weight*stat[,1])
 T=t_top/sum(weight)
 sem=sqrt(1/sum(weight))

 zscore <- T/sem

 #results[sim,1] <- rma(yi=stat[,1],sei=stat[,2],method="FE")$zval
 results[sim,1] <- zscore
 results[sim,2] <- summary(glm(V1~genotype,data=dat,family="binomial"))$coefficients[2,3]
}

 #Test whether or not the analyses produce different test stats
median(results[,1]/results[,2])


 #When the AF is common, having 50 samples per study results in a perhaps 5% loss in efficiency. This represents a worst case scenario, as most studies have > 50
 #However, at low MAF, un-estimatable logistic estimates become common - e.g. with af 10% and 7% in cases and controls respectively, 3% of results had huge SEs
 

#Conclusion: For common variation, At lower mafs, efficiency reduces exponentially due to misestiamtion due to sparse cell counts
#The realistic loss of efficiency could be a an interpolated number between the losses at n=500 and n=50
#With AF 10%, approx 1/25 of results will have at least one study with a mis-estimated parameter