Central Limit Theorem

A sample of n random variables (X1,X2....Xn) is taken from the population. If the population distribution is not normally distributed, then the sample size should be greater or equal to 30 (n>=30). Consider each r.v. to be independent and identically distributed. The sample average and sum are as follows.

Screen Shot 2017-07-17 at 1.28.52 PM

The Central Limit Theorem states that the sampling distribution of the average (or sum) of a large number of samples will follow a normal distribution regardless of the original population distribution.
Say the population distribution has mean μ and standard deviation σ. Then,

Screen Shot 2017-07-17 at 1.28.59 PM

This concept is best visualized. Let’s begin with a population that has a normal distribution.

Population: Normal

pop_norm <- rnorm(10000, mean=10, sd=1) 
hist(pop_norm, main = "Normal Distribution with mu=10", border="darkorange2")

Unknown

The population mean is μ = 10, and standard deviation σ = 1. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(10,0.1). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(1000, 10).

n_sam_vec <- c()               #create empty vector for the sampling distribution 
for (i in 1:10000){               #10000 simulations
n_mean<-mean(rnorm(100,10,1))     #take the AVERAGE of sample of 100 r.v.
n_sam_vec<-c(n_sam_vec,n_mean)}   #add this to the sampling distribution vector
hist(n_sam_vec,freq=F,            #graph the sampling distribution
     col="orange",
     main="Histogram of Sample Means")        

Unknown-1

mean(n_sam_vec)                 #mean should be approx. 10 by CLT

## [1] 10.00217

sd(n_sam_vec)                   #SD should be approx. 0.1 by CLT

## [1] 0.100377

n_sum_vec <- c()                 
for (i in 1:10000){           
n_sum<-sum(rnorm(100,10,1))      #take the TOTAL of sample of 100 r.v.
n_sum_vec<-c(n_sum_vec,n_sum)}       
hist(n_sum_vec,freq=F,
     main="Histogram of Sample Totals")
line_fit<-seq(950,1050,by=0.001) 
lines(line_fit,dnorm(line_fit,1000,10),col="orange")

Unknown-2

mean(n_sum_vec)                 #mean should be approx. 1000 by CLT

## [1] 999.9817

sd(n_sum_vec)                   #SD should be approx. 10 by CLT

## [1] 9.995104

Population: Exponential

exp_seq <-seq(0,5,0.001)  #sequence from 0 to 5 by .001 
plot(exp_seq, dgamma(exp_seq,1,2), col="steelblue2", main="Exponential Distribution with λ = 2") 

Unknown-3

The population mean is 1/λ = 0.5, and standard deviation σ = 0.5. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(0.5,0.05). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(50, 5).

e_sam_vec <- c()               #create empty vector for the sampling distribution 
for (i in 1:10000){               #10000 simulations
s_mean<-mean(rgamma(100,1,2))     #take the AVERAGE of sample of 100 r.v.
e_sam_vec<-c(e_sam_vec,s_mean)}   #add this to the sampling distribution vector
hist(e_sam_vec,freq=F,            #graph the sampling distribution
     col="steelblue2",
     main="Histogram of Sample Means") 

Unknown-4

mean(e_sam_vec)                 #mean should be approx. .5 by CLT

## [1] 0.5002097

sd(e_sam_vec)                   #SD should be approx. .05 by CLT

## [1] 0.05096455

e_sum_vec <- c()                 
for (i in 1:10000){           
e_sum<-sum(rgamma(100,1,2))      #take the TOTAL of sample of 100 r.v.
e_sum_vec<-c(e_sum_vec,e_sum)}       
hist(e_sum_vec,freq=F,
     main="Histogram of Sample Totals")
line_fit<-seq(30,75,by=0.001) 
lines(line_fit,dnorm(line_fit,50,5),col="blue")

Unknown-5

mean(e_sum_vec)                 #mean should be approx. 50 by CLT

## [1] 50.08411

sd(e_sum_vec)                   #SD should be approx. 5 by CLT

## [1] 5.036488

Population: Uniform

pop_unif <- runif(10000, min=0, max=6) 
hist(pop_unif, main = "Uniform Distribution with a=0, b=6", border="darkgreen")

Unknown-6

The population mean is (a+b)^2 = 3 and standard deviation √((ba)^2 / 12)) = 1.732. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should becomex̄ ~ N(μσ√(n)) ~ N(3,0.1732). The sampling distribution for the sum should becomeT ~ N(nμσn) ~ N(300, 17.32).

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
  u_mean<-mean(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
  u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
     col="green",
     main="Histogram of Sample Means")

Unknown-7.png

mean(u_sam_vec)                 #mean should be approx. 3 by CLT

## [1] 2.999125

sd(u_sam_vec)                   #mean should be approx. 0.1732 by CLT

## [1] 0.1754574

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
  u_mean<-sum(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
  u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
     main="Histogram of Sample Means")   
line_fit<-seq(220,400,by=0.001) 
lines(line_fit,dnorm(line_fit,300,17.32),col="green")

Unknown-8

mean(u_sam_vec)                 #mean should be approx. 300 by CLT

## [1] 299.8411

sd(u_sam_vec)                   #mean should be approx. 17.32 by CLT

## [1] 17.17556