## Central Limit Theorem

A sample of n random variables (X1,X2....Xn) is taken from the population. If the population distribution is not normally distributed, then the sample size should be greater or equal to 30 (n>=30). Consider each r.v. to be independent and identically distributed. The sample average and sum are as follows.

The Central Limit Theorem states that the sampling distribution of the average (or sum) of a large number of samples will follow a normal distribution regardless of the original population distribution.
Say the population distribution has mean μ and standard deviation σ. Then,

This concept is best visualized. Let’s begin with a population that has a normal distribution.

# Population: Normal

pop_norm <- rnorm(10000, mean=10, sd=1)
hist(pop_norm, main = "Normal Distribution with mu=10", border="darkorange2")

The population mean is μ = 10, and standard deviation σ = 1. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(10,0.1). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(1000, 10).

n_sam_vec <- c()               #create empty vector for the sampling distribution
for (i in 1:10000){               #10000 simulations
n_mean<-mean(rnorm(100,10,1))     #take the AVERAGE of sample of 100 r.v.
n_sam_vec<-c(n_sam_vec,n_mean)}   #add this to the sampling distribution vector
hist(n_sam_vec,freq=F,            #graph the sampling distribution
col="orange",
main="Histogram of Sample Means")        

mean(n_sam_vec)                 #mean should be approx. 10 by CLT

## [1] 10.00217

sd(n_sam_vec)                   #SD should be approx. 0.1 by CLT

## [1] 0.100377

n_sum_vec <- c()
for (i in 1:10000){
n_sum<-sum(rnorm(100,10,1))      #take the TOTAL of sample of 100 r.v.
n_sum_vec<-c(n_sum_vec,n_sum)}
hist(n_sum_vec,freq=F,
main="Histogram of Sample Totals")
line_fit<-seq(950,1050,by=0.001)
lines(line_fit,dnorm(line_fit,1000,10),col="orange")

mean(n_sum_vec)                 #mean should be approx. 1000 by CLT

## [1] 999.9817

sd(n_sum_vec)                   #SD should be approx. 10 by CLT

## [1] 9.995104

# Population: Exponential

exp_seq <-seq(0,5,0.001)  #sequence from 0 to 5 by .001
plot(exp_seq, dgamma(exp_seq,1,2), col="steelblue2", main="Exponential Distribution with λ = 2") 

The population mean is 1/λ = 0.5, and standard deviation σ = 0.5. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(0.5,0.05). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(50, 5).

e_sam_vec <- c()               #create empty vector for the sampling distribution
for (i in 1:10000){               #10000 simulations
s_mean<-mean(rgamma(100,1,2))     #take the AVERAGE of sample of 100 r.v.
e_sam_vec<-c(e_sam_vec,s_mean)}   #add this to the sampling distribution vector
hist(e_sam_vec,freq=F,            #graph the sampling distribution
col="steelblue2",
main="Histogram of Sample Means") 

mean(e_sam_vec)                 #mean should be approx. .5 by CLT

## [1] 0.5002097

sd(e_sam_vec)                   #SD should be approx. .05 by CLT

## [1] 0.05096455

e_sum_vec <- c()
for (i in 1:10000){
e_sum<-sum(rgamma(100,1,2))      #take the TOTAL of sample of 100 r.v.
e_sum_vec<-c(e_sum_vec,e_sum)}
hist(e_sum_vec,freq=F,
main="Histogram of Sample Totals")
line_fit<-seq(30,75,by=0.001)
lines(line_fit,dnorm(line_fit,50,5),col="blue")

mean(e_sum_vec)                 #mean should be approx. 50 by CLT

## [1] 50.08411

sd(e_sum_vec)                   #SD should be approx. 5 by CLT

## [1] 5.036488

# Population: Uniform

pop_unif <- runif(10000, min=0, max=6)
hist(pop_unif, main = "Uniform Distribution with a=0, b=6", border="darkgreen")

The population mean is (a+b)^2 = 3 and standard deviation √((ba)^2 / 12)) = 1.732. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should becomex̄ ~ N(μσ√(n)) ~ N(3,0.1732). The sampling distribution for the sum should becomeT ~ N(nμσn) ~ N(300, 17.32).

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
u_mean<-mean(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
col="green",
main="Histogram of Sample Means")

mean(u_sam_vec)                 #mean should be approx. 3 by CLT

## [1] 2.999125

sd(u_sam_vec)                   #mean should be approx. 0.1732 by CLT

## [1] 0.1754574

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
u_mean<-sum(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
main="Histogram of Sample Means")
line_fit<-seq(220,400,by=0.001)
lines(line_fit,dnorm(line_fit,300,17.32),col="green")

mean(u_sam_vec)                 #mean should be approx. 300 by CLT

## [1] 299.8411

sd(u_sam_vec)                   #mean should be approx. 17.32 by CLT

## [1] 17.17556