## The Chi-Square Distribution

Recall that the Gamma distribution has several special cases depending on its parameters α and β. Using the dgamma function in R, we can graph a few cases. The dgamma commonly takes in a vector, a rate (alpha) and a shape (Beta).

exp_seq <- seq(0, 7, .001) #sequence from 0 to 5 by .001
plot(exp_seq, dgamma(exp_seq, 1,1), col="red", main="Gamma Density Distribution",
xlab="x", ylab="f(x)", cex=0.02)
lines(exp_seq, dgamma(exp_seq, 2, 1), col="orange")
lines(exp_seq, dgamma(exp_seq, 3, 1), col="yellow")
lines(exp_seq, dgamma(exp_seq, 4, 1), col="green")
lines(exp_seq, dgamma(exp_seq, 5, 1), col="blue")
legend(4.5, 1, legend=c("α=1, β=1", "α=2, β=1", "α=3, β=1", "α=4, β=1", "α=5, β=1"),
col=c("red", "orange", "yellow", "green", "blue"),
lty=1,
cex=0.8)

The Exponential is a special case of the Gamma Distribution with Γ(α=1, β=1/λ) .
The Chi-square (χ²) is also special case of the Gamma distribution, with Γ(α=½, β=2). To see this, let Z be a standard normal random variable from ~N(0,1). If you square Z, then Z² is a chi-square variable with degrees of freedom 1 notated here as χ²₁ .
Take a look at the transformation of a standard normal graph squared.

# z is standard normal N(0,1)

z <- rnorm(n = 10000, mean = 0, sd = 1)
hist(z)

# Creating a chi-square distribution by squaring the values
x = z^2
hist(x, bins=5)

Notice the shape of the chi-square distribution is similar to a gamma density distribution. This can be proved using the definition above and the Distribution Function Technique. If Z~N(0,1) and X=Z², then x~χ²₁.

Take the derivative of the cdf to find the pdf.

Since Z follows a standard normal distribution:

This simplifies to:

This is the pdf for Γ(α=½, β=2) and it is called a chi-square of degrees of freedom 1.

## The t Distribution

A t distribution is created using the ratio of a standard normal and the square root of a chi-square divided by its degrees of freedom.

Z ~ N(0,1)
U ~χ²n
is a t distribution with n degrees of freedom.
The following graph shows how several t distributions compare to the standard normal curve.

curve(dnorm(x), -4.5, 4.5, col = "red")
curve(dt(x, df = 1), add = TRUE)
curve(dt(x, df = 5), add = TRUE)
curve(dt(x, df = 15), add = TRUE)

The pdf of the t distribution with n degrees of freedom is quite complicated, but its expected value is 0 and the variance is n/n-2. As the degrees of freedom increase, the t distribution tends toward the standard normal (notice the variance approaches 1 as n⇒∞). This can be seen in the graph as well. The tails of the t distribution as higher but they approach N(0,1) as the degrees of freedom increase.

## The F Distribution

Let U ~ χ²n and V ~ χ²m. If U and V are independent then the ratio of

is an F distribution notated n,m.

f_seq <- seq(0, 7, .001)
f_dist <- df(f_seq, df1 = 3, df2 = 4)
plot(f_dist, main = "F-Distribution w/ df1=3 and df2=4")

## Central Limit Theorem

A sample of n random variables (X1,X2....Xn) is taken from the population. If the population distribution is not normally distributed, then the sample size should be greater or equal to 30 (n>=30). Consider each r.v. to be independent and identically distributed. The sample average and sum are as follows.

The Central Limit Theorem states that the sampling distribution of the average (or sum) of a large number of samples will follow a normal distribution regardless of the original population distribution.
Say the population distribution has mean μ and standard deviation σ. Then,

This concept is best visualized. Let’s begin with a population that has a normal distribution.

# Population: Normal

pop_norm <- rnorm(10000, mean=10, sd=1)
hist(pop_norm, main = "Normal Distribution with mu=10", border="darkorange2")

The population mean is μ = 10, and standard deviation σ = 1. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(10,0.1). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(1000, 10).

n_sam_vec <- c()               #create empty vector for the sampling distribution
for (i in 1:10000){               #10000 simulations
n_mean<-mean(rnorm(100,10,1))     #take the AVERAGE of sample of 100 r.v.
n_sam_vec<-c(n_sam_vec,n_mean)}   #add this to the sampling distribution vector
hist(n_sam_vec,freq=F,            #graph the sampling distribution
col="orange",
main="Histogram of Sample Means")        

mean(n_sam_vec)                 #mean should be approx. 10 by CLT

## [1] 10.00217

sd(n_sam_vec)                   #SD should be approx. 0.1 by CLT

## [1] 0.100377

n_sum_vec <- c()
for (i in 1:10000){
n_sum<-sum(rnorm(100,10,1))      #take the TOTAL of sample of 100 r.v.
n_sum_vec<-c(n_sum_vec,n_sum)}
hist(n_sum_vec,freq=F,
main="Histogram of Sample Totals")
line_fit<-seq(950,1050,by=0.001)
lines(line_fit,dnorm(line_fit,1000,10),col="orange")

mean(n_sum_vec)                 #mean should be approx. 1000 by CLT

## [1] 999.9817

sd(n_sum_vec)                   #SD should be approx. 10 by CLT

## [1] 9.995104

# Population: Exponential

exp_seq <-seq(0,5,0.001)  #sequence from 0 to 5 by .001
plot(exp_seq, dgamma(exp_seq,1,2), col="steelblue2", main="Exponential Distribution with λ = 2") 

The population mean is 1/λ = 0.5, and standard deviation σ = 0.5. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(0.5,0.05). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(50, 5).

e_sam_vec <- c()               #create empty vector for the sampling distribution
for (i in 1:10000){               #10000 simulations
s_mean<-mean(rgamma(100,1,2))     #take the AVERAGE of sample of 100 r.v.
e_sam_vec<-c(e_sam_vec,s_mean)}   #add this to the sampling distribution vector
hist(e_sam_vec,freq=F,            #graph the sampling distribution
col="steelblue2",
main="Histogram of Sample Means") 

mean(e_sam_vec)                 #mean should be approx. .5 by CLT

## [1] 0.5002097

sd(e_sam_vec)                   #SD should be approx. .05 by CLT

## [1] 0.05096455

e_sum_vec <- c()
for (i in 1:10000){
e_sum<-sum(rgamma(100,1,2))      #take the TOTAL of sample of 100 r.v.
e_sum_vec<-c(e_sum_vec,e_sum)}
hist(e_sum_vec,freq=F,
main="Histogram of Sample Totals")
line_fit<-seq(30,75,by=0.001)
lines(line_fit,dnorm(line_fit,50,5),col="blue")

mean(e_sum_vec)                 #mean should be approx. 50 by CLT

## [1] 50.08411

sd(e_sum_vec)                   #SD should be approx. 5 by CLT

## [1] 5.036488

# Population: Uniform

pop_unif <- runif(10000, min=0, max=6)
hist(pop_unif, main = "Uniform Distribution with a=0, b=6", border="darkgreen")

The population mean is (a+b)^2 = 3 and standard deviation √((ba)^2 / 12)) = 1.732. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should becomex̄ ~ N(μσ√(n)) ~ N(3,0.1732). The sampling distribution for the sum should becomeT ~ N(nμσn) ~ N(300, 17.32).

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
u_mean<-mean(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
col="green",
main="Histogram of Sample Means")

mean(u_sam_vec)                 #mean should be approx. 3 by CLT

## [1] 2.999125

sd(u_sam_vec)                   #mean should be approx. 0.1732 by CLT

## [1] 0.1754574

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
u_mean<-sum(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
main="Histogram of Sample Means")
line_fit<-seq(220,400,by=0.001)
lines(line_fit,dnorm(line_fit,300,17.32),col="green")

mean(u_sam_vec)                 #mean should be approx. 300 by CLT

## [1] 299.8411

sd(u_sam_vec)                   #mean should be approx. 17.32 by CLT

## [1] 17.17556