Chi-Square, t and F Distribution

The Chi-Square Distribution

Recall that the Gamma distribution has several special cases depending on its parameters α and β. Using the dgamma function in R, we can graph a few cases. The dgamma commonly takes in a vector, a rate (alpha) and a shape (Beta).

exp_seq <- seq(0, 7, .001) #sequence from 0 to 5 by .001
plot(exp_seq, dgamma(exp_seq, 1,1), col="red", main="Gamma Density Distribution",
     xlab="x", ylab="f(x)", cex=0.02)
lines(exp_seq, dgamma(exp_seq, 2, 1), col="orange")
lines(exp_seq, dgamma(exp_seq, 3, 1), col="yellow")
lines(exp_seq, dgamma(exp_seq, 4, 1), col="green")
lines(exp_seq, dgamma(exp_seq, 5, 1), col="blue")
legend(4.5, 1, legend=c("α=1, β=1", "α=2, β=1", "α=3, β=1", "α=4, β=1", "α=5, β=1"),
 col=c("red", "orange", "yellow", "green", "blue"), 
 lty=1, 
 cex=0.8)

plot_zoom_pngThe Exponential is a special case of the Gamma Distribution with Γ(α=1, β=1/λ) .Screen Shot 2017-08-06 at 1.24.42 PM
The Chi-square (χ²) is also special case of the Gamma distribution, with Γ(α=½, β=2). To see this, let Z be a standard normal random variable from ~N(0,1). If you square Z, then Z² is a chi-square variable with degrees of freedom 1 notated here as χ²₁ .
Take a look at the transformation of a standard normal graph squared.

# z is standard normal N(0,1)

z <- rnorm(n = 10000, mean = 0, sd = 1)
hist(z)

Unknown

# Creating a chi-square distribution by squaring the values
x = z^2
hist(x, bins=5)

Unknown-1

Notice the shape of the chi-square distribution is similar to a gamma density distribution. This can be proved using the definition above and the Distribution Function Technique. If Z~N(0,1) and X=Z², then x~χ²₁.

Screen Shot 2017-08-06 at 2.08.27 PM
Take the derivative of the cdf to find the pdf.

Screen Shot 2017-08-06 at 2.08.37 PM
Since Z follows a standard normal distribution:
Screen Shot 2017-08-06 at 2.08.10 PM
This simplifies to:
Screen Shot 2017-08-06 at 2.04.54 PM.png
This is the pdf for Γ(α=½, β=2) and it is called a chi-square of degrees of freedom 1.

The t Distribution

A t distribution is created using the ratio of a standard normal and the square root of a chi-square divided by its degrees of freedom.

Z ~ N(0,1)
U ~χ²n
Screen Shot 2017-08-06 at 2.17.40 PMis a t distribution with n degrees of freedom.
The following graph shows how several t distributions compare to the standard normal curve.

curve(dnorm(x), -4.5, 4.5, col = "red")
curve(dt(x, df = 1), add = TRUE)
curve(dt(x, df = 5), add = TRUE)
curve(dt(x, df = 15), add = TRUE)

Unknown-2
The pdf of the t distribution with n degrees of freedom is quite complicated, but its expected value is 0 and the variance is n/n-2. As the degrees of freedom increase, the t distribution tends toward the standard normal (notice the variance approaches 1 as n⇒∞). This can be seen in the graph as well. The tails of the t distribution as higher but they approach N(0,1) as the degrees of freedom increase.

The F Distribution

Let U ~ χ²n and V ~ χ²m. If U and V are independent then the ratio of

Screen Shot 2017-08-06 at 2.28.37 PM.png  is an F distribution notated n,m.

f_seq <- seq(0, 7, .001) 
f_dist <- df(f_seq, df1 = 3, df2 = 4)
plot(f_dist, main = "F-Distribution w/ df1=3 and df2=4")

plot_zoom_png-1.png

 

Advertisements

Central Limit Theorem

A sample of n random variables (X1,X2....Xn) is taken from the population. If the population distribution is not normally distributed, then the sample size should be greater or equal to 30 (n>=30). Consider each r.v. to be independent and identically distributed. The sample average and sum are as follows.

Screen Shot 2017-07-17 at 1.28.52 PM

The Central Limit Theorem states that the sampling distribution of the average (or sum) of a large number of samples will follow a normal distribution regardless of the original population distribution.
Say the population distribution has mean μ and standard deviation σ. Then,

Screen Shot 2017-07-17 at 1.28.59 PM

This concept is best visualized. Let’s begin with a population that has a normal distribution.

Population: Normal

pop_norm <- rnorm(10000, mean=10, sd=1) 
hist(pop_norm, main = "Normal Distribution with mu=10", border="darkorange2")

Unknown

The population mean is μ = 10, and standard deviation σ = 1. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(10,0.1). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(1000, 10).

n_sam_vec <- c()               #create empty vector for the sampling distribution 
for (i in 1:10000){               #10000 simulations
n_mean<-mean(rnorm(100,10,1))     #take the AVERAGE of sample of 100 r.v.
n_sam_vec<-c(n_sam_vec,n_mean)}   #add this to the sampling distribution vector
hist(n_sam_vec,freq=F,            #graph the sampling distribution
     col="orange",
     main="Histogram of Sample Means")        

Unknown-1

mean(n_sam_vec)                 #mean should be approx. 10 by CLT

## [1] 10.00217

sd(n_sam_vec)                   #SD should be approx. 0.1 by CLT

## [1] 0.100377

n_sum_vec <- c()                 
for (i in 1:10000){           
n_sum<-sum(rnorm(100,10,1))      #take the TOTAL of sample of 100 r.v.
n_sum_vec<-c(n_sum_vec,n_sum)}       
hist(n_sum_vec,freq=F,
     main="Histogram of Sample Totals")
line_fit<-seq(950,1050,by=0.001) 
lines(line_fit,dnorm(line_fit,1000,10),col="orange")

Unknown-2

mean(n_sum_vec)                 #mean should be approx. 1000 by CLT

## [1] 999.9817

sd(n_sum_vec)                   #SD should be approx. 10 by CLT

## [1] 9.995104

Population: Exponential

exp_seq <-seq(0,5,0.001)  #sequence from 0 to 5 by .001 
plot(exp_seq, dgamma(exp_seq,1,2), col="steelblue2", main="Exponential Distribution with λ = 2") 

Unknown-3

The population mean is 1/λ = 0.5, and standard deviation σ = 0.5. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should become x̄ ~ N(μσ√(n)) ~ N(0.5,0.05). The sampling distribution for the sum should become T ~ N(nμσn) ~ N(50, 5).

e_sam_vec <- c()               #create empty vector for the sampling distribution 
for (i in 1:10000){               #10000 simulations
s_mean<-mean(rgamma(100,1,2))     #take the AVERAGE of sample of 100 r.v.
e_sam_vec<-c(e_sam_vec,s_mean)}   #add this to the sampling distribution vector
hist(e_sam_vec,freq=F,            #graph the sampling distribution
     col="steelblue2",
     main="Histogram of Sample Means") 

Unknown-4

mean(e_sam_vec)                 #mean should be approx. .5 by CLT

## [1] 0.5002097

sd(e_sam_vec)                   #SD should be approx. .05 by CLT

## [1] 0.05096455

e_sum_vec <- c()                 
for (i in 1:10000){           
e_sum<-sum(rgamma(100,1,2))      #take the TOTAL of sample of 100 r.v.
e_sum_vec<-c(e_sum_vec,e_sum)}       
hist(e_sum_vec,freq=F,
     main="Histogram of Sample Totals")
line_fit<-seq(30,75,by=0.001) 
lines(line_fit,dnorm(line_fit,50,5),col="blue")

Unknown-5

mean(e_sum_vec)                 #mean should be approx. 50 by CLT

## [1] 50.08411

sd(e_sum_vec)                   #SD should be approx. 5 by CLT

## [1] 5.036488

Population: Uniform

pop_unif <- runif(10000, min=0, max=6) 
hist(pop_unif, main = "Uniform Distribution with a=0, b=6", border="darkgreen")

Unknown-6

The population mean is (a+b)^2 = 3 and standard deviation √((ba)^2 / 12)) = 1.732. According to the CLT, if we take sample sizes of 100, then the sampling distribution for averages should becomex̄ ~ N(μσ√(n)) ~ N(3,0.1732). The sampling distribution for the sum should becomeT ~ N(nμσn) ~ N(300, 17.32).

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
  u_mean<-mean(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
  u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
     col="green",
     main="Histogram of Sample Means")

Unknown-7.png

mean(u_sam_vec)                 #mean should be approx. 3 by CLT

## [1] 2.999125

sd(u_sam_vec)                   #mean should be approx. 0.1732 by CLT

## [1] 0.1754574

u_sam_vec <- c()                #create empty vector for the sampling distribution
for (i in 1:10000){                 #10000 simulations
  u_mean<-sum(runif(100,0,6))      #take the AVERAGE of sample of 100 r.v.
  u_sam_vec<-c(u_sam_vec,u_mean)}   #add this to the sampling distribution vector
hist(u_sam_vec,freq=F,              #graph the sampling distribution
     main="Histogram of Sample Means")   
line_fit<-seq(220,400,by=0.001) 
lines(line_fit,dnorm(line_fit,300,17.32),col="green")

Unknown-8

mean(u_sam_vec)                 #mean should be approx. 300 by CLT

## [1] 299.8411

sd(u_sam_vec)                   #mean should be approx. 17.32 by CLT

## [1] 17.17556