24: Estimators

Author

Derek Sollberger

Published

March 21, 2023

Today: Estimators

Goal: Explore generalization from samples to populations

Objectives: Show the biased or unbiased estimation via

  • sample mean x¯
  • sample variance s2
  • sample standard deviation s

Demographics Example

From our Demographics Survey data of Math 32 students, suppose that the following is a sample of observations of heights (in inches):

{x11=72,x12=61,x13=60,x14=75,x15=69}

  • then t1=67.4 inches is the sample mean.

Suppose that the following is another sample of heights:

{x21=66,x22=78,x23=78,x24=77,x25=64}

  • then t2=72.6 inches is the sample mean.

Suppose that the following is another sample of heights:

{x31=61,x32=59,x33=70,x34=61,x35=65}

  • then t3=63.2 inches is the sample mean.

Observe: the sample mean (usually) changes upon a new set of observations

  • Can we calculate the average height of UC Merced students?
  • How can we calculate the average height of UC Merced students?

Thought: what if we take a mean of the sample means?

Estimators

Estimators

Let T be a random variable and f be some calculation T=f(x1,x2,x3,...)

If we are trying to estimate a population parameter θ, we say that T is an unbiased estimator of θ if E[T]=θ

Today, we will look at situations where f is calculating the

  • mean
  • variance
  • standard deviation

Mean

We will run simulations with XU(0,1) because we know what the answers should be. The population mean is

μ=a+b2=12

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- mean(these_numbers) #record average
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = 1/2, color = "red", size = 3) +
  labs(title = "Simulation Sample Mean",
       subtitle = paste("black: sample distribution\nred: true population mean\nmean of sample means: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Loosely speaking, since the sampling distribution “lines up” with the population mean, we say that the sample median is an unbiased estimator of the population mean.

E[X¯n]=E(X1+X2+...+Xnn) =1nE(X1+X2+...+Xn) =1n(E[X1]+E[X2]+...+E[Xn]) =1n(μ+μ+...+μ) =1n(nμ)

Therefore E[X¯n]=μ

Population Variance

We will run simulations with XU(0,1) because we know what the answers should be. The population variance is

σ2=(ba)212=112

We will explore what happens if we apply the population variance formula

σ2=1Ni=1N(xiμ)2

to samples.

# user-defined function
pop_var <- function(x){
  N <- length(!is.na(x)) #population size
  mu <- mean(x, na.rm = TRUE) #population mean
  
  # return population mean (note use of "N")
  sum( (x - mu)^2 ) / N
}

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- pop_var(these_numbers) #record population variance
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = 1/12, color = "red", size = 3) +
  labs(title = "Simulation of Population Variances",
       subtitle = paste("black: sample distribution\nred: true population variance\nmean of population variances: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()

Loosely speaking, since the sampling distribution tends to underestimate the population variance, we say that the population variance (with N) is a biased estimator of the population variance.

Bessel’s Correction

Can we rescale the process for computing variance so that the operation is an unbiased estimator for the population variance?

Let Xi be a set of n i.i.d. random variables from the same distribution with the same population variance σ2. By independence, there is zero covariance.

We will compute the value of k so that

E[ki=1n(XiX¯n)2n]=σ2

Lemma: Var(XiX¯n)=n1nσ2

We have derived the formula for the sample variance

Sn2=1n1i=1n(XiX¯n)2

That is, the “n1” (Bessel’s correction) is in place so that the sample variance s2 is an unbiased estimator of the population variance σ2

Sample Variance

We will run simulations with XU(0,1) because we know what the answers should be. The population variance is

σ2=(ba)212=112

We will explore what happens if we apply the sample variance formula

s2=1n1i=1n(xiμ)2

to samples.

# user-defined function
samp_var <- function(x){
  n  <- length(!is.na(x)) #sample size
  xbar <- mean(x, na.rm = TRUE) #sample mean
  
  # return population mean (note use of "n-1")
  sum( (x - xbar)^2 ) / (n-1)
}

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- samp_var(these_numbers) #record sample variance
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = 1/12, color = "red", size = 3) +
  labs(title = "Simulation of Sample Variances",
       subtitle = paste("black: sample distribution\nred: true population variance\nmean of sample variances: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()

Loosely speaking, since the sampling distribution “lines up” with the population variance, we say that the sample variance (with n1) is an unbiased estimator of the population variance.

Sample Standard Deviation

We will run simulations with XU(0,1) because we know what the answers should be. The population standard deviation is

σ=(ba)212=112

We will explore what happens if we apply the sample variance formula

s=1n1i=1n(xiμ)2

to samples.

# user-defined function
samp_var <- function(x){
  n  <- length(!is.na(x)) #sample size
  xbar <- mean(x, na.rm = TRUE) #sample mean
  
  # return population mean (note use of "n-1")
  sum( (x - xbar)^2 ) / (n-1)
}

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- sqrt(samp_var(these_numbers)) #record sample standard deviation
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = sqrt(1/12), color = "red", size = 3) +
  labs(title = "Simulation of Sample Variances",
       subtitle = paste("black: sample distribution\nred: true population variance\nmean of sample standard deviations: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()

Let Xi be a set of n i.i.d. random variables from the same distribution with the same population standard deviation σ. To avoid trivial situations, assume non-zero variance, so σ0.

If s=i=1n(XiX¯n)2n1 was an unbiased estimator, then E[s]=σ

However, by Jensen’s Inequality, since g(x)=x2 is a convex function,

σ2=E[Sn2]>(E[Sn])2

and it follows that E[Sn]<σ. Due to the underestimation, sample standard deviation s is a biased estimator of population standard deviation σ.

 

However, in practice, the discrepancy is usually so small that it is ignored.

Looking Ahead

  • due Fri., Mar. 24:

    • LHW8
  • no lecture on Mar. 24, Apr. 3

  • Exam 2 will be on Mon., Apr. 10