Math 32 - 24: Estimators

Today: Estimators

Goal: Explore generalization from samples to populations

Objectives: Show the biased or unbiased estimation via

sample mean $\bar{x}$
sample variance $s^{2}$
sample standard deviation $s$

Demographics Example

From our Demographics Survey data of Math 32 students, suppose that the following is a sample of observations of heights (in inches):

${x_{11} = 72, x_{12} = 61, x_{13} = 60, x_{14} = 75, x_{15} = 69}$

then $t_{1} = 67.4$ inches is the sample mean.

Suppose that the following is another sample of heights:

${x_{21} = 66, x_{22} = 78, x_{23} = 78, x_{24} = 77, x_{25} = 64}$

then $t_{2} = 72.6$ inches is the sample mean.

Suppose that the following is another sample of heights:

${x_{31} = 61, x_{32} = 59, x_{33} = 70, x_{34} = 61, x_{35} = 65}$

then $t_{3} = 63.2$ inches is the sample mean.

Sampling Itself is Probabilistic

Observe: the sample mean (usually) changes upon a new set of observations

Can we calculate the average height of UC Merced students?
How can we calculate the average height of UC Merced students?

Thought: what if we take a mean of the sample means?

Estimators

Let $T$ be a random variable and $f$ be some calculation $T = f (x_{1}, x_{2}, x_{3}, . . .)$

If we are trying to estimate a population parameter $θ$ , we say that $T$ is an unbiased estimator of $θ$ if $E [T] = θ$

Today, we will look at situations where $f$ is calculating the

mean
variance
standard deviation

Mean

We will run simulations with $X \sim U (0, 1)$ because we know what the answers should be. The population mean is

$μ = \frac{a + b}{2} = \frac{1}{2}$

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- mean(these_numbers) #record average
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = 1/2, color = "red", size = 3) +
  labs(title = "Simulation Sample Mean",
       subtitle = paste("black: sample distribution\nred: true population mean\nmean of sample means: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Loosely speaking, since the sampling distribution “lines up” with the population mean, we say that the sample median is an unbiased estimator of the population mean.

\begin{array}{rcl} E [{\bar{X}}_{n}] & = & E (\frac{X_{1} + X_{2} + . . . + X_{n}}{n}) \\ = & \frac{1}{n} E (X_{1} + X_{2} + . . . + X_{n}) \\ = & \frac{1}{n} (E [X_{1}] + E [X_{2}] + . . . + E [X_{n}]) \\ = & \frac{1}{n} (μ + μ + . . . + μ) \\ = & \frac{1}{n} (n μ) \end{array}

Therefore $E [{\bar{X}}_{n}] = μ$

Population Variance

We will run simulations with $X \sim U (0, 1)$ because we know what the answers should be. The population variance is

$σ^{2} = \frac{(b - a)^{2}}{12} = \frac{1}{12}$

We will explore what happens if we apply the population variance formula

$σ^{2} = \frac{1}{N} \sum_{i = 1}^{N} (x_{i} - μ)^{2}$

to samples.

# user-defined function
pop_var <- function(x){
  N <- length(!is.na(x)) #population size
  mu <- mean(x, na.rm = TRUE) #population mean
  
  # return population mean (note use of "N")
  sum( (x - mu)^2 ) / N
}

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- pop_var(these_numbers) #record population variance
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = 1/12, color = "red", size = 3) +
  labs(title = "Simulation of Population Variances",
       subtitle = paste("black: sample distribution\nred: true population variance\nmean of population variances: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()

Loosely speaking, since the sampling distribution tends to underestimate the population variance, we say that the population variance (with $N$ ) is a biased estimator of the population variance.

Bessel’s Correction

Can we rescale the process for computing variance so that the operation is an unbiased estimator for the population variance?

Let $X_{i}$ be a set of $n$ i.i.d. random variables from the same distribution with the same population variance $σ^{2}$ . By independence, there is zero covariance.

We will compute the value of $k$ so that

$E [k \cdot \frac{\sum_{i = 1}^{n} (X_{i} - {\bar{X}}_{n})^{2}}{n}] = σ^{2}$

Lemma: $Var (X_{i} - {\bar{X}}_{n}) = \frac{n - 1}{n} \cdot σ^{2}$

Bessel’s Correction

We have derived the formula for the sample variance

$S_{n}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - {\bar{X}}_{n})^{2}$

That is, the “ $n - 1$ ” (Bessel’s correction) is in place so that the sample variance $s^{2}$ is an unbiased estimator of the population variance $σ^{2}$

Sample Variance

We will run simulations with $X \sim U (0, 1)$ because we know what the answers should be. The population variance is

$σ^{2} = \frac{(b - a)^{2}}{12} = \frac{1}{12}$

We will explore what happens if we apply the sample variance formula

$s^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - μ)^{2}$

to samples.

# user-defined function
samp_var <- function(x){
  n  <- length(!is.na(x)) #sample size
  xbar <- mean(x, na.rm = TRUE) #sample mean
  
  # return population mean (note use of "n-1")
  sum( (x - xbar)^2 ) / (n-1)
}

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- samp_var(these_numbers) #record sample variance
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = 1/12, color = "red", size = 3) +
  labs(title = "Simulation of Sample Variances",
       subtitle = paste("black: sample distribution\nred: true population variance\nmean of sample variances: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()

Loosely speaking, since the sampling distribution “lines up” with the population variance, we say that the sample variance (with $n - 1$ ) is an unbiased estimator of the population variance.

Sample Standard Deviation

We will run simulations with $X \sim U (0, 1)$ because we know what the answers should be. The population standard deviation is

$σ = \sqrt{\frac{(b - a)^{2}}{12}} = \sqrt{\frac{1}{12}}$

We will explore what happens if we apply the sample variance formula

$s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - μ)^{2}}$

to samples.

# user-defined function
samp_var <- function(x){
  n  <- length(!is.na(x)) #sample size
  xbar <- mean(x, na.rm = TRUE) #sample mean
  
  # return population mean (note use of "n-1")
  sum( (x - xbar)^2 ) / (n-1)
}

N <- 1337 # number of iterations
n <- 25   # sample size

# pre-allocate vector of space for observations
obs <- rep(NA, N)

# run simulation
for(i in 1:N){
  these_numbers <- runif(n, 0, 1) # sample n numbers from U(0,1)
  obs[i] <- sqrt(samp_var(these_numbers)) #record sample standard deviation
}

# mean of observations
mean_of_obs <- mean(obs)

# make data frame
df <- data.frame(obs)

# visualization
df |>
  ggplot(aes(x = obs)) +
  geom_density(color = "black", size = 2) +
  geom_vline(xintercept = sqrt(1/12), color = "red", size = 3) +
  labs(title = "Simulation of Sample Variances",
       subtitle = paste("black: sample distribution\nred: true population variance\nmean of sample standard deviations: ", round(mean_of_obs, 4)),
       caption = "Math 32") +
  theme_minimal()

Let $X_{i}$ be a set of $n$ i.i.d. random variables from the same distribution with the same population standard deviation $σ$ . To avoid trivial situations, assume non-zero variance, so $σ \neq 0$ .

If $s = \sqrt{\frac{\sum_{i = 1}^{n} (X_{i} - {\bar{X}}_{n})^{2}}{n - 1}}$ was an unbiased estimator, then $E [s] = σ$

However, by Jensen’s Inequality, since $g (x) = x^{2}$ is a convex function,

$σ^{2} = E [S_{n}^{2}] > {(E [S_{n}])}^{2}$

and it follows that $E [S_{n}] < σ$ . Due to the underestimation, sample standard deviation $s$ is a biased estimator of population standard deviation $σ$ .

However, in practice, the discrepancy is usually so small that it is ignored.

Looking Ahead

due Fri., Mar. 24:
- LHW8
no lecture on Mar. 24, Apr. 3
Exam 2 will be on Mon., Apr. 10