N7: Point Estimation: MOM and MLE

Node N7 — Section 1

Why This Concept Exists

Point estimation is the bridge between probability theory and statistical practice. While probability theory asks "given a parameter, what is the distribution of the data?", estimation asks the inverse: "given the data, what is the best guess for the parameter?"

This node covers the two dominant methods for constructing estimators at the PTS2 level:

Method of Moments (MOM): The older, simpler method. Equate population moments to sample moments and solve. Fast, intuitive, but often suboptimal.
Maximum Likelihood Estimation (MLE): The gold standard. Find the parameter value that makes the observed data most probable. Optimal in large samples, but requires calculus and careful reasoning.

Leverage: MLE appears in roughly 60-70% of PTS2 exams. You can expect an MLE derivation question worth 8-12 marks in any paper. The MOM method is less frequently tested standalone but appears as a comparison point (MLE vs MOM efficiency). Both methods have predictable derivation patterns that can be learned and rehearsed.

Examiners love MLE because the derivation is standardised, the answer is unique, and checking the work is straightforward. They also love to test edge cases (discrete parameters, boundary solutions, non-differentiable likelihoods) that separate students who understand the concept from those who memorise the recipe.

Node N7 — Section 2

Prerequisites

Before engaging with this node, you should be comfortable with:

Population moments: \(E[X^k]\) for \(k = 1, 2, \ldots\). The first moment is the mean, the second central moment is the variance.
Sample moments: \(\bar{X} = \frac{1}{n}\sum X_i\) (first sample moment), \(m_2 = \frac{1}{n}\sum X_i^2\) (second sample raw moment). Know the difference between raw and central moments.
Differentiation: Finding maxima of functions by setting derivatives to zero, second derivative test, chain rule, product rule.
Logarithms: \(\ln(a^b) = b\ln(a)\), \(\ln(ab) = \ln(a) + \ln(b)\). The log-likelihood uses these properties to convert products into sums.
Product notation: The likelihood is typically \(\prod_{i=1}^{n} f(x_i;\theta)\), so you need to be comfortable with manipulating products.
Basic discrete distributions: Binomial, Poisson, Geometric — their PMFs, means, and variances.
Basic continuous distributions: Uniform, Exponential, Normal — their PDFs, means, and variances.

Mental model: MOM is "match the averages". MLE is "make the data as probable as possible". Both aim to produce a good estimator, but they define "good" differently.

Node N7 — Section 3

Core Exposition

3.1 What Is an Estimator?

An estimator is a function of the sample data used to guess an unknown population parameter \(\theta\). Formally: if \(X_1, \ldots, X_n\) is a random sample, an estimator is \(\hat{\theta} = g(X_1, \ldots, X_n)\).

Properties we desire in an estimator:

Unbiasedness: \(E[\hat{\theta}] = \theta\) for all values of \(\theta\). Consistency: \(\hat{\theta} \xrightarrow{P} \theta\) as \(n \to \infty\). Efficiency: Among unbiased estimators, the one with the smallest variance is most efficient. Mean Squared Error: \(\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2\).

3.2 Method of Moments (MOM)

The method of moments is conceptually simple:

Algorithm: 1. Express the first \(k\) population moments \(E[X^j]\) as functions of the unknown parameter(s). 2. Compute the corresponding sample moments \(m_j = \frac{1}{n}\sum X_i^j\). 3. Set population moments equal to sample moments: \(E[X^j] = m_j\) for \(j = 1, \ldots, k\). 4. Solve the resulting system of equations for the parameter(s).

Why it works: By the Law of Large Numbers, sample moments converge to population moments as \(n \to \infty\). So equating them is a natural way to produce a consistent estimator.

3.3 Maximum Likelihood Estimation (MLE)

The MLE finds the parameter value that maximises the probability (or density) of the observed data:

Algorithm: 1. Write the likelihood function: \(L(\theta) = \prod_{i=1}^{n} f(x_i; \theta)\) (continuous case) or \(\prod_{i=1}^{n} P(X = x_i; \theta)\) (discrete case). 2. Take the log-likelihood: \(\ell(\theta) = \ln L(\theta) = \sum_{i=1}^{n} \ln f(x_i; \theta)\). 3. Differentiate: \(\dfrac{d\ell}{d\theta}\), set equal to zero, solve for \(\theta\). 4. Verify it's a maximum: Check \(\dfrac{d^2\ell}{d\theta^2} < 0\) at the solution. 5. Check boundary cases: if the MLE from calculus falls outside the parameter space, or if the likelihood is monotone, the MLE may be at a boundary.

3.4 Properties of MLE (Large Sample Results)

Consistency: \(\hat{\theta}_{\text{MLE}} \xrightarrow{P} \theta\) under regularity conditions. Asymptotic Normality: \(\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta) \xrightarrow{d} N\!\left(0, \dfrac{1}{I(\theta)}\right)\). Asymptotic Efficiency: MLE asymptotically achieves the Cramér-Rao lower bound. Invariance: If \(\hat{\theta}\) is the MLE of \(\theta\), then \(g(\hat{\theta})\) is the MLE of \(g(\theta)\) for any function \(g\).

where the Fisher information is:

I(\theta) = -E\!\left[\dfrac{d^2\ell}{d\theta^2}\right] = \text{Var}\!\left[\dfrac{d\ell}{d\theta}\right]

3.5 The Score and Fisher Information

The score function is the derivative of the log-likelihood:

\(U(\theta) = \dfrac{d\ell}{d\theta}\) The MLE solves \(U(\hat{\theta}) = 0\). The expected Fisher information: \(I(\theta) = -E\!\left[\dfrac{d^2\ell}{d\theta^2}\right]\). For an i.i.d. sample: \(I_n(\theta) = n \cdot I_1(\theta)\) where \(I_1\) is the information from one observation.

3.6 Comparing MOM and MLE

MOM Advantages Fast, no calculus needed, closed-form even for complex distributions.	MOM Disadvantages Can be biased, inefficient, sometimes outside the parameter space.
MLE Advantages Consistent, asymptotically normal and efficient, invariant under transformation.	MLE Disadvantages Requires differentiation, may not have closed form, sensitive to boundary issues.

Node N7 — Section 4

Worked Examples

Example 1: MOM and MLE for Exponential Distribution

Let \(X_1, \ldots, X_n\) be i.i.d. from the exponential distribution with PDF \(f(x; \lambda) = \lambda e^{-\lambda x}\) for \(x > 0\). Find both MOM and MLE estimators of \(\lambda\).

MOM: Match the first moment Population mean: \(E[X] = 1/\lambda\).
Sample mean: \(\bar{X} = \frac{1}{n}\sum X_i\).
Equate: \(1/\lambda = \bar{X}\).
Therefore: \(\hat{\lambda}_{\text{MOM}} = \dfrac{1}{\bar{X}}\).

Bias check: Since \(\bar{X} \sim \text{Gamma}(n, n\lambda)\), we have \(E[1/\bar{X}] = \frac{n}{n-1}\cdot\lambda\), so \(\hat{\lambda}_{\text{MOM}}\) is biased (overestimates), but asymptotically unbiased as \(n \to \infty\).

MLE: Write the likelihood \(L(\lambda) = \prod_{i=1}^{n} \lambda e^{-\lambda X_i} = \lambda^n \exp\!\left(-\lambda\sum X_i\right)\)

Log-likelihood and derivative \(\ell(\lambda) = n\ln\lambda - \lambda\sum_{i=1}^{n} X_i = n\ln\lambda - n\lambda \bar{X}\)
\(\dfrac{d\ell}{d\lambda} = \dfrac{n}{\lambda} - n\bar{X} = 0\)
Therefore: \(\hat{\lambda}_{\text{MLE}} = \dfrac{1}{\bar{X}}\).

Second derivative test: \(\dfrac{d^2\ell}{d\lambda^2} = -\dfrac{n}{\lambda^2} < 0\) for all \(\lambda > 0\). \(\checkmark\) Maximum confirmed.

Interesting result: For the exponential distribution, MOM and MLE give the same estimator: \(\hat{\lambda} = 1/\bar{X}\). However, this estimator is biased: \(E[\hat{\lambda}] = \frac{n}{n-1}\cdot\lambda\). An unbiased version would be \(\tilde{\lambda} = \frac{n-1}{n\bar{X}}\).

Example 2: MLE for Normal Distribution — Unknown Mean and Variance

Let \(X_1, \ldots, X_n\) be i.i.d. \(N(\mu, \sigma^2)\) where both \(\mu\) and \(\sigma^2\) are unknown. Find the MLEs of \(\mu\) and \(\sigma^2\).

Step 1: Likelihood function \(L(\mu, \sigma^2) = \prod_{i=1}^{n} \dfrac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\dfrac{(X_i - \mu)^2}{2\sigma^2}\right)\)
\(= (2\pi\sigma^2)^{-n/2} \exp\!\left(-\dfrac{1}{2\sigma^2}\sum(X_i - \mu)^2\right)\)

Step 2: Log-likelihood \(\ell(\mu, \sigma^2) = -\dfrac{n}{2}\ln(2\pi) - \dfrac{n}{2}\ln(\sigma^2) - \dfrac{1}{2\sigma^2}\sum_{i=1}^{n}(X_i - \mu)^2\)

Step 3: Differentiate with respect to μ \(\dfrac{\partial\ell}{\partial\mu} = \dfrac{1}{\sigma^2}\sum_{i=1}^{n}(X_i - \mu) = 0\)
\(\displaystyle \sum X_i - n\mu = 0 \quad \Rightarrow \quad \hat{\mu}_{\text{MLE}} = \bar{X}\) \(\checkmark\)

Step 4: Differentiate with respect to σ² Treat \(\sigma^2\) as a single parameter:\ \(\dfrac{\partial\ell}{\partial\sigma^2} = -\dfrac{n}{2\sigma^2} + \dfrac{1}{2(\sigma^2)^2}\sum(X_i - \mu)^2 = 0\)
\(\dfrac{n}{2\sigma^2} = \dfrac{1}{2(\sigma^2)^2}\sum(X_i - \mu)^2\)
\(\hat{\sigma}^2_{\text{MLE}} = \dfrac{1}{n}\sum_{i=1}^{n}(X_i - \hat{\mu})^2 = \dfrac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2\)

Critical exam point: The MLE of \(\sigma^2\) uses denominator \(n\), not \(n-1\). This means it is biased: \(E[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2 = \frac{n-1}{n}\sigma^2\). The unbiased version uses \(n-1\) (Bessel's correction). Examiners often ask: "Is the MLE unbiased?" The answer is no for the variance.

Example 3: MLE for a Uniform Distribution — The Boundary Case

Let \(X_1, \ldots, X_n\) be i.i.d. from \(U(0, \theta)\). Find the MLE of \(\theta\).

The Likelihood \(f(x; \theta) = \dfrac{1}{\theta}\) for \(0 \leq x \leq \theta\), and 0 otherwise.
\(L(\theta) = \prod_{i=1}^{n} \dfrac{1}{\theta} \cdot \mathbb{I}(0 \leq X_i \leq \theta) = \dfrac{1}{\theta^n} \cdot \mathbb{I}(\theta \geq \max X_i)\)
where \(\mathbb{I}(\cdot)\) is the indicator function. The likelihood is zero if \(\theta\) is less than any observed value.

Why Differentiation Fails \(\ln L = -n\ln\theta\) gives \(\dfrac{d}{d\theta} = -\dfrac{n}{\theta}\). Setting this to zero gives no solution!
The log-likelihood is strictly decreasing in \(\theta\), but \(\theta\) cannot go to infinity — it must be at least as large as the largest observation.

The Solution Since \(L(\theta) = 1/\theta^n\) is a strictly decreasing function, it is maximised when \(\theta\) is as small as possible. The constraint is \(\theta \geq \max\{X_1, \ldots, X_n\}\).

Therefore: \(\hat{\theta}_{\text{MLE}} = \max\{X_1, \ldots, X_n\} = X_{(n)}\).

This is an order statistic! (connecting to N5.)

Key lesson: When the derivative method fails, examine the shape of the likelihood and its domain of definition. The support of the distribution (which depends on \(\theta\)) creates the constraint. This is the single most important MLE technique beyond the standard calculus approach.

Example 4: MOM for a Two-Parameter Distribution

Let \(X_1, \ldots, X_n\) be i.i.d. from the Gamma distribution with shape \(\alpha\) and rate \(\beta\), with \(E[X] = \alpha/\beta\) and \(\text{Var}(X) = \alpha/\beta^2\). Find the MOM estimators of \(\alpha\) and \(\beta\).

Set up the system First moment: \(\dfrac{\alpha}{\beta} = \bar{X}\) (equation 1)\ Second moment (via variance): \(\dfrac{\alpha}{\beta^2} = \dfrac{1}{n}\sum X_i^2 - \bar{X}^2 = S_n^2\) (equation 2)

From equation 1: \(\alpha = \beta\bar{X}\).
Substitute into equation 2: \(\dfrac{\beta\bar{X}}{\beta^2} = \dfrac{\bar{X}}{\beta} = S_n^2\).
Therefore: \(\hat{\beta}_{\text{MOM}} = \dfrac{\bar{X}}{S_n^2}\), and \(\hat{\alpha}_{\text{MOM}} = \beta\bar{X} = \dfrac{\bar{X}^2}{S_n^2}\).

Node N7 — Section 5

Pattern Recognition & Examiner Traps

Trap 1: Forgetting the second derivative test Some exam questions specifically require you to verify that your critical point is a maximum, not a minimum or inflection point. Always check \(\dfrac{d^2\ell}{d\theta^2} < 0\) for MLE unless the question says "you do not need to verify".

Trap 2: Ignoring the parameter space in MLE The Uniform distribution example (N7) is the canonical case, but the principle applies generally. If the parameter appears in the support of the distribution (e.g., \(U(\theta_1, \theta_2)\), \(U(0, \theta)\)), differentiation may not work. Always check: is \(\theta\) in the domain of the PDF?

WRONG Always differentiate the log-likelihood and set to zero. This works for all MLE problems.

RIGHT First check if \(\theta\) appears in the support. If yes, the MLE may be at a boundary (like \(X_{(n)}\)). If no, proceed with differentiation.

Trap 3: Using the wrong denominator for sample variance in MOM When MOM asks you to match the second central moment to the population variance, the sample analogue is \(\frac{1}{n}\sum(X_i - \bar{X})^2\), not \(\frac{1}{n-1}\sum(X_i - \bar{X})^2\). The \(n-1\) version is for unbiasedness; MOM is about matching moments, not about unbiasedness.

Trap 4: Confusing MLE invariance with invariance of bias If \(\hat{\theta}\) is the MLE of \(\theta\), then \(g(\hat{\theta})\) is the MLE of \(g(\theta)\). This does NOT mean that if \(\hat{\theta}\) is unbiased for \(\theta\), then \(g(\hat{\theta})\) is unbiased for \(g(\theta)\). In fact, by Jensen's inequality, \(g(\hat{\theta})\) is usually biased even if \(\hat{\theta}\) is unbiased (for nonlinear \(g\)).

Examiner patterns to recognise:

"Show that the MLE of θ is ..." — follow the standard MLE recipe. Write L, take log, differentiate, solve.
"Find the MOM estimator of θ
"Is the MLE biased?" — compute E[MLE and check if E[\hat{\theta}] = \theta. If not, compute the bias: E[\hat{\theta}] - \theta.
"Find the MLE of g(\theta)" — find MLE of \theta first, then apply invariance: g(\hat{\theta}).
"Compare MOM and MLE" — check if same? Different? Which has smaller variance?

Node N7 — Section 6

Connections

How N7 connects to the rest of PTS2:

← N6 (Sampling Distributions): The properties of estimators (bias, variance, efficiency) require knowledge of sampling distributions. The MLE for normal parameters relies on the fact that \(\bar{X} \sim N(\mu, \sigma^2/n)\) and \((n-1)S^2/\sigma^2 \sim \chi^2(n-1)\).
→ N8 (Confidence Intervals): MLEs are used to construct confidence intervals via the pivotal quantity approach. The asymptotic normality of MLE provides another CI method.
→ N10-N12 (Hypothesis Testing): The likelihood ratio test — the most powerful general purpose test — uses the MLEs under both the null and alternative hypotheses.

Node N7 — Section 7

Summary Table

Method	Principle	Steps	Properties	When to Use
MOM	Match population to sample moments	Set \(E[X^k] = m_k\), solve	Consistent, simple, can be biased/inefficient	Quick estimates, complex distributions
MLE	Maximise likelihood of observed data	Write \(L(\theta)\), log, differentiate, solve	Consistent, asymptotically normal & efficient, may be biased in small samples	Default method, exams, inference
Uniform MLE	Boundary solution	Not differential — use support constraint

MLE Recipe 1. Write \(L(\theta) = \prod f(x_i;\theta)\). 2. Take \(\ln\). 3. Differentiate w.r.t. \(\theta\). 4. Set to 0 and solve. 5. Check second derivative < 0.

Bias ≠ Variance An estimator can be unbiased but have high variance, or biased but have low variance. MSE = Variance + Bias² tells the full story.

Invariance is Powerful MLE of \(g(\theta)\) = \(g(\)MLE of \(\theta\)). This saves you having to re-derive everything for transformed parameters.

Watch the Boundary If \(\theta\) appears in the support, differentiation may fail. Check if the likelihood is monotone and look for the boundary solution.

Node N7 — Section 8

Self-Assessment

Test your understanding before moving to N8:

Can you do all of these?

Derive the MLE for \(\lambda\) in a Poisson(\(\lambda\)) distribution. [Answer: \(\hat{\lambda} = \bar{X}\).]
Derive the MLE for \(p\) in a Geometric(\(p\)) distribution. [Answer: \(\hat{p} = 1/\bar{X}\).]
Find the MOM estimator for \(\theta\) in the distribution \(f(x) = (\theta+1)x^\theta\) on \([0,1]\).
Show that for the normal distribution, \(\bar{X}\) and \(S^2\) are unbiased for \(\mu\) and \(\sigma^2\) respectively.
Explain why the MLE of \(\sigma^2\) for a normal distribution is biased, but the MLE of \(\mu\) is unbiased.
Given the MLE \(\hat{\lambda}\), find the MLE of \(e^{\lambda}\) using invariance.

Practice Problems

Let \(X_1, \ldots, X_n\) be i.i.d. from a Pareto distribution: \(f(x;\theta) = \theta/x^{\theta+1}\) for \(x > 1\). Find the MLE of \(\theta\). [Answer: \(\hat{\theta}_{\text{MLE}} = n/\sum \ln X_i\).]
For a Binomial(\(n, p\)) sample, find the MOM estimator of \(p\). [Answer: \(\hat{p} = \bar{X}/n\).]
True or false: If \(\hat{\theta}\) is the MLE of \(\theta\), then \(\hat{\theta}/2\) is the MLE of \(\theta/2\). [Answer: True, by invariance.]

High-Leverage Questions

HLQ: Exam-Style Question with Worked Solution

12 MARKS MLE / MOM COMPARISON MULTI-PART

A random sample \(X_1, \ldots, X_n\) is drawn from the distribution with PDF:

\(f(x;\theta) = \theta x^{\theta-1}\)   for \(0 < x < 1\),   \(\theta > 0\).

(a) Find the Method of Moments estimator of \(\theta\). (3 marks)

(b) Find the Maximum Likelihood Estimator of \(\theta\). (4 marks)

(c) Is the MLE unbiased? Justify your answer. (3 marks)

(d) Find the MLE of \(\theta^2\). (2 marks)

Part (a): MOM Estimator Compute the first population moment:\ \(E[X] = \displaystyle\int_0^1 x \cdot \theta x^{\theta-1}\,dx = \theta\int_0^1 x^{\theta}\,dx = \theta \cdot \dfrac{1}{\theta + 1} = \dfrac{\theta}{\theta + 1}\)

Set equal to the sample mean:\ \(\dfrac{\theta}{\theta + 1} = \bar{X}\).

Solve for \(\theta\):
\(\theta = \bar{X}(\theta + 1) = \bar{X}\theta + \bar{X}\).
\(\theta(1 - \bar{X}) = \bar{X}\).
\(\hat{\theta}_{\text{MOM}} = \dfrac{\bar{X}}{1 - \bar{X}}\).

Part (b): MLE \(L(\theta) = \prod_{i=1}^{n} \theta X_i^{\theta-1} = \theta^n \left(\prod_{i=1}^{n} X_i\right)^{\theta-1}\).

\(\ell(\theta) = n\ln\theta + (\theta - 1)\sum_{i=1}^{n} \ln X_i\).

\(\dfrac{d\ell}{d\theta} = \dfrac{n}{\theta} + \sum_{i=1}^{n} \ln X_i = 0\).

\(\hat{\theta}_{\text{MLE}} = \dfrac{-n}{\sum_{i=1}^{n} \ln X_i}\).

Note: Since \(0 < X_i < 1\), we have \(\ln X_i < 0\), so \(\sum \ln X_i < 0\) and the MLE is positive. \(\checkmark\)

Check second derivative:\ \(\dfrac{d^2\ell}{d\theta^2} = -\dfrac{n}{\theta^2} < 0\) for all \(\theta > 0\). \(\checkmark\) Maximum confirmed.

Part (c): Bias of the MLE Let \(Y_i = -\ln X_i\). Since \(X \sim \text{Beta}(\theta, 1)\), we have:\ \(P(Y_i \leq y) = P(-\ln X_i \leq y) = P(X_i \geq e^{-y}) = 1 - F_X(e^{-y}) = 1 - (e^{-y})^\theta = 1 - e^{-\theta y}\).

So \(Y_i \sim \text{Exp}(\theta)\), and \(\sum Y_i \sim \text{Gamma}(n, \theta)\).

\(\hat{\theta}_{\text{MLE}} = \dfrac{n}{\sum Y_i}\).
\(E\!\left[\dfrac{1}{\sum Y_i}\right] = \dfrac{\theta}{n-1}\) (using the known result for the reciprocal of a Gamma variable).

So \(E[\hat{\theta}_{\text{MLE}}] = n \cdot \dfrac{\theta}{n-1} = \dfrac{n}{n-1} \cdot \theta \neq \theta\).

The MLE is biased, but asymptotically unbiased since \(\frac{n}{n-1} \to 1\) as \(n \to \infty\).

Part (d): MLE of \(\theta^2\) By the invariance property of MLEs:\ The MLE of \(\theta^2\) is \((\hat{\theta}_{\text{MLE}})^2 = \left(\dfrac{-n}{\sum \ln X_i}\right)^2 = \dfrac{n^2}{\left(\sum \ln X_i\right)^2}\).

Summary of answers: (a) \(\hat{\theta}_{\text{MOM}} = \dfrac{\bar{X}}{1 - \bar{X}}\). (b) \(\hat{\theta}_{\text{MLE}} = \dfrac{-n}{\sum_{i=1}^{n} \ln X_i}\). (c) Biased: E[\(\hat{\theta}_{\text{MLE}}\)] = \(\frac{n}{n-1}\theta\). (d) MLE of \(\theta^2\) = \(\left(\dfrac{-n}{\sum \ln X_i}\right)^2\). \(\checkmark\)