Why This Concept Exists
Confidence intervals (N8, N9) tell us a range of plausible values for a parameter. But sometimes the question is more specific: "Is the parameter equal to a particular value?" For example: "Does this batch of pills contain exactly 500mg of active ingredient?" "Has the mean wait time actually decreased?" "Is this coin fair?"
Hypothesis testing provides a formal decision-making framework for answering such questions with a quantified error rate. Unlike a CI, which always produces an interval, a hypothesis test yields a binary decision: reject or do not reject the null hypothesis. This decision is made while controlling the probability of a false positive (Type I error) at a pre-specified level \(\alpha\).
The framework is built around a six-step protocol that must be followed rigidly in exam answers. Deviating from the protocol is one of the most common reasons students lose marks — not because their mathematics is wrong, but because they fail to communicate their reasoning in the expected structure.
Prerequisites
Before engaging with this node, you must be comfortable with:
- Standard normal and t-distributions (N6): You must be able to find critical values \(z_\alpha\) and \(t_{\nu, \alpha}\) from tables. You must understand how tail probabilities relate to these values.
- Sampling distributions (N6): \(\bar{X} \sim N(\mu, \sigma^2/n)\) for normal data; \(T = \frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t(n-1)\) when \(\sigma\) is unknown.
- Binomial and its normal approximation: For proportion tests, you need \(\hat{p} \approx N\big(p, p(1-p)/n\big)\) for large \(n\).
- Confidence intervals (N8-N9): The duality between CIs and hypothesis tests means that understanding one immediately gives you half of the other. If \(\mu_0\) is outside the 95% CI, you reject \(H_0: \mu = \mu_0\) at \(\alpha = 0.05\).
- Algebraic manipulation: Rearranging inequalities and computing standardized test statistics.
Core Exposition
3.1 The Six-Step Protocol
Every hypothesis test in PTS2 should follow this exact structure. Examiners award marks step by step, so writing out all six steps is the safest approach.
Write \(H_0: \theta = \theta_0\) and \(H_a: \theta > \theta_0\), \(\theta < \theta_0\), or \(\theta \neq \theta_0\). Always define \(\theta\) in words (e.g., "where \(\mu\) is the population mean weight in grams").
Step 2: Choose the significance level \(\alpha\)
Commonly given in the question (5%, 1%, 10%). This is the maximum acceptable probability of a Type I error (rejecting \(H_0\) when it is true).
Step 3: Specify the test statistic and its distribution under \(H_0\)
Identify whether you need a z-statistic (known \(\sigma\)), a t-statistic (unknown \(\sigma\)), or a proportion z-statistic. Write down the exact formula and distribution.
Step 4: Determine the rejection region
For \(H_a: \theta > \theta_0\): reject if the test statistic exceeds the upper-tail critical value.
For \(H_a: \theta < \theta_0\): reject if the test statistic is below the lower-tail critical value.
For \(H_a: \theta \neq \theta_0\): reject if the test statistic is in either tail (split \(\alpha\)).
Step 5: Compute the observed value
Substitute the sample data into the test statistic formula.
Step 6: State the decision and conclusion
Decision: "Reject \(H_0\)" or "Do not reject \(H_0\)." Never write "Accept \(H_0\)."
Conclusion: A sentence in the context of the problem.
3.2 The z-Test (Known Variance)
When the population variance is known and data are normal (or \(n\) is large), the test statistic is:
| Alternative | Rejection Region | p-value |
|---|---|---|
| \(H_a: \mu > \mu_0\) | \(Z > z_\alpha\) | \(P(Z > z_{\text{obs}})\) |
| \(H_a: \mu < \mu_0\) | \(Z < -z_\alpha\) | \(P(Z < z_{\text{obs}})\) |
| \(H_a: \mu \neq \mu_0\) | \(|Z| > z_{\alpha/2}\) | \(2P(Z > |z_{\text{obs}}|)\) |
3.3 The t-Test (Unknown Variance)
When \(\sigma\) is estimated by \(S\):
All three cases (right-tailed, left-tailed, two-tailed) follow the same rejection-region logic as the z-test, but with t-critical values from the t-table instead of z-critical values.
3.4 The Proportion Test
For testing \(H_0: p = p_0\) with large \(n\):
3.5 p-Values
The p-value is the probability, under \(H_0\), of observing a test statistic as extreme as or more extreme than what was observed. It tells you how surprising your data are if \(H_0\) were true.
Right-tailed: p-value \(= P(Z > z_{\text{obs}})\).
Left-tailed: p-value \(= P(Z < z_{\text{obs}})\).
Two-tailed: p-value \(= 2P(Z > |z_{\text{obs}}|)\).
3.6 Type I and Type II Errors
Type II error (\(\beta\)): Do not reject \(H_0\) when \(H_a\) is actually true. (False negative.)
Power (\(1 - \beta\)): Correctly reject \(H_0\) when \(H_a\) is true. (Covered in depth in N11.)
Worked Examples
Example 1: One-Sample z-Test (Right-Tailed)
A pharmaceutical company claims the mean active ingredient in a tablet is 500 mg. A regulatory agency suspects the actual mean is higher than 500 mg. From known manufacturing data, \(\sigma = 2.0\) mg. A random sample of \(n = 30\) tablets has mean \(\bar{x} = 500.73\) mg.
At the 5% significance level, test whether the mean is greater than 500 mg.
where \(\mu\) is the true mean active ingredient in mg.
From tables: \(z_{0.05} = 1.645\).
Reject \(H_0\) if \(Z > 1.645\).
There is sufficient evidence at the 5% level to conclude that the mean active ingredient exceeds 500 mg.
p-value: \(P(Z > 1.999) = 1 - \Phi(1.999) \approx 1 - 0.9772 = 0.0228\).
Since 0.0228 < 0.05, we reject \(H_0\). (Same conclusion.)
Example 2: One-Sample t-Test (Two-Tailed)
A bolt manufacturer claims the mean diameter is 10.00 mm. A quality inspector takes a random sample of \(n = 9\) bolts and finds \(\bar{x} = 10.23\) mm and \(s = 0.48\) mm. Assume normality.
At the 1% significance level, test whether the mean diameter differs from 10.00 mm.
where \(\mu\) is the true mean bolt diameter in mm.
From tables: \(t_{8, 0.005} = 3.355\).
Reject \(H_0\) if \(|T| > 3.355\).
There is insufficient evidence at the 1% level to conclude that the mean diameter differs from 10.00 mm. The observed deviation of 0.23 mm is consistent with random sampling variation.
Example 3: One-Sample Proportion Test
A political pollster claims that 40% of voters support a candidate. A researcher samples 200 voters and finds 68 supporters.
At the 5% level, test whether the true proportion differs from 40%.
SE under \(H_0\): \(\sqrt{0.40 \times 0.60 / 200} = \sqrt{0.0012} = 0.03464\).
\[z_{\text{obs}} = \frac{0.34 - 0.40}{0.03464} = \frac{-0.06}{0.03464} = -1.732\]
There is insufficient evidence at the 5% level to conclude that the proportion differs from 40%. The observed 34% could plausibly arise from random sampling when the true proportion is 40%.
p-value: \(2P(Z > 1.732) \approx 2 \times 0.0416 = 0.0832 > 0.05\). Consistent.
Pattern Recognition & Examiner Traps
- "Test whether..." (no direction specified) — almost always a two-tailed test with \(H_a: \neq\).
- "Test whether the mean has decreased / improved / increased" — one-tailed test. Read the direction carefully.
- When the question says "using a 5% significance level" — \(\alpha = 0.05\) is given. When it says "test at the 1% level" — \(\alpha = 0.01\).
- If given a p-value and asked to interpret: compare directly with \(\alpha\). If p < \(\alpha\), reject.
- "Write a conclusion in the context of the problem" — must include the words "insufficient evidence" or "sufficient evidence" and reference the parameter and context.
Connections
- ← N6 (Sampling Distributions): Every test statistic is a standardized sampling distribution. The z-test exists because \(\bar{X}\) is normal; the t-test exists because \((\bar{X}-\mu)/(S/\sqrt{n})\) follows a t-distribution.
- ← N8-N9 (CIs): The inversion principle: a two-sided test at level \(\alpha\) rejects \(H_0: \theta = \theta_0\) if and only if \(\theta_0\) falls OUTSIDE the \((1-\alpha)\) CI. This duality means you can often check your hypothesis tests against your CIs (and vice versa).
- → N11 (Power): Power analysis extends the six-step framework by quantifying the probability of correctly rejecting \(H_0\) under a specific alternative. N10 gives you the mechanism; N11 tells you how well it works.
- → N12 (Two-Sample Tests): N12 extends the six-step protocol to two-sample settings. The structure (hypotheses, test stat, rejection region, compute, decide, conclude) is identical — only the formulas change.
Summary Table
| Test | \(H_0\) | Test Statistic | Distribution |
|---|---|---|---|
| z-test (mean, \(\sigma\) known) | \(\mu = \mu_0\) | \(\dfrac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\) | \(N(0, 1)\) |
| t-test (mean, \(\sigma\) unknown) | \(\mu = \mu_0\) | \(\dfrac{\bar{X} - \mu_0}{S/\sqrt{n}}\) | \(t(n-1)\) |
| Proportion test | \(p = p_0\) | \(\dfrac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\) | \(\approx N(0, 1)\) |
Self-Assessment
Test your understanding before moving to N11:
- Test \(H_0: \mu = 25\) vs \(H_a: \mu > 25\) with \(\bar{x} = 27\), \(\sigma = 4\), \(n = 36\), at \(\alpha = 0.05\). [Answer: \(z = 3.0 > 1.645\), reject \(H_0\), p-value = 0.0013.]
- Test \(H_0: \mu = 100\) vs \(H_a: \mu \neq 100\) with \(\bar{x} = 96.3\), \(s = 7.2\), \(n = 16\), at \(\alpha = 0.01\). [Answer: \(t = -2.056\), \(|t| < 2.947\), do not reject.]
- Test \(H_0: p = 0.30\) vs \(H_a: p < 0.30\) with \(X = 42\) successes in \(n = 200\), at \(\alpha = 0.05\). [Answer: \(\hat{p} = 0.21\), \(z = -2.76\), reject \(H_0\).]
- Explain in words what a Type II error means in the context of a medical screening test for a disease. [Answer: The test fails to detect the disease in a patient who actually has it — a false negative.]
- For \(H_0: \mu = 50\) vs \(H_a: \mu \neq 50\), suppose the 95% CI for \(\mu\) is [51.2, 55.8]. What is the decision at \(\alpha = 0.05\)? [Answer: Reject \(H_0\), since 50 is outside the CI.]
- Identify which hypothesis test is appropriate: (a) \(n = 50\), \(\sigma = 3\) known, testing \(\mu\). (b) \(n = 10\), s from data, testing \(\mu\). (c) \(n = 300\), testing a proportion. [Answer: (a) z-test, (b) t-test, (c) proportion test.]
HLQ: Exam-Style Question with Worked Solution
A factory produces ball bearings with a target mean weight of 5.00 grams. The production process has a known standard deviation of \(\sigma = 0.12\) grams. A quality control engineer takes a random sample of 40 bearings.
(a) In a particular week, the sample mean was \(\bar{x} = 5.045\) grams. Test, at the 5% level, whether the mean weight has increased. (5 marks)
(b) Calculate the p-value for this test and interpret it. (3 marks)
(c) Suppose instead that \(\sigma\) was unknown and the sample standard deviation was found to be \(s = 0.14\) grams. How would the test procedure change? Carry out the test at the 5% level. (5 marks)
(d) A separate test of the proportion of defective bearings found 12 defectives in a sample of 200, testing \(H_0: p = 0.04\) vs \(H_a: p > 0.04\) at the 5% level. Carry out the test. (3 marks)
Step 2: \(\alpha = 0.05\).
Step 3: \(Z = \dfrac{\bar{X} - 5.00}{0.12/\sqrt{40}} \sim N(0, 1)\) under \(H_0\).
Step 4: Reject \(H_0\) if \(Z > z_{0.05} = 1.645\).
Step 5: \(z_{\text{obs}} = \dfrac{5.045 - 5.00}{0.12/\sqrt{40}} = \dfrac{0.045}{0.01897} = 2.372\).
Step 6: Since \(2.372 > 1.645\), reject \(H_0\). There is sufficient evidence at the 5% level that the mean weight has increased.
Interpretation: If the true mean were really 5.00 grams, there is only a 0.89% chance of observing a sample mean of 5.045 grams or larger in a sample of 40 bearings. This is very unlikely, providing strong evidence against \(H_0\).
Step 2: \(\alpha = 0.05\).
Step 3: \(T = \dfrac{\bar{X} - 5.00}{S/\sqrt{40}} \sim t(39)\) under \(H_0\).
Step 4: Reject \(H_0\) if \(T > t_{39, 0.05} \approx 1.685\).
Step 5: \(t_{\text{obs}} = \dfrac{5.045 - 5.00}{0.14/\sqrt{40}} = \dfrac{0.045}{0.02214} = 2.033\).
Step 6: Since \(2.033 > 1.685\), still reject \(H_0\).
Change observed: The critical value is larger (1.685 vs 1.645) and the observed statistic is smaller (2.033 vs 2.372) because \(s = 0.14 > 0.12 = \sigma\). The conclusion happens to be the same, but with less evidence.
Step 2: \(\alpha = 0.05\).
Step 3: \(Z = \dfrac{\hat{p} - 0.04}{\sqrt{0.04(0.96)/200}} \approx N(0, 1)\).
Step 4: Reject if \(Z > 1.645\).
Step 5: \(\hat{p} = 12/200 = 0.06\). \(SE = \sqrt{0.04 \times 0.96/200} = \sqrt{0.000192} = 0.01386\).
\(z_{\text{obs}} = \dfrac{0.06 - 0.04}{0.01386} = 1.443\).
Step 6: Since \(1.443 < 1.645\), do not reject \(H_0\). Insufficient evidence at the 5% level that the defect rate exceeds 4%.
(a) Reject \(H_0\) (\(z = 2.372 > 1.645\)). Mean weight has significantly increased.
(b) p-value = 0.0089. Very strong evidence against \(H_0\).
(c) Using t-test with \(s = 0.14\): still reject (\(t = 2.033 > 1.685\)), but with weaker evidence.
(d) Do not reject \(H_0\) (\(z = 1.443 < 1.645\)). No evidence defect rate exceeds 4%.