Why This Concept Exists
Two-sample hypothesis tests are the capstone of the inference chain in PTS2. They sit at the intersection of sampling distributions (N6), estimation (N7), confidence intervals (N8–N9), and power analysis (N11). When an examiner asks whether two populations differ, you reach for the protocols established in this node.
The two-sample z-test for \(\mu_1 - \mu_2\) and the proportion tests (pooled vs. separate) appear across every exam paper and are frequently combined with power calculations, making this the single node that exercises the widest range of your inferential toolkit.
The key intellectual move is recognising that a difference between two samples has its own sampling distribution. The test statistic is always structured the same way: \(\text{statistic} - \text{hypothesised value} \over \text{standard error}\) — but the standard error changes depending on whether variances are known, unknown (but large samples), or unknown with small samples (t-distribution, covered in later courses).
Prerequisites
Before engaging with this node, you should be comfortable with:
- One-sample hypothesis testing (N10): The six-step protocol, p-values, one-tailed vs. two-tailed tests, and interpreting results in context.
- Sampling distributions (N6): The distribution of \(\bar{X}_1 - \bar{X}_2\) and why it is normal (or approximately normal).
- Properties of the normal distribution (N6): Being able to compute and look up \(\Phi\) values, critical values \(z_{\alpha}\), and tail areas.
- Variance and standard error algebra (N6, N9): \(\text{Var}(\bar{X}_1 - \bar{X}_2) = \text{Var}(\bar{X}_1) + \text{Var}(\bar{X}_2)\) under independence.
- Binomial distribution and normal approximation to proportions (N10): Understanding \(\hat{p} = X/n\) as an estimator and when the normal approximation \(\hat{p} \approx N\!\left(p, \frac{p(1-p)}{n}\right)\) is valid.
- Power and Type II error (N11): Computing \(\beta = P(\text{fail to reject }H_0 \mid H_1\text{ is true})\) and the relationship between power, effect size, and sample size.
Two-Sample z-Test for \(\mu_1 - \mu_2\)
3.1 Setup and Assumptions
We have two independent samples:
Sample 2: \(X_{12}, X_{22}, \ldots, X_{n_2 2} \stackrel{iid}{\sim} N(\mu_2, \sigma_2^2)\)
Known variances are assumed: \(\sigma_1^2\) and \(\sigma_2^2\) are given from prior information. This is the z-test setting.
3.2 Hypotheses
We test the difference \(\mu_1 - \mu_2\) against some target value \(\Delta_0\):
One-tailed (right): \(H_0: \mu_1 - \mu_2 = \Delta_0\) vs. \(H_1: \mu_1 - \mu_2 > \Delta_0\)
One-tailed (left): \(H_0: \mu_1 - \mu_2 = \Delta_0\) vs. \(H_1: \mu_1 - \mu_2 < \Delta_0\)
Most commonly, \(\Delta_0 = 0\) (test whether the means are equal). When the problem asks whether one mean is greater than the other by a specific amount, \(\Delta_0 \neq 0\).
3.3 The Six-Step Protocol
p-value approach: Compute \(p = 1 - \Phi(z_{\text{obs}})\) (right-tailed). If \(p < \alpha\), reject \(H_0\).
3.4 The Sampling Distribution of the Difference
The foundation of the test is the sampling distribution:
This follows because \(\text{Var}(\bar{X}_1 - \bar{X}_2) = \text{Var}(\bar{X}_1) + \text{Var}(\bar{X}_2)\) by independence, and \(\text{Var}(\bar{X}_i) = \sigma_i^2/n_i\).
Two-Proportion Tests: Pooled vs. Separate
4.1 The Problem Setup
We have two independent binomial samples:
Population 2: \(X_2 \sim \text{Bin}(n_2, p_2)\) with sample proportion \(\hat{p}_2 = X_2/n_2\)
We test \(H_0: p_1 - p_2 = \Delta_0\). Under the normal approximation (valid when \(n_i p_i\) and \(n_i(1-p_i)\) are both >~ 10):
4.2 The Critical Decision: \(\Delta_0 = 0\) vs. \(\Delta_0 \neq 0\)
This is the most important structural choice in two-proportion testing. It determines whether you pool or use separate variance estimates.
Under \(H_0\), the two populations share the same proportion: \(p_1 = p_2 = p\).
We pool the data to get a single best estimate of \(p\):
\[\n\hat{p} = \frac{x_1 + x_2}{n_1 + n_2} \quad \text{(pooled proportion)}\n\]
The test statistic is:
\[\nZ = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}}\n\]
This uses the common variance estimate pooled from both samples.
The null hypothesis does not imply \(p_1 = p_2\), so pooling is invalid.
Use separate variance estimates:
\[\nZ = \frac{(\hat{p}_1 - \hat{p}_2) - \Delta_0}{\sqrt{\dfrac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \dfrac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}\n\]
4.3 Decision Table
| Condition | Variance Method | Standard Error |
|---|---|---|
| \(\Delta_0 = 0\) (pooling OK) | Pooled: \(\hat{p} = \dfrac{x_1+x_2}{n_1+n_2}\) | \(\sqrt{\hat{p}(1\!-\!\hat{p})\!\left(\dfrac{1}{n_1}\!+\!\dfrac{1}{n_2}\right)}\) |
| \(\Delta_0 \neq 0\) (no pooling) | Separate: use \(\hat{p}_1, \hat{p}_2\) | \(\sqrt{\dfrac{\hat{p}_1(1\!-\!\hat{p}_1)}{n_1}\!+\!\dfrac{\hat{p}_2(1\!-\!\hat{p}_2)}{n_2}}\) |
Worked Example: Stock Data — Unilever vs. Galapagos
Example: Testing \(p_U - p_G > 0.10\)
An investment firm holds shares in two pharmaceuticals-related companies. Over the past 200 trading days, Unilever shares rose in value on 126 days (\(x_U = 126, n_U = 200\)). Over the same period, Galapagos shares rose on 102 days (\(x_G = 102, n_G = 180\)).
A portfolio manager claims that the proportion of profitable days for Unilever exceeds that of Galapagos by more than 10 percentage points. Test this claim at the 5% significance level.
\(H_1: p_U - p_G > 0.10\)
This is a right-tailed test. Note that \(\Delta_0 = 0.10 \neq 0\), so we use separate variances (no pooling).
\[\nZ = \frac{(\hat{p}_U - \hat{p}_G) - 0.10}{\sqrt{\dfrac{\hat{p}_U(1-\hat{p}_U)}{n_U} + \dfrac{\hat{p}_G(1-\hat{p}_G)}{n_G}}}\n\]
Reject \(H_0\) if \(z_{\text{obs}} > 1.645\).
\(\hat{p}_G = 102/180 = 0.567\)
\(\hat{p}_U - \hat{p}_G = 0.630 - 0.567 = 0.063\) (observed difference)
Hypothesised difference: \(\Delta_0 = 0.10\)
Standard error:
\[\text{SE} = \sqrt{\frac{0.630(0.370)}{200} + \frac{0.567(0.433)}{180}} = \sqrt{0.001165 + 0.001364} = \sqrt{0.002529} = 0.0503\]\n\[\nz_{\text{obs}} = \frac{0.063 - 0.10}{0.0503} = \frac{-0.037}{0.0503} = -0.736\n\]
p-value \(= 1 - \Phi(-0.736) = \Phi(0.736) \approx 0.769\).
Since \(p = 0.769 > 0.05\), we do not reject \(H_0\).
Worked Example: OreToGo vs. OreCom — Power Calculation
Example: Computing Power for a Two-Sample Comparison
OreToGo and Ore.com are two mining analytics platforms. The average daily processing speed (in megabits per minute) is modelled as normally distributed:
Ore.com: \(X_2 \sim N(\mu_2, \sigma_2^2)\) with \(\sigma_2^2 = 25\)
A sample of \(n_1 = 50\) observations from OreToGo yields \(\bar{x}_1 = 158\), and \(n_2 = 45\) from Ore.com yields \(\bar{x}_2 = 152\).
(a) Test \(H_0: \mu_1 = \mu_2\) vs. \(H_1: \mu_1 > \mu_2\) at \(\alpha = 0.05\).
(b) Compute the power of this test if the true difference is \(\mu_1 - \mu_2 = 5\).
(c) What sample size \(n_2 = n_1 = n\) would be needed to achieve power \(\geq 0.90\) when \(\mu_1 - \mu_2 = 5\)?
Test statistic:
\[\nZ = \frac{(\bar{X}_1 - \bar{X}_2) - 0}{\sqrt{\dfrac{36}{50} + \dfrac{25}{45}}} = \frac{158 - 152}{\sqrt{0.7200 + 0.5556}} = \frac{6}{\sqrt{1.2756}} = \frac{6}{1.1294} = 5.313\n\]\n\(\alpha = 0.05\), right-tailed: critical value \(z_{0.05} = 1.645\).
\nSince \(5.313 > 1.645\), we reject \(H_0\).
\np-value \(= 1 - \Phi(5.313) \approx 0.00000\) — extremely significant.
Translating back to the difference \(\bar{X}_1 - \bar{X}_2\):
\[\text{Reject when } \bar{X}_1 - \bar{X}_2 > 0 + 1.645 \times 1.1294 = 1.858\]
Under the alternative (\(\mu_1 - \mu_2 = 5\)):
\(\bar{X}_1 - \bar{X}_2 \sim N(5, 1.2756)\), i.e., \(SE = 1.1294\).
\[\text{Power} = P\!\left(Z > \frac{1.858 - 5}{1.1294}\right) = P\!\left(Z > -2.782\right) = \Phi(2.782) \approx 0.9973\]
Power ≈ 99.73%. This test has very high power to detect a difference of 5 units.
For a right-tailed test with \(n_1 = n_2 = n\):
\[\frac{\delta}{\sqrt{\dfrac{\sigma_1^2}{n} + \dfrac{\sigma_2^2}{n}}} \geq z_\alpha + z_\beta\]
where \(\delta = \mu_1 - \mu_2 = 5\), \(z_\alpha = z_{0.05} = 1.645\), \(z_\beta = z_{0.10} = 1.282\).
\[\frac{5}{\sqrt{\dfrac{36 + 25}{n}}} \geq 1.645 + 1.282 = 2.927\]\n\[\frac{5}{\sqrt{61/n}} \geq 2.927\]\n\[\frac{5}{2.927} \geq \sqrt{\frac{61}{n}}\]\n\[1.708 \geq \sqrt{\frac{61}{n}}\]\n\[1.708^2 \geq \frac{61}{n}\]\n\[\2.918 \geq \frac{61}{n}\]\n\[n \geq \frac{61}{2.918} = 20.91\]\n\nSo we need \(\mathbf{n \geq 21}\) per group. The current design (\(n_1 = 50, n_2 = 45\)) is heavily overpowered for detecting \(\delta = 5\).
Pattern Recognition & Examiner Traps
\(n_1 \hat{p}_1\), \(n_1 (1-\hat{p}_1)\), \(n_2 \hat{p}_2\), \(n_2 (1-\hat{p}_2)\)
If any are small, the z-test approximation is unreliable and should not be used.
- "Test whether the proportions are equal" — this is \(\Delta_0 = 0\), use pooling.
- "Test whether the difference exceeds 5%" — this is \(\Delta_0 = 0.05 \neq 0\), use separate variances.
- "Show that there is no significant difference" — the examiner expects you to not reject \(H_0\). If your calculation does reject, check your arithmetic.
- "Compute the power" — always find the critical value on the original scale first, then compute the probability on that scale under the alternative.
- "Find the minimum sample size" — set up the power equation \(\dfrac{\delta}{SE} \geq z_\alpha + z_\beta\) and solve for \(n\).
Self-Assessment
Test your understanding of this node by working through these before moving to the end-of-course revision.
- Execute the six-step protocol for a two-sample z-test with known variances.
- Determine when to pool and when to use separate variances in a two-proportion test.
- Correctly frame \(H_0\) and \(H_1}\) for a claim involving a non-zero difference.
- Compute the observed z-statistic and p-value for a two-proportion test.
- Calculate power for a two-sample test given a specific alternative value.
- Determine minimum sample sizes needed to achieve target power.
- Check the normality conditions for the binomial approximation before using the z-test for proportions.
- Two machines produce bolts. Machine A: \(\bar{x}_1 = 49.85, n_1 = 30, \sigma_1 = 1.2\). Machine B: \(\bar{x}_2 = 50.12, n_2 = 35, \sigma_2 = 1.0\). Test \(H_0: \mu_1 = \mu_2\) vs. \(H_1: \mu_1 \neq \mu_2\) at \(\alpha = 0.01\).
- In a survey, 320 of 500 voters in City A support a policy, and 280 of 450 in City B support it. Test whether the support proportions differ at the 5% level. (Hint: \(\Delta_0 = 0\), use pooling.)
- For the bolt problem above, compute the power of the test if the true difference is \(\mu_1 - \mu_2 = -0.5\).
- A study claims that conversion rate B exceeds conversion rate A by at least 3%. In a trial: \(x_A = 45/n_A = 500\), \(x_B = 60/n_B = 400\). Test the claim. [Hint: \(\Delta_0 = 0.03 \neq 0\), use separate variances.]
- For a two-sample comparison with \(\sigma_1^2 = 16\), \(\sigma_2^2 = 9\), and \(n_1 = n_2 = n\), find the minimum \(n\) to achieve power \(\geq 0.95\) when \(\mu_1 - \mu_2 = 3\) and \(\alpha = 0.01\).
HLQ: Exam-Style Question with Worked Solution
A quality inspector collects two independent samples of product weights from Production Line A and B:
Line B: \(n_B = 55\), \(\bar{x}_B = 245.1\)g, known \(\sigma_B^2 = 20\)
(a) Test at the 5% level whether the mean weight from Line A exceeds that from Line B by more than 2 grams. (6 marks)
(b) In a separate quality check, 48 out of 60 units from Line A pass inspection, and 38 out of 55 from Line B pass. Test at the 1% level whether the pass rate on Line A exceeds that on Line B by more than 10 percentage points. (5 marks)
(c) If the true difference in means is \(\mu_A - \mu_B = 4\), compute the power of the test in part (a). (3 marks)
Step 2: Test statistic:
\[\nZ = \frac{(\bar{X}_A - \bar{X}_B) - 2}{\sqrt{\dfrac{16}{60} + \dfrac{20}{55}}}\n\]\n\nStep 3: \(\alpha = 0.05\). Critical value: \(z_{0.05} = 1.645\).
Step 4: Observed statistic:
\(\bar{x}_A - \bar{x}_B = 248.3 - 245.1 = 3.2\)
\n\(\text{SE} = \sqrt{\dfrac{16}{60} + \dfrac{20}{55}} = \sqrt{0.2667 + 0.3636} = \sqrt{0.6303} = 0.7939\)
\n\(z_{\text{obs}} = \dfrac{3.2 - 2.0}{0.7939} = \dfrac{1.2}{0.7939} = 1.511\)
Step 5: \(1.511 < 1.645\). Not in the rejection region.
p-value \(= 1 - \Phi(1.511) \approx 0.0655\).
Step 6: There is insufficient evidence at the 5% level to conclude that Line A's mean weight exceeds Line B's by more than 2 grams.
Step 1: \(H_0: p_A - p_B = 0.10\) vs. \(H_1: p_A - p_B > 0.10\). Right-tailed.
Step 2: \(\Delta_0 = 0.10 \neq 0\), so use separate variances:
\[\nZ = \frac{(\hat{p}_A - \hat{p}_B) - 0.10}{\sqrt{\dfrac{\hat{p}_A(1-\hat{p}_A)}{60} + \dfrac{\hat{p}_B(1-\hat{p}_B)}{55}}}\n\]\n\nStep 3: \(\alpha = 0.01\). Critical value: \(z_{0.01} = 2.326\).
Step 4: \(\hat{p}_A - \hat{p}_B = 0.800 - 0.691 = 0.109\).
\n\(\text{SE} = \sqrt{\dfrac{0.800(0.200)}{60} + \dfrac{0.691(0.309)}{55}} = \sqrt{0.002667 + 0.003882} = \sqrt{0.006549} = 0.0809\)
\n\(z_{\text{obs}} = \dfrac{0.109 - 0.10}{0.0809} = \dfrac{0.009}{0.0809} = 0.111\)
Step 5: \(0.111 < 2.326\). Not in the rejection region. p-value \(\approx 0.456\).
Step 6: There is insufficient evidence at the 1% level to conclude that the pass rate on Line A exceeds that on Line B by more than 10 percentage points.
On the original scale:
\(\bar{X}_A - \bar{X}_B > 2 + 1.645 \times 0.7939 = 2 + 1.306 = 3.306\)
Under the alternative (\(\mu_A - \mu_B = 4\)):\n\(\bar{X}_A - \bar{X}_B \sim N(4, 0.6303)\), i.e., \(SE = 0.7939\):
\n\[\n\text{Power} = P\!\left(Z > \frac{3.306 - 4}{0.7939}\right) = P\!\left(Z > -0.874\right) = \Phi(0.874) \approx 0.809\n\]\nPower ≈ 80.9%.