Why This Concept Exists
In N10, you learned to set up and execute hypothesis tests. But a crucial question was left unanswered: how good is the test? A test that always fails to reject \(H_0\) (even when \(H_0\) is false) is useless. The power of a test quantifies its ability to detect a true effect when it exists.
Power analysis answers practical questions that appear in both exam problems and real-world research design:
- "If the true mean is actually 19.5, what is the probability our test will detect that it differs from 19?"
- "How large a sample do we need to have an 80% chance of detecting an effect of size 0.5?"
- "Why did we fail to reject \(H_0\) — was there really no effect, or was our test simply too weak to detect it?"
Prerequisites
Before engaging with this node, you must be comfortable with:
- Hypothesis testing framework (N10): All six steps, the concept of rejection regions, and the distinction between Type I (\(\alpha\)) and Type II (\(\beta\)) errors.
- Standard normal probabilities: You must be able to compute \(\Phi(z)\) and \(1 - \Phi(z)\) efficiently. Power calculations reduce to evaluating normal CDF values.
- Two distributions in play: Understanding that hypothesis testing involves TWO sampling distributions — one under \(H_0\) and one under \(H_a\). The rejection region is defined under \(H_0\), but power is computed under \(H_a\).
- Non-centrality parameter: The standardized distance \(\delta = \dfrac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma}\) measures how far the alternative is from the null, in units of standard error.
Core Exposition
3.1 Definition of Power
where \(\beta = P(\text{Type II error}) = P(\text{do not reject } H_0 \mid H_a \text{ is true})\).
Power is always computed at a specific alternative value \(\mu = \mu_1\). Unlike \(\alpha\), which is fixed by the test design, power depends on how far \(\mu_1\) is from \(\mu_0\), the sample size, and the variance.
3.2 Power for a One-Sided z-Test (Right-Tailed)
Consider testing \(H_0: \mu = \mu_0\) vs \(H_a: \mu > \mu_0\) with known \(\sigma\). The rejection region is \(\bar{X} > \mu_0 + z_\alpha \dfrac{\sigma}{\sqrt{n}}\).
3.3 Power for a One-Sided z-Test (Left-Tailed)
For testing \(H_0: \mu = \mu_0\) vs \(H_a: \mu < \mu_0\) at level \(\alpha\), with \(\mu_1 < \mu_0\):
3.4 Power for a Two-Sided z-Test
For testing \(H_0: \mu = \mu_0\) vs \(H_a: \mu \neq \mu_0\) at level \(\alpha\):\
3.5 Factors Affecting Power
There are two ways to increase the power of a test (for a fixed alternative):
\(\delta = \sqrt{n}(\mu_1 - \mu_0)/\sigma\) grows with \(n\), increasing power.
Cost: More data collection is expensive and time-consuming.
2. Increase the significance level \(\alpha\):
Larger \(\alpha\) means a smaller critical value, so the rejection region is larger.
Cost: Higher Type I error rate (more false positives).
The relationship between \(\alpha\) and \(\beta\) is a trade-off: for a fixed sample size, decreasing \(\alpha\) (being more conservative about false positives) necessarily increases \(\beta\) (more false negatives), and vice versa.
3.6 Finding the Required Sample Size
To achieve a desired power \(1 - \beta\) for detecting a difference \(\mu_1 - \mu_0\) with a one-sided test:
For a two-sided test, replace \(z_\alpha\) with \(z_{\alpha/2}\).
Always round up to the nearest integer.
Worked Examples
Example 1: Power of a Right-Tailed z-Test
Recall the testing scenario from N10 Example 1: test \(H_0: \mu = 500\) vs \(H_a: \mu > 500\) with \(\sigma = 2.0\), \(n = 30\), at \(\alpha = 0.05\).
What is the power of this test if the true mean is actually \(\mu_1 = 501\)?
Power = \(\Phi(\delta - z_\alpha) = \Phi(2.739 - 1.645) = \Phi(1.094)\).
From tables: \(\Phi(1.09) \approx 0.8621\).
Power \(\approx 86.2\%\). If the true mean is 501 mg, this test has an 86.2% chance of correctly rejecting \(H_0\).
Example 2: Low Power Scenario
Consider testing \(H_0: \mu = 19.0\) vs \(H_a: \mu > 19.0\) with \(\sigma = 2.0\), \(n = 30\), at \(\alpha = 0.05\).
What is the power if the true mean is only \(\mu_1 = 19.5\)?
Power = \(\Phi(1.369 - 1.645) = \Phi(-0.276)\).
From tables: \(\Phi(-0.276) = 1 - \Phi(0.276) \approx 1 - 0.6087 = 0.3913\).
Power \(\approx 39.1\%\). Even though the true mean is 0.5 units above the null, the test has only a 39% chance of detecting this difference. This is a low-power test.
Why is the power low? The effect size (0.5) is small relative to the standard error (\(2/\sqrt{30} \approx 0.365\)). The difference is only about 1.37 standard errors, which is not large enough to reliably cross the critical value of 1.645.
Example 3: Required Sample Size for Target Power
An engineer wants to test \(H_0: \mu = 100\) vs \(H_a: \mu > 100\) with \(\sigma = 5\). They want to be able to detect a shift to \(\mu_1 = 102\) with at least 90% power, using \(\alpha = 0.05\).
What minimum sample size is required?
\(\alpha = 0.05\): \(z_{0.05} = 1.645\).
Target power = 0.90, so \(\beta = 0.10\): \(z_{0.10} = 1.282\).
\(\mu_1 - \mu_0 = 2\), \(\sigma = 5\).
\(n \geq \left(\dfrac{5(1.645 + 1.282)}{2}\right)^2 = \left(\dfrac{5 \times 2.927}{2}\right)^2 = \left(7.318\right)^2 = 53.55\).
Minimum sample size: \(n \geq 54\).
With only \(n = 30\) (as in Example 2), the power would be only about 39% — unacceptably low for this application.
Example 4: Power for a Two-Sided Test
Test \(H_0: \mu = 50\) vs \(H_a: \mu \neq 50\) with \(\sigma = 8\), \(n = 36\), at \(\alpha = 0.05\).
What is the power if the true mean is \(\mu_1 = 53\)?
\(z_{0.025} = 1.96\).
Full formula: Power = \(\Phi(2.25 - 1.96) + \Phi(-1.96 - 2.25)\) = \(\Phi(0.29) + \Phi(-4.21)\) = \(0.6141 + 0.0000 = 0.6141\).
Approximation (one term): Power \(\approx \Phi(0.29) = 0.6141\).
Power \(\approx 61.4\%\). The second term (\(\Phi(-4.21)\)) is essentially zero because the left-tail probability under the \(H_a\) distribution is negligible when \(\mu_1 > \mu_0\).
Pattern Recognition & Examiner Traps
- "Calculate the power of this test when \(\mu = \mu_1\)" — compute \(\delta\), then apply \(\Phi(\delta - z_\alpha)\) for one-sided or \(\Phi(\delta - z_{\alpha/2})\) for two-sided.
- "Find the minimum sample size to achieve 80% power" — use \(n = \bigl(\sigma(z_\alpha + z_{0.20})/\Delta\bigr)^2\) for one-sided or \(z_{\alpha/2}\) for two-sided. Round up.
- "Comment on the power" — if power < 50%, the test is weak and a non-rejection of \(H_0\) is not informative. If power > 80%, the test is reliable.
- Examiners frequently test the N10 example scenario (\(n = 30\), \(\mu_0 = 19\), \(\sigma = 2\)) and ask for power at \(\mu_1 = 19.5\). This gives \(\delta = 1.369\) and power \(\approx 0.391\).
Connections
- ← N10 (Hypothesis Testing): N11 extends the N10 framework by evaluating how well the test performs. The same test statistic, rejection region, and critical values from N10 are used to compute power.
- ← N6 (Sampling Distributions): The power formula depends on \(\bar{X} \sim N(\mu, \sigma^2/n)\), a result from N6. The non-centrality parameter \(\delta\) is the effect size scaled by the standard error.
- → N12 (Two-Sample Tests): Power calculations for two-sample tests follow the same logic, but with a two-sample \(\delta = \dfrac{\mu_1 - \mu_2 - \Delta_0}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}}\). Understanding one-sample power makes two-sample power straightforward.
Summary Table
| Scenario | Power Formula | Required n |
|---|---|---|
| One-sided (\(\mu_1 > \mu_0\)) | \(\Phi(\delta - z_\alpha)\) | \(\left(\dfrac{\sigma(z_\alpha + z_\beta)}{\mu_1-\mu_0}\right)^2\) |
| One-sided (\(\mu_1 < \mu_0\)) | \(\Phi(z_\alpha - |\delta|)\) | Same as above |
| Two-sided | \(\approx \Phi(|\delta| - z_{\alpha/2})\) | \(\left(\dfrac{\sigma(z_{\alpha/2} + z_\beta)}{\mu_1-\mu_0}\right)^2\) |
where \(\delta = \dfrac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma}\)
Self-Assessment
Test your understanding before moving to N12:
- For a right-tailed z-test with \(n = 30\), \(\sigma = 2\), \(\mu_0 = 19\), \(\alpha = 0.05\), compute the power when \(\mu_1 = 20\). [Answer: \(\delta = 5\sqrt{30}/2 = 5.477\), power = \(\Phi(5.477 - 1.645) = \Phi(3.832) \approx 0.9999\).]
- For the same test, compute the power when \(\mu_1 = 19.3\). [Answer: \(\delta = 0.8215\), power = \(\Phi(0.8215 - 1.645) = \Phi(-0.824) \approx 0.205\).]
- Test \(H_0: \mu = 100\) vs \(H_a: \mu > 100\), \(\sigma = 10\), \(\alpha = 0.05\), \(n = 25\). Find the minimum sample size to achieve 80% power when \(\mu_1 = 104\). [Answer: \(n \geq \left(\frac{10(1.645 + 0.842)}{4}\right)^2 = 38.65\), so \(n \geq 39\).]
- Explain: why might a non-rejection of \(H_0\) not be "good news"? [Answer: If the test has low power, a non-rejection could simply mean the test was too weak to detect the effect. You need to check the power at the minimum important effect size before concluding no effect exists.]
- For a two-sided test with \(\alpha = 0.01\), \(\sigma = 6\), \(n = 25\), compute the power when \(\mu_1 = \mu_0 + 2.5\). [Answer: \(\delta = 5 \times 2.5/6 = 2.083\), power \(\approx \Phi(2.083 - 2.576) = \Phi(-0.493) \approx 0.311\).]
- Describe the relationship between sample size and power. Sketch a graph of power vs \(n\) for a fixed effect size. [Answer: Power increases with \(n\), but with diminishing returns. The curve starts steep and flattens as power approaches 100%.]
HLQ: Exam-Style Question with Worked Solution
A food manufacturer labels cereal boxes as containing 500 grams. A consumer agency suspects the actual mean is less than 500 grams. From historical data, the standard deviation of box weights is known to be \(\sigma = 4.0\) grams. The agency plans to test \(H_0: \mu = 500\) vs \(H_a: \mu < 500\) at the 5% significance level.
(a) With a sample of \(n = 25\) boxes, compute the power of this test if the true mean is actually \(\mu_1 = 498\) grams. (4 marks)
(b) Compute the power if the true mean is \(\mu_1 = 497\) grams. Comment on how it compares to part (a). (3 marks)
(c) What minimum sample size is needed to achieve at least 90% power to detect a true mean of 498 grams? (3 marks)
(d) The agency ultimately collected 25 boxes and found \(\bar{x} = 498.1\) grams. Carry out the full hypothesis test. Given your answer to part (a), how should the result be interpreted? (2 marks)
\(\delta = \dfrac{\sqrt{25}(498 - 500)}{4.0} = \dfrac{5 \times (-2)}{4} = -2.500\).
For a left-tailed test: Power = \(\Phi(z_\alpha + \delta) = \Phi(1.645 + (-2.500)) = \Phi(-0.855)\).
From tables: \(\Phi(-0.855) = 1 - \Phi(0.855) \approx 1 - 0.8038 = 0.1962\).
Power \(\approx 19.6\%\). Even if the true mean is 2 grams below the label (a potentially meaningful discrepancy), the test with \(n = 25\) has only a 20% chance of detecting it. This is a very low-power test.
Power = \(\Phi(1.645 + (-3.750)) = \Phi(-2.105)\).
From tables: \(\Phi(-2.105) \approx 1 - 0.9824 = 0.0176\).
Wait — this is WRONG direction. Let me reconsider. The true mean is 497 < 500, so \(\mu_1 - \mu_0 = -3\). But for a left-tailed test, power = \(\Phi(z_\alpha - \dfrac{\sqrt{n}|\mu_1 - \mu_0|}{\sigma})\).
Actually, let me recompute carefully. The critical value is \(\bar{x}_c = 500 - 1.645 \times 4/5 = 500 - 1.316 = 498.684\).
Under \(\mu = 497\): Power = \(P(\bar{X} < 498.684 \mid \mu = 497)\) = \(P(Z < \frac{498.684 - 497}{4/5})\) = \(P(Z < 1.684/0.8)\) = \(P(Z < 2.105)\).
= \(\Phi(2.105) \approx 0.9824\).
Power \(\approx 98.2\%\). When the true mean is 3 grams below, the test is highly effective. This is much higher than the 19.6% power when \(\mu_1 = 498\), because the effect is larger.
For a left-tailed test, we need \(\mu_0 - z_\alpha \dfrac{\sigma}{\sqrt{n}} > \mu_1 + z_\beta \dfrac{\sigma}{\sqrt{n}}\), which gives:
\[ n \geq \left(\frac{\sigma(z_\alpha + z_\beta)}{|\mu_1 - \mu_0|}\right)^2 = \left(\frac{4(1.645 + 1.282)}{2}\right)^2 = \left(\frac{4 \times 2.927}{2}\right)^2 = (5.854)^2 = 34.27\] Minimum sample size: \(n \geq 35\).
With \(n = 25\), the power is only 19.6% — far below the desired 90%.
Step 2: \(\alpha = 0.05\).
Step 3: \(Z = \dfrac{\bar{X} - 500}{4/\sqrt{25}} \sim N(0, 1)\) under \(H_0\).
Step 4: Reject if \(Z < -1.645\).
Step 5: \(z_{\text{obs}} = \dfrac{498.1 - 500}{4/5} = \dfrac{-1.9}{0.8} = -2.375\).
Step 6: Since \(-2.375 < -1.645\), reject \(H_0\). There is sufficient evidence at the 5% level that the mean is less than 500 grams.
Interpretation in light of part (a): The test rejected \(H_0\) even though its power at \(\mu_1 = 498\) was only 19.6%. This suggests the true mean may actually be even lower than 498, or we simply got lucky with this particular sample. A rejection with low power is still a valid rejection — it just means such rejections are relatively rare.
(a) Power at \(\mu_1 = 498\): \(\Phi(-0.855) \approx 0.196\) (19.6%). Very low.
(b) Power at \(\mu_1 = 497\): \(\Phi(2.105) \approx 0.982\) (98.2%). Much higher because the effect is larger.
(c) Minimum sample size for 90% power at \(\mu_1 = 498\): \(n = 35\).
(d) Reject \(H_0\) (\(z = -2.375 < -1.645\)). Result is valid despite low power at \(\mu_1 = 498\).