N11: Power of a Test

Node N11 — Section 1

Why This Concept Exists

In N10, you learned to set up and execute hypothesis tests. But a crucial question was left unanswered: how good is the test? A test that always fails to reject \(H_0\) (even when \(H_0\) is false) is useless. The power of a test quantifies its ability to detect a true effect when it exists.

Power analysis answers practical questions that appear in both exam problems and real-world research design:

"If the true mean is actually 19.5, what is the probability our test will detect that it differs from 19?"
"How large a sample do we need to have an 80% chance of detecting an effect of size 0.5?"
"Why did we fail to reject \(H_0\) — was there really no effect, or was our test simply too weak to detect it?"

Leverage: Power calculations appear in about half of PTS2 exams (2015-2025), typically worth 6-10 marks. The ability to compute power for a z-test of a single mean — and to find the minimum sample size needed to achieve a target power — is a reliable source of marks. Understanding power also distinguishes top performers from the rest.

Node N11 — Section 2

Prerequisites

Before engaging with this node, you must be comfortable with:

Hypothesis testing framework (N10): All six steps, the concept of rejection regions, and the distinction between Type I (\(\alpha\)) and Type II (\(\beta\)) errors.
Standard normal probabilities: You must be able to compute \(\Phi(z)\) and \(1 - \Phi(z)\) efficiently. Power calculations reduce to evaluating normal CDF values.
Two distributions in play: Understanding that hypothesis testing involves TWO sampling distributions — one under \(H_0\) and one under \(H_a\). The rejection region is defined under \(H_0\), but power is computed under \(H_a\).
Non-centrality parameter: The standardized distance \(\delta = \dfrac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma}\) measures how far the alternative is from the null, in units of standard error.

Key idea: Every power calculation follows the same logic: (1) find the rejection-region critical value under \(H_0\), (2) translate that critical value to the scale of the \(H_a\) distribution, and (3) compute the probability of falling in the rejection region under \(H_a\).

Node N11 — Section 3

Core Exposition

3.1 Definition of Power

Power = \(P(\text{reject } H_0 \mid H_a \text{ is true}) = 1 - \beta\) where \(\beta = P(\text{Type II error}) = P(\text{do not reject } H_0 \mid H_a \text{ is true})\).

Power is always computed at a specific alternative value \(\mu = \mu_1\). Unlike \(\alpha\), which is fixed by the test design, power depends on how far \(\mu_1\) is from \(\mu_0\), the sample size, and the variance.

3.2 Power for a One-Sided z-Test (Right-Tailed)

Consider testing \(H_0: \mu = \mu_0\) vs \(H_a: \mu > \mu_0\) with known \(\sigma\). The rejection region is \(\bar{X} > \mu_0 + z_\alpha \dfrac{\sigma}{\sqrt{n}}\).

Derivation Under \(H_a: \mu = \mu_1\) (where \(\mu_1 > \mu_0\)):\[ \text{Power} = P\!\left(\bar{X} > \mu_0 + z_\alpha \frac{\sigma}{\sqrt{n}} \;\middle|\; \mu = \mu_1\right) \]\nStandardize under \(H_a\) (where \(\bar{X} \sim N(\mu_1, \sigma^2/n)\)):\[ \text{Power} = P\!\left(Z > \frac{\mu_0 + z_\alpha \frac{\sigma}{\sqrt{n}} - \mu_1}{\sigma/\sqrt{n}}\right) = P\!\left(Z > z_\alpha - \frac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma}\right) \]\nLet \(\delta = \dfrac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma}\):\[ \boxed{\text{Power} = 1 - \Phi(z_\alpha - \delta) = \Phi(\delta - z_\alpha)}\]

3.3 Power for a One-Sided z-Test (Left-Tailed)

For testing \(H_0: \mu = \mu_0\) vs \(H_a: \mu < \mu_0\) at level \(\alpha\), with \(\mu_1 < \mu_0\):

\text{Power} = P\!\left(\bar{X} < \mu_0 - z_\alpha \frac{\sigma}{\sqrt{n}} \;\middle|\; \mu = \mu_1\right) = \Phi\!\left(z_\alpha - \frac{\sqrt{n}(\mu_0 - \mu_1)}{\sigma}\right) = \Phi\!\left(z_\alpha - \frac{\sqrt{n}|\mu_1 - \mu_0|}{\sigma}\right)

Unified formula: For a one-sided test, if \(\delta = \dfrac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma}\) is positive (right-tailed), power = \(\Phi(\delta - z_\alpha)\). If \(\delta\) is negative (left-tailed), power = \(\Phi(-|\delta| - z_\alpha)\). In both cases, power increases as \(|\delta|\) increases.

3.4 Power for a Two-Sided z-Test

For testing \(H_0: \mu = \mu_0\) vs \(H_a: \mu \neq \mu_0\) at level \(\alpha\):\

\text{Power} = P\!\left(\bar{X} < \mu_0 - z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \;\middle|\; \mu = \mu_1\right) + P\!\left(\bar{X} > \mu_0 + z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \;\middle|\; \mu = \mu_1\right)\]\n\[= \Phi\!\left(z_{\alpha/2} - \delta\right) + \Phi\!\left(-z_{\alpha/2} - \delta\right)

Approximation: When \(\delta > 0\) (i.e., \(\mu_1 > \mu_0\)), the second term \(\Phi(-z_{\alpha/2} - \delta)\) is negligible (very small). So for practical purposes: Power \(\approx \Phi(\delta - z_{\alpha/2})\). This is the formula PTS2 examiners expect you to use for two-sided tests.

3.5 Factors Affecting Power

There are two ways to increase the power of a test (for a fixed alternative):

1. Increase the sample size \(n\): \(\delta = \sqrt{n}(\mu_1 - \mu_0)/\sigma\) grows with \(n\), increasing power. Cost: More data collection is expensive and time-consuming. 2. Increase the significance level \(\alpha\): Larger \(\alpha\) means a smaller critical value, so the rejection region is larger. Cost: Higher Type I error rate (more false positives).

The relationship between \(\alpha\) and \(\beta\) is a trade-off: for a fixed sample size, decreasing \(\alpha\) (being more conservative about false positives) necessarily increases \(\beta\) (more false negatives), and vice versa.

3.6 Finding the Required Sample Size

To achieve a desired power \(1 - \beta\) for detecting a difference \(\mu_1 - \mu_0\) with a one-sided test:

n \geq \left(\frac{\sigma\,(z_\alpha + z_\beta)}{\mu_1 - \mu_0}\right)^2

For a two-sided test, replace \(z_\alpha\) with \(z_{\alpha/2}\).
Always round up to the nearest integer.

Node N11 — Section 4

Worked Examples

Example 1: Power of a Right-Tailed z-Test

Recall the testing scenario from N10 Example 1: test \(H_0: \mu = 500\) vs \(H_a: \mu > 500\) with \(\sigma = 2.0\), \(n = 30\), at \(\alpha = 0.05\).

What is the power of this test if the true mean is actually \(\mu_1 = 501\)?

Step 1: Compute the non-centrality parameter \(\delta = \dfrac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma} = \dfrac{\sqrt{30}(501 - 500)}{2.0} = \dfrac{5.477 \times 1}{2.0} = 2.739\).

Step 2: Apply the power formula \(\alpha = 0.05\) (right-tailed), so \(z_{0.05} = 1.645\).
Power = \(\Phi(\delta - z_\alpha) = \Phi(2.739 - 1.645) = \Phi(1.094)\).
From tables: \(\Phi(1.09) \approx 0.8621\).

Power \(\approx 86.2\%\). If the true mean is 501 mg, this test has an 86.2% chance of correctly rejecting \(H_0\).

Example 2: Low Power Scenario

Consider testing \(H_0: \mu = 19.0\) vs \(H_a: \mu > 19.0\) with \(\sigma = 2.0\), \(n = 30\), at \(\alpha = 0.05\).

What is the power if the true mean is only \(\mu_1 = 19.5\)?

Solution \(\delta = \dfrac{\sqrt{30}(19.5 - 19.0)}{2.0} = \dfrac{5.477 \times 0.5}{2.0} = 1.369\).
Power = \(\Phi(1.369 - 1.645) = \Phi(-0.276)\).
From tables: \(\Phi(-0.276) = 1 - \Phi(0.276) \approx 1 - 0.6087 = 0.3913\).

Power \(\approx 39.1\%\). Even though the true mean is 0.5 units above the null, the test has only a 39% chance of detecting this difference. This is a low-power test.

Why is the power low? The effect size (0.5) is small relative to the standard error (\(2/\sqrt{30} \approx 0.365\)). The difference is only about 1.37 standard errors, which is not large enough to reliably cross the critical value of 1.645.

Example 3: Required Sample Size for Target Power

An engineer wants to test \(H_0: \mu = 100\) vs \(H_a: \mu > 100\) with \(\sigma = 5\). They want to be able to detect a shift to \(\mu_1 = 102\) with at least 90% power, using \(\alpha = 0.05\).

What minimum sample size is required?

Solution For a one-sided test: \[n \geq \left(\frac{\sigma(z_\alpha + z_\beta)}{\mu_1 - \mu_0}\right)^2\]
\(\alpha = 0.05\): \(z_{0.05} = 1.645\).
Target power = 0.90, so \(\beta = 0.10\): \(z_{0.10} = 1.282\).
\(\mu_1 - \mu_0 = 2\), \(\sigma = 5\).

\(n \geq \left(\dfrac{5(1.645 + 1.282)}{2}\right)^2 = \left(\dfrac{5 \times 2.927}{2}\right)^2 = \left(7.318\right)^2 = 53.55\).

Minimum sample size: \(n \geq 54\).
With only \(n = 30\) (as in Example 2), the power would be only about 39% — unacceptably low for this application.

Example 4: Power for a Two-Sided Test

Test \(H_0: \mu = 50\) vs \(H_a: \mu \neq 50\) with \(\sigma = 8\), \(n = 36\), at \(\alpha = 0.05\).

What is the power if the true mean is \(\mu_1 = 53\)?

Solution \(\delta = \dfrac{\sqrt{36}(53 - 50)}{8} = \dfrac{6 \times 3}{8} = 2.25\).
\(z_{0.025} = 1.96\).

Full formula: Power = \(\Phi(2.25 - 1.96) + \Phi(-1.96 - 2.25)\) = \(\Phi(0.29) + \Phi(-4.21)\) = \(0.6141 + 0.0000 = 0.6141\).

Approximation (one term): Power \(\approx \Phi(0.29) = 0.6141\).

Power \(\approx 61.4\%\). The second term (\(\Phi(-4.21)\)) is essentially zero because the left-tail probability under the \(H_a\) distribution is negligible when \(\mu_1 > \mu_0\).

Node N11 — Section 5

Pattern Recognition & Examiner Traps

Trap 1: Confusing power with significance level Power is the probability of rejecting \(H_0\) when \(H_a\) is true. It is NOT the same as \(1 - \alpha\). A test with \(\alpha = 0.05\) does NOT have power 0.95. Power depends on how far \(\mu_1\) is from \(\mu_0\), not on \(\alpha\) alone.

WRONG "The test has a 5% significance level, so the power is 95%." — This conflates Type I error rate with power. They are different quantities.

RIGHT "Power = \(\Phi(\delta - z_\alpha)\) = \(\Phi(1.369 - 1.645)\) = 0.391." — Power must be computed from the specific alternative \(\mu_1\).

Trap 2: Forgetting that power depends on a specific \(\mu_1\) Power cannot be computed without specifying a particular alternative value. Asking "what is the power of this test?" is incomplete — it must be "what is the power when \(\mu = \mu_1\)?".

Trap 3: Using the wrong critical value in the power formula For a two-sided test at level \(\alpha\), the power formula uses \(z_{\alpha/2}\), not \(z_\alpha\). Students who use \(z_\alpha\) in the two-sided formula get the wrong answer.

WRONG For \(\alpha = 0.05\) two-sided: using \(z_{0.05} = 1.645\) in the formula.

RIGHT For \(\alpha = 0.05\) two-sided: using \(z_{0.025} = 1.96\) in the formula. The \(\alpha\) is split between both tails.

Trap 4: Sample size formula — using \(z_\alpha\) for a two-sided test The sample size formula for a two-sided test must use \(z_{\alpha/2}\), not \(z_\alpha\). This is closely related to Trap 3.

Examiner patterns:

"Calculate the power of this test when \(\mu = \mu_1\)" — compute \(\delta\), then apply \(\Phi(\delta - z_\alpha)\) for one-sided or \(\Phi(\delta - z_{\alpha/2})\) for two-sided.
"Find the minimum sample size to achieve 80% power" — use \(n = \bigl(\sigma(z_\alpha + z_{0.20})/\Delta\bigr)^2\) for one-sided or \(z_{\alpha/2}\) for two-sided. Round up.
"Comment on the power" — if power < 50%, the test is weak and a non-rejection of \(H_0\) is not informative. If power > 80%, the test is reliable.
Examiners frequently test the N10 example scenario (\(n = 30\), \(\mu_0 = 19\), \(\sigma = 2\)) and ask for power at \(\mu_1 = 19.5\). This gives \(\delta = 1.369\) and power \(\approx 0.391\).

Node N11 — Section 6

Connections

N11 connects to the full inference picture:

← N10 (Hypothesis Testing): N11 extends the N10 framework by evaluating how well the test performs. The same test statistic, rejection region, and critical values from N10 are used to compute power.
← N6 (Sampling Distributions): The power formula depends on \(\bar{X} \sim N(\mu, \sigma^2/n)\), a result from N6. The non-centrality parameter \(\delta\) is the effect size scaled by the standard error.
→ N12 (Two-Sample Tests): Power calculations for two-sample tests follow the same logic, but with a two-sample \(\delta = \dfrac{\mu_1 - \mu_2 - \Delta_0}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}}\). Understanding one-sample power makes two-sample power straightforward.

Node N11 — Section 7

Summary Table

Scenario	Power Formula	Required n
One-sided (\(\mu_1 > \mu_0\))	\(\Phi(\delta - z_\alpha)\)	\(\left(\dfrac{\sigma(z_\alpha + z_\beta)}{\mu_1-\mu_0}\right)^2\)
One-sided (\(\mu_1 < \mu_0\))	\(\Phi(z_\alpha - \|\delta\|)\)	Same as above
Two-sided	\(\approx \Phi(\|\delta\| - z_{\alpha/2})\)	\(\left(\dfrac{\sigma(z_{\alpha/2} + z_\beta)}{\mu_1-\mu_0}\right)^2\)

where \(\delta = \dfrac{\sqrt{n}(\mu_1 - \mu_0)}{\sigma}\)

Power = 1 - β Type II error rate \(\beta\) is the probability of missing a real effect. Power is the probability of detecting it. They sum to 1. Low power means a non-rejection of \(H_0\) is uninformative.

Two Levers: n and α Increase sample size or increase significance level to boost power. Both come with costs: bigger samples are expensive; larger \(\alpha\) means more false positives. Choose based on the consequences of Type II errors.

Effect Size Matters Power depends on the standardized distance \(\delta\). Small effects require much larger samples than large effects. If the effect size is tiny, even huge samples may be needed for adequate power.

Two-Sided ≈ One-Sided for Power For practical purposes, the power of a two-sided test is well-approximated by the one-sided formula with \(z_{\alpha/2}\) instead of \(z_\alpha\). The far-tail contribution is negligible.

Node N11 — Section 8

Self-Assessment

Test your understanding before moving to N12:

Can you do all of these?

For a right-tailed z-test with \(n = 30\), \(\sigma = 2\), \(\mu_0 = 19\), \(\alpha = 0.05\), compute the power when \(\mu_1 = 20\). [Answer: \(\delta = 5\sqrt{30}/2 = 5.477\), power = \(\Phi(5.477 - 1.645) = \Phi(3.832) \approx 0.9999\).]
For the same test, compute the power when \(\mu_1 = 19.3\). [Answer: \(\delta = 0.8215\), power = \(\Phi(0.8215 - 1.645) = \Phi(-0.824) \approx 0.205\).]
Test \(H_0: \mu = 100\) vs \(H_a: \mu > 100\), \(\sigma = 10\), \(\alpha = 0.05\), \(n = 25\). Find the minimum sample size to achieve 80% power when \(\mu_1 = 104\). [Answer: \(n \geq \left(\frac{10(1.645 + 0.842)}{4}\right)^2 = 38.65\), so \(n \geq 39\).]
Explain: why might a non-rejection of \(H_0\) not be "good news"? [Answer: If the test has low power, a non-rejection could simply mean the test was too weak to detect the effect. You need to check the power at the minimum important effect size before concluding no effect exists.]
For a two-sided test with \(\alpha = 0.01\), \(\sigma = 6\), \(n = 25\), compute the power when \(\mu_1 = \mu_0 + 2.5\). [Answer: \(\delta = 5 \times 2.5/6 = 2.083\), power \(\approx \Phi(2.083 - 2.576) = \Phi(-0.493) \approx 0.311\).]
Describe the relationship between sample size and power. Sketch a graph of power vs \(n\) for a fixed effect size. [Answer: Power increases with \(n\), but with diminishing returns. The curve starts steep and flattens as power approaches 100%.]

High-Leverage Questions

HLQ: Exam-Style Question with Worked Solution

12 MARKS POWER CALCULATION + SAMPLE SIZE MULTI-PART

A food manufacturer labels cereal boxes as containing 500 grams. A consumer agency suspects the actual mean is less than 500 grams. From historical data, the standard deviation of box weights is known to be \(\sigma = 4.0\) grams. The agency plans to test \(H_0: \mu = 500\) vs \(H_a: \mu < 500\) at the 5% significance level.

(a) With a sample of \(n = 25\) boxes, compute the power of this test if the true mean is actually \(\mu_1 = 498\) grams. (4 marks)

(b) Compute the power if the true mean is \(\mu_1 = 497\) grams. Comment on how it compares to part (a). (3 marks)

(c) What minimum sample size is needed to achieve at least 90% power to detect a true mean of 498 grams? (3 marks)

(d) The agency ultimately collected 25 boxes and found \(\bar{x} = 498.1\) grams. Carry out the full hypothesis test. Given your answer to part (a), how should the result be interpreted? (2 marks)

Part (a): Power When \(\mu_1 = 498\) This is a left-tailed test, so the rejection region is \(\bar{X} < \mu_0 - z_\alpha \dfrac{\sigma}{\sqrt{n}}\).
\(\delta = \dfrac{\sqrt{25}(498 - 500)}{4.0} = \dfrac{5 \times (-2)}{4} = -2.500\).
For a left-tailed test: Power = \(\Phi(z_\alpha + \delta) = \Phi(1.645 + (-2.500)) = \Phi(-0.855)\).
From tables: \(\Phi(-0.855) = 1 - \Phi(0.855) \approx 1 - 0.8038 = 0.1962\).

Power \(\approx 19.6\%\). Even if the true mean is 2 grams below the label (a potentially meaningful discrepancy), the test with \(n = 25\) has only a 20% chance of detecting it. This is a very low-power test.

Part (b): Power When \(\mu_1 = 497\) \(\delta = \dfrac{\sqrt{25}(497 - 500)}{4.0} = \dfrac{5 \times (-3)}{4} = -3.750\).
Power = \(\Phi(1.645 + (-3.750)) = \Phi(-2.105)\).
From tables: \(\Phi(-2.105) \approx 1 - 0.9824 = 0.0176\).

Wait — this is WRONG direction. Let me reconsider. The true mean is 497 < 500, so \(\mu_1 - \mu_0 = -3\). But for a left-tailed test, power = \(\Phi(z_\alpha - \dfrac{\sqrt{n}|\mu_1 - \mu_0|}{\sigma})\).

Actually, let me recompute carefully. The critical value is \(\bar{x}_c = 500 - 1.645 \times 4/5 = 500 - 1.316 = 498.684\).
Under \(\mu = 497\): Power = \(P(\bar{X} < 498.684 \mid \mu = 497)\) = \(P(Z < \frac{498.684 - 497}{4/5})\) = \(P(Z < 1.684/0.8)\) = \(P(Z < 2.105)\).
= \(\Phi(2.105) \approx 0.9824\).

Power \(\approx 98.2\%\). When the true mean is 3 grams below, the test is highly effective. This is much higher than the 19.6% power when \(\mu_1 = 498\), because the effect is larger.

Part (c): Required Sample Size for 90% Power Target: Power \(\geq 0.90\), so \(\beta \leq 0.10\) and \(z_{0.10} = 1.282\).
For a left-tailed test, we need \(\mu_0 - z_\alpha \dfrac{\sigma}{\sqrt{n}} > \mu_1 + z_\beta \dfrac{\sigma}{\sqrt{n}}\), which gives:
\[ n \geq \left(\frac{\sigma(z_\alpha + z_\beta)}{|\mu_1 - \mu_0|}\right)^2 = \left(\frac{4(1.645 + 1.282)}{2}\right)^2 = \left(\frac{4 \times 2.927}{2}\right)^2 = (5.854)^2 = 34.27\] Minimum sample size: \(n \geq 35\).
With \(n = 25\), the power is only 19.6% — far below the desired 90%.

Part (d): Carrying Out the Test Step 1: \(H_0: \mu = 500\) vs \(H_a: \mu < 500\).
Step 2: \(\alpha = 0.05\).
Step 3: \(Z = \dfrac{\bar{X} - 500}{4/\sqrt{25}} \sim N(0, 1)\) under \(H_0\).
Step 4: Reject if \(Z < -1.645\).
Step 5: \(z_{\text{obs}} = \dfrac{498.1 - 500}{4/5} = \dfrac{-1.9}{0.8} = -2.375\).
Step 6: Since \(-2.375 < -1.645\), reject \(H_0\). There is sufficient evidence at the 5% level that the mean is less than 500 grams.

Interpretation in light of part (a): The test rejected \(H_0\) even though its power at \(\mu_1 = 498\) was only 19.6%. This suggests the true mean may actually be even lower than 498, or we simply got lucky with this particular sample. A rejection with low power is still a valid rejection — it just means such rejections are relatively rare.

Summary of answers:
(a) Power at \(\mu_1 = 498\): \(\Phi(-0.855) \approx 0.196\) (19.6%). Very low.
(b) Power at \(\mu_1 = 497\): \(\Phi(2.105) \approx 0.982\) (98.2%). Much higher because the effect is larger.
(c) Minimum sample size for 90% power at \(\mu_1 = 498\): \(n = 35\).
(d) Reject \(H_0\) (\(z = -2.375 < -1.645\)). Result is valid despite low power at \(\mu_1 = 498\).