Type I and Type II Errors, Statistical Power, and p-Hacking

Every statistical test in finance carries the risk of two types of errors. A Type I error means acting on a signal that isn’t real — like hiring a manager whose track record was driven by luck. A Type II error means missing a signal that is real — like dismissing a profitable strategy because your test lacked the power to detect it. Understanding these errors, and how statistical power affects them, is essential for making sound investment decisions.

What Are Type I and Type II Errors?

When you conduct a hypothesis test, you start with a null hypothesis (H0) and decide whether to reject it based on your data. Two things can go wrong:

Key Concept

Type I Error (False Positive): Rejecting a true null hypothesis. You conclude there’s an effect when there isn’t one.

Type II Error (False Negative): Failing to reject a false null hypothesis. You miss a real effect that actually exists.

Type I vs. Type II Errors: Decision Matrix

H0 is True (No Effect) H0 is False (Effect Exists)
Reject H0 Type I Error (α) — False Positive Correct Decision (Power = 1 – β)
Fail to Reject H0 Correct Decision (1 – α) Type II Error (β) — False Negative

The probabilities of these errors have specific names:

  • Alpha (α) — the probability of a Type I error, also called the significance level. If you set α = 0.05, you accept a 5% chance of rejecting H0 when it’s actually true.
  • Beta (β) — the probability of a Type II error. If β = 0.20, there’s a 20% chance you’ll fail to detect a real effect.

In finance, both errors have real costs. A Type I error might lead you to pay active management fees for a manager with no actual skill. A Type II error might cause you to ignore a factor premium that could have enhanced your portfolio returns.

The Trade-Off Between Alpha and Beta

Alpha and beta are inversely related: if you make your test more stringent (lower α), you reduce false positives but increase false negatives. You can’t minimize both simultaneously without increasing your sample size or effect size.

Decision Effect on Type I Error (α) Effect on Type II Error (β)
Lower significance level (e.g., α = 0.01 instead of 0.05) Decreases Increases
Raise significance level (e.g., α = 0.10 instead of 0.05) Increases Decreases
Increase sample size No direct effect Decreases
Larger true effect size No direct effect Decreases

This trade-off forces you to think carefully about which error is more costly in your specific context. In drug trials, Type I errors (approving an ineffective drug) may be prioritized. In investment research, Type II errors (missing profitable strategies) might be just as damaging as false positives.

Statistical Power Explained

Statistical power is the probability of correctly rejecting a false null hypothesis — in other words, the probability of detecting a real effect when one exists. Importantly, power is not a fixed property of a test; it depends on the specific effect size you’re trying to detect, your sample size, the variance in your data, and your chosen significance level.

Power Formula
Power = 1 – β
Where β is the probability of a Type II error

If your test has 80% power, there’s an 80% chance you’ll detect a real effect and a 20% chance you’ll miss it (Type II error). Four factors determine statistical power:

  • Sample size (n) — more data reduces sampling variability, making real effects easier to detect
  • Effect size — larger true effects are easier to distinguish from noise
  • Significance level (α) — a higher α threshold makes it easier to reject H0
  • Population variance — lower variance in the data makes effects more visible
Pro Tip

A common rule of thumb is to design studies with at least 80% power. This means you accept a 20% chance of missing a real effect. In practice, many investment backtests and manager evaluations have far lower power, leading to unreliable conclusions.

Type I and Type II Errors in Investment Research

Let’s see how these concepts play out in real investment scenarios.

Example 1: Manager Selection False Positive

You’re evaluating whether a hedge fund manager has genuine skill.

  • Null hypothesis: Manager alpha = 0 (no skill, returns are luck)
  • Data: 36 months of returns, t-statistic = 2.1, p-value = 0.043
  • Decision: At α = 0.05, you reject H0 and allocate $10 million

The problem: With only 3 years of data and typical return volatility, the test has roughly 10-15% power to detect a manager with a true information ratio of 0.4. The vast majority of genuinely skilled managers would not produce a significant t-statistic in such a short window. If skilled managers are rare (say, only 20% of candidates), the false discovery rate is much higher than 5%.

Cost of Type I error: You pay 1% in annual fees ($100,000/year) for a manager who may simply have been lucky. The “statistically significant” result doesn’t mean skill is probable.

Example 2: Missing a Real Factor Premium

You’re testing whether a value factor premium exists in a market segment.

  • True premium: 2% annually (modest but economically meaningful)
  • Data: 5 years of annual returns with 15% annual volatility
  • Annualized Sharpe ratio of the premium: 2% / 15% ≈ 0.13
  • Power: Only about 6% at α = 0.05 with n = 5 annual observations

Result: You fail to reject H0 and conclude “value is dead.”

Cost of Type II error: You miss a strategy that could have added 2% annually to portfolio returns. With such low power, failing to reject the null is almost guaranteed — the test was never capable of detecting a real but modest effect.

Statistical Significance vs. Economic Significance

A statistically significant result is not always economically meaningful, and vice versa. These are distinct concepts that both matter for investment decisions.

Statistically Significant

  • P-value below threshold (e.g., p < 0.05)
  • Effect unlikely due to chance alone
  • Does NOT indicate effect size matters
  • Achievable with large samples and tiny effects

Economically Significant

  • Effect size large enough to matter
  • Exceeds transaction costs and implementation frictions
  • Justifies the operational complexity
  • Can exist even if not statistically significant
Important

A strategy with p = 0.001 and expected excess return of 0.1% annually is statistically significant but economically worthless — transaction costs would consume the entire premium. Conversely, a strategy with p = 0.08 and expected excess return of 5% annually may be economically compelling despite failing a strict significance test. Always evaluate both dimensions before allocating capital.

p-Hacking and the Multiple Comparisons Problem

P-hacking occurs when researchers test many hypotheses, try different specifications, or subset the data until they find a “significant” result — then report only that finding. This practice dramatically inflates false positive rates.

The Multiple Testing Problem

If you test 20 independent hypotheses at α = 0.05, and all null hypotheses are actually true (the “global null”), the probability of at least one false positive is:

1 – (0.95)20 ≈ 64%

On average, you’ll find one “significant” result purely by chance — even though no true effects exist.

Common forms of p-hacking in finance:

  • Data dredging — testing hundreds of factors until something “works”
  • Specification search — trying different time periods, benchmarks, or control variables
  • Selective reporting — publishing only strategies that showed significance
  • Stopping rules — adding data until the result crosses the significance threshold

Corrections for multiple testing:

  • Bonferroni correction — divide α by the number of tests (α/n). Simple but conservative.
  • False Discovery Rate (FDR) — controls the expected proportion of false positives among rejected hypotheses. Less conservative than Bonferroni.
  • Out-of-sample validation — test promising strategies on held-out data before drawing conclusions.

Research by Harvey, Liu, and Zhu (2016) suggests that the standard t-statistic threshold of 2.0 is far too low for factor research. Given the hundreds of factors tested across academic finance, a t-statistic of 3.0 or higher may be more appropriate.

Common Mistakes

Even experienced analysts fall into these traps when interpreting statistical tests:

  1. Confusing statistical significance with practical importance — A highly significant result with a tiny effect size may not be worth trading. Always examine effect magnitude alongside p-values.
  2. Ignoring power when designing studies — Running an underpowered backtest (e.g., 5 years of data for a subtle effect) almost guarantees you’ll miss real strategies. Calculate required sample size before testing.
  3. Testing many strategies without adjusting for multiple comparisons — If you test 100 strategies and report the best five, you’re likely reporting noise. Apply Bonferroni or FDR corrections.
  4. Treating p = 0.049 vs. p = 0.051 as fundamentally different — These are essentially the same result. The 0.05 threshold is arbitrary, not a scientific boundary.
  5. Assuming non-significant means “no effect” — Absence of evidence is not evidence of absence. A non-significant result in a low-power study tells you almost nothing about whether an effect exists.

Use our Hypothesis Testing Calculator to compute test statistics and p-values for your own analyses.

Limitations of Error and Power Analysis

While understanding Type I/II errors and power is essential, these frameworks have their own constraints:

Key Limitation

Power calculations require you to specify the effect size you want to detect — but in finance, the true effect size is usually unknown. If you assume a larger effect than actually exists, your calculated power will be overstated.

  • Conventional thresholds are arbitrary — The α = 0.05 and power = 80% conventions have no deep scientific basis. They’re social agreements, not optimal decision rules.
  • Multiple testing corrections can be conservative — Bonferroni correction controls the family-wise error rate regardless of whether tests are independent or correlated. When you test many hypotheses, especially correlated ones, it may be too strict, leading to missed discoveries.
  • Frequentist framework limitations — P-values don’t tell you the probability that H0 is true. They tell you the probability of seeing your data if H0 were true — a subtle but important distinction.
  • Base rates matter — If skilled managers are rare (low prior probability), even a “significant” test result may be more likely to reflect luck than skill. Bayesian approaches can incorporate prior probabilities explicitly.

For more on the foundations of statistical inference, see our articles on confidence intervals and the central limit theorem.

Frequently Asked Questions

A Type I error (false positive) occurs when you reject a true null hypothesis — you conclude an effect exists when it doesn’t. A Type II error (false negative) occurs when you fail to reject a false null hypothesis — you miss a real effect. In finance, a Type I error might mean investing with a lucky manager who has no skill, while a Type II error might mean ignoring a profitable strategy because your test couldn’t detect it.

Statistical power is the probability that your test will correctly detect a real effect when one exists. Power = 1 – β, where β is the Type II error rate. A power of 80% means there’s an 80% chance of detecting a true effect and a 20% chance of missing it. Low-power studies are unreliable because they frequently fail to detect real effects, leading to false conclusions that “no effect exists.”

Pre-register your hypotheses and analysis plan before looking at the data. Limit the number of tests you run and apply corrections like Bonferroni or false discovery rate (FDR) when testing multiple hypotheses. Use out-of-sample validation — hold back a portion of your data to test promising strategies. Report all tests you conducted, not just the significant ones. Be skeptical of results that required extensive data mining to achieve significance.

The 80% power convention (proposed by Jacob Cohen) reflects a pragmatic trade-off. At 80% power, you have a 4:1 ratio of correctly detecting real effects versus missing them. Going higher requires substantially more data — achieving 90% power might require 30% more observations than 80% power. However, 80% is not a scientific law. In high-stakes decisions, you might want 90% or 95% power; for exploratory research, lower power may be acceptable.

Yes, but it requires either a large sample size or a large effect size. With more data, you can set a stringent significance level (low α) while still maintaining high power to detect real effects. Similarly, if the true effect is large, it’s easier to detect even with strict criteria. The challenge in finance is that effect sizes are often small (e.g., modest factor premiums) and sample sizes are limited (only so many years of data exist), making it difficult to achieve both low α and high power simultaneously.

Disclaimer

This article is for educational and informational purposes only and does not constitute investment advice. Statistical significance does not guarantee future performance, and the examples provided are illustrative only. Always conduct thorough due diligence and consult a qualified financial advisor before making investment decisions.