Hypothesis Testing in Regression: t-Tests, F-Tests & Confidence Intervals
After estimating a regression model, the next critical question is whether the estimated coefficients reflect real relationships in the population or could have arisen by chance. Hypothesis testing in regression provides the formal statistical framework for answering this question. This guide covers the t-test for individual coefficients, p-values and their interpretation, confidence intervals, the F-test for joint significance, and the crucial distinction between statistical and economic significance.
Sampling Distributions of OLS Estimators
Hypothesis testing in regression asks whether the patterns we estimate from sample data reflect real population relationships or could have arisen by sampling variability alone. The answer depends on understanding how OLS estimators behave across hypothetical repeated samples.
Under the classical linear model assumptions (MLR.1 through MLR.6), the OLS estimator β̂j follows a normal distribution centered on the true population parameter βj. The spread of this distribution is measured by the standard error — a smaller standard error means β̂j is estimated more precisely.
The standard error of β̂j measures how much the estimated coefficient would vary across repeated samples drawn from the same population. A small standard error means the estimate is precise; a large standard error means substantial uncertainty remains about the true value of βj.
The six classical assumptions — linearity (MLR.1), random sampling (MLR.2), no perfect collinearity (MLR.3), zero conditional mean of errors (MLR.4), homoskedasticity (MLR.5), and normality of errors (MLR.6) — guarantee that the sampling distributions of OLS estimators are exactly normal. In practice, the normality assumption (MLR.6) is often questionable. Fortunately, the Central Limit Theorem ensures that with sufficiently large samples, the OLS estimators are approximately normally distributed even without assuming normal errors. This property, known as asymptotic normality, means the inference procedures described below remain valid for large samples under the weaker assumptions MLR.1 through MLR.5.
For a detailed treatment of how OLS estimators are derived, see our guides on simple linear regression and multiple regression analysis.
The t-Test for a Single Coefficient
The t-test evaluates whether an individual regression coefficient differs from a hypothesized value. In most applications, the null hypothesis is that the coefficient equals zero — meaning the corresponding variable has no partial effect on the dependent variable, holding other factors fixed.
Where:
- β̂j — the OLS estimate of the coefficient on variable xj
- aj — the hypothesized value under H0 (typically zero)
- se(β̂j) — the standard error of β̂j
Under the null hypothesis and the classical assumptions, the t-statistic follows a t-distribution with n − k − 1 degrees of freedom, where n is the sample size and k is the number of independent variables. The degrees of freedom reflect the amount of information available after estimating k + 1 parameters (k slopes plus the intercept). With more degrees of freedom, the t-distribution approaches the standard normal, and critical values become smaller — making it easier to reject the null hypothesis.
| Degrees of Freedom | 5% Critical Value (Two-Sided) | 1% Critical Value (Two-Sided) |
|---|---|---|
| 10 | 2.228 | 3.169 |
| 30 | 2.042 | 2.750 |
| 60 | 2.000 | 2.660 |
| 120 | 1.980 | 2.617 |
| ∞ (Normal) | 1.960 | 2.576 |
One-Sided vs. Two-Sided Tests
A two-sided test (H1: βj ≠ 0) rejects the null when the absolute value of the t-statistic exceeds the critical value. This is the default when you have no prior expectation about the sign of the effect.
A one-sided test (e.g., H1: βj > 0) rejects only when the t-statistic exceeds the critical value in the predicted direction. One-sided tests are appropriate when theory strongly predicts the sign — for example, testing whether a fund’s alpha is positive.
Most statistical software reports t-statistics assuming H0: βj = 0. To test a different hypothesized value (such as βj = 1), you must compute the t-statistic manually using t = (β̂j − 1) / se(β̂j).
p-Values in Regression: What They Mean and What They Do Not
The p-value is the probability of observing a test statistic as extreme as — or more extreme than — the one actually computed, assuming the null hypothesis is true. For a two-sided test:
The p-value tells you the smallest significance level at which you would reject the null hypothesis. If p = 0.023, you reject at the 5% level but not at the 1% level. The smaller the p-value, the stronger the evidence against H0.
A p-value does NOT tell you the probability that the null hypothesis is true. It tells you how likely the observed data would be if the null were true. These are fundamentally different statements. Conflating them is one of the most common errors in applied regression analysis.
The conventional threshold of p < 0.05 is widely used but arbitrary. It was never intended as a rigid decision rule. Researchers who test many specifications and report only the significant results — a practice known as p-hacking or data mining — can produce spurious findings that do not replicate. Always interpret p-values in context, alongside effect sizes and confidence intervals.
Confidence Intervals in Regression
A confidence interval provides a range of plausible values for the true population coefficient, given the data. It conveys more information than a hypothesis test alone because it shows both the direction and the precision of the estimate.
Where c is the critical value from the tn−k−1 distribution at the chosen confidence level. For a 95% confidence interval with large degrees of freedom, c ≈ 1.96.
A 95% confidence interval means that if you repeated the sampling procedure many times, approximately 95% of the resulting intervals would contain the true parameter βj. This is not the same as saying there is a 95% probability that βj lies in any one particular interval.
Confidence intervals and hypothesis tests are directly linked: if the hypothesized value aj falls outside the 95% confidence interval, the t-test rejects H0: βj = aj at the 5% significance level. If aj falls inside the interval, you fail to reject.
Testing a Mutual Fund’s Alpha: t-Test in Practice
Suppose you regress Fidelity Contrafund’s (FCNTX) monthly excess returns on the market excess return using 60 months of data to test whether the fund generates risk-adjusted performance that is significantly different from zero — that is, whether its Jensen’s alpha differs from zero.
| Parameter | Value |
|---|---|
| Estimated alpha (α̂) | 0.42% per month |
| Standard error of α̂ | 0.18% |
| Null hypothesis | H0: α = 0 (no abnormal return) |
| Sample size (n) | 60 months |
| Degrees of freedom | n − k − 1 = 60 − 1 − 1 = 58 |
Step 1: Compute the t-statistic:
t = 0.42 / 0.18 = 2.33
Step 2: Compare to the critical value. For a two-sided test at the 5% level with 58 degrees of freedom, c ≈ 2.00.
Step 3: Since |2.33| > 2.00, reject H0. The p-value is approximately 0.023, confirming rejection at the 5% level.
Step 4: Construct the 95% confidence interval:
CI = 0.42 ± 2.00 × 0.18 = [0.06%, 0.78%] per month
Contrafund’s estimated alpha is statistically significant at the 5% level. The confidence interval suggests the true monthly alpha lies between 0.06% and 0.78%, which translates to roughly 0.7% to 9.4% annualized — a meaningful range for evaluating whether the fund’s active management adds value.
The F-Test for Multiple Restrictions
While the t-test evaluates one coefficient at a time, many research questions require testing whether a group of variables is jointly significant. Individual t-tests cannot answer this question because they each test a different null hypothesis. Multicollinearity among the tested variables can make every individual t-test insignificant while the group as a whole has substantial explanatory power.
The F-test compares a restricted model (with the restrictions imposed) to an unrestricted model (without restrictions) and asks whether the restrictions cause a statistically significant loss in explanatory power.
Where:
- SSRr, SSRur — sum of squared residuals from the restricted and unrestricted models
- R2ur, R2r — R-squared values from the unrestricted and restricted models
- q — number of restrictions (variables being tested)
- n − k − 1 — degrees of freedom in the unrestricted model
Under the null hypothesis, the F-statistic follows an F-distribution with q and n − k − 1 degrees of freedom. Reject H0 when F exceeds the critical value.
The R2 form is a convenient shortcut, but it applies only to exclusion restrictions in nested models — that is, when the restricted model is obtained by dropping variables from the unrestricted model. For more general linear restrictions, use the SSR form.
A special case of the F-test is the overall significance test, which tests H0: β1 = β2 = … = βk = 0 — that is, whether all slope coefficients are jointly zero (the intercept is not included in this null). This test asks whether the regression model as a whole explains any variation in the dependent variable. Most software reports this F-statistic automatically.
Joint Significance of Factor Loadings: F-Test in Practice
Suppose you estimate a Fama-French three-factor model for the Invesco QQQ Trust (QQQ), a technology-heavy ETF tracking the Nasdaq-100, using 120 monthly observations. The three-factor model regresses QQQ’s excess returns on the market factor (MKT), the size factor (SMB), and the value factor (HML). You want to test whether SMB and HML jointly contribute explanatory power beyond the market factor alone.
| Model | R2 | Variables |
|---|---|---|
| Unrestricted (3-factor) | 0.88 | MKT, SMB, HML |
| Restricted (market-only) | 0.82 | MKT |
Null hypothesis: H0: βSMB = βHML = 0 (SMB and HML have no joint effect)
Number of restrictions: q = 2
Degrees of freedom: n − k − 1 = 120 − 3 − 1 = 116
F-statistic (R2 form):
F = [(0.88 − 0.82) / 2] / [(1 − 0.88) / 116] = 0.03 / 0.001034 = 29.01
The critical value of F2,116 at the 5% level is approximately 3.07. Since 29.01 far exceeds 3.07, we reject H0. The size and value factors are jointly significant — they contribute meaningful explanatory power beyond the market factor alone.
Note that individual t-tests on SMB and HML might tell a different story if the two factors are correlated. The F-test properly accounts for the joint contribution of both variables.
Economic Significance vs. Statistical Significance
A regression coefficient can be statistically significant yet economically trivial — or economically meaningful yet statistically insignificant. Responsible regression analysis always considers both dimensions.
Statistical Significance
- Tests whether the coefficient differs from zero
- Driven by sample size, standard error, and significance level
- Large samples can make tiny effects significant
- Captured by t-statistics and p-values
- Best for: determining whether an effect exists
Economic Significance
- Asks whether the coefficient is large enough to matter
- Requires domain knowledge and judgment
- Unaffected by sample size
- Assessed through effect sizes and confidence intervals
- Best for: deciding whether to act on the finding
Consider two results from a factor model estimated on 10,000 daily observations:
| Coefficient | Estimate | p-Value | Statistically Significant? | Economically Significant? |
|---|---|---|---|---|
| SMB loading | 0.02 | < 0.001 | Yes | No — adds ~2 bps/month, negligible after costs |
| Market beta | 1.45 | 0.06 | No (at 5% level) | Yes — 45% above-market sensitivity matters for risk |
The large sample size drives the tiny SMB loading to statistical significance, while the imprecisely estimated but large market beta just misses the conventional threshold. A portfolio manager would care far more about the market beta result despite its higher p-value.
Always evaluate the magnitude of a coefficient alongside its p-value. Ask: “Is this effect large enough to change an investment decision?” If the answer is no, statistical significance alone does not make the finding practically important.
Limitations and Assumptions
The t-tests, F-tests, and confidence intervals described above are valid only when the underlying assumptions hold. Violations can lead to misleading inference.
Regression inference relies on the classical assumptions MLR.1 through MLR.6 for exact results, or MLR.1 through MLR.5 with large samples for asymptotically valid results. If errors are heteroskedastic (non-constant variance), the usual standard errors are invalid and hypothesis tests can be severely distorted.
When heteroskedasticity is present, the solution is to use robust standard errors (also called heteroskedasticity-robust or White standard errors), which remain valid regardless of the error variance structure. For a detailed treatment, see our guide on heteroskedasticity in regression.
If errors are non-normal and the sample size is small, the t and F distributions are only approximate. In such cases, interpret borderline results with caution. With large samples, the Central Limit Theorem ensures that inference remains approximately valid even without the normality assumption.
How to Evaluate Regression Hypothesis Tests
Follow this systematic workflow when conducting or evaluating hypothesis tests in regression:
- State the hypotheses: Define H0 and H1 clearly. For individual coefficients, this is typically H0: βj = 0. For joint tests, specify which coefficients are being tested together.
- Choose the significance level (α): Select α before examining results — typically 5% or 1%. This prevents post-hoc rationalization of borderline results.
- Estimate the model: Run OLS on the multiple regression specification. For F-tests, estimate both the restricted and unrestricted models.
- Compute the test statistic: Calculate the t-statistic or F-statistic from the regression output.
- Compare to the critical value or compute the p-value: Reject H0 if the test statistic exceeds the critical value, or equivalently, if p < α.
- Interpret in economic context: Assess whether the coefficient’s magnitude is large enough to be practically meaningful. Consider confidence intervals alongside p-values. See our guide on regression functional forms for how different model specifications affect coefficient interpretation.
Common Mistakes
1. Confusing “fail to reject” with “accept the null.” Failing to reject H0: βj = 0 does not prove the coefficient is zero. It means the data lack sufficient evidence to conclude otherwise. The coefficient may be nonzero but the sample too small or too noisy to detect it. Always say “fail to reject,” never “accept.”
2. Ignoring economic significance when p < 0.05. A statistically significant coefficient can be economically trivial. With 10,000 daily observations, a market beta of 0.003 might have p < 0.01, but 0.3% sensitivity to the market is meaningless for any investment decision. Always examine the coefficient’s magnitude and practical implications.
3. Using individual t-tests to evaluate joint significance. Individual t-tests each evaluate whether a single coefficient is zero. They do not test whether a group of coefficients is jointly zero. Multicollinearity can make every individual t-test insignificant while the joint F-test strongly rejects. For joint hypotheses, always use the F-test.
4. Reporting p-values without confidence intervals. A p-value indicates only whether to reject or fail to reject — it reveals nothing about the magnitude or precision of the estimate. Confidence intervals show the range of plausible values for the true coefficient, which is far more informative for decision-making.
5. Over-relying on the 5% significance threshold. The 0.05 cutoff is a convention, not a law of nature. A result with p = 0.06 is not fundamentally different from p = 0.04. Report exact p-values so readers can apply their own judgment, and interpret borderline results cautiously rather than mechanically.
Frequently Asked Questions
Disclaimer
This article is for educational and informational purposes only and does not constitute investment advice. The numerical examples are illustrative and based on hypothetical regression output. Statistical significance does not imply economic significance or guarantee future results. Always conduct your own analysis and consult a qualified financial advisor before making investment decisions.