Hypothesis Testing in Regression: t-Tests, F-Tests & Confidence Intervals

Q: What is the difference between a t-test and an F-test in regression?

The t-test evaluates whether a single regression coefficient is statistically different from a hypothesized value (usually zero). The F-test evaluates whether a group of coefficients is jointly significant — that is, whether they collectively contribute explanatory power to the model. Use the t-test for individual coefficient hypotheses and the F-test for joint restrictions. When testing a single coefficient, the F-test with one restriction is mathematically equivalent to the squared t-test: F = t 2 .

Q: What does a p-value of 0.05 mean in regression?

A p-value of 0.05 means there is a 5% probability of observing a test statistic as extreme as (or more extreme than) the one calculated, assuming the null hypothesis is true . It does not mean there is a 5% chance the coefficient is zero. It represents the smallest significance level at which you would reject H 0 . At the conventional 5% threshold, a p-value of exactly 0.05 sits right at the boundary of rejection — you would reject at the 5% level but not at any stricter level.

Q: Can a regression coefficient be statistically significant but economically insignificant?

Yes, and this is common with large sample sizes. Statistical significance depends on both the coefficient’s magnitude and the standard error. With enough observations, even tiny effects become statistically distinguishable from zero. For example, a factor loading of 0.01 might achieve p < 0.001 with 10,000 daily observations, but a 1% sensitivity to the factor is negligible for portfolio decisions. Always evaluate both the p-value and the coefficient’s practical magnitude when interpreting regression results.

Q: What happens to the t-test when the sample size is small?

With small samples, the t-distribution has heavier tails than the standard normal distribution, meaning critical values are larger and it is harder to reject the null hypothesis. This appropriately reflects the greater uncertainty in small-sample estimates. As the sample size grows, the t-distribution converges to the standard normal. The exact t-test requires the normality assumption (MLR.6), but with larger samples, the Central Limit Theorem ensures the t-test remains approximately valid even when errors are not normally distributed.

Q: How do you perform an F-test using R-squared values?

This shortcut applies to exclusion restrictions in nested models — when the restricted model is obtained by dropping variables from the unrestricted model. Estimate both models and compute F = [(R 2 ur − R 2 r ) / q] / [(1 − R 2 ur ) / (n − k − 1)], where q is the number of variables dropped, n is the sample size, and k is the number of regressors in the unrestricted model. Compare the result to the F q, n−k−1 critical value at your chosen significance level. Both R-squared values must come from regressions on the same sample.

Q: Why should I report confidence intervals instead of just p-values?

Confidence intervals convey more information than p-values alone. While a p-value only indicates whether to reject or fail to reject at a given significance level, a confidence interval shows the range of plausible values for the true coefficient. This helps readers assess both the direction and magnitude of the effect. For example, knowing that a fund’s monthly alpha lies in [0.06%, 0.78%] is far more informative for investment decisions than knowing p = 0.023. Confidence intervals also make it easier to distinguish between economically meaningful and trivial effects.

Published: March 18, 2026

Article by Ryan O'Connell, CFA, FRM

After estimating a regression model, the next critical question is whether the estimated coefficients reflect real relationships in the population or could have arisen by chance. Hypothesis testing in regression provides the formal statistical framework for answering this question. This guide covers the t-test for individual coefficients, p-values and their interpretation, confidence intervals, the F-test for joint significance, and the crucial distinction between statistical and economic significance.

Sampling Distributions of OLS Estimators

Hypothesis testing in regression asks whether the patterns we estimate from sample data reflect real population relationships or could have arisen by sampling variability alone. The answer depends on understanding how OLS estimators behave across hypothetical repeated samples.

Under the classical linear model assumptions (MLR.1 through MLR.6), the OLS estimator β̂_j follows a normal distribution centered on the true population parameter β_j. The spread of this distribution is measured by the standard error — a smaller standard error means β̂_j is estimated more precisely.

Key Concept

The standard error of β̂_j measures how much the estimated coefficient would vary across repeated samples drawn from the same population. A small standard error means the estimate is precise; a large standard error means substantial uncertainty remains about the true value of β_j.

The six classical assumptions — linearity (MLR.1), random sampling (MLR.2), no perfect collinearity (MLR.3), zero conditional mean of errors (MLR.4), homoskedasticity (MLR.5), and normality of errors (MLR.6) — guarantee that the sampling distributions of OLS estimators are exactly normal. In practice, the normality assumption (MLR.6) is often questionable. Fortunately, the Central Limit Theorem ensures that with sufficiently large samples, the OLS estimators are approximately normally distributed even without assuming normal errors. This property, known as asymptotic normality, means the inference procedures described below remain valid for large samples under the weaker assumptions MLR.1 through MLR.5.

For a detailed treatment of how OLS estimators are derived, see our guides on simple linear regression and multiple regression analysis.

The t-Test for a Single Coefficient

The t-test evaluates whether an individual regression coefficient differs from a hypothesized value. In most applications, the null hypothesis is that the coefficient equals zero — meaning the corresponding variable has no partial effect on the dependent variable, holding other factors fixed.

t-Statistic

t = (β̂_j − a_j) / se(β̂_j)

The estimated coefficient minus the hypothesized value, divided by its standard error

Where:

β̂_j — the OLS estimate of the coefficient on variable x_j
a_j — the hypothesized value under H₀ (typically zero)
se(β̂_j) — the standard error of β̂_j

Under the null hypothesis and the classical assumptions, the t-statistic follows a t-distribution with n − k − 1 degrees of freedom, where n is the sample size and k is the number of independent variables. The degrees of freedom reflect the amount of information available after estimating k + 1 parameters (k slopes plus the intercept). With more degrees of freedom, the t-distribution approaches the standard normal, and critical values become smaller — making it easier to reject the null hypothesis.

Degrees of Freedom	5% Critical Value (Two-Sided)	1% Critical Value (Two-Sided)
10	2.228	3.169
30	2.042	2.750
60	2.000	2.660
120	1.980	2.617
∞ (Normal)	1.960	2.576

One-Sided vs. Two-Sided Tests

A two-sided test (H₁: β_j ≠ 0) rejects the null when the absolute value of the t-statistic exceeds the critical value. This is the default when you have no prior expectation about the sign of the effect.

A one-sided test (e.g., H₁: β_j > 0) rejects only when the t-statistic exceeds the critical value in the predicted direction. One-sided tests are appropriate when theory strongly predicts the sign — for example, testing whether a fund’s alpha is positive.

Pro Tip

Most statistical software reports t-statistics assuming H₀: β_j = 0. To test a different hypothesized value (such as β_j = 1), you must compute the t-statistic manually using t = (β̂_j − 1) / se(β̂_j).

p-Values in Regression: What They Mean and What They Do Not

The p-value is the probability of observing a test statistic as extreme as — or more extreme than — the one actually computed, assuming the null hypothesis is true. For a two-sided test:

p-Value (Two-Sided)

p = P(|T| > |t_obs|)

The probability of obtaining a t-statistic at least as large in absolute value as the observed value, under H₀

The p-value tells you the smallest significance level at which you would reject the null hypothesis. If p = 0.023, you reject at the 5% level but not at the 1% level. The smaller the p-value, the stronger the evidence against H₀.

Common Misconception

A p-value does NOT tell you the probability that the null hypothesis is true. It tells you how likely the observed data would be if the null were true. These are fundamentally different statements. Conflating them is one of the most common errors in applied regression analysis.

The conventional threshold of p < 0.05 is widely used but arbitrary. It was never intended as a rigid decision rule. Researchers who test many specifications and report only the significant results — a practice known as p-hacking or data mining — can produce spurious findings that do not replicate. Always interpret p-values in context, alongside effect sizes and confidence intervals.

Confidence Intervals in Regression

A confidence interval provides a range of plausible values for the true population coefficient, given the data. It conveys more information than a hypothesis test alone because it shows both the direction and the precision of the estimate.

Confidence Interval

CI = β̂_j ± c × se(β̂_j)

The estimated coefficient plus or minus the critical value times its standard error

Where c is the critical value from the t_n−k−1 distribution at the chosen confidence level. For a 95% confidence interval with large degrees of freedom, c ≈ 1.96.

A 95% confidence interval means that if you repeated the sampling procedure many times, approximately 95% of the resulting intervals would contain the true parameter β_j. This is not the same as saying there is a 95% probability that β_j lies in any one particular interval.

Confidence intervals and hypothesis tests are directly linked: if the hypothesized value a_j falls outside the 95% confidence interval, the t-test rejects H₀: β_j = a_j at the 5% significance level. If a_j falls inside the interval, you fail to reject.

Testing a Mutual Fund’s Alpha: t-Test in Practice

Suppose you regress Fidelity Contrafund’s (FCNTX) monthly excess returns on the market excess return using 60 months of data to test whether the fund generates risk-adjusted performance that is significantly different from zero — that is, whether its Jensen’s alpha differs from zero.

t-Test Example: Fidelity Contrafund (FCNTX) Alpha

Parameter	Value
Estimated alpha (α̂)	0.42% per month
Standard error of α̂	0.18%
Null hypothesis	H₀: α = 0 (no abnormal return)
Sample size (n)	60 months
Degrees of freedom	n − k − 1 = 60 − 1 − 1 = 58

Step 1: Compute the t-statistic:

t = 0.42 / 0.18 = 2.33

Step 2: Compare to the critical value. For a two-sided test at the 5% level with 58 degrees of freedom, c ≈ 2.00.

Step 3: Since |2.33| > 2.00, reject H₀. The p-value is approximately 0.023, confirming rejection at the 5% level.

Step 4: Construct the 95% confidence interval:

CI = 0.42 ± 2.00 × 0.18 = [0.06%, 0.78%] per month

Contrafund’s estimated alpha is statistically significant at the 5% level. The confidence interval suggests the true monthly alpha lies between 0.06% and 0.78%, which translates to roughly 0.7% to 9.4% annualized — a meaningful range for evaluating whether the fund’s active management adds value.

The F-Test for Multiple Restrictions

While the t-test evaluates one coefficient at a time, many research questions require testing whether a group of variables is jointly significant. Individual t-tests cannot answer this question because they each test a different null hypothesis. Multicollinearity among the tested variables can make every individual t-test insignificant while the group as a whole has substantial explanatory power.

The F-test compares a restricted model (with the restrictions imposed) to an unrestricted model (without restrictions) and asks whether the restrictions cause a statistically significant loss in explanatory power.

F-Statistic (General Form)

F = [(SSR_r − SSR_ur) / q] / [SSR_ur / (n − k − 1)]

The increase in sum of squared residuals from imposing q restrictions, scaled by the unrestricted error variance

F-Statistic (R² Form — Exclusion Restrictions Only)

F = [(R²_ur − R²_r) / q] / [(1 − R²_ur) / (n − k − 1)]

The gain in R-squared from including the tested variables, relative to the unexplained variation

Where:

SSR_r, SSR_ur — sum of squared residuals from the restricted and unrestricted models
R²_ur, R²_r — R-squared values from the unrestricted and restricted models
q — number of restrictions (variables being tested)
n − k − 1 — degrees of freedom in the unrestricted model

Under the null hypothesis, the F-statistic follows an F-distribution with q and n − k − 1 degrees of freedom. Reject H₀ when F exceeds the critical value.

Pro Tip

The R² form is a convenient shortcut, but it applies only to exclusion restrictions in nested models — that is, when the restricted model is obtained by dropping variables from the unrestricted model. For more general linear restrictions, use the SSR form.

A special case of the F-test is the overall significance test, which tests H₀: β₁ = β₂ = … = β_k = 0 — that is, whether all slope coefficients are jointly zero (the intercept is not included in this null). This test asks whether the regression model as a whole explains any variation in the dependent variable. Most software reports this F-statistic automatically.

Joint Significance of Factor Loadings: F-Test in Practice

Suppose you estimate a Fama-French three-factor model for the Invesco QQQ Trust (QQQ), a technology-heavy ETF tracking the Nasdaq-100, using 120 monthly observations. The three-factor model regresses QQQ’s excess returns on the market factor (MKT), the size factor (SMB), and the value factor (HML). You want to test whether SMB and HML jointly contribute explanatory power beyond the market factor alone.

F-Test Example: Invesco QQQ (QQQ) Factor Loadings

Model	R²	Variables
Unrestricted (3-factor)	0.88	MKT, SMB, HML
Restricted (market-only)	0.82	MKT

Null hypothesis: H₀: β_SMB = β_HML = 0 (SMB and HML have no joint effect)

Number of restrictions: q = 2

Degrees of freedom: n − k − 1 = 120 − 3 − 1 = 116

F-statistic (R² form):

F = [(0.88 − 0.82) / 2] / [(1 − 0.88) / 116] = 0.03 / 0.001034 = 29.01

The critical value of F_2,116 at the 5% level is approximately 3.07. Since 29.01 far exceeds 3.07, we reject H₀. The size and value factors are jointly significant — they contribute meaningful explanatory power beyond the market factor alone.

Note that individual t-tests on SMB and HML might tell a different story if the two factors are correlated. The F-test properly accounts for the joint contribution of both variables.

Economic Significance vs. Statistical Significance

A regression coefficient can be statistically significant yet economically trivial — or economically meaningful yet statistically insignificant. Responsible regression analysis always considers both dimensions.

Statistical Significance

Tests whether the coefficient differs from zero
Driven by sample size, standard error, and significance level
Large samples can make tiny effects significant
Captured by t-statistics and p-values
Best for: determining whether an effect exists

Economic Significance

Asks whether the coefficient is large enough to matter
Requires domain knowledge and judgment
Unaffected by sample size
Assessed through effect sizes and confidence intervals
Best for: deciding whether to act on the finding

Consider two results from a factor model estimated on 10,000 daily observations:

Statistical vs. Economic Significance

Coefficient	Estimate	p-Value	Statistically Significant?	Economically Significant?
SMB loading	0.02	< 0.001	Yes	No — adds ~2 bps/month, negligible after costs
Market beta	1.45	0.06	No (at 5% level)	Yes — 45% above-market sensitivity matters for risk

The large sample size drives the tiny SMB loading to statistical significance, while the imprecisely estimated but large market beta just misses the conventional threshold. A portfolio manager would care far more about the market beta result despite its higher p-value.

Always evaluate the magnitude of a coefficient alongside its p-value. Ask: “Is this effect large enough to change an investment decision?” If the answer is no, statistical significance alone does not make the finding practically important.

Limitations and Assumptions

The t-tests, F-tests, and confidence intervals described above are valid only when the underlying assumptions hold. Violations can lead to misleading inference.

Key Assumptions

Regression inference relies on the classical assumptions MLR.1 through MLR.6 for exact results, or MLR.1 through MLR.5 with large samples for asymptotically valid results. If errors are heteroskedastic (non-constant variance), the usual standard errors are invalid and hypothesis tests can be severely distorted.

When heteroskedasticity is present, the solution is to use robust standard errors (also called heteroskedasticity-robust or White standard errors), which remain valid regardless of the error variance structure. For a detailed treatment, see our guide on heteroskedasticity in regression.

If errors are non-normal and the sample size is small, the t and F distributions are only approximate. In such cases, interpret borderline results with caution. With large samples, the Central Limit Theorem ensures that inference remains approximately valid even without the normality assumption.

How to Evaluate Regression Hypothesis Tests

Follow this systematic workflow when conducting or evaluating hypothesis tests in regression:

State the hypotheses: Define H₀ and H₁ clearly. For individual coefficients, this is typically H₀: β_j = 0. For joint tests, specify which coefficients are being tested together.
Choose the significance level (α): Select α before examining results — typically 5% or 1%. This prevents post-hoc rationalization of borderline results.
Estimate the model: Run OLS on the multiple regression specification. For F-tests, estimate both the restricted and unrestricted models.
Compute the test statistic: Calculate the t-statistic or F-statistic from the regression output.
Compare to the critical value or compute the p-value: Reject H₀ if the test statistic exceeds the critical value, or equivalently, if p < α.
Interpret in economic context: Assess whether the coefficient’s magnitude is large enough to be practically meaningful. Consider confidence intervals alongside p-values. See our guide on regression functional forms for how different model specifications affect coefficient interpretation.

Try Our Hypothesis Testing Calculator

Common Mistakes

1. Confusing “fail to reject” with “accept the null.” Failing to reject H₀: β_j = 0 does not prove the coefficient is zero. It means the data lack sufficient evidence to conclude otherwise. The coefficient may be nonzero but the sample too small or too noisy to detect it. Always say “fail to reject,” never “accept.”

2. Ignoring economic significance when p < 0.05. A statistically significant coefficient can be economically trivial. With 10,000 daily observations, a market beta of 0.003 might have p < 0.01, but 0.3% sensitivity to the market is meaningless for any investment decision. Always examine the coefficient’s magnitude and practical implications.

3. Using individual t-tests to evaluate joint significance. Individual t-tests each evaluate whether a single coefficient is zero. They do not test whether a group of coefficients is jointly zero. Multicollinearity can make every individual t-test insignificant while the joint F-test strongly rejects. For joint hypotheses, always use the F-test.

4. Reporting p-values without confidence intervals. A p-value indicates only whether to reject or fail to reject — it reveals nothing about the magnitude or precision of the estimate. Confidence intervals show the range of plausible values for the true coefficient, which is far more informative for decision-making.

5. Over-relying on the 5% significance threshold. The 0.05 cutoff is a convention, not a law of nature. A result with p = 0.06 is not fundamentally different from p = 0.04. Report exact p-values so readers can apply their own judgment, and interpret borderline results cautiously rather than mechanically.

Frequently Asked Questions

The t-test evaluates whether a single regression coefficient is statistically different from a hypothesized value (usually zero). The F-test evaluates whether a group of coefficients is jointly significant — that is, whether they collectively contribute explanatory power to the model. Use the t-test for individual coefficient hypotheses and the F-test for joint restrictions. When testing a single coefficient, the F-test with one restriction is mathematically equivalent to the squared t-test: F = t².

A p-value of 0.05 means there is a 5% probability of observing a test statistic as extreme as (or more extreme than) the one calculated, assuming the null hypothesis is true. It does not mean there is a 5% chance the coefficient is zero. It represents the smallest significance level at which you would reject H₀. At the conventional 5% threshold, a p-value of exactly 0.05 sits right at the boundary of rejection — you would reject at the 5% level but not at any stricter level.

Yes, and this is common with large sample sizes. Statistical significance depends on both the coefficient’s magnitude and the standard error. With enough observations, even tiny effects become statistically distinguishable from zero. For example, a factor loading of 0.01 might achieve p < 0.001 with 10,000 daily observations, but a 1% sensitivity to the factor is negligible for portfolio decisions. Always evaluate both the p-value and the coefficient’s practical magnitude when interpreting regression results.

With small samples, the t-distribution has heavier tails than the standard normal distribution, meaning critical values are larger and it is harder to reject the null hypothesis. This appropriately reflects the greater uncertainty in small-sample estimates. As the sample size grows, the t-distribution converges to the standard normal. The exact t-test requires the normality assumption (MLR.6), but with larger samples, the Central Limit Theorem ensures the t-test remains approximately valid even when errors are not normally distributed.

This shortcut applies to exclusion restrictions in nested models — when the restricted model is obtained by dropping variables from the unrestricted model. Estimate both models and compute F = [(R²_ur − R²_r) / q] / [(1 − R²_ur) / (n − k − 1)], where q is the number of variables dropped, n is the sample size, and k is the number of regressors in the unrestricted model. Compare the result to the F_{q, n−k−1} critical value at your chosen significance level. Both R-squared values must come from regressions on the same sample.

Confidence intervals convey more information than p-values alone. While a p-value only indicates whether to reject or fail to reject at a given significance level, a confidence interval shows the range of plausible values for the true coefficient. This helps readers assess both the direction and magnitude of the effect. For example, knowing that a fund’s monthly alpha lies in [0.06%, 0.78%] is far more informative for investment decisions than knowing p = 0.023. Confidence intervals also make it easier to distinguish between economically meaningful and trivial effects.

Disclaimer

This article is for educational and informational purposes only and does not constitute investment advice. The numerical examples are illustrative and based on hypothetical regression output. Statistical significance does not imply economic significance or guarantee future results. Always conduct your own analysis and consult a qualified financial advisor before making investment decisions.

Explore Top Finance Certificates

Access official certificates from Wharton Online & Columbia Business School Executive Education, powered by Wall Street Prep. Save up to $500 with code RYAN.