Omitted Variable Bias, Multicollinearity & Common Regression Pitfalls
When a regression model leaves out a variable that both affects the outcome and correlates with an included regressor, the estimated coefficients absorb the omitted variable’s effect and become systematically biased. This problem — omitted variable bias (OVB) — is the most common threat to causal inference in applied finance research. A single-factor CAPM regression that omits size and value factors, a profitability model that ignores leverage, or a bond yield regression that excludes credit risk all produce misleading estimates for the same reason. This article covers the OVB formula and how to sign the direction of bias, then addresses related pitfalls — multicollinearity, measurement error, and functional form misspecification — before comparing the main solutions available to researchers working with multiple regression models.
What Is Omitted Variable Bias?
Suppose the true population model is Y = β0 + β1X1 + β2X2 + u, but the researcher omits X2 and estimates a simple regression of Y on X1 alone. If X2 is correlated with X1, the OLS estimate of β1 will be biased because it picks up part of the effect of X2. This violates the zero conditional mean assumption — E(u | X1) ≠ 0 — because the omitted variable is absorbed into the error term and correlated with the included regressor.
OVB arises if and only if both conditions hold simultaneously:
- The omitted variable affects Y — β2 ≠ 0 (X2 belongs in the true model)
- The omitted variable correlates with the included regressor — Corr(X1, X2) ≠ 0
If either condition fails, the OLS estimate of β1 remains unbiased. A variable can be correlated with X1 without causing bias (if it does not affect Y), and a variable can affect Y without causing bias (if it is uncorrelated with X1).
The formula makes the mechanism concrete: the short-regression estimator β̃1 equals the true effect β1 plus a bias term that depends on how strongly X2 affects Y and how strongly X2 is related to X1. Adding X2 to the regression eliminates this bias — which is why multiple regression is the primary tool for controlling confounders in observational finance research.
Direction of Bias
The OVB formula allows researchers to determine whether the biased estimate is too high (upward bias) or too low (downward bias) by examining the signs of β2 and δ̃1:
| Sign of β2 (Omitted Variable’s Effect on Y) | Sign of δ̃1 (Correlation of Omitted and Included X) | Direction of Bias on β̃1 |
|---|---|---|
| Positive (+) | Positive (+) | Upward (overestimate) |
| Positive (+) | Negative (−) | Downward (underestimate) |
| Negative (−) | Positive (+) | Downward (underestimate) |
| Negative (−) | Negative (−) | Upward (overestimate) |
The rule is straightforward: when β2 and δ̃1 share the same sign, the bias is upward; when they have opposite signs, the bias is downward.
Consider estimating JPMorgan Chase’s (JPM) market sensitivity using the single-factor CAPM:
RJPM − Rf = α + βMKT(Rm − Rf) + u
This model omits the Fama-French size (SMB) and value (HML) factors. Using monthly returns from 2015–2024, JPM — a classic large-cap value stock — has a positive HML loading (βHML > 0, because bank stocks tend to co-move with the value factor in this sample). Over the same period, market excess returns are positively correlated with HML (δ̃ > 0). Both signs are positive in this sample, so the single-factor market beta is biased upward.
In practice, adding SMB and HML to create the three-factor model reduces JPM’s estimated market beta. The single-factor βMKT of approximately 1.28 drops to roughly 1.20 once size and value exposures are controlled for. The difference of 0.08 is the omitted variable bias absorbed by the single-factor estimate. The direction and magnitude of this bias are sample-specific — different stocks, sectors, and time periods can produce different signs on βHML and δ̃1.
Multicollinearity
While omitted variable bias concerns variables excluded from the model, multicollinearity concerns high correlation among variables included in the model. When two or more regressors are strongly correlated, OLS has difficulty separating their individual effects, inflating the standard errors of the affected coefficients.
Crucially, multicollinearity does not bias coefficient estimates. OLS remains unbiased and consistent; the problem is imprecision. The overall model can still be highly significant (large F-statistic) even when individual coefficients are individually insignificant (small t-statistics) because the regressors jointly explain variation in Y but their individual contributions cannot be cleanly separated.
For example, regressing ROA on log(total assets), log(revenue), and debt-to-equity for the 30 Dow Jones Industrial Average components might yield a VIF of 7.3 on log(total assets) and 6.8 on log(revenue), because firms like Apple, Microsoft, and UnitedHealth Group that rank high on one measure tend to rank high on the other. The individual coefficients on assets and revenue become imprecise, but the model’s joint explanatory power remains intact.
Multicollinearity inflates standard errors but does not bias coefficients. If the overall F-test is significant but individual t-tests are not, multicollinearity is a likely culprit. Before dropping a collinear variable, consider whether you need to isolate individual effects or whether joint significance is sufficient for your research question — dropping a relevant variable to “fix” multicollinearity can introduce omitted variable bias.
Measurement Error
Even when all relevant variables are included in the model, measurement error — the difference between the recorded value and the true value of a variable — can distort OLS estimates. The consequences depend critically on whether the error is in the dependent variable or an independent variable.
Measurement error in Y (dependent variable): If Y is measured with random noise — Yobserved = Ytrue + e, where e is uncorrelated with the regressors — the OLS coefficient estimates remain unbiased. The noise simply increases the error variance, producing larger standard errors and less precise estimates but no systematic distortion.
Measurement error in X (independent variable): This is far more damaging. Under the classical errors-in-variables model, measurement error in X produces attenuation bias — the estimated coefficient is systematically pulled toward zero:
In finance, credit ratings from agencies like Moody’s and S&P are ordinal categories (Aaa, Aa1, Aa2, …) that researchers often convert to numerical scores for regression analysis. These scores are an imperfect, discrete proxy for the continuous latent variable “true credit risk.” The classical attenuation formula above requires three conditions: the measurement error has mean zero, is uncorrelated with the true value of X, and is uncorrelated with the structural error — normality is not required. When researchers map ordinal credit ratings to numerical scores, these classical conditions do not hold cleanly because the mapping is discrete and somewhat arbitrary. The result is a bias whose direction and magnitude depend on the specific numerical encoding chosen, though in many practical settings the coefficient is still attenuated toward zero relative to the true effect.
Proxy Variables
When the true variable of interest is unobservable — such as management quality, financial distress risk, or investor sentiment — a proxy variable can serve as a stand-in. A valid proxy must be correlated with the unobservable and should not independently belong in the regression conditional on the unobservable variable.
A well-known example comes from the Fama-French asset pricing literature: the book-to-market ratio (B/M) is widely used as a proxy for financial distress risk, which is not directly observable. Firms with high B/M ratios — such as struggling retailers or overleveraged industrial companies — tend to carry greater distress risk, and the HML factor captures the return premium associated with this risk. Including B/M in a cross-sectional regression of stock returns absorbs part of the effect that unobservable distress risk would otherwise impose on the other coefficients, reducing OVB.
Proxy variables are a practical compromise: they reduce OVB when the true variable cannot be measured, but the bias reduction depends on how closely the proxy tracks the unobservable. When no adequate proxy exists, instrumental variables offer an alternative approach to addressing endogeneity from omitted variables.
The RESET Test for Functional Form
A model can also be misspecified through incorrect functional form — for example, fitting a linear relationship when the true relationship is quadratic, or omitting an interaction term. The Ramsey RESET (Regression Equation Specification Error Test) is a widely used diagnostic for detecting such problems.
The RESET test adds powers of the fitted values (typically ŷ2 and ŷ3) to the original regression and tests their joint significance using an F-test. If the added terms are jointly significant, the null hypothesis of correct functional form is rejected — the linear specification is inadequate. The test detects many types of misspecification (missing quadratics, omitted interactions, incorrect log transformations) but does not tell you which specific terms are missing.
A significant RESET result is a warning sign, not a prescription. The researcher must then consider alternative specifications — adding quadratic terms, using log transformations, or including interaction effects. For a detailed treatment of these functional form choices, see Regression Functional Forms.
Solutions to Omitted Variable Bias
No single solution works in all settings. The appropriate strategy depends on data availability, the nature of the omitted variable, and the assumptions the researcher is willing to maintain.
Add Control Variables
- Include the omitted variable directly in the regression
- Requires: data on the omitted variable; variable must be exogenous
- Risk: “bad controls” — adding a variable that is itself caused by the treatment introduces new bias
- Use dummy variables to control for categorical confounders
Proxy Variables
- Substitute an observable variable that correlates with the unobservable
- Requires: proxy tracks the unobservable closely
- Limitation: rarely eliminates bias entirely; proxy quality is difficult to verify
- Best when the omitted variable is conceptually clear but not directly measurable
Instrumental Variables
- Find a variable Z correlated with X but uncorrelated with the error term
- Requires: relevance (Z predicts X) and exogeneity (Z does not directly affect Y)
- Limitation: valid instruments are rare; weak instruments create new problems
- See Instrumental Variables & 2SLS for the full methodology
Panel Data Fixed Effects
- Use within-entity variation to control for all time-invariant unobservables
- Requires: panel data (repeated observations on the same entities over time)
- Limitation: only removes time-invariant OVB; time-varying omitted variables remain problematic
- See Panel Data Analysis for fixed and random effects estimation
Randomization
- Random assignment of X breaks the correlation between X and all omitted variables
- Requires: ability to conduct or access data from a randomized experiment
- Limitation: rare in finance; feasible mainly in fintech A/B testing or regulatory lotteries
- The gold standard for causal inference but seldom available in financial markets research
Common Mistakes
1. Thinking more controls always reduce bias (“bad controls”). Adding a variable that is itself an outcome of the treatment variable can introduce new bias rather than reducing OVB. For example, in a regression of firm stock returns on R&D spending, controlling for patent counts is problematic because patents are an outcome of R&D, not a confounder. A valid control must be a common cause of both the treatment and the outcome, not a mediator or consequence.
2. Confusing multicollinearity with omitted variable bias. Multicollinearity inflates standard errors but does not bias coefficient estimates. OVB biases the coefficients themselves. Dropping a collinear variable to “fix” multicollinearity can actually introduce OVB if that variable belongs in the true model — you are trading precision for bias, which is almost always a bad trade.
3. Assuming low VIF means no omitted variable bias. VIF measures correlation among included regressors. A model can have VIF values near 1.0 for every included variable while suffering severe OVB from an excluded variable. VIF diagnoses multicollinearity, not omitted variables — they are fundamentally different problems.
4. Using OVB as a blanket criticism without signing the bias. Claiming “the estimate might be biased due to omitted variables” without specifying the likely direction and approximate magnitude provides no actionable information. Always attempt to sign the bias using the formula: identify the probable omitted variable, determine the signs of β2 and δ̃1, and state whether the bias is upward or downward.
Limitations
Omitted variable bias can never be fully ruled out in observational data. The bias formula assumes you know which variable is omitted and its relationship to the included regressors — information that is rarely available in practice. Even when controls, proxies, or instruments are used, residual bias from additional unmeasured confounders may remain.
The direction-of-bias analysis is only as reliable as the researcher’s knowledge of the omitted variable’s true effect on Y and its correlation with the included regressors. In complex financial systems with many interrelated variables, signing the bias requires strong economic reasoning that may itself be debatable.
Each solution to OVB introduces its own assumptions. Instrumental variables require valid instruments that may be difficult to defend. Panel fixed effects remove only time-invariant confounders, leaving time-varying OVB unaddressed. Proxy variables depend on unmeasurable proxy quality. No single method provides a universal guarantee of unbiased estimation.
Frequently Asked Questions
Disclaimer
This article is for educational and informational purposes only and does not constitute investment advice. The examples and regression results cited are illustrative and based on real market scenarios with approximate figures. Actual empirical results depend on sample selection, time period, and model specification. Content is based on Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach, 8th Edition, Cengage, 2025, Chapters 3, 6, and 9. Always conduct your own research and consult a qualified financial advisor before making investment decisions.