Omitted Variable Bias, Multicollinearity & Common Regression Pitfalls

When a regression model leaves out a variable that both affects the outcome and correlates with an included regressor, the estimated coefficients absorb the omitted variable’s effect and become systematically biased. This problem — omitted variable bias (OVB) — is the most common threat to causal inference in applied finance research. A single-factor CAPM regression that omits size and value factors, a profitability model that ignores leverage, or a bond yield regression that excludes credit risk all produce misleading estimates for the same reason. This article covers the OVB formula and how to sign the direction of bias, then addresses related pitfalls — multicollinearity, measurement error, and functional form misspecification — before comparing the main solutions available to researchers working with multiple regression models.

What Is Omitted Variable Bias?

Suppose the true population model is Y = β0 + β1X1 + β2X2 + u, but the researcher omits X2 and estimates a simple regression of Y on X1 alone. If X2 is correlated with X1, the OLS estimate of β1 will be biased because it picks up part of the effect of X2. This violates the zero conditional mean assumption — E(u | X1) ≠ 0 — because the omitted variable is absorbed into the error term and correlated with the included regressor.

Two Conditions for Omitted Variable Bias

OVB arises if and only if both conditions hold simultaneously:

  1. The omitted variable affects Y — β2 ≠ 0 (X2 belongs in the true model)
  2. The omitted variable correlates with the included regressor — Corr(X1, X2) ≠ 0

If either condition fails, the OLS estimate of β1 remains unbiased. A variable can be correlated with X1 without causing bias (if it does not affect Y), and a variable can affect Y without causing bias (if it is uncorrelated with X1).

Omitted Variable Bias Formula
Bias(β̃1) = β2 × δ̃1
where β̃1 is the OLS estimator from the short regression (omitting X2), β2 is the true effect of the omitted variable on Y, and δ̃1 is the slope from regressing X2 on X1. The expected value of β̃1 equals the true β1 plus this bias term.

The formula makes the mechanism concrete: the short-regression estimator β̃1 equals the true effect β1 plus a bias term that depends on how strongly X2 affects Y and how strongly X2 is related to X1. Adding X2 to the regression eliminates this bias — which is why multiple regression is the primary tool for controlling confounders in observational finance research.

Direction of Bias

The OVB formula allows researchers to determine whether the biased estimate is too high (upward bias) or too low (downward bias) by examining the signs of β2 and δ̃1:

Sign of β2 (Omitted Variable’s Effect on Y) Sign of δ̃1 (Correlation of Omitted and Included X) Direction of Bias on β̃1
Positive (+) Positive (+) Upward (overestimate)
Positive (+) Negative (−) Downward (underestimate)
Negative (−) Positive (+) Downward (underestimate)
Negative (−) Negative (−) Upward (overestimate)

The rule is straightforward: when β2 and δ̃1 share the same sign, the bias is upward; when they have opposite signs, the bias is downward.

JPMorgan Chase: CAPM Beta and Omitted Factor Bias

Consider estimating JPMorgan Chase’s (JPM) market sensitivity using the single-factor CAPM:

RJPM − Rf = α + βMKT(Rm − Rf) + u

This model omits the Fama-French size (SMB) and value (HML) factors. Using monthly returns from 2015–2024, JPM — a classic large-cap value stock — has a positive HML loading (βHML > 0, because bank stocks tend to co-move with the value factor in this sample). Over the same period, market excess returns are positively correlated with HML (δ̃ > 0). Both signs are positive in this sample, so the single-factor market beta is biased upward.

In practice, adding SMB and HML to create the three-factor model reduces JPM’s estimated market beta. The single-factor βMKT of approximately 1.28 drops to roughly 1.20 once size and value exposures are controlled for. The difference of 0.08 is the omitted variable bias absorbed by the single-factor estimate. The direction and magnitude of this bias are sample-specific — different stocks, sectors, and time periods can produce different signs on βHML and δ̃1.

Multicollinearity

While omitted variable bias concerns variables excluded from the model, multicollinearity concerns high correlation among variables included in the model. When two or more regressors are strongly correlated, OLS has difficulty separating their individual effects, inflating the standard errors of the affected coefficients.

Crucially, multicollinearity does not bias coefficient estimates. OLS remains unbiased and consistent; the problem is imprecision. The overall model can still be highly significant (large F-statistic) even when individual coefficients are individually insignificant (small t-statistics) because the regressors jointly explain variation in Y but their individual contributions cannot be cleanly separated.

Variance Inflation Factor (VIF)
VIFj = 1 / (1 − R2j)
where R2j is the R-squared from regressing Xj on all other independent variables. VIF = 1 means no collinearity. Common heuristics suggest VIF > 5 warrants further investigation and VIF > 10 may indicate serious multicollinearity, but these cutoffs are not formal statistical tests — whether multicollinearity is a practical problem depends on the research question and the precision required.

For example, regressing ROA on log(total assets), log(revenue), and debt-to-equity for the 30 Dow Jones Industrial Average components might yield a VIF of 7.3 on log(total assets) and 6.8 on log(revenue), because firms like Apple, Microsoft, and UnitedHealth Group that rank high on one measure tend to rank high on the other. The individual coefficients on assets and revenue become imprecise, but the model’s joint explanatory power remains intact.

Pro Tip

Multicollinearity inflates standard errors but does not bias coefficients. If the overall F-test is significant but individual t-tests are not, multicollinearity is a likely culprit. Before dropping a collinear variable, consider whether you need to isolate individual effects or whether joint significance is sufficient for your research question — dropping a relevant variable to “fix” multicollinearity can introduce omitted variable bias.

Measurement Error

Even when all relevant variables are included in the model, measurement error — the difference between the recorded value and the true value of a variable — can distort OLS estimates. The consequences depend critically on whether the error is in the dependent variable or an independent variable.

Measurement error in Y (dependent variable): If Y is measured with random noise — Yobserved = Ytrue + e, where e is uncorrelated with the regressors — the OLS coefficient estimates remain unbiased. The noise simply increases the error variance, producing larger standard errors and less precise estimates but no systematic distortion.

Measurement error in X (independent variable): This is far more damaging. Under the classical errors-in-variables model, measurement error in X produces attenuation bias — the estimated coefficient is systematically pulled toward zero:

Attenuation Bias (Classical Errors-in-Variables)
plim(β̂1) = β1 × σ2X / (σ2X + σ2e)
The ratio σ2X / (σ2X + σ2e) is always between 0 and 1, so the estimated coefficient is biased toward zero. The noisier the measurement (larger σ2e), the greater the attenuation.

In finance, credit ratings from agencies like Moody’s and S&P are ordinal categories (Aaa, Aa1, Aa2, …) that researchers often convert to numerical scores for regression analysis. These scores are an imperfect, discrete proxy for the continuous latent variable “true credit risk.” The classical attenuation formula above requires three conditions: the measurement error has mean zero, is uncorrelated with the true value of X, and is uncorrelated with the structural error — normality is not required. When researchers map ordinal credit ratings to numerical scores, these classical conditions do not hold cleanly because the mapping is discrete and somewhat arbitrary. The result is a bias whose direction and magnitude depend on the specific numerical encoding chosen, though in many practical settings the coefficient is still attenuated toward zero relative to the true effect.

Proxy Variables

When the true variable of interest is unobservable — such as management quality, financial distress risk, or investor sentiment — a proxy variable can serve as a stand-in. A valid proxy must be correlated with the unobservable and should not independently belong in the regression conditional on the unobservable variable.

A well-known example comes from the Fama-French asset pricing literature: the book-to-market ratio (B/M) is widely used as a proxy for financial distress risk, which is not directly observable. Firms with high B/M ratios — such as struggling retailers or overleveraged industrial companies — tend to carry greater distress risk, and the HML factor captures the return premium associated with this risk. Including B/M in a cross-sectional regression of stock returns absorbs part of the effect that unobservable distress risk would otherwise impose on the other coefficients, reducing OVB.

Proxy variables are a practical compromise: they reduce OVB when the true variable cannot be measured, but the bias reduction depends on how closely the proxy tracks the unobservable. When no adequate proxy exists, instrumental variables offer an alternative approach to addressing endogeneity from omitted variables.

The RESET Test for Functional Form

A model can also be misspecified through incorrect functional form — for example, fitting a linear relationship when the true relationship is quadratic, or omitting an interaction term. The Ramsey RESET (Regression Equation Specification Error Test) is a widely used diagnostic for detecting such problems.

Ramsey RESET Test

The RESET test adds powers of the fitted values (typically ŷ2 and ŷ3) to the original regression and tests their joint significance using an F-test. If the added terms are jointly significant, the null hypothesis of correct functional form is rejected — the linear specification is inadequate. The test detects many types of misspecification (missing quadratics, omitted interactions, incorrect log transformations) but does not tell you which specific terms are missing.

A significant RESET result is a warning sign, not a prescription. The researcher must then consider alternative specifications — adding quadratic terms, using log transformations, or including interaction effects. For a detailed treatment of these functional form choices, see Regression Functional Forms.

Solutions to Omitted Variable Bias

No single solution works in all settings. The appropriate strategy depends on data availability, the nature of the omitted variable, and the assumptions the researcher is willing to maintain.

Add Control Variables

  • Include the omitted variable directly in the regression
  • Requires: data on the omitted variable; variable must be exogenous
  • Risk: “bad controls” — adding a variable that is itself caused by the treatment introduces new bias
  • Use dummy variables to control for categorical confounders

Proxy Variables

  • Substitute an observable variable that correlates with the unobservable
  • Requires: proxy tracks the unobservable closely
  • Limitation: rarely eliminates bias entirely; proxy quality is difficult to verify
  • Best when the omitted variable is conceptually clear but not directly measurable

Instrumental Variables

  • Find a variable Z correlated with X but uncorrelated with the error term
  • Requires: relevance (Z predicts X) and exogeneity (Z does not directly affect Y)
  • Limitation: valid instruments are rare; weak instruments create new problems
  • See Instrumental Variables & 2SLS for the full methodology

Panel Data Fixed Effects

  • Use within-entity variation to control for all time-invariant unobservables
  • Requires: panel data (repeated observations on the same entities over time)
  • Limitation: only removes time-invariant OVB; time-varying omitted variables remain problematic
  • See Panel Data Analysis for fixed and random effects estimation

Randomization

  • Random assignment of X breaks the correlation between X and all omitted variables
  • Requires: ability to conduct or access data from a randomized experiment
  • Limitation: rare in finance; feasible mainly in fintech A/B testing or regulatory lotteries
  • The gold standard for causal inference but seldom available in financial markets research

Common Mistakes

1. Thinking more controls always reduce bias (“bad controls”). Adding a variable that is itself an outcome of the treatment variable can introduce new bias rather than reducing OVB. For example, in a regression of firm stock returns on R&D spending, controlling for patent counts is problematic because patents are an outcome of R&D, not a confounder. A valid control must be a common cause of both the treatment and the outcome, not a mediator or consequence.

2. Confusing multicollinearity with omitted variable bias. Multicollinearity inflates standard errors but does not bias coefficient estimates. OVB biases the coefficients themselves. Dropping a collinear variable to “fix” multicollinearity can actually introduce OVB if that variable belongs in the true model — you are trading precision for bias, which is almost always a bad trade.

3. Assuming low VIF means no omitted variable bias. VIF measures correlation among included regressors. A model can have VIF values near 1.0 for every included variable while suffering severe OVB from an excluded variable. VIF diagnoses multicollinearity, not omitted variables — they are fundamentally different problems.

4. Using OVB as a blanket criticism without signing the bias. Claiming “the estimate might be biased due to omitted variables” without specifying the likely direction and approximate magnitude provides no actionable information. Always attempt to sign the bias using the formula: identify the probable omitted variable, determine the signs of β2 and δ̃1, and state whether the bias is upward or downward.

Limitations

Important Caveat

Omitted variable bias can never be fully ruled out in observational data. The bias formula assumes you know which variable is omitted and its relationship to the included regressors — information that is rarely available in practice. Even when controls, proxies, or instruments are used, residual bias from additional unmeasured confounders may remain.

The direction-of-bias analysis is only as reliable as the researcher’s knowledge of the omitted variable’s true effect on Y and its correlation with the included regressors. In complex financial systems with many interrelated variables, signing the bias requires strong economic reasoning that may itself be debatable.

Each solution to OVB introduces its own assumptions. Instrumental variables require valid instruments that may be difficult to defend. Panel fixed effects remove only time-invariant confounders, leaving time-varying OVB unaddressed. Proxy variables depend on unmeasurable proxy quality. No single method provides a universal guarantee of unbiased estimation.

Frequently Asked Questions

Omitted variable bias occurs when a regression leaves out a variable that both affects the outcome and is correlated with an included explanatory variable. The result is that the estimated coefficient on the included variable absorbs part of the omitted variable’s effect, making the estimate systematically too high or too low. For example, estimating a stock’s market sensitivity using only a single-factor CAPM model omits size and value risk factors, biasing the market beta estimate. The bias disappears only when the omitted variable is added to the model or addressed through other techniques like instrumental variables or panel data fixed effects.

Use the bias formula: bias = β2 × δ̃1. First, determine the sign of β2 — does the omitted variable positively or negatively affect Y? Second, determine the sign of δ̃1 — is the omitted variable positively or negatively correlated with the included X? If both signs are the same (both positive or both negative), the bias is upward and the coefficient overestimates the true effect. If the signs are opposite, the bias is downward. This signing exercise requires economic reasoning about the likely relationships among the variables.

Omitted variable bias and multicollinearity are fundamentally different problems. OVB occurs when a relevant variable is excluded from the model, causing coefficient estimates to be systematically biased. Multicollinearity occurs when included variables are highly correlated, inflating standard errors but not biasing the estimates. The solutions differ accordingly: OVB requires adding controls, using proxies, or employing instrumental variables; multicollinearity may be addressed by increasing sample size, dropping truly redundant variables, or simply accepting wider confidence intervals. Importantly, dropping a collinear variable to reduce multicollinearity can introduce OVB if that variable belongs in the true model.

The variance inflation factor measures how much the variance of an OLS coefficient is inflated due to multicollinearity among the regressors. It is calculated as VIFj = 1 / (1 − R2j), where R2j is the R-squared from regressing Xj on all other independent variables. A VIF of 1 indicates no collinearity. Common heuristics suggest values above 5 warrant further investigation and values above 10 may indicate serious multicollinearity, though these cutoffs are not formal statistical tests. Use our Multicollinearity (VIF) Calculator to compute VIF values for your regression model. Remember that VIF diagnoses correlation among included variables — it says nothing about omitted variable bias from excluded variables.

In observational data, OVB can never be guaranteed to be zero because there may always be unmeasured confounders that the researcher is unaware of. However, it can be substantially reduced through several strategies: adding relevant control variables to the regression, using proxy variables for unobservables, employing instrumental variables to address endogeneity, or using panel data fixed effects to control for time-invariant unobservables. Only true randomization — where the treatment is randomly assigned — breaks the correlation between the treatment and all potential omitted variables, providing unbiased estimates. In finance, such randomization is rare outside of fintech A/B testing or natural experiments created by regulatory changes.

Disclaimer

This article is for educational and informational purposes only and does not constitute investment advice. The examples and regression results cited are illustrative and based on real market scenarios with approximate figures. Actual empirical results depend on sample selection, time period, and model specification. Content is based on Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach, 8th Edition, Cengage, 2025, Chapters 3, 6, and 9. Always conduct your own research and consult a qualified financial advisor before making investment decisions.