Omitted Variable Bias, Multicollinearity & Common Regression Pitfalls

Q: How do you determine the direction of omitted variable bias?

Use the bias formula: bias = β 2 × δ̃ 1 . First, determine the sign of β 2 — does the omitted variable positively or negatively affect Y? Second, determine the sign of δ̃ 1 — is the omitted variable positively or negatively correlated with the included X? If both signs are the same (both positive or both negative), the bias is upward and the coefficient overestimates the true effect. If the signs are opposite, the bias is downward. This signing exercise requires economic reasoning about the likely relationships among the variables.

Published: March 19, 2026

Article by Ryan O'Connell, CFA, FRM

When a regression model leaves out a variable that both affects the outcome and correlates with an included regressor, the estimated coefficients absorb the omitted variable’s effect and become systematically biased. This problem — omitted variable bias (OVB) — is the most common threat to causal inference in applied finance research. A single-factor CAPM regression that omits size and value factors, a profitability model that ignores leverage, or a bond yield regression that excludes credit risk all produce misleading estimates for the same reason. This article covers the OVB formula and how to sign the direction of bias, then addresses related pitfalls — multicollinearity, measurement error, and functional form misspecification — before comparing the main solutions available to researchers working with multiple regression models.

What Is Omitted Variable Bias?

Suppose the true population model is Y = β₀ + β₁X₁ + β₂X₂ + u, but the researcher omits X₂ and estimates a simple regression of Y on X₁ alone. If X₂ is correlated with X₁, the OLS estimate of β₁ will be biased because it picks up part of the effect of X₂. This violates the zero conditional mean assumption — E(u | X₁) ≠ 0 — because the omitted variable is absorbed into the error term and correlated with the included regressor.

Two Conditions for Omitted Variable Bias

OVB arises if and only if both conditions hold simultaneously:

The omitted variable affects Y — β₂ ≠ 0 (X₂ belongs in the true model)
The omitted variable correlates with the included regressor — Corr(X₁, X₂) ≠ 0

If either condition fails, the OLS estimate of β₁ remains unbiased. A variable can be correlated with X₁ without causing bias (if it does not affect Y), and a variable can affect Y without causing bias (if it is uncorrelated with X₁).

Omitted Variable Bias Formula

Bias(β̃₁) = β₂ × δ̃₁

where β̃₁ is the OLS estimator from the short regression (omitting X₂), β₂ is the true effect of the omitted variable on Y, and δ̃₁ is the slope from regressing X₂ on X₁. The expected value of β̃₁ equals the true β₁ plus this bias term.

The formula makes the mechanism concrete: the short-regression estimator β̃₁ equals the true effect β₁ plus a bias term that depends on how strongly X₂ affects Y and how strongly X₂ is related to X₁. Adding X₂ to the regression eliminates this bias — which is why multiple regression is the primary tool for controlling confounders in observational finance research.

Direction of Bias

The OVB formula allows researchers to determine whether the biased estimate is too high (upward bias) or too low (downward bias) by examining the signs of β₂ and δ̃₁:

Sign of β₂ (Omitted Variable’s Effect on Y)	Sign of δ̃₁ (Correlation of Omitted and Included X)	Direction of Bias on β̃₁
Positive (+)	Positive (+)	Upward (overestimate)
Positive (+)	Negative (−)	Downward (underestimate)
Negative (−)	Positive (+)	Downward (underestimate)
Negative (−)	Negative (−)	Upward (overestimate)

The rule is straightforward: when β₂ and δ̃₁ share the same sign, the bias is upward; when they have opposite signs, the bias is downward.

JPMorgan Chase: CAPM Beta and Omitted Factor Bias

Consider estimating JPMorgan Chase’s (JPM) market sensitivity using the single-factor CAPM:

R_JPM − R_f = α + β_MKT(R_m − R_f) + u

This model omits the Fama-French size (SMB) and value (HML) factors. Using monthly returns from 2015–2024, JPM — a classic large-cap value stock — has a positive HML loading (β_HML > 0, because bank stocks tend to co-move with the value factor in this sample). Over the same period, market excess returns are positively correlated with HML (δ̃ > 0). Both signs are positive in this sample, so the single-factor market beta is biased upward.

In practice, adding SMB and HML to create the three-factor model reduces JPM’s estimated market beta. The single-factor β_MKT of approximately 1.28 drops to roughly 1.20 once size and value exposures are controlled for. The difference of 0.08 is the omitted variable bias absorbed by the single-factor estimate. The direction and magnitude of this bias are sample-specific — different stocks, sectors, and time periods can produce different signs on β_HML and δ̃₁.

Multicollinearity

While omitted variable bias concerns variables excluded from the model, multicollinearity concerns high correlation among variables included in the model. When two or more regressors are strongly correlated, OLS has difficulty separating their individual effects, inflating the standard errors of the affected coefficients.

Crucially, multicollinearity does not bias coefficient estimates. OLS remains unbiased and consistent; the problem is imprecision. The overall model can still be highly significant (large F-statistic) even when individual coefficients are individually insignificant (small t-statistics) because the regressors jointly explain variation in Y but their individual contributions cannot be cleanly separated.

Variance Inflation Factor (VIF)

VIF_j = 1 / (1 − R²_j)

where R²_j is the R-squared from regressing X_j on all other independent variables. VIF = 1 means no collinearity. Common heuristics suggest VIF > 5 warrants further investigation and VIF > 10 may indicate serious multicollinearity, but these cutoffs are not formal statistical tests — whether multicollinearity is a practical problem depends on the research question and the precision required.

For example, regressing ROA on log(total assets), log(revenue), and debt-to-equity for the 30 Dow Jones Industrial Average components might yield a VIF of 7.3 on log(total assets) and 6.8 on log(revenue), because firms like Apple, Microsoft, and UnitedHealth Group that rank high on one measure tend to rank high on the other. The individual coefficients on assets and revenue become imprecise, but the model’s joint explanatory power remains intact.

Pro Tip

Multicollinearity inflates standard errors but does not bias coefficients. If the overall F-test is significant but individual t-tests are not, multicollinearity is a likely culprit. Before dropping a collinear variable, consider whether you need to isolate individual effects or whether joint significance is sufficient for your research question — dropping a relevant variable to “fix” multicollinearity can introduce omitted variable bias.

Try Our Multicollinearity (VIF) Calculator

Measurement Error

Even when all relevant variables are included in the model, measurement error — the difference between the recorded value and the true value of a variable — can distort OLS estimates. The consequences depend critically on whether the error is in the dependent variable or an independent variable.

Measurement error in Y (dependent variable): If Y is measured with random noise — Y_observed = Y_true + e, where e is uncorrelated with the regressors — the OLS coefficient estimates remain unbiased. The noise simply increases the error variance, producing larger standard errors and less precise estimates but no systematic distortion.

Measurement error in X (independent variable): This is far more damaging. Under the classical errors-in-variables model, measurement error in X produces attenuation bias — the estimated coefficient is systematically pulled toward zero:

Attenuation Bias (Classical Errors-in-Variables)

plim(β̂₁) = β₁ × σ²_X / (σ²_X + σ²_e)

The ratio σ²_X / (σ²_X + σ²_e) is always between 0 and 1, so the estimated coefficient is biased toward zero. The noisier the measurement (larger σ²_e), the greater the attenuation.

In finance, credit ratings from agencies like Moody’s and S&P are ordinal categories (Aaa, Aa1, Aa2, …) that researchers often convert to numerical scores for regression analysis. These scores are an imperfect, discrete proxy for the continuous latent variable “true credit risk.” The classical attenuation formula above requires three conditions: the measurement error has mean zero, is uncorrelated with the true value of X, and is uncorrelated with the structural error — normality is not required. When researchers map ordinal credit ratings to numerical scores, these classical conditions do not hold cleanly because the mapping is discrete and somewhat arbitrary. The result is a bias whose direction and magnitude depend on the specific numerical encoding chosen, though in many practical settings the coefficient is still attenuated toward zero relative to the true effect.

Proxy Variables

When the true variable of interest is unobservable — such as management quality, financial distress risk, or investor sentiment — a proxy variable can serve as a stand-in. A valid proxy must be correlated with the unobservable and should not independently belong in the regression conditional on the unobservable variable.

A well-known example comes from the Fama-French asset pricing literature: the book-to-market ratio (B/M) is widely used as a proxy for financial distress risk, which is not directly observable. Firms with high B/M ratios — such as struggling retailers or overleveraged industrial companies — tend to carry greater distress risk, and the HML factor captures the return premium associated with this risk. Including B/M in a cross-sectional regression of stock returns absorbs part of the effect that unobservable distress risk would otherwise impose on the other coefficients, reducing OVB.

Proxy variables are a practical compromise: they reduce OVB when the true variable cannot be measured, but the bias reduction depends on how closely the proxy tracks the unobservable. When no adequate proxy exists, instrumental variables offer an alternative approach to addressing endogeneity from omitted variables.

The RESET Test for Functional Form

A model can also be misspecified through incorrect functional form — for example, fitting a linear relationship when the true relationship is quadratic, or omitting an interaction term. The Ramsey RESET (Regression Equation Specification Error Test) is a widely used diagnostic for detecting such problems.

Ramsey RESET Test

The RESET test adds powers of the fitted values (typically ŷ² and ŷ³) to the original regression and tests their joint significance using an F-test. If the added terms are jointly significant, the null hypothesis of correct functional form is rejected — the linear specification is inadequate. The test detects many types of misspecification (missing quadratics, omitted interactions, incorrect log transformations) but does not tell you which specific terms are missing.

A significant RESET result is a warning sign, not a prescription. The researcher must then consider alternative specifications — adding quadratic terms, using log transformations, or including interaction effects. For a detailed treatment of these functional form choices, see Regression Functional Forms.

Solutions to Omitted Variable Bias

No single solution works in all settings. The appropriate strategy depends on data availability, the nature of the omitted variable, and the assumptions the researcher is willing to maintain.

Add Control Variables

Include the omitted variable directly in the regression
Requires: data on the omitted variable; variable must be exogenous
Risk: “bad controls” — adding a variable that is itself caused by the treatment introduces new bias
Use dummy variables to control for categorical confounders

Proxy Variables

Substitute an observable variable that correlates with the unobservable
Requires: proxy tracks the unobservable closely
Limitation: rarely eliminates bias entirely; proxy quality is difficult to verify
Best when the omitted variable is conceptually clear but not directly measurable

Instrumental Variables

Find a variable Z correlated with X but uncorrelated with the error term
Requires: relevance (Z predicts X) and exogeneity (Z does not directly affect Y)
Limitation: valid instruments are rare; weak instruments create new problems
See Instrumental Variables & 2SLS for the full methodology

Panel Data Fixed Effects

Use within-entity variation to control for all time-invariant unobservables
Requires: panel data (repeated observations on the same entities over time)
Limitation: only removes time-invariant OVB; time-varying omitted variables remain problematic
See Panel Data Analysis for fixed and random effects estimation

Randomization

Random assignment of X breaks the correlation between X and all omitted variables
Requires: ability to conduct or access data from a randomized experiment
Limitation: rare in finance; feasible mainly in fintech A/B testing or regulatory lotteries
The gold standard for causal inference but seldom available in financial markets research

Common Mistakes

1. Thinking more controls always reduce bias (“bad controls”). Adding a variable that is itself an outcome of the treatment variable can introduce new bias rather than reducing OVB. For example, in a regression of firm stock returns on R&D spending, controlling for patent counts is problematic because patents are an outcome of R&D, not a confounder. A valid control must be a common cause of both the treatment and the outcome, not a mediator or consequence.

2. Confusing multicollinearity with omitted variable bias. Multicollinearity inflates standard errors but does not bias coefficient estimates. OVB biases the coefficients themselves. Dropping a collinear variable to “fix” multicollinearity can actually introduce OVB if that variable belongs in the true model — you are trading precision for bias, which is almost always a bad trade.

3. Assuming low VIF means no omitted variable bias. VIF measures correlation among included regressors. A model can have VIF values near 1.0 for every included variable while suffering severe OVB from an excluded variable. VIF diagnoses multicollinearity, not omitted variables — they are fundamentally different problems.

4. Using OVB as a blanket criticism without signing the bias. Claiming “the estimate might be biased due to omitted variables” without specifying the likely direction and approximate magnitude provides no actionable information. Always attempt to sign the bias using the formula: identify the probable omitted variable, determine the signs of β₂ and δ̃₁, and state whether the bias is upward or downward.

Limitations

Important Caveat

Omitted variable bias can never be fully ruled out in observational data. The bias formula assumes you know which variable is omitted and its relationship to the included regressors — information that is rarely available in practice. Even when controls, proxies, or instruments are used, residual bias from additional unmeasured confounders may remain.

The direction-of-bias analysis is only as reliable as the researcher’s knowledge of the omitted variable’s true effect on Y and its correlation with the included regressors. In complex financial systems with many interrelated variables, signing the bias requires strong economic reasoning that may itself be debatable.

Each solution to OVB introduces its own assumptions. Instrumental variables require valid instruments that may be difficult to defend. Panel fixed effects remove only time-invariant confounders, leaving time-varying OVB unaddressed. Proxy variables depend on unmeasurable proxy quality. No single method provides a universal guarantee of unbiased estimation.

Frequently Asked Questions

Omitted variable bias occurs when a regression leaves out a variable that both affects the outcome and is correlated with an included explanatory variable. The result is that the estimated coefficient on the included variable absorbs part of the omitted variable’s effect, making the estimate systematically too high or too low. For example, estimating a stock’s market sensitivity using only a single-factor CAPM model omits size and value risk factors, biasing the market beta estimate. The bias disappears only when the omitted variable is added to the model or addressed through other techniques like instrumental variables or panel data fixed effects.

Use the bias formula: bias = β₂ × δ̃₁. First, determine the sign of β₂ — does the omitted variable positively or negatively affect Y? Second, determine the sign of δ̃₁ — is the omitted variable positively or negatively correlated with the included X? If both signs are the same (both positive or both negative), the bias is upward and the coefficient overestimates the true effect. If the signs are opposite, the bias is downward. This signing exercise requires economic reasoning about the likely relationships among the variables.

Omitted variable bias and multicollinearity are fundamentally different problems. OVB occurs when a relevant variable is excluded from the model, causing coefficient estimates to be systematically biased. Multicollinearity occurs when included variables are highly correlated, inflating standard errors but not biasing the estimates. The solutions differ accordingly: OVB requires adding controls, using proxies, or employing instrumental variables; multicollinearity may be addressed by increasing sample size, dropping truly redundant variables, or simply accepting wider confidence intervals. Importantly, dropping a collinear variable to reduce multicollinearity can introduce OVB if that variable belongs in the true model.

The variance inflation factor measures how much the variance of an OLS coefficient is inflated due to multicollinearity among the regressors. It is calculated as VIF_j = 1 / (1 − R²_j), where R²_j is the R-squared from regressing X_j on all other independent variables. A VIF of 1 indicates no collinearity. Common heuristics suggest values above 5 warrant further investigation and values above 10 may indicate serious multicollinearity, though these cutoffs are not formal statistical tests. Use our Multicollinearity (VIF) Calculator to compute VIF values for your regression model. Remember that VIF diagnoses correlation among included variables — it says nothing about omitted variable bias from excluded variables.

In observational data, OVB can never be guaranteed to be zero because there may always be unmeasured confounders that the researcher is unaware of. However, it can be substantially reduced through several strategies: adding relevant control variables to the regression, using proxy variables for unobservables, employing instrumental variables to address endogeneity, or using panel data fixed effects to control for time-invariant unobservables. Only true randomization — where the treatment is randomly assigned — breaks the correlation between the treatment and all potential omitted variables, providing unbiased estimates. In finance, such randomization is rare outside of fintech A/B testing or natural experiments created by regulatory changes.

Disclaimer

This article is for educational and informational purposes only and does not constitute investment advice. The examples and regression results cited are illustrative and based on real market scenarios with approximate figures. Actual empirical results depend on sample selection, time period, and model specification. Content is based on Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach, 8th Edition, Cengage, 2025, Chapters 3, 6, and 9. Always conduct your own research and consult a qualified financial advisor before making investment decisions.

Explore Top Finance Certificates

Access official certificates from Wharton Online & Columbia Business School Executive Education, powered by Wall Street Prep. Save up to $500 with code RYAN.

Omitted Variable Bias, Multicollinearity & Common Regression Pitfalls

What Is Omitted Variable Bias?

Direction of Bias

Multicollinearity

Measurement Error

Proxy Variables

The RESET Test for Functional Form

Solutions to Omitted Variable Bias

Add Control Variables

Proxy Variables

Instrumental Variables

Panel Data Fixed Effects

Randomization

Common Mistakes

Limitations

Frequently Asked Questions

What is omitted variable bias in simple terms?

How do you determine the direction of omitted variable bias?

What is the difference between omitted variable bias and multicollinearity?

What is the variance inflation factor (VIF) and what is a good VIF value?

Can omitted variable bias be completely eliminated?

Disclaimer

Table of Contents

Explore Top Finance Certificates

Contact Me

Contact Me