Multiple Regression Analysis: Estimation, Interpretation & R-Squared
Multiple regression analysis is the workhorse of empirical finance and econometrics. When a single variable cannot adequately explain stock returns, firm profitability, or bond yields, multiple regression allows analysts to include several explanatory variables simultaneously and isolate each one’s effect while holding the others constant. This guide covers the population model, OLS estimation with multiple regressors, the ceteris paribus interpretation, R-squared vs adjusted R-squared, and the Gauss-Markov assumptions that make OLS the best linear unbiased estimator.
What Is Multiple Regression?
Multiple regression extends simple linear regression by including two or more independent variables in a single model. Instead of asking “how does the market affect this stock’s returns?”, multiple regression asks “how does the market affect returns after controlling for firm size and value characteristics?”
Each slope coefficient βj measures the partial effect of Xj on Y, holding all other independent variables constant. This “all else equal” interpretation — called ceteris paribus — is the primary reason multiple regression is more useful than running several simple regressions separately.
The error term u captures all factors that affect Y but are not included as regressors. If any omitted factor is correlated with an included X variable, the OLS estimates will be biased — a problem known as omitted variable bias.
OLS Estimation with Multiple Regressors
Ordinary Least Squares (OLS) estimates the coefficients by minimizing the sum of squared residuals across all observations, choosing all k + 1 coefficients simultaneously. The resulting fitted equation is:
The Partialling Out Interpretation
A powerful way to understand multiple regression: the OLS estimate of β1 is identical to the slope from a simple regression of Y on the part of X1 that is uncorrelated with all other regressors. In other words, OLS first “partials out” the influence of X2 through Xk from both Y and X1, then estimates the remaining relationship. This is why multiple regression coefficients typically differ from simple regression coefficients — they strip away confounding influences.
Precision of the OLS Estimator
The variance of each OLS coefficient depends on three factors:
Three forces increase estimator variance (reduce precision): (1) a larger error variance σ², meaning more noise in the data; (2) less variation in xj (smaller SSTj), giving OLS less information to work with; and (3) higher Rj², meaning xj is more collinear with the other regressors. The term 1 / (1 − Rj²) is called the variance inflation factor (VIF) — when Rj² approaches 1, the VIF and standard errors explode.
Interpreting Multiple Regression Coefficients
Each OLS coefficient represents the expected change in Y for a one-unit increase in that variable, holding all other independent variables fixed. Consider a three-factor model of stock excess returns:
| Coefficient | Estimate | Interpretation |
|---|---|---|
| β̂1 (Market) | 1.20 | A 1 percentage point increase in market excess return is associated with a 1.20 pp increase in the stock’s excess return, holding size and value factors constant |
| β̂2 (SMB) | -0.25 | A 1 pp increase in the size factor is associated with a 0.25 pp decrease in excess return, holding market and value factors constant |
| β̂3 (HML) | -0.48 | A 1 pp increase in the value factor is associated with a 0.48 pp decrease, holding market and size factors constant |
Coefficients from a multiple regression will generally differ from those obtained by running separate simple regressions of Y on each X individually. The multiple regression coefficients are more reliable because they control for the influence of other included variables. This is precisely the partialling out effect at work.
R-Squared vs Adjusted R-Squared
Two closely related measures quantify how well a multiple regression model fits the data. Understanding the distinction is critical for model selection.
R-Squared (R²)
- Measures the fraction of variance in Y explained by the model
- Formula: R² = 1 − SSR / SST
- Always increases (or stays the same) when a regressor is added
- Range: 0 to 1
- Best for: evaluating a single model’s explanatory power
Adjusted R-Squared (R̄²)
- Penalizes for adding regressors that do not improve the model
- Formula: R̄² = 1 − [SSR/(n−k−1)] / [SST/(n−1)]
- Can decrease if a new variable does not sufficiently reduce SSR
- Range: can be negative (rare in practice)
- Best for: comparing models with different numbers of regressors
Why does R² always increase? Adding a variable can only reduce (or leave unchanged) the sum of squared residuals — OLS will set the new coefficient to zero if the variable has no explanatory power. Since SST is fixed, R² = 1 − SSR/SST cannot fall. Adjusted R² corrects for this because when a new variable is added, the residual degrees of freedom (n − k − 1) decrease by one. If the new variable does not reduce SSR sufficiently, the ratio SSR/(n − k − 1) actually increases, causing adjusted R² to fall.
Never compare R-squared values between models that use different dependent variables (e.g., Y vs log(Y)). R-squared measures the proportion of variance explained for a specific dependent variable. Models with different left-hand-side variables have different SST values, making R-squared comparisons meaningless.
The Gauss-Markov Assumptions (MLR.1–MLR.5)
The theoretical foundation of OLS rests on five assumptions. When all five hold, OLS has a remarkable property: it is the most precise linear unbiased estimator available.
| Assumption | Name | Statement | What It Means in Practice |
|---|---|---|---|
| MLR.1 | Linear in Parameters | Y = β0 + β1X1 + … + βkXk + u | The model is linear in the β coefficients. The X variables can be nonlinear transformations (log, squared) as long as the parameters enter linearly. |
| MLR.2 | Random Sampling | The data are a random sample from the population | Each observation is independently drawn. Violated with time series data, clustered samples, or selection on the dependent variable. |
| MLR.3 | No Perfect Multicollinearity | No independent variable is an exact linear function of the others | Regressors can be correlated but not perfectly so. Including all category dummies plus an intercept (the dummy variable trap) violates this assumption. |
| MLR.4 | Zero Conditional Mean | E(u | X1, …, Xk) = 0 | Unobserved factors are unrelated on average to all included regressors. This is the critical assumption for unbiasedness. Violated when relevant variables are omitted. |
| MLR.5 | Homoskedasticity | Var(u | X1, …, Xk) = σ² | The error variance is constant across all values of X. Violated when volatility depends on firm size, return magnitude, or other observable characteristics. |
Under assumptions MLR.1 through MLR.4 alone, OLS estimators are unbiased — on average, they hit the true parameter values. Adding MLR.5 (homoskedasticity) delivers an even stronger result:
Under assumptions MLR.1 through MLR.5, OLS is BLUE — the Best Linear Unbiased Estimator. No other estimator that is both linear in Y and unbiased has smaller variance than OLS. In practical terms, OLS gives you the most precise coefficient estimates possible among all linear unbiased methods.
Violations of MLR.4 (omitted variable bias) and MLR.5 (heteroskedasticity) are common in financial data. For exact finite-sample inference (t-tests and confidence intervals in small samples), a sixth assumption — normality of the error term — is added; in large samples, the central limit theorem provides approximate validity without normality. See Hypothesis Testing in Regression for details.
Finance Example: Estimating Factor Exposures
Multiple regression is the engine behind factor models in finance. The Fama-French three-factor model regresses a stock’s excess return on three factors: the market risk premium (Mkt-RF), a size factor (SMB), and a value factor (HML).
Regression equation: RAAPL − Rf = α + βM(RM − Rf) + βS(SMB) + βV(HML) + ε
| Variable | Coefficient | Std. Error | Interpretation |
|---|---|---|---|
| Intercept (α) | 0.35% / month | 0.28% | Monthly excess return unexplained by the three factors |
| Mkt-RF (βM) | 1.20 | 0.07 | Apple amplifies market moves by about 20% |
| SMB (βS) | −0.25 | 0.10 | Behaves like a large-cap stock (negative size exposure) |
| HML (βV) | −0.48 | 0.12 | Behaves like a growth stock (negative value exposure) |
Three-factor model: R² = 0.58, Adjusted R² = 0.56 (n = 60)
Single-factor model (market only): R² = 0.42, βM = 1.28
Adding the size and value factors improved the R² from 0.42 to 0.58 — the three factors explain 58% of Apple’s monthly return variation compared to just 42% with the market alone. Notice that the market beta dropped from 1.28 to 1.20. The simple regression overstated market sensitivity because it was partially capturing size and value effects — a textbook demonstration of the partialling out interpretation.
For a contrasting profile, consider a regional bank stock. Banks typically exhibit positive SMB and HML loadings, reflecting smaller capitalization and value-oriented characteristics — the opposite of Apple’s negative factor exposures. The sign pattern alone reveals fundamental differences in how these stocks respond to size and value risk. Use our OLS Regression Calculator to estimate factor models for any stock.
Omitted Variable Bias Preview
What happens when a relevant variable is left out of the regression? The OLS estimates of the included variables become biased. Specifically, if the true model includes X2 but you omit it:
E(β̂1) = β1 + β2 × δ1
where δ1 is the population slope from the regression of the omitted variable X2 on the included variable X1. Two conditions must both hold for bias to occur: (1) the omitted variable affects Y (β2 ≠ 0), and (2) the omitted variable is correlated with an included regressor (δ1 ≠ 0). If either condition fails, there is no bias.
The Apple example above is consistent with this prediction. The single-factor regression estimated βM = 1.28, but adding SMB and HML reduced it to 1.20. This coefficient shift is consistent with the direction predicted by omitted variable bias — because market returns are correlated with SMB and HML, the single-factor estimate was partly capturing size and value effects that belong to the omitted factors.
Omitted variable bias is the single most common threat to the validity of regression estimates. However, adding more variables helps only if those variables genuinely affect Y and are correlated with the included regressors. Adding irrelevant variables does not create bias but wastes degrees of freedom and reduces precision.
For the full treatment of omitted variable bias — including the direction of bias, proxy variables, and solutions — see our dedicated article on Omitted Variable Bias.
How to Calculate Multiple Regression
Estimating a multiple regression model follows a systematic process grounded in economic theory:
- Define the dependent variable — What outcome are you trying to explain? (e.g., monthly excess stock returns, quarterly firm profitability)
- Select independent variables based on theory — Each regressor should have an economic rationale for inclusion. Avoid adding variables simply because they are available.
- Collect sufficient data — A common rule of thumb is at least 10 observations per independent variable. In finance, 60 monthly observations is standard for factor model estimation.
- Estimate using OLS — OLS is available in any statistical or econometric software. The method minimizes the sum of squared residuals simultaneously across all coefficients.
- Examine R² and adjusted R² — Use R² for overall fit and adjusted R² to compare models with different numbers of regressors.
- Check coefficient signs and magnitudes — Do the estimates align with economic intuition? Unexpected signs may indicate omitted variables or multicollinearity.
- Test for assumption violations — Conduct hypothesis tests on individual coefficients (t-tests) and joint significance (F-tests). Check for heteroskedasticity and multicollinearity.
Common Mistakes
Even experienced analysts make these errors when working with multiple regression:
1. Kitchen Sink Regression — Throwing every available variable into the model without economic justification. R-squared will always increase, but adjusted R-squared may fall, standard errors can inflate due to multicollinearity, and the model becomes harder to interpret. Each variable should have a theoretical reason for inclusion.
2. Confusing Statistical Significance with Practical Significance — A coefficient can be statistically significant (small p-value) but economically trivial. Always examine the magnitude of the estimated effect, not just whether it is different from zero. A market beta of 0.002 with a p-value of 0.01 is statistically significant but economically meaningless. See Hypothesis Testing in Regression for more on significance testing.
3. Ignoring Multicollinearity — When independent variables are highly correlated, individual coefficient estimates become imprecise (large standard errors) even though the overall model may fit well. Multicollinearity does not bias OLS, but it makes isolating individual effects difficult. Monitor the variance inflation factor (VIF) — values above 10 warrant concern.
4. Comparing R-Squared Across Different Dependent Variables — An R² of 0.60 from a regression of stock returns cannot be compared to an R² of 0.85 from a regression of log returns. The dependent variable must be identical for R-squared comparisons to be valid.
Limitations of Multiple Regression
While multiple regression is the most widely used tool in empirical finance, it has important boundaries:
Multiple regression estimates associations, not causal effects, by default. Unbiased causal interpretation requires that the zero conditional mean assumption (MLR.4) holds — meaning no omitted variables correlated with the included regressors affect Y. In observational financial data, this is difficult to guarantee.
1. Linearity in Parameters — The model assumes a linear relationship between Y and the β parameters. If the true relationship is nonlinear, functional form transformations (logs, quadratics, interactions) are needed. See Regression Functional Forms for techniques.
2. Sensitivity to Outliers — OLS minimizes squared residuals, which means extreme observations receive disproportionate weight. A single unusual month (e.g., the March 2020 COVID crash) can substantially shift coefficient estimates.
3. Large Sample Requirements — With many regressors relative to observations, estimates become imprecise and adjusted R-squared can be low or even negative. Financial datasets often have limited time series length, constraining the number of variables that can be reliably estimated.
4. R-Squared Can Be Misleading — A high R-squared does not mean the model is correctly specified or that the coefficients are unbiased. Omitted variable bias can exist even when R-squared is high. Conversely, a low R-squared does not invalidate the model — individual coefficients can still be unbiased and economically meaningful.
Frequently Asked Questions
Disclaimer
This article is for educational and informational purposes only and does not constitute investment advice. Regression results and factor loadings cited are approximate and may differ based on the data source, time period, and methodology used. Always conduct your own analysis and consult a qualified financial advisor before making investment decisions. Reference: Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach, 8th Edition, Cengage, 2025.