The Overfitting Trap That Wasn't
Every quantitative finance textbook teaches the same lesson: keep your models simple. Add too many parameters and your model memorizes noise instead of learning signal. The bias-variance tradeoff, drilled into every statistics student, says that beyond some optimal point, additional complexity hurts out-of-sample performance. For decades, this principle guided how practitioners built return prediction models — trim the variable list, penalize large coefficients, prefer parsimony over power.
Then something strange happened. In machine learning research outside of finance, practitioners discovered that absurdly large models — with millions or billions of parameters, far exceeding the number of training observations — generalized better than their smaller counterparts. GPT-style language models, deep image classifiers, and protein folding networks all defied the classical tradeoff. The phenomenon was dubbed "benign overfit," and it upended the theoretical foundations of statistical learning.
Bryan Kelly, Semyon Malamud, and Kangying Zhou brought this insight to asset pricing. Their 2024 paper in the Journal of Finance, "The Virtue of Complexity in Return Prediction," provides both a theoretical framework and comprehensive empirical evidence that overparameterized models outperform parsimonious ones at predicting stock returns (Kelly, Malamud & Zhou, 2024). The implications for how we build and evaluate quantitative strategies are profound.
Why Complexity Helps: The Theory
The paper's theoretical contribution resolves a puzzle that has haunted quantitative finance: if stock returns are hard to predict (low R-squared), why would adding more parameters to a model improve predictions rather than degrading them?
The answer lies in the structure of the signal environment. Stock returns are influenced by hundreds of characteristics — size, value, momentum, profitability, investment, liquidity, volatility, accruals, and many more. Each individual predictor carries a tiny amount of information. The signal is real but dispersed across many dimensions, each contributing a small increment of predictive power.
In this setting, a parsimonious model faces a dilemma. If it selects a small subset of predictors (as LASSO or stepwise regression does), it discards the weak signals in the excluded variables. If it includes all predictors with equal weight, the noise from irrelevant variables overwhelms the weak signals. Either way, the model underperforms.
An overparameterized model resolves this dilemma through a mechanism the authors call "implicit shrinkage." When a model has more parameters than observations, there are infinitely many parameter vectors that fit the training data perfectly. The minimum-norm solution — the one that gradient descent naturally finds — spreads the weights across all parameters, effectively performing a form of ridge regularization without any explicit penalty term. This implicit shrinkage prevents any single predictor from dominating and allows the model to aggregate weak signals across all available dimensions.
The mathematical result is striking: as the number of parameters grows relative to the number of observations (the overparameterization ratio), out-of-sample prediction error first increases (the classical overfitting zone), then decreases again beyond a critical threshold. This is the "double descent" curve that has been documented in deep learning. Kelly, Malamud, and Zhou prove it applies to return prediction under realistic conditions.
The Empirical Horse Race
The paper tests this theory comprehensively. Using the same dataset as Gu, Kelly, and Xiu (2020) — monthly returns for the entire CRSP universe from 1957 to 2021, with 900+ firm characteristics as potential predictors — the authors systematically vary model complexity and measure out-of-sample performance.
The results align precisely with the theory. Models with fewer parameters than observations (the underparameterized regime) show the expected pattern: performance improves up to a point, then overfitting kicks in. But once complexity crosses the interpolation threshold — where the model has enough parameters to perfectly fit the training data — performance begins improving again. The most complex models, with tens of thousands of parameters, produce the best out-of-sample R-squared values.
The economic magnitudes are substantial. A long-short portfolio sorted on neural network return predictions generates monthly alphas that increase monotonically with model complexity. The most complex models produce annualized Sharpe ratios exceeding 2.0, significantly outperforming the parsimonious alternatives favored by traditional econometrics.
What the Complex Models See
The paper goes beyond the horse race to investigate what complex models capture that simple ones miss. The analysis reveals three key sources of additional predictive power.
First, nonlinear interactions between predictors. Simple models treat each characteristic independently — momentum is momentum regardless of firm size or market conditions. Complex models discover that momentum's predictive power varies dramatically with volatility, liquidity, and the business cycle. These conditional relationships are invisible to linear models but carry substantial return-predictive content.
Second, time-varying factor exposures. The relationship between firm characteristics and expected returns changes across market regimes. Value works differently in recessions than expansions. Low-volatility stocks behave differently in rising versus falling interest rate environments. Complex models with sufficient capacity can learn these regime-dependent relationships from the data.
Third, tail behavior and extreme events. Complex models better capture the nonlinear dynamics around market stress periods. The paper documents that overparameterized models are particularly effective at predicting returns during high-volatility periods — precisely when accurate forecasts are most valuable. The most complex model successfully de-risked before 14 of 15 NBER-dated recessions in the sample, a track record that no parsimonious model matched.
The Double Descent Curve
The paper's most visually striking result is the double descent curve for out-of-sample R-squared plotted against model complexity. The curve shows:
| Complexity Region | Behavior | Performance |
|---|---|---|
| Underparameterized (p < n) | Classical bias-variance tradeoff | Moderate, peaks then declines |
| Interpolation threshold (p ≈ n) | Model perfectly fits training data | Worst performance (overfitting peak) |
| Overparameterized (p >> n) | Benign overfit, implicit shrinkage | Best performance, improves with complexity |
This U-shaped pattern (or more precisely, the second descent after the interpolation peak) explains why practitioners who stopped adding complexity at the classical overfitting point were leaving predictive power on the table. The key insight is that you must push through the overfitting zone to reach the benign overfit regime on the other side.
Connection to Gu, Kelly, and Xiu (2020)
This paper is a natural sequel to the landmark ML horse race study. Where Gu, Kelly, and Xiu (2020) demonstrated empirically that neural networks outperform linear models at return prediction, the present paper explains why. The earlier study showed the what; this study provides the theoretical mechanism.
The connection also resolves a tension in the earlier work. Gu, Kelly, and Xiu found that the three-layer neural network (NN3) outperformed the five-layer network (NN5), which seemed to suggest diminishing returns to depth. Kelly, Malamud, and Zhou reinterpret this finding: the relevant measure of complexity is not depth alone but the total number of parameters. When complexity is measured correctly — as the overparameterization ratio — more is consistently better.
This also connects to the ongoing debate about the "factor zoo." With over 400 published anomalies, many researchers have argued for aggressive pruning — reduce the predictor set to a handful of robust factors. The virtue of complexity result pushes back: rather than selecting a few strong predictors and discarding the rest, it may be better to include everything and let the model's implicit regularization sort out the weights. The weak signals in the discarded variables contain genuine predictive information that aggregates into meaningful economic value.
Limitations and Caveats
The paper's conclusions come with important qualifications that practitioners should weigh carefully.
Transaction costs are the most significant practical concern. The complex models generate alpha primarily in small, illiquid stocks where trading costs are highest. After realistic cost adjustments, the advantage of the most complex models narrows — though it does not disappear. For institutional investors managing large portfolios, the net-of-cost benefit depends critically on execution quality and portfolio turnover constraints.
The theoretical framework assumes a specific signal structure: many weak predictors with independent noise. If the true signal is concentrated in a few strong predictors (as it might be in some alternative asset classes), the virtue of complexity may not hold. The paper demonstrates the result for US equities, where the dispersed-signal assumption is well supported, but generalization to other markets requires further validation.
Model interpretability remains a challenge. An overparameterized neural network that outperforms in out-of-sample tests is difficult to explain to investors, risk managers, and regulators. The paper provides theoretical justification for why complex models work, but it does not resolve the practical tension between predictive power and interpretability.
Finally, the result says nothing about whether these patterns will persist. If the alpha from complex models is driven by behavioral biases or institutional frictions, it may diminish as more capital pursues similar strategies. If it reflects genuine risk compensation for bearing complexity risk, it may persist but with significant drawdowns during model-unfriendly regimes.
Implications for Model Building
The practical takeaway is nuanced. The paper does not argue that complexity is always better — it argues that the classical bias-variance tradeoff, which says complexity is always harmful beyond some point, is wrong for the specific structure of cross-sectional return prediction.
For practitioners building ML-based equity strategies, the implications are:
Do not rely on variable selection as your primary regularization strategy. Including more predictors, even weak ones, can improve out-of-sample performance if the model has sufficient capacity.
Use implicit regularization through overparameterization (early stopping, minimum-norm solutions) rather than explicit regularization that forces sparsity (LASSO, dropout). The former preserves weak signals; the latter discards them.
Evaluate model performance across the full complexity spectrum, not just at the parsimonious end. The optimal model may be far more complex than traditional practices suggest.
Always validate with strict out-of-sample testing. Benign overfit is a theoretical property that holds under specific conditions; it is not a license to skip validation.
The paper marks a significant shift in how quantitative finance thinks about model complexity. For decades, simplicity was treated as a virtue in its own right. Kelly, Malamud, and Zhou demonstrate that in the specific context of return prediction — where signals are weak, dispersed, and numerous — complexity is the virtue.
This article is for educational purposes only and does not constitute financial advice. Past performance does not guarantee future results.
Related
This analysis was synthesised from Kelly, Malamud & Zhou (2024), The Journal of Finance by the QD Research Engine — Quant Decoded’s automated research platform — and reviewed by our editorial team for accuracy. Learn more about our methodology.
References
-
Kelly, B., Malamud, S., & Zhou, K. (2024). The Virtue of Complexity in Return Prediction. The Journal of Finance, 79(1), 459-503. https://doi.org/10.1111/jofi.13298
-
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. The Review of Financial Studies, 33(5), 2223-2273. https://doi.org/10.1093/rfs/hhaa009
-
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854. https://doi.org/10.1073/pnas.1903070116