Machine Learning in Asset Pricing: What Actually Works

30,000 Stocks, 900 Predictors, One Question

Between 1957 and 2016, a neural network trained on macroeconomic indicators and firm characteristics generated an annualized out-of-sample R-squared of 0.40% for individual US stock returns. That number sounds small until you realize that, in a universe of 30,000 stocks, even a tiny edge in return prediction translates to economically significant portfolio gains. The long-short portfolio formed on these neural network predictions earned a Sharpe ratio above 1.8 — more than double what the best linear models could achieve over the same period.

These are the headline findings of what has become one of the most cited papers in modern empirical finance: "Empirical Asset Pricing via Machine Learning" by Sheng Gu, Bryan Kelly, and Dacheng Xiu, published in the Review of Financial Studies in 2020 (Gu, Kelly & Xiu, 2020). The paper is a systematic horse race between every major machine learning method applied to the fundamental problem of finance — predicting stock returns — and its conclusions challenge both the efficient markets camp and the factor zoo skeptics.

The Problem: Too Many Factors, Not Enough Signal

Asset pricing has a data problem. Over the past three decades, researchers have proposed hundreds of variables that supposedly predict stock returns. Book-to-market, momentum, profitability, investment, accruals, share issuance, idiosyncratic volatility — the list now exceeds 400 published anomalies. Harvey, Liu, and Zhu famously documented this explosion in their 2016 paper, arguing that most of these "discoveries" are statistical noise amplified by data mining (Harvey, Liu & Zhu, 2016).

The traditional approach to this problem is linear. Pick a handful of factors, run a regression, check the t-statistics. The Fama-French five-factor model uses five variables. Even the most ambitious linear models rarely use more than a few dozen. The reason is simple: linear regression cannot handle hundreds of correlated predictors without severe overfitting. Adding more variables to a linear model eventually makes predictions worse, not better.

Machine learning changes this calculus. Methods like random forests, gradient-boosted trees, and neural networks are specifically designed to extract signal from high-dimensional, noisy data. They can capture nonlinear relationships and interactions between variables that linear models miss entirely. The question Gu, Kelly, and Xiu asked was whether these methods, applied to the full universe of proposed stock predictors, actually improve return forecasts.

The Horse Race

The paper tests a comprehensive battery of methods, all trained on the same data and evaluated under identical out-of-sample conditions. The methods range from traditional econometric approaches to state-of-the-art machine learning:

Method	Out-of-Sample R²	Monthly Sharpe (L/S)
OLS (all predictors)	-1.01%	0.60
OLS (3 predictors)	0.16%	0.89
Elastic Net	0.21%	1.12
Random Forest	0.23%	1.35
Gradient-Boosted Trees	0.34%	1.51
Neural Network (NN3)	0.40%	1.80
Neural Network (NN5)	0.36%	1.71

Several patterns emerge from these results.

First, OLS with all predictors is a disaster. The negative R-squared means you would be better off predicting the historical average return for every stock than using the OLS predictions. This confirms the standard intuition that linear models overfit in high dimensions.

Second, regularization helps enormously. Elastic net, which is linear regression with penalty terms that shrink coefficients and select variables, turns a negative R-squared into a positive one. But the improvement plateaus quickly because elastic net is still fundamentally linear.

Third, tree-based methods outperform linear methods. Random forests and gradient-boosted trees capture nonlinear relationships between predictors and returns, pushing R-squared higher and Sharpe ratios above 1.3.

Fourth, neural networks win. The three-layer neural network (NN3) achieves the highest out-of-sample R-squared and the highest Sharpe ratio. The five-layer network (NN5) is slightly worse, suggesting diminishing returns to depth in this application.

What the Neural Network Finds

The paper's most illuminating contribution is not just the horse race results but the analysis of what the winning models actually learn. Using a technique called variable importance analysis, the authors decompose each model's predictions to identify which inputs drive the forecasts.

The dominant predictor across all nonlinear models is momentum — but not simple 12-month momentum. The neural network identifies complex interactions between short-term reversal (1-month returns), medium-term momentum (2-12 months), and long-term reversal (13-60 months) that vary with market conditions. In high-volatility environments, short-term reversal dominates. In calm markets, medium-term momentum takes over.

The second most important category is liquidity and trading activity. Variables like share turnover, bid-ask spread, and dollar trading volume interact with size and momentum in ways that linear models cannot capture. Small, illiquid stocks with strong momentum behave differently from large, liquid stocks with the same momentum signal.

The third key finding is the importance of macroeconomic interactions. The neural network learns that the predictive power of firm characteristics changes with the business cycle. Value stocks (high book-to-market) predict returns more strongly during recessions, while momentum works better during expansions. These time-varying relationships are invisible to standard linear models that estimate fixed coefficients.

The Taming of the Factor Zoo

A companion paper by Feng, Giglio, and Xiu provides additional theoretical grounding for why machine learning works in this setting (Feng, Giglio & Xiu, 2020). Their framework addresses a fundamental question: with 400+ proposed factors, how do you determine which ones genuinely capture risk and which are noise?

The traditional approach — testing factors one at a time against existing models — is statistically flawed because it ignores the multiple testing problem. If you test 400 variables, roughly 20 will appear significant at the 5% level by pure chance.

Feng, Giglio, and Xiu propose a double-selection procedure that uses machine learning (specifically LASSO) to simultaneously select the factors that matter while controlling for the others. Applied to 150+ published factors, they find that the vast majority are redundant. Only a handful of factors survive after properly accounting for multiple testing and correlations between factors. The factors that survive — market, size, value, momentum, profitability, and a small number of others — align closely with what the neural network in Gu, Kelly, and Xiu identifies as important.

This convergence is reassuring. The neural network is not discovering some exotic, uninterpretable signal. It is finding that the well-known factors interact in nonlinear ways that linear models miss.

What This Means in Practice

The practical implications differ substantially depending on who you are.

For institutional investors and hedge funds, the paper validates the shift toward machine learning in quantitative strategies. The out-of-sample gains are large enough to survive transaction costs for portfolios that can trade efficiently. Several systematic hedge funds now use neural network-based return prediction as a core signal, though the specific architectures and training procedures go well beyond what the paper describes.

For retail investors, the implications are more nuanced. You cannot replicate these strategies at home. The paper uses monthly rebalancing across 30,000 stocks, which requires institutional-scale execution infrastructure. The long-short portfolio also requires short selling, which is costly and sometimes impossible for retail accounts.

However, the findings have indirect implications for how retail investors think about factor investing. If the true return-generating process is nonlinear — if momentum works differently in volatile versus calm markets, if value depends on the business cycle — then simple, static factor exposures will capture only a fraction of the available premium. This helps explain why factor ETFs, which apply fixed rules to single characteristics, often underperform relative to their backtests. The real premium comes from dynamic, conditional factor exposures that machine learning methods can capture but fixed rules cannot.

Limitations and Open Questions

The paper's strengths are also its limitations. The 60-year sample period (1957-2016) covers multiple market regimes, which is good for generalization. But the most recent decade — characterized by near-zero interest rates, unprecedented central bank intervention, and the rise of passive investing — may represent a structural break. Models trained on 1957-2016 data may not perform as well in the post-pandemic environment.

Overfitting remains a concern despite the careful out-of-sample design. The authors use a rolling window approach where models are retrained periodically, but the choice of hyperparameters (network depth, regularization strength, learning rate) still involves some look-ahead bias. Kelly, Malamud, and Zhou (2024) address this concern in a subsequent paper, providing theoretical justification for why complex models can genuinely outperform in high-dimensional settings rather than merely overfitting (Kelly, Malamud & Zhou, 2024).

Transaction costs are acknowledged but not fully incorporated. The paper reports Sharpe ratios gross of costs, and the most profitable strategies involve heavy trading in small, illiquid stocks where implementation costs are highest. After realistic transaction cost adjustments, the advantage of neural networks over simpler methods narrows, though it does not disappear.

Finally, the paper focuses exclusively on US equities. Whether the same patterns hold in international markets, fixed income, or other asset classes remains an active research question. Early evidence from international studies is promising but not definitive.

The Bigger Picture

Gu, Kelly, and Xiu's paper marks a turning point in empirical asset pricing. It demonstrates that the choice of statistical method — linear versus nonlinear, simple versus complex — matters as much as the choice of predictors. For decades, asset pricing research focused on discovering new variables while using the same linear regression toolkit. This paper shows that the toolkit itself was the bottleneck.

The implications extend beyond return prediction. If stock returns are genuinely driven by nonlinear factor interactions, then our standard factor models — the three-factor, five-factor, and six-factor models that dominate academic and practitioner finance — are fundamentally misspecified. They capture the first-order effects but miss the higher-order interactions that machine learning methods exploit.

This does not mean factor models are useless. They remain valuable as conceptual frameworks and risk attribution tools. But as forecasting tools, they leave significant predictive power on the table. The gap between linear and nonlinear methods is the empirical evidence that markets are more complex than our standard models assume.

For anyone investing in quantitative strategies — whether through hedge funds, smart beta ETFs, or their own systematic approaches — this paper's central message is clear: the methods matter as much as the data, and the simplest model is not always the best model.

This article is for educational purposes only and does not constitute financial advice. Past performance does not guarantee future results.

The Virtue of Complexity: Why Overparameterized Models Predict Returns Better

Models & Frameworks12 min

The Carhart Four-Factor Model: Adding Momentum to Asset Pricing

Models & Frameworks12 min

The Fama-French Five-Factor Model Explained

Models & Frameworks12 min

The Replication Crisis in Asset Pricing: Which Anomalies Survive Independent Verification?

QD Research Originals50 min

This analysis was synthesised from Gu, Kelly & Xiu (2020), Review of Financial Studies by the QD Research Engine — Quant Decoded’s automated research platform — and reviewed by our editorial team for accuracy. Learn more about our methodology.

References

Gu, S., Kelly, B., & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. The Review of Financial Studies, 33(5), 2223-2273. https://doi.org/10.1093/rfs/hhaa009
Feng, G., Giglio, S., & Xiu, D. (2020). Taming the Factor Zoo: A Test of New Factors. The Journal of Finance, 75(3), 1327-1370. https://doi.org/10.1111/jofi.12883
Harvey, C. R., Liu, Y., & Zhu, H. (2016). ...and the Cross-Section of Expected Returns. The Review of Financial Studies, 29(1), 5-68. https://doi.org/10.1093/rfs/hhv059
Kelly, B., Malamud, S., & Zhou, K. (2024). The Virtue of Complexity in Return Prediction. The Journal of Finance, 79(1), 459-503. https://doi.org/10.1111/jofi.13298
Fama, E. F., & French, K. R. (2015). A Five-Factor Model of Expected Stock Returns. Journal of Financial Economics, 116(1), 1-22. https://doi.org/10.1016/j.jfineco.2014.10.010

Machine Learning in Asset Pricing: What Actually Works

Practical Application for Retail Investors

Editor’s Note

30,000 Stocks, 900 Predictors, One Question

The Problem: Too Many Factors, Not Enough Signal

The Horse Race

What the Neural Network Finds

The Taming of the Factor Zoo

What This Means in Practice

Limitations and Open Questions

The Bigger Picture

Related

References