The Replication Crisis in Asset Pricing: Which Anomalies Survive Independent Verification?

Key Takeaway

Academic finance has catalogued over 400 variables that supposedly predict stock returns. Harvey, Liu, and Zhu called it a "factor zoo." But when independent researchers try to reproduce these findings, the results diverge sharply: some studies report that most anomalies fail, while others report that most replicate just fine. We systematically compare 12 major replication efforts and find that both camps are correct — they are simply answering different questions. Statistical replication (can we reproduce the t-statistic?) succeeds roughly 50–70% of the time. Economic replication (can anyone actually make money from this?) succeeds only 15–30% of the time. The gap between these two numbers is not a crisis — it is the single most important finding in modern empirical finance.

This article is a QD Research Original. We did not summarise a single paper. We assembled evidence from 12 independent replication studies, constructed a three-dimensional framework that reconciles their seemingly contradictory findings, and derived a novel three-tier classification of the factor zoo. Our methodology, assumptions, and confidence levels are disclosed throughout. Every claim is traceable to published, peer-reviewed research.

Part I: The Question Nobody Has Agreed On

How We Got to 400+ Factors

The story of the factor zoo begins with a single, elegant equation. In 1964, William Sharpe published the Capital Asset Pricing Model, which proposed that a stock's expected return is determined by one thing: its sensitivity to the overall market, measured by beta. One factor. One equation. Clean, testable, Nobel-Prize-winning.

The cracks appeared almost immediately. Fischer Black, Michael Jensen, and Myron Scholes documented in 1972 that the empirical relationship between beta and returns was far flatter than the CAPM predicted. High-beta stocks earned less than they should; low-beta stocks earned more. This was not a minor statistical quibble — it was a fundamental challenge to the dominant theory of asset pricing.

The response from academia was not to abandon the factor model framework but to expand it. In 1992 and 1993, Eugene Fama and Kenneth French published their landmark papers introducing two additional factors: size (small stocks outperform large stocks) and value (high book-to-market stocks outperform low book-to-market stocks). The Fama-French three-factor model became the new standard. Researchers who previously explained returns with one factor now used three.

Mark Carhart added momentum in 1997 — stocks that had risen over the past year continued to rise, and stocks that had fallen continued to fall. The four-factor model became the workhorse of empirical asset pricing for over a decade.

Then the floodgates opened. Fama and French themselves expanded to five factors in 2015, adding profitability and investment. Robert Novy-Marx identified gross profitability as a predictor. Frazzini and Pedersen formalised the low-beta anomaly. Robert Stambaugh and Yu Yuan documented the short leg of anomalies. Kewei Hou, Chen Xue, and Lu Zhang proposed a competing q-factor model. Each new paper typically identified a characteristic that predicted returns with a t-statistic above 2.0, the conventional threshold for statistical significance.

By the time Harvey, Liu, and Zhu conducted their census in 2016, the count had reached 316 published factors. By some estimates, the number has since exceeded 400. The question that had driven a generation of research — what determines expected stock returns? — had produced an embarrassment of riches. Too many answers. Far too many answers.

The Multiple Testing Problem

To understand why so many factors is a problem, consider a simple statistical thought experiment. Suppose stock returns are completely random — no factor predicts anything. A researcher tests one variable against returns and uses the standard p < 0.05 threshold. There is a 5% chance of a false positive — finding significance where none exists.

Now suppose the researcher tests 100 variables. Even if none of them truly predicts returns, approximately five will appear significant by chance alone. If the researcher publishes only the five "significant" results and discards the 95 failures, the published literature will contain five false discoveries that look indistinguishable from real findings. Each has a t-statistic above 2.0. Each has a p-value below 0.05. Each will be cited, built upon, and incorporated into factor models.

This is the multiple testing problem, and it is not hypothetical. The structure of academic publishing creates precisely these incentives. Researchers test many variables but report only the significant ones. Journals prefer novel findings over null results. The file drawer fills with abandoned hypotheses, while the published record fills with the statistical outliers.

Harvey, Liu, and Zhu formalised this argument. With 316 factors tested against roughly 50 years of monthly returns, the probability of false discoveries at the t > 2.0 threshold is extremely high. They proposed adjusting the significance threshold using methods from the multiple testing literature — Bonferroni corrections, the Benjamini-Hochberg procedure, and their own Bayesian approach. Under their preferred adjustment, the minimum t-statistic for a new factor discovery should be approximately 3.0. At this bar, the majority of published anomalies fail.

Their paper was a bombshell. If taken at face value, it implied that decades of empirical asset pricing research had produced mostly noise.

The Counterattack

But not everyone agreed. The response came from multiple directions simultaneously.

David McLean and Jeffrey Pontiff took an entirely different approach. Rather than adjusting statistical thresholds, they exploited a natural experiment: time. Every published anomaly has a date of discovery. Returns earned by the anomaly before the paper was written are in-sample returns that could have been data-mined. Returns earned after publication are out-of-sample returns that cannot have been data-mined. If anomalies are purely statistical artefacts, post-publication returns should be zero.

McLean and Pontiff examined 97 anomalies and found that post-publication returns declined by an average of 58%. This was substantial decay — but it was not zero. The anomalies retained roughly 42% of their original in-sample magnitude. This finding was ambiguous: it was consistent with both partial data mining and partial arbitrage.

Kewei Hou, Chen Xue, and Lu Zhang went further. In their 2020 paper in the Review of Financial Studies, they attempted to replicate 452 published anomalies — the most comprehensive replication effort to that date. Their approach was deliberately standardised: they used consistent data sources, consistent sample selection criteria, and consistent factor construction methods across all 452 anomalies. They found that 64% — nearly two-thirds — failed to produce a t-statistic above 1.96 in their tests.

This result was widely interpreted as confirming a replication crisis in finance. If only 36% of published anomalies could survive the simplest statistical test, what did that say about the field?

But then Andrew Chen and Tom Zimmermann, working independently at the Federal Reserve Board, produced a starkly different finding. They replicated 319 anomalies — a largely overlapping set with Hou, Xue, and Zhang — and found that approximately 82% reproduced their original results. This was not a minor discrepancy. One study found 36% replication; another found 82%.

The key difference was methodology. Chen and Zimmermann followed each original paper's methodology as precisely as possible. When the original paper used NYSE breakpoints, they used NYSE breakpoints. When it excluded financial firms, they excluded financial firms. When it used a specific lag structure, they replicated the lag structure. Hou, Xue, and Zhang had applied their own standardised methodology, which inevitably differed from the original papers in dozens of small ways.

Finally, in 2023, Theis Jensen, Bryan Kelly, and Lasse Pedersen published their comprehensive analysis of 153 factors using Bayesian shrinkage methods. Their approach acknowledged that factor premiums are estimated with noise and applied statistical shrinkage to distinguish genuine premiums from estimation error. Their conclusion was unambiguous: there is no replication crisis in finance. The vast majority of factors have positive, economically meaningful expected returns after shrinkage.

The Paradox

By 2024, the academic community had produced four major conclusions about the same body of evidence:

Most anomalies are false discoveries (Harvey, Liu & Zhu)
Most anomalies decay substantially but survive (McLean & Pontiff)
Most anomalies fail to replicate (Hou, Xue & Zhang)
Most anomalies do replicate (Chen & Zimmermann; Jensen, Kelly & Pedersen)

These cannot all be simultaneously true under a single definition of "replication." The resolution requires understanding that each study asked a subtly different question. This is the central problem we address.

Part II: Research Question and Competing Hypotheses

Formalising the Question

Research question: Do published return anomalies represent genuine, exploitable market phenomena, or are they predominantly artefacts of data mining? And why do independent replication studies reach opposite conclusions when examining largely overlapping sets of anomalies?

We decompose this into three competing hypotheses, each with clearly specified falsification criteria.

H1 — The Data Mining Hypothesis

Statement: The majority of published anomalies are false discoveries generated by extensive specification search, data dredging, and publication bias. The factor zoo is a statistical artefact, not a description of market reality.

Mechanism: Researchers explore hundreds of potential predictors, construct them in various ways (different lags, breakpoints, weighting schemes), test them against returns using different sample periods and subsets, and publish only the specifications that produce significant results. Journals amplify this selection by preferring novel, significant findings. The result is a published literature dominated by the most extreme draws from a distribution of mostly null results.

Falsification criteria: H1 predicts that:

Anomalies should decay to zero post-publication, not to some positive residual
Replication rates should be uniformly low regardless of methodological fidelity
No systematic pattern should distinguish surviving anomalies from failing ones
International replication should fail at the same rate as domestic re-examination
Anomalies should not correlate with observable economic mechanisms

If any of these predictions fail, H1 in its pure form is falsified. A weaker version — that some anomalies are false discoveries — is almost certainly true and not particularly interesting.

H2 — The Arbitrage Hypothesis

Statement: Most published anomalies capture real market phenomena — genuine mispricings or risk premiums — but publication disseminates the information to market participants, attracting arbitrage capital that partially or fully eliminates the premium. Post-publication decay represents the market becoming more efficient, not a correction of statistical fiction.

Mechanism: Before publication, a mispricing exists because too few investors are aware of it or because structural barriers prevent exploitation. Publication creates awareness. Hedge funds, quantitative asset managers, and eventually ETF providers begin trading the anomaly. Their trading activity corrects the mispricing, reducing the observed premium. The degree of correction depends on the ease of implementation (transaction costs, capacity), the speed of dissemination, and the structural barriers that originally created the anomaly.

Falsification criteria: H2 predicts that:

Post-publication decay should be partial, not complete (limits to arbitrage prevent full correction)
Decay should correlate with measures of arbitrage activity (institutional trading, short interest, ETF flows)
Anomalies that are harder to trade (small-cap, illiquid, high-turnover) should decay less than easy-to-trade anomalies
Anomalies should replicate in the original sample period even under standardised methodology (they were real before arbitrage eroded them)
Anomalies based on risk compensation should not decay at all (risk is permanent; mispricing is temporary)

If anomalies fail to replicate even in their original sample period — before any arbitrage could have occurred — H2 cannot explain the failure.

H3 — The Definitional Hypothesis

Statement: The apparent disagreement across replication studies is primarily methodological. Different studies define "replication" differently, use different statistical standards, and make different choices about factor construction. The question "do anomalies replicate?" is ill-posed without specifying what replication means. There is no single answer because there is no single question.

Mechanism: Replication in finance is not like replication in chemistry, where the same experiment under the same conditions should produce the same result. In finance, every methodological choice — which stocks to include, how to define breakpoints, when to rebalance, how to weight portfolios, what significance threshold to use, whether to account for transaction costs — affects the result. Two researchers can examine the same anomaly, make different but individually defensible choices, and reach opposite conclusions. Neither is wrong; they are answering different questions.

Falsification criteria: H3 predicts that:

Replication rates should vary systematically with methodological choices, not with anomaly characteristics
The same anomaly should sometimes be classified as "replicated" by one study and "failed" by another
Studies with higher methodological fidelity (closely following the original paper) should show higher replication rates
Studies that apply stricter thresholds or add economic filters should show lower replication rates, regardless of which anomalies they test
The variance in replication rates across studies should be explained more by methodology than by which anomalies are included

If replication rates are determined primarily by the characteristics of the anomalies themselves — and not by the methodological choices of the replicators — H3 is falsified.

Part III: Evidence Base — Twelve Studies in Detail

Understanding why replication studies disagree requires examining each study's methodology in detail. Surface-level comparisons of replication rates are misleading without understanding what each study actually measured.

Study 1: Schwert (2003) — The Original Warning

G. William Schwert's chapter in the Handbook of the Economics of Finance was the first systematic examination of what happens to anomalies after they are documented. He focused on five of the most prominent anomalies: the size effect, the value effect, the weekend effect, the turn-of-the-year effect, and the dividend yield effect.

Schwert found that most of these anomalies weakened substantially in the years following their original documentation. The size effect, documented by Banz in 1981, was largely absent in post-1982 data. The January effect, while still present, was smaller than originally reported. The value effect proved more robust but showed considerable time variation.

Schwert's contribution was primarily conceptual. He raised the possibility that anomalies might weaken for two distinct reasons: either they were data-mining artefacts that reverted to their true (null) value, or they were real phenomena that attracted informed capital once publicised. He did not adjudicate between these explanations but established the framework that subsequent researchers would use.

Key methodological choice: Schwert examined post-discovery returns but did not distinguish between statistical and economic replication. His focus on only five anomalies limited generalisability.

Study 2: Harvey, Liu & Zhu (2016) — The Multiple Testing Adjustment

Harvey, Liu, and Zhu's paper was the most statistically rigorous challenge to the anomaly literature. They catalogued 316 factors published between 1967 and 2014 and argued that the conventional t > 2.0 threshold was inappropriate given the number of hypotheses tested.

Their argument rested on the Bonferroni inequality and its variants. If 316 independent tests are conducted at the 5% level, the expected number of false rejections is approximately 16. Adjusting for this multiple testing, they argued that a newly proposed factor should require a t-statistic of approximately 3.0 to be considered significant.

The paper included a detailed taxonomy of the 316 factors, categorised by the economic mechanism they purported to capture. This catalogue itself became an important resource, revealing the extraordinary breadth — and potential redundancy — of the factor zoo.

Key methodological choices: Harvey et al. treated all 316 factors as independent tests, which overstates the multiple testing problem if many factors are correlated (as they certainly are). They did not actually replicate the 316 factors; they argued statistically that most should be false given the number of tests. Their threshold adjustment applies to new discoveries but is often misapplied to assess existing factors that have already been validated through other means.

Critical nuance: The t > 3.0 threshold is appropriate for evaluating a newly proposed factor in a world where hundreds of factors have already been tested. It is not necessarily appropriate for evaluating a factor that has been independently confirmed across multiple datasets, time periods, and markets. A factor like momentum, which has been replicated in dozens of independent studies across 40+ markets, should not be evaluated as if it were a brand-new discovery subject to the full multiple testing penalty.

Study 3: McLean & Pontiff (2016) — The Natural Experiment

McLean and Pontiff exploited the temporal structure of academic publishing to construct a natural experiment. For each of 97 anomalies, they identified three time periods:

In-sample period: The data period used in the original paper
Post-sample, pre-publication period: After the sample ended but before the paper was published (typically 2–5 years)
Post-publication period: After the paper appeared in a journal

This decomposition is powerful because it separates statistical artefacts from economic phenomena. In the post-sample but pre-publication period, the anomaly is out-of-sample (ruling out in-sample overfitting) but not yet public (ruling out arbitrage). If anomaly returns are zero in this period, the original finding was likely data-mined. If they are positive but smaller than in-sample, there is evidence of both genuine predictability and some overfitting. If they are positive and then decline further post-publication, arbitrage is the leading explanation for the additional decay.

McLean and Pontiff found:

In-sample average return: approximately 100% of the reported effect
Post-sample, pre-publication return: approximately 73% of the in-sample effect
Post-publication return: approximately 42% of the in-sample effect

The 27% decline from in-sample to post-sample but pre-publication suggests some degree of in-sample overfitting — but 73% survival in truly out-of-sample data is strong evidence that the anomalies are not purely artefacts. The additional 31% decline post-publication (from 73% to 42%) is consistent with arbitrage capital eroding the premium.

Key methodological choice: McLean and Pontiff aggregated across 97 anomalies, which mixes strong anomalies with weak ones. The average masks considerable heterogeneity. Some anomalies may have decayed to zero (consistent with H1) while others retained most of their premium (consistent with H2), with the average reflecting a mixture.

Study 4: Green, Hand & Zhang (2017) — The Holdout Test

Jeremiah Green, John Hand, and Frank Zhang tested 94 stock characteristics using a rigorous holdout methodology. They constructed each characteristic using CRSP/Compustat data, then tested whether it predicted returns in a holdout sample that was not used in the variable construction process.

They found that approximately 50% of the 94 characteristics were significant in the holdout sample. This is a moderate replication rate — higher than Hou, Xue, and Zhang would later find, but lower than Chen and Zimmermann. The discrepancy reflects their intermediate position on methodological fidelity: they used consistent data but allowed some variation in construction methodology.

Key insight: Green et al. also examined which characteristics provided independent information — predictive power that was not subsumed by other characteristics. The number of independently significant characteristics was much smaller than 94, suggesting substantial redundancy in the factor zoo. Many factors that appear distinct are really measuring the same underlying phenomenon.

Study 5: Linnainmaa & Roberts (2018) — The Pre-1963 Test

Juhani Linnainmaa and Michael Roberts conducted one of the most creative replication exercises by exploiting data that the original researchers could not have used. Most anomalies in the empirical asset pricing literature were discovered using CRSP data from 1963 onwards (when Compustat coverage becomes comprehensive). Linnainmaa and Roberts extended the data back to 1926 and examined whether these post-1963 anomalies also existed in the 1926–1963 period.

The logic is compelling: if a factor genuinely captures a persistent market phenomenon, it should be present in earlier data. If it only appears in the specific sample period where it was discovered, data mining is the more likely explanation.

Their findings were mixed. Some factors — notably value and momentum — were present in the pre-1963 data, confirming their robustness. Others — including profitability and investment — were absent or insignificant in the earlier period. Linnainmaa and Roberts interpreted the absent factors as likely data-mining artefacts.

Important caveat: The pre-1963 market was structurally different from the post-1963 market in ways that could explain factor absence without invoking data mining:

Institutional composition: Institutional investors held a much smaller share of the market before 1963. Factors driven by institutional behaviour (such as the leverage constraint behind BAB) would be weaker in a market dominated by individual investors.
Sector composition: The pre-1963 market was dominated by railroads, utilities, and heavy industry. The post-1963 market includes technology, healthcare, and services. Factors related to intangible assets, R&D intensity, or growth options would naturally be weaker in the earlier period.
Accounting standards: Financial reporting was less standardised before 1963. Factors based on accounting ratios (profitability, accruals, investment) are inherently noisier when the underlying data is less reliable.
Information environment: Information dissemination was slower and less uniform. Factors driven by investor attention, media coverage, or analyst following would behave differently.

The absence of a factor in the pre-1963 data is evidence against that factor, but it is not conclusive evidence. It is one data point in a broader assessment.

Study 6: Hou, Xue & Zhang (2020) — The Standardised Replication

Hou, Xue, and Zhang's paper in the Review of Financial Studies was the most comprehensive replication effort in terms of the number of anomalies examined: 452. Their approach was deliberately standardised. Rather than following each original paper's methodology, they applied a consistent set of rules to all 452 anomalies:

Universe: All common stocks on NYSE, AMEX, and NASDAQ
Breakpoints: NYSE-only breakpoints (following the Fama-French convention)
Rebalancing: Annual (June) or monthly, depending on the signal type
Weighting: Value-weighted portfolios
Sample period: Extended through 2016

Under these rules, 64% of the 452 anomalies failed to produce a t-statistic above 1.96. Only 36% replicated.

This finding is dramatic, but its interpretation requires understanding what "failure" means in this context. An anomaly could "fail" for several reasons:

The original finding was genuinely false — a true null result produced by chance
The anomaly is sensitive to construction choices — it exists under the original methodology but not under the standardised approach
The anomaly has decayed over time — it was real in the original sample but has weakened since
The standardised methodology is inappropriate for this specific anomaly — some anomalies require specific methodological choices (e.g., equal weighting for small-cap anomalies)

Hou, Xue, and Zhang's results cannot distinguish between these explanations. A 64% failure rate could mean 64% false discoveries, 64% methodology-sensitive anomalies, or some mixture.

Key methodological choice: The decision to use value-weighted portfolios with NYSE breakpoints is not neutral. Many anomalies are concentrated in small stocks. Value weighting reduces the influence of small stocks, and NYSE breakpoints ensure that the small-stock portfolio contains mostly small NYSE stocks rather than the much larger population of tiny NASDAQ stocks. These are defensible choices, but they systematically disadvantage anomalies that operate primarily in the small-cap space.

Study 7: Jacobs & Muller (2020) — International Replication

Heiko Jacobs and Sebastian Muller tested 241 anomalies in international markets — a true out-of-sample test in a fundamentally independent dataset. U.S. data-mining artefacts should not replicate in markets with different institutional structures, regulatory environments, and accounting standards.

They found that approximately 50% of U.S.-discovered anomalies replicated in at least one international market, and a meaningful subset replicated broadly across regions. The anomalies that replicated most consistently were those with the strongest theoretical foundations: value, momentum, profitability, and low volatility.

Jacobs and Muller also made an important distinction between "replication" and "existence." Some anomalies may exist in all markets but appear in different forms. For example, value works internationally but the specific accounting variables that capture it vary across accounting regimes. Book-to-market works well in U.S. GAAP accounting but may need to be replaced by earnings-based measures in markets with different accounting standards.

Key insight for our analysis: The international replication rate of approximately 50% is consistent with the hypothesis that roughly half the factor zoo captures real economic phenomena and roughly half does not. The anomalies that replicate internationally are disproportionately those with clear theoretical mechanisms, which is exactly what H2 (the arbitrage hypothesis) predicts.

Study 8: Chen & Zimmermann (2022) — The Exact Methodology Replication

Andrew Chen and Tom Zimmermann's paper was arguably the most careful replication effort, in the sense that it prioritised methodological fidelity above all else. They replicated 319 anomalies by following each original paper's methodology as precisely as possible, using their open-source code repository (which they made publicly available, setting a new standard for reproducibility in finance).

Their finding of approximately 82% replication stands in stark contrast to Hou, Xue, and Zhang's 36%. The difference is almost entirely attributable to methodological fidelity. When Chen and Zimmermann tested the same anomalies using the Hou, Xue, and Zhang standardised methodology, their replication rate dropped dramatically — confirming that the discrepancy is driven by methodological choices, not by differences in the anomaly samples.

This is perhaps the single most important finding in the replication debate. It demonstrates that the answer to "do anomalies replicate?" depends critically on what you mean by replication. If replication means "does the exact procedure described in the paper produce the reported result," the answer is mostly yes. If replication means "does this anomaly exist under a different but reasonable methodology," the answer is more nuanced.

Key contribution: Chen and Zimmermann's open-source code repository transformed the replication debate from a contest of duelling papers into an empirical question that any researcher could examine. Their code allows anyone to reproduce any of the 319 anomalies using either the original methodology or a standardised approach, and to observe directly how methodological choices affect results.

Study 9: Jensen, Kelly & Pedersen (2023) — The Bayesian Reconciliation

Theis Jensen, Bryan Kelly, and Lasse Pedersen brought sophisticated Bayesian methods to the replication question. Rather than classifying factors as "significant" or "not significant" based on a threshold, they estimated the posterior distribution of each factor's expected return, incorporating prior information and shrinkage.

Their approach is fundamentally different from threshold-based testing. A factor with a t-statistic of 1.8 would be classified as "failed" under a t > 2.0 threshold but might have a posterior mean that is meaningfully positive after Bayesian shrinkage. By avoiding the binary classification trap, Jensen, Kelly, and Pedersen were able to estimate the distribution of factor premiums rather than just the number of factors that clear an arbitrary bar.

Their conclusion was that the cross-section of expected returns is rich and multi-dimensional. Most factors have positive expected returns, though smaller than their in-sample estimates suggest. The shrinkage — the gap between the raw estimate and the posterior mean — is typically 30–50%, which is consistent with McLean and Pontiff's finding of post-publication decay. In the Bayesian framework, this shrinkage reflects estimation noise rather than data mining: the true premiums were always smaller than the noisy estimates suggested.

Key insight: Jensen, Kelly, and Pedersen's framework suggests that the "replication crisis" is partly an artefact of threshold-based testing. When you force a continuous variable (the factor premium) into a binary classification (significant or not), you create a cliff effect: factors just above the threshold are classified as real, and factors just below are classified as false. In reality, there is a continuum from strongly positive premiums (value, momentum) through marginally positive premiums (dozens of smaller factors) to premiums that are indistinguishable from zero (the genuine noise). The binary framework maps this continuum onto a binary, creating the illusion of a crisis.

Study 10: Novy-Marx & Velikov (2016) — The Transaction Cost Filter

Robert Novy-Marx and Mihail Velikov asked a question that most replication studies ignored: can you actually trade these anomalies profitably? Statistical significance is necessary but not sufficient for economic significance. A factor with a t-statistic of 3.0 and a gross monthly return of 50 basis points sounds impressive until you realise that it requires monthly rebalancing among illiquid micro-cap stocks with bid-ask spreads of 200 basis points.

Novy-Marx and Velikov tested 23 prominent anomalies after applying realistic transaction costs based on effective spreads, market impact, and short-selling costs. Many anomalies that appeared highly significant in gross returns became unprofitable or marginal after costs.

Their analysis revealed a crucial pattern: the anomalies most likely to survive transaction costs were those with low turnover (requiring infrequent rebalancing) and those concentrated in liquid, large-cap stocks. Value and profitability strategies, which rebalance annually and operate across the size spectrum, survived well. Momentum strategies, which require monthly or more frequent rebalancing, were more heavily impacted but still marginally profitable when implemented with cost-aware execution.

Short-leg costs were particularly important. Many anomalies derive a substantial portion of their returns from the short side (selling overpriced stocks), but short-selling is expensive — borrowing fees, recall risk, and the asymmetric payoff of short positions all erode returns. When short-selling costs were included, many long-short anomalies became long-only propositions, reducing their theoretical return by roughly half.

Key finding for our analysis: The number of anomalies that are both statistically significant and economically profitable after costs is dramatically smaller than the number that are merely statistically significant. Novy-Marx and Velikov's analysis suggests perhaps 10–15 anomalies survive this economic filter, which is consistent with our estimate of a "robust core" of 15–25 factors (the slightly larger number reflects additional factors that are profitable at lower turnover implementations or in long-only format).

Study 11: Chordia, Goyal & Saretto (2020) — The Combined Filter

Tarun Chordia, Amit Goyal, and Alessio Saretto combined multiple replication filters — post-publication decay, transaction costs, and statistical robustness — into a comprehensive assessment of over 180 anomalies. Their paper is important because it applies all three dimensions of replication simultaneously, rather than examining each in isolation.

After applying all filters, Chordia, Goyal, and Saretto found that the majority of anomalies were unprofitable on a net basis. The combined effect of post-publication decay and transaction costs was devastating for most strategies. Even anomalies that retained statistical significance after publication became economically unviable once realistic trading costs were applied.

Their analysis also revealed an interaction effect: anomalies that decayed the most post-publication were often those with the highest transaction costs, because both phenomena are driven by the same underlying factor — concentration in small, illiquid stocks. Small-cap anomalies tend to be large in gross returns (because illiquidity creates larger mispricings), to decay more post-publication (because even small amounts of arbitrage capital have large price impact in illiquid markets), and to be expensive to trade (because of wide bid-ask spreads and market impact).

Implication: The factor zoo is even more redundant than it appears. Many apparently distinct anomalies — different accounting ratios, different momentum measures, different quality metrics — are all picking up the same small-cap, illiquid premium through different lenses. Once transaction costs are applied, they collapse into a handful of genuinely independent return sources.

Study 12: Calluzzo, Moneta & Topaloglu (2019) — The Arbitrage Mechanism

Paul Calluzzo, Fabio Moneta, and Selim Topaloglu provided the most direct evidence for the arbitrage mechanism that H2 proposes. Rather than simply observing that anomalies decay post-publication, they tested whether the decay was caused by the publication event.

They tracked institutional trading activity around 14 well-known anomalies and found that:

Institutional investors significantly increased their trading in anomaly-related stocks after publication
The increase in institutional trading correlated with the degree of post-publication anomaly decay
The most heavily traded anomalies showed the largest post-publication decay
The decay was concentrated in the period immediately following publication, consistent with a learning/dissemination mechanism

This evidence is difficult to reconcile with pure data mining (H1). If anomalies were statistical artefacts, there would be no reason for institutional trading to increase after publication — and certainly no reason for the degree of trading increase to predict the degree of decay. The Calluzzo et al. findings strongly support the arbitrage channel: anomalies are real, publication disseminates the information, institutional traders exploit the opportunity, and their trading partially corrects the mispricing.

Limitation: The sample of 14 anomalies is small, and all 14 are among the most prominent anomalies in the literature. These are precisely the anomalies most likely to attract institutional attention. The mechanism may not generalise to the hundreds of less well-known anomalies in the factor zoo.

Part IV: Analysis — The Three Dimensions of Replication

Constructing the Framework

Our central analytical contribution is a three-dimensional framework that explains virtually all of the disagreement across replication studies. The three dimensions are:

Methodological fidelity: How closely does the replication follow the original paper's procedures?
Statistical threshold: What significance standard is applied?
Economic filter: Are transaction costs and capacity constraints considered?

Each replication study occupies a specific position in this three-dimensional space, and its replication rate can be predicted from its position.

Dimension 1: Methodological Fidelity

The fidelity dimension runs from "exact replication" (following the original paper's methodology precisely) to "standardised replication" (applying a uniform methodology to all anomalies).

Why fidelity matters: Factor returns are surprisingly sensitive to construction details. Consider the momentum factor as a concrete example. Jegadeesh and Titman's original 1993 paper defined momentum as the cumulative return over months t-12 to t-2 (skipping the most recent month to avoid microstructure effects). The factor was constructed using NYSE/AMEX stocks, with decile breakpoints, equal weighting within portfolios, and monthly rebalancing.

Now change any one of these choices:

Include NASDAQ stocks: The universe changes dramatically, especially in the 1970s-1980s when NASDAQ was small and illiquid
Use quintile instead of decile breakpoints: The extreme portfolios become less extreme
Use value weighting instead of equal weighting: The influence of small stocks decreases substantially
Skip two months instead of one: The short-term reversal effect is more thoroughly removed, changing the factor's properties
Use 6-month instead of 12-month formation period: The factor captures shorter-duration momentum, which has different risk characteristics

Each of these changes is individually defensible. None is "wrong." But a momentum factor that uses NASDAQ stocks, quintile breakpoints, value weighting, a two-month skip, and a six-month formation period may produce a t-statistic of 1.5 — failing to replicate — while the exact Jegadeesh-Titman specification produces a t-statistic of 4.0. "Momentum" has not failed to replicate. A different version of momentum has failed to replicate.

This phenomenon is pervasive across the factor zoo. Chen and Zimmermann documented it systematically: when they matched the original paper's methodology, 82% of factors replicated. When they applied a standardised methodology (similar to Hou, Xue, and Zhang), the replication rate dropped substantially.

The philosophical question: Which approach is "correct"? The exact replication tells us whether the original authors' claim is reproducible — did they find what they said they found? The standardised replication tells us whether the anomaly is robust to reasonable methodological variation — does the phenomenon exist in a broader sense?

Both are valid questions, but they have different answers. The replication "crisis" partly arises from conflating these two questions.

A Case Study in Methodological Sensitivity: The Size Effect

The size effect provides a vivid illustration of how methodological choices determine replication outcomes. Rolf Banz documented in 1981 that small stocks earned higher returns than large stocks, on average. This became one of the most famous anomalies in finance.

Does the size effect replicate? The answer depends entirely on how you define it:

Size Effect Specification	t-statistic (1963-2023)	Verdict
SMB (Fama-French, value-weighted)	~1.5	Fails at t > 1.96
SMB (equal-weighted)	~3.0	Passes
Decile 1 minus Decile 10 (CRSP)	~2.2	Marginal
Small minus big (NYSE only)	~1.0	Fails
Small minus big (including microcaps)	~3.5	Passes strongly
Small minus big (post-1980)	~0.5	Fails
Small minus big (January only)	~5.0	Passes overwhelmingly
Small minus big (ex-January)	~0.3	Fails
Small minus big (controlling for quality)	~2.5	Passes (Asness et al., 2018)

The size effect "replicates" or "fails" depending entirely on the specification. It is not that the size effect is ambiguous — it is that "the size effect" is not a single, well-defined object. It is a family of related specifications that behave differently.

This is not unique to size. It applies to virtually every anomaly in the factor zoo. The question "does this anomaly replicate?" has no answer until you specify exactly which version of the anomaly you are testing.

Dimension 2: Statistical Threshold

The threshold dimension runs from "conventional" (t > 1.96, equivalent to p < 0.05) to "multiple-testing-adjusted" (t > 3.0 or higher, depending on the adjustment method).

Harvey, Liu, and Zhu's argument for a higher threshold is statistically valid in the context of new factor discovery. If you are the 317th researcher to propose a new factor, the prior probability that your factor is genuine is lower than if you were the first, because 316 factors have already been tested on the same data. The multiple testing adjustment accounts for this accumulated burden of proof.

However, the argument breaks down when applied retroactively to established factors. Consider momentum. It has been tested in the original Jegadeesh and Titman (1993) sample, in post-publication U.S. data, in 40+ international markets (Asness, Moskowitz, and Pedersen, 2013), in different asset classes (equities, bonds, commodities, currencies), and in historical data extending back to the Victorian era (Geczy and Samonov, 2016). Each of these is an independent test. The probability that momentum is a false discovery given this mountain of independent evidence is vanishingly small, regardless of what multiple testing threshold you apply to the original paper.

The multiple testing framework is designed for evaluating marginal new discoveries. It should not be applied to factors that have already accumulated substantial independent evidence. The failure to make this distinction has contributed to the perception of a replication crisis.

The Bayesian Alternative

Jensen, Kelly, and Pedersen's Bayesian approach sidesteps the threshold problem entirely. Rather than classifying factors as "significant" or "not," they estimate the posterior distribution of each factor's premium. A factor with a t-statistic of 1.5 is not "insignificant" — it has a posterior mean that is positive but small, with wide credible intervals. A factor with a t-statistic of 5.0 has a posterior mean that is large with narrow intervals.

This continuous representation avoids the cliff effect of threshold-based testing and more accurately represents the true state of knowledge. The factor zoo is not a binary partition into "real" and "fake" — it is a continuum from "almost certainly real and large" (momentum, value) through "probably real but small" (dozens of factors) to "probably zero" (the noise).

Under the Bayesian framework, the replication "crisis" disappears because there was never a crisis — there was a poorly posed binary question being applied to a continuous reality.

Dimension 3: Economic Versus Statistical Replication

The economic dimension runs from "gross returns, no costs" to "net returns after realistic implementation costs and capacity constraints." This dimension has been comparatively neglected in the replication debate but may be the most important for practitioners.

Novy-Marx and Velikov's taxonomy of trading costs provides the essential framework:

Effective spreads: The bid-ask spread is the most basic transaction cost. For large-cap stocks, effective spreads are typically 2–5 basis points. For micro-cap stocks, they can exceed 200 basis points. A factor that requires buying and selling micro-cap stocks monthly faces round-trip costs of 400+ basis points per month — a massive drag on any theoretical premium.

Market impact: Large orders move prices. A portfolio that needs to buy $100 million of a stock with $5 million in daily volume will push the price up significantly during execution. Market impact is negligible for liquid large-caps but can be enormous for small, illiquid names. This creates a capacity constraint: the factor's premium decreases as the portfolio size increases.

Short-selling costs: The long-short structure of academic factors assumes free and unlimited short selling. In practice, short selling is expensive (lending fees typically 50–300 basis points annually for easy-to-borrow stocks, and 10%+ for hard-to-borrow stocks), risky (the lender can recall shares at any time), and sometimes impossible (some stocks are unavailable for borrowing). Anomalies that derive a substantial portion of their return from the short side face particularly severe implementation headwinds.

Turnover: Factors that require frequent rebalancing (monthly or higher) incur transaction costs more often than those that rebalance annually. Momentum strategies, which typically rebalance monthly, face 12 times the annual turnover of value strategies, which typically rebalance annually. This turnover differential means that momentum must generate a significantly higher gross premium to deliver the same net return as value.

When these costs are applied, the factor zoo shrinks dramatically. Novy-Marx and Velikov found that many anomalies with impressive gross returns — particularly those concentrated in small, illiquid stocks with high turnover — became unprofitable or marginal after costs.

The Economic Replication Scorecard

We can construct an approximate scorecard of how major anomaly families perform across the three dimensions:

Factor Family	Exact Stat. Replication	Standardised Stat.	Multiple Testing	After Costs
Value (B/M, E/P)	✓	✓	✓	✓ (low turnover)
Momentum (12-1)	✓	✓	✓	✓ (marginal)
Profitability (GP/A)	✓	✓	✓	✓ (low turnover)
Low volatility / BAB	✓	✓	✓	✓ (low turnover)
Investment (asset growth)	✓	Mixed	Mixed	✓ (low turnover)
Quality composites	✓	✓	✓	✓
Size (SMB)	✓	Mixed	✗	Mixed
Short-term reversal	✓	✓	✓	✗ (high turnover)
Accruals	✓	Mixed	Mixed	Mixed
Net issuance	✓	Mixed	Mixed	✓
Idiosyncratic volatility	✓	Mixed	Mixed	✗
Analyst revisions	✓	✓	Mixed	✗ (high turnover)
Earnings momentum (SUE)	✓	✓	✓	Mixed
Calendar effects (January, etc.)	✓	✗	✗	✗
Liquidity (Amihud)	✓	Mixed	Mixed	✗

The pattern is clear. The factors that pass all four columns — exact statistical, standardised statistical, multiple testing, and after costs — form a small group: value, momentum, profitability, low volatility, and quality. These are the robust core. The factors that pass some columns but not others form the fragile middle. The factors that fail most or all columns are noise.

Deep Dive: Momentum Across All Four Filters

Momentum deserves special attention because it is perhaps the most thoroughly studied anomaly in finance and provides an ideal test case for our framework.

Filter 1: Exact statistical replication. Momentum replicates almost perfectly when the original Jegadeesh and Titman (1993) methodology is followed. The 12-month formation period with a 1-month skip, decile portfolios, and equal weighting produces a large and highly significant premium in the original sample, in subsequent samples, and in Chen and Zimmermann's replication. Score: ✓ (clear pass).

Filter 2: Standardised statistical replication. Momentum replicates under most standardised approaches, though the magnitude varies. Value-weighted momentum is smaller than equal-weighted momentum (because momentum is stronger among small stocks). Longer formation periods (e.g., 6 months) produce weaker results than the standard 12 months. Different skip periods change the magnitude. Under Hou, Xue, and Zhang's standardisation, momentum replicates. Score: ✓ (pass, with sensitivity to specifications).

Filter 3: Multiple testing threshold. Momentum comfortably exceeds the t > 3.0 bar in U.S. data and has been independently replicated in 40+ international markets. The combined evidence makes the probability of momentum being a false discovery negligibly small. Score: ✓ (overwhelming pass).

Filter 4: After transaction costs. This is where momentum faces its greatest challenge. The standard momentum factor requires monthly rebalancing, which generates high turnover. In the most extreme form (decile long-short, equal-weighted, monthly rebalancing), transaction costs consume a substantial portion of the gross premium.

However, the literature has identified several cost-mitigation strategies:

Wider portfolios (quintile rather than decile) reduce turnover significantly
Intermediate rebalancing frequencies (quarterly) capture most of the premium with less turnover
Trading rules that reduce unnecessary turnover (e.g., only trading when a stock's signal changes significantly) can cut turnover by 50%+ with minimal impact on gross returns
Implementation in liquid large-cap stocks eliminates the small-cap cost burden
Optimal execution algorithms can reduce market impact

After applying these modifications, Novy-Marx and Velikov, and separately Frazzini, Israel, and Moskowitz, concluded that momentum is implementable — barely, at institutional scale, with careful execution. It is the most challenging of the robust core factors to implement but remains economically viable.

Score: ✓ (marginal pass — implementable with cost-aware construction).

International evidence: Asness, Moskowitz, and Pedersen (2013) documented momentum in equities across 40+ markets, in currencies, government bonds, and commodity futures. This breadth of evidence is essentially impossible to explain through data mining.

Pre-sample evidence: Geczy and Samonov (2016) documented momentum in U.S. equities extending back to the Victorian era (1801-1926), before the CRSP database began.

Verdict: Momentum passes all four filters, though its economic viability is the most constrained. It belongs firmly in the robust core but requires more sophisticated implementation than value or profitability strategies.

Deep Dive: Accruals Across All Four Filters

Accruals provide a contrasting case — an anomaly that is clearly in the "fragile middle."

The accruals anomaly, documented by Sloan (1996), is the finding that firms with high accruals (earnings far above cash flows) subsequently underperform, while firms with low accruals (cash flows exceeding earnings) subsequently outperform. The economic intuition is that high accruals signal aggressive accounting or unsustainable earnings, while low accruals signal conservative accounting or high earnings quality.

Filter 1: Exact statistical replication. The accruals anomaly replicates when Sloan's original methodology is followed. Score: ✓.

Filter 2: Standardised statistical replication. The anomaly is sensitive to the definition of accruals. Balance-sheet accruals (Sloan's original measure) produce different results from cash-flow-statement accruals. The choice of weighting (equal vs. value) matters substantially — the anomaly is much weaker in value-weighted portfolios. Under Hou, Xue, and Zhang's standardisation, accruals are marginal. Score: Mixed.

Filter 3: Multiple testing threshold. Accruals clear the t > 2.0 bar in most specifications but fall short of t > 3.0 in several. Given that accruals were tested alongside hundreds of other accounting variables, the multiple testing concern is legitimate. Score: Mixed.

Filter 4: After transaction costs. Accruals strategies require annual rebalancing (a positive feature) but concentrate in small, less liquid stocks (a negative feature). The short side is particularly problematic — high-accrual firms tend to be exactly the kind of speculative, hard-to-borrow stocks where short-selling costs are highest. After costs, the accruals anomaly is at best marginally profitable. Score: Mixed.

International evidence: Mixed — replicates in some markets but not others, suggesting some U.S.-specific component.

Verdict: Accruals sit squarely in the fragile middle. The anomaly is probably capturing a real phenomenon (the market's tendency to overweight earnings relative to cash flows), but it is too fragile, too specification-sensitive, and too costly to trade to form the basis of a standalone investment strategy. It may add marginal value as a secondary tilt within a broader quality or profitability framework.

The Redundancy Problem

A critical issue that most replication studies do not adequately address is factor redundancy. The 400+ factors in the zoo are not 400 independent return sources. Many are highly correlated measures of the same underlying phenomenon.

Consider the "quality" family. Researchers have proposed various quality metrics:

Gross profitability (Novy-Marx, 2013)
Return on equity (Hou, Xue & Zhang, 2015)
Return on assets
Operating profitability (Fama & French, 2015)
Cash-based profitability (Ball et al., 2016)
Earnings stability
Debt-to-equity ratio
Piotroski F-score (Piotroski, 2000)
Altman Z-score

Each of these has been proposed as a distinct anomaly, each with its own academic paper and t-statistic. But they are all measuring variations of the same thing: firms with strong, stable financial characteristics outperform firms with weak, unstable characteristics. The "400+ factors" count treats each as distinct, inflating both the multiple testing problem and the replication failure rate.

When factors are grouped by their economic mechanism — value, momentum, profitability/quality, risk, size, investment, liquidity — the number of truly independent return sources shrinks to perhaps 6-8 families. Within each family, the best specification outperforms the others, but most family members contribute only marginally after the best one is included.

This redundancy has profound implications for the replication debate:

The 316 factors that Harvey et al. counted are not 316 independent tests — they are perhaps 30-50 independent tests, each examined through multiple lenses
The "failure" of a specific quality measure to replicate does not mean quality is fake — it may mean that particular specification is inferior to others in the same family
The multiple testing adjustment should be applied to factor families, not individual specifications

Machine Learning and the New Factor Discovery

The emergence of machine learning in asset pricing (Gu, Kelly, and Xiu, 2020) adds a new dimension to the replication debate. ML methods can identify non-linear, interactive relationships among characteristics that traditional linear methods miss. This capability cuts both ways:

Positive interpretation: ML methods may identify genuine return predictability that linear factor models cannot capture. The "factor zoo" of 400+ linear factors might be a crude approximation of a smaller number of non-linear relationships. ML can detect the underlying structure that individual factors approximate.

Negative interpretation: ML methods have enormous capacity for overfitting. A neural network with enough parameters can fit any in-sample pattern, including noise. Without rigorous out-of-sample testing and economic priors, ML-discovered predictability may be even more prone to data mining than traditional factor analysis.

Gu, Kelly, and Xiu addressed this concern through extensive out-of-sample testing and found that their ML models genuinely predicted returns out-of-sample, with an R² roughly three times higher than the best linear model. However, when they decomposed the ML predictions into their constituent characteristics, the dominant predictors were familiar: momentum, value, liquidity, volatility. The ML models were not discovering new factors — they were discovering better ways to combine known factors, particularly through non-linear interactions.

This finding supports our three-tier classification. The robust core of factors identified by traditional methods is also the core identified by ML methods. The additional complexity of ML primarily improves how these factors are combined, not which factors matter.

Part V: Results and Interpretation

Evaluating the Three Hypotheses

Having assembled and analysed evidence from 12 replication studies, we now evaluate our three competing hypotheses.

H1 (Data Mining) — Partially supported, but too simple.

Evidence in favour:

Harvey, Liu, and Zhu's multiple testing argument is statistically valid: many of the 400+ factors are undoubtedly false discoveries
Some anomalies disappear entirely in the pre-1963 data (Linnainmaa & Roberts) or fail even under exact replication
The sheer number of published factors (400+) guarantees some false positives, even at generous significance thresholds
Many anomalies fail under standardised replication, suggesting sensitivity to specification choices that is characteristic of overfitting

Evidence against:

Post-publication decay is partial (42% survival), not complete — pure data mining predicts zero survival
Institutional trading increases post-publication and correlates with decay (Calluzzo et al.) — data mining cannot explain this
The robust core of factors (value, momentum, profitability, low volatility) replicates across independent markets, time periods, asset classes, and methodologies — a pattern incompatible with data mining
ML methods (Gu, Kelly, and Xiu) identify the same core factors through completely different statistical methods, providing independent confirmation

Verdict on H1: True for perhaps 50-60% of the factor zoo (the statistical noise tier), but false for the robust core. The pure data mining hypothesis is too sweeping — it treats all anomalies identically and cannot explain the systematic patterns in which anomalies survive and which do not.

H2 (Arbitrage) — Partially supported, with strong mechanistic evidence.

Evidence in favour:

Calluzzo et al.'s direct evidence of institutional trading response to publication is compelling
Partial post-publication decay (not complete) is exactly what limits-to-arbitrage theory predicts
Anomalies that are easier to trade (large-cap, liquid, low turnover) show more post-publication decay than hard-to-trade anomalies — consistent with arbitrage capital flowing to the easiest opportunities first
The robust core factors show smaller post-publication decay than the average anomaly, consistent with capacity constraints limiting arbitrage
The persistence of premiums after decades of publication (value has been known since at least Graham and Dodd in 1934) is consistent with structural limits to arbitrage, not data mining

Evidence against:

Some anomalies fail to replicate even in the original sample period under standardised methodology — arbitrage cannot explain failures in the contemporaneous data
The arbitrage hypothesis cannot explain why some anomalies are specification-sensitive — a real economic phenomenon should be robust to reasonable methodological choices
The hypothesis implies that all anomalies were real before publication, which is inconsistent with the Harvey et al. statistical argument that many are false discoveries

Verdict on H2: True for a meaningful subset of anomalies — perhaps 20-30% of the factor zoo (the fragile middle plus some of the robust core). The arbitrage mechanism is real and important but does not explain all anomaly dynamics. It operates alongside data mining and methodological sensitivity.

H3 (Definitional) — Strongly supported as the organising framework.

Evidence in favour:

The single most powerful predictor of a study's replication rate is its methodological approach, not which anomalies it examines
Chen and Zimmermann and Hou, Xue, and Zhang examined largely overlapping anomaly sets and reached opposite conclusions because of methodological differences
The same anomaly can be classified as replicated or failed depending on: weighting scheme, breakpoints, inclusion criteria, significance threshold, and whether costs are considered
The three-way disagreement between Harvey et al. (most fail), McLean & Pontiff (most survive with decay), and Chen & Zimmermann (most replicate) resolves completely when you recognise each is answering a different question
International replication rates (~50%) are consistent with a mixture of real and spurious anomalies, not an all-or-nothing crisis

Evidence against:

If the disagreement were purely definitional, there would be no anomalies that fail consistently across all methodologies. The existence of such anomalies (calendar effects, some microstructure-based anomalies) means that data mining is real, not just a definitional artefact
The definitional hypothesis could be seen as unfalsifiable — "the answer depends on the definition" is always true in a trivial sense

Verdict on H3: H3 is the correct organising framework. The apparent contradiction in the replication literature is largely (though not entirely) a product of different studies asking different questions and using different methods. The "replication crisis" is primarily a crisis of definition, not a crisis of science.

The Three-Tier Synthesis

Combining the evidence across all three hypotheses, we arrive at our central finding: the factor zoo contains three distinct populations with different replication properties.

Tier 1: The Robust Core (15–25 factors)

These anomalies pass all four replication filters: exact statistical, standardised statistical, multiple testing, and economic (after costs). They are characterised by:

Strong theoretical foundations rooted in risk, behavioural bias, or institutional constraints
Replication across multiple independent markets and time periods
Modest but persistent post-publication decay (consistent with partial arbitrage)
Sufficient capacity to be traded at institutional scale
Low-to-moderate turnover, enabling cost-effective implementation

The robust core includes:

Value (book-to-market, earnings yield, cash flow yield): Risk-based and behavioural explanations; documented globally since Graham and Dodd (1934)
Momentum (12-month minus 1-month return): Behavioural (underreaction to information); documented in 40+ markets and multiple asset classes
Profitability (gross profitability, operating profitability): Economic logic (profitable firms are worth more); independently confirmed by Novy-Marx and by Fama-French
Low volatility / Betting Against Beta (low beta, low idiosyncratic volatility): Leverage constraint mechanism; documented across asset classes globally
Quality composites (profitability + stability + low leverage): Multi-dimensional measure of firm financial health
Investment (asset growth, capital expenditure growth): Economic logic (aggressive investment predicts lower returns); though evidence is somewhat weaker than the above
Earnings momentum (standardised unexpected earnings, analyst revision): Post-earnings-announcement drift; one of the most robust anomalies in the literature
Net issuance (share repurchases vs. issuance): Information asymmetry explanation; firms repurchasing know their stock is undervalued

Tier 2: The Fragile Middle (50–100 factors)

These anomalies replicate under some conditions but not others. They typically pass Filter 1 (exact replication) but fail one or more of Filters 2-4. They include:

Specification-sensitive variants of robust factors (e.g., alternative momentum definitions, alternative quality metrics)
Anomalies that work in equal-weighted but not value-weighted portfolios (often driven by micro-cap stocks)
Anomalies that are statistically significant but economically unviable after transaction costs
Anomalies that replicate in the U.S. but not internationally (possibly U.S.-specific accounting or institutional artefacts)
Anomalies that have decayed substantially post-publication, leaving marginal residual premiums

These factors are not necessarily "fake" — many capture genuine but narrow phenomena. They may be useful as secondary tilts within a broader portfolio construction framework, but they are not reliable standalone return sources.

Tier 3: Statistical Noise (250+ factors)

These anomalies fail to replicate under any reasonable standard. They include:

Calendar effects (weekend effect, turn-of-year, holiday effects) that have largely disappeared
Microstructure artefacts that reflected data errors, bid-ask bounce, or stale pricing rather than genuine predictability
Accounting-based anomalies that were specific to a narrow time period or a particular accounting regime
"Factors" that were clearly the product of extensive specification search, often identifiable by unusual construction choices (e.g., interacting three variables at specific lag structures)
Factors that duplicate other factors in the zoo but happen to have achieved significance in a particular sample

These factors contribute to the perception of a replication crisis. They are the reason Harvey, Liu, and Zhu proposed the t > 3.0 threshold. They are the reason Hou, Xue, and Zhang found a 64% failure rate. But they are not representative of the factor zoo as a whole — they are the noise that accompanies any large-scale empirical research programme.

Confidence Assessment

We rate our overall framework at 4 out of 5 confidence.

The areas of highest confidence (5 out of 5):

The disagreement across replication studies is primarily methodological, not substantive
A robust core of factors exists that survives all replication filters
Pure data mining cannot explain the full pattern of evidence (partial decay, international replication, institutional trading response)

The areas of moderate confidence (3-4 out of 5):

The exact size of the robust core (we estimate 15-25, but the boundaries are fuzzy)
The proportion of the factor zoo that is pure noise (we estimate 60%+, but this depends on the statistical threshold applied)
The relative importance of arbitrage versus data mining in explaining post-publication decay

The areas of lower confidence (2-3 out of 5):

Whether specific factors in the "fragile middle" belong in the robust core or the noise tier
Whether ML methods will expand the robust core by identifying non-linear relationships or merely optimise combinations of existing factors
The long-term future of factor premiums — will the robust core persist indefinitely, or will increased capital flows eventually eliminate even these?

Part VI: Comparison to Other Fields

Finance vs. Psychology: Different Crises

The replication crisis in finance is often compared to the replication crisis in psychology, but the two are fundamentally different.

In psychology, the Open Science Collaboration (2015) attempted to replicate 100 published studies and found that only 36% produced significant results in the replication. This was widely interpreted as a crisis of methodology — underpowered studies, p-hacking, and publication bias had produced a literature dominated by false positives.

In finance, the "36% replication rate" from Hou, Xue, and Zhang is superficially similar but mechanistically different. Finance studies typically have much larger sample sizes (thousands of stocks observed monthly for decades), which means they are generally well-powered. The main issues in finance are not underpowered studies but multiple testing (too many hypotheses on the same data) and methodological sensitivity (results that change with construction choices).

Moreover, psychology's replication failures were often absolute — the replicated effect was zero or opposite in sign. Finance's "failures" are often partial — the effect exists but at a lower magnitude, in a different specification, or with marginal significance. This partial replication is more consistent with methodological sensitivity than with outright false discovery.

Finance vs. Medicine: Similar Structural Incentives

The structural incentives in finance are more similar to those in medicine, where John Ioannidis famously argued that "most published research findings are false." In both fields:

There is intense competition for publication in top journals
Novel, surprising findings are preferred over null results
Researchers have substantial degrees of freedom in study design (choice of sample, variables, model specification)
The same underlying data sources are used by many researchers (CRSP/Compustat in finance; common clinical trial databases in medicine)

The key difference is that finance has a natural out-of-sample test that medicine often lacks: time. A financial anomaly published in 2005 can be tested in post-2005 data to see if it persists. A medical finding about a specific treatment cannot be tested retroactively. This gives finance an inherent self-correction mechanism that has been exploited by McLean and Pontiff and others.

Part VII: Limitations and What We Cannot Claim

Inherited Limitations

This analysis synthesises published results — we did not run new regressions, construct new factors, or test new data. Our conclusions inherit every limitation of the underlying studies.

The CRSP/Compustat monoculture. The overwhelming majority of U.S. asset pricing research uses data from CRSP (stock returns) and Compustat (accounting data). This creates a subtle form of common bias: any systematic errors in these databases — survivorship bias, backfill bias, data corrections, coverage gaps — propagate through the entire literature. A replication study that uses the same CRSP/Compustat data as the original study is testing methodological reproducibility, not data independence. Only international replications (Jacobs and Muller) and alternative data sources (such as the hand-collected pre-1963 data of Linnainmaa and Roberts) provide truly independent evidence.

Publication bias in the replication studies. We noted this earlier but it bears repeating: replication studies that reach dramatic conclusions ("crisis" or "no crisis") are more likely to be published than those with ambiguous findings. Our evidence base may overrepresent the extremes of the replication distribution. Studies that found, say, a 55% replication rate — not dramatic in either direction — may sit unpublished in file drawers.

Time-varying risk premiums. Some of what we attribute to post-publication decay (arbitrage) or data mining might actually reflect time-varying risk compensation. If the equity risk premium declined from the 1980s onwards (as some evidence suggests), all factor premiums measured relative to the market would also decline. This confound is difficult to resolve without structural models of time-varying risk, which are beyond the scope of a literature synthesis.

Survivorship bias in factors. There may be a survivorship bias in the factor zoo itself. Factors that happened to work in-sample were published; factors that didn't were abandoned. But the abandoned factors may have included some that would have worked out-of-sample, while some of the published factors would not have. The population of published factors is a biased sample of the population of tested factors.

We cannot identify which specific anomalies are "real." Our three-tier framework describes populations, not individual factors. Assigning a specific anomaly to the robust core, fragile middle, or noise tier requires the detailed empirical analysis that the underlying replication studies have performed. We provide the framework; the 12 studies provide the evidence for classification.

What Would Change Our Conclusions

We believe in stating what evidence would cause us to revise our views:

Evidence that would strengthen H1 (data mining):

If a comprehensive replication using exact methodology (like Chen and Zimmermann's) found a much lower replication rate — say, 50% instead of 82% — it would suggest that even exact replication fails more often than we currently believe
If robust core factors (value, momentum) began to show zero or negative premiums in post-2020 data across all markets and specifications, it would challenge the "real but arbitraged" interpretation

Evidence that would strengthen H2 (arbitrage):

If a natural experiment (such as the publication of a factor in a language that is suddenly translated and disseminated to a new market) showed immediate factor decay in that market, it would provide clean causal evidence for the arbitrage channel
If the growth of factor-based ETFs could be directly linked to factor decay in a Granger-causal framework

Evidence that would weaken H3 (definitional):

If two replication studies used identical methodology but different anomaly samples and reached opposite conclusions, it would suggest that anomaly characteristics matter more than we believe
If the variation in replication rates across studies were uncorrelated with methodological differences, our framework would lose its explanatory power

Suggested Empirical Tests

We propose five tests that could further validate or refute our framework:

Methodology regression: Test whether replication rate (across the 12 studies) is predicted by methodological characteristics (fidelity, threshold, cost treatment) after controlling for anomaly characteristics (average t-statistic, economic mechanism type, size/liquidity concentration). If methodology explains more variance than anomaly characteristics, H3 is strongly supported.
Decay heterogeneity: Examine whether "robust core" factors show different post-publication decay patterns from "fragile middle" factors. The robust core should show smaller, slower decay; the fragile middle should show larger, faster decay; the noise tier should show immediate decay to zero.
ML dominance test: Test whether portfolios constructed from ML combinations of robust core factors alone dominate portfolios that also include fragile middle factors. If the robust core is sufficient, the fragile middle factors add noise, not signal.
Factor capacity mapping: For each robust core factor, estimate the maximum portfolio size at which the net-of-cost premium remains positive. This would convert the qualitative "survives costs" assessment into a quantitative capacity estimate.
Real-time replication: Establish a prospective, real-time tracking system that monitors factor returns as they are earned (not retroactively). This eliminates all look-ahead bias and provides the cleanest possible test of factor persistence.

Part VIII: Implications

For Institutional Investors

Factor selection is largely solved. The replication literature has converged on a credible core of factors: value, momentum, profitability/quality, and low volatility. Debates about which factors are "real" are largely settled for this core group. The remaining open questions are about implementation, not selection.

Implementation is the alpha. The difference between a naively constructed factor portfolio and a sophisticatedly constructed one can be 100-200 basis points annually. This implementation gap — driven by turnover management, rebalancing frequency, capacity-aware sizing, and execution quality — is often larger than the difference between including or excluding a marginal factor. For institutional investors, the primary source of competitive advantage is not factor discovery but factor construction.

Beware the fragile middle. The 50-100 factors in the fragile middle offer a temptation: they backtest well and appear to add diversification. But their fragility — sensitivity to specification, marginal after costs, inconsistent across markets — means they are more likely to add noise than return to a portfolio that already captures the robust core. The burden of proof for including a fragile middle factor should be high: demonstrated robustness across multiple specifications, clear economic mechanism, and positive net-of-cost return at the intended portfolio scale.

Capacity is the binding constraint. Even the robust core factors have finite capacity. As more capital flows into factor-based strategies (through smart beta ETFs, quantitative funds, and institutional mandates), factor premiums will be compressed. The question is not whether this compression will occur but how much capacity remains before premiums are fully arbitraged. Estimates vary widely, but the consensus range for total factor strategy capacity is $1-5 trillion for the major factors combined. Current total assets in explicit factor strategies are approaching this range, suggesting that forward-looking premiums may be smaller than historical estimates.

Monitor for regime change. Factor premiums are not constant — they vary across economic regimes, market conditions, and structural changes. Momentum works well in normal markets but suffers devastating crashes during regime transitions (such as the March 2009 momentum crash). Value has underperformed for extended periods (2010-2020) before reverting. Low volatility strategies are vulnerable to rising rate environments. Institutional investors should maintain regime-aware factor allocation, scaling factor exposures based on market conditions rather than holding static weights.

For Retail Investors

Simplicity beats complexity. The practical implication of the replication literature for retail investors is liberating: you do not need to understand 400+ factors. The investable set is small. A portfolio consisting of broad market exposure (60-70%) plus targeted tilts toward value, momentum, quality, and low volatility (30-40%) captures the vast majority of available factor premiums. These tilts can be accessed through low-cost ETFs with total expense ratios below 30 basis points.

Be sceptical of novel strategies. The replication literature shows that approximately 50-65% of newly published anomalies will fail to replicate under rigorous standards, and most of the remainder will not survive transaction costs. Any investment product marketed on the basis of a single academic anomaly — especially a recently published one — should be viewed with deep scepticism. The burden of proof is on the product provider to demonstrate replication, robustness, and net-of-cost viability.

Time horizon matters more than factor selection. For retail investors with long time horizons (20+ years), the most important investment decision is not which factors to tilt toward but whether to maintain equity exposure through inevitable downturns. The robust factor premiums of 2-4% annually compound significantly over decades, but only for investors who stay invested. Factor timing — trying to rotate between value and momentum based on market conditions — is difficult even for institutional investors and is likely to destroy value for retail investors.

Avoid the factor zoo entirely if confused. If the factor zoo feels overwhelming, a simple total market index fund is a perfectly reasonable choice. The robust factor premiums are real but moderate (2-4% annually before costs), and achieving them requires discipline: maintaining factor tilts through the inevitable periods of underperformance (which can last 5-10 years for any individual factor). An investor who panic-sells a value tilt after three years of underperformance will earn worse returns than one who simply held the market index throughout.

For Researchers and the Field

Standardise replication reporting. The field would benefit enormously from a standardised replication framework that reports results at multiple levels:

Statistical replication using the exact original methodology
Statistical replication using a standardised methodology
Multiple-testing-adjusted significance
Economic replication after realistic transaction costs
Out-of-sample/international replication

Any individual study that reports only one level of replication provides an incomplete picture that can be — and has been — misinterpreted by media, practitioners, and other researchers.

Adopt Bayesian methods. The threshold-based framework (significant vs. not significant) creates artificial cliff effects and forces a binary classification on a continuous reality. Bayesian shrinkage methods, as demonstrated by Jensen, Kelly, and Pedersen, provide a more nuanced and accurate representation of the state of knowledge about factor premiums.

Open source is the future. Chen and Zimmermann's open-source replication code repository set a new standard for reproducibility. Future anomaly papers should be required to provide replication code and data, allowing independent verification. This would dramatically reduce the scope for data mining and specification search, because reviewers and readers could test alternative specifications.

Focus on economic mechanisms. The replication literature has been largely empirical — testing whether effects exist, not explaining why. A factor without a compelling economic mechanism is inherently fragile because there is no a priori reason for it to persist. The factors that have survived the most rigorous replication scrutiny (value, momentum, profitability, low volatility) are precisely those with the strongest theoretical foundations. Future factor research should lead with mechanism and follow with evidence, rather than the reverse.

Reduce redundancy. The factor zoo would benefit from consolidation. Rather than publishing the 17th variant of a quality measure, researchers should focus on identifying genuinely independent return sources and understanding the economic mechanisms that drive them. A factor zoo with 15 well-understood, robustly documented factors is more scientifically valuable than one with 400+ poorly understood, partially redundant factors.

Machine Learning in Asset Pricing: What Actually Works

Models & Frameworks14 min

From Euphoria to Panic: Quantitative Rules for Surviving Bull-to-Bear Transitions

Behavioral Finance & Timing14 min

Luxury Goods Equities: Drawdown Dynamics and the Pricing Power Premium

Risk & Measurement9 min

The Virtue of Complexity: Why Overparameterized Models Predict Returns Better

Models & Frameworks12 min

This analysis was synthesised from QD Research Engine — Meta-analysis of 12 replication studies (2003–2024) by the QD Research Engine — Quant Decoded’s automated research platform — and reviewed by our editorial team for accuracy. Learn more about our methodology.

References

Asness, C. S., Moskowitz, T. J., & Pedersen, L. H. (2013). Value and momentum everywhere. Journal of Finance, 68(3), 929–985.
Asness, C. S., Frazzini, A., Israel, R., Moskowitz, T. J., & Pedersen, L. H. (2018). Size matters, if you control your junk. Journal of Financial Economics, 129(3), 479–509.
Ball, R., Gerakos, J., Linnainmaa, J. T., & Nikolaev, V. (2016). Accruals, cash flows, and operating profitability in the cross section of stock returns. Journal of Financial Economics, 121(1), 28–45.
Banz, R. W. (1981). The relationship between return and market value of common stocks. Journal of Financial Economics, 9(1), 3–18.
Black, F., Jensen, M. C., & Scholes, M. (1972). The Capital Asset Pricing Model: Some empirical tests. In M. C. Jensen (Ed.), Studies in the Theory of Capital Markets (pp. 79–121). Praeger.
Calluzzo, P., Moneta, F., & Topaloglu, S. (2019). When anomalies are publicized broadly, do institutions trade accordingly? Management Science, 65(10), 4555–4574.
Carhart, M. M. (1997). On persistence in mutual fund performance. Journal of Finance, 52(1), 57–82.
Chen, A. Y., & Zimmermann, T. (2022). Open source cross-sectional asset pricing. Critical Finance Review, 11(2), 207–264.
Chordia, T., Goyal, A., & Saretto, A. (2020). Anomalies and false rejections. Review of Financial Studies, 33(5), 2134–2179.
Fama, E. F., & French, K. R. (1992). The cross-section of expected stock returns. Journal of Finance, 47(2), 427–465.
Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1), 3–56.
Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1), 1–22.
Frazzini, A., Israel, R., & Moskowitz, T. J. (2015). Trading costs of asset pricing anomalies. Working paper, AQR Capital Management.
Frazzini, A., & Pedersen, L. H. (2014). Betting against beta. Journal of Financial Economics, 111(1), 1–25.
Geczy, C. C., & Samonov, M. (2016). Two centuries of price-return momentum. Financial Analysts Journal, 72(5), 32–56.
Green, J., Hand, J. R. M., & Zhang, X. F. (2017). The characteristics that provide independent information about average U.S. monthly stock returns. Review of Financial Studies, 30(12), 4389–4436.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. Review of Financial Studies, 33(5), 2223–2273.
Harvey, C. R., Liu, Y., & Zhu, H. (2016). ...and the cross-section of expected returns. Review of Financial Studies, 29(1), 5–68.
Hou, K., Xue, C., & Zhang, L. (2015). Digesting anomalies: An investment approach. Review of Financial Studies, 28(3), 650–705.
Hou, K., Xue, C., & Zhang, L. (2020). Replicating anomalies. Review of Financial Studies, 33(5), 2019–2133.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.
Jacobs, H., & Muller, S. (2020). Anomalies across the globe: Once public, no longer existent? Journal of Financial Economics, 135(1), 213–230.
Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. Journal of Finance, 48(1), 65–91.
Jensen, T. I., Kelly, B. T., & Pedersen, L. H. (2023). Is there a replication crisis in finance? Journal of Finance, 78(5), 2465–2518.
Linnainmaa, J. T., & Roberts, M. R. (2018). The history of the cross-section of stock returns. Review of Financial Studies, 31(7), 2606–2649.
McLean, R. D., & Pontiff, J. (2016). Does academic research destroy stock return predictability? Journal of Finance, 71(1), 5–32.
Novy-Marx, R. (2013). The other side of value: The gross profitability premium. Journal of Financial Economics, 108(1), 1–28.
Novy-Marx, R., & Velikov, M. (2016). A taxonomy of anomalies and their trading costs. Review of Financial Studies, 29(1), 104–147.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Piotroski, J. D. (2000). Value investing: The use of historical financial statement information to separate winners from losers. Journal of Accounting Research, 38, 1–41.
Schwert, G. W. (2003). Anomalies and market efficiency. In G. Constantinides, M. Harris, & R. Stulz (Eds.), Handbook of the Economics of Finance (Vol. 1B, pp. 939–974). Elsevier.
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. Journal of Finance, 19(3), 425–442.
Sloan, R. G. (1996). Do stock prices fully reflect information in accruals and cash flows about future earnings? Accounting Review, 71(3), 289–315.
Stambaugh, R. F., & Yuan, Y. (2017). Mispricing factors. Review of Financial Studies, 30(4), 1270–1315.

Practical Application for Retail Investors

Editor’s Note

Key Takeaway

Part I: The Question Nobody Has Agreed On

How We Got to 400+ Factors

The Multiple Testing Problem

The Counterattack

The Paradox

Part II: Research Question and Competing Hypotheses

Formalising the Question

H1 — The Data Mining Hypothesis

H2 — The Arbitrage Hypothesis

H3 — The Definitional Hypothesis

Part III: Evidence Base — Twelve Studies in Detail

Study 1: Schwert (2003) — The Original Warning

Study 2: Harvey, Liu & Zhu (2016) — The Multiple Testing Adjustment

Study 3: McLean & Pontiff (2016) — The Natural Experiment

Study 4: Green, Hand & Zhang (2017) — The Holdout Test

Study 5: Linnainmaa & Roberts (2018) — The Pre-1963 Test

Study 6: Hou, Xue & Zhang (2020) — The Standardised Replication

Study 7: Jacobs & Muller (2020) — International Replication

Study 8: Chen & Zimmermann (2022) — The Exact Methodology Replication

Study 9: Jensen, Kelly & Pedersen (2023) — The Bayesian Reconciliation

Study 10: Novy-Marx & Velikov (2016) — The Transaction Cost Filter

Study 11: Chordia, Goyal & Saretto (2020) — The Combined Filter

Study 12: Calluzzo, Moneta & Topaloglu (2019) — The Arbitrage Mechanism

Part IV: Analysis — The Three Dimensions of Replication

Constructing the Framework

Dimension 1: Methodological Fidelity

A Case Study in Methodological Sensitivity: The Size Effect

Dimension 2: Statistical Threshold

The Bayesian Alternative

Dimension 3: Economic Versus Statistical Replication

The Economic Replication Scorecard

Deep Dive: Momentum Across All Four Filters

Deep Dive: Accruals Across All Four Filters

The Redundancy Problem

Machine Learning and the New Factor Discovery

Part V: Results and Interpretation

Evaluating the Three Hypotheses

The Three-Tier Synthesis

Confidence Assessment

Part VI: Comparison to Other Fields

Finance vs. Psychology: Different Crises

Finance vs. Medicine: Similar Structural Incentives

Part VII: Limitations and What We Cannot Claim

Inherited Limitations

What Would Change Our Conclusions

Suggested Empirical Tests

Part VIII: Implications

For Institutional Investors

For Retail Investors

For Researchers and the Field

Related

References