ML model requires better training data: genai solution

Our understanding of financial markets is naturally constrained from historical experience – a single real time that could appear between countless possibilities. Each market cycle, geo -political event, or policy decision represents just an expression of potential results.

This range especially intensifies when the training machine learning (ML) model, which can unknowingly learn from historical artifacts rather than the dynamics of the inherent market. Since complex ML models become more prevalent in investment management, their tendency to overfit under specific historical conditions pose rising risk for investment results.

Generative AI-based synthetic data (Genai synthetic data) is emerging as a possible solution to this challenge. While Genai has mainly attracted attention to natural language processing, its ability to generate sophisticated synthetic data can prove to be even more valuable for quantitative investment processes. By creating data effectively that represents the “parallel deadline”, this approach can be prepared and an engineer to provide rich training datasets that preserve important market relationships searching for counterfectual scenarios.

The Challenge: Moving from single timeline training

Traditional quantitative models face an underlying border: they learn from a historical sequence of events that led the current conditions. It makes what we call “empirical prejudice”. This challenge becomes more pronounced with complex machine learning models, whose complex pattern ability makes them particularly weak for overfiting on limited historical data. An alternative approach is to consider the retribution: if they can be certain if certain, perhaps arbitrary events, decisions, or shakes played differently

To clarify these concepts, consider benchmarks to the active international equity portfolio for MSCI EAFE. Figure 1 reflects the performance characteristics of several portfolio – inverted capture, downside capture, and overall relative returns in the last five years ending 31 January, 2025.

Figure 1: Empirical data. Eafe-Benchmarked Portfolio, features of five years performance on 31 January 2025.

This empirical dataset represents just a small sample of potential portfolio, and a small sample of potential results also revealed different events. There are important limitations of traditional approaches for the expansion of this dataset.

Figure 2. Estance-based approach: K-Nikat neighbor (left), smoke (right).

Traditional Synthetic Data: Understanding Boundaries

Traditional methods of synthetic data generation try to address data limits, but often reduce the complex dynamics of financial markets. Using our eafe portfolio example, we can check how different approaches perform:

Examples such as K-Nn and SMOTE expand the existing data patterns through local samples, but are trespassed by observation data relationships. They cannot generate scenario beyond the examples of their training, limit their utility to understand potential future market conditions.

Figure 3: More flexible approaches generally improve results but struggle to catch complex market relationships: GMM (left), KDE (right).

Traditional synthetic data generation approaches, whether through examples-based methods or density assessment, face fundamental boundaries. Although these approaches can increase patterns, they cannot generate realistic market scenario that actually protects complex inter-relations by discovering different market conditions. This limit becomes particularly clear when we examine the density assessment approach.

Density assessment approaches such as GMM and KDE provide more flexibility in expanding data patterns, but still struggled to catch the complex, interconnected mobility of the financial markets. These methods especially stagge during governance, when historical relations can develop.

Genai synthetic data: more powerful training

Recent research at City St. Georges and University of Warwicks, NYU ACM International Conference in AI in Finance (ICAIF), shows that Jeanai can potentially estimate the market -underlying data generating functions better. Through nerve network architecture, this approach aims to learn conditional distribution by consistently preserving market relations.

The Research and Policy Center (RPC) will soon publish a report that defines synthetic data and outlines generative AI approaches that can be used to make it. The report will highlight the best methods of using references of existing educational literature to evaluate the quality of synthetic data and highlight cases of potential use.

Figure 4: The depiction of geni synthetic data extends the location of realistic possible results while maintaining major relationships.

This approach to synthetic data generation can be expanded to offer several possible benefits:

  • Extended Training Set: Realistic growth of limited financial dataset
  • Landscape investigation: Creation of admirable market status while maintaining continuous relations
  • Tail event analysis: Building various but realistic stress scenarios

As illustrated in Figure 4, the Genai synthetic data approach aims to expand the location of potential portfolio performance characteristics, respecting fundamental market relationships and realistic boundaries. This machine provides a rich training environment for the learning model, possibly reduces their vulnerability for historical artifacts and improves their ability to generalize in market conditions.

Implementation in security selection

For equity selection models, which are particularly susceptible to learning historical patterns, Genai synthetic data provides three possible benefits:

  1. Decreased: By training on various market conditions, models can make better differences between frequent signals and temporary artifacts.
  2. Increased tail risk management: More diverse landscape in training data can improve the strength of the model during market stress.
  3. Better generalization: Extended training data that maintains realistic market relations can help to adapt the model to the changing conditions.

The implementation of effective Genai synthetic data generation presents its own technical challenges, possibly more than the complexity of the investment model itself. However, our research suggests that successfully addressing these challenges can greatly improve risky returns through strong model training.

Genai Path for Better Model Training

Genai synthetic data has the ability to provide more powerful, further visible insights for investment and risk models. Through nerve network-based architecture, it aims to better approximate market data generating functions, possibly enabling more accurate representation of future market conditions, consistently consistent interrelations.

Although it can benefit most investment and risk models, an important reason it represents such an important innovation, which is currently due to the risk of increasing machine learning and overfit in investment management. GENAI synthetic data can generate laudable market landscapes that preserve complex relationships when discovering different situations. This technique provides a route for a stronger investment model.

However, even the most advanced synthetic data cannot compensate for the implementation of naive machine learning. There is no safe improvement for excessive complexity, opaque model, or weak investment logic.


The Research and Policy Center will host a webinar on March 18 tomorrow, which will feature a world-famous specialist Marcos Lopez de Prado in Financial Machine Learning and Quantitative Research.

Related posts

2025 Long-Distance Investing Blueprint (Listen Before Buying)

1911 Gold Intersects 5.52 g/t Au over 6.50 m and 54.00 g/t Au over 0.50 m on SAM Southeast Zone at True North

The “Big, Beautiful” Tax Breaks You’ll Get in 2025