In today’s data-powered investment environment, data quality, availability and uniqueness can create or break a strategy. Still investment professionals face borders regularly: historic datasets cannot occupy emerging risks, alternative data is often incomplete or prohibitory expensive, and open-sources models and datasets are diagonally towards major markets and English-language materials.
Since the firms seek more optimal and forward-loving tools, synthetic data-especially when receiving from generative AI (Jenai)-is emerging as a strategic property, the market offers new ways to imitrse the landscapes, train machine learning models and backtest investment strategies. This post shows how to reconcile the geni-managed synthetic data is shaping the investment workflows-from imitating the correlations of the property to increasing the spirit model-and do physicians need to know to evaluate its usefulness and limitations.
What is really synthetic data, how is it generated by Jeanai model, and why is it rapidly relevant to investment use cases?
Consider two common challenges. A portfolio manager is forced by historical data to optimize performance in different market rule, which is not responsible for the “what-ag” scenarios that have not yet been. Similarly, a data scientific surveillance in the German-language news for small-cap stocks can find a scientific monitoring spirit that most available datasets are in English and focus on large-cap companies, which limit both coverage and relevance. In both cases, synthetic data provides a practical solution.
Does Jeani separate synthetic data – and now why it matters
Synthetic data refers to artificially generated datasets that repeat the statistical properties of real -world data. While the concept is not new – techniques like Monte Carlo simulation and bootstrapping have long supported financial analysis – what has changed is How,
Genai refers to a class of deep-reflected models capable of generating high-loyal synthetic data in methods such as text, tabular, image and time-series. Unlike traditional methods, the Genai models learn directly from the data, which eliminates the requirement of rigorous assumptions about the underlying general process. This capacity opens up cases of powerful use in investment management, especially in areas where real data is rare, complex, incomplete or cost, language or regulation.
Normal geni model
There are different types of Genai models. Varial Autanekoders (VAES), Geornial Advertisement Network (GANS), defusion-based models and large language models (LLM) are the most common. Each model is designed using nerve network architecture, although they differ in their size and complexity. These methods have already demonstrated the ability to increase some data-centered workflows within the industry. For example, VAE has been used to create synthetic volatility surfaces to improve options trading (Burzron) At al.2021). GANS portfolio optimization and risk management (Zhu, Mariani and Lee, have proved useful for 2020; At al.2023). Defusion-based models have proved useful to simulate the asset withdrawal matriies under various market governance (Kubiyak). At al.2024). And LLMS has proved useful for market simulation (LI (LI) At al.2024).
Table 1. Views for synthetic data generation.
Method | It produces types of data | Example application | Generous? |
Monte Carlo | Time-series | Portfolio adaptation, risk management | No |
Kopula-based work | Time-series | Credit risk analysis, asset correlation modeling | No |
Voluntary model | Time-series | Volatility forecasting simulation | No |
Bootstrapping | Time-series, tabulated, text | Self-confidence interval, stress-testing | No |
Variaan carcododar | Tables, time chain, audio, paintings | Imcation of volatility | Yes |
Liberal adverse network | Table, time chain, audio, picture, | Portfolio adaptation, risk management, model training | Yes |
Dissemination model | Table, time chain, audio, picture, | Correlation modeling, portfolio adaptation | Yes |
Big language model | Text, Table, Pictures, Audio | Emotion analysis, market simulation | Yes |
Synthetic data quality evaluation
Synthetic data should be realistic and matched with statistical properties of your real data. Methods of current assessment come in two categories: quantitative and qualitative.
The qualitative approach involves imagining a comparison between real and synthetic dataset. Examples include comparing visualized distribution, variable pairs, time-series paths and scatterplots between the corresponding matriasis. For example, a GAN model trained to simulate assets returns to assess the price-risk risk must successfully reproduce heavy-tails of delivery. A proliferation model trained to produce synthetic correlation matriasses under various market regulations should adequately occupy asset co-resolutions.
Quantitative perspectives include statistical tests to compare distribution such as Kolmogorov-Smirnov, population stability index and gensen-vannon deviations. These test output statistics indicate equality between two distributions. For example, the Kolmogorov-Smirnovon test outputs a P-Value, which if less than 0.05, suggests that two distribution are quite different. This may provide more solid measurements for equality between two distributions unlike visualization.
Another approach includes the “train-on-scholarship, test-on-radio”, where a model is trained on synthetic data and tested on actual data. The performance of this model can be compared to a model that is trained and tested on real data. If synthetic data successfully repeats the properties of real data, the performance between two models should be the same.
In action: Genai enhancing financial emotion analysis with synthetic data
To make it into practice, I cured a small open-source LLM, Quven 3–0.6B for financial emotion analysis using a public dataset of finance-related headlines and social media content, known as FIC-SA.[1]Dataset has 822 training examples, in which most sentences are classified as “positive” or “negative” spirit.
I then used GPT-4o to generate 800 synthetic training examples. The synthetic dataset generated by GPT-4O was more diverse than the original training data, in which more companies and spirit (Figure 1) were covered. Increasing the diversity of training data provides more examples to LLM from where learning to identify emotion from course materials, to improve model performance on potentially unseen data.
Figure 1. Distribution of emotion classes for both real and synthetic data, real (left), synthetic (right), and enhanced training dataset (middle).
Table 2. Example sentences from real and synthetic training dataset.
Sentence | Class | data |
The slip in wear leads from high to FTSE. | negative | Real |
The Astrazeneca wins the FDA approval for the major new lung cancer pill. | Positive | Real |
Shell and BG shareholders to vote in late January to vote. | neutral | Real |
Tesla’s quarterly report shows an increase of 15%in vehicle delivery. | Positive | artificial |
PepsiCo is organizing a press conference to address recent product recall. | neutral | artificial |
The CEOs of the Home Depot suddenly steps down amidst internal disputes. | negative | artificial |
After fixing a second model on a combination of real and synthetic data using the same training process, the verification in the F1-score increased by nearly 10 percent points on the dataset (Table 3), with a final F1-score of 82.37% on the test dataset.
Table 3. Model performance on FIQA-SA verification dataset.
Sample | Weighted F1-score |
Model 1 (real) | 75.29% |
Model 2 (real + synthetic) | 85.17% |
I found that increase in the ratio of synthetic data Too much Had a negative effect. For optimal results, there is a Goldelectox zone between much and very low synthetic data.
Not a silver pill, but a valuable tool
Synthetic data is not a replacement for actual data, but is worth experimenting with it. Choose a method, evaluate synthetic data quality, and conduct A/B testing in an environment with a sandbox where you compare and compare workflows with different proportions of synthetic data. You may be surprised at conclusions.
You can see all the codes and datasets on the RPC Labs Github Repository and dive into the LLM case study in the research report “Synthetic Data in Investment Management” of Research and Policy Center.
[1] Dataset is available for download here: https://huggingFace.co/datasets/thefinai/fique- Student- Slassification