Are AI-driven investment platforms classified as high-risk under the EU AI Act?

In most cases, no. Annex III of the AI Act names only two financial use cases as high-risk: creditworthiness assessment of natural persons, and risk assessment and pricing in life and health insurance. Portfolio analytics, robo-advice and AI-augmented research are governed primarily by MiFID II, ESMA guidelines and the AI Act's general transparency obligations rather than the high-risk regime.

What is the Probability of Backtest Overfitting and why does it matter for AI platforms?

PBO, defined by Bailey, Borwein, López de Prado and Zhu in 2014, estimates the probability that a strategy ranked best in-sample will underperform the median out-of-sample, using combinatorially symmetric cross-validation. It matters acutely for AI platforms because LLMs can propose thousands of strategy variants per day, multiplying the selection-bias problem that PBO was designed to deflate; a PBO above roughly 0.5 means the selection process is worse than random.

Where should a large language model sit in the architecture of a compliant investment platform?

At the explanation and hypothesis-generation layer, never at the scoring layer. Deterministic statistics — Deflated Sharpe, PBO, regime-conditional metrics — must own the evaluation of any strategy, because only they account for the number of trials run. The LLM's job is to propose candidates for that deterministic engine to score, and to explain the results to users in natural language, with the AI nature of the interaction disclosed under Article 50 of the AI Act.

The EU AI Act and AI Investment Platforms: What Operators Need to Know

The EU AI Act entered into force in August 2024 and is phasing into application through 2026 and 2027. For anyone building or operating an AI investment platform in the EU, the first job is to read the text instead of the headlines: most retail trading and portfolio analytics do not sit in the Act's high-risk regime at all. They sit under MiFID II, with the AI Act layering specific transparency and general-purpose AI obligations on top. Conflating the two is the single most common compliance error I see, and it produces both over-engineering and false comfort.

This post walks through what the AI Act actually says about financial use cases, what investment platforms are still obliged to do under MiFID II suitability and ESMA guidelines, and how a platform like ARIA Analyst structures its stack so that each layer — deterministic math, machine learning, and large language models — is doing the job it is qualified to do. If you want background on the underlying metrics, our glossary defines Deflated Sharpe, PBO and regime detection in operator-friendly language.

What the EU AI Act actually classifies as high-risk in finance

The high-risk list lives in Annex III of the AI Act. For the financial sector, only two use cases are explicitly named: AI systems used to evaluate the creditworthiness of natural persons or establish their credit score, and AI systems used for risk assessment and pricing in life and health insurance. That is the whole financial perimeter of Annex III. Algorithmic trading, robo-advice, portfolio optimisation, news scoring, factor models and AI augmentation of equity research are not listed there. There is no threshold in the Act about a tool moving asset values by a given percentage in a given window — that idea is not in the law, and operators who design controls around an imaginary threshold will fail audits on the real obligations they did miss.

Investment platforms still face heavy regulation: MiFID II suitability and appropriateness, the ESMA Guidelines on certain aspects of the MiFID II suitability requirements (updated in 2023 to cover sustainability), the 2024 ESMA Public Statement on the use of artificial intelligence in the provision of retail investment services, product governance under MiFID II Article 16, and — where applicable — the DORA framework for ICT risk. None of that disappears because the AI Act exists. The AI Act adds, it does not replace.

What the AI Act does add for investment platforms

Three pieces of the AI Act bite on investment platforms even when the Annex III high-risk list does not. First, Article 50 transparency: if your platform interacts with users in natural language — a chat assistant, an explanation engine, a Q&A about a portfolio — users must be informed that they are interacting with an AI system, unless this is obvious from the context. Second, the general-purpose AI obligations in Title V apply to providers of the foundation models you integrate, but they shape your vendor diligence: you need to know the model's documentation, training data summary, and limitations. Third, prohibitions in Article 5 — manipulative or subliminal techniques, exploitation of vulnerabilities — apply to everyone, and a sales funnel that uses AI to push leveraged products to retail users on the basis of inferred emotional state is a clear violation.

Operators should also track the AI Act's interaction with the GDPR for any profiling of users, with the Digital Operational Resilience Act for incident reporting, and with national supervisors. In Spain, CNMV's 2023 communication on AI in investment services is the most concrete domestic guidance available today.

The metrics layer: what "transparency" really means for a quantitative platform

Where the AI Act demands transparency, regulators do not actually want a thousand-page model card. They want defensible numbers. For a quantitative investment platform, transparency means being able to show that the performance figures you publish have been corrected for the biases that make backtests look better than reality. Two metrics matter here and are routinely misreported in marketing copy.

Deflated Sharpe Ratio (DSR): introduced by Bailey and López de Prado in 2014, the DSR adjusts a strategy's observed Sharpe ratio for the number of trials run, the higher moments of the return distribution, and the variance of Sharpes across the trials. It produces the probability that the true Sharpe is positive given how much you searched. A raw Sharpe of 1.5 on a strategy selected from 2,000 backtests can deflate to a true expected Sharpe below zero.
Probability of Backtest Overfitting (PBO): defined by Bailey, Borwein, López de Prado and Zhu in 2014, PBO uses combinatorially symmetric cross-validation to estimate the probability that the strategy ranked best in-sample will underperform the median out-of-sample. A PBO above roughly 0.5 means your selection process is worse than random.
Regime classification: a hidden Markov or change-point model identifies distinct market states (low-vol trending, high-vol mean-reverting, crisis, etc.). It matters because a single-regime Sharpe is an average of incompatible behaviours, and calibrating on the wrong regime is the second most common source of out-of-sample failure after selection bias itself.

Why LLM-driven investment ideas make DSR and PBO non-optional

The deepest reason DSR and PBO matter for AI investment platforms is selection bias. Both metrics were designed for a world where a human researcher tests dozens or hundreds of strategy variants. An LLM-driven idea engine can propose thousands of strategy variants per day, and a naïve pipeline that backtests each one and surfaces the top performer will produce spectacular in-sample Sharpes that are pure noise. The more capable the language model, the worse the problem gets, because the search space expands faster than the data does.

A defensible AI investment platform therefore cannot use LLMs at the metrics layer at all. The LLM is allowed to generate hypotheses and explain results; it is not allowed to score them. Scoring has to happen with deterministic statistics that count the number of trials and penalise accordingly. This is the single most important design choice in the entire space, and it is the dividing line between platforms that survive a regulator's scrutiny and platforms that do not.

How ARIA Analyst applies this

ARIA Analyst is built as an explicitly stacked architecture so that each layer does only what its mathematics permits. The metrics layer is fully deterministic: a 5-agent scoring core computes Deflated Sharpe, PBO via combinatorial cross-validation, regime-conditional returns, drawdown geometry and turnover-adjusted alpha on every strategy variant. No language model touches those numbers. The machine learning layer handles calibration and regime classification — hidden Markov models for state identification, isotonic regression for probability calibration, gradient-boosted ensembles for short-horizon nowcasting. The LLM layer sits on top of both and is restricted to two jobs: generating candidate hypotheses for the deterministic layer to evaluate, and explaining the deterministic layer's output to the user in natural language.

That stacking is what makes ARIA's AI augmentation legible under both the AI Act and MiFID II. When a user asks why a holding was downgraded, the chain of reasoning is fully reconstructable: the deterministic agents produced specific numbers, the ML layer assigned a regime probability, and the LLM rendered the explanation. Article 50 of the AI Act is satisfied by surfacing that the assistant is an AI system; MiFID II suitability is satisfied because the recommendation rests on auditable quantitative criteria, not on a chat model's preferences. You can read more about this layered approach in our methodology page, and our blog index covers each layer in depth.

What to verify in your own platform

Three checks separate platforms that will pass a 2026 audit from platforms that will not. First, open any document that reports a Sharpe ratio. If the words "deflated", "trials" or "selection bias" do not appear, you are looking at a number that has not been corrected for how hard you searched. Second, find the regime model. If performance is reported as a single line rather than conditioned on at least two market states, the calibration is implicitly assuming the future will look like the average of the past. Third, trace any user-facing recommendation back to its source. If the chain ends at a language model output rather than a deterministic scoring step, the platform is using the LLM at the wrong layer.

None of these checks are exotic. They are exactly what a competent compliance officer would ask, and they are exactly what the AI Act's transparency provisions and MiFID II's suitability provisions are designed to make visible. The platforms that get this right will look boring on the marketing page and unimpeachable in the audit. That is the trade-off worth making.

Conclusion

The EU AI Act is narrower for investment platforms than the press coverage suggests, and the real obligations are split between Article 50 transparency, the Article 5 prohibitions, vendor diligence on general-purpose AI providers, and the unchanged weight of MiFID II. What unifies all of them is a demand that AI-driven investment products be explainable in terms a regulator can audit. The technical answer is a stacked architecture: deterministic math at the metrics layer, machine learning at the calibration layer, language models at the explanation layer, and never the other way around. ARIA Analyst is built on exactly that separation, and you can see how it shows up in product on our pricing page.

What the EU AI Act actually classifies as high-risk in finance

What the AI Act does add for investment platforms

The metrics layer: what "transparency" really means for a quantitative platform

Deflated Sharpe Ratio (DSR): introduced by Bailey and López de Prado in 2014, the DSR adjusts a strategy's observed Sharpe ratio for the number of trials run, the higher moments of the return distribution, and the variance of Sharpes across the trials. It produces the probability that the true Sharpe is positive given how much you searched. A raw Sharpe of 1.5 on a strategy selected from 2,000 backtests can deflate to a true expected Sharpe below zero.
Probability of Backtest Overfitting (PBO): defined by Bailey, Borwein, López de Prado and Zhu in 2014, PBO uses combinatorially symmetric cross-validation to estimate the probability that the strategy ranked best in-sample will underperform the median out-of-sample. A PBO above roughly 0.5 means your selection process is worse than random.
Regime classification: a hidden Markov or change-point model identifies distinct market states (low-vol trending, high-vol mean-reverting, crisis, etc.). It matters because a single-regime Sharpe is an average of incompatible behaviours, and calibrating on the wrong regime is the second most common source of out-of-sample failure after selection bias itself.

The EU AI Act and AI Investment Platforms: What Operators Need to Know

What the EU AI Act actually classifies as high-risk in finance

What the AI Act does add for investment platforms

The metrics layer: what "transparency" really means for a quantitative platform

Why LLM-driven investment ideas make DSR and PBO non-optional

How ARIA Analyst applies this

What to verify in your own platform

Conclusion

Frequently asked questions

Are AI-driven investment platforms classified as high-risk under the EU AI Act?

What is the Probability of Backtest Overfitting and why does it matter for AI platforms?

Where should a large language model sit in the architecture of a compliant investment platform?

Ready to put this into practice?

Continue reading

Transaction Cost Modeling: The Backtest Killer Nobody Talks About

Information Coefficient: The Metric Quants Live and Die By

Apple vs. Microsoft: What Revenue Trajectories Tell Us About Two Different Megacaps

The EU AI Act and AI Investment Platforms: What Operators Need to Know

What the EU AI Act actually classifies as high-risk in finance

What the AI Act does add for investment platforms

The metrics layer: what "transparency" really means for a quantitative platform

Why LLM-driven investment ideas make DSR and PBO non-optional

How ARIA Analyst applies this

What to verify in your own platform

Conclusion

Frequently asked questions

Are AI-driven investment platforms classified as high-risk under the EU AI Act?

What is the Probability of Backtest Overfitting and why does it matter for AI platforms?

Where should a large language model sit in the architecture of a compliant investment platform?

Ready to put this into practice?

Continue reading

Transaction Cost Modeling: The Backtest Killer Nobody Talks About

Information Coefficient: The Metric Quants Live and Die By

Apple vs. Microsoft: What Revenue Trajectories Tell Us About Two Different Megacaps