How AI Scores Stocks: A Look Inside Deterministic Multi-Agent Analysis
A technical walkthrough of how deterministic multi-agent systems score stocks: feature extraction, weight derivation, confidence calibration, and why reproducibility matters more than novelty.
Ask ten finance Twitter accounts how AI scores a stock and you will get ten variations of the same answer: "it reads the 10-K and tells you what it thinks." That answer is not wrong, but it describes language-model summarization, not scoring. Scoring is the act of mapping a high-dimensional feature vector, fundamentals, technicals, macro context, sentiment, risk, into a single number that is comparable across tickers and stable across time. Language models, on their own, are bad at this. They are creative, fluent, and irreproducible. A good scoring system is the opposite: boring, mechanical, and reproducible to the last decimal.
This article walks through the architecture ARIA Analyst uses to score equities, and, more importantly, why each design decision was made. If you are evaluating any AI-driven investment platform, the questions below are the ones you should be asking.
The two camps: LLM-as-analyst vs. deterministic scoring
Modern AI investment tools fall into two broad camps. The first treats the LLM as a junior analyst: you feed it earnings transcripts and SEC filings, it writes a memo. ChatGPT, Claude, and most "AI investment research" wrappers operate this way. The second camp treats the LLM as a peripheral component, a narrator at the end of a deterministic pipeline. Bloomberg-style quant models, Morningstar Quantitative Rating, and ARIA Analyst sit here.
The distinction matters because of one property: reproducibility. If you score AAPL on a Tuesday morning and the model returns 72, you need that same model to return 72 on Tuesday afternoon, assuming inputs have not changed. LLMs do not give you that guarantee. Temperature, sampling, and token-level randomness mean the same prompt can produce different outputs. For an "is this a buy" question, that is fatal. You cannot backtest a strategy whose decision rule changes between runs.
Deterministic scoring solves this by treating the score as a pure function of measurable features. Given the same fundamentals, technicals, macro state, and price history, the score is identical. There is no temperature parameter to fiddle with. There is no "creative" interpretation. The score is what the function says it is.
What features actually matter
Before you can score, you have to decide what to measure. There is an enormous literature on which features predict returns, most of it inconclusive. The honest answer is that the predictive signal in any single feature is small (R² in the low single digits for most equity factors over rolling windows), and what actually works is combining many weak signals into a single robust composite.
ARIA uses five families of features, each handled by a dedicated agent:
- Fundamental: valuation (P/E, EV/EBITDA, P/B), profitability (ROE, ROIC, gross margin trends), growth (revenue and EPS CAGR), and leverage (net debt/EBITDA, interest coverage). These are the slowest-moving features and the ones most prone to look-ahead bias if you are not careful about reporting lags.
- Technical: momentum across 1M/3M/6M/12M windows, mean reversion signals (RSI extremes, Bollinger band position), volume-weighted price patterns, and volatility regime. Pure technicals are noisy on their own but combine well with fundamentals to time entries.
- Macro: yield curve shape, dollar strength, sector rotation regime, real rates, and credit spreads. Macro features are particularly important for cyclicals, selling a homebuilder when real rates are rising is a different bet than selling one when rates are falling.
- Sentiment: news flow tone (FinBERT-style classification on a 7-day window), insider transactions, short interest changes, and options skew. Sentiment alone is a contrarian indicator at extremes; in the middle, it is mostly noise.
- Risk: idiosyncratic volatility, beta to factors (size, value, momentum, quality), drawdown history, and tail risk metrics. Risk features rarely move the score on their own, their job is to flag positions where a high composite score should be discounted because the asset is fragile.
Each agent emits a sub-score on a 0–100 scale and a confidence value. The confidence is not subjective; it is derived from the dispersion of the underlying features (high dispersion → low confidence) and from data freshness (stale data → low confidence). This is the first place where a lot of consumer AI tools cut corners, they output a number with no honest uncertainty around it.
How weights are derived
The naive approach is to give each agent an equal vote: five agents, 20% each. This is what most retail products do, and it is wrong for one obvious reason: the predictive power of fundamentals over a 12-month horizon is not the same as the predictive power of technicals over the same horizon. Treating them as equivalent throws away information.
ARIA derives weights empirically using rolling cross-validation. Concretely, for each of nine regime × horizon bundles (low/medium/high volatility regimes × 1-month, 3-month, 12-month horizons), we run a constrained linear regression of forward returns on agent sub-scores. The coefficients are clipped to a sensible range (no single agent can carry more than 35% of the weight) to prevent overfitting, and the regression is refit monthly using walk-forward windows, never the full sample. The output is a 5-vector of weights per regime × horizon combination.
Why bother with regime conditioning? Because the same feature has different meaning in different markets. Momentum is a fantastic signal in 2017 and a terrible one in March 2020. Value pays in 2022 and underperforms in 2023. A scoring system that uses a single static set of weights is implicitly assuming the market is stationary, which it is not. Walk through Marcos López de Prado's
Advances in Financial Machine Learning if you want the rigorous version of this argument, the short version is that any model trained on a single market regime will be unstable out of sample.
Confidence calibration
A scoring model that says "AAPL is an 82" is almost useless without a measure of how confident the model is in that 82. The same number with 95% confidence and with 40% confidence implies completely different position sizes. This is where most consumer AI tools fall down hardest, they output point estimates with no honest uncertainty.
ARIA produces confidence in two stages. First, each agent emits its own confidence based on data quality and feature dispersion. Second, the composite confidence is shrunk toward the ML ensemble's out-of-sample probability calibration. Concretely, the ML stage uses isotonic regression on a held-out validation set to map raw model scores to empirical hit rates. If the model says "73% probability of outperforming the index over 3 months," that number is calibrated against the actual historical hit rate at that score level. We covered the mechanics of isotonic calibration in a separate article, see "Isotonic Calibration: Turning Raw Model Scores into Trustworthy Probabilities" linked below.
The downstream effect of honest calibration is that position sizing becomes a meaningful exercise rather than a bluff. The Kelly criterion, for example, is only useful when the probability inputs are calibrated. If your model claims 70% confidence but actually hits 55% of the time, full Kelly will blow up your account. Reliable calibration is what makes the rest of the machinery work.
Where the LLM comes in
After the deterministic pipeline produces a score and a confidence, the system runs a Bull vs. Bear debate using two LLM instances. The bull is given the full feature dump and prompted to make the strongest case for a long position. The bear is given the same data and prompted to make the strongest case against. A third "judge" LLM reads both arguments and rates each on a 0–10 scale across categories (factual accuracy, argument strength, novelty of insight).
The crucial design decision is that the debate does not change the score. It is appended as a narrative layer. The score is what the score is; the debate exists to surface considerations a numeric model cannot articulate ("the new product launch is delayed and management has been evasive on the timeline") and to help users understand what the score is reacting to. We wrote about why this separation matters in "Bull vs Bear: How AI Debate Improves Investment Decisions."
Common objections, answered briefly
Is this not just a factor model with extra steps?
In a sense, yes, and that is a feature, not a bug. Factor models work. The "extra steps" are regime conditioning, walk-forward weight derivation, honest confidence calibration, and a debate layer for qualitative context. Each step adds value where vanilla factor models fall short.
How do you avoid overfitting?
Three defenses. (1) Walk-forward backtesting with purged k-fold cross-validation, see "Walk-Forward Backtesting: The Gold Standard for Strategy Validation." (2) Deflated Sharpe ratio to penalize strategies that look good only because we tried many. (3) Probability of Backtest Overfitting (PBO) calculated for every published strategy. If PBO is above 50%, the strategy is not deployed.
Why not just use ChatGPT?
ChatGPT is great for narrative summarization and bad at scoring. It is non-deterministic, it cannot be backtested, and it has no concept of feature weighting or regime conditioning. Use it for what it is good at, reading transcripts, summarizing filings, and use a deterministic scorer for the score.
Conclusion
A good scoring system is mechanical, reproducible, and honest about uncertainty. The number it produces is one input among many in your investment decision, not a verdict. If you are evaluating any AI investment tool, ask three questions: Is the score reproducible? Is the confidence calibrated? Has the strategy been walk-forward backtested with honest overfit penalties? If the answer to any of those is no, you are looking at a language model wearing a financial costume.
Want to see deterministic scoring in action on your portfolio? Create a free ARIA account, three analyses a day, no card required. Or browse the plans if you are ready for unlimited scoring and walk-forward backtesting.
Frequently asked questions
Is ARIA Analyst better than Bloomberg Terminal for stock analysis?
Bloomberg Terminal is a market data and news platform; ARIA is a scoring and decision-support platform. Bloomberg gives you the raw feed (and at $24,000 a year, it should). ARIA takes raw feeds and turns them into a single reproducible score with calibrated confidence, walk-forward backtesting, and Monte Carlo simulation. They are complements, not competitors, though for most retail and emerging-pro users, ARIA covers 90% of what they were paying Bloomberg for at a fraction of the cost.
How is ARIA different from Seeking Alpha or Motley Fool?
Seeking Alpha and Motley Fool are content platforms, human analysts write articles and you read them. The "quant rating" they show is a simple multi-factor model. ARIA is a system, not a publication: thirteen deterministic agents, an ML ensemble with isotonic calibration, walk-forward backtesting, and Monte Carlo simulation. The output is a number plus a debate, not an article.
Can I use ChatGPT to analyze stocks instead?
For narrative tasks, summarizing earnings calls, explaining a 10-K, brainstorming risk factors, ChatGPT is excellent. For scoring, no. LLMs are non-deterministic by design, which means the same question can produce different answers across runs. You cannot backtest a non-deterministic decision rule, and you cannot size positions against a confidence number the model invented on the spot. Use ChatGPT for what it is good at and use a deterministic scorer for the score.
Ready to put this into practice?
ARIA Analyst applies these methods on any stock, crypto, forex, commodity, or fund. Three free analyses per day on the free tier.