Deterministic vs LLM Stock Scoring: Why Reproducibility Matters
A rigorous comparison of deterministic scoring systems and LLM-based "AI stock analysis." Reproducibility is not a nice-to-have, it is the foundation of every backtest, every confidence calibration, and every position sizing decision.
Every backtest, every confidence interval, every position-sizing decision in quantitative finance rests on a single property: reproducibility. Given the same inputs, the model produces the same output. Drop reproducibility and the rest of the apparatus falls down. Backtests become meaningless because the historical decision rule cannot be replayed. Confidence intervals become unfalsifiable. Position sizing becomes a guess.
This is why the rise of LLM-based "AI stock analysis" should make any serious investor uncomfortable. Not because LLMs are not useful, they are, but because they are being marketed as scoring systems, which they are not.
What "reproducible" actually means in finance
Reproducibility has a precise technical meaning in quantitative finance. A model is reproducible if, given the same input features, it produces the same output to within numerical floating-point precision. This is stricter than "consistent" or "stable." A model that returns 71 on Monday and 73 on Wednesday with the same features is not reproducible, even if 71 and 73 lead to the same investment recommendation.
Why so strict? Because the small differences compound. A scoring system that drifts by 2 points between runs will rank tickers differently across runs. The "top 10" buy list on Monday will not be identical to the "top 10" on Wednesday. If you are running a portfolio that holds the top 10, you will turn over positions purely because of model noise, generating real transaction costs to chase imaginary signal.
Why LLMs are non-deterministic by design
LLMs generate text token by token. At each step, the model produces a probability distribution over the next token, and a sampling procedure picks one. The standard sampling parameters, temperature, top-p, top-k, are explicitly designed to introduce randomness so that the model produces varied, creative output. Even at temperature 0, most production LLM APIs are not bitwise reproducible across runs because of non-deterministic GPU kernels and batching.
You can mitigate this somewhat. You can pin temperature to 0, fix the seed, force greedy decoding, and request the same model version. In practice, what you get is something that is approximately reproducible most of the time, but not strictly reproducible, and not reproducible across model updates. When OpenAI silently rolls out a new version of GPT-4 (which they do), your "stock score" changes overnight without any change in your inputs.
This is fine for many applications. It is not fine for scoring.
What an LLM-as-scorer actually does
When you ask ChatGPT "What is your score for NVDA out of 100?" what is actually happening is something like this: the model has trained on millions of words about NVDA, financial analysis frameworks, scoring methodologies, and prompt-following. When you ask for a score, it generates a plausible-sounding number based on linguistic patterns ("companies with high growth tend to score high, NVDA has high growth, therefore high score"). The number is not derived from any feature vector. It is generated as a token sequence.
Run the same query five times and you will get five different numbers, typically within a 10-point range. Ask the same question phrased differently and the range widens. Ask in a different language and it widens further. None of this is a bug, it is exactly how generative text models work. The bug is calling the output a "score" rather than a "narrative."
The cascading consequences for backtesting
Suppose you want to backtest a strategy that buys the top 10 stocks by AI score, holds for one month, rebalances monthly. To do this honestly, you need to be able to replay history: for each historical date, what scores would the model have produced given the information available at that date? With a deterministic scorer, you replay the feature pipeline at that date and run the model. With an LLM, you cannot do this, the model itself has changed, the API has changed, and even if neither had changed, the output is non-deterministic.
You could try to approximate this by running the LLM many times on the historical features and averaging. This is expensive and still gives you a noisy estimate. It also has a subtle problem: the LLM training data includes the future. GPT-4's training cutoff includes years of post-2020 market data. When you ask it about a 2018 stock pick, its "score" leaks information from 2019-2024. This is the most insidious form of look-ahead bias in financial ML, and it is impossible to remove without retraining the model from scratch on a time-restricted corpus.
How ARIA isolates LLMs from the score
ARIA uses LLMs heavily, but never for scoring. The architecture isolates the LLM from the numeric pipeline. Specifically:
- Feature extraction is deterministic. Fundamentals come from filings via parsing, not summarization. Technicals come from price arrays via numerical computation. Macro features come from FRED and other public time-series APIs. None of this passes through an LLM.
- Sub-scoring is deterministic. Each agent computes a 0-100 score using a fixed numerical recipe. There is no language-model call in this path.
- Weighting is deterministic. Weights come from rolling cross-validation on historical returns. They are stored as a matrix of floats. They are looked up, not generated.
- The composite score is a weighted sum. Float arithmetic. Bitwise reproducible.
- LLMs are used for: (1) generating the bull and bear narratives that explain the score; (2) summarizing news flow into sentiment categories; (3) explaining methodology to users in plain language. None of these affect the numeric score.
The result is that the score is reproducible to floating-point precision while the user still gets the qualitative benefits of LLM-generated explanations and debate. The LLM contributes to the user experience; the score contributes to the investment decision. They live in different parts of the system.
What LLMs are genuinely good for in finance
This is not an anti-LLM article. LLMs are excellent at several finance tasks that deterministic systems do poorly:
- Reading and summarizing earnings call transcripts. A 20,000-word transcript becomes a 200-word summary in seconds, with key themes and analyst questions highlighted.
- Translating numeric outputs into plain language. "Score 78, confidence 82%, debate-margin +2.1, ML-prob 71%" is meaningful to a quant; "Strong buy with high confidence, bullish case driven by improving fundamentals and momentum, with debate slightly favoring the long side" is meaningful to a normal human.
- Generating bull-case and bear-case narratives. Given a feature dump, the LLM can articulate the strongest arguments for and against a position. This is genuinely useful for stress-testing your own thesis.
- Classifying news flow. FinBERT-style models classify headlines as positive, negative, or neutral with high accuracy. This becomes a feature in the deterministic pipeline.
- Answering follow-up questions. "Why did the score drop from 76 to 68 this week?" is a natural-language query that maps to a structured query over the feature dump. LLMs are perfect for this.
A practical test you can run
Want to evaluate whether an AI investment tool actually scores or merely chats? Run this test: pick a stock, get a score, wait five minutes, get the score again. Then ask for the score in a different language. Then ask phrased as "rate this from 1 to 100" vs "give this a quality rating." A deterministic scorer will produce the same number every time. A chat-as-scorer will not.
A second test: ask for the model's historical score on the same stock as of a specific past date. A deterministic scorer can replay history. An LLM cannot, it can only generate a plausible-sounding number based on what it learned in training, which is contaminated by the future.
Conclusion
Reproducibility is not a luxury feature. It is the precondition for honest backtesting, honest confidence calibration, and honest position sizing. LLMs are powerful tools, but they belong in the narrative layer, not the scoring layer. Any AI investment tool that lets the LLM drive the score is selling you a chat interface dressed as a model.
See how ARIA separates the deterministic scoring layer from the LLM narrative layer in our methodology page, or create a free account and run your own reproducibility test.
Frequently asked questions
Can you make ChatGPT deterministic for stock scoring?
You can make it approximately deterministic by pinning temperature to 0, fixing the seed, and using greedy decoding. But you cannot make it fully reproducible across model versions (which change without notice), and you cannot remove the training-data look-ahead bias, the model has seen the future of any historical date, so its "historical scores" are not really historical. For toy or educational use, pinned-temperature LLM scoring is fine. For real investment decisions, no.
Why is reproducibility important for backtesting?
A backtest is a simulation of what would have happened if your strategy had been deployed historically. For the backtest to be meaningful, you have to be able to replay the strategy's decisions using only the information that was available at each historical date. If your scoring model is non-deterministic, you cannot replay decisions, every replay produces different scores, and therefore different trades. The backtest result becomes noise rather than signal.
Is ARIA Analyst free from LLM contamination in its scores?
Yes. ARIA scores are computed by deterministic agents on deterministic features. LLMs are used only for the narrative layer (bull-vs-bear debate, news summarization, plain-language explanations) and do not feed back into the numeric score. The methodology page documents the full pipeline, and the score is reproducible to floating-point precision across runs.
Ready to put this into practice?
ARIA Analyst applies these methods on any stock, crypto, forex, commodity, or fund. Three free analyses per day on the free tier.