The 2-minute test that filters AI analysis platforms
Most AI analysis tools sold to retail are LLM wrappers, not models. Two consecutive queries on the same ticker reveal which is which. Educational research, not investment advice.
ARIA Analyst is a research and analysis tool. This article is educational and does not constitute investment advice. Always verify and form your own decisions.
There is a 2-minute test that filters most AI analysis platforms before you pay a single dollar. It does not require a finance background. It does not require code. It works on any tool that returns a numeric score, regardless of the marketing language around it.
Ask the platform for an analysis of a ticker. Write down the numeric score it reports. Wait 5 minutes. Open a new tab. Ask for the same analysis on the same ticker, with the same prompt. Compare the two scores. That is the entire test.
If the scores differ by more than a trivial amount in the absence of underlying data movement, the platform is not running a model. It is running a language model with non-zero temperature and asking it for a number. The output is a sample from a distribution, not the evaluation of a function. The label "AI" is accurate in a marketing sense and misleading in a useful sense.
Why reproducibility is the right filter
A real quantitative model has a fixed functional form. Given the same inputs, it produces the same output. That property is what makes the output usable for downstream tasks: you can backtest a strategy that depends on it, you can audit the decisions it triggers, you can regression-test it after a code change. Without that property the output is generated opinion. Useful for exploration, useless for systematic decisions.
The reason this matters in practice is comparability. To use any score as input to research (comparing across tickers, tracking how it evolves through time, cross-referencing with other signals), the score has to mean the same thing across runs. A non-deterministic score is uncomparable with itself, which collapses its informational value to almost nothing. You can read the prose around it, you cannot use the number to compare or track.
What the test does not prove
Two scores can differ for legitimate reasons. If the underlying price has moved between queries during market hours, the technical features change and a deterministic scorer should produce a slightly different number, proportional to how impactful the move was. The test is not "any difference means a wrapper". It is "differences that do not correspond to data changes mean a wrapper". A 15 to 25 point swing on a ticker that moved 0.1% in the interval is non-determinism, not responsiveness.
A second clarification: passing the test does not make a platform good. It just makes it a model. The model can still be wrong, the weights can still be poorly chosen, the calibration can still be miscalibrated. The test rules out one specific failure mode (sampling instead of computing) and leaves all the other failure modes intact.
Why most retail AI tools fail this test
A language model is the cheapest way to ship a product that looks like investment analysis. The pipeline is simple: take a ticker, build a prompt that contains the financials and a recent news summary, ask the LLM for a score and a verdict. Ship.
The problem is that LLMs are creative by design. Temperature is non-zero in default API configurations. Different sampling seeds produce different completions. The model that gave you 72 last time is sampling from a distribution that includes 65 and 81, and you will hit those values eventually. The platform usually hides this by reporting an integer (so a 72.3 and a 71.8 both display as 72) and by adding a small averaging layer over multiple completions (which reduces but does not eliminate variance). The test still surfaces the problem because the noise is real.
The pricing pressure pushes in the wrong direction. A wrapper costs the operator one API call per query plus the LLM token cost, totaling a few cents. A deterministic scoring pipeline requires building and maintaining feature engineering, data infrastructure, and model calibration, with engineering cost amortized over the user base. The wrapper wins on margin every time. The cost shows up as a worse product.
What a real three-layer system looks like
The framework that makes the test pass is to separate the system into three layers and use each for what it does well.
Layer one is the score. It must be deterministic: a function of measurable features with fixed weights. In ARIA, this is five agents (macro, fundamental, technical, sentiment, risk) with weights fixed per asset class. The macro score is a function of yield curve, VIX, breadth, and credit spreads. The fundamental score is a function of ROE, revenue growth, margin trends, valuation ratios, quality factors. The technical score uses momentum, mean-reversion, vol, and trend indicators. Each agent produces a 0-100 score with explicit formulas. The composite is a weighted blend with public weights. Same inputs, same number, always.
Layer two is calibration. The raw score (0-100) gets mapped to a probability of outperformance over a stated horizon (21 days, by default). The mapping uses isotonic regression on the trailing 24 months of out-of-fold predictions. The result is that when the system says 70 percent, it actually means 70 percent. Calibration is monitored weekly. If any probability bucket drifts more than 5 points from its observed frequency, sizing is disabled until recalibration finishes.
Layer three is narrative. This is where LLMs earn their seat. A Bull versus Bear debate runs across five rounds with an Arbiter. Deep-search agents pull intelligence from filings, news, peer comparisons, insider transactions, and ESG sources. A final synthesis writes the thesis. The narrative varies between queries because the LLM writes with stylistic variability. The score does not move.
What to do with the test results on your current platform
If your current AI analysis tool fails the test, the output is best read as exploratory prose rather than reference material. The narrative may still be informative. The score should not be referenced as a fixed data point in your own research notes, because it is not fixed.
Whether you pay for any analysis tool, free or otherwise, is a separate question that depends on your research workflow and how much value you place on speed of information access. This article does not recommend any specific course of action. It explains a property (reproducibility) that distinguishes one type of tool from another.
The summary
The 2-query test is the simplest filter against the most common failure mode in AI-assisted analysis tools. It does not prove correctness. It does prove reproducibility, which is a necessary precondition for the score to be useful as reference information. If a tool fails it, you have learned something specific about what the tool is and is not. If it passes, you have ruled out one specific failure mode and can move on to evaluating the other things that matter.
ARIA Analyst is an analysis tool designed to pass the test by construction. The methodology is public. The asset-class weights are fixed. The calibration layer publishes its reliability diagrams. ARIA does not provide investment advice. It produces information; what users do with that information is their own decision.
Frequently asked questions
How much can the score legitimately change between two queries 5 minutes apart?
During market hours, the bound is essentially zero unless the price moved enough to shift technical indicators. A 0.1 percent intraday move on a large-cap stock will not change the score. A 2 percent move can shift it by a few points if the technical dimension is heavily weighted. After hours, the bound is essentially zero (no underlying data change). A swing of 10 points or more without a corresponding data move is a strong indicator of LLM-based generation.
Does the test work for crypto, forex, commodities?
Yes, with the same logic. Crypto trades 24/7, so price-based features do update continuously, but the magnitude of change in 5 minutes should still be small enough that the score does not swing dramatically. Forex and commodity platforms can be tested similarly. The principle is invariant: same inputs, same outputs, always.
What if the platform refuses to give me a numeric score and only gives prose?
That is a different kind of warning. A purely prose-based recommendation is uncomparable across runs and cannot drive sizing. It is the LLM-as-analyst pattern, which has real uses but is not a scoring system. Treat it as exploratory writing, not as a signal to act on.
How does ARIA explicitly handle the test internally?
The scoring layer is a pure function. The agents are stateless, the weights are constants per asset class, the blending formula is deterministic. The only sources of legitimate score variation between queries are changes in the inputs (price ticks, news updates, fundamentals revisions). The LLM layer that writes narrative does have variability by design, but it does not touch the score. You can verify this end-to-end by running two consecutive analyses on the same ticker through the platform.
Is this just a critique of LLMs in finance?
No. LLMs are excellent tools when used at the right layer. The argument is that LLMs do not belong in the scoring layer because non-determinism in scoring breaks downstream tasks. They belong in the narrative, reasoning, and intelligence-extraction layers, where their strengths (fluency, summarization, multi-source synthesis) are what the job requires. A system that uses an LLM at every layer is a chatbot. A system that uses an LLM only where it is the right tool is an investment platform.
Ready to put this into practice?
ARIA Analyst applies these methods on any stock, crypto, forex, commodity, or fund. Three free analyses per day on the free tier.