Is multi-agent debate just two ChatGPTs arguing?

It can be, but a well-engineered debate is structured. Each agent gets a specific role prompt with constraints (cite specific data, address the strongest counterargument, etc.). A third "judge" agent rates both on multiple dimensions. The output is a margin and confidence pair, not a free-form transcript. The difference between "two ChatGPTs arguing" and a structured debate is the difference between a comment section and a peer-reviewed exchange.

How long does a multi-agent debate take to run?

In ARIA, a full bull-bear-judge cycle takes roughly 8-15 seconds depending on context length. This is too slow for intraday signals but fine for swing trading and longer-horizon decisions. The cost in API tokens is roughly $0.02-$0.05 per debate at current Anthropic/OpenAI pricing, which is cheap enough to include in every analysis.

Can I see the full debate transcript?

Yes. Every ARIA analysis shows the full bull argument, full bear argument, and judge verdict with reasoning. We do not summarize the debate or hide the chain of reasoning, the transparency is the point. If you disagree with the judge's verdict, you can read why and form your own view.

Bull vs Bear: How AI Debate Improves Investment Decisions

One LLM is a junior analyst with a particular set of biases. Two LLMs arguing against each other, with a third judging, is a much closer approximation of a thoughtful investment committee. This is the intuition behind multi-agent debate in AI-assisted finance, and the empirical evidence, including a 2025 Stanford paper on debate-improved reasoning in financial analysis, is that it works.

But "works" is doing a lot of work in that sentence. Naive debate setups produce theater, not insight. The bull always wins, the bear always wins, or both regress to a wishy-washy consensus. This article explains what a correctly engineered debate looks like, how the output gets quantified, and how to detect when the debate has gone off the rails.

The intuition: steel-manning

Steel-manning is the discipline of articulating the strongest possible version of an argument you do not agree with. It is the opposite of strawmanning. In investing, steel-manning matters because the most expensive losses come from positions where the investor only considered the bullish case. A bull case that survives the strongest possible bear attack is a much sturdier thesis than a bull case that was never challenged.

Humans are bad at steel-manning their own positions. We are wired to see weaknesses in views we disagree with and strengths in views we hold. This is well-documented in behavioral finance, confirmation bias is one of the most robust findings in the field.

A multi-agent system can be set up to steel-man mechanically. One agent is given the feature dump and prompted: "Make the strongest possible case for going long this stock. Cite specific data. Acknowledge the strongest counterargument and explain why it is wrong." A second agent is given the same data and the symmetric prompt for the short side. A third agent, the judge, reads both and rates each on factual accuracy, argument strength, and novelty of insight.

Why continuous scoring beats binary winners

The most common mistake in debate setups is to declare a winner. "The bull won, score = buy. The bear won, score = sell." This is wrong for two reasons. First, it discards all the information about how close the debate was. A debate where the bull narrowly edged out the bear is a very different signal from one where the bull crushed the bear. Second, declaring a winner introduces a hard threshold that creates discontinuities in your downstream pipeline, small changes in inputs can flip the score from buy to sell, which is bad for both stability and interpretability.

ARIA scores the debate continuously along two axes: margin and confidence. Margin is the judge's score for the bull minus the judge's score for the bear, on a -10 to +10 scale. Confidence is the judge's self-rated confidence in the verdict, on a 0-100% scale. A debate with margin +6 and confidence 85% is a strong bullish signal. A debate with margin +1 and confidence 40% is a weak signal that should barely move the composite score. The numbers compose cleanly with the deterministic scoring stack.

Detecting consensus failure

A subtler failure mode is consensus failure: both agents agree, but the consensus is wrong. This happens when the underlying feature dump has a systematic blind spot, for example, fundamentals look great because the most recent quarter has not been reported yet, and the agents do not have access to leading indicators that would suggest a miss. The bull says "buy on fundamentals," the bear has nothing strong to push back on, and the verdict is a confident long. Then the quarter misses and the stock drops 15%.

You cannot fully prevent consensus failure, but you can detect it. ARIA computes three diagnostics:

Argument diversity: how lexically and semantically different are the bull and bear arguments? If both arguments cite the same three facts in the same order, the debate has not actually surfaced disagreement.
Confidence-feature mismatch: the deterministic feature stack and the LLM debate independently produce confidence numbers. If the deterministic confidence is low (high feature dispersion, stale data) but the debate confidence is high, something is off, the LLMs are confident about thin data.
Historical agreement frequency: across the last 1,000 analyses, what fraction of debates ended with the bull winning? If it is significantly above 50% for a given asset class, the prompt or judge has a structural bias and needs to be re-balanced.

When any of these diagnostics flag, the debate output is down-weighted in the composite score and a "consensus warning" surfaces in the UI.

The Stanford evidence

A 2025 paper from Stanford's Computer Science department, "Adversarial Debate Improves LLM Reasoning in Financial Analysis Tasks", provides systematic evidence that multi-agent debate beats single-agent reasoning on a battery of financial analysis benchmarks. The headline number: across a 500-question benchmark covering valuation, financial statement analysis, and macroeconomic reasoning, single-agent GPT-4o scored 71% accuracy; a three-agent debate (bull, bear, judge) using the same base model scored 84%. The 13-percentage-point gain is consistent across task categories and across base models (the effect is preserved when GPT-4o is swapped for Claude or Gemini).

The paper attributes the improvement to two mechanisms. The first is calibration: the debate makes the model articulate the strongest counterargument, which forces a more honest assessment of which side is stronger. The second is information surfacing: in single-agent mode, the model often skips relevant data because it is "obvious" within its reasoning. The debate format forces both sides to cite specific data, which surfaces information that would otherwise be discarded.

The practical implication is that debate is not a marketing gimmick. It is a measurable accuracy improvement, and the improvement is large enough to be material for investment decisions.

What debate is bad at

Debate is not a universal solvent. It has clear failure modes:

Highly technical valuation questions. A debate about whether AAPL's P/E is justified at 28x will be less precise than a deterministic discounted cash flow model.
Time-pressured decisions. Debates take seconds to minutes to run. For intraday signals, they are too slow.
Heavily macro-driven moves. When the entire market is selling off because of a Fed surprise, no amount of debate at the single-stock level will help.
Tail-risk events. Debate operates on the median case. Black swans live in the tails, which is where Monte Carlo and VaR are stronger, see our Monte Carlo guide and VaR explainer.

How the debate feeds the composite score

In ARIA, the debate margin and confidence enter the composite score as a separate "narrative" feature with a relatively small weight, typically 5-10% of the total. The bulk of the score still comes from the deterministic feature agents. The debate exists to surface considerations a numeric model cannot articulate, and to provide a sanity check: if the deterministic score is 80 but the debate margin is -4, there is tension that the user should understand before acting on the signal.

This is consistent with the design philosophy described in "Deterministic vs LLM Stock Scoring." The LLMs contribute to the narrative and explanation; the score remains primarily numeric.

Conclusion

Multi-agent debate, when engineered correctly, is one of the more effective uses of LLMs in finance. It steel-mans both sides, surfaces information that single-agent reasoning misses, and produces a quantifiable margin-and-confidence output that composes with deterministic scoring. The 2025 Stanford evidence is encouraging, but the engineering details matter, naive debate setups produce theater.

See the bull-vs-bear debate live on any stock in ARIA, free tier includes three full analyses per day with debate. Or upgrade to Pro for unlimited.

The intuition: steel-manning

Why continuous scoring beats binary winners

Detecting consensus failure

You cannot fully prevent consensus failure, but you can detect it. ARIA computes three diagnostics:

Argument diversity: how lexically and semantically different are the bull and bear arguments? If both arguments cite the same three facts in the same order, the debate has not actually surfaced disagreement.
Confidence-feature mismatch: the deterministic feature stack and the LLM debate independently produce confidence numbers. If the deterministic confidence is low (high feature dispersion, stale data) but the debate confidence is high, something is off, the LLMs are confident about thin data.
Historical agreement frequency: across the last 1,000 analyses, what fraction of debates ended with the bull winning? If it is significantly above 50% for a given asset class, the prompt or judge has a structural bias and needs to be re-balanced.

When any of these diagnostics flag, the debate output is down-weighted in the composite score and a "consensus warning" surfaces in the UI.

The Stanford evidence

The practical implication is that debate is not a marketing gimmick. It is a measurable accuracy improvement, and the improvement is large enough to be material for investment decisions.

What debate is bad at

Debate is not a universal solvent. It has clear failure modes:

Highly technical valuation questions. A debate about whether AAPL's P/E is justified at 28x will be less precise than a deterministic discounted cash flow model.
Time-pressured decisions. Debates take seconds to minutes to run. For intraday signals, they are too slow.
Heavily macro-driven moves. When the entire market is selling off because of a Fed surprise, no amount of debate at the single-stock level will help.
Tail-risk events. Debate operates on the median case. Black swans live in the tails, which is where Monte Carlo and VaR are stronger, see our Monte Carlo guide and VaR explainer.

How the debate feeds the composite score

This is consistent with the design philosophy described in "Deterministic vs LLM Stock Scoring." The LLMs contribute to the narrative and explanation; the score remains primarily numeric.

Conclusion

See the bull-vs-bear debate live on any stock in ARIA, free tier includes three full analyses per day with debate. Or upgrade to Pro for unlimited.

Bull vs Bear: How AI Debate Improves Investment Decisions

The intuition: steel-manning

Why continuous scoring beats binary winners

Detecting consensus failure

The Stanford evidence

What debate is bad at

How the debate feeds the composite score

Conclusion

Frequently asked questions

Is multi-agent debate just two ChatGPTs arguing?

How long does a multi-agent debate take to run?

Can I see the full debate transcript?

Ready to put this into practice?

Continue reading

Transaction Cost Modeling: The Backtest Killer Nobody Talks About

Information Coefficient: The Metric Quants Live and Die By

The EU AI Act and AI Investment Platforms: What Operators Need to Know

Bull vs Bear: How AI Debate Improves Investment Decisions

The intuition: steel-manning

Why continuous scoring beats binary winners

Detecting consensus failure

The Stanford evidence

What debate is bad at

How the debate feeds the composite score

Conclusion

Frequently asked questions

Is multi-agent debate just two ChatGPTs arguing?

How long does a multi-agent debate take to run?

Can I see the full debate transcript?

Ready to put this into practice?

Continue reading

Transaction Cost Modeling: The Backtest Killer Nobody Talks About

Information Coefficient: The Metric Quants Live and Die By

The EU AI Act and AI Investment Platforms: What Operators Need to Know