NLP for News Sentiment in Stock Trading: What Actually Works
A practical guide to NLP-based news sentiment for stock trading: FinBERT and finance-specific models, source weighting, decay functions, and the empirical evidence on sentiment alpha.
News sentiment is one of the most over-hyped and under-engineered signals in retail quant finance. The over-hyped part: vendors sell sentiment indices that promise to predict the market based on headlines. The under-engineered part: most of those indices use generic NLP models on a noisy corpus and produce signals that barely outperform a coin flip. Done correctly, news sentiment adds a small but persistent signal to a stock-prediction model, typically 0.05-0.15 Sharpe ratio when added to a well-built fundamentals-and-technicals baseline. Done wrong, it adds pure noise.
This article explains what "done correctly" means in practice: which NLP models to use, how to source news, how to weight by source quality, how to decay over time, and what the realistic alpha looks like.
Why generic NLP fails in finance
Sentiment models trained on generic text corpora (Twitter, product reviews, news) miss the finance-specific vocabulary that matters. "The company beat earnings" is positive in finance; in generic sentiment it might be neutral. "Guidance lowered" is negative in finance; generic models often miss it. "Cyclical headwinds expected to abate" is mildly positive; a generic model probably misses it entirely.
The empirical evidence is clear: finance-specific NLP models outperform generic models by 15-30% in F1 score on financial sentiment classification tasks. FinBERT (Yang et al. 2020) was the first widely-adopted finance-specific BERT variant; FinGPT and several Llama-based variants have followed. The choice of model matters less than the choice of corpus, any finance-specific model beats any generic model on finance tasks.
ARIA Analyst uses FinBERT for headline classification (positive / negative / neutral) and a custom-tuned variant for full-article entity-linked sentiment. The cost differential vs. generic models is modest (a few hundred MB of model weights, ~50ms per inference); the accuracy differential is large.
Source weighting
Not all news sources are equally informative. A Bloomberg or Reuters headline is more likely to reflect institutional positioning than a SeekingAlpha contributor post. The latter is more likely to reflect retail sentiment, which is a useful signal but a different one.
ARIA Analyst weights news sources by historical predictive value, computed via rolling regression of forward returns on past sentiment from each source. The top tier, Bloomberg, Reuters, Dow Jones, Wall Street Journal, get the highest weights. Secondary tier, Financial Times, CNBC, MarketWatch, get medium weights. Tertiary tier, SeekingAlpha, BizJournals, blog aggregators, get low weights. Pure retail noise (Reddit, generic blog spam) is excluded entirely.
The source-weighting matters because the signal-to-noise ratio differs across sources by an order of magnitude. A Bloomberg story has a higher prior probability of moving the stock; a SeekingAlpha post may be informative but it is much noisier on average.
Time decay
News matters most when it is fresh. A negative earnings story is highly predictive for the next 1-2 trading days; by day 5 the market has digested it; by day 30 the residual predictive power is near zero. ARIA Analyst applies an exponential decay with a 3-trading-day half-life, a story 3 days old gets half the weight of a story today; a 6-day-old story gets a quarter.
The half-life is tuned empirically on out-of-sample data. Faster decay (1-day half-life) under-weights persistent themes; slower decay (10-day half-life) over-weights stale news. Three trading days is the empirical sweet spot for equity headlines.
Aggregation strategies
Once you have per-headline sentiment scores with source weights and time decay, you need to aggregate to a per-stock sentiment feature. ARIA Analyst uses two parallel aggregations:
Volume-weighted sentiment: the weighted average of all sentiment scores for the stock over the past 7 trading days, weighted by source and time decay. This is the "average sentiment" signal, useful but slow.
Extreme sentiment indicator: the count of strongly-positive minus strongly-negative articles in the past 3 days. This captures sentiment shocks (a major news event) better than the average.
Both features are included in the ML ensemble. The volume-weighted version contributes steady signal; the extreme version contributes event-driven signal. They are correlated but not identical, and the gradient boosting model uses both.
What the alpha actually looks like
Realistic numbers for news sentiment alpha, on top of a well-built fundamentals-and-technicals baseline:
- Daily prediction horizon: 0.05-0.10 Sharpe contribution. Most predictive immediately after major news; decays within days.
- Weekly horizon: 0.03-0.07 Sharpe contribution. Less direct effect because sentiment is mostly priced in within days.
- Monthly horizon: near zero direct alpha. By month's end, the headline-level news effect is fully absorbed.
- Cross-sectional event studies: stocks with sentiment-shock features in the top 10% outperform the bottom 10% by 30-50 bps over 5 trading days post-event, after costs.
The alpha is real but small. Anyone promising 0.5+ Sharpe from sentiment alone is overstating; anyone dismissing sentiment as worthless is also wrong. It is a useful complement to other features, not a stand-alone strategy.
Common pitfalls
- Using generic sentiment models. The finance-specific variants are clearly better and the infrastructure cost is small.
- Weighting all sources equally. Bloomberg and SeekingAlpha are not the same signal.
- Ignoring time decay. A 30-day-old story has near-zero predictive power.
- Including retail-sourced noise. Reddit, Twitter, and generic blog spam add more noise than signal at scale.
- Trading on sentiment alone. The signal is small; it works as a feature, not as a stand-alone strategy.
- Using forward-looking labels. Some sentiment datasets are labeled based on what subsequently happened to the stock price, which is look-ahead by definition. Use only contemporaneous labels.
Special case: earnings call transcripts
A higher-value variant of news sentiment is earnings-call-transcript NLP. The transcripts are richer than headlines and contain forward-looking statements from management. Sentiment models applied to transcripts can detect tone shifts ("we are seeing some pressure" vs. "we expect strong demand"), uncertainty markers (hedge words, qualifications), and analyst-question dynamics (whether analysts are pressing on weak spots).
Transcript NLP is generally more alpha-generative than headline NLP, 0.10-0.20 Sharpe contribution in some published studies. The trade-off is that transcripts are released quarterly, so the feature only updates four times per year. For systematic strategies, transcript NLP is a useful complement to headline NLP but cannot be a primary signal.
Conclusion
NLP for news sentiment is a small but persistent alpha source for equity strategies, with realistic Sharpe contributions in the 0.05-0.15 range when implemented correctly. Implementation correctness means finance-specific models, source weighting by historical predictive value, exponential time decay with a 3-day half-life, and both volume-weighted and extreme-event aggregations. Done correctly, it is a useful feature; done wrong, it is noise.
ARIA Analyst uses FinBERT-classified news sentiment as a feature in its ML ensemble. Create a free account to see the sentiment features for any stock, or read our feature engineering guide for the full feature set. See how AI scores stocks for the broader scoring methodology.
Frequently asked questions
Can I use ChatGPT for news sentiment?
Yes, but with caveats. ChatGPT can produce reasonable sentiment scores on financial headlines, but it is much more expensive per inference than FinBERT (cents vs. fractions of a cent), non-deterministic across runs, and has training-data look-ahead concerns for any historical analysis. For interactive use on recent headlines, ChatGPT is fine; for systematic strategies over historical data, use a finance-specific model with a frozen training cutoff (FinBERT trained through some date D, applied only to articles after D).
How do I get news data for backtesting?
Several options. NewsAPI provides headline-level data from major sources with a few-month historical archive (~$50/month). RavenPack and AlexandriaTech provide industrial-strength historical news data with sentiment scores (institutional pricing). For free / academic use, Stooq has a limited news archive; some research datasets (NYT API, Reuters open archives) provide longer history. ARIA Analyst uses NewsAPI plus a custom RSS aggregator for live use and RavenPack-licensed data for backtest validation.
Does Twitter/X sentiment work for stocks?
Limited evidence either way. Aggregated retail social media sentiment can predict short-horizon (intraday to multi-day) moves in liquid, retail-followed stocks (TSLA, AMC, GME, large-cap tech). It works less well in B2B and small-cap stocks where retail sentiment is sparse. The signal is noisier than mainstream news sentiment and harder to filter for relevance. ARIA Analyst includes Twitter sentiment as a feature for the meme-stock cohort and excludes it elsewhere.
Ready to put this into practice?
ARIA Analyst applies these methods on any stock, crypto, forex, commodity, or fund. Three free analyses per day on the free tier.