Market Signal Extraction
Lead analyst · Jan–May 2024
⚡ IT sector signal: ~13% backtested return
Problem
Investment teams read through enormous volumes of financial news, but turning that reading into something systematically useful is hard. The challenge isn't gathering the data. It's building signals you can actually validate against real market returns rather than just assuming they're informative.
What I Built
- 01We started by actually comparing traditional topic models (LDA and BERTopic) against GPT-3.5 with structured prompts via LangChain. The goal was to validate the methodology rather than just assume the modern approach would win.
- 02GPT-3.5 ended up outperforming FinBERT on ambiguous and forward-looking financial language. FinBERT is solid for clear-cut cases, but it struggled with the nuanced phrasing that shows up in real analyst reports.
- 03Once we had article-level sentiment, we rolled it up to sector scores across all 11 S&P 500 sectors and ranked them by signal strength for factor construction.
- 04Alphalens gave us a rigorous way to test whether the signals actually predicted forward returns, not just whether they correlated with past prices.
~68K S&P 500 financial news articles → LDA + BERTopic topic modeling (baseline comparison) → GPT-3.5 + LangChain prompt engineering (primary) → FinBERT sentiment classification → Sector-level signal aggregation (11 sectors) → Alphalens factor validation against real market returns
Results
The IT sector signal produced about 13% in backtested returns, with a clear relationship between topic-level sentiment and what the sector did afterward. AllianceBernstein's portfolio team received the full factor analysis, showing that LLM-extracted signals can generate statistically meaningful alpha when built and validated carefully.
Stack
PythonLDABERTopicGPT-3.5 APILangChainFinBERTAlphalensPandas
Next Project
Gas Notification Pipeline→