The Problem: Information Overload in Investment Research
As a quantamental investor managing a multi-asset portfolio, I constantly synthesize information from disparate sources: proprietary research notes, SEC filings, earnings transcripts, real-time market news, and sector analysis. The challenge isn't finding information—it's finding the right information at the right time and reasoning about it coherently.
Traditional search fails here. Keyword matching misses semantic intent. Manual research doesn't scale. I needed a system that could:
- Store and retrieve my proprietary research semantically
- Augment with real-time web intelligence
- Reason across both sources to generate actionable insights
The Solution: A Financial Research Agent (Modular Stack)
The core idea is stable (tool-augmented retrieval + grounded synthesis), but the stack evolves as providers change. In the current codebase, the agent is a tool-calling loop with context management, backed by multiple stores (SQL + graph + vectors + cache) and a continuous evaluation suite that prevents silent regressions.
| Component | Implementation | Purpose |
|---|---|---|
| LLM runtime | LiteLLM (OpenAI-compatible tool calling) | Reasoning + tool orchestration |
| Embeddings | Voyage AI finance embeddings (e.g., voyage-finance-2) |
Semantic retrieval over filings and extracted report chunks |
| Vector store | Qdrant (vector search + metadata filters) | Fast retrieval with company/report/section/period filtering |
| Web intelligence | Exa (domain-filtered news / research / filings) | Fresh context: news, SEC filings, research |
| Knowledge graph | Neo4j (supply-chain graph) | Structured dependency queries and impact analysis |
| Relational store | Postgres (portfolio / positions) | Portfolio state and risk inputs |
| Cache | Redis (tool result caching) | Lower latency and more stable unit economics |
Evaluation Is the Product
In investment workflows, a “good demo” is not success. A system is only useful if it is consistently correct, properly sourced, and stable under distribution shift. I treated evaluation as a first-class feature: every change to prompts, tools, retrieval, or model routing must move measurable metrics in the right direction.
| What I evaluate | How it fails in practice | What I measure |
|---|---|---|
| Tool selection | Calls the wrong tools, skips required tools, or loops | Expected tools vs actual tool calls (overlap/Jaccard) |
| Entity grounding | Drifts to the wrong ticker/supplier, mixes entities | Expected entities present; prohibited entities absent |
| Numeric accuracy | Wrong numbers, unit mistakes, extraction bugs | Expected values within tolerance (robust numeric parsing) |
| Hallucinations | Fabricated facts and confident nonsense | Must-not-contain constraints + factual checks |
| Difficulty tiers | Easy QA looks fine; strategy fails under pressure | Easy/Medium/Hard thresholds (factual → reasoning → strategy) |
| Latency + cost | Slow responses or unsustainable spend | Response time + tokens per turn |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Agent Layer │
│ Router / Planner / Executor + Context Manager (tool calls) │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ Tool Layer │
│ Portfolio Tools | Supply Chain Tools | Report Tools | Exa Search │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ Postgres (portfolio) | Neo4j (graph) | Qdrant (vectors) │
│ Redis (cache) │
└─────────────────────────────────────────────────────────────────┘
Why This Design?
Vendor names are less important than the design constraints: retrieval quality, provenance/citations, latency/cost, and operational reliability. I pick providers that are strong on financial text, offer predictable latency, and can be evaluated and monitored.
Agent loop (LiteLLM + tool calling)
The repo implements a loop-based agent: the model proposes tool calls, tools execute, results go back into context, and the agent synthesizes a final answer. Using LiteLLM keeps the interface OpenAI-compatible so the underlying model can change without rewriting the loop.
from agent.main import FinancialResearchAgent agent = FinancialResearchAgent(model=\"gpt-4o-mini\") # any OpenAI-compatible model via LiteLLM response = await agent.chat(\"Give me a morning briefing on my portfolio\") print(response.content)
Vector retrieval (Voyage finance embeddings + Qdrant filters)
Report chunks are embedded with a finance-specific model (query/document modes) and stored in Qdrant with indexed payload fields
like company, report_type, section, and period. This makes retrieval precise and testable.
from storage.vector_store import VectorStore
vs = VectorStore()
await vs.init_collections()
results = await vs.search_reports(
query=\"Blackwell demand commentary\",
companies=[\"NVDA\"],
report_types=[\"earnings_call\"],
limit=5,
)
for r in results:
print(r.score, r.chunk.section, r.chunk.period)
Freshness (Exa web search with domain filters)
Web search is the “recency layer.” Exa is configured with domain allowlists for high-quality sources (news, research, SEC), plus utilities like full-article content fetching when deeper reading is needed.
from tools.web_search import ExaTools exa = ExaTools() news = await exa.search_financial_news(\"NVDA export restrictions\", days_back=7, num_results=5) print(news)
Structured reasoning (Neo4j supply-chain graph + portfolio state)
The differentiator vs. pure RAG is structured context: supply-chain relationships live in Neo4j and portfolio state lives in Postgres. This enables dependency queries (“who supplies who”) and impact analysis (“what breaks if CoWoS is constrained”) that are hard to do with text alone.
# In practice, the agent decides which tools to call (reports vs web vs graph vs portfolio)
# based on the query and the available context.
from agent.main import FinancialResearchAgent
agent = FinancialResearchAgent(model="gpt-4o-mini")
response = await agent.chat("What are the key risks for AI chip stocks right now?")
print(response.content)
Implementation Details
Project Structure
financial-deep-research/ ├── agent/ # Agent core loop + context manager ├── tools/ # Portfolio / supply chain / reports / Exa web search ├── storage/ # Postgres / Neo4j / Qdrant / Redis ├── ingestion/ # Report processing + graph building ├── evaluation/ # Datasets + metrics + runner ├── scripts/ # init_databases.py, run_evaluation.py ├── frontend/ # Optional UI └── requirements.txt
Configuration
# .env OPENAI_API_KEY=your_key VOYAGE_API_KEY=your_key EXA_API_KEY=your_key QDRANT_URL=http://localhost:6333 NEO4J_URI=bolt://localhost:7687 POSTGRES_DSN=postgresql://user:pass@localhost:5432/db REDIS_URL=redis://localhost:6379/0
CLI Usage
# Start agent (CLI loop) python -m agent.main # Initialize databases / collections python scripts/init_databases.py # Run evaluation suite (difficulty + category filters) python scripts/run_evaluation.py --difficulty easy python scripts/run_evaluation.py --difficulty hard --category investment_strategy --max-cases 5
What This System Enables
Everything below maps directly to modules in the repo (tools + storage + evaluation):
- Portfolio Q&A: query positions, exposure, and risk from Postgres-backed portfolio tools
- Supply-chain analysis: retrieve dependencies and run impact queries from the Neo4j knowledge graph
- Report-grounded answers: semantic search over filings/transcripts via Voyage embeddings + Qdrant filters
- Freshness: pull recent news/filings/research via Exa with domain filters
- Repeatable iteration: run the evaluation suite to catch regressions in tool selection, numeric accuracy, and hallucinations
How I Evaluate It (Practical Checklist)
The most important thing I learned: tool-augmented RAG systems fail quietly. A correct-sounding answer can be based on the wrong tool output or a mis-grounded number. My evaluation loop is built around explicit test cases with expected tools/entities/values and difficulty tiers.
1) Build a gold set of questions
I maintain a benchmark suite of recurring investor workflows (supply-chain dependency, portfolio risk, report extraction, and strategic decisions). Each case specifies what “correct behavior” means: expected tools, expected entities, expected numbers within tolerance, and what must not appear.
# eval/questions.json (illustrative)
[
{
"id": "nvda_risks_2024q4",
"question": "What are the near-term risks for NVDA given current valuation and export controls?",
"expected_tools": ["search_financial_news", "search_reports"],
"expected_entities": ["NVDA"],
"must_not_contain": ["fabricated numbers"]
}
]
2) Retrieval-first metrics
Before judging the answer, I evaluate behavior: did the agent call the right tools and anchor to the right entities and numbers? If tool selection is wrong, “better prompting” only hides the failure mode.
- Tool selection score: expected tools vs actual tool calls.
- Entity grounding: expected entities present; prohibited entities absent.
- Numeric correctness: extracted values within tolerance (and no “10-K → 10” parsing bugs).
3) Source-grounded answer scoring
For answers, I bias toward grounded outputs. When the agent is missing data, it should ask for missing context or avoid hard claims rather than hallucinating.
- Factual checks: expected facts present, prohibited claims absent.
- Hallucination detection: enforce must-not-contain constraints for known failure modes.
- Reasoning/strategy signals: for medium/hard cases, verify it provides analysis and actionable recommendations.
4) Latency and token economics
For daily usage, cost and responsiveness are product features. I track latency percentiles and tokens-per-answer, and I use routing (cheaper models for summarization, stronger models for synthesis) to hit budgets without collapsing quality.
5) Monitoring for drift
Markets change and sources change. I log query types, retrieval distributions (doc types, dates), and “answer confidence” proxies (abstentions, citation density) to detect drift early and refresh the benchmark set.
In practice, I run this continuously during iteration:
python scripts/run_evaluation.py --difficulty easy python scripts/run_evaluation.py --difficulty hard --category investment_strategy --max-cases 5
Try It Yourself
The complete implementation is available at: github.com/Yvette-0508/financial-deep-research (private)
git clone https://github.com/Yvette-0508/financial-deep-research cd financial-deep-research pip install -r requirements.txt # Add your API keys to .env python scripts/init_databases.py python -m agent.main