Building a Financial Research Agent: Graph + RAG + Continuous Evaluation

The Problem: Information Overload in Investment Research

As a quantamental investor managing a multi-asset portfolio, I constantly synthesize information from disparate sources: proprietary research notes, SEC filings, earnings transcripts, real-time market news, and sector analysis. The challenge isn't finding information—it's finding the right information at the right time and reasoning about it coherently.

Traditional search fails here. Keyword matching misses semantic intent. Manual research doesn't scale. I needed a system that could:

  • Store and retrieve my proprietary research semantically
  • Augment with real-time web intelligence
  • Reason across both sources to generate actionable insights

The Solution: A Financial Research Agent (Modular Stack)

The core idea is stable (tool-augmented retrieval + grounded synthesis), but the stack evolves as providers change. In the current codebase, the agent is a tool-calling loop with context management, backed by multiple stores (SQL + graph + vectors + cache) and a continuous evaluation suite that prevents silent regressions.

Component Implementation Purpose
LLM runtime LiteLLM (OpenAI-compatible tool calling) Reasoning + tool orchestration
Embeddings Voyage AI finance embeddings (e.g., voyage-finance-2) Semantic retrieval over filings and extracted report chunks
Vector store Qdrant (vector search + metadata filters) Fast retrieval with company/report/section/period filtering
Web intelligence Exa (domain-filtered news / research / filings) Fresh context: news, SEC filings, research
Knowledge graph Neo4j (supply-chain graph) Structured dependency queries and impact analysis
Relational store Postgres (portfolio / positions) Portfolio state and risk inputs
Cache Redis (tool result caching) Lower latency and more stable unit economics

Evaluation Is the Product

In investment workflows, a “good demo” is not success. A system is only useful if it is consistently correct, properly sourced, and stable under distribution shift. I treated evaluation as a first-class feature: every change to prompts, tools, retrieval, or model routing must move measurable metrics in the right direction.

What I evaluate How it fails in practice What I measure
Tool selection Calls the wrong tools, skips required tools, or loops Expected tools vs actual tool calls (overlap/Jaccard)
Entity grounding Drifts to the wrong ticker/supplier, mixes entities Expected entities present; prohibited entities absent
Numeric accuracy Wrong numbers, unit mistakes, extraction bugs Expected values within tolerance (robust numeric parsing)
Hallucinations Fabricated facts and confident nonsense Must-not-contain constraints + factual checks
Difficulty tiers Easy QA looks fine; strategy fails under pressure Easy/Medium/Hard thresholds (factual → reasoning → strategy)
Latency + cost Slow responses or unsustainable spend Response time + tokens per turn

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Agent Layer                              │
│      Router / Planner / Executor + Context Manager (tool calls)   │
└─────────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────────┐
│                          Tool Layer                              │
│  Portfolio Tools | Supply Chain Tools | Report Tools | Exa Search │
└─────────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────────┐
│                         Storage Layer                             │
│     Postgres (portfolio) | Neo4j (graph) | Qdrant (vectors)       │
│                      Redis (cache)                                 │
└─────────────────────────────────────────────────────────────────┘

Why This Design?

Vendor names are less important than the design constraints: retrieval quality, provenance/citations, latency/cost, and operational reliability. I pick providers that are strong on financial text, offer predictable latency, and can be evaluated and monitored.

Agent loop (LiteLLM + tool calling)

The repo implements a loop-based agent: the model proposes tool calls, tools execute, results go back into context, and the agent synthesizes a final answer. Using LiteLLM keeps the interface OpenAI-compatible so the underlying model can change without rewriting the loop.

from agent.main import FinancialResearchAgent

agent = FinancialResearchAgent(model=\"gpt-4o-mini\")  # any OpenAI-compatible model via LiteLLM
response = await agent.chat(\"Give me a morning briefing on my portfolio\")
print(response.content)

Vector retrieval (Voyage finance embeddings + Qdrant filters)

Report chunks are embedded with a finance-specific model (query/document modes) and stored in Qdrant with indexed payload fields like company, report_type, section, and period. This makes retrieval precise and testable.

from storage.vector_store import VectorStore

vs = VectorStore()
await vs.init_collections()
results = await vs.search_reports(
    query=\"Blackwell demand commentary\",
    companies=[\"NVDA\"],
    report_types=[\"earnings_call\"],
    limit=5,
)
for r in results:
    print(r.score, r.chunk.section, r.chunk.period)

Freshness (Exa web search with domain filters)

Web search is the “recency layer.” Exa is configured with domain allowlists for high-quality sources (news, research, SEC), plus utilities like full-article content fetching when deeper reading is needed.

from tools.web_search import ExaTools

exa = ExaTools()
news = await exa.search_financial_news(\"NVDA export restrictions\", days_back=7, num_results=5)
print(news)

Structured reasoning (Neo4j supply-chain graph + portfolio state)

The differentiator vs. pure RAG is structured context: supply-chain relationships live in Neo4j and portfolio state lives in Postgres. This enables dependency queries (“who supplies who”) and impact analysis (“what breaks if CoWoS is constrained”) that are hard to do with text alone.

# In practice, the agent decides which tools to call (reports vs web vs graph vs portfolio)
# based on the query and the available context.
from agent.main import FinancialResearchAgent

agent = FinancialResearchAgent(model="gpt-4o-mini")
response = await agent.chat("What are the key risks for AI chip stocks right now?")
print(response.content)

Implementation Details

Project Structure

financial-deep-research/
├── agent/                # Agent core loop + context manager
├── tools/                # Portfolio / supply chain / reports / Exa web search
├── storage/              # Postgres / Neo4j / Qdrant / Redis
├── ingestion/            # Report processing + graph building
├── evaluation/           # Datasets + metrics + runner
├── scripts/              # init_databases.py, run_evaluation.py
├── frontend/             # Optional UI
└── requirements.txt

Configuration

# .env
OPENAI_API_KEY=your_key
VOYAGE_API_KEY=your_key
EXA_API_KEY=your_key
QDRANT_URL=http://localhost:6333
NEO4J_URI=bolt://localhost:7687
POSTGRES_DSN=postgresql://user:pass@localhost:5432/db
REDIS_URL=redis://localhost:6379/0

CLI Usage

# Start agent (CLI loop)
python -m agent.main

# Initialize databases / collections
python scripts/init_databases.py

# Run evaluation suite (difficulty + category filters)
python scripts/run_evaluation.py --difficulty easy
python scripts/run_evaluation.py --difficulty hard --category investment_strategy --max-cases 5

What This System Enables

Everything below maps directly to modules in the repo (tools + storage + evaluation):

  • Portfolio Q&A: query positions, exposure, and risk from Postgres-backed portfolio tools
  • Supply-chain analysis: retrieve dependencies and run impact queries from the Neo4j knowledge graph
  • Report-grounded answers: semantic search over filings/transcripts via Voyage embeddings + Qdrant filters
  • Freshness: pull recent news/filings/research via Exa with domain filters
  • Repeatable iteration: run the evaluation suite to catch regressions in tool selection, numeric accuracy, and hallucinations

How I Evaluate It (Practical Checklist)

The most important thing I learned: tool-augmented RAG systems fail quietly. A correct-sounding answer can be based on the wrong tool output or a mis-grounded number. My evaluation loop is built around explicit test cases with expected tools/entities/values and difficulty tiers.

1) Build a gold set of questions

I maintain a benchmark suite of recurring investor workflows (supply-chain dependency, portfolio risk, report extraction, and strategic decisions). Each case specifies what “correct behavior” means: expected tools, expected entities, expected numbers within tolerance, and what must not appear.

# eval/questions.json (illustrative)
[
  {
    "id": "nvda_risks_2024q4",
    "question": "What are the near-term risks for NVDA given current valuation and export controls?",
    "expected_tools": ["search_financial_news", "search_reports"],
    "expected_entities": ["NVDA"],
    "must_not_contain": ["fabricated numbers"]
  }
]

2) Retrieval-first metrics

Before judging the answer, I evaluate behavior: did the agent call the right tools and anchor to the right entities and numbers? If tool selection is wrong, “better prompting” only hides the failure mode.

  • Tool selection score: expected tools vs actual tool calls.
  • Entity grounding: expected entities present; prohibited entities absent.
  • Numeric correctness: extracted values within tolerance (and no “10-K → 10” parsing bugs).

3) Source-grounded answer scoring

For answers, I bias toward grounded outputs. When the agent is missing data, it should ask for missing context or avoid hard claims rather than hallucinating.

  • Factual checks: expected facts present, prohibited claims absent.
  • Hallucination detection: enforce must-not-contain constraints for known failure modes.
  • Reasoning/strategy signals: for medium/hard cases, verify it provides analysis and actionable recommendations.

4) Latency and token economics

For daily usage, cost and responsiveness are product features. I track latency percentiles and tokens-per-answer, and I use routing (cheaper models for summarization, stronger models for synthesis) to hit budgets without collapsing quality.

5) Monitoring for drift

Markets change and sources change. I log query types, retrieval distributions (doc types, dates), and “answer confidence” proxies (abstentions, citation density) to detect drift early and refresh the benchmark set.

In practice, I run this continuously during iteration:

python scripts/run_evaluation.py --difficulty easy
python scripts/run_evaluation.py --difficulty hard --category investment_strategy --max-cases 5

Try It Yourself

The complete implementation is available at: github.com/Yvette-0508/financial-deep-research (private)

git clone https://github.com/Yvette-0508/financial-deep-research
cd financial-deep-research
pip install -r requirements.txt
# Add your API keys to .env
python scripts/init_databases.py
python -m agent.main