Elia Alghazal
← Back

SHARP-RAG

Self-correcting agentic RAG pipeline for multi-hop question answering

Live Site ↗GitHub ↗
THE PROBLEM

Naive RAG retrieves documents once and generates an answer regardless of whether the retrieved context actually supports it. On multi-hop questions, where the answer requires chaining facts across multiple documents, this single-pass approach fails structurally. SHARP-RAG addresses this with a self-correcting loop: a Critic agent evaluates evidence quality and, when it is insufficient, triggers targeted re-retrieval before any answer is generated.

SYSTEM DESIGN
HotpotQA QuestionPlannerRetrieverCriticSynthesizerAnswerChromaDBre-retrieve
ENGINEERING DECISIONS
LangGraph instead of function chaining
The critique loop needs to route back to the Retriever when evidence is rated insufficient. Standard function chains cannot loop to a previous step without recursion and they lose shared state between calls. LangGraph's cyclic graph with a shared AgentState object solved both problems at once. When I needed to add or remove the critique loop during development, it was a single edge change, not a refactor.
Structured JSON for the Critic's verdict, not free text
The Critic could have returned natural language. "The evidence seems insufficient" works for a human but breaks everything downstream that needs to act on it. A structured JSON object with fixed keys — verdict, confidence, missing information, and suggested queries — made the output machine-readable without parsing, let the Retriever use suggested queries directly on the next pass, and made critique behavior loggable and debuggable across runs. The paper's core finding about calibration would have been invisible without structured output to analyze.
Adaptive top-k across retries instead of fixed k
Retrying with the same k value on the same query retrieves overlapping documents from the same dense neighborhood. Expanding k on each retry — from 3 to 5 to 7 — forces the retriever to reach further into the index. The tradeoff is a noisier context on later passes, which turned out to be significant when the 8B Critic over-triggered retries. That tradeoff is precisely what the paper analyzes.
A separate Planner agent instead of direct retrieval
Multi-hop questions chain facts across documents. A question like "What nationality is the director of the film that won the Palme d'Or in 1994?" needs at least two queries: who directed the film, and where that person is from. Retrieving on the full original question returns documents about the film, not the director's background. The Planner decomposes the question into ordered sub-queries before any retrieval happens, which lifted first-pass recall on bridge questions regardless of what the Critic does downstream.
OUTCOMES
  1. 01Preprint published on Zenodo, June 2026
  2. 02SHARP-RAG v2 achieves 15.0% EM and 15.8% F1 on HotpotQA
  3. 03Central finding: critique model calibration, not architecture, is the determining performance variable
  4. 04Identified over-triggering as a measurable failure mode with a clear path to resolution via the 70B production model
STACK

Python · LangGraph · Groq API (Llama-3.3-70B) · ChromaDB · HuggingFace sentence-transformers · HotpotQA · Rich