How We Built India's Most Accurate Tax RAG System

Building a Retrieval-Augmented Generation system for Indian tax law is harder than general-purpose RAG. Legal text is dense, cross-referenced, and changes every Finance Bill. Hallucinating a section number is not just wrong — it can lead a CA to cite a non-existent provision in a legal filing.

Here is how we built TaxMarg's RAG pipeline, and how we got from 45.6% to 74.5% accuracy on our 60-query benchmark.

The 15-Step Pipeline

Every query to TaxMarg goes through 15 steps:

Step 0: Query Routing

Before doing any retrieval, we classify the query into three buckets:

off_topic — Reject immediately (saves API cost)
tax_general — Answer directly without retrieval (simple factual questions)
tax_retrieval — Full RAG pipeline

This classification uses Gemini Flash Lite, which is approximately 10x cheaper than the generation model.

Steps 1-6: Preprocessing

Normalize the query text
Step-back prompting to generate an abstract version of the question
Filter extraction to identify specific acts, sections, and years mentioned
Multi-query expansion to generate 3-4 reformulated queries
HyDE (Hypothetical Document Embedding) to generate a hypothetical answer for better semantic matching

Steps 7-10: Retrieval

Embed using OpenAI text-embedding-3-large (3072 dimensions)
Hybrid search combining dense vectors with BM25 sparse vectors, fused with Reciprocal Rank Fusion
Cross-encoder reranking of the top candidates
Knowledge graph expansion to pull in related provisions

Steps 11-14: Generation

Token-budgeted context assembly — we never exceed the context window
Generation with Claude Sonnet 4.6 (8,192 max tokens)
Citation validation — every doc_id in the response is checked against retrieved documents
Cache the result for identical future queries

Hybrid Search: Why BM25 Still Matters

Dense vector search is great for semantic similarity, but tax law has specific terminology that requires exact matching. A query about "Section 148A" needs to match documents containing exactly "148A", not just semantically similar reassessment provisions.

We run dense and sparse searches in parallel using asyncio.gather, then fuse results with Reciprocal Rank Fusion. This consistently outperforms either search method alone.

The Knowledge Graph

Our knowledge graph has 28,327 nodes and 81,293 edges with 17 relation types. Node IDs are aligned with Qdrant document IDs, so we can expand retrieval results with related provisions. For example, querying about TDS on salary automatically pulls in the relevant exemption sections, computation rules, and CBDT circulars.

Context Overflow Fallback

When the assembled context exceeds Claude Sonnet's limit, we automatically fall back to Gemini 2.5 Pro (1M context window). This handles the edge case of queries that match many provisions — rare but critical for complex multi-act questions.

Accuracy Measurement

We maintain a 60-query test set with expert-verified answers, scored by an LLM-as-judge pipeline:

Metric	Baseline	Current
Accuracy	45.6%	74.5%
Citation precision	62%	91%
Hallucination rate	18%	2.3%

The biggest accuracy gains came from:

1. Metadata enrichment (effective_year and topic tags) — +8%

2. Cross-encoder reranking — +6%

3. Multi-query expansion — +5%

4. HyDE — +4%

What is Next

We are working on case law integration (ITAT, High Court, and Supreme Court decisions) as a fourth retrieval layer, which we expect to push accuracy above 85%.