Building a Retrieval-Augmented Generation system for Indian tax law is harder than general-purpose RAG. Legal text is dense, cross-referenced, and changes every Finance Bill. Hallucinating a section number is not just wrong — it can lead a CA to cite a non-existent provision in a legal filing.
Here is how we built TaxMarg's RAG pipeline, and how we got from 45.6% to 74.5% accuracy on our 60-query benchmark.
The 15-Step Pipeline
Every query to TaxMarg goes through 15 steps:
Step 0: Query Routing
Before doing any retrieval, we classify the query into three buckets:
- off_topic — Reject immediately (saves API cost)
- tax_general — Answer directly without retrieval (simple factual questions)
- tax_retrieval — Full RAG pipeline
This classification uses Gemini Flash Lite, which is approximately 10x cheaper than the generation model.
Steps 1-6: Preprocessing
- Normalize the query text
- Step-back prompting to generate an abstract version of the question
- Filter extraction to identify specific acts, sections, and years mentioned
- Multi-query expansion to generate 3-4 reformulated queries
- HyDE (Hypothetical Document Embedding) to generate a hypothetical answer for better semantic matching
Steps 7-10: Retrieval
- Embed using OpenAI text-embedding-3-large (3072 dimensions)
- Hybrid search combining dense vectors with BM25 sparse vectors, fused with Reciprocal Rank Fusion
- Cross-encoder reranking of the top candidates
- Knowledge graph expansion to pull in related provisions
Steps 11-14: Generation
- Token-budgeted context assembly — we never exceed the context window
- Generation with Claude Sonnet 4.6 (8,192 max tokens)
- Citation validation — every doc_id in the response is checked against retrieved documents
- Cache the result for identical future queries
Hybrid Search: Why BM25 Still Matters
Dense vector search is great for semantic similarity, but tax law has specific terminology that requires exact matching. A query about "Section 148A" needs to match documents containing exactly "148A", not just semantically similar reassessment provisions.
We run dense and sparse searches in parallel using asyncio.gather, then fuse results with Reciprocal Rank Fusion. This consistently outperforms either search method alone.
The Knowledge Graph
Our knowledge graph has 28,327 nodes and 81,293 edges with 17 relation types. Node IDs are aligned with Qdrant document IDs, so we can expand retrieval results with related provisions. For example, querying about TDS on salary automatically pulls in the relevant exemption sections, computation rules, and CBDT circulars.
Context Overflow Fallback
When the assembled context exceeds Claude Sonnet's limit, we automatically fall back to Gemini 2.5 Pro (1M context window). This handles the edge case of queries that match many provisions — rare but critical for complex multi-act questions.
Accuracy Measurement
We maintain a 60-query test set with expert-verified answers, scored by an LLM-as-judge pipeline:
| Metric | Baseline | Current |
|---|---|---|
| Accuracy | 45.6% | 74.5% |
| Citation precision | 62% | 91% |
| Hallucination rate | 18% | 2.3% |
The biggest accuracy gains came from:
1. Metadata enrichment (effective_year and topic tags) — +8%
2. Cross-encoder reranking — +6%
3. Multi-query expansion — +5%
4. HyDE — +4%
What is Next
We are working on case law integration (ITAT, High Court, and Supreme Court decisions) as a fourth retrieval layer, which we expect to push accuracy above 85%.