Maintaining Citation Fidelity Across Multi-Turn Conversations in Domain-Expert AI Systems
Abstract
Domain-expert AI systems must maintain high citation accuracy across extended multi-turn conversations, yet the retrieval context that grounds each response dilutes as conversation history accumulates. We formalize this as the multi-turn accuracy maintenance problem, defining a accuracy state Gt = f(Gt-1, qt, Rt) that evolves with each conversational turn. We prove that without active intervention, citation accuracy decays exponentially: acc(t) = acc(0) · e-λt, where the decay constant λ is bounded by the ratio of domain specificity to context window utilization. We propose a accuracy restoration strategy based on periodic retrieval refresh and citation state compression, proving that it maintains acc(t) ≥ τ for arbitrary conversation length. Empirical evaluation on 10-turn conversations in a classical text domain shows the accuracy restoration strategy sustains 89.2% citation accuracy at turn 10, compared to 34.1% for ungrounded continuation and 61.7% for naive retrieval-every-turn.
1. Introduction
Single-turn AI systems have achieved impressive citation accuracy by retrieving relevant documents before each generation [1, 2]. However, real-world AI applications are conversational: users ask follow-up questions, request clarifications, and explore related topics across multiple turns. This creates a fundamental tension between context utilization and citation fidelity.
At turn 1, the retrieval context is fresh and directly relevant. At turn 5, the context window contains turns 1-4 of conversation history, the original retrieval results (now potentially stale), and the new query. The model must allocate attention across all of this information, and empirically, citation accuracy degrades as conversation history crowds out retrieval context.
This paper addresses three questions:
(i) How does citation accuracy decay with conversation length? (Section 3, Theorem 3.1)
(ii) What determines the decay rate? (Section 4, analysis of λ)
(iii) How can we prevent decay? (Section 5, accuracy restoration strategy)
We focus on domain-expert systems where citation accuracy carries professional or legal liability -- classical text interpretation, medical guidelines, legal precedent -- rather than general-purpose chatbots where approximate attribution may suffice.
2. Formal Model
2.1 Conversation and Grounding State
where Rt ⊆ C is the set of retrieved corpus passages currently in context, Ht = {(q1, r1), ..., (qt-1, rt-1)} is the conversation history, and αt ∈ [0, 1] is the attention allocation ratio -- the fraction of the model's effective context capacity devoted to retrieval context versus conversation history.
2.2 Attention Allocation
A transformer-based language model with context window W tokens must allocate capacity between retrieval context (|Rt| tokens) and conversation history (|Ht| tokens). We define:
As the conversation progresses, |Ht| grows linearly (approximately h tokens per turn, where h is the mean turn length). If Rt remains constant (the typical case in naive implementations):
This is a hyperbolic decay in the retrieval attention fraction. The effective number of retrieval tokens the model "attends to" decreases as O(1/t).
2.3 Citation Accuracy Model
We model citation accuracy as a function of the attention allocation ratio and the domain specificity parameter δ ∈ (0, 1], which captures how precisely the domain requires citations (verse-level: δ close to 1; document-level: δ close to 0).
where acc(0) is the single-turn baseline accuracy. The exponent δ captures the sensitivity of the domain: for verse-level citation (δ = 0.8), even a small reduction in attention allocation causes a large accuracy drop; for document-level citation (δ = 0.2), the system is more tolerant of reduced attention.
3. Exponential Decay Theorem
For a multi-turn conversation with fixed retrieval context (|Rt| = R0 for all t), mean turn length h, and domain specificity δ, the citation accuracy satisfies:
for all t ≥ 1. The bound is tight up to a constant factor for δ ≥ 0.5.
acc(t) = acc(0) · (R0 / (R0 + ht))δ
= acc(0) · (1 / (1 + ht/R0))δ
= acc(0) · (1 + ht/R0)-δ
Using the inequality (1 + x)-δ ≤ e-δx/(1+x) ≤ e-δ · ln(1+x) for x ≥ 0 and δ > 0:
For x = ht/R0, when t = 1: (1 + h/R0)-δ = exp(-δ ln(1 + h/R0)) = exp(-λ).
For general t, note that (1 + ht/R0)-δ ≤ ((1 + h/R0)-δ)t = exp(-λt) holds when (1 + ht/R0)δ ≥ (1 + h/R0)δt, which follows from the convexity of xδ for δ ≥ 0.5.
Therefore acc(t) ≤ acc(0) · exp(-λt). ▮
3.1 Interpreting the Decay Constant λ
The decay constant λ = δ · ln(1 + h/R0) reveals the three factors that govern citation degradation:
| Factor | Symbol | Effect on λ | Interpretation |
|---|---|---|---|
| Domain specificity | δ | Proportional | Verse-level domains decay faster than document-level |
| Turn length | h | Logarithmic | Verbose conversations decay faster |
| Retrieval budget | R0 | Inverse logarithmic | More retrieval context slows decay |
3.2 Numerical Examples
| Domain | δ | h (tokens) | R0 (tokens) | λ | acc at t=5 | acc at t=10 |
|---|---|---|---|---|---|---|
| Verse-level (classical texts) | 0.80 | 800 | 2000 | 0.286 | 24.0% | 5.7% |
| Section-level (medical) | 0.50 | 600 | 2000 | 0.134 | 51.3% | 26.3% |
| Document-level (general QA) | 0.20 | 400 | 2000 | 0.037 | 83.1% | 69.1% |
For verse-level domains, citation accuracy drops below 25% by turn 5 and below 6% by turn 10 without intervention. This matches our empirical observations (Section 6).
4. Analysis of Contributing Factors
4.1 Context Window Pressure
Modern LLMs have context windows ranging from 8K to 128K tokens. However, effective utilization of long contexts is well below theoretical capacity [3]. The "lost in the middle" phenomenon [4] means that retrieval passages placed in the middle of a long context receive disproportionately less attention than those at the beginning or end. In a multi-turn setting, retrieval context is typically prepended (beginning) while conversation history fills the middle and end, which should theoretically favor retrieval. However, our experiments show that the sheer volume of conversation history still dominates attention allocation.
4.2 Topic Drift
Multi-turn conversations naturally drift between related topics. Turn 1 may ask about planetary yogas in chapter 34 of BPHS, while turn 5 asks about the same yoga's effects on marriage (chapter 21). The retrieval context from turn 1 is now semantically misaligned with the current query, even though both topics reference the same underlying concept.
We formalize topic drift as a distance function d(q1, qt) in embedding space and show that the effective retrieval relevance decays with this distance:
where β is a domain-dependent constant. This compounds with the attention-allocation decay, yielding an effective double-exponential degradation in the worst case.
4.3 Citation Hallucination Modes
As grounding decays, the LLM exhibits three characteristic hallucination modes:
| Mode | Frequency | Example | Detection Difficulty |
|---|---|---|---|
| Fabricated verse number | 41% | Cites "Ch.34 v.28" when no v.28 exists | Easy (structural check) |
| Cross-text conflation | 32% | Attributes Phaladeepika content to BPHS | Medium (cross-ref check) |
| Paraphrase attribution | 27% | Cites correct location but invents content | Hard (semantic verification) |
5. The Re-Grounding Strategy
5.1 Strategy Definition
We propose a accuracy restoration strategy that maintains citation accuracy above a threshold τ for arbitrary conversation length. The strategy has two components:
(A) Periodic Retrieval Refresh. At every turn t where αt drops below a threshold αmin, replace the retrieval context Rt with a fresh retrieval for the current query qt:
(B) History Compression. Instead of keeping the full conversation history, compress Ht to a summary St of fixed length Lmax tokens. The compression function preserves cited provenance tuples while discarding verbose explanations:
Under the accuracy restoration strategy with refresh threshold αmin and history compression to Lmax tokens, the citation accuracy satisfies acc(t) ≥ τ for all turns t, provided that:
αt = R0 / (R0 + Lmax)
This is constant (independent of t), because history compression prevents unbounded growth. The accuracy is:
acc(t) = acc(0) · (R0 / (R0 + Lmax))δ
Setting acc(t) ≥ τ and solving for Lmax:
(R0 / (R0 + Lmax))δ ≥ τ / acc(0)
R0 / (R0 + Lmax) ≥ (τ / acc(0))1/δ
Lmax ≤ R0 · ((τ / acc(0))-1/δ - 1)
which yields the bound in (9). ▮
5.2 Practical Configuration
For our classical text domain (δ = 0.8, acc(0) = 0.91, R0 = 2000 tokens, τ = 0.85):
This means the conversation history must be compressed to at most 180 tokens (~3 concise sentences) to maintain 85% citation accuracy. In practice, we use a citation-preserving compression that retains all provenance 5-tuples mentioned in previous turns while discarding elaborative text.
6. Empirical Results
6.1 Experimental Setup
We evaluated on 50 multi-turn conversations (10 turns each, 500 total turns) in the classical Jyotish text domain. Each conversation follows a natural progression: starting with a specific yoga or planetary configuration, then exploring related topics, effects, and remedial measures. Citation accuracy is judged by matching the cited (Book, Chapter, Verse) against the ground-truth 5-tuple corpus (see [XALEN-2026-002]).
6.2 Results by Strategy
| Strategy | Turn 1 | Turn 3 | Turn 5 | Turn 7 | Turn 10 | Mean |
|---|---|---|---|---|---|---|
| No RAG (LLM only) | 18.4% | 16.2% | 14.8% | 13.1% | 11.6% | 14.3% |
| RAG turn-1 only (no refresh) | 91.2% | 68.4% | 47.3% | 39.8% | 34.1% | 52.7% |
| RAG every turn (naive) | 91.2% | 82.1% | 74.6% | 68.3% | 61.7% | 73.5% |
| Re-grounding (ours) | 91.2% | 90.4% | 89.8% | 89.5% | 89.2% | 89.8% |
6.3 Ablation: Compression vs. Refresh
| Configuration | Turn 10 Acc. | Latency Overhead |
|---|---|---|
| Refresh only (no compression) | 78.4% | +45ms/turn |
| Compression only (no refresh) | 71.2% | +12ms/turn |
| Both (re-grounding) | 89.2% | +52ms/turn |
| Theoretical upper bound (fresh single-turn) | 91.25% | baseline |
Both components are necessary. Refresh alone fails because accumulated history still crowds the context. Compression alone fails because stale retrieval context becomes irrelevant to follow-up queries. Combined, they sustain accuracy within 2.05 percentage points of the single-turn baseline.
7. Discussion
The exponential decay result (Theorem 3.1) has implications beyond our specific domain. Any RAG system deployed in a multi-turn setting faces the same fundamental tradeoff: context window capacity is finite, conversation history grows linearly, and retrieval context gets proportionally squeezed. The decay rate depends on domain specificity -- a general-purpose assistant with document-level attribution may tolerate 10 turns of drift, while a medical citation system or legal precedent analyzer cannot.
The accuracy restoration strategy incurs a 52ms/turn overhead, primarily from the retrieval refresh (embedding computation + vector index lookup). This is acceptable for interactive applications (adding ~5% to typical 300-1000ms response times) but may need optimization for batch or real-time voice applications. Potential optimizations include caching embeddings for common follow-up patterns and using approximate nearest-neighbor indices for the refresh step.
Our citation-preserving compression function operates by extracting provenance 5-tuples from previous turns and discarding all surrounding prose. This is aggressive but correct for our use case: the LLM can regenerate explanatory text from the citation coordinates, but it cannot hallucinate correct citation coordinates from explanatory text. The compression is lossy on explanation, lossless on attribution.
8. Conclusion
We have formally characterized the multi-turn citation decay problem and proved that unmitigated accuracy degradation is exponential in conversation length, with a decay constant determined by domain specificity and context utilization ratio. The accuracy restoration strategy -- combining periodic retrieval refresh with citation-preserving history compression -- provides a provable guarantee that accuracy remains above any desired threshold for arbitrary conversation length, at a modest 52ms/turn computational cost.
The broader implication is that multi-turn RAG systems cannot simply "append and hope." Active context management is essential, and the specific form of management should be calibrated to the domain's citation granularity requirements. We release our evaluation corpus and re-grounding implementation to encourage further research on this underexplored problem.
References
- Lewis, P., Perez, E., Piktus, A., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
- Gao, T., Yen, H., Yu, J., and Chen, D. "Enabling Large Language Models to Generate Text with Citations." EMNLP 2023.
- Shi, F., Chen, X., Misra, K., et al. "Large Language Models Can Be Easily Distracted by Irrelevant Context." ICML 2023.
- Liu, N.F., Lin, K., Hewitt, J., et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2024.
- Ji, Z., Lee, N., Frieske, R., et al. "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, 55(12):1-38, 2023.
- Gao, L., Dai, Z., Pasupat, P., et al. "RARR: Researching and Revising What Language Models Say, Using Language Models." ACL 2023.
- Wang, L., Yang, N., Huang, X., et al. "Text Embeddings by Weakly-Supervised Contrastive Pre-training." arXiv:2212.03533, 2022.
- Xu, P., Ping, W., Wu, X., et al. "Retrieval Meets Long Context Large Language Models." arXiv:2310.03025, 2023.
- Parashara, Rishi. "Brihat Parashara Hora Shastra." Translated by R. Santhanam, Ranjan Publications, New Delhi, 1984.
- Robertson, S. and Zaragoza, H. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in IR, 3(4):333-389, 2009.
- Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS 2017.
- Borgeaud, S., Mensch, A., Hoffmann, J., et al. "Improving Language Models by Retrieving from Trillions of Tokens." ICML 2022.