Allo-Self-RAG: Fuzzy aggregation of internal and external critique signals for improved Self-RAG evaluation

Document Type : Research Paper

Authors

1 Department of computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran

2 Visiting Researcher at Institute for Applied Computer Science (InfAI), Nature-Inspired Machine Intelligence, Dresden, Germany

10.22111/ijfs.2026.9936

Abstract

Retrieval-Augmented Generation (RAG) systems play a crucial role in grounding Large Language Models (LLMs) with
external knowledge. However, existing architectures such as Self-RAG employ static linear aggregation of internal
critique tokens, which requires manual tuning and inadequately models the non-linear interactions underlying retrieval
and generation. Moreover, exclusive reliance on self-critique can introduce confirmation bias and hallucinations. To
overcome these limitations, this work introduces Allo-Self-RAG, a neuro-symbolic framework that integrates fuzzy logic
with RAG. From a dual-process-inspired perspective, Allo-Self-RAG is framed as a structured enhancement over the
heuristic Self-RAG baseline: the standard Self-RAG pipeline is closer to System-1-like post-retrieval behavior, whereas
Allo-Self-RAG introduces a more System-2-like evaluation layer through structured signal aggregation. A Fuzzy Inference
System (FIS) adaptively fuses internal self-critique tokens with external allo-critique signals from an independent
reranker, replacing static linear aggregation with rule-guided score integration and a rule-based revision mechanism
for conflicting evidence. When this evaluation stage detects ambiguity or conflicting evidence among top-ranked candidates, the framework automatically invokes a synthesis step to reconcile contradictions and produce a more reliable
consensus answer. Simulated Annealing (SA) is employed to optimize fuzzy membership functions automatically using
a small calibration dataset, eliminating manual parameter tuning. Extensive experimental evaluation demonstrates
that Allo-Self-RAG consistently outperforms the Self-RAG baseline, achieving 56.61% accuracy on PopQA (+1.45%
improvement), 66.98% on ARC-Challenge (+1.03% improvement), and 67.51% on PubHealth (+1.01% improvement),
showing reliable gains across retrieval-augmented question answering benchmarks.

Keywords


[1] F. Abdolinejad, M. Eftekhari, Augmenting RAG with nonnegative matrix factorization-driven semantic chunking
in embedding space, The Journal of Supercomputing, 82 (2026), 224. https://doi.org/10.1007/
s11227-026-08370-3
[2] A. Asai, Z. Wu, et al., Self-RAG: Learning to retrieve, generate, and critique through self-reflection, arXiv, (2023).
https://arxiv.org/abs/2310.11511
[3] N. A. Birur, T. Baswa, et al., VERA: Validation and enhancement for retrieval augmented systems, arXiv, (2024).
https://arxiv.org/abs/2409.15364
[4] W. Cai, J. Jiang, et al., A survey on mixture of experts in large language models, IEEE Transactions on Knowledge
and Data Engineering, 37(7) (2025), 3896-3915. https://doi.org/10.1109/TKDE.2025.3554028
[5] P. Chen, X. Liu, et al., Fuzzy reasoning chain (FRC): An innovative reasoning framework from fuzziness to clarity,
Findings of the Association for Computational Linguistics: EMNLP 2025, Association for Computational Linguistics,
(2025), 10230-10240. https://doi.org/10.18653/v1/2025.findings-emnlp.541
[6] P. Clark, et al., Think you have solved question answering? Try ARC, the AI2 reasoning challenge, arXiv, (2018).
https://arxiv.org/abs/1803.05457
[7] J. Deng, Y. Shen, et al., Influence guided context selection for effective retrieval-augmented generation, arXiv, (2025).
https://arxiv.org/abs/2509.21359
[8] Y. Dubois, et al., AlpacaFarm: A simulation framework for methods that learn from human feedback, arXiv, (2024).
https://arxiv.org/abs/2305.14387
[9] M. Eftekhari, A. Mehrpooya, et al., How fuzzy concepts contribute to machine learning, Springer, 2022. https:
//doi.org/10.1007/978-3-030-94066-9
[10] L. Gao, X. Ma, J. Lin, J. Callan, Precise zero-shot dense retrieval without relevance labels, Proceedings of the
61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for
Computational Linguistics, (2023), 1762-1777. https://doi.org/10.18653/v1/2023.acl-long.99
[11] A. Garcez, L. C. Lamb, Neurosymbolic AI: The 3rd wave, Artificial Intelligence Review, 56 (2023), 12387-12406.
https://doi.org/10.1007/s10462-023-10448-w
[12] F. Hosseini, M. Eftekhari, PFE-SELF-RAG: Balancing self-RAG evaluation metrics via Pareto efficiency, Journal
of Mahani Mathematical Research, (2026), 179-208. https://doi.org/10.22103/jmmr.2026.25661.1841
[13] Y. Huang, J. Xiangji Huang, A survey on retrieval-augmented text generation for large language models, ACM
Computing Surveys, 58(12) (2026). https://doi.org/10.1145/3805774
[14] G. Izacard, M. Caron, et al., Unsupervised dense information retrieval with contrastive learning, Transactions
on Machine Learning Research, (2022). http://dblp.uni-trier.de/db/journals/tmlr/tmlr2022.html#
IzacardCHRBJG22
[15] S. Jeong, J. Baek, et al., Adaptive-RAG: Learning to adapt retrieval-augmented large language models through
question complexity, Proceedings of the 2024 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), (2024), 7036-7050. https://doi.org/10.18653/v1/2024.naacl-long.389
[16] Z. Ji, N. Lee, et al., Survey of hallucination in natural language generation, ACM Computing Surveys, 55(12)
(2023), 1-38. https://doi.org/10.1145/3571730
[17] Z. Jiang, F. Xu, et al., Active retrieval augmented generation, Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, (2023), 7969-7992. https://doi.org/10.18653/v1/2023.emnlp-main.
495
[18] M. Joshi, E. Choi, D. Weld, L. Zettlemoyer, TriviaQA: A large scale distantly supervised challenge dataset for
reading comprehension, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, (2017), 1601-1611. https://doi.org/10.
18653/v1/P17-1147
[19] D. Kahneman, Thinking, fast and slow, Macmillan, 2011.
[20] V. Karpukhin, B. O˘guz, et al., Dense passage retrieval for open-domain question answering, Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational
Linguistics, (2020), 6769-6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
[21] O. Khattab, M. Zaharia, ColBERT: Efficient and effective passage search via contextualized late interaction over
BERT, SIGIR ’20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in
Information Retrieval, (2020), 39-48. https://doi.org/10.1145/3397271.3401075
[22] S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi, Optimization by simulated annealing, Science, 220(4598) (1983), 671-
680.
[23] N. Kotonya, F. Toni, Explainable automated fact-checking for public health claims, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, (2020), 7740-7754. https://doi.org/10.18653/
v1/2020.emnlp-main.623
[24] J. Lesatod, J. Rivera, et al., An adaptive compute approach to optimize inference efficiency in large language
models, Wiley, 2024. https://doi.org/10.22541/au.172851214.47069639/v1
[25] P. Lewis, E. Perez, et al., Retrieval-augmented generation for knowledge-intensive NLP tasks, arXiv, (2021). https:
//arxiv.org/abs/2005.11401
[26] E. Liu, et al., Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and
reducing inference costs, arXiv, (2024). https://doi.org/10.48550/arXiv.2407.00945
[27] N. F. Liu, K. Lin, et al., Lost in the middle: How language models use long contexts, Transactions of the Association
for Computational Linguistics, 12 (2024), 153-173. https://doi.org/10.1162/tacl_a_00638
[28] J. Liu, P. Tang, et al., A survey on inference optimization techniques for mixture of experts models, ACM Computing
Surveys, 58(10) (2026), 1-37. https://doi.org/10.1145/3794845
[29] X. Lyu, S. Grafberger, S. Biegel, et al., Improving retrieval-augmented large language models via data importance
learning, arXiv, (2023). https://arxiv.org/abs/2307.03027
[30] N. Masoumi, O. Davar, M. Eftekhari, MG-CRAG: Fusion of multi-granular retrieval evaluators in corrective RAG
with weakly supervised fine-tuning, Knowledge and Information Systems, 68(1) (2026), 149. https://doi.org/10.
1007/s10115-026-02778-2
[31] S. Mishra, et al., From facts to conclusions: Integrating deductive reasoning in retrieval-augmented LLMs, arXiv,
(2025). https://arxiv.org/abs/2512.16795
[32] R. Nogueira, K. Cho, Passage re-ranking with BERT, arXiv, (2020). https://arxiv.org/abs/1901.04085
[33] B. Pan, Y. Shen, et al., Dense training, sparse inference: Rethinking training of mixture-of-experts language
mModels, arXiv, (2024). https://arxiv.org/abs/2404.05567
[34] O. Press, M. Zhang, et al., Measuring and narrowing the compositionality gap in language models, Conference:
Findings of the Association for Computational Linguistics: EMNLP, (2023). https://doi.org/10.18653/v1/2023.
findings-emnlp.378
[35] A. Rogers, J. Boyd-Graber, N. Okazaki, Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, (2023). https:
//aclanthology.org/2023.acl-long.0/
[36] W. Shi, S. Min, et al., REPLUG: Retrieval-augmented black-box language models, arXiv, (2023). https://arxiv.
org/abs/2301.12652
[37] N. Shinn, F. Cassano, et al., Reflexion: Language agents with verbal reinforcement learning, NIPS ’23: Proceedings
of the 37th International Conference on Neural Information Processing Systems, (2023), 8634-8652.
[38] W. Sun, L. Yan, et al., Is ChatGPT good at search? Investigating large language models as re-ranking agents,
arXiv, (2024). https://arxiv.org/abs/2304.09542
[39] P. Tamhankar, N. R. Patel, M. C. Kolla, MultiRAG: A fuzzy logic-driven multi-granularity framework for legal
document generation, 2025 IEEE International Conference on Information Reuse and Integration and Data Science
(IRI), (2025), 313-318. https://doi.org/10.1109/IRI66576.2025.00065
[40] H. Touvron, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv, (2023). https://arxiv.org/
abs/2307.09288
[41] H. Trivedi, N. Balasubramanian, et al., Interleaving retrieval with chain-of-thought reasoning for knowledgeintensive
multi-step questions, arXiv, (2023). https://arxiv.org/abs/2212.10509
[42] H. Wang, A. Prasad, et al., Retrieval-augmented generation with conflicting evidence, arXiv, (2025). https://
arxiv.org/abs/2504.13079
[43] H. Wang, L. Ren, T. Zhao, L. Jiao, CoLLM: Industrial large–small model collaboration with fuzzy decision-making
agent and self-reflection, IEEE Transactions on Fuzzy Systems, 34(4) (2026). https://doi.org/10.1109/TFUZZ.
2025.3594229
[44] F. Wang, X. Wan, et al., Astute RAG: Overcoming imperfect retrieval augmentation and knowledge conflicts for
large language models, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, (2025), 30553-30571. https://doi.org/10.
18653/v1/2025.acl-long.1476
[45] Z. Wang, Z. Wang, et al., Speculative RAG: Enhancing retrieval augmented generation through drafting, arXiv,
(2025). https://arxiv.org/abs/2407.08223
[46] X. Wang , J. Wei, et al., Self-consistency improves chain of thought reasoning in language models, arXiv, (2023).
https://arxiv.org/abs/2203.11171
[47] S. Xie, T. Yang, et al., LLM-driven multimodal knowledge graph construction for industrial process with prompt
optimization and fuzzy RAG, IEEE Transactions on Fuzzy Systems, 99 (2026), 1-14. https://doi.org/10.1109/
TFUZZ.2026.3665172
[48] F. Xu, W. Shi, E. Choi, RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation,
arXiv, (2023). https://arxiv.org/abs/2310.04408
[49] F. Xue, Z. Zheng, et al., Openmoe: An early effort on open mixture-of-experts language models, ICML’24: Proceedings
of the 41st International Conference on Machine Learning, (2024), 55625-55655.
[50] S. Q. Yan, J. C. Gu, et al., Corrective retrieval augmented generation, arXiv, (2024). https://arxiv.org/abs/
2401.15884
[51] T. Yao, et al., Multiagent fuzzy reinforcement learning with LLM for cooperative navigation of endovascular robotics,
IEEE Transactions on Fuzzy Systems, 34 (2026), 1109-1119. https://doi.org/10.1109/TFUZZ.2025.3585934
[52] Y. Yu, et al., RankRAG: Unifying context ranking with retrieval-augmented generation in LLMs, arXiv, (2024).
https://arxiv.org/abs/2407.02485
[53] W. Yu, H. Zhang, et al., Chain-of-Note: Enhancing robustness in retrieval-augmented language models, Proceedings
of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, (2024), 14672-14685. https://doi.org/10.18653/v1/2024.emnlp-main.813
[54] D. Zhang, J. Song, et al., Mixture of experts in large language models, arXiv, (2025). https://doi.org/10.48550/
arXiv.2507.11181
[55] H. Zhuang, et al., RankT5: Fine-tuning T5 for text ranking with ranking losses, SIGIR ’23: Proceedings of the 46th
International ACM SIGIR Conference on Research and Development in Information Retrieval, (2022), 2308-2313.
https://doi.org/10.1145/3539618.3592047