FAULT TOLERANCE AND RELIABILITY IN KUBERNETES-ORCHESTRATED MULTI-AGENT SYSTEMS: UNIVERSITY SCHEDULING CASE STUDY
DOI:
https://doi.org/10.54309/IJICT.2025.21.1.013Keywords:
machine learning, MAS, MAS Optimization, fault detection, MAS maintenance, cloud-native deploymentAbstract
Multi-Agent Systems act a particular role in distributed computing and in environments requiring autonomous coordination, such as robotics, cloud computing, and traffic management. However, ensuring fault tolerance and reliability in MAS remains a significant challenge, particularly in large-scale deployments. This study investigates the impact of Kubernetes-based orchestration on the fault tolerance of MAS, evaluating mechanisms such as automated scaling, redundancy strategies, and self-healing capabilities. Experimental results demonstrate that Kubernetes enhances MAS resilience by reducing failure frequency and improving Mean Time to Recovery. The study also identifies trade-offs between performance and resource consumption, showing that while redundancy and auto-scaling improve system robustness, they introduce computational overhead. Affinity-based scheduling and selective redundancy strategies were found to balance efficiency and reliability effectively. The findings have significant implications for real-world MAS deployments, particularly in optimizing Kubernetes configurations to achieve fault tolerance without excessive resource utilization. Future research should focus on AI-driven scaling, hybrid cloud-edge execution, and enhanced fault detection mechanisms to further improve MAS reliability and efficiency in dynamic environments.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 INTERNATIONAL JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
https://creativecommons.org/licenses/by-nc-nd/3.0/deed.en