SEMANTIC COMPLETENESS IN KAZAKH-LANGUAGE EXTRACTIVE QA THROUGH ONTOLOGY AND RETRIEVAL MECHANISMS
DOI:
https://doi.org/10.54309/IJICT.2026.25.1.005Abstract
This study explores extractive question answering for the low-resource Kazakh language by combining ontology-based semantic enrichment with retrieval-augmentation. We design a complete data preparation pipeline, including PDF text extraction, cleaning, chunking, Sentence-BERT vectorization, and FAISS indexing. Using GPT-4, we generate and manually validate a final dataset of 350 QA pairs. Four models are evaluated: mBERT-QA, XLM-RoBERTa-QA, XLM-RoBERTa-QA with ontology injection, and a hybrid Retrieval + XLM-RoBERTa-QA + Ontology system. Evaluation across EM, F1, BERTScore-F1, ROUGE-L, and SemSim metrics shows that hybrid models substantially outperform baselines. The best configuration achieves an F1 score of 52.6%, surpassing mBERT-QA by 21 percentage points. Results demonstrate that ontology-infused context and dense retrieval significantly improve answer span extraction, reducing noise and enhancing semantic alignment. The proposed approach provides an effective foundation for developing high-accuracy educational QA systems in the Kazakh language.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 INTERNATIONAL JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
https://creativecommons.org/licenses/by-nc-nd/3.0/deed.en