INTERNATIONAL JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES

DETECTING DUPLICATES IN KAZAKH TEXTS: A COMPARISON OF TF-IDF, WORD AND SENTENCE EMBEDDINGS

Authors

  • A.O. Tleubayeva Astana IT University
  • S.V. Biloshchytskа Astana IT University
  • O. Kuchanskyi Astana IT University
  • A.A. Mukhatayev Astana IT University
  • A.B. Nugumanovа Astana IT University

DOI:

https://doi.org/10.54309/IJICT.2025.24.4.020

Keywords:

duplicate detection, Kazakh language, TF-IDF, word embeddings, sentence embeddings, semantic similarity, BM25, dense retrieval, hybrid reranking, low-resource NLP

Abstract

This paper presents a comprehensive comparison of TF-IDF, word, and multilingual sentence embeddings for automatic duplicate detection in Kazakh texts. Experiments use the KazakhTextDuplicates dataset with labels for exact, paraphrase, contextual, and partial duplicates. All models were evaluated within a unified setup featuring standardized preprocessing, L2-normalized vectors, and validation-based threshold tuning. The Word2Vec model with TF-IDF weighting achieved the highest performance (F1 = 0.996; ROC-AUC = 0.9999; PR-AUC = 0.9999). The TF-IDF (1–3-grams) method remained competitive for exact and partial overlaps (PR-AUC = 0.932; ROC-AUC = 0.775), while FastText provided the best recall (R ≈ 0.99) at moderate precision. Among multilingual models, BGE-m3 and Snowflake Arctic achieved the best PR-AUC (≈0.614). In retrieval, the BM25 followed by dense re-ranking pipeline produced a small but consistent improvement over dense-only search (Recall@10: +0.04–0.12 pp; nDCG@10: +0.10–0.13 pp), confirming the effectiveness of combining lexical and semantic features for duplicate detection in morphologically rich, low-resource languages such as Kazakh.

Downloads

Download data is not yet available.

Author Biographies

A.O. Tleubayeva, Astana IT University

 Arailym O. Tleubayeva ―   PhD student, senior lecturer at the School of Artificial Intelligence and Data Science, Astana IT University LLP.

S.V. Biloshchytskа, Astana IT University

Svitlana V. Biloshchytska ― Doctor of Technical Sciences, Professor at the School of Artificial Intelligence and Data Science, Astana IT University LLP.

O. Kuchanskyi, Astana IT University

Oleksandr Kuchanskyi ― Doctor of Technical Sciences, Professor at the School of Artificial Intelligence and Data Science, Astana IT University

A.A. Mukhatayev, Astana IT University

Aidos A. Mukhatayev, Candidate of Pedagogical Sciences, Professor at the School of General Education Disciplines, Astana IT University LLP.

A.B. Nugumanovа, Astana IT University

Dina S. Kantayeva ―   PhD student, manager at the Strategy and Corporate Management Unit, Astana IT University LLP.

Downloads

Published

2025-11-15

How to Cite

Тлеубаева, А., Белошицкая , С., Кучанский , О., Мухатаев , А., & Кантаева , Д. (2025). DETECTING DUPLICATES IN KAZAKH TEXTS: A COMPARISON OF TF-IDF, WORD AND SENTENCE EMBEDDINGS. INTERNATIONAL JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, 6(4), 333–350. https://doi.org/10.54309/IJICT.2025.24.4.020

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.

Loading...