INTELLIGENT CLUSTERING METHODS FOR PROCESSING AND ANALYZING SHORT TEXTS
DOI:
https://doi.org/10.54309/IJICT.2025.22.2.002Abstract
This study presents an in-depth exploration of short text clustering, employing advanced methodologies such as Bidirectional Encoder Representations from Transformers (BERT), Term Frequency-Inverse Document Frequency (TF-IDF), and a novel hybrid technique combining Latent Dirichlet Allocation, BERT, and Autoencoder (LDA+BERT+AE). The research begins with a discussion of the theoretical foundations of each method, highlighting their advantages and limitations. BERT is evaluated for its ability to capture word dependencies within text, whereas TF-IDF is recognized for its efficiency in determining term significance. In the experimental section, the effectiveness of these methods in clustering short texts is systematically compared, with particular emphasis on the hybrid LDA+BERT+AE approach. A comprehensive analysis of the LDA-BERT model’s training and validation loss across 200 epochs reveals that initial loss values exceed 1.2, rapidly declining to approximately 0.8 within the first 25 epochs, before eventually stabilizing around 0.4. The close correlation between the training and validation curves indicates the model's ability to learn effectively and generalize well, demonstrating minimal overfitting. Findings from the study illustrate that the LDA+BERT+AE hybrid method significantly improves text clustering performance compared to standalone approaches. Based on these results, recommendations are provided for the optimal selection and combination of clustering techniques tailored to various short text types and natural language processing (NLP) tasks. Additionally, the study explores the practical applications of these methods in industrial and academic environments, where precise text processing and categorization are essential. The research concludes by underscoring the importance of an integrated approach to short text analysis, which facilitates deeper semantic comprehension and more effective information extraction.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 INTERNATIONAL JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
https://creativecommons.org/licenses/by-nc-nd/3.0/deed.en