TF-IDF Algorithm
Term frequency-inverse document frequency weighting. Rare words get higher weight for more accurate clustering.
N-grams & Jaccard
Bigrams, trigrams and Jaccard coefficient for comparing phrase similarity.
Levenshtein Distance
Edit distance for detecting typos and spelling variations.
Hierarchical Clustering
Agglomerative algorithm with average linkage for optimal cluster merging.
Semantic Analysis
Word co-occurrence matrix for detecting semantic relationships between terms.
Cosine Similarity
TF-IDF vector cosine similarity for comparing with cluster centroids.
Multilingual Stemming
50+ stemming rules for Ukrainian and English with morphology support.
Stop Words
Automatic removal of 100+ function words for both languages for cleaner analysis.