Investigating Evasive Techniques in SMS Spam Filtering A Comparative Analysis of Machine Learning Models
DOI:
https://doi.org/10.46647/rdems0205030Keywords:
SMS Spam Detection, Malicious URL Detection, Machine Learning, Natural Language Processing, BERT, Multilingual Text Classification, Random Forest, XGBoost, Feature Extraction, Text Preprocessing, Cybersecurity, Phishing Detection, Ensemble Learning, Data Mining, Classification AlgorithmsAbstract
This project, titled “Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models”, focuses on detecting and classifying SMS messages as spam or ham (legitimate) while addressing the growing challenge of evasive spam techniques used by attackers to bypass traditional filters. With the increasing use of multilingual communication and obfuscated text patterns, spam messages often include hidden meanings, altered words, and multilingual content that make detection difficult using conventional rule-based systems. To overcome these challenges, the proposed system integrates advanced Natural Language Processing (NLP) techniques and machine learning models, including BERT-based sentence embeddings for semantic understanding and Random Forest classifiers for efficient text classification. The system preprocesses SMS data by cleaning, normalizing, and converting text into meaningful vector representations using multilingual BERT embeddings, enabling better contextual understanding of messages in different languages such as English and Hindi. These embeddings are then used to train classification models that distinguish between spam and non-spam messages. Additionally, performance evaluation is carried out using metrics such as accuracy, precision, recall, and F1-score to compare the effectiveness of different models. The system also extends its functionality to URL-based malicious link detection using handcrafted lexical features and an XGBoost classifier, making it a more comprehensive security solution. Experimental results demonstrate that deep semantic embeddings combined with ensemble learning methods significantly improve detection accuracy, especially against disguised and evolving spam patterns. Overall, this work provides a robust, scalable, and intelligent approach for SMS spam filtering, contributing to enhanced mobile communication security and demonstrating the effectiveness of combining transformer-based embeddings with traditional machine learning algorithms for real-world cybersecurity applications.