摘要

The datum used for many machine learning projects drifts over time. The problem of dataset drift causes the performance of the models to deteriorate. This can have devastating consequences for critical real-world applications that rely on the accuracy and robustness of the model, such as autonomous vehicles, fraud detection, medical diagnosis, etc. The resilience to dataset drift will impact the degree and ways the model suffers in production.

The impact of data drift depends on the selected algorithms, as well as the application domains. Application domains involving graphics and pattern recognitions, such as traffic-toll systems and image classification are less likely to suffer from data drift because the data used for training is static and is expected to have the same distribution that will be seen in production. On the other hand, the output of daily business processes that rely on natural language understanding for text classification involving user behavior are subject to change over time.

We research the stability of four popular high-performing machine learning algorithms for the text-classification task of email spam classification. In this work, we compare the resilience to dataset drift of the Random Forest (RF) algorithm, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) Network, as well as the pre-trained transformers architecture of the Bidirectional Encoder Representation from Transformer (BERT) developed by Google. To simulate various amounts of data drift, we create four hybrid datasets using the SpamAssassain benchmark dataset, combined with various percentages of emails from the Enron benchmark dataset.

Our study found that for the specific Natural Language Understanding tasks of text classification for use in spam filters, the RF, CNN, and LSTM were less likely to suffer from data drift. These three models suffered only a 1% loss compared to an 11% loss for BERT when using a dataset representing 60% data change. When using a dataset representative of 90% data change, the BERT model suffered a 16% loss compared to the 7% loss of the RF and LSTM, and the 6% loss of the CNN. Overall, we found the RF, CNN, and LSTM algorithms were less likely to be impacted by dataset drift when used for spam filter applications.

詳細資料

標題
Sensitivity of Machine Learning Algorithms to Dataset Drift for the Natural Language Processing Application of Spam Filters
作者
Fields, Tonya
出版年份
2021
出版者
ProQuest Dissertations Publishing
ISBN
9798379584610
來源類型
論文或專題
出版物語言
English
ProQuest 文件識別碼
2820269606
著作權
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.