Sensitivity of Machine Learning Algorithms to

Abstract

The datum used for many machine learning projects drifts over time. The problem of dataset drift causes the performance of the models to deteriorate. This can have devastating consequences for critical real-world applications that rely on the accuracy and robustness of the model, such as autonomous vehicles, fraud detection, medical diagnosis, etc. The resilience to dataset drift will impact the degree and ways the model suffers in production.

The impact of data drift depends on the selected algorithms, as well as the application domains. Application domains involving graphics and pattern recognitions, such as traffic-toll systems and image classification are less likely to suffer from data drift because the data used for training is static and is expected to have the same distribution that will be seen in production. On the other hand, the output of daily business processes that rely on natural language understanding for text classification involving user behavior are subject to change over time.

We research the stability of four popular high-performing machine learning algorithms for the text-classification task of email spam classification. In this work, we compare the resilience to dataset drift of the Random Forest (RF) algorithm, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) Network, as well as the pre-trained transformers architecture of the Bidirectional Encoder Representation from Transformer (BERT) developed by Google. To simulate various amounts of data drift, we create four hybrid datasets using the SpamAssassain benchmark dataset, combined with various percentages of emails from the Enron benchmark dataset.

Our study found that for the specific Natural Language Understanding tasks of text classification for use in spam filters, the RF, CNN, and LSTM were less likely to suffer from data drift. These three models suffered only a 1% loss compared to an 11% loss for BERT when using a dataset representing 60% data change. When using a dataset representative of 90% data change, the BERT model suffered a 16% loss compared to the 7% loss of the RF and LSTM, and the 6% loss of the CNN. Overall, we found the RF, CNN, and LSTM algorithms were less likely to be impacted by dataset drift when used for spam filter applications.

Details

Title

Sensitivity of Machine Learning Algorithms to Dataset Drift for the Natural Language Processing Application of Spam Filters

Author

Fields, Tonya

Publication year

2021

Publisher

ProQuest Dissertations & Theses

ISBN

9798379584610

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

2820269606

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Sensitivity of Machine Learning Algorithms to Dataset Drift for the Natural Language Processing Application of Spam Filters

Jump to:

Abstract

Details

Suggested sources