[[missing key: loading-pdf-error]] [[missing key: loading-pdf-link]]
摘要
The datum used for many machine learning projects drifts over time. The problem of dataset drift causes the performance of the models to deteriorate. This can have devastating consequences for critical real-world applications that rely on the accuracy and robustness of the model, such as autonomous vehicles, fraud detection, medical diagnosis, etc. The resilience to dataset drift will impact the degree and ways the model suffers in production.
The impact of data drift depends on the selected algorithms, as well as the application domains. Application domains involving graphics and pattern recognitions, such as traffic-toll systems and image classification are less likely to suffer from data drift because the data used for training is static and is expected to have the same distribution that will be seen in production. On the other hand, the output of daily business processes that rely on natural language understanding for text classification involving user behavior are subject to change over time.
We research the stability of four popular high-performing machine learning algorithms for the text-classification task of email spam classification. In this work, we compare the resilience to dataset drift of the Random Forest (RF) algorithm, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) Network, as well as the pre-trained transformers architecture of the Bidirectional Encoder Representation from Transformer (BERT) developed by Google. To simulate various amounts of data drift, we create four hybrid datasets using the SpamAssassain benchmark dataset, combined with various percentages of emails from the Enron benchmark dataset.
Our study found that for the specific Natural Language Understanding tasks of text classification for use in spam filters, the RF, CNN, and LSTM were less likely to suffer from data drift. These three models suffered only a 1% loss compared to an 11% loss for BERT when using a dataset representing 60% data change. When using a dataset representative of 90% data change, the BERT model suffered a 16% loss compared to the 7% loss of the RF and LSTM, and the 6% loss of the CNN. Overall, we found the RF, CNN, and LSTM algorithms were less likely to be impacted by dataset drift when used for spam filter applications.
您已從我們的資料庫要求所選內容的 "進行中" 機器翻譯。提供這項功能只是讓您更方便使用,絕不是要取代人工翻譯。 顯示完整的免責聲明
ProQuest 及其授權人均不對關於翻譯進行任何展示或保證。翻譯係 "依現狀" 及 "可用時" 自動產生,且不會保留在我們的系統中。PROQUEST 和其授權人明確地排除關於以下情事之明示或暗示的保證,包括可用性、正確性、及時性、完整性、未侵權、適售性或符合特定使用目的。您在使用翻譯時將受限於您的「電子產品授權合約」中所含的所有使用限制,同時在使用翻譯功能時,您同意對於 ProQuest 或其授權人,放棄關於使用翻譯功能及任何從其中衍生而出之內容的任何及所有聲明。 隱藏完整的免責聲明
![歡迎使用 [我的檢索]!](/assets/ctx/f4326488/images/openlayer/PQ_Open_MyResearch.jpg)




