Content area
The objective of this thesis is to create a framework for detecting fraudulent vehicle insurance claims using binary regression models (logistic, probit, and complementary log-log), machine learning binary classifiers (decision tree, random forest, and naïve Bayes), and optimization-based machine learning techniques (k-nearest neighbor, gradient boosting, support vector machine, and artificial neural network). The study utilizes a dataset consisting of 16,100 observations, which includes prediction variables such as gender, marital status, age, whether a police report was filed, whether there were witnesses to the accident, and the number of cars involved in the accident. Approximately 30% of the response variable’s values (indicating whether fraud was detected) are affirmative cases. The dataset is divided into 80% for training and 20% for testing. The models are trained on the training data, and fraud probability is predicted for each row in the testing data. Model performances are evaluated using various criteria. The programming language R (version 4.3.2) is used throughout the thesis due to its capabilities to implement the aforementioned techniques.