Content area
Scientific researchers constitute the core strength of innovation within an organization, and their turnover can significantly affect the enterprise. This includes the risk of trade secret disclosure, setbacks in research and development, and stalled business progress. To address these issues, this paper proposes a novel prediction method named PAD-SA (Prediction of Academic Departure using ADASYN-Stacking Algorithm) by employing the ADASYN (Adaptive Synthetic) sampling algorithm in conjunction with the Stacking algorithm. PAD-SA can predict the probability of scientific researchers’ departure, thereby helping enterprises anticipate the turnover intentions of their research staff members. The dataset for this study comprises feature information collected from 1100 scientific researchers. The paper addresses the dataset imbalance issue by employing the adaptive oversampling algorithm of ADASYN, which effectively mitigates model prediction bias due to uneven sample distribution. In performance comparisons, PAD-SA outperformed the best model in the benchmark group, with its ROC value exceeding the average performance of the comparative models by 3.7%, 11.9%, and 9.3% respectively.
Article Highlights
Through visualization techniques, the relationship between dataset features and employee turnover rates is revealed, laying the foundation for data preprocessing and model construction.
The ADASYN sampling technique is employed to address the imbalance in the original dataset, effectively reducing the prediction bias of the model.
By integrating the Stacking algorithm, an efficient prediction model for the turnover of researchers is successfully constructed, yielding significant results.
Introduction
With the continuous development of the social economy and the increasing prosperity of the job market, enterprises inevitably encounter the challenge of employee turnover [1]. In particular, the loss of scientific research talent presents a significant impediment to the development of the company. This phenomenon not only diminishes the company’s technological innovation capabilities but also indirectly increases the economic and temporal costs associated with human resource management. To effectively address this challenge, this study collected personal information and characteristic data from 1100 scientific researchers, encompassing both those who have departed and those currently employed. The prediction accuracy and efficiency of traditional machine learning methods are generally moderate. Based on the ADASYN-Stacking algorithm, this study innovatively proposed the PAD-SA algorithm, which is capable of predicting the intention of researchers to leave, thereby mitigating the potential negative impact of researcher turnover on the company. Through comparative performance analysis of the dataset, the PAD-SA algorithm proposed in this paper demonstrates advantages in terms of prediction accuracy and efficiency compared to other methods, providing robust decision support for enterprises in the area of human resource management.
The main contributions of this paper can be summarized into three aspects:
Through visualization techniques, the relationship between dataset features and employee turnover rates is revealed, establishing the foundation for data preprocessing and model construction.
The ADASYN sampling technique is employed to address the imbalance in the original dataset, effectively mitigating the prediction bias of the model.
By integrating the Stacking algorithm, an efficient prediction model for the turnover of researchers is successfully constructed, yielding significant results.
Related work
In recent years, the quest to predict employee turnover has garnered significant attention from researchers, leading to the development of diverse algorithmic models. Meddeb [2] pioneered this trend by leveraging machine learning and system dynamics to forecast turnover rates. Lazzari [3] further explored the phenomenon, utilizing Logistic Regression and LightGBM (Light Gradient Boosting Machine) ranking to analyze a pan-European survey on employee turnover intentions. Shuai [4] employed the random-effects bagged tree approach to identify the most salient predictors of turnover intention. Nurtjahjono [5] investigated the role of job satisfaction, utilizing PLS-SEM (Partial Least Squares Structural Equation Modeling) to study its impact on turnover intentions. Muhamad [6] used an ensemble method based on stacking, random forests, and AdaBoost to predict employee turnover in company human resource applications; however, the model requires further optimization of the dataset and data model due to the limited employee information features. Nazli [7] used classification models to predict the probability of employee turnover, but the model may be susceptible to overfitting. Ravi [8] proposed a scheme based on RNN (Recurrent Neural Network) to predict the turnover and performance of bank employees, but the scheme is limited in its applicability as it is only targeted at bank employees. Devadarshan [9] proposed a predictive model for employee turnover based on AlexNet and VGG16 and analyzed the characteristic information of employees, but the accuracy and efficiency of the model are moderate.
The Stacking algorithm [10], an integral component of ensemble learning, combines multiple models to enhance machine learning performance, often surpassing the accuracy of a single model. This approach has been highly recommended in prestigious competitions like Netflix, KDD, and Kaggle due to its ability to yield superior predictions. In real-world applications of machine learning algorithms, unbalanced data presents a common challenge in areas like financial risk control [11], fraud detection [12], advertising recommendations [13], and medical diagnostics [14]. Typically, the ratio of positive to negative samples in unbalanced datasets can vary significantly, as exemplified by the Santander transaction prediction and IEEE-CIS fraud detection datasets in the Kaggle competition.
When trained on unbalanced data, models tend to favor labels from more numerous categories, thereby compromising their practical effectiveness. As illustrated in Fig. 1, this imbalance can significantly skew predictions. To address this issue, various techniques have been developed, including under-sampling and over-sampling methods. Among the oversampling techniques, Chawla [15] proposed SMOTE (Synthetic Minority Over-sampling Technique), but it does not consider the category information of neighboring samples, often leading to sample overlap and resulting in poor classification performance. Hui [16] introduced an enhanced version of SMOTE called Borderline SMOTE, which addresses some of the challenges faced by traditional SMOTE when handling imbalanced datasets. This improved algorithm still encounters difficulties, including potential inaccuracies in defining boundary samples, susceptibility to overfitting, and significant computational demands. In response to these limitations, He [17] developed the ADASYN algorithm, which can improve the classifier's performance on the minority class while maintaining high classification accuracy and low overfitting risk.
Fig. 1 [Images not available. See PDF.]
Long tail (unbalanced) distribution of sample data
Methodology
This section initially presents the sampling algorithm ADASYN, subsequently examines the Stacking algorithm, and finally proposes the PAD-SA algorithm, which predicts the probability of scientific researchers’ intention to leave based on ADASYN and Stacking algorithm.
ADASYN adaptive sampling algorithm
ADASYN, or Adaptive Synthetic Sampling, presents an innovative method that assigns different weights to various minority samples, leading to the creation of customized amounts of synthetic samples. The process is detailed as follows:
Step 1: Determine the number of synthetic samples to be generated using the following formula:
1
Here, represents the number of synthetic samples, denotes the number of samples in the majority class, and signifies the number of samples in the minority class. The parameter is a factor ranging from 0 to 1. When is set to 1, the resulting sample ratio following ADASYN synthesis will approximate a 1:1 balance between the majority and minority classes.
Step 2: Calculate the proportion of the majority class within the K nearest neighbors for each minority class sample as follows:
2
Here, Δi represents the number of majority class samples among the K nearest neighbors of the i-th minority class sample, i = 1, 2, 3, …, ms, where ms is the total number of minority class samples.
Step 3: Normalize the ri values using the following formula to obtain standardized weights:
3
Step 4: Determine the number of new synthetic samples to create for each minority class sample based on its standardized weight using the following formula:
4
Step 5: Generate the synthetic samples for each minority class sample according to the number determined in Step 4, using the SMOTE algorithm. The formulaic representation of the synthetic sample generation process is as follows:
5
Here, is the synthetic sample generated for the i-th minority class sample, xi is the i-th minority class sample, and xzi is a randomly selected sample from the K nearest neighbors of xi.
Stacking algorithm
Stacking[18] is an ensemble learning technique that improves prediction accuracy and robustness by integrating multiple different types of models. Unlike Bagging [19] and Boosting [20], Stacking aims to merge the distinct predictive strengths of various models for complementary effects. This method is particularly effective for the research staff turnover database in this experiment, which features numerous variables but limited samples. By integrating different models, Stacking can mitigate overfitting risks and enhance the generalizability of predictions.
In this paper, we have selected SVM (Support Vector Machine), Random Forest, and LightGBM as the base classification models for Stacking. SVM is a classifier based on the maximum margin principle, excelling in handling high-dimensional data, especially when the number of features far exceeds the number of samples. It separates data points of different categories by finding the optimal hyperplane. Random Forest, a tree-based ensemble learning technique, builds multiple decision trees by randomly selecting features and samples, then aggregates results through voting. It efficiently handles numerous features, resists outliers, and evaluates feature importance. LightGBM is an efficient gradient boosting decision tree algorithm that uses a histogram-based fast training method, particularly suitable for handling large-scale data and categorical features, and offers high-precision predictions across various problems. SVM is suitable for linearly separable data, Random Forest can capture nonlinear relationships and complex interactions, while LightGBM is renowned for its fast training and powerful handling of categorical features. These models demonstrate consistent performance across various datasets and typically yield high accuracy rates. By combining these complementary approaches, we seek to improve overall predictive performance.
In this study, for the dataset division in the Stacking method, we adopted fivefold cross-validation. This approach represents a widely accepted and methodologically sound choice that achieves an optimal balance between computational efficiency, model generalization capacity, and evaluation reliability. This method is particularly suitable for the medium to small-scale dataset of researcher turnover involved in this study, yielding relatively robust model evaluation outcomes.
PAD-SA algorithm
This study presents a PAD-SA algorithm based on the ADASYN algorithm and Stacking algorithm, utilizing a hierarchical framework that integrates multiple models to enhance the performance of predicting the departure of scientific researchers. The core of this framework comprises two distinct layers. In the first layer, a series of base learners process the original training data. The second layer, subsequently, utilizes the outputs of these first-layer learners as input, incorporating this information into the training set for further optimization. This process leads to the construction of a comprehensive stacking model, which leverages the strengths of multiple learners to improve prediction accuracy. Figures 2 and 3 provide a visual representation of the specific operational flow of this framework.
Fig. 2 [Images not available. See PDF.]
PAD-SA framework operation process
Fig. 3 [Images not available. See PDF.]
The Stacking integration algorithm construction process
Experiments
This section commences with an introduction to the experimental dataset and the application of visualization techniques to analyze its characteristics. Subsequently, it proceeds with the preprocessing of the dataset. Following this, the performance evaluation metrics utilized in the experiments are proposed.
Dataset analysis and processing
The dataset used in this study is an internally-constructed dataset, derived from the human resource information of 1100 scientific researchers. It includes 178 samples of researchers who have left the company and 922 samples of researchers who have not. The dataset encompasses 30 data feature columns and 1 label column. The label column represents the turnover status, with 0 indicating retention and 1 indicating departure. The data feature columns include 21 continuous variables and 7 discrete variables. Table 1 presents the discrete variables in the data feature columns along with their corresponding meanings.
Table 1. Description of discrete variables
Characteristic variable | Instructions |
|---|---|
Employee gender | 0 = male, 1 = female |
Marital status of employees | 1 = married, 2 = divorced |
Whether to work overtime | 0 = overtime, 1 = no overtime |
Business travel frequency | 0 = company presence, 1 = frequent travel, 2 = frequent travel |
Employee's department | 0 = Sales Department, 1 = R&D Department, 3 = Human Resources Department |
Area of expertise of employees | 0 = Life sciences, 1 = Medical, 2 = Marketing, 3 = Technical degree, 4 = Human Resources, 5 = Other |
Employee job role | 0 = Sales executive, 1 = researcher, 2 = technician 3 = Manufacturing Director, 4 = Medical representative, 5 = Department Manager, 6 = Sales representative, 7 = Research Director, 8 = Human Resources |
Due to the significant scale variations among features in the original data, the coefficient ratios within the cost function can be disproportionately large, requiring an increased number of iterations to converge to an optimal solution. To address this issue, we employ data standardization to normalize the feature values into a specific range, thereby reducing the disparity between coefficients in the cost function. This approach not only accelerates the iterative speed of the algorithm but also mitigates the impact of data dimensionality. Consequently, this study adopts the Z-score method as the preferred standardization technique.
6
Here, denotes the standardized data point, signifies the original data point, is indicative of the mean of the original dataset, and corresponds to the standard deviation of the original data.
Through data visualization, it was discovered that age serves as a critical determinant in employee turnover. As employees advance in age, they demonstrate increased stability and a diminished likelihood of vacating their positions, suggesting an inverse relationship between age and turnover rates. A comprehensive analysis of the baseline data indicates that individuals below 24 years of age and above 58 years of age exhibit substantially higher turnover rates. Conversely, employees within the 24–58 age range manifest a more moderate and consistent turnover tendency, as illustrated in Fig. 4.
Fig. 4 [Images not available. See PDF.]
Relationship between attrition and age
Furthermore, we conducted an analysis using age and salary income as samples and generated a Kernel Density Estimation (KDE) plot. This visualization reveals that enterprise personnel exhibit the highest turnover probability when their monthly salary falls within the range of 0–7000, as depicted in Fig. 5. This discovery has significantly contributed to subsequent algorithm development and offered valuable insights into understanding the driving factors behind employee turnover.
Fig. 5 [Images not available. See PDF.]
Relationship between turnover and salary
Following a comprehensive statistical analysis of the dataset’s features, we identified issues such as missing values, outliers, duplicates, and irrelevant variables. In the experiment, to address the issue of missing data, we filled in the gaps with median values. For extreme values, we replaced them with the value at the 95 th percentile. As for duplicate values, we merged them accordingly for processing. Furthermore, due to the constraints of the experimental samples within the specific domain, we encountered a significant data imbalance issue. The dataset revealed that only 16% of employees resigned, resulting in a positive-to-negative sample ratio of 9:1, as depicted in Fig. 6.
Fig. 6 [Images not available. See PDF.]
Balance of positive and negative data samples
The significant disparity in the number of positive and negative data samples, along with the unbalanced nature of the samples, poses challenges in the form of sample defects in unbalanced datasets. These issues tend to bias classification learning towards the majority class, resulting in models with high accuracy but limited generalization ability. Specifically, the AUC or F1 score may not meet practical requirements. To address these issues, this paper employs the ADASYN adaptive sampling algorithm to mitigate the imbalance between positive and negative samples.
Model evaluation index
In this research, the metrics for experimentation include Accuracy, F1-score, and AUC (Area Under the Curve) value. Initially, we will introduce the fundamental confusion matrix and Accuracy, subsequently, discuss Precision, Recall, and F1-score, and finally, cover ROC (Receiver Operating Characteristic) and AUC.
The confusion matrix is a widely used tool in machine learning for evaluating the performance of classification models [21], as depicted in Table 2. In this context, a positive example signifies an employee who has left the company, while a negative example repres AUC is derived by calculating the area under the ROC curve. When dealing with a limited number of samples, the formula for computing AUC is as shown below: rectly predicted to have not left. False Positive (FP) indicates the number of employees who did not leave but were incorrectly predicted to have left. True Negative (TN) indicates the number of employees who did not leave and were correctly predicted to have not left.
Table 2. Confusion matrix
Real situation | Forecast result | |||
|---|---|---|---|---|
Positive example | Counter example | |||
Positive example | TP | FN | ||
Counter example | FP | TN | ||
Accuracy is a fundamental metric in machine learning for measuring model performance, representing the proportion of correctly predicted samples relative to the total number of samples. The formula for calculating accuracy is as follows:
7
Precision and Recall are key metrics utilized in evaluating the performance of classification models. Precision measures the proportion of correctly predicted departed samples among all samples classified as departed by the model. Specifically, it elucidates the accuracy of the model's departure predictions. Recall, conversely, measures the proportion of actual departed samples accurately identified by the model. It indicates the model's efficacy in detecting true departures within the dataset.
8
9
The F1 Score is a comprehensive metric in machine learning used to measure a model’s precision and recall. It represents the harmonic mean of precision and recall, expressed by the following formula:
10
ROC curves and AUC are established evaluation metrics for assessing model performance. The ROC curve plots the True Positive Rate (TPR) on the vertical axis against the False Positive Rate (FPR) on the horizontal axis. In contrast to the ROC curve, which provides a visual representation of model performance, the AUC value offers a more concise and quantitative assessment. A higher AUC value indicates superior model performance, as it signifies an enhanced ability to discriminate between true positives and false positives. TPR and FPR are defined as follows:
11
12
AUC is derived by calculating the area under the ROC curve. In instances where a limited number of samples are available, the formula for computing AUC is as presented below:
13
Here, represents the number of positive samples, represents the number of negative samples, and represents the ranking position of the i-th positive sample among all samples.
Results and evaluation
The model proposed in this paper was implemented using the Python scikit-learn library. The PAD-SA predictive model was compared against other machine learning algorithms, including SVM, Random Forest, and LightGBM, on the custom dataset, yielding superior results. The PAD-SA algorithm proposed in this study demonstrates improved performance compared to previous models across various metrics, such as Accuracy, F1-score, and Area Under the Curve (AUC) value, as detailed in Table 3. Furthermore, Fig. 7 presents the ROC curves and corresponding AUC values, providing additional evidence of the efficacy of the PAD-SA algorithm developed in this research.
Table 3. Comparison of model evaluation index values
Model | Accuracy (%) | F1value (%) | AUC value (%) |
|---|---|---|---|
SVM | 89.0 | 49.4 | 69.1 |
RandomForest | 94.4 | 85.0 | 84.6 |
XGboost | 95.3 | 87.0 | 83.4 |
PAD-SA | 96.5 | 88.4 | 86.3 |
Fig. 7 [Images not available. See PDF.]
ROC curve and AUC area corresponding to different models
In comparison to other models, the PAD-SA algorithm exhibits substantial enhancements in Accuracy, F1-score, and AUC value. Specifically, when compared to the optimal model within the benchmarks, PAD-SA achieved an ROC value increase of 3.7%, outperforming the average performance of the comparison models by 11.9% and 9.3%, respectively. With regard to accuracy, PAD-SA demonstrates a 1.3% improvement over the optimal benchmark model. In terms of F1-score, PAD-SA exceeds the optimal benchmark model by 1.5%. Furthermore, concerning AUC value, PAD-SA displays a 1.7% advantage over the optimal benchmark model. These findings suggest that the PAD-SA algorithm proposed in this study possesses a significant advantage on the self-constructed dataset of scientific researchers'turnover, effectively predicting the probability of researchers departing from their positions.
Conclusion
The PAD-SA algorithm proposed in this paper has demonstrated superior performance on a self-constructed dataset comprising 1100 samples, outshining other models. The ROC value of the PAD-SA algorithm has seen an increase of 3.7%, surpassing the average performance of the comparative models by 11.9% and 9.3% respectively. This research effectively aids enterprises in forecasting the departure intentions of scientific researchers in advance, allowing companies to implement suitable measures to mitigate the losses associated with researcher turnover.
However, the research is not without its limitations. Notably, the characteristic information of scientific researchers, including their departure status, is kept confidential within companies, resulting in a smaller dataset size and potential biases in data distribution.
Author contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Jing Yang, Ru Liu and Qiyuan Feng. The first draft of the manuscript was written by Tianyi Zhang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding
No funding was received to assist with the preparation of this manuscript.
Data availability
Data will be made available on request.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Hom, PW et al. One hundred years of employee turnover theory and research. J Appl Psychol; 2017; 102,
2. Meddeb E, The Human Resource Management challenge of predicting employee turnover using machine learning and system dynamics. BIR Workshops;2021.
3. Lazzari, M; Alvarez, JM; Ruggieri, S. Predicting and explaining employee turnover intention. Int J Data Sci Anal; 2022; 14,
4. Yuan, S; Kroon, B; Kramer, A. Building prediction models with grouped data: a case study on the prediction of turnover intention. Hum Resour Manag J; 2024; 34,
5. Nurtjahjono, GE et al. Predicting turnover intention through employee satisfaction and organizational commitment in Local Banks in East Java. Profit J Adminsitrasi Bisnis; 2023; 17,
6. Fadel, M; Kanasfi, K; Arifin, Z; Triyono, G. Application of ensemble method for employee turnover predictions in financial services company. J Tek Inform (JUTIF); 2024; 5,
7. Ozakca NS, Bulus A, Cetin A, Artificial intelligence based employee attrition analysis and prediction. In: 2024 6th International conference on computing and informatics (ICCI), New Cairo—Cairo, Egypt, 2024, pp. 512–517, https://doi.org/10.1109/ICCI61671.2024.10485157
8. Bommisetti RK, Kodumagulla RP, Srinivas M, Nalkurti A, Nandurkar PA, Hoang SD, Deep learning based predictive analytics for employees turnover with performance prediction in banking sector. In: 2024 International conference on inventive computation technologies (ICICT), Lalitpur, Nepal, 2024, pp. 933–938, https://doi.org/10.1109/ICICT60155.2024.10544552.
9. Sardar DK, Chourasiya S, Vijyalakshmi V, Employee turnover prediction by machine learning techniques. In: 2023 International conference on circuit power and computing technologies (ICCPCT), Kollam, India, 2023, pp. 265–272. https://doi.org/10.1109/ICCPCT58313.2023.10244896
10. Qiang, Li; Liang, Z. Employee turnover prediction analysis and research based on stacking algorithm. J Chongqing Technol Bus Univ Natl Sci Edn; 2019; 36,
11. Zhang T, Li J, Credit risk control algorithm based on stacking ensemble learning. In: 2021 IEEE international conference on power electronics, computer applications (ICPECA). IEEE; 2021.
12. Veigas, KC; Regulagadda, DS; Kokatnoor, SA. Optimized stacking ensemble (OSE) for credit card fraud detection using synthetic minority oversampling model. Indian J Sci Technol; 2021; 14,
13. Mu, W et al. Wine recommendation algorithm based on partitioning and stacking integration strategy for Chinese wine consumers. Italian J Food Sci; 2022; 34,
14. Dehkordi SK, Sajedi H, A prescription-based automatic medical diagnosis system using a stacking method. In: 2017 IEEE 15th international symposium on intelligent systems and informatics (SISY). IEEE;2017.
15. Chawla, NV; Bowyer, KW; Hall, LO et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res; 2002; 16,
16. Han, H; Wang, WY; Mao, BH. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Lect Notes Comput Sci; 2005; [DOI: https://dx.doi.org/10.1007/11538059_91]
17. He, H; Bai, Y; Garcia, EA et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning. IEEE; 2008; [DOI: https://dx.doi.org/10.1109/IJCNN.2008.4633969]
18. Bin, Z; Bin, L. Estimation of surface PM (25) concentration based on Stacking. Environ Eng; 2020; 38,
19. Kulina, SHN; Gocheva-Ileva, SG; Yaneva, PE. CART ensemble and bagging algorithm for estimating of factors influencing the furniture market. Telematique; 2023; 22,
20. Ajit, P. Prediction of employee turnover in organizations using machine learning algorithms. Algorithms; 2016; 4,
21. Zhihua, Z. Machine learning; 2016; Beijing, Tsinghua University Press:
Copyright Springer Nature B.V. May 2025