Content area
Scorecards are widely used in decision-making due to their transparency, simplicity, and interpretability. A scorecard is a predictive model that assigns points to features based on their contribution to an outcome. The sum of these points produces a total score that can be used to make a classification or estimate the probability of a given event.
Traditional approaches to scorecard construction rely on a two-step process: first, continuous features are discretised and encoded; then, a predictive model calculates weights (points) for all features. Furthermore, current state-of-the-art methods focus solely on weight estimation, assuming a prior pre-processing step for feature discretisation. This leaves a gap in the development of algorithms that combine both steps into a unified process. To address this, this thesis introduces Infinitesimal Bins, a novel discretisation algorithm designed to approximate scorecard construction as a one-step algorithm.
Additionally, recent literature emphasises optimisation-based approaches over machine learningbased ones for scorecard design, particularly for binary scorecards, with RiskSLIM being the most notable method. However, its extension to ordinal classification remains underexplored. This work addresses this issue by adapting RiskSLIM for ordinal data through a Data Replication framework.
Special emphasis is placed on healthcare applications, where ordinal outcomes are more common and the interpretability and transparency of models are crucial for clinical usage. In particular, this thesis investigates the prediction of aesthetic results after breast cancer conservative treatment.
The experimental evaluation shows that Infinitesimal Bins, compared to the baseline discretiser, tends to produce larger models. Its granularity can be beneficial for small datasets, but often leads to excessively complex models in larger datasets, indicating the need for refinement through bin merging or pruning. In terms of encoding methods, Differential Coding consistently outperforms One-Hot Encoding by reducing overfitting, improving sparsity, and achieving higher accuracy. Among classifiers, in binary tasks, the sparsity-inducing Generic Generalised Linear Estimator (skglm) model achieved the best balance between compactness and predictive performance. For ordinal tasks, however, there is no single classifier that proves consistently superior; therefore, the choice of classifier should depend on the desired trade-off. Exploratory analyses further revealed that manual feature selection and ensemble strategies enhance interpretability and performance. Additionally, the extension of RiskSLIM to ordinal classification demonstrates the feasibility of optimisation-based approaches in this setting.
Details
Physiology;
Machine learning;
Integer programming;
Accuracy;
Datasets;
Regression analysis;
Artificial intelligence;
Classification;
Support vector machines;
Feature selection;
Linear programming;
Methods;
Algorithms;
Data replication;
Breast cancer;
Home ownership;
Business metrics;
Computer science;
Oncology