1. Introduction
Electronic Health Records (EHRs) have become indispensable in modern clinical healthcare, offering a rich source of data that chronicles a patient’s medical history. Over recent years, deep learning-based prediction models have gained significant attention for their ability to leverage EHR data, which represented as temporal sequences of clinical visits, can significantly inform and enhance healthcare decision-making [1]. Such applications range from disease diagnosis and mortality prediction to patient sub-typing and personalized treatment planning [2,3,4].
Working with EHR data presents challenges due to its inherent limitations in clinical practice. Factors such as rare disease occurrences, expensive examinations, and safety considerations result in data scarcity [5]; for instance, missing or inconsistent follow-up visits disrupt the continuity of patient health trajectories, creating substantial gaps in longitudinal records [2]. A patient with diabetes may have regular check-ups for three months and then miss several appointments, resulting in incomplete monitoring of critical indicators like HbA1c levels [6]. These discontinuities create fractured representations of patient status, making it difficult to model disease progression accurately. Consequently, task-specific models trained on such incomplete datasets often suffer from overfitting to observed patterns while failing to capture true health trajectories, compromising both performance and robustness in real-world applications.
While most existing works tackling EHR data sparsity challenges can be categorized into direct data space and indirect representation space methods, they generally follow four key approaches. Direct methods include attention-based models like RETAIN [2], which identify influential visits through neural attention but focus only on available data points without addressing missing temporal contexts. Interpretability-focused approaches [3,7] enhance transparency through multichannel feature extraction and importance recalibration, yet they struggle with irregularly sampled features. Indirect methods include data scarcity solutions like multi-task learning [5], SAFARI [6], which apply correlational sparsity prior but require extensive expert annotation, and advanced representation learning techniques [8], which employ self-supervised learning but often capture patterns unrelated to the target prediction task.
Despite recent advancements, contemporary methods face the following two major limitations: (1) insufficient modeling of long-term health trajectories, particularly in contexts where follow-up visits occur at irregular intervals; and (2) inadequate ability to selectively identify features that are most relevant to specific prediction tasks. For example, consider a diabetes patient who misses several follow-up appointments; current models may not only struggle to relate the patient’s initially stable status to later complications due to gaps in the temporal sequence, but they may also fail to effectively filter out irrelevant features (such as unrelated laboratory values or demographic attributes) while emphasizing clinically significant variables that contribute to the risk of disease progression. As a result, critical trajectory patterns and key predictors may be overlooked, impairing model performance for individualized risk assessment.
Intuitively, the way clinicians approach patient care offers valuable inspiration for modeling clinical pathways. When faced with a patient who has missed follow-ups, physicians naturally try to fill in the gaps by considering the patient’s historical trends and likely progression based on similar cases they have encountered. They focus on the most predictive indicators while filtering out irrelevant information. Similarly, an effective predictive modeling approach should leverage the predictive power of historical visit patterns to anticipate likely future states, enabling a more continuous representation of the patient’s health trajectory despite missing observations. This perspective aligns with how language models predict the next word in a sentence based on context, suggesting that viewing clinical pathways as “health sentences” could yield more coherent patient representations [9].
Given these insights, we are confronted with the following pressing challenge: How can we effectively model the long-term clinical pathway to capture future-predictive health representations, despite missing or inconsistent follow-ups, while filtering out redundant information that might compromise target prediction?
To address this, we introduce
Central to our approach is the view of a patient’s clinical pathway as analogous to a sentence in natural language processing, with each visit considered a word containing lab tests and events [9]. Unlike traditional methods requiring additional labeling or complex feature engineering [5], our auxiliary sub-network evaluates the reliability of each feature by analyzing historical visit patterns and temporal dependencies among clinical variables. By examining how well certain features predict future health states, our model learns which aspects of the patient record are most informative for clinical outcomes, even when observations are irregular or missing.
Our neuron-level filtering mechanism, based on layer decorrelation, encourages diversity among hidden units while adaptively selecting target-predictive features. High-ranking neurons that capture essential health progression patterns are preserved, while low-ranking neurons that model noise or redundant information are filtered out. This approach mimics how physicians focus on key indicators while disregarding less relevant measurements [6,10].
We argue that jointly learning future-predictive health representations offers valuable insights for prognosis, beyond independently predicting target labels. For example, when predicting diabetes risk, incorporating blood glucose level predictions as an auxiliary task enhances the primary task by capturing the underlying progression pattern. Unlike previous multi-task approaches with parameter hard-sharing [5],
Our primary contributions are outlined as follows: Methodologically, we propose Experimentally, evaluations on CDSL, MIMIC-III, and MIMIC-IV datasets demonstrate
2. Related Work
In the field of EHR data analysis, data sparsity caused by irregular sampling poses significant challenges for building efficient prediction models [11,12]. Previous approaches primarily fall into the following two categories: methods that directly address sparsity in the data space and methods that indirectly resolve sparsity in the representation space.
2.1. Direct Data Space Methods
Direct data space methods aim to operate directly on raw EHR data through feature extraction or data imputation to address missing value problems. Traditional approaches rely on imputation techniques [13] or statistical methods [14], which assume patient visits are independent and features are missing randomly, ignoring the fact that missingness in EHR data often contains important clinical information. More sophisticated architectures like RETAIN [2] use two-level attention to identify important visits and variables but neglect the fine-grained importance of feature changes. While RETAIN [2] identifies influential visits and variables, its primary focus on available data points makes it challenging to infer progression along an implicit clinical pathway, especially when significant temporal gaps arise from irregular follow-ups. The attention mechanism may not fully capture the underlying clinical logic connecting temporally distant but pathologically related events if the intervening pathway segments are unobserved or poorly represented due to sampling irregularities. Recent models [3,15,16] enhance representation through time-aware distributions, multi-scale feature extraction, and adaptive feature importance recalibration, but they still face fundamental challenges with highly sparse data—extracting sufficient information from limited observations to construct complete representations.
The core limitation of direct methods is their reliance solely on limited data from individual patients, ignoring valuable information in cross-patient patterns, making it difficult to construct complete representations in highly sparse environments. Furthermore, even state-of-the-art feature extraction techniques struggle with high proportions of missing values, especially when the missing patterns themselves contain clinical information [17].
2.2. Indirect Representation Space Methods
Indirect representation space methods address sparsity problems in the representation space by utilizing auxiliary information or similar patient data, rather than directly filling in missing values in the original data [5].
One direction leverages patient similarities. GRASP [4] finds similar patients to enhance representation learning, while other approaches [10] compensate for missing modality information using auxiliary data from similar patients. However, these methods underestimate the impact of missing features when measuring patient similarity, leading to inaccurate similarity assessments. More importantly, they focus on how to extract information from similar patients, rather than how to effectively integrate this information.
Some methods enhance representations through multi-task learning [5,17]. Current approaches often employ parameter hard-sharing strategies for multi-task models, failing to explicitly consider what to share and how much to share it, which can lead to negative transfer that interferes with target prediction optimization.
Some recent works [6,8] learn self-supervised representations for irregular time series through time-sensitive contrastive learning and data reconstruction, but they fail to fully utilize the implicit clinical pathway information in EHR data, which is crucial for capturing dynamic changes in patients’ health status. For instance, advanced sequential models like Mamba [9], while effective at capturing long-range dependencies, are not inherently designed to decode the implicit clinical pathway logic from sequences of medical codes and measurements, particularly when these sequences are characterized by irregular and unpredictable time intervals. Their state-space mechanisms might summarize long histories but may not explicitly model the ’expected next phase’ of a clinical pathway or differentiate signals critical to pathway progression from general temporal patterns, especially when faced with high data irregularity.
Overall, a significant gap persists, outlined as follows: existing methods, whether direct or indirect, often lack dedicated mechanisms to holistically interpret implicit clinical pathways from irregularly sampled EHR time series. They may either focus too narrowly on observed points, fail to adequately model the inherent pathway logic, or lack adaptive filtering robust enough for severe irregularity and missingness. This underscores the need for novel frameworks that can explicitly learn from pathway progression cues and selectively utilize features, even amidst such data imperfections.
In this paper, we propose
3. Preliminary
In this section, we begin with a motivating example, then describe the structure of EHR data, and formulate the problem of clinical healthcare prediction.
3.1. A Motivating Example
We consider the health status prediction of patients with chronic or critical conditions as the motivating example. Many individuals worldwide suffer from such conditions, facing significant health risks that require ongoing treatment and periodic hospital visits for various tests (e.g., blood tests and vital sign monitoring). The accurate prediction of patient health risks based on medical records collected during these visits is crucial in supporting recovery and preventing adverse outcomes.
3.2. Problem Formulation
EHR data are routinely collected from patient observations in hospitals through clinical visits, encompassing discrete time-series data (e.g., medication and diagnosis) and continuous multivariate data (e.g., vital signs and laboratory measurements). We assume a patient visits the clinic t times, generating time-ordered EHR records denoted as . Each EHR record contains features, such as lab test results or clinical observations, as illustrated in Figure 1. The prediction problem in this paper is formulated as follows: given t historical EHR records of a patient, i.e., , how can we predict the patient’s healthcare status y, which represents the probability of encountering a specific risk (e.g., in-hospital mortality or readmission). The next section will detail the proposed model. Table 1 lists the notations used in our model.
4. Method
For the healthcare prediction task, the model is trained to utilize the recorded clinical visits () to learn the representation of health status and predict the probability of suffering from the specific target outcome in the future (e.g., the risk of in-hospital mortality or readmission). In this paper, we propose a general model, We explicitly model the clinical status pathway by training a Gated Recurrent Unit (GRU)-based auxiliary sub-network (i.e., the left GRU) to predict the lab tests and clinical events recorded in a future visit (). The hidden representation of the sequence () is encoded to be a good predictor of future status, and it is provided as extra clinical features for the supervised clinical prediction task. This helps the model to depict the health status from a long-term perspective. A task-specific GRU is applied to extract the other part of the health status representation. The model merges the task-specific representation and the auxiliary representation to perform the target prediction. We encourage the diversity among hidden units based on layer decorrelation to help the useful units stand out (i.e., denoted as red circles). A neuron-level gate is designed to filter out the units that are useless to the target prediction (i.e., denoted as blue squares on the left side, and blue triangles on the right side) and reduce the redundancy of the model.
Figure 2
Framework of
[Figure omitted. See PDF]
4.1. Auxiliary Task for Clinical Pathway Modeling
For the sake of causality, the model is not allowed to consult the lab tests and clinical events recorded in future visits and take them as input (i.e., clinical features), while, in reality, they are closely related to the target outcome in the future. It is beneficial to jointly model the clinical pathway in the future instead of independently predicting the target label. However, it may introduce redundancy and disturb the optimization of the original network if the model is made to predict the future visit and target outcome by directly performing multi-task learning in the same model. Inspired by [18] in natural language processing, which adds the context meaning of words to the sequence tagging model by pre-training an extra language model, we construct a separate auxiliary network to predict lab tests and clinical events in the future. The sub-network takes the patient’s EHR records ⋯ as the input and embeds the time sequence with a GRU [19],
(1)
where ⋯ represents the learned representations for each clinical visit, and is the hidden dimension. These representations are called future-specific representations, and they contain information on the future status of the patients since they are used to predict future events in the model. Then, those representations are fed into a feedforward layer to predict the lab test values of the next clinical visit,(2)
We do not need extra labels but use the time sequence itself to supervise the training. We calculate the mean-squared loss for every predicted visit except the last predicted one, whose ground-truth value is not recorded in the data.
(3)
The extracted hidden embeddings for the sequence will be used as additional clinical features in the supervised target prediction model. In particular, we concatenate the embedding () with the output from the GRU layers in the target prediction network (). However, is trained to predict the future clinical visit and thus contains a lot of target-irrelevant information, which may introduce undesired bias that can degrade the performance of target prediction. To exploit the beneficial information for target prediction, we force neurons to represent different information and filter out the target-irrelevant ones based on the neuron-level filtering gate in the next subsection.
4.2. Neuron-Level Filtering Gate
The task-specific module takes the patient’s raw EHR records ⋯ as the input and embeds the time sequence with a task-specific GRU,
(4)
where ⋯ represents the task-specific representations for each clinical visit, and is the hidden dimension. These representations are called task-specific representations, and they contain task-specific information of the patients. The future-specific representations and task-specific representations are then projected into a new latent space,(5)
where and are the projection matrices for the two different representations. We construct a combined system, using as additional advanced clinical features. A simple method is to concatenate with . To effectively exploit information that is beneficial for the target prediction in auxiliary embeddings and reduce the redundancy, we intend to use a neuron-level filtering gate, which adaptively models the demand degree of clinical pathway information for patients in different conditions and filters out useless neurons.Firstly, we promote diversities among hidden neurons (i.e., differentiation of information stored inside each neuron) to prepare for neuron-level filtering. This makes it easy to identify the target-irrelevant part. Based on [20], which uses a regularizer to reduce overfitting and increase generalization, we explicitly encourage non-redundant representations by reducing the correlation between activations in and , respectively. Here, for simplicity, we use to denote them as a general case. The covariances between all pairs of activations i and j of form a matrix ,
(6)
where B is the batch size and is the i-th activation of at the b-th case in the batch. is the sample mean of activation i over the batch. The diagonal of is then subtracted from the matrix norm to build the decorrelation loss term as follows:(7)
where is the Frobenius norm, and the operator extracts the main diagonal of a matrix into a vector. The decorrelation loss will co-operate with the target prediction loss to decompose the target-relevant part from the irrelevant part, which helps to ease the difficulty in mining useful information.Secondly, we propose a filtering gate to model the demand degree of future visit prediction for patients in diverse conditions. If the gate believes that jointly evaluating the future visit is beneficial to the target prediction for a patient, more hidden units in auxiliary embeddings will be included in the final health status representation. Otherwise, the gate will tend to select the hidden units in task-specific embeddings. Only the neurons selected by the gate are used to perform the target prediction. Concretely, the filtering gate is learned based on the latest record of the patient,
(8)
where is the projection matrix for the record in the last time step. Inspired by [21], which proposes a novel inductive bias for recurrent neural networks to separately allocate hidden state neurons with long and short-term information, we use the similar cumax function to operate the learned and obtain the valve for and .(9)
where cumsum denotes the cumulative sum. The output of vector cumax can be seen as the expectation of a binary gate [21]. This binary gate splits the cell state into two segments as fiollows: the 0-segment and the 1-segment. Thus, and act as complementary gate vectors, where element-wise, ensuring they operate on distinct neural pathways. We obtain a new state representation by using the learned gate to adaptively extract information from the future/task-specific embeddings,(10)
where ⊙ represents element-wise multiplication. The patient’s state representation is fed into a feedforward layer to predict the final task,(11)
The corresponding target prediction loss and the final optimization loss are, respectively, defined as(12)
(13)
where is a hyper-parameter. The synergistic operation of these components—specifically, the initial learning of future-specific and task-specific representations in separate GRUs; the promotion of internal diversity within these projected representations (, ) via the loss; and critically, the adaptive neuron-level filtering gate—plays a crucial role in mitigating the risk of negative transfer. By dynamically selecting and weighting neuronal information from both the future-predictive and task-specific embeddings based on the current input instance (),5. Experimental Setups
5.1. Datasets
We utilize the CDSL (COVID Data Save Lives), MIMIC-III, and MIMIC-IV datasets for benchmarking, with detailed statistics presented in Table 2 and Table 3.
CDSL dataset [22]. This dataset contains anonymized records of 4255 COVID-19 patients from Spain’s HM Hospitales [22]. The CDSL dataset is only used for mortality prediction as readmission information is unavailable.
MIMIC-III dataset (version 1.4) [11]. MIMIC-III contains de-identified health data from over 40,000 ICU patients (2001–2012), including demographics, vital signs, laboratory tests, and medications.
MIMIC-IV dataset (version 2.2) [12]. This updated version collected from 2008 to 2019 includes 24,610 samples after preprocessing, with approximately 29.5% positive samples for both mortality and readmission.
For all datasets, we apply the same preprocessing, outlined as follows: (1) forward filling missing values with patients’ most recent records; (2) standardizing features to zero mean and unit variance; (3) imputing missing features with dataset-level averages when all records are missing. We strictly maintain causality throughout preprocessing to ensure test data integrity.
For the MIMIC-III dataset, the split of 80% training, 15% validation, and 5% testing was chosen to strictly align with the cohort selection and data splitting approach used in the benchmark study by Harutyunyan et al. [5]. This adherence ensures that our performance evaluation on MIMIC-III is directly comparable to established prior work. The specific data splits for CDSL (70% training, 10% validation, and 20% testing) and MIMIC-IV (70% training, 10% validation, and 20% testing) represent a standard and robust partitioning for machine learning model development and evaluation. For mortality prediction, we use the first 48 h of ICU data; for readmission prediction, we use all data before discharge.
5.2. Evaluation Metrics
We assess model performance using AUPRC, AUROC, and min(+P, Se). For our imbalanced datasets, AUPRC provides more informative assessment than AUROC, as it emphasizes positive class performance [23]. The min(+P, Se) metric, the minimum of precision (+P) and sensitivity (Se, or recall), ensures there is a balance between accurate positive predictions and capturing true positives. This is vital in healthcare, where both minimizing false alarms and detecting true cases matter.
5.3. Prediction Tasks
We conduct experiments on the following two clinically relevant prediction tasks.
Mortality prediction. This task predicts whether a patient will die during hospitalization. For the MIMIC datasets, we use the first 48 h of ICU data to predict in-hospital mortality (12% positive cases). For the CDSL dataset, we predict COVID-19 mortality (12.69% positive cases), which is particularly challenging due to the disease’s novel nature and varied impact across populations.
Readmission prediction. The 30-day readmission prediction task aims to predict hospital readmission within 30 days after discharge. Available only for MIMIC datasets, this task uses all pre-discharge data, with approximately 16% positive samples in MIMIC-III and 15.5% in MIMIC-IV.
Both tasks have significant clinical value for identifying high-risk patients, optimizing resource allocation, and enabling early interventions.
5.4. Baseline Models
We compare
5.4.1. EHR-Specific Models
RETAIN [2] utilizes a two-level neural attention mechanism to detect influential visits and significant clinical variables within those visits. It processes EHR data in reverse time order, mimicking physician practice by giving higher attention to recent clinical visits.
SAFARI [6] learns compact patient health representations by imposing a correlational sparsity prior to the correlations of medical feature pairs. It solves a bi-level optimization problem involving high-level inter-group correlations and lower-level intra-group correlations, using Laplacian kernel and graph neural networks.
AdaCare [16] captures the long- and short-term variations of biomarkers as clinical features to represent health status across multiple time scales. It models correlations between clinical features to enhance those that strongly indicate health status, maintaining high prediction accuracy while providing interpretability.
GRASP [4] enhances patient representation learning by leveraging knowledge from similar patients. It defines similarities between patients for different clinical tasks, finds similar patients with useful information, and learns cohort representation to extract valuable knowledge.
5.4.2. General Deep Learning Models
RNN is a standard recurrent neural network model applied to sequential medical data, serving as a fundamental baseline.
is a basic GRU model with an addition-based attention mechanism, serving as a strong baseline for healthcare prediction tasks.
Mamba [9] is a linear-time sequence modeling architecture based on selective state spaces. It allows the model to selectively propagate or forget information along the sequence length dimension, depending on the current token, making it suitable for processing long sequences of medical data.
5.4.3. Ablation Models
To evaluate our model components, we include the following two removes the future clinical pathway context module, focusing solely on current information for prediction. directly concatenates auxiliary and task-specific embeddings without the neuron-level filtering gate, maintaining potential redundancy between embeddings.
5.5. Implementation Details
Hardware and software. All models are trained on a single NVIDIA RTX 3090 GPU with CUDA 11.8 and 64 GB system memory. We implement our method using Python 3.11.4, PyTorch 2.0.1 [24], PyTorch Lightning 2.0.5 [25], and pyehr [26,27].
Training and hyperparameters. We use AdamW optimizer [28] with a batch size of 256 patients for all models. Training runs for a maximum of 100 epochs with early stopping after 10 epochs without AUPRC improvement on the validation set. We employ a learning rate of 0.001 with linear warmup (first 5 epochs) followed by cosine decay.
To ensure reproducibility, we fix the random seed to 42 for all experiments and employ bootstrapping on the test sets for robust performance evaluation.
6. Experimental Results and Analysis
We evaluate
6.1. Experimental Results
Table 4 and Table 5 present performance comparisons between
For readmission prediction (Table 5), performance advantages become more pronounced.
Comparative analysis reveals that
6.2. Ablation Study
To understand component contributions, we compare
performs worse than the full model, with notable AUPRC decreases in mortality prediction (1.55% for CDSL and 6.01% for MIMIC-III) and readmission prediction (3.88% for MIMIC-III and 4.98% for MIMIC-IV). This confirms that modeling future clinical pathways provides valuable information for health status assessment and prediction.
Similarly, shows performance drops across datasets, with AUPRC decreases of 2.00%, 3.37%, and 2.33% for mortality prediction on CDSL, MIMIC-III, and MIMIC-IV, respectively, and 3.95% for MIMIC-III readmission. These results demonstrate the efficacy of the neuron-level filtering gate in reducing redundancy and selecting relevant information from auxiliary embeddings.
These findings confirm that both future clinical pathway modeling and neuron-level filtering mechanisms are essential components of
6.3. Observations and Analysis
To further understand how PathCare adaptively selects information from future-predictive embeddings versus task-specific embeddings, we examine the learned gate values across different patient groups. Figure 3 illustrates the distribution of gate values for survival and non-survival groups across the CDSL and MIMIC-III datasets.
Interestingly, patients with adverse outcomes (non-survival) consistently exhibit higher average gate values compared to those with favorable outcomes (survival) across both datasets (CDSL: 0.54 vs. 0.51; MIMIC-III: 0.41 vs. 0.38). This indicates that PathCare adaptively allocates more representation capacity to future-predictive features for high-risk patients, while relying more on task-specific features for low-risk patients.
For patients at higher risk, the model appears to emphasize capturing recent deterioration patterns and near-term trajectory shifts that signal potential complications. Conversely, for patients with better prognoses, the model incorporates more auxiliary embeddings to enhance representation robustness, potentially accounting for the greater stability and predictability in their clinical trajectories. This adaptive mechanism aligns with clinical practice, where physicians pay closer attention to immediate risk signals in critically ill patients while considering broader health indicators for stable patients.
This adaptive reliance on future-predictive features, controlled by the gate mechanism, as evidenced by the distinct distributions in Figure 3, highlights how
7. Conclusions
In this paper,
Conceptualization, D.S.; Methodology, D.S. and X.L.; Software, D.S. and L.G.; Validation, C.Z.; Investigation, D.S.; Resources, L.M., L.W. and W.T.; Data curation, D.S. and K.Y.; Writing—original draft, D.S.; Writing—review & editing, D.S.; Visualization, D.S.; Supervision, L.M. and L.W.; Project administration, L.M. and W.T.; Funding acquisition, L.M. and W.T. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 Time-series data of EHR. Patients may visit the hospital multiple times. The health status is depicted by various clinical features, including numerical variables (e.g., lab tests) and categorical variables (e.g., diagnosis codes).
Figure 3 Distribution of gate values for survival vs. non-survival groups across datasets. Higher values indicate greater reliance on future-predictive features.
Notations for
Notation | Definition |
---|---|
| Ground truth of prediction target at t-th visit |
| Prediction result at t-th visit |
| Multivariate visit record at t-th visit |
| Prediction result of the status of the next visit |
| Health status embedding learned to predict the future visit |
| Health status embedding learned to predict the primary target |
| Projection of future-specific embedding |
| Projection of task-specific embedding |
| Neuron-level gate for modeling the demand degree of a future visit |
| Mask for selecting units from |
| Mask for selecting units from |
| Combined health status representation for the target prediction |
B | Batch size |
| Covariances between all pairs of activations i and j in the layer |
| The i-th activation of |
| The sample mean of activation i over the batch |
| Projection matrix for future representations |
| Projection matrix for task-specific representations |
| Projection matrix for the filtering gate |
| Hyperparameter for the decorrelation loss term |
| Loss for predicting the next visit |
| Decorrelation loss term |
| Loss for the primary prediction task |
Statistics of the experimented datasets after preprocessing. The # Samples column shows the number of samples and their percentage of the entire dataset, indicating data splits (train, val, and test). The #
Dataset | Split | # Samples | # | # |
---|---|---|---|---|
CDSL | Train | 2978 (69.99%) | 378 (12.69%) | - |
Val | 426 (10.01%) | 54 (12.68%) | - | |
Test | 851 (20.00%) | 108 (12.69%) | - | |
MIMIC-III | Train | 16,094 (80.00%) | 4996 (31.04%) | 4996 (31.04%) |
Val | 3018 (15.00%) | 934 (30.95%) | 934 (30.95%) | |
Test | 1006 (5.00%) | 312 (31.01%) | 312 (31.01%) | |
MIMIC-IV | Train | 17,227 (70.00%) | 5095 (29.58%) | 5095 (29.58%) |
Val | 2461 (10.00%) | 724 (29.42%) | 724 (29.42%) | |
Test | 4922 (20.00%) | 1450 (29.46%) | 1450 (29.46%) |
Detailed statistics of the CDSL dataset.
Statistic | Total | Survived | Deceased |
---|---|---|---|
Number of patients | 4255 | 3715 (87.31%) | 540 (12.69%) |
Number of records | 123,044 | 108,142 (87.89%) | 14,902 (12.11%) |
Records per patient | 24.0 [15, 39] | 25.0 [15, 39] | 22.5 [11, 37] |
Age | 67.2 [56.0, 80.0] | 65.1 [54.0, 77.0] | 81.6 [75.0, 89.0] |
Age > Mean (67 years) | 2228 (52.36%) | 1748 (47.05%) | 480 (88.89%) |
Age ≤ Mean (67 years) | 2027 (47.64%) | 1967 (52.95%) | 60 (11.11%) |
Male | 2515 (59.11%) | 2173 (58.49%) | 342 (63.33%) |
Female | 1740 (40.89%) | 1542 (41.51%) | 198 (36.67%) |
Number of features | 99 | ||
Length of stay (days) | 6.4 [4.0, 11.0] | 6.1 [4.0, 11.0] | 6.0 [3.0, 10.0] |
Performance comparison of different methods on mortality prediction tasks across the CDSL, MIMIC-III, and MIMIC-IV datasets.
Methods | CDSL Mortality | MIMIC-III Mortality | MIMIC-IV Mortality | ||||||
---|---|---|---|---|---|---|---|---|---|
AUPRC (↑) | AUROC (↑) | min(+P, Se) (↑) | AUPRC (↑) | AUROC (↑) | min(+P, Se) (↑) | AUPRC (↑) | AUROC (↑) | min(+P, Se) (↑) | |
RETAIN | 77.23 ± 4.13 | 93.67 ± 1.56 | 68.96 ± 4.14 | 45.60 ± 4.48 | 83.89 ± 2.12 | 28.74 ± 4.27 | 51.53 ± 1.40 | 86.61 ± 0.55 | 33.09 ± 1.77 |
RNN | 83.03 ± 2.97 | 95.55 ± 0.82 | 69.02 ± 5.06 | 47.63 ± 5.86 | 84.03 ± 1.93 | 28.73 ± 4.97 | 51.92 ± 1.67 | 85.09 ± 0.71 | 30.92 ± 1.42 |
SAFARI | 76.70 ± 4.02 | 94.42 ± 1.27 | 63.96 ± 3.99 | 48.32 ± 3.08 | 84.57 ± 0.94 | 25.31 ± 1.94 | 49.25 ± 1.88 | 85.21 ± 0.83 | 32.22 ± 1.73 |
AdaCare | 82.10 ± 4.02 | 94.78 ± 1.19 | 72.57 ± 3.60 | 51.19 ± 2.90 | 83.72 ± 0.86 | 25.76 ± 2.44 | 51.18 ± 1.53 | 83.79 ± 0.68 | 34.33 ± 1.60 |
GRASP | 83.60 ± 3.10 | 95.05 ± 0.96 | 71.12 ± 4.50 | 45.64 ± 6.29 | 83.24 ± 1.90 | 30.22 ± 5.81 | 52.63 ± 1.38 | 86.23 ± 0.61 | 30.69 ± 1.33 |
| 80.57 ± 3.83 | 95.31 ± 1.13 | 66.38 ± 3.51 | 52.24 ± 2.62 | 85.49 ± 0.72 | 24.69 ± 2.48 | 54.61 ± 1.37 | 86.16 ± 0.73 | 42.92 ± 1.68 |
Mamba | 79.21 ± 3.73 | 92.28 ± 2.06 | 66.69 ± 4.73 | 51.33 ± 3.12 | 85.33 ± 0.89 | 26.05 ± 2.08 | 51.66 ± 1.32 | 84.29 ± 0.72 | 32.35 ± 1.95 |
| 82.56 ± 3.03 | 95.74 ± 0.93 | 72.48 ± 3.48 | 47.50 ± 4.83 | 83.99 ± 1.62 | 46.40 ± 0.42 | 51.67 ± 4.54 | 85.19 ± 1.70 | 52.47 ± 3.67 |
| 82.11 ± 3.73 | 95.24 ± 1.13 | 74.72 ± 3.69 | 50.14 ± 4.68 | 85.36 ± 1.65 | 50.80 ± 0.38 | 51.86 ± 4.09 | 84.13 ± 1.73 | 51.76 ± 3.39 |
| 84.11 ± 3.18 | 96.08 ± 0.97 | 76.55 ± 3.68 | 53.51 ± 4.40 | 85.63 ± 1.62 | 52.62 ± 0.16 | 54.19 ± 1.86 | 85.91 ± 0.65 | 52.62 ± 1.56 |
Note: Bold values indicate the best performance in each column.
Performance comparison of different methods on 30-day readmission prediction tasks across the MIMIC-III and MIMIC-IV datasets.
Methods | MIMIC-III Readmission | MIMIC-IV Readmission | ||||
---|---|---|---|---|---|---|
AUPRC (↑) | AUROC (↑) | min(+P, Se) (↑) | AUPRC (↑) | AUROC (↑) | min(+P, Se) (↑) | |
RETAIN | 48.98 ± 1.94 | 77.50 ± 1.14 | 25.02 ± 1.40 | 46.71 ± 1.77 | 77.53 ± 0.97 | 35.15 ± 1.76 |
RNN | 45.77 ± 2.13 | 74.34 ± 0.87 | 28.68 ± 1.89 | 48.72 ± 1.35 | 76.05 ± 0.81 | 27.12 ± 1.53 |
SAFARI | 46.65 ± 2.47 | 77.11 ± 1.30 | 25.25 ± 1.84 | 45.49 ± 1.82 | 76.70 ± 0.95 | 30.69 ± 1.64 |
AdaCare | 47.19 ± 2.40 | 76.97 ± 1.10 | 24.36 ± 1.70 | 46.87 ± 1.28 | 76.07 ± 0.84 | 26.40 ± 1.63 |
GRASP | 48.36 ± 2.09 | 76.70 ± 0.93 | 18.29 ± 1.50 | 50.23 ± 1.50 | 78.47 ± 0.88 | 29.19 ± 1.37 |
| 50.24 ± 2.08 | 78.36 ± 1.16 | 25.12 ± 1.43 | 50.97 ± 1.31 | 78.46 ± 0.86 | 33.80 ± 1.77 |
Mamba | 45.98 ± 2.20 | 76.38 ± 1.06 | 24.50 ± 1.52 | 48.04 ± 1.42 | 76.87 ± 0.86 | 27.45 ± 1.63 |
| 47.13 ± 2.05 | 76.76 ± 0.92 | 47.17 ± 1.72 | 46.54 ± 1.61 | 76.98 ± 0.85 | 47.39 ± 1.41 |
| 47.06 ± 2.11 | 75.43 ± 0.99 | 46.42 ± 1.77 | 50.42 ± 1.61 | 76.85 ± 0.88 | 48.61 ± 1.50 |
| 51.01 ± 1.90 | 78.64 ± 0.88 | 50.44 ± 1.72 | 51.52 ± 1.61 | 78.41 ± 0.92 | 50.30 ± 1.40 |
Note: Bold values indicate the best performance in each column.
Appendix A. Online Resources
We have released the source code at
The MIMIC-III dataset is provided at
The MIMIC-IV dataset is provided at
The HM Hospitals COVID-19 Collaborator is provided at
1. Chen, J.; Zhang, A. Hgmf: Heterogeneous graph-based fusion for multimodal data with incompleteness. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Virtual Event, 6–10 July 2020; pp. 1295-1305.
2. Choi, E.; Bahadori, M.T.; Sun, J.; Kulas, J.; Schuetz, A.; Stewart, W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016); Barcelona, Spain, 5–10 December 2016; Volume 29.
3. Ma, L.; Zhang, C.; Gao, J.; Jiao, X.; Yu, Z.; Zhu, Y.; Wang, T.; Ma, X.; Wang, Y.; Tang, W.
4. Zhang, C.; Gao, X.; Ma, L.; Wang, Y.; Wang, J.; Tang, W. GRASP: Generic framework for health status representation learning based on incorporating knowledge from similar patients. Proceedings of the AAAI Conference on Artificial Intelligence; Virtual, 2–9 February 2021; Volume 35, pp. 715-723.
5. Harutyunyan, H.; Khachatrian, H.; Kale, D.C.; Galstyan, A. Multitask learning and benchmarking with clinical time series data. arXiv; 2017; arXiv: 1703.07771[DOI: https://dx.doi.org/10.1038/s41597-019-0103-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31209213]
6. Ma, X.; Wang, Y.; Chu, X.; Ma, L.; Tang, W.; Zhao, J.; Yuan, Y.; Wang, G. Patient health representation learning via correlational sparse prior of medical features. IEEE Trans. Knowl. Data Eng.; 2022; 35, pp. 11769-11783. [DOI: https://dx.doi.org/10.1109/TKDE.2022.3230454]
7. Zhang, Z.; Zhang, Q.; Jiao, Y.; Lu, L.; Ma, L.; Liu, A.; Liu, X.; Zhao, J.; Xue, Y.; Wei, B.
8. Chowdhury, R.R.; Li, J.; Zhang, X.; Hong, D.; Gupta, R.K.; Shang, J. Primenet: Pre-training for irregular multivariate time series. Proceedings of the AAAI Conference on Artificial Intelligence; Washington DC, USA, 7–14 February 2023; Volume 37, pp. 7184-7192.
9. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv; 2023; arXiv: 2312.00752
10. Zhang, C.; Chu, X.; Ma, L.; Zhu, Y.; Wang, Y.; Wang, J.; Zhao, J. M3care: Learning with missing modalities in multimodal healthcare data. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Washington, DC, USA, 14–18 August 2022; pp. 2418-2428.
11. Johnson, A.E.; Pollard, T.J.; Shen, L.; Li-wei, H.L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data; 2016; 3, 160035. [DOI: https://dx.doi.org/10.1038/sdata.2016.35] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27219127]
12. Johnson, A.E.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.
13. Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw.; 2011; 45, pp. 1-67. [DOI: https://dx.doi.org/10.18637/jss.v045.i03]
14. Choi, E.; Xiao, C.; Stewart, W.; Sun, J. Mime: Multilevel medical embedding of electronic health records for predictive healthcare. Proceedings of the Advances in Neural Information Processing Systems; Montreal, QC, Canada, 3–8 December 2018; pp. 4547-4557.
15. Ma, L.; Zhang, C.; Wang, Y.; Ruan, W.; Wang, J.; Tang, W.; Ma, X.; Gao, X.; Gao, J. Concare: Personalized clinical feature embedding via capturing the healthcare context. Proceedings of the AAAI conference on artificial intelligence; New York, NY, USA, 7–12 February 2020; Volume 34, pp. 833-840.
16. Ma, L.; Gao, J.; Wang, Y.; Zhang, C.; Wang, J.; Ruan, W.; Tang, W.; Gao, X.; Ma, X. Adacare: Explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration. Proceedings of the AAAI Conference on Artificial Intelligence; New York, NY, USA, 7–12 February 2020; Volume 34, pp. 825-832.
17. Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep.; 2018; 8, 6085. [DOI: https://dx.doi.org/10.1038/s41598-018-24271-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29666385]
18. Peters, M.E.; Ammar, W.; Bhagavatula, C.; Power, R. Semi-supervised sequence tagging with bidirectional language models. arXiv; 2017; arXiv: 1705.00108
19. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv; 2014; arXiv: 1412.3555
20. Cogswell, M.; Ahmed, F.; Girshick, R.; Zitnick, L.; Batra, D. Reducing overfitting in deep networks by decorrelating representations. arXiv; 2015; arXiv: 1511.06068
21. Shen, Y.; Tan, S.; Sordoni, A.; Courville, A. Ordered neurons: Integrating tree structures into recurrent neural networks. arXiv; 2018; arXiv: 1810.09536
22. HM Hospitales. Covid Data Save Lives. 2020; Available online: https://www.hmhospitales.com/prensa/notas-de-prensa/comunicado-covid-data-save-lives (accessed on 5 June 2024).
23. Keilwagen, J.; Grosse, I.; Grau, J. Area under precision-recall curves for weighted and unweighted data. PLoS ONE; 2014; 9, e92209. [DOI: https://dx.doi.org/10.1371/journal.pone.0092209] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24651729]
24. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.
25. Falcon, W.A. PyTorch Lightning; GitHub: San Francisco, CA, USA, 2019.
26. Gao, J.; Zhu, Y.; Wang, W.; Wang, Z.; Dong, G.; Tang, W.; Wang, H.; Wang, Y.; Harrison, E.M.; Ma, L. A comprehensive benchmark for COVID-19 predictive modeling using electronic health records in intensive care. Patterns; 2024; 5, 100951. [DOI: https://dx.doi.org/10.1016/j.patter.2024.100951] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38645764]
27. Zhu, Y.; Wang, W.; Gao, J.; Ma, L. PyEHR: A Predictive Modeling Toolkit for Electronic Health Records. 2023; Available online: https://github.com/yhzhu99/pyehr (accessed on 22 November 2024).
28. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv; 2014; arXiv: 1412.6980
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Electronic Health Records (EHRs) offer valuable insights for healthcare prediction. Existing methods approach EHR analysis through direct imputation techniques in data space or representation learning in feature space. However, these approaches face the following two critical limitations: first, they struggle to model long-term clinical pathways due to their focus on isolated time points rather than continuous health trajectories; second, they lack mechanisms to effectively distinguish between clinically relevant and redundant features when observations are irregular. To address these challenges, we introduce
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 Peking University Third Hospital, Beijing 100191, China; [email protected], National Engineering Research Center for Software Engineering, Peking University, Beijing 100871, China; [email protected] (L.G.); [email protected] (C.Z.); [email protected] (K.Y.); [email protected] (X.L.); [email protected] (L.M.), Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing 100871, China
2 National Engineering Research Center for Software Engineering, Peking University, Beijing 100871, China; [email protected] (L.G.); [email protected] (C.Z.); [email protected] (K.Y.); [email protected] (X.L.); [email protected] (L.M.), Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing 100871, China
3 Affiliated Xuzhou Municipal Hospital of Xuzhou Medical University, Xuzhou 221002, China
4 Peking University Third Hospital, Beijing 100191, China; [email protected]