Content area
Artificial intelligence (AI) has been integrated into modern clinical decision support systems and demonstrated expert-level performance across various domains, assisting with tasks of disease diagnosis/prognosis, drug discovery, and personalized treatment. As such applications have quickly proliferated in past decades, the effectiveness of AI tools, however, is contingent upon the quantity and quality of the training data fed into the AI models. In contrast to general domains such as natural image recognition and object detection, where large, well-curated datasets like ImageNet and COCO are available, the healthcare domain is often plagued by various data challenges that often involve a mixture of multiple data issues including data scarcity, imbalance, lack of diversity, and more. Blindly applying ML tools developed in general domains to healthcare without accounting these limitations can lead to serious consequences, such as misdiagnosis, delayed treatment, or even patient harm.
This thesis presents a set of contributions aimed at addressing these data challenges that support more stable and safer AI deployment, with a focus on: Tackling Training on Small Data: Given the limited availability of annotated medical data and the data-intensive nature of AI models, we develop a transfer learning technique that mitigates the domain discrepancy between natural and medical images, thereby enabling more effective learning from small medical datasets. Enable Learning from Distributed Private Data: Observing that healthcare data are naturally distributed, we investigate federated learning in medical images and clinical texts and confirm its effectiveness in learning from distributed data holders without data sharing. Improving Learning from Biased Data: Acknowledging that real-world medical data often exhibit various forms of bias, we propose a novel, exact continuous reformulation for direct metric optimization that offers more precise control over target metrics and facilitates learning towards unbiased metrics. Safeguard AI Model Predictions: noticing that AI is not flawless, we permit prediction slackness by allowing prediction abstention (i.e., rejection) based on designed confidence score under real-world perturbations. Altogether, these contributions seek to advance AI integration in healthcare by ensuring the development of models that are both safe and reliable.