Content area
Standard data selection relies on opaque metrics like loss, offering little insight into the specific knowledge a model acquires. This thesis proposes interpretable concept-based data selection, a framework treating high-level semantic concepts as the primary units of curation. We first demonstrate that characterizing the conceptual composition of data is essential for robust generalization. Failing to explicitly capturing the concepts in data leads to unintended artifacts in the selection process. To operationalize this, we introduce a gradient-based methodology that quantifies the influence of specific concepts at the instance level. Finally, we apply this framework to Continual Learning, showing that selecting rehearsal data based on "threatened" concepts significantly mitigates catastrophic forgetting compared to random baselines. This work establishes interpretability not merely as an analytic lens, but as a rigorous, actionable tool for efficient model training.