Content area
Full Text
replying to C. Dens et al. Nature Machine Intelligencehttps://doi.org/10.1038/s42256-023-00727-0 (2023)
In their Matters Arising, Dens et al. highlighted the negative data sampling issue in T-cell epitope specificity prediction1. The negative sampling issue is of general importance in biological data modelling, since biological experiments intend to record positive results and ignore the negative results. We appreciate the efforts from Meysman and colleagues to raise this point for T-cell epitope specificity modelling, as it is known that the choice of negative data sampling strategy influences the prediction results2,3. Therefore, a negative data sampling strategy should be carefully selected, which is what we noticed and emphasized in our original work on PanPep4. There are two commonly used negative sampling strategies: reshuffling based on positive pairs (the first strategy) and randomly drawing from background repertories (the second strategy), and we selected the second for PanPep, as explained in in the original paper4. In reply to Dens et al.1 we would like to further clarify and provide a comprehensive discussion of our negative sampling strategy in peptide–TCR binding prediction from a perspective of positive and unlabelled data learning (PU learning) together with causal language. We advocate more attention to this issue in peptide–TCR binding prediction.
Confounding factors analysis in data construction from a causal perspective
We acknowledge the concern raised by Dens et al., that utilizing the background repertories as a negative sampling dataset may inadvertently introduce a confounding variable stemming from the TCR sources. To elucidate this issue, we employed causal language and graphically represented this scenario in a tri-element causal graph comprising TCR, binding label and TCR source5 (Fig. 1a). The concern of Dens et al., however, lies in the potential confounder between TCR and the binding label, as evidenced by two additional arrows originating from the TCR source, one pointing towards TCR and the other towards the binding label. Their concern can thus be interpreted as potential ‘shortcut learning’ via a backdoor pathway triggered by the source element. This implies that the model learning might deviate and instead adapt to the learning invalid knowledge from the TCR source to the binding label path.
Fig. 1 [Images not available. See PDF.]
The confounder illustration from a causal...