Content area
Abstract
This dissertation consists of three research topics.
In the first part, we present deep fiducial inference and approximate fiducial computation (AFC) algorithm. Since the mid-2000s, there has been a resurrection of interest in modern modifications of fiducial inference. To date, the main computational tool to extract a generalized fiducial distribution is Markov chain Monte Carlo (MCMC). We propose an alternative way of computing a generalized fiducial distribution that could be used in complex situations. In particular, to overcome the difficulty when the unnormalized fiducial density (needed for MCMC) is intractable, we design a fiducial autoencoder (FAE). The fitted FAE is used to generate generalized fiducial samples of the unknown parameters. To increase accuracy, we then apply an approximate fiducial computation (AFC) algorithm, by rejecting samples that do not replicate the observed data well enough when plugged into a decoder. Our numerical experiments show the effectiveness of our FAE-based inverse solution and the excellent coverage performance of the AFC corrected FAE solution.
In the second part, we present SMNN, a supervised mutual nearest neighbor method, for batch effect correction in single-cell RNA-sequencing (scRNA-seq) data. Batch effect correction has been recognized to be indispensable when integrating single-cell RNA sequencing (scRNA-seq) data from multiple batches. State-of-the-art methods ignore single-cell cluster label information, but such information can improve the effectiveness of batch effect correction, particularly under realistic scenarios where biological differences are not orthogonal to batch effects. To address this issue, we propose SMNN for batch effect correction of scRNA-seq data via supervised mutual nearest neighbor detection. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over alternative methods including MNN, Seurat v3 and LIGER. Furthermore, SMNN retains more cell-type-specific features, partially manifested by differentially expressed genes identified between cell types after SMNN correction being biologically more relevant, with precision improving by up to 841.0%.
In the third part, we present an ensemble imputation framework for DNA methylation across different platforms. DNA methylation at CpG dinucleotides is a biological process by which methyl groups are added to the DNA molecule. It is one of the most extensively studied epigenetic marks. With technological advancements, geneticists can profile DNA methylation with multiple reliable approaches. However, different profiling platforms can differ substantially in the density and measurements for the CpGs they assess, consequently hindering joint analysis across platforms. For this project, we focus on the two most commonly used commercial methylation platforms from the Illumina company, specifically aiming to impute from the HumanMethylation450 (HM450) BeadChip to ~850K CpG sites on the HumanMethylationEPIC (HM850) BeadChip. We present CUE, CpG imputation Ensemble, which ensemble multiple classical statistical and modern machine learning methods. Our results highlight CUE as a valuable tool for imputing from HM450 to HM850.