Content area
Abstract
Information retrieval (IR) is essential for large-scale natural language processing (NLP) tasks like open-domain question answering and automatic fact-checking. Traditional IR models, such as TF-IDF and BM25, rely on lexical matching and often fail to capture semantic meaning, limiting their effectiveness. In contrast, recent machine learning-based IR methods use pretrained language models to encode the semantic meaning of both queries and documents. These models offer significant improvements by enabling better ranking of relevant results.
This dissertation presents several contributions to advancing machine learning-based IR for improving large-scale NLP tasks. First, we introduce MythQA, a multi-answer open-domain question answering system for detecting check-worthy claims directly from large-scale information sources like Twitter. To support this study, we construct TweetMythQA, a benchmark with specific evaluation metrics and 5.3K annotated tweets, classified as "Supporting", "Refuting", or "Neutral".
Further, we propose M3, a multi-hop dense retrieval system that integrates contrastive and multi-task learning to enhance text retrieval. M3 demonstrates state-of-the-art performance on the FEVER, an open-domain fact verification benchmark, addressing limitations of contrastive learning in dense retrieval.
Lastly, we explore multi-modal retrieval-augmented question answering (MRAQA) by developing RAMQA, a framework combining learning-to-rank methods with generative LLMs for multi-modal question answering. Using LLaVa and LLaMA models, RAMQA outperforms strong baselines on the WebQA and MultiModalQA benchmarks, highlighting its effectiveness in multi-modal retrieval tasks.
In summary, this dissertation introduces novel benchmarks, advanced retrieval models, and robust frameworks to improve large-scale NLP tasks, enhancing open-domain claim detection, automatic fact verification, and retrieval-augmented multi-modal question answering.





