Content area

Abstract

Information retrieval (IR) is essential for large-scale natural language processing (NLP) tasks like open-domain question answering and automatic fact-checking. Traditional IR models, such as TF-IDF and BM25, rely on lexical matching and often fail to capture semantic meaning, limiting their effectiveness. In contrast, recent machine learning-based IR methods use pretrained language models to encode the semantic meaning of both queries and documents. These models offer significant improvements by enabling better ranking of relevant results.

This dissertation presents several contributions to advancing machine learning-based IR for improving large-scale NLP tasks. First, we introduce MythQA, a multi-answer open-domain question answering system for detecting check-worthy claims directly from large-scale information sources like Twitter. To support this study, we construct TweetMythQA, a benchmark with specific evaluation metrics and 5.3K annotated tweets, classified as "Supporting", "Refuting", or "Neutral".

Further, we propose M3, a multi-hop dense retrieval system that integrates contrastive and multi-task learning to enhance text retrieval. M3 demonstrates state-of-the-art performance on the FEVER, an open-domain fact verification benchmark, addressing limitations of contrastive learning in dense retrieval.

Lastly, we explore multi-modal retrieval-augmented question answering (MRAQA) by developing RAMQA, a framework combining learning-to-rank methods with generative LLMs for multi-modal question answering. Using LLaVa and LLaMA models, RAMQA outperforms strong baselines on the WebQA and MultiModalQA benchmarks, highlighting its effectiveness in multi-modal retrieval tasks.

In summary, this dissertation introduces novel benchmarks, advanced retrieval models, and robust frameworks to improve large-scale NLP tasks, enhancing open-domain claim detection, automatic fact verification, and retrieval-augmented multi-modal question answering.

Details

Title
Machine Learning-Based Information Retrieval for Large-Scale Natural Language Processing
Author
Bai, Yang
Publication year
2024
Publisher
ProQuest Dissertations & Theses
ISBN
9798293894192
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
3256561392
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.