Machine Learning-Based Information Retrieval for

Abstract

Information retrieval (IR) is essential for large-scale natural language processing (NLP) tasks like open-domain question answering and automatic fact-checking. Traditional IR models, such as TF-IDF and BM25, rely on lexical matching and often fail to capture semantic meaning, limiting their effectiveness. In contrast, recent machine learning-based IR methods use pretrained language models to encode the semantic meaning of both queries and documents. These models offer significant improvements by enabling better ranking of relevant results.

This dissertation presents several contributions to advancing machine learning-based IR for improving large-scale NLP tasks. First, we introduce MythQA, a multi-answer open-domain question answering system for detecting check-worthy claims directly from large-scale information sources like Twitter. To support this study, we construct TweetMythQA, a benchmark with specific evaluation metrics and 5.3K annotated tweets, classified as "Supporting", "Refuting", or "Neutral".

Further, we propose M3, a multi-hop dense retrieval system that integrates contrastive and multi-task learning to enhance text retrieval. M3 demonstrates state-of-the-art performance on the FEVER, an open-domain fact verification benchmark, addressing limitations of contrastive learning in dense retrieval.

Lastly, we explore multi-modal retrieval-augmented question answering (MRAQA) by developing RAMQA, a framework combining learning-to-rank methods with generative LLMs for multi-modal question answering. Using LLaVa and LLaMA models, RAMQA outperforms strong baselines on the WebQA and MultiModalQA benchmarks, highlighting its effectiveness in multi-modal retrieval tasks.

In summary, this dissertation introduces novel benchmarks, advanced retrieval models, and robust frameworks to improve large-scale NLP tasks, enhancing open-domain claim detection, automatic fact verification, and retrieval-augmented multi-modal question answering.

Details

Title

Machine Learning-Based Information Retrieval for Large-Scale Natural Language Processing

Author

Bai, Yang

Publication year

2024

Publisher

ProQuest Dissertations & Theses

ISBN

9798293894192

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

3256561392

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Machine Learning-Based Information Retrieval for Large-Scale Natural Language Processing

Content area

Abstract

Details

Suggested sources