Abstract/Details

Machine Learning-Based Information Retrieval for Large-Scale Natural Language Processing

Bai, Yang.   University of Florida ProQuest Dissertations & Theses,  2024. 31633889.

Abstract (summary)

Information retrieval (IR) is essential for large-scale natural language processing (NLP) tasks like open-domain question answering and automatic fact-checking. Traditional IR models, such as TF-IDF and BM25, rely on lexical matching and often fail to capture semantic meaning, limiting their effectiveness. In contrast, recent machine learning-based IR methods use pretrained language models to encode the semantic meaning of both queries and documents. These models offer significant improvements by enabling better ranking of relevant results.

This dissertation presents several contributions to advancing machine learning-based IR for improving large-scale NLP tasks. First, we introduce MythQA, a multi-answer open-domain question answering system for detecting check-worthy claims directly from large-scale information sources like Twitter. To support this study, we construct TweetMythQA, a benchmark with specific evaluation metrics and 5.3K annotated tweets, classified as "Supporting", "Refuting", or "Neutral".

Further, we propose M3, a multi-hop dense retrieval system that integrates contrastive and multi-task learning to enhance text retrieval. M3 demonstrates state-of-the-art performance on the FEVER, an open-domain fact verification benchmark, addressing limitations of contrastive learning in dense retrieval.

Lastly, we explore multi-modal retrieval-augmented question answering (MRAQA) by developing RAMQA, a framework combining learning-to-rank methods with generative LLMs for multi-modal question answering. Using LLaVa and LLaMA models, RAMQA outperforms strong baselines on the WebQA and MultiModalQA benchmarks, highlighting its effectiveness in multi-modal retrieval tasks.

In summary, this dissertation introduces novel benchmarks, advanced retrieval models, and robust frameworks to improve large-scale NLP tasks, enhancing open-domain claim detection, automatic fact verification, and retrieval-augmented multi-modal question answering.

Indexing (details)


Business indexing term
Subject
Computer science;
Artificial intelligence;
Information science
Classification
0984: Computer science
0723: Information science
0800: Artificial intelligence
Identifier / keyword
Fact checking; Machine learning; Multi-modal language model; Natural language processing
Title
Machine Learning-Based Information Retrieval for Large-Scale Natural Language Processing
Author
Bai, Yang
Number of pages
106
Publication year
2024
Degree date
2024
School code
0070
Source
DAI-A 87/3(E), Dissertation Abstracts International
ISBN
9798293894192
Advisor
Wang, Zhe; Grant, Christan
Committee member
Chen, Shigang; Bindschaedler, Vincent; Shin, Jieun
University/institution
University of Florida
Department
Computer and Information Science and Engineering
University location
United States -- Florida
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
31633889
ProQuest document ID
3256561392
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Document URL
https://www.proquest.com/docview/3256561392