Content area

Abstract

The advent of transformer-based models has revolutionized natural language processing, bringing remarkable improvements in tasks like automatic speech recognition (ASR). Inspired by these advancements, this thesis explores the optimization of a transformer-based ASR model to improve transcription accuracy in educational settings, particularly for lecture content. The goal of this research is to provide real-time, high-accuracy captions that enhance accessibility for all students, while offering a cost-effective solution for educators.

To assess the potential of domain-specific fine-tuning, Whisper-small underwent two phases of fine-tuning. In the first phase, it was finetuned on carefully selected, publicly available datasets: SpeechColab’s Gigaspeech-XS, AMI Meeting corpus. In the second phase, fine-tuned model was optimized on a self-curated dataset consisting of roughly 10 hours of live lecture recordings collected and assembled by me. Finally, a real-time captioning assistant application was developed to leverage the finetuned model and transcribe speech in real time with live editing capabilities.

The optimized Whisper-small model was evaluated against Whisper’s retrained small, medium and large(version 2) counterparts. The evaluation was performed on a clean unseen data prepared by me. The fine-tuned model achieved lower Word Error Rates (WER) of 4.53%, compared to 5.51% and 5.78% for Whisper-Medium and Whisper-Large-V2 respectively. These results demonstrate that fine-tuning a transformer-based ASR model on domain-specific data can significantly enhance its performance in a targeted context, such as live lecture transcription.

The findings of this experiment highlight the promise of transformer-based models for improving educational accessibility. From thereon, building an application tailored to live lecture settings, this research contributes to the development of adaptable, low-cost technologies that support inclusive learning environments. The success of this experiment lays the groundwork for future breakthroughs in speech recognition, aiming to make education more accessible for everyone.

Details

1010268
Business indexing term
Title
Domain-Specific Customization for Improving Speech to Text
Number of pages
101
Publication year
2025
Degree date
2025
School code
0465
Source
MAI 86/10(E), Masters Abstracts International
ISBN
9798310380806
Committee member
Geigel, Joe; Orfan, Jansen
University/institution
Rochester Institute of Technology
Department
Computer Science
University location
United States -- New York
Degree
M.S.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
31934961
ProQuest document ID
3190171649
Document URL
https://www.proquest.com/dissertations-theses/domain-specific-customization-improving-speech/docview/3190171649/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic