Content area

Abstract

This dissertation addresses the challenge of learning latent structures in audio through multimodal data, enabling applications across speech and audio recognition and generation while simplifying pipelines and advancing state-of-the-art performance.

The first half of the dissertation explores visually grounded speech (VGS), a learning setup where speech is represented along with a semantically related image, similar to human language acquisition. We first introduce FaST-VGS, an end-to-end model for speech-image retrieval that matches the accuracy of a cascaded approach reliant on automatic speech recognition (ASR). By integrating unimodal and multimodal self-supervised learning, we then propose FaST-VGS+, which learns semantically rich speech representations, achieving state-of-the-art results on downstream semantic tasks. Finally, we propose and analyze a visually grounded self-supervised model, VG-HuBERT, for unsupervised discovery of word- and syllable-level units.

The second half of the dissertation focuses on generation tasks. For all presented works, we adopt a minimalist design philosophy, aiming to learn latent structures in audio for generation by jointly modeling audio with other modalities.

We first develop a video-conditioned sound generation model that implicitly disentangles foreground and background sounds via multimodal audio-visual learning, attaining state-of-the-art action-to-sound generation performance.

The next two chapters advance the field of speech generation. We propose VoiceCraft, a neural codec language model (NCLM) unifying text-to-speech (TTS) and speech editing, and has emerged as a widely adopted tool in the synthesis community; next, by analyzing the common pitfalls of existing NCLM-based TTS models, we propose VoiceStar, a robust NCLM-based TTS model, that is capable of controlling the output duration and extrapolating to unseen output length, achieving state-of-the-art performance on both short-form and long-form TTS benchmarks. Both VoiceCraft and VoiceStar are end-to-end models that implicitly disentangle speaker identity and lexical content, achieving voice cloning TTS without relying on external hand-engineered modules.

In conclusion, we outline future directions for unified audio generation, emphasizing seamless integration of multimodal information and controllability.

Details

1010268
Business indexing term
Title
Learning Latent Structures in Audio from Multimodal Data
Number of pages
211
Publication year
2025
Degree date
2025
School code
0227
Source
DAI-A 87/6(E), Dissertation Abstracts International
ISBN
9798270229337
Committee member
Mooney, Raymond J.; Grauman, Kristen; Mohamed, Abdel-rahman
University/institution
The University of Texas at Austin
Department
Computer Science
University location
United States -- Texas
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32458851
ProQuest document ID
3283962475
Document URL
https://www.proquest.com/dissertations-theses/learning-latent-structures-audio-multimodal-data/docview/3283962475/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic