Content area

Abstract

The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning. Recently, AQA has garnered attention due to the advent of Large Audio Language Models (LALMs). Current literature focuses on constructing LALMs by integrating audio encoders with text-only Large Language Models (LLMs) through a projection module. While LALMs excel in general audio understanding, they are limited in temporal reasoning, which may hinder their commercial applications and on-device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we perform a further fine-tuning of an existing baseline using curriculum learning strategy to specialize in temporal reasoning without compromising performance on fine-tuned tasks. We demonstrate the performance of our model using state-of-the-art LALMs on public audio benchmark datasets. Third, we implement our AQA model on-device locally and investigate its CPU inference for edge applications.

Details

1009240
Title
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models
Publication title
arXiv.org; Ithaca
Publication year
2024
Publication date
Dec 13, 2024
Section
Computer Science; Electrical Engineering and Systems Science
Publisher
Cornell University Library, arXiv.org
Source
arXiv.org
Place of publication
Ithaca
Country of publication
United States
University/institution
Cornell University Library arXiv.org
e-ISSN
2331-8422
Source type
Working Paper
Language of publication
English
Document type
Working Paper
Publication history
 
 
Online publication date
2024-12-16
Milestone dates
2024-09-10 (Submission v1); 2024-09-13 (Submission v2); 2024-12-13 (Submission v3)
Publication history
 
 
   First posting date
16 Dec 2024
ProQuest document ID
3103019342
Document URL
https://www.proquest.com/working-papers/enhancing-temporal-understanding-audio-question/docview/3103019342/se-2?accountid=208611
Full text outside of ProQuest
Copyright
© 2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2024-12-17
Database
2 databases
  • ProQuest One Academic
  • ProQuest One Academic