Content area

Abstract

Vector quantization (VQ) is a classic signal processing technique that models the probability density function of a distribution using a set of representative vectors called the codebook, such that each codebook vector represents a subset of the distribution's samples. Deep neural networks (DNNs) are a branch of machine learning that has gained popularity in recent decades as they can solve complicated optimization problems. Since VQ provides an abstract high-level discrete representation of a distribution, it has been widely used as a beneficial tool in many applications based on DNNs, such as image generation, speech recognition, text-to-speech synthesis, and speech and video coding. Regarding VQ's broad utilization in applications based on DNNs, a small improvement in VQ can result in a huge boost in the performance of many applications installed on devices dealing with different data types such as speech, image, video, and text.

This thesis mainly focuses on improving various VQ methods within deep learning frameworks. We propose using vector quantization instead of scalar quantization in a speech coding framework. The experiments show that the decoded speech has a higher perceptual quality because VQ considers the correlation between different dimensions of spectral envelopes. As another contribution, we propose a new solution to the gradient collapse problem called noise substitution in vector quantization (NSVQ), in which we model VQ as the addition of a noise vector to the input. Experiments show that NSVQ gives a faster convergence, more accurate gradients, and less hyperparameters to tune than two state-of-the-art solutions, i.e., straight-through estimator and exponential moving average. We further demonstrate that NSVQ can also optimize other variants of VQ that use multiple codebooks, e.g., product VQ, residual VQ, and additive VQ. Experimental results under different speech coding, image compression, and approximate nearest neighbor search show that VQ variants optimized by NSVQ can perform comparably to the baselines.

By incorporating space-filling curves into VQ, we introduced a novel quantization technique called space-filling vector quantization (SFVQ), which quantizes the input on a continuous piecewise linear curve. Because of inherent order in the SFVQ codebook, adjacent codebook vectors refer to similar contents. We used this property of SFVQ, which allows us to interpret the underlying phonetic structure of the latent space of a voice conversion model. Moreover, we used SFVQ to interpret the intermediate latent space of StyleGAN2 and BigGAN image generative models. SFVQ gives good control over generations, such that we found the mapping between the latent space and generative factors (e.g., gender, age, etc.), and we discovered the interpretable directions to change the image's attributes (e.g., smile, pose, etc.). In another work, we used SFVQ to cluster the speaker embeddings to enhance the speaker's privacy in speech processing tools based on DNNs.

Details

1010268
Title
Vector Quantization in Deep Neural Networks for Speech and Image Processing
Number of pages
86
Publication year
2025
Degree date
2025
School code
2690
Source
DAI-B 87/5(E), Dissertation Abstracts International
Committee member
Virtanen, Tuomas; Williams, Jennifer
University/institution
Aalto Yliopisto (Finland)
University location
Finland
Degree
Dr.sci.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32414523
ProQuest document ID
3286898942
Document URL
https://www.proquest.com/dissertations-theses/vector-quantization-deep-neural-networks-speech/docview/3286898942/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic