Vector Quantization in Deep Neural Networks for Speech and Image Processing

Abstract

Vector quantization (VQ) is a classic signal processing technique that models the probability density function of a distribution using a set of representative vectors called the codebook, such that each codebook vector represents a subset of the distribution's samples. Deep neural networks (DNNs) are a branch of machine learning that has gained popularity in recent decades as they can solve complicated optimization problems. Since VQ provides an abstract high-level discrete representation of a distribution, it has been widely used as a beneficial tool in many applications based on DNNs, such as image generation, speech recognition, text-to-speech synthesis, and speech and video coding. Regarding VQ's broad utilization in applications based on DNNs, a small improvement in VQ can result in a huge boost in the performance of many applications installed on devices dealing with different data types such as speech, image, video, and text.

This thesis mainly focuses on improving various VQ methods within deep learning frameworks. We propose using vector quantization instead of scalar quantization in a speech coding framework. The experiments show that the decoded speech has a higher perceptual quality because VQ considers the correlation between different dimensions of spectral envelopes. As another contribution, we propose a new solution to the gradient collapse problem called noise substitution in vector quantization (NSVQ), in which we model VQ as the addition of a noise vector to the input. Experiments show that NSVQ gives a faster convergence, more accurate gradients, and less hyperparameters to tune than two state-of-the-art solutions, i.e., straight-through estimator and exponential moving average. We further demonstrate that NSVQ can also optimize other variants of VQ that use multiple codebooks, e.g., product VQ, residual VQ, and additive VQ. Experimental results under different speech coding, image compression, and approximate nearest neighbor search show that VQ variants optimized by NSVQ can perform comparably to the baselines.

By incorporating space-filling curves into VQ, we introduced a novel quantization technique called space-filling vector quantization (SFVQ), which quantizes the input on a continuous piecewise linear curve. Because of inherent order in the SFVQ codebook, adjacent codebook vectors refer to similar contents. We used this property of SFVQ, which allows us to interpret the underlying phonetic structure of the latent space of a voice conversion model. Moreover, we used SFVQ to interpret the intermediate latent space of StyleGAN2 and BigGAN image generative models. SFVQ gives good control over generations, such that we found the mapping between the latent space and generative factors (e.g., gender, age, etc.), and we discovered the interpretable directions to change the image's attributes (e.g., smile, pose, etc.). In another work, we used SFVQ to cluster the speaker embeddings to enhance the speaker's privacy in speech processing tools based on DNNs.

Details

Subject

Speech;
Neural networks;
Signal processing;
Computer science;
Information technology

Classification

0984: Computer science
0489: Information Technology

Identifier / keyword

Vector quantization; Deep neural networks; Space-filling vector quantization; Speaker anonymization

URL

https://aaltodoc.aalto.fi/handle/123456789/133778

Title

Vector Quantization in Deep Neural Networks for Speech and Image Processing

Author

Vali, Mohammad Hassan

Number of pages

Publication year

2025

Degree date

2025

School code

2690

Source

DAI-B 87/5(E), Dissertation Abstracts International

Advisor

Bäckström, Tom

Committee member

Virtanen, Tuomas; Williams, Jennifer

University/institution

Aalto Yliopisto (Finland)

University location

Finland

Degree

Dr.sci.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32414523

ProQuest document ID

3286898942

Document URL

https://www.proquest.com/dissertations-theses/vector-quantization-deep-neural-networks-speech/docview/3286898942/se-2?accountid=208611

Go to Library Record

https://aaltodoc.aalto.fi/handle/123456789/133778

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Vector Quantization in Deep Neural Networks for Speech and Image Processing

Content area

Abstract

Details