Generative De-Quantization for Neural Speech Codec via Latent Diffusion

Paper

Title: “Generative De-Quantization for Neural Speech Codec via Latent Diffusion”
Authors: Haici Yang, Inseon Jang, and Minje Kim
(Submitted to ICASSP 2024 for publication; under review)

Comparison of Different Coding Systems

Example #1
Original
EnCodec 1.5kbps
EnCodec 3kbps
DAC 3kbps
LaDiffCodec 1.5kbps (proposed)
LaDiffCodec 3kbps (proposed)
Example #2
Original
EnCodec 1.5kbps
EnCodec 3kbps
DAC 3kbps
LaDiffCodec 1.5kbps (proposed)
LaDiffCodec 3kbps (proposed)

Ablation of the Midway Infilling Algorithm

The following sequence of samples are synthesized using the proposed midway-infilling algorithm with different interpolation rates \lambda. The larger \lambda is, the more the condition branch (i.e., the quantized code) participates in the sampling procedure, which improves the “correctness” of the synthesized examples at the cost of reduced “naturalness.” All of them took 100 steps for sampling. We also provide the samples using DDPM sampling (baseline) with 1000 steps for comparison.

Example #1 at 1.0 kbps

Sufficient to serve with five or six mackerel.”
(In the baseline result and some other infilling results, the word “serve” is mispronounced.)

Original
DDPM result; \lambda=0; 1000 steps.
Midway Infilling; \lambda=0.1; 100 steps
Midway Infilling; \lambda=0.2; 100 steps
Midway Infilling; \lambda=0.3; 100 steps
Midway Infilling; \lambda=0.4; 100 steps
Midway Infilling; \lambda=0.5; 100 steps
Midway Infilling; \lambda=0.6; 100 steps
Midway Infilling; \lambda=0.7; 100 steps
Midway Infilling; \lambda=0.8; 100 steps
Midway Infilling; \lambda=0.9; 100 steps
Midway Infilling; \lambda=1.0; 100 steps

Source Codes

https://github.com/haiciyang/LaDiffCodec