Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens

Haici Yang, Jiaqi Su, Minje Kim, and Zeyu Jin

Interspeech 2024 PDF

Audio samples

Closer comparison with other models

This session provides a closer comparison among different models. The samples are collected from comparion models' demo pages, and include a wide range of in-the-wild enhancement scenarios.

If applicable, we present the provided enhanced samples from each model's own webpages. All the Genhancer samples are from the MaskGIT+Whisper model. If not noted otherwise, Miipher samples are generated with speaker condition, and without text condition.

Genhancer runs enhancement at 44.1kHz; Miipher [1] at 24kHz. All other models included below operate at 16kHz. HiFi-GAN-2 [2] has a built-in bandwidth extension module and can thus provide 16kHz, 22.05kHz, or 44.1kHz output. For fair comparison, all the samples in this section are prepared for or downsampled to 16kHz unless otherwise noted.

Sample 1 - Reverb + noise (Miipher demo)
Without transcription prompt, Genhancer is able to recover the content well, comparable to Miipher with text, and clearly out-performs Miipher w/o text.

Clean (24kHz)


Degraded (24kHz)

Genhancer (24kHz)
Miipher w/o text (24kHz, Original)
Miipher w/ text (24kHz, Original)

HiFi-GAN-2 (24kHz)

   Transcript: Attempting to appease dictatorial regimes with custom beanie babies is a tried and true strategy.
Sample 2 - Reverb + noise (Miipher demo)

Clean (24kHz)

Degraded (24kHz)

Genhancer (24kHz)
Genhancer (44.1kHz)
Miipher w/o text (24kHz, Original)

HiFi-GAN-2 (24kHz)

   Transcript: The Albanian Minister sidestepped the Italian Military Mission, and appointed himself Chief of Staff.
Sample 3 - Reverb + noise (Low-latency SE [3] demo)
This is a challenging example where the speech is barely intelligible towards the end of the utterance. All the generative methods display certain degree of hallucination. Our method sticks closer to the original speaker’s voice and speech content than others, but it can still produce robotic voice artifacts when uncertainty is high.

Clean

Degraded

Genhancer
Low-latency SE (Original)

Miipher

HiFi-GAN-2
Sample 4 - Noise + distortion (Low-latency SE [3] demo )

Degraded

Genhancer
Low-latency SE (Original)

Miipher

StoRm [4]

HiFi-GAN-2
Sample 5 - Reverb + noise (StoRm demo)
Clean
Degraded
Genhancer
Miipher
StoRm (Original)
HiFi-GAN-2
Sample 6 - Strong reverb (StoRm demo)
This sample shows Genhancer is robust to strong reverberation. Meanwhile, we can observe that as a non-generative model, HiFi-GAN-2 is more susceptible to the adversial audio effect from input, although it was able to preserve the content to a fairly good extend. In contrast, all other generative models provide a much cleaner acoustic environment.
Clean
Degraded
Genhancer
Miipher
StoRm (Original)
HiFi-GAN-2
Sample 7 - Other language (UNIVERSE [5] demo )
An example in a different language, that wasn't included in the training set.
Degraded
Genhancer
UNIVERSE (Original)
Miipher
StoRm
HiFi-GAN-2
Sample 8 - Out-of-distribution spoken style (UNIVERSE demo)
The speaker of this sample shows strong emotion by his changing pitch and tones in voice. In constrast to Miipher, Genhancer is capable of preserving the emotion, which we attribute to the use of codec token for providing more complete acoustic information.
Degraded
Genhancer
Miipher
UNIVERSE (Original)
StoRm
HiFi-GAN-2
Sample 9 - Multi-speaker (UNIVERSE demo)
Although trained with language modeling, our method smoothly transits between two speakers and keeps their identities, thanks to the look-around capability of MaskGIT and the additional branch of local context
Clean
Degraded
Genhancer
Miipher
UNIVERSE (Original)
HiFi-GAN-2
Sample 10 - Reverb + noise (DEVO [6 ]demo)
Genhancer produces correct content and stable pitch, prosody and speaker.
Clean
Degraded
Genhancer
Miipher
DEVO (Original)
HiFi-GAN-2
Sample 11 - Reverb + noise (DEVO demo)
Generative models in general better resist the implusive bird chirping noise. Genhancer produces the cleanest acoustic background.
Clean
Degraded
Genhancer
Miipher
DEVO (Original)
HiFi-GAN-2
Sample 12 - Reverb + noise (SELM [7] demo)
Clean
Degraded
Genhancer
Miipher
SELM (Original)
HiFi-GAN-2

Comparison among variants of Genhancer

We observed that:
Sample 1 (Real-world speech content)

Degraded

AR+Whisper

MaskGIT+Whisper
MaskGIT+W2v-BERT2.0

MaskGIT+WavLM

FFDiscrete+Whisper
Sample 2 (Real-world speech content)

Degraded

AR+Whisper

MaskGIT+WavLM

MaskGIT+Whisper
MaskGIT+W2v-BERT2.0

FFDiscrete+Whisper
Sample 3 (DEMO)

Degraded

AR+Whisper

MaskGIT+WavLM

MaskGIT+Whisper
MaskGIT+W2v-BERT2.0

FFDiscrete+Whisper

   Transcript: Access my vimeo service to play music from Bernhard Fleischmann.
Sample 4 (DEMO)
Clean
Degraded
AR+WavLM
MaskGIT+Whisper
MaskGIT+WavLM
FFDiscrete+Whisper
Sample 5 (DAPS Real)
Clean
Degraded
AR+Whisper
MaskGIT+Whisper
MaskGIT+WavLM
FFDiscrete+Whisper
Sample 6 (DAPS Real)
Clean
Degraded
AR+Whisper
MaskGIT+Whisper
MaskGIT+WavLM
FFDiscrete+Whisper

More samples

For reference, we also provide the first 10 gender-balanced samples of each evaluation set in an anonymous repo here , of five variants of Genhancer, namely, MaskGIT + Whisper, MaskGIT + WavLM, MaskGIT + W2v-BERT2.0, AR + Whisper and FFDiscrete + Whisper.

Supplement

Network design

We use DF-Conformer [8] as the backbone.

- Feedforward noisy branch

- Token Generator

Training setups

Reference

[1] Koizumi, Yuma, et al. "Miipher: A robust speech restoration model integrating self-supervised speech and text representations." 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2023.
[2] Su, Jiaqi, Zeyu Jin, and Adam Finkelstein. "HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features." 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021.
[3] Xue, Huaying, Xiulian Peng, and Yan Lu. "Low-latency Speech Enhancement via Speech Token Generation." arXiv preprint arXiv:2310.08981 (2023).
[4] J. -M. Lemercier, J. Richter, S. Welker and T. Gerkmann, "StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724-2737, 2023, doi: 10.1109/TASLP.2023.3294692.
[5] Serrà, Joan, et al. "Universal speech enhancement with score-based diffusion." arXiv preprint arXiv:2206.03065 (2022).
[6] Irvin, Bryce, et al. "Self-supervised learning for speech enhancement through synthesis." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[7] Wang, Ziqian, et al. "SELM: Speech Enhancement Using Discrete Tokens and Language Models." arXiv preprint arXiv:2312.09747 (2023).
[8] Koizumi, Yuma, et al. "DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement." 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021.
[9] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems 36 (2024).