This session provides a closer comparison among different models. The samples are collected from comparion models' demo pages, and include a wide range of in-the-wild enhancement scenarios.
If applicable, we present the provided enhanced samples from each model's own webpages. All the Genhancer samples are from the MaskGIT+Whisper model. If not noted otherwise, Miipher samples are generated with speaker condition, and without text condition.
Genhancer runs enhancement at 44.1kHz; Miipher [1] at 24kHz. All other models included below operate at 16kHz. HiFi-GAN-2 [2] has a built-in bandwidth extension module and can thus provide 16kHz, 22.05kHz, or 44.1kHz output. For fair comparison, all the samples in this section are prepared for or downsampled to 16kHz unless otherwise noted.
Sample 1 - Reverb + noise (Miipher demo)
Without transcription prompt, Genhancer is able to recover the content well, comparable to Miipher with text, and clearly out-performs Miipher w/o text.
Clean (24kHz)
Degraded (24kHz)
Genhancer (24kHz)
Miipher w/o text (24kHz, Original)
Miipher w/ text (24kHz, Original)
HiFi-GAN-2 (24kHz)
   Transcript: Attempting to appease dictatorial regimes with custom beanie babies is a tried and true strategy.
   Transcript: The Albanian Minister sidestepped the Italian Military Mission, and appointed himself Chief of Staff.
Sample 3 - Reverb + noise (Low-latency SE [3] demo)
This is a challenging example where the speech is barely intelligible towards the end of the utterance. All the generative methods display certain degree of hallucination. Our method sticks closer to the original speaker’s voice and speech content than others, but it can still produce robotic voice artifacts when uncertainty is high.
Sample 6 - Strong reverb (StoRm demo)
This sample shows Genhancer is robust to strong reverberation. Meanwhile, we can observe that as a non-generative model, HiFi-GAN-2 is more susceptible to the adversial audio effect from input, although it was able to preserve the content to a fairly good extend. In contrast, all other generative models provide a much cleaner acoustic environment.
Clean
Degraded
Genhancer
Miipher
StoRm (Original)
HiFi-GAN-2
Sample 7- Other language (UNIVERSE [5] demo )
An example in a different language, that wasn't included in the training set.
Degraded
Genhancer
UNIVERSE (Original)
Miipher
StoRm
HiFi-GAN-2
Sample 8 - Out-of-distribution spoken style (UNIVERSE demo)
The speaker of this sample shows strong emotion by his changing pitch and tones in voice. In constrast to Miipher, Genhancer is capable of preserving the emotion, which we attribute to the use of codec token for providing more complete acoustic information.
Degraded
Genhancer
Miipher
UNIVERSE (Original)
StoRm
HiFi-GAN-2
Sample 9 - Multi-speaker (UNIVERSE demo)
Although trained with language modeling, our method smoothly transits between two speakers and keeps their identities, thanks to the look-around capability of MaskGIT and the additional branch of local context
Clean
Degraded
Genhancer
Miipher
UNIVERSE (Original)
HiFi-GAN-2
Sample 10 - Reverb + noise (DEVO [6 ]demo)
Genhancer produces correct content and stable pitch, prosody and speaker.
Clean
Degraded
Genhancer
Miipher
DEVO (Original)
HiFi-GAN-2
Sample 11 - Reverb + noise (DEVO demo)
Generative models in general better resist the implusive bird chirping noise. Genhancer produces the cleanest acoustic background.
MaskGIT in general is more accurate than AR in speech content and speaker characteristics, while AR model can achieve higher fidelity.
No consistent ranking among different pre-trained speech features, but Whisper displays more stable results across the broad. WavLM can sometimes achieve more accurate content reconstruction like in Sample 3.
The parallel token prediction method (FFDiscrete) performs comparably well on a majority of cases, showing that the use of tokens already prevents mode collapse and benefits enhancement quality. However, FFDiscrete leads to distortion when the acoustic conditions get challenging (e.g., Sample 3), while MaskGIT method achieves better consistency with its iterative mode selection.
Sample 1 (Real-world speech content)
Degraded
AR+Whisper
MaskGIT+Whisper
MaskGIT+W2v-BERT2.0
MaskGIT+WavLM
FFDiscrete+Whisper
Sample 2 (Real-world speech content)
Degraded
AR+Whisper
MaskGIT+WavLM
MaskGIT+Whisper
MaskGIT+W2v-BERT2.0
FFDiscrete+Whisper
Sample 3 (DEMO)
Degraded
AR+Whisper
MaskGIT+WavLM
MaskGIT+Whisper
MaskGIT+W2v-BERT2.0
FFDiscrete+Whisper
   Transcript: Access my vimeo service to play music from Bernhard Fleischmann.
Sample 4 (DEMO)
Clean
Degraded
AR+WavLM
MaskGIT+Whisper
MaskGIT+WavLM
FFDiscrete+Whisper
Sample 5 (DAPS Real)
Clean
Degraded
AR+Whisper
MaskGIT+Whisper
MaskGIT+WavLM
FFDiscrete+Whisper
Sample 6 (DAPS Real)
Clean
Degraded
AR+Whisper
MaskGIT+Whisper
MaskGIT+WavLM
FFDiscrete+Whisper
More samples
For reference, we also provide the first 10 gender-balanced samples of each evaluation set in an anonymous repo here , of five variants of Genhancer, namely, MaskGIT + Whisper, MaskGIT + WavLM, MaskGIT + W2v-BERT2.0, AR + Whisper and FFDiscrete + Whisper.
Supplement
Network design
We use DF-Conformer [8] as the backbone.
- Feedforward noisy branch
- Token Generator
Q = 9, number of codebooks in DAC [9]
D = 8, code dimension by DAC
S = 4096, codebook size by DAC
C'= 256, C = 512
P = 1024, dimension of pretrained features
Caual convolutions are used for the auto-regressive generative pattern; non-causal convolutions for the rest.
Training setups
Batch size = 80; sample length = 8 seconds
AdamW optimizer with betas = (0.9, 0.95) and weight decay = 0.01
Learning rate linearly increases to 1e-4 in the first 1k steps, and decays to 1e-5 following cosine schedule in 300k steps.
Reference
[1] Koizumi, Yuma, et al. "Miipher: A robust speech restoration model integrating self-supervised speech and text representations." 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2023.
[2] Su, Jiaqi, Zeyu Jin, and Adam Finkelstein. "HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features." 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021.
[3] Xue, Huaying, Xiulian Peng, and Yan Lu. "Low-latency Speech Enhancement via Speech Token Generation." arXiv preprint arXiv:2310.08981 (2023).
[4] J. -M. Lemercier, J. Richter, S. Welker and T. Gerkmann, "StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724-2737, 2023, doi: 10.1109/TASLP.2023.3294692.
[5] Serrà , Joan, et al. "Universal speech enhancement with score-based diffusion." arXiv preprint arXiv:2206.03065 (2022).
[6] Irvin, Bryce, et al. "Self-supervised learning for speech enhancement through synthesis." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[7] Wang, Ziqian, et al. "SELM: Speech Enhancement Using Discrete Tokens and Language Models." arXiv preprint arXiv:2312.09747 (2023).
[8] Koizumi, Yuma, et al. "DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement." 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021.
[9] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems 36 (2024).