Self-Supervised Learning for Personalized Speech Enhancement

For the paper, “Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement,” submitted to ICASSP 2021.

Source codes

https://github.com/IU-SAIGE/contrastive_mixtures

Sound examples

Audio Sample #1

Input mixture and the baseline models with no finetuning
Input mixture
Small generalist without finetuning (64×2, -3.22dB)
Large generalist without finetuning (256×2, -2.47dB)
Small models (64×2) finetuned with short clean speech (3s):

Premixture SNR=5dB for PseudoSE and CM

Finetuned generalist (2.73dB)
PseudoSE + finetuning (7.47dB)
CM + finetuning (8.25dB)
Large models (256×2) finetuned with short clean speech (3s):

Premixture SNR=10dB for PseudoSE and CM

Finetuned generalist (4.50dB)
PseudoSE + finetuning (9.04dB)
CM + finetuning (9.45dB)

Audio Sample #2

Input mixture and the baseline models with no finetuning
Input mixture
Small generalist without finetuning (64×2, 1.15dB)
Large generalist without finetuning (256×2, 0.07dB)
Small models (64×2) finetuned with short clean speech (3s):

Premixture SNR=10dB for PseudoSE and CM

Finetuned generalist (2.27dB)
PseudoSE + finetuning (3.73dB)
CM + finetuning (1.41dB)
Large models (256×2) finetuned with short clean speech (3s):

Premixture SNR=10dB for PseudoSE and CM

Finetuned generalist (2.17dB)
PseudoSE + finetuning (3.38dB)
CM + finetuning (5.41dB)