Our dream is to develop a universal speech enhancement system that can deal with all the different kinds of corruption and variation a speech signal can go through: speaker-specific characteristics, reverberation, interfering noise, cross talk, band-pass filtering, clipping, etc. We know that the hypothetical universal speech enhancement system might be a gigantic Deep Neural Network with a complicated network structure. But, what if it's easier to train a smaller and simpler system (e.g. a shallower and narrower DNN), which is specialized for removing only one kind of artifact? What if a few such systems are already available, and the best result among them is comparable to what we want from the hypothetical universal speech enhancement system? If there is a method that can choose the best model out of the candidate specialized systems, we can combine them to build the universal speech enhancement system.

This approach is beneficial in a lot of sense under certain conditions. First, we can reuse all those already-trained DNNs as our modules instead of training the all-purpose gigantic DNN from scratch. If the participating modules are DNNs with a simpler network topology and are easier to train, we can build the desired universal speech enhancement system easier and faster. Second, the proposed approach is more scalable. If the proposed system needs to handle a newly observed type of corruption, we can quickly learn a module specialized for the new training examples, and add it to the pool of candidates.

The main goal of this work is to devise a moderator, whose job is to choose the best module for an unseen test mixture. Although potentially this job could be done by a classifier, I took another path. The main reason is that learning the classifier is not an easy job as it needs to be trained on all the training sets collected so far. Moreover, this discriminative classifier is not very scalable to the new training set. Instead, I trained an AutoEncoder (AE) by using pure speech spectra only. As an AE's training goal is to minimize the error between the input and the output, the AE I trained using clean speech should produce clean speech for a clean input speech. I use this AE reconstruction error as a measurement to assess the quality of all the module-specific intermediate speech enhancement results. The basic idea is this: (a) I feed the intermediate modular outputs to the speech AE, and check their AE reconstruction error (b) I choose the one that creates the least AE error as it will be the most similar one to clean speech (according to the definition of the AE).



The performance of this model selection mechanism is great. The AE is very successful at selecting the best module (note that it deals with test signals without any ground-truth targets). I call this a Collaborative Deep Learning for speech enhancement, as we can harmonize whatever DNN for speech enhancement, if its job is to predict cleaned-up speech.



Reference

Minje Kim, "Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, March 5-9, 2017.