This approach is beneficial in a lot of sense under certain conditions. First, we can reuse all those already-trained DNNs as our modules instead of training the all-purpose gigantic DNN from scratch. If the participating modules are DNNs with a simpler network topology and are easier to train, we can build the desired universal speech enhancement system easier and faster. Second, the proposed approach is more scalable. If the proposed system needs to handle a newly observed type of corruption, we can quickly learn a module specialized for the new training examples, and add it to the pool of candidates.
The main goal of this work is to devise a moderator, whose job is to choose the best module for an unseen test mixture. Although potentially this job could be done by a classifier, I took another path. The main reason is that learning the classifier is not an easy job as it needs to be trained on all the training sets collected so far. Moreover, this discriminative classifier is not very scalable to the new training set. Instead, I trained an AutoEncoder (AE) by using pure speech spectra only. As an AE's training goal is to minimize the error between the input and the output, the AE I trained using clean speech should produce clean speech for a clean input speech. I use this AE reconstruction error as a measurement to assess the quality of all the module-specific intermediate speech enhancement results. The basic idea is this: (a) I feed the intermediate modular outputs to the speech AE, and check their AE reconstruction error (b) I choose the one that creates the least AE error as it will be the most similar one to clean speech (according to the definition of the AE).
The performance of this model selection mechanism is great. The AE is very successful at selecting the best module (note that it deals with test signals without any ground-truth targets). I call this a Collaborative Deep Learning for speech enhancement, as we can harmonize whatever DNN for speech enhancement, if its job is to predict cleaned-up speech.