We have been interested in utilizing user-created recordings for audio enhancement as in [Kim and Smaragdis 2013, Kim and Smaragdis 2016]. You know, these days everybody is carrying his or her own mobile devices and recording everything. And, we know that using multiple recordings is helpful for audio enhancement. Although this sounds like a natural extension, it's not actually that straightforward due to the heterogeneousity of the "sensor array" we're talking about. If we think of this set of user devices as a sensor array, it's more likely to be an "ad-hoc" sensor array, because they are loosely connected and often too different from each other. For example, they are not synchronized. Furthermore, each recording can suffer from its own individual artifact and interference which don't appear in the other recording. In this project we focus on another collaborative algorithm to solve the dereverberation problem. Most of the time in the real-world recording environment, it's impossible to get the dry source. The recordings are almost always a reverberation of the source. It happens when the walls are reflecting the sound source and create many delayed versions of the source, which are eventually added up at the microphone (think of your speech in a bathroom or a cave). It is a serious kind of artifact that deteriorate the speech recognition performance. We can think of reverberation as a filtering process—the dry speech source is convolved with a filter of impulse train, where the temporal position and height of an impulse define the delay and the strength of the particular delayed arrival. What we do in this project is to make use of multiple recordings of the same speech for a better dereverberation. Since the recordings are captured in the different locations in the room, their reverberation filters are different from each other, although an algorithm has to find a common source across all the dereverberation models.
To this end, we extend the Nonnegative-Tensor-Tactorization (NTF)-based multi-channel dereverberation model proposed by [Mirsamadi et al. 2014]—where channel-specific speech dereveberation tasks are linked by jointly estimating the speech source across channels. Because an ordinary NTF model has a natural ambiguity for this problem, i.e. the reverberation filter can be learned in the source estimation, we further regularize it by using an additional Nonnegative Matrix Factorization (NMF) along with sparsity and total variation to add prior knowledge about dry speech to the source reconstruction.

The proposed NTF model with regularization
We ran the experiment on three different room types, with reverberation decay times (T60), respectively, of 0.6, 1.2, and 1.6 seconds. Speech recordings with longer decay times are challenging cases for applications such as automatic speech recognition. The input to the program consists of monophonic recordings of four sensors placed in different parts of the room. The output consists of a single dry-signal estimate. This page presents example audio results for one source/sensor configuration per room along with the corresponding average STOI and SNR values. We show the results of three different dereverberation models: the baseline (NTF) model [Mirsamadi et al. 2014] adapted to an objective function using KL-divergence instead of Euclidean distance, the NTF model we extended with an NMF speech prior, and our proposed model of NTF with an NMF speech prior and regularization.
 
  • Input reverberant audio signals (four channels)
    T60=0.6 sec.




    T60=1.2 sec.




    T60=1.6 sec.




  • Reconstructions
    T60=0.6 sec.
    • Baseline (KL-div NTF)
    • NTF with speech prior only
    • NTF with speech prior, sparsity, and total variance
    T60=1.2 sec.
    • Baseline (KL-div NTF)
    • NTF with speech prior only
    • NTF with speech prior, sparsity, and total variance
    T60=1.6 sec.
    • Baseline (KL-div NTF)
    • NTF with speech prior only
    • NTF with speech prior, sparsity, and total variance
 

Reference

Check out our recent paper, Sanna Wager and Minje Kim (2018), "Collaborative speech dereverberation: regularized tensor factorization for crowdsourced multi-channel recordings," (under review). pdf for more details.