Such as in crowdsourcing, the project aims to improve the quality of the recordings from audio scenes, e.g. music concerts, talks, lectures, etc, by separating out the only interesting sources from multiple low-quality user-created recordings. This could be seen as a challenging microphone array setting with channels that are not synched, defected in unique ways, with different sampling rates.
We achieve the separation by using an extended probabilistic topic model that enables sharing of some topics (sources) across the recordings. To put it another way, we do the usual matrix factorization for each recording, but fix some of the sources to be the same (with different global weights) across the simultaneous factorizations for all recordings.
We could get better separation from some synthetic concert recordings than the oracle matrix factorization results with ideal bases pre-learned from the ground truth clean recording. We plan to accelerate this algorithm so that it can cope with a really big audio dataset.
And some audio clips:
- Input#1: low-pass filtered recording (8kHz) with a speech interference (wav)
- Input#2: high-pass filtered recording (500Hz) with another speech interference (wav)
- Input#3: low-pass filtered (11.5Hz) and high-pass filtered (500kHz) recording with clipping artifacts (wav)
- Enhanced audio using PLCS plus both priors (wav)
In the second phase of this project, we expanded our experiments into the case with a lot more microphones involved (1000 sensors in the scene, the blue dots in the picture above). Now the goal is to extract the dominant source (filled diamond in the middle of the picture) out of those 1000 recordings. If this job is done manually, someone has to listen to all the 1000 recordings, and pick out the best one based on his or her perceptually quality assessment, which is a tedious, difficult, and expensive job to do. The selection will give the results ranging from the worst recording to the best one. Note that in the worst recording, someone else' voice was recorded loudly rather than the dominant source of interest.
Our idea to get round this computational complexity issue is to focus on the nearest neighbors at every EM rather than the whole. In the figure above, we can see that topic modeling can find similar convex hulls (the green polytope wrapping the data points) whether or not we have those non-neighboring data samples during the process. Actually, what we believe is that focusing on those neighbors of current topics (corners of the polytope) will give us not only the speed-up, but the better results, because otherwise M-step spends a lot of time extracting out a small amount of contribution from those non-neighboring observations.
The nearest neighbor search can take much time though. If we do this search in an exhaustive way, something based on a proper distance metric, such as cross entropy between the normalized magnitude spectrograms of recordings and that of the reconstructed source, the overhead introduced by the search will diminish the speed-up. Instead, we can do this search based on Hamming distance calculated from hash codes of those spectrograms, because calculating Hamming distance can be done in a cheaper way by using the bitwise operations. If we find 3K pseudo neighbors with respect to Hamming distance in the first place, and then perform the exhaustive search only on the 3K candidates rather than the whole, we can construct the K-neighbor set in a fast way.
Check out how they sound!
- Naive median filtering of magnitude spectrograms (wav)
- The results from the proposed method (wav)
- The worst recording (wav)
- "Collaborative Audio Enhancement: Crowdsourced Audio Recording" (2014 NIPS Workshop on Crowdsourcing and Machine Learning)
- "Efficient Neighborhood-Based Topic Modeling for Collaborative Audio Enhancement on Massive Crowdsourced Recordings," ICASSP 2016