In this project, I want to do come up with a deflation method for NMF, where we learn the NMF basis vectors one by one in the order of importance. See my paper for more details:
Minje Kim and Paris Smaragdis (2014), "Efficient Model Selection for Speech Enhancement Using a Deflation Method for Nonnegative Matrix Factorization," in Proceedings of the IEEE Global Conference on Signal and Information Processing (Global SIP), Atlanta, GA, December 3-5, 2014.

Nonnegative Matrix Factorization (NMF) is a useful tool in source separation / speech enhancement applications, particularly if we have one and the only one recording of all sources mixed up. One of the big challenges in NMF-based source separation systems is though, that we have to start from a fixed model complexity, or the number of basis vectors. This limitation can cause some problems from time to time when it comes to the sources without a priori information.

For example, say that we analyze a music signal of three notes, and let's assume that we don't know the number of notes in there in advance. If we run the good old NMF multiplicative update rules with the three assumed bases, we learn this result: So far so good. What if we made a wrong guess and learned the model with 2 bases? We get this: What about only 1 basis vector? The thing is, NMF tries its best to approximate the input matrix with a product of smaller factor matrices, W and H, and during the procedure, it doesn't care if the approximation exceeds the input values. In the above example with only one basis, we see that the learned basis vector is actually an average of the spectrogram along the time axis, and the reconstruction matrix generates some spurious harmonics peaks here and there, which make the reconstruction bigger than the input at some TF bins.

What I'm trying to say is that the NMF results from the rank-3 assumption (I mean three basis vectors to learn) are not the sum of three different rank-1 NMF runs, each of which is with only one basis vector. That's why we need a good guess about the number of components in the first place. Otherwise, we cannot reuse the NMF results from wrong guesses. What a waste of time.

In this project, I want to come up with a deflation method for NMF, where we learn the NMF basis vectors one by one in the order of importance. For example in the 3-notes example, we first learn a basis for the third and the longest note, and then the second longest one, and finally the first and the shortest one: Note that the residual at each round is fed to the input to the next round, as if it was a new nonnegative input spectrogram. Another interesting example can be found in feature extraction from images of human faces. In the figures above, we see that the five input pictures are from a same guy who likes to wear sunglasses. If we apply the deflation NMF method to the input matrix of five pictures (we vectorize each image into a long column vector to make up the input matrix of five columns), we see that in the first learned component we recover some holistic face image of this guy. Since he likes to wear sunglasses, and the sunglasses are represented with black pixels in the images, we see that the first component is with those black pixels by default. Having the second component as the second most important thing to cover the difference in the facial expressions, the third component finally learns the bare eyes that are represented as the residual in the first image after subtracting the most significant components. Note that this kind of order of components is not available from the ordinary NMF runs.

We extend this concept to the semi-supervised speech enhancement case where we don't know the characteristics about the noise. To learn the noise bases, a common technique is to learn them from the test signals, with a magically guessed number of components (that's actually the only data we have about the noise). With the proposed deflation method however, we can start from a small number of noise bases, and if we're not satisfied, we can keep adding more bases until the system models the test signal with a small enough error. This is of course a more complicated procedure than an oracle NMF model with the best chosen number of components, but is less complicated than the worst case scenario where we have to search for the best number of components through different runs of the standard NMF. Another thing is that the deflation model can be even better than the globally optimal oracle NMF results, because there is no guarantee that the globally optimal number of noise components might be the best locally, when it comes to the non-stationary noise. For example, if the interfering train noise stops at some point, there's no point learning noise bases at that time. The deflation NMF model can handle this situation better.

The deflation method somehow corresponds to the eigendecomposition-based techniques, such as PCA, where we naturally know the order of importance using eigenvalues. It is a greedy method that extracts the most important component first, while leaving a NONNEGATIVE residual, so that the system can pursue another component if it wishes. During the optimization, therefore, we assume an additional term to capture the nonnegative residual part along with the ordinary factor matrices we learn, something like D(X|WH+R), where D is any error function we use.

The explicit use of the nonnegative residual term R prevents the approximation done by WH from exceeding X. This is the concept called under-approximation. The deflation method for NMF has an interesting connection to the Hierarchical Dirichlet Process, where a topic model is learned with potentially infinite number of components, also with the order of importance. However, the deflation method for NMF has its own merit, such as the freedom of choosing any beta-divergence, on top of the beta=1 case that corresponds to the probabilistic topic models. Also, the deflation method is more flexible in choosing the model complexity by allowing the system to stop adding components at any point, while sacrificing the approximation performance due to its greedy nature.