There can be a lot of different ways to compress a neural network so that it can enjoy a reduced time and spatial complexity during the run time. One way would be to use a significantly less quantization levels to encode the weights, as shown in the bitwise neural networks [Kim and Smaragdis 2015]. But, there are different kinds of approaches than quantization. In this project, we took a more fundamental audio-centric approach to this problem. The main idea is to tweak the cost function of a DNN (or actually of any machine learning models) so that the network can focus more on the dimensions that are perceptually meaningful (a.k.a. audible) while it doesn't work so hard on the other dimensions that are not. For example, let's suppose that there is a neural network that produces an \(F\)-dimensional cleaned-up speech spectrum \(\hat{M}_{:,t}\in\mathbb{R}^{F\times 1}\) (or an ideal masking vector). Then, a typical neural network cost function with the sum of squared error would define the error between the prediction and the ground truth as follows: \(\frac{1}{2}\sum_f (M_{f,t}-\hat{M}_{f,t})^2\). Every frequency subbands are equally important for the network. Is this true?

We think the cost function can be better defined. Psychoacoustics say that not every frequency subband is equally important. An example would be the simultaneous masking effect, which describes the situation where a peak in the spectrum can mask out the other sound components that are nearby and soft. In the neural network perspective, this is related to an important question: why bother trying to reduce the error that's inaudible anyway?

In the figure, the blue curve is the actual energy of the signal over frequencies. The simultaneous masking effect identifies the global masking threshold (green dotted line) which works like a masker that makes the sound component under the curve inaudible. So, in this figure, the blue shaded area is the only part that are audible as the signal's energy is higher than the threshold. By calculating the energy ratio between the actual power spectral density of the signal and the global masking threshold, we can come up with a weight vector \(H_{:,t}\in\mathbb{R}^{F\times 1}\) that tells us not to focus too much on a certain frequency bins if the weights are low there: \(\frac{1}{2}\sum_f H_{f,t} (M_{f,t}-\hat{M}_{f,t})^2\).

Comparison of power spectral density of an input signal against its perceptual weight matrix (H)

This relaxes the optimization problem, because during training the network is allowed to create more error when \(H_{f,t}\) is small. We use this conjecture to reduce the structural complexity of DNNs. For example, in the speech denoising senario we can reduce the number of layers and hidden units to produce audio output that is with perceptually similar quality compared to an ordinary DNN. Note that the network will DO create more objective error. But, our argument is the increased error won't be audible.

As expected, the Signal-to-Distortion metric doesn't differentiate the two models with and without the psychoacoustic weights. However, the perceptual source separation metric shows a clear advantage of the psychoacoustic weights, because their overall perceptual score doesn't decrease compared to the ordinary with no weighting.

### Reference

Check out our recent paper, Kai Zhen, Aswin Sivaraman, Jongmo Sung, and Minje Kim, "On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising," (under review) about this project.- Minje Kim and Paris Smaragdis, "Bitwise Neural Networks,"
*International Conference on Machine Learning (ICML) Workshop on Resource-Efficient Machine Learning*, Lille, France, Jul. 6-11, 2015.