### Motivation: Why Binaries?

What if you want to do some real-time face detection / recognition using a deep learning system running on a pair of glasses? What if you want to listen to what's going on around you by allowing your cellphone to record all the audio signals and letting a neural network analyze them 24/7? What if you want your smart watches to be actually smart and do some intelligent jobs for us by recognizing events from all the sensory signals using a lightweight AI running on them? Overall, my question is, do the small devices afford a deep neural network under all those constraints, such as limited batteries, memory spaces, clocks, etc? If not, what can we do to make it happen?Bitwise Neural Networks (BNN) are an extremely compact, yet powerful kind of neural networks. We all are accustomed to the complexity of training a deep neural network, and willing to supply expensive GPU cards for faster backpropagation on big training data. However, the technology we need for the efficient forwardprop on a very small device is different from that. A BNN is defined by all binary numbers. I mean, all binaries — its input signals, output signals, hidden unit outputs, weights, and bias. It's different from a discretization or quantization technique that tries to represent a real-valued number with a set of binaries, like in floating-point or fixed-point encodings. Instead, in BNN a real-valued number is replaced with a bit.

Why binaries? First, binaries are cheap to store. If you compare a 64bit floating-point representation, or a very efficient 7bit fixed-point encoding, with a bit, it's apparent that the single bit representation uses 1/64 or 1/7 of space the other representations occupy. Second, binaries are cheap to compute. When it comes to bipolar binaries (-1 and +1), the multiplication can be done by using a single gate, XNOR. An addition corresponds to counting of ones. Activation functions? We use sign() function, which is nothing but a comparison-based thresholding in BNN. Therefore, the gate logic is extremely simple, and the computation is really fast without having to use a lot of power.

*A forwardprop in a real-valued network versus in a BNN*

*Complexities of a floating-point network*

*Complexities of a BNN*

In theory, there exists a BNN that learns all the Boolean associations between the input and output pairs if we're allowed to use many hidden units. For instance, as for the famous XOR problem,

the binary weights are just one of the possible real-valued solutions. The performance of a bitwise network is not different from a comprehensive one, when it comes to an XOR problem.

Of course, eventually BNNs have to call for more hyperplanes to do the same complicated job, which might increase the network size. For instance, in the figure below, we need more than one binary-valued hyperplanes for this linearly separable problem. But, please note that the bigger BNN doesn't always mean that it's more complicated than a smaller real-valued network, because per node cost is a lot cheaper in BNN.

### Does It Work Well?

They are all good to know, but then I can expect a reasonable question. Does it work well? Well, I'd say it all depends on how we train the network. But, I don't think the learning algorithm should be really complicated. Long story short, I succeeded in training this kind of extreme network by just doing backprop twice. In the first round, we learn a set of network parameters whose dynamic range is bounded between -1 and +1. Then, in the second round, we initialize the network parameters from the first round results, and do some usual alternating forwardprop / backprop procedure, but we binarize all the parameters during the forwardprop so that the network is aware of the harsh goal. There are some other usual stuff, such as regularization, dropout, and everything, but that's the basic story.Binarizing the input signals is challenging, too. One can come up with a hashing technique that preserves the locality or semantic of the original data, then it serves as a binary feature set. I've tried Winner-Take-All hashing for this, and it works reasonably well. But, if we just do a usual population based quantization, followed by scattering all the bit string to the input nodes so that a node takes a bit rather than a set of bits to represent a real number, they just serve as the binary input successfully.

So, once again, does it work? Yes, it does for some experiments that I tried. Actually, it works surprisingly well for MNIST (hand-written digit) classification, and works reasonably well for phoneme classification as well.

### BNN for Speech Denoising

As deep learning becomes popular nowadays in audio researches as well, I went ahead and built a bitwise denoising autoencoder, which takes binarized magnitude spectra of a noisy signal as its input and produces their cleaned-up versions. For example, for a given magnitude Fourier coefficient, we can quantize it into 2bit integer, and then spread those two bits into two input nodes rather than as a single integer input to each node. Therefore, a spectrum with N coefficients is handled with 2N input units in this BNN.As for the output, it's straightforward to train the network to learn the Ideal Binary Masks, which are a very natural binary representation that indicate whether a spectrum element belongs to the desired source or not. I trained this network with 10 random TIMIT speakers' speech mixed up with 4 different non-stationary noise types. Then, the network is tested on another set of 5 random speakers with one different type of noise. The network has 2 hidden layers, each of which has 2048 hidden units. The results from this fully-binary bitwise denoising autoencoder gives very decent separation quality (around 10dB in SDR, 25dB in SIR, and 10dB in SAR), and they sound like this: