Have you ever wished if you were a good singer? Some people believe that it's a natural ability that one can never acquire by practice (like my wife who's a natural-born good singer and looks down on me in that regard). I disagree with her, but I admit that I'm not a good singer and failed to improve my singing over my entire life so far. Instead, I decided to get some help from AI to improve my singing.


My group SAIGE recently developed a deep learning system called "Deep Autotuner," which takes an out-of-tune singing voice as its input and spits out an estimated in-tune version. How does it know how much a sung melody is out of tune? Well, as we humans can catch it even for the songs that have never been known to the listener, I suspect that it is based on a comparison between the main melody and the accompaniment. If the singing voice (or any other instrument) is off from the harmony, human brains are trained to recognize the mismatch as dissonance. Therefore, our deep learning system is trained to map an out-of-tune singing voice signal and its accompaniment signal to the in-tune version (i.e. the amount of pitch shift).


We were very excited in the beginning, because it's a cool idea--a fully data-driven approach to autotuning, a true AUTOtuning system! However, it turned out that it is not easy to achieve a good performance due to the lack of data. We basically need a very good singing-voice performance as the target of the network as well as its out-of-tune version to simulate user's singing. Luckily for us, Smule, Inc. became interested in this problem, and kindly provided us with their data and advice during Sanna's internship there.


Smule's Intonation dataset includes 4702 quality singing voice tracks by 3556 unique singers accompanied by 474 unique arrangements. Having them as the target, we came up with an artificially corrupted version of each of them, by detuning the original singing voice off by up to one semitone, as the simulated input to the system. A CNN+GRU network architecture working on CQT was adopted. Below are the pitch contours over frames.


Check out the green curve, the deep autotuned pitch correction, is closer to the original pitch contour (black line) than the out-of-tune singing (red).


Please check out the audio demo below. It's still a bit robotic and noisy mainly due to the phase synthesis part, but we can feel that the deep autotuner is up and running!


    Example 1

    • Input out-of-tune singing
    • Deep autotuned singing
    • Target singing (ground-truth)

    Example 2

    • Input out-of-tune singing
    • Deep autotuned singing
    • Target singing (ground-truth)

    Example 3

    • Input out-of-tune singing
    • Deep autotuned singing
    • Target singing (ground-truth)

Reference

Check out our paper about this:
Sanna Wager, George Tzanetakis, Cheng-i Wang, Lijiang Guo, Aswin Sivaraman, and Minje Kim, "Deep autotuner: A data-driven approach to natural-sounding pitch correction for singing voice in karaoke performances," [arXiv:1902.00956v1]

The Intonation dataset:
Sanna Wager, George Tzanetakis, Stefan Sullivan, Cheng-i Wang, John Shimmin, Minje Kim, Perry Cook, "Intonation: A Dataset of Quality Vocal Performances Refined by Spectral Clustering on Pitch Congruence," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, May 12-17, 2019. [pdf]