Neural Speech and Audio Coding: A Research Thread
Over the past several years, my group has been exploring a common question from multiple angles: how should we design neural speech and audio codecs that are both perceptually meaningful and computationally efficient? Neural audio coding aims to compress audio into compact representations that can be faithfully reconstructed, but doing this well requires rethinking loss functions, model structure, quantization, and even the role of source structure in the signal. Below is a curated set of related projects that together form this research thread.
ACoM: Audio Coding for Machines (WASPAA 2025)
ACoM (Audio Coding for Machines) broadens the scope beyond human listening. Instead of optimizing codecs solely for perceptual quality, this work asks what happens when the downstream consumer is another machine learning model. The project explores representations that remain compact yet maximally useful for machine perception tasks, reflecting the emerging shift from “audio for humans” to audio as a tokenized modality for AI systems. For more information, please visit this project page.
From Hallucination to Articulation: Language Model-Based Losses for Neural Speech Coding (ICASSP 2026)
Ultra-low-bitrate neural speech codecs have become surprisingly good at producing clean and natural audio—but when pushed too far, they can make a subtle but serious mistake: saying the wrong thing. This paper focuses on that failure mode, which we call phoneme hallucination, where the decoded speech sounds plausible yet deviates linguistically from the original. To address this, we introduce language-model-driven losses that inject explicit linguistic awareness into codec training. By using pretrained ASR and text–speech representation models as semantic “judges,” the codec learns not just to sound good, but to preserve what was actually said. Experiments show that these losses significantly improve semantic faithfulness at extremely low bitrates, without sacrificing perceptual quality or changing the runtime codec architecture. For more information, please visit this project page.
LaDiffCodec: Generative De-Quantization via Latent Diffusion (ICASSP 2024)
LaDiffCodec explores a different bottleneck in neural codecs: quantization and dimension reduction. Instead of treating quantization noise as something to merely suppress, this work uses latent diffusion modeling to generatively reconstruct the lost fine details after coarse quantization. The key idea is to move beyond deterministic decoding and allow a generative prior to restore perceptually important structure. This direction connects neural audio coding with modern generative modeling and opens the door to high-fidelity reconstruction at very low bitrates. For more information, please visit this project page.
Personalized Neural Speech Codec (ICASSP 2024)
Personalized Neural Speech Codec (PNSC) explores the idea of a speech codec that “knows” your voice—so it can spend its bits and model capacity more efficiently. Instead of trying to generalize to everyone, the system groups speakers into a small number of clusters, predicts which group a test utterance belongs to, and then selects the most suitable group-specific modules so the bitstream better preserves the speaker’s traits at very low bitrates . For the codec backbone, the project builds on LPCNet and focuses personalization where it matters most: the GRU-based decoder. The key result is that personalization can improve perceived quality at the same bitrate, and can even enable meaningful decoder compression (e.g., reducing GRU size) with performance that remains statistically on par with a larger baseline. For more information, please visit this project page.
Psychoacoustic Loss Functions for Neural Audio Coding (SPL 2020)
This line of work starts from a simple but fundamental observation: training neural audio codecs with generic losses (e.g., MSE or feature loss) does not align well with human perception. In PAM-NAC, we incorporate classical psychoacoustic modeling directly into the learning objective so that the network focuses on perceptually audible errors rather than uniformly minimizing signal differences. The resulting framework bridges traditional codec wisdom and modern deep learning, enabling lightweight neural codecs that better match human listening criteria while maintaining real-time feasibility. For more information, please visit this project page.
HARP-Net (WASPAA 2021 and ICASSP 2023)
HARP-Net (Hyper-Autoencoded Reconstruction Propagation) is a neural audio coding architecture designed to improve scalability across bitrates by fixing a core bottleneck in autoencoder codecs: once you quantize the bottleneck, information flow between encoder and decoder becomes lossy, and the decoder has to do too much guesswork. HARP-Net borrows the reconstruction-strength intuition of U-Net skip connections, but replaces U-Net’s bitrate-expensive identity shortcuts with compressed skip paths: small “skip autoencoders” that themselves quantize and code the intermediate encoder feature maps, then deliver them to the corresponding decoder layers. The final bitstream becomes the concatenation of the main bottleneck code plus these layer-wise skip codes, which gives a clean knob for scalable coding (use more skip codes when you can afford more bits) while keeping each skip path bitrate-efficient. In listening tests on music, this “hyper-autoencoded” skip design improves perceptual quality over parameter-matched vanilla autoencoder baselines at similar bitrates, and the reported MUSHRA results show HARP-Net’s advantage clearly. The ICASSP follow-up extends this idea toward native multi-band coding, where the core band does most of the reconstruction and a low-bitrate high-band code + bandwidth-extension-style decoding helps recover high frequencies—explicitly drawing an analogy to spectral band replication (SBR) in traditional codecs . For more information, please visit this project page.
SANAC: Source-Aware Neural Audio Coding (ICASSP 2021)
SANAC (Source-Aware Neural Audio Coding) pushes neural codecs toward structure awareness. Conventional codecs treat mixtures monolithically, but many real signals contain multiple sources whose perceptual importance differs. SANAC explicitly integrates source separation and coding in the latent space so the system can allocate bits where they matter most. This source-aware perspective improves reconstruction of noisy speech mixtures and highlights the importance of semantic structure inside neural compression pipelines. For more information, please visit this project page.