Audio Coding for Machines

Machine-Learned Latent Features Are Codes for That Machine!

When we think about compressing sound, we usually imagine MP3s or AACs, or even neural codecs with a higher compression ratio these days. These codecs are designed so that music and speech still sound great to humans. But what if the listener isn’t human at all? What if it’s a speech recognizer, a sound classifier, or a language model that never needs to “hear” in the human sense?

That’s exactly the question this paper asks.

Our new work, “Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine,” presented at WASPAA 2025, is based on this concept, called “audio coding for machines (ACoM)j.” In the paper, we explore how to compress audio not for people, but for machines, an ML-model trained to do specific things, thus perceptual quality is not an important thing to preserve in the coded audio signals.

The Background: Data Transmission in the Distributed Computing Environment

In many real-world speech and sound applications, the model that makes sense of audio isn’t running entirely on one machine. Instead, part of it lives on a device (like your phone, a smart speaker, or a tiny sensor), and the rest runs in the cloud, where more powerful computers handle the heavy lifting. To make that possible, we need a way to send data efficiently between the two. That’s where data transmission comes in.

Convetional methods.

The first figure shows the conventional setup: a codec compresses the raw waveform on the device, then the decoded audio is sent to a downstream model in the cloud (for example, an automatic speech recognizer or sound classifier). But this approach is inefficient in that the system wastes bandwidth sending details that humans care about but the model doesn’t actually need, and it wastes computation encoding audio, which happens in the first few layers of the cloud model anyway, as well as decoding audio that will immediately be reprocessed into features.

From Audio for Humans to Audio for Machines

Traditional codecs, or popular neural audio codecs like SoundStream, EnCodec, or DAC, are amazing at reconstructing high-quality audio waveforms. They’re the engines behind many generative models in speech and music, turning continuous signals into discrete “tokens” that large models can process.

However, these codecs are still optimized for perceptual quality. They work hard to preserve every subtle tone and nuance that a person might notice. That’s great for streaming and communication, but not necessarily for automatic speech recognition (ASR) or audio classification (AC). Machines don’t need the subtle timbre difference or the warmth of a singer’s voice if those details don’t help the task.

So instead of compressing what humans hear, we designed a codec that compresses the signal harder to retain only what machines need.

The Idea: Use the Model’s Own Features as the Codec

The proposed coding scenario.

The second figure shows the smarter way, our method based on the ACoM (Audio Coding for Machines) concept. Instead of running the entire encoding-decoding pipeline of a dedicated codec, the cloud model offloads some of its early layers to the device. That way, the device can extract compact feature tokens from the model’s early layers that it runs. These tokens already contain all the information the cloud model needs for its task, so we can skip the entire codec stage. The result: less data to transmit, faster inference, and a design that’s much more aligned with how modern distributed AI systems actually work.

Here is the recipe: Imagine you already have a trained deep-learning model for speech or sound. Somewhere in the middle of that network, there’s a feature representation learned by an internal layer, that captures exactly the information the model needs to perform its job. Our method simply repurposes that layer.

We insert a residual vector quantizer (RVQ) right there, turning the model’s continuous features into compact discrete tokens. These tokens can then be transmitted to the cloud or another system that runs the rest of the model. No separate audio encoder or decoder is needed.

This way:

  • The early part of the model acts like the encoder.
  • The quantization step makes it transmittable.
  • The later part of the model acts like the decoder, but only for machine tasks — not for reconstructing sound.

This approach removes redundant processing steps and can drastically reduce the bitrate — often to below 200 bits per second, which is smaller than what a perceptual codec needs!

Why It Matters

This split-model setup fits naturally with modern AI pipelines where part of the computation happens on a device (like a phone or sensor) and the rest in the cloud.

Our codec helps with three big goals:

  1. Lower bitrate: Faster, cheaper data transmission.
  2. Lighter on devices: No need for a heavy neural codec.
  3. Task-oriented: Every bit of data serves the model’s purpose.

It’s a win for edge AI applications, from voice assistants to IoT sound sensors, that need to send data efficiently without sacrificing accuracy.

Experiments: Speech and Sound, Both Covered

We tested our approach on two popular benchmarks — speech recognition (ASR) using LibriSpeech, and audio classification (AC) using UrbanSound8K — to see whether our “coding for machines” idea could really shrink data without hurting accuracy.

ASR experiments

In the ASR results, the conventional setup using a separate codec (blue bars) struggled when the input was noisy and required higher bitrates to maintain accuracy. In contrast, our ACoM-based models (green and red bars) achieved nearly the same word error rates as the original full model while transmitting data at under 500 bits per second— more than a tenfold reduction in bitrate. Even the ultra-compressed version (purple bars, around 130–150 bps) maintained reasonable recognition accuracy with far lower on-device computation.

Audio classification experiments

The AC results show a similar trend. When we replaced the traditional codec with ACoM, the models reached or even slightly improved classification accuracy (around 80%) while using just a fraction of the bitrate. At the same time, on-device computational cost dropped by more than 99% compared to conventional codec pipelines.

In short, ACoM keeps the accuracy of the system, but compresses away unnecessary data, enabling ultra-efficient, edge-to-cloud AI that communicates in the language of machines, not humans.

Takeaway: Machines Don’t Need to Hear Like Humans

This study reinforces a simple but powerful insight: When designing for machines, we don’t need to mimic human perception. We can instead optimize for what machines actually need to understand and act on. By embedding quantization directly into the model, we can make AI systems more efficient, adaptable, and scalable, especially in scenarios where bandwidth or compute is limited.

Read More

Anastasia Kuznetsova, Inseon Jang, Wootaek Lim, and Minje Kim
Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine,”
in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Tahoe City, CA, Oct. 12-15, 2025.
[pdf, code]

The Project Is Open-sourced

Github repo: https://github.com/ana-kuznetsova/speechbrain/tree/develop