From Hallucination to Articulation: Language Model-Driven Losses for Neural Speech Coding

Motivation

Ultra-low-bitrate neural speech codecs have made remarkable progress in recent years. Modern systems can operate at a few hundred bits per second while maintaining surprisingly good perceptual quality. However, as bitrates are pushed further down, a new failure mode becomes increasingly visible: phoneme hallucination (PH).

Unlike classical coding artifacts (quantization noise, muffling, etc.), phoneme hallucinations are semantic errors. The decoded speech may sound clean and natural, yet contain the wrong phoneme or word. This happens because generative decoders attempt to produce plausible speech even when the compressed representation lacks sufficient linguistic information.  

This paper asks a simple but important question:

Can we explicitly guide neural codecs with language knowledge so that they hallucinate less and articulate more faithfully?

Acoustic vs. Semantic Codecs — and the Gap Between Them

Neural speech codecs are often categorized into two families. Acoustic codecs focus on short-term waveform fidelity using reconstruction losses, while semantic codecs aim to preserve linguistic content using self-supervised speech representations, e.g., HuBERT or WavLM.  Semantic codecs have achieved impressive bitrate reductions. Yet even strong systems such as TAAE, FocalCodec, and SemantiCodec still exhibit phoneme hallucinations at very low rates (e.g., below 0.4 kbps).  This suggests that current semantic objectives are not fully sufficient. The codec may preserve some high-level information, but the decoder is still free to generate linguistically plausible—but incorrect—speech.

Key Idea: Language-Model-Driven Losses

The central proposal of this work is to introduce language-model (LM) losses into codec training. Instead of relying only on acoustic reconstruction or feature matching, we explicitly measure whether the decoded speech is linguistically consistent with the input. Crucially, these LM losses:

  • Require no architectural changes
  • Add no inference overhead
  • Can be applied to any speech-generating codec

The idea is to treat pretrained language-aware models as semantic judges during training.

Understanding Phoneme Hallucination

Phoneme hallucination typically emerges when aggressive compression removes information necessary to represent phonetic detail. The paper highlights several causes, such as excessive temporal downsampling, insufficient codebook capacity, and overly strong generative priors. When this happens, the decoder does not output noise—it outputs a different but plausible phoneme. This makes PH particularly dangerous because traditional acoustic metrics may not catch it.  Figure 1 of the paper visually demonstrates this: the decoded spectrogram at 187.5 bps clearly deviates linguistically despite sounding clean.  

A phoneme hallucination example.

Two Families of LM Losses

The paper proposes two complementary formulations.

  • ASR-Based Loss (Transcript-Free): The first approach leverages a pretrained ASR model (Whisper-tiny in experiments). The clean input speech is first transcribed by the ASR model. Then the decoded speech is forced to predict the same token sequence. This yields a cross-entropy loss in the subword space. Importantly, this method does not require ground-truth transcripts, making it widely applicable. Intuitively, the ASR model acts as a linguistic consistency checker.
  • Timed-Text Regularizer (TTR): The second approach uses (automatically or manually) aligned text when transcripts are available. Here, the decoded speech is processed by an audio-based language model (WavLM), while the reference text is processed by BERT. The system then minimizes the distance between the two embedding sequences, including both token-level cosine similarity and pairwise relational structure. This encourages the decoded speech to occupy the same semantic space as the text.  

Reference Codec Setup

To test the idea cleanly, the paper builds a controlled semantic codec. The design reflects common modern practices: HuBERT features for semantic representation, pitch branch for acoustic detail, vector quantization bottlenecks, and HiFi-GAN vocoder. The system operates at extremely low bitrates: 187.5 bps and 212.5 bps for a single-talker setup, as in personalized neural speech codecs. This creates a deliberately challenging regime where phoneme hallucinations are likely to appear.

Objective Results

In all objective metrics, the proposed loss functions were able to improve the performance. For example, at 187.5 bps:

  • ASR loss achieves WER 1.45% (Whisper)
  • compared to 3.33% for semantic distillation
  • and 3.04% for the stage-2 baseline

This is a substantial improvement in linguistic accuracy without sacrificing perceptual quality.

Subjective Evaluation

The listening tests tell an even clearer story. As for semantic MOS, the right figure shows that LM-loss models significantly improve semantic compliance. Statistical testing confirms: ASR loss performs best, followed by TTR, and both outperform semantic distillation. From a listener’s perspective, the decoded speech simply matches the intended words more reliably. Interestingly, the MUSHRA-style similarity scores (left plot) show that: LM losses and semantic distillation have similar overall quality, but both clearly beat the stage-2 baseline. This is an important result: the LM losses improve semantics without hurting perceptual quality.

Sound Examples

Input Speech Utterance (uncompressed): “We were then in the midst of the great banking crisis.”
Baseline codec’s output with hallucinations: “We were then at the midst of the great banking prisis.”
Result from the proposed TTR loss-trained codec.
Result from the proposed ASR loss-trained codec.

Why LM Losses Work

The paper offers an insightful interpretation. Traditional semantic distillation mainly shapes the encoder representation. In contrast, LM losses are end-to-end, meaning they also influence the decoder’s generative behavior. This has two effects. First, the codec learns to encode richer semantic cues. Second, the decoder is “tamed” to generate linguistically plausible outputs. In other words, the model is not just compressing speech—it is being trained to respect language structure during generation.

Bigger Picture

This work highlights an emerging theme in neural speech coding: As bitrates drop, the bottleneck is no longer acoustic fidelity — it is semantic faithfulness. Language-model-driven losses provide a lightweight yet powerful way to inject linguistic awareness into ultra-low-bitrate codecs. Because the method is architecture-agnostic and inference-free, it can potentially benefit a wide range of existing neural codecs. Looking forward, this line of work suggests that the next generation of speech codecs will likely be trained not only with signal losses, but also with explicit language-level supervision, especially in the extreme compression regime.

For More Information, Source Code, Checkpoints

Jayeon Yi and Minje Kim, “From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),  Barcelona, Spain, May. 4-8, 2026.
[pdfGitHub]