TGIF: Talker Group-Informed Familiarization of Target Speaker Extraction

Overview

In everyday life, our devices run many speech/audio applications that can benefit from the target speaker extraction (TSE) concept: the ability to pull out your voice from a noisy mixture of sounds.

However, today’s TSE models are designed as generalists. They aim to work for any target speaker buried in any mixture signal. That means they must be big, powerful, and trained on thousands of hours of speech data from countless people. But what if your device only ever needs to recognize a handful of voices, say, the members of your family? Why carry the burden of generalizing to everyone in the world when you only need to get familiar with a few?

That’s the question behind our new work, “TGIF: Talker Group-Informed Familiarization of Target Speaker Extraction,” presented at WASPAA 2025.

Sometimes the voices in the mixture can be confusingly similar, when the generalist TSE system tends to fail.

From Generalists to Specialists

We propose a new concept called Talker Group-Informed Familiarization (TGIF), a way for voice models to specialize in a small group of people (a “talker group”), instead of trying to serve everyone equally. Here’s the motivation:

  • generalist model can handle any voice, but it’s large and resource-hungry.
  • specialist model, trained for a small number of talkers, can be compact and efficient: perfect for on-device use.
    • In addition, it can learn to discriminate better between subtly different talkers, e.g., sisters in the family, by focusing more on their differences rather than the typical generalization effort (see the figure above).
  • The problem? We usually don’t have clean recordings of each family member’s voice to train such a model.

TGIF solves this by borrowing a trick from the broader machine learning world: knowledge distillation. In our framework, a large “teacher” model is trained as a generalist on a big dataset to help a smaller “student” model learn to specialize for a particular talker group.

The teacher provides pseudo-clean signals (its best guesses of the target voice), and the student learns from them to adapt to the unique speech patterns, accents, and acoustic quirks of that group.

Left: Generalist pretraining; Right: Specialist adaptation using group-specific data and teacher guidance.

How It Works

The figure captures the two-stage training process:

Left- Generalist Pretraining: Both the teacher and student start by learning from a large, generic dataset that includes a wide range of speakers and environments. This stage teaches them the broad principles of speech separation, i.e., how to pull one voice out of a crowd. Here, the only difference between the teacher and student models is that the student model is too small to generalize well to any talker group, requiring more adaptation.

Right- Specialist Adaptation: Once deployed to a specific environment (say, your home), the student model fine-tunes itself on that talker group’s recordings — even without access to their clean voice tracks. The teacher generates pseudo targets, and the student adapts to them. Over time, the student becomes more “familiar” with the voices it hears most often — hence familiarization.

Why Familiarization Helps

Imagine training a speech system to recognize anyone’s voice versus one that only needs to separate voices within your family. The latter’s job is simpler and can be done by a smaller model. Mathematically speaking, TGIF reduces the problem space from all possible combinations of speakers to just the few combinations that matter. For example, in a group of 5 family members, there are far fewer potential mixtures than in a dataset of 10,000 random speakers, which means the specialist model can learn faster, perform better, and still run efficiently on edge devices.

Experiments and Evaluation

We tested TGIF on the TSE task using a new dataset we created for this purpose, simulating many small, family-style talker groups in noisy home environments.

  • The teacher model (SpEx+) served as a large, high-performing reference.
  • The student model (TD-SpeakerBeam) was much smaller — tested in two lightweight versions with hidden dimensions of 128 and 256.

Our experiments compared four configurations:

  • Teacher \mathcal{T}: Large, high-capacity generalist
  • Student Generalist \mathcal{S}: Small generalist baseline
  • KD Specialist \mathcal{S}^{\text{KD}}: Student fine-tuned via teacher guidance (TGIF adaptation)
  • Oracle Specialist \mathcal{S}^{\text{KD-Oracle}}: Ideal adaptation using clean speech references (which don’t exist in practice)

Results: When Familiarity Wins

The graph below summarizes the improvements over the generalist baselines.

Mean SI-SDR improvement vs. number of interfering speakers for different models.

The results show a clear trend:

  • The teacher still leads overall, but the TGIF specialists close the gap significantly, often within 1–1.5 dB of the teacher’s performance.
  • In harder cases (4–5 overlapping speakers), the specialists even surpass the teacher, because they only need to discriminate among a few familiar voices.
  • Compared to their own generalist baselines, the TGIF students gained up to 3 dB in SI-SDR, demonstrating that familiarization offers real, measurable benefits.
  • The gains are largest when the input mixtures are noisy or heavily overlapped, exactly where generalist models tend to struggle.

In short, familiar models outperform universal ones when the environment is limited and predictable, just like people who understand each other better over time.

Read More

  • Tsun-An Hsieh and Minje Kim
    TGIF: Talker Group-Informed Familiarization of Target Speaker Extraction,”
    in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
    Tahoe City, CA, Oct. 12-15, 2025.
    [pdf, code]
  • This project is extended from the personalized speech enhancement concept and the papers my team has published so far. More general introductions can be found here.

Source Codes

https://github.com/aleXiehta/TGIF-WASPAA

※ The material discussed here is based upon work supported by the National Science Foundation under Award #: 2512987. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.