Imagine being in a crowded room where several people are speaking at the same time. Yet somehow, your brain can focus on a single voice you care about and filter out the rest. Replicating this ability in machines is known as Target Speaker Extraction (TSE). It allows a system to isolate one person’s speech from a mixture of voices and background noise, given a short example (i.e., enrollment) of that person’s voice.
This capability is crucial for many modern technologies:
- Voice assistants operating in noisy homes,
- Hearing aids helping users focus on a conversation partner,
- Speech recognition systems handling overlapping speakers,
- Communication systems where clarity matters.
In our recent work, “Adaptive Deterministic Flow Matching for Target Speaker Extraction (AD-FlowTSE)”, we explore how generative AI techniques can make this process both more accurate and more efficient.
From Predictive Models to Generative Models
Most traditional TSE systems are predictive models. They directly map a noisy mixture of audio to a clean output waveform or a mask that filters the signal. While effective, these systems sometimes introduce artifacts and may struggle when encountering unfamiliar speakers or noisy environments.
Recently, researchers have begun exploring generative models, such as diffusion models and flow matching, for speech tasks. Instead of predicting the clean signal in one shot, these models gradually transform a noisy input into clean speech through a sequence of steps.
Generative approaches often produce more natural-sounding outputs. However, they come with a drawback: many inference steps are typically required, which increases computational cost. Diffusion models, for example, often require dozens of iterative updates before producing a clean signal.
A Key Insight: Where Does the Mixture Sit?
Consider how a speech mixture is formed. If a target speaker’s signal is s_1 and everything else (other speakers, noise, etc.) forms the background b, a mixture can be approximated as
x = \tau s_1 + (1-\tau) bHere, \tau represents the mixing ratio, i.e., essentially how much of the target speech is present in the mixture.
- If \tau is large, the target speaker dominates the signal.
- If \tau is small, the mixture is mostly background noise and interference.
Most existing generative TSE methods ignore this structure. They assume every input requires the same transformation path, regardless of whether the signal is already fairly clean or extremely noisy. This leads to inefficient inference.
Our key idea is simple but powerful:
Instead of always starting from the same point, we estimate the mixing ratio and start the extraction process closer to where the mixture already lies.

This insight allows the model to adapt the number of inference steps depending on how difficult the extraction task is. Moreover, if the trajectory is linear, we may be able to do the inference within a single step!
Adaptive Deterministic Flow Matching (AD-FlowTSE)
Our proposed system builds on a specific generative modeling technique, flow matching. In this framework, a neural network learns a vector field that gradually transforms one distribution into another. In conventional approaches, the transformation is learned between the mixture distribution and the clean speech distribution. But, we take a different perspective: instead, we learn a deterministic flow between background audio and the target speech signal. The mixing ratio \tau naturally defines where the observed mixture lies along this path.
At inference time, the system:
- Estimates the mixing ratio \tau from the mixture and the target speaker’s enrollment sample.
- Initializes the generative process at that point along the background-to-speech trajectory.
- Runs only the necessary (small) number of steps to reach the clean speech signal.
In other words, easy cases require very little computation, while harder cases automatically receive more refinement steps. This leads to an adaptive inference strategy that aligns computation with the difficulty of the input.
Why This Matters
The adaptive formulation provides two major benefits.
Efficiency. Because the system starts closer to the solution, it often requires only a single inference step to achieve high-quality extraction. This dramatically reduces computational cost compared to traditional diffusion approaches.
Accuracy. By aligning the generative trajectory with the actual structure of the mixture (background vs. target speech), the model better preserves the identity of the target speaker and produces cleaner outputs.
Experimental Results
We evaluated AD-FlowTSE on the Libri2Mix dataset, a widely used benchmark for multi-speaker speech separation and extraction. Across several evaluation metrics — including speech quality (PESQ), intelligibility (ESTOI), and signal distortion (SI-SDR) — our method consistently outperformed previous generative approaches. For example, AD-FlowTSE achieved significantly stronger SI-SDR scores, indicating better signal reconstruction quality compared to earlier flow-matching and diffusion-based methods. The method also improved speaker similarity, meaning the extracted voice more closely matched the intended speaker’s identity. Another interesting observation emerged from analyzing the number of inference steps. The best performance often occurred with just one or five steps, and increasing the number of steps actually degraded results slightly. This confirms the central idea of the paper: once the system starts from a good initialization point (based on the estimated mixing ratio), only minimal refinement is needed.
When Estimation Matters
A critical component of the system is the mixing ratio predictor, which estimates how much of the target speaker is present in the mixture. Our experiments show that accurate estimation is important. When the mixing ratio was replaced with random values, performance dropped significantly. Conversely, using the true mixing ratio (an oracle scenario) produced only slightly better results than the estimated version, suggesting the learned predictor is already quite accurate.
A Step Toward Efficient Generative Speech Processing
Generative models are increasingly powerful tools for speech processing, but their computational cost has limited practical deployment. AD-FlowTSE demonstrates that re-thinking the structure of the generative process can dramatically improve efficiency without sacrificing quality. By aligning the generative trajectory with the physical composition of the mixture, the system becomes both faster and more accurate. Ultimately, this approach opens new possibilities for real-time speech extraction on resource-constrained devices, from mobile phones to smart assistants.
The fact that we deliberately made the process deterministic, rather than starting the sampling process from the Gaussian noise, we observe interesting behavior: the denoising process does not leverage the stochasticity, so the result of TSE is objectively similar to the original instead of synthesizing a new, clean speech. In other words, the model doesn’t benefit from a strong prior (i.e., the distribution of clean speech), while it tries to estimate the original speech faithfully.
Sound Examples
Example #1
Example #2
Learn More
Tsun-An Hsieh and Minje Kim, “Adaptive Deterministic Flow Matching for Target Speaker Extraction,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May. 4-8, 2026. [pdf, demo, GitHub]
※ The material discussed here is based upon work supported by the National Science Foundation under Award #: 2512987. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
※ Part of this blog article was written in collaboration with ChatGPT.