Schrödinger Bridge Consistency Trajectory Models for Speech Enhancement

Abstract

Speech enhancement (SE) using diffusion models is a promising technology for improving speech quality in noisy speech data. Recently, the Schrödinger bridge (SB) has been utilized in diffusion-based SE to resolve the mismatch between the endpoint of the forward process and the starting point of the reverse process. However, the SB inference remains slow owing to the need for a large number of function evaluations (NFE) to achieve high-quality results. Although consistency models have been proposed to accelerate inference, by employing consistency training via distillation from pretrained models in the field of image generation, they struggle to improve generation quality as the number of steps increases. Consistency trajectory models (CTMs) have been introduced to overcome this limitation, achieving faster inference while preserving a favorable trade-off between quality and speed. In particular, SoundCTM applies CTM techniques to sound generation. The SB addresses the aforementioned mismatch and CTM improves inference speed while maintaining a favorable trade-off between quality and speed. However, no existing method simultaneously resolves both issues to the best of our knowledge. Hence, in this study, we propose Schrödinger bridge consistency trajectory models (SBCTM), which apply CTM techniques to the SB for SE. In addition, we introduce a novel auxiliary loss, incorporating perceptual loss, into the original CTM training framework. Consequently, SBCTM achieves an approximately 16x improvement in real-time factor compared with the conventional SB for SE. Furthermore, the favorable trade-off between quality and speed in SBCTM enables time-efficient inference by limiting multi-step refinement to cases in which one-step inference is insufficient. SBCTM enables more practical and efficient speech enhancement by providing fast inference and a flexible mechanism for further quality improvement. Our codes, pretrained models, and audio samples are available here.

Audio Samples of Speech Enhacement

The following models, trained on the VoiceBank-DEMAND dataset [1] (downsampled to 16 kHz), can be compared in terms of their speech enhancement performance using the VoiceBank-DEMAND test set.
To accurately perceive the differences between each audio sample, we strongly recommend the use of headphones.

SB-PESQ [2] (Teacher model using the pretrained checkpoint 'm7.ckpt')
SBCTM [3] (Proposed model)
SEMamba [4] (SOTA or near-SOTA baseline model using the pretrained checkpoint 'SEMamba_advanced.pth')

Gender (Noise type)	Clean	Noisy	SB-PESQ (NFE=1)	SB-PESQ (NFE=16)	SBCTM (NFE=1)	SBCTM (NFE=16)	SEMamba
Female (Crowd)
Male (Outdoor)
Female (Music)
Male (Typing)
Female (Typing)
Female (Outdoor)
Male (Crowd)
Male (Music)

References

[1] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi. Investigating RNN-based speech enhancement methods for noise-robust textto-speech. ISCA Speech Synthesis Workshop, 2016.
[2] Julius Richter, Danilo De Oliveira, and Timo Gerkmann. Investigating training objectives for generative speech enhancement. arXiv preprint, arXiv:2409.10753, 2024.
[3] S. Nishigori, K. Saito, N. Murata, M. Hirano, S. Takahashi, and Y. Mitsufuji. Schrödinger Bridge Consistency Trajectory Models for Speech Enhancement. IEEE WASPAA, 2025.
[4] R. Chao, W.-H. Cheng and M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang,S.-W. Fu, and Y. Tsao. An investigation of incorporating Mamba for speech enhancement. IEEE Spoken Language Technology Workshop, 2024.