EchoAudio

Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

Anonymous

Abstract. Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce EchoAudio, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. Unlike prior approaches that address noise removal through iterative processes, EchoAudio integrates Consistency Models (CMs) into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-audio generation and text-to-music synthesis tasks demonstrate that EchoAudio needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. EchoAudio enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in EchoAudio is effective.

EchoAudio Overview

An illustration of EchoSpeech. EchoAudio propose the Guided Consistency Distillation with k-step ODE solver. c is the text embedding and 𝜔 is the classifier-free guidance scale.

Text-to-Audio Generation
Text-to-Music Generation
Preliminary Analyses - Classifier-free Guidance
Preliminary Analyses - Multi-step ODE Solver

Text-to-Audio Generation

Text Prompts	Ground-truth	EchoAudio	Teacher	Make-an-audio2	AudioLDM2	Tango	AudioLDM	Make-an-Audio	AudioGen

Text-to-Music Generation

Text Prompts	Ground-truth	EchoAudio	Teacher	AudioLDM2	MusicLDM	MusicGen	Riffusion

Preliminary Analyses - Classifier-free Guidance

Text Prompts	Ground-truth	Scale 1	Scale 3	Scale 5	Scale 7	Scale 9

Preliminary Analyses - Multi-step ODE Solver

Text Prompts	Ground-truth	k 1	k 5	k 10	k 20	k 50