F5 TTS

F5 TTS (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching) is an advanced text-to-speech model with batch processing support.

This online demo supports multiple TTS models including F5-TTS and E2 TTS (Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS).

F5 TTS Features

Discover what makes F5 TTS a powerful and versatile text-to-speech solution

Flow Matching Technology

Uses advanced flow matching techniques to create fluent and faithful speech that sounds natural.

Multilingual Support

Currently supports English and Chinese with high-quality voice synthesis for both languages.

Reference Audio

Upload reference audio to clone voices and speaking styles for personalized text-to-speech.

Batch Processing

Advanced batch processing support for generating multiple audio clips efficiently.

Multiple Models

Choose between F5-TTS, E2-TTS, and other models depending on your specific needs.

Automatic Transcription

Reference text is automatically transcribed with Whisper if not provided manually.

About F5 TTS

F5 TTS (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching) is an advanced text-to-speech model that creates natural-sounding speech using flow matching technology.

The model supports both English and Chinese languages, with the ability to clone voices from reference audio. For best results, keep reference clips short (less than 12 seconds) and ensure the audio is fully uploaded before generating.

F5 TTS is part of a family of models that includes E2 TTS (Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS), offering different approaches to high-quality speech synthesis.

Learn more about F5 TTS on GitHub

FAQ

Frequently Asked Questions

Common questions about F5 TTS

What is F5 TTS?

F5 TTS (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching) is an advanced text-to-speech model that creates natural-sounding speech using flow matching technology.

What languages does F5 TTS support?

F5 TTS currently supports English and Chinese languages with high-quality voice synthesis.

How do I use reference audio?

Upload a reference audio file (preferably in WAV or MP3 format) to clone the voice. For best results, keep clips short (less than 12 seconds) and ensure the audio is fully uploaded before generating.

What's the difference between F5-TTS and E2-TTS?

F5-TTS uses flow matching technology for fluent and faithful speech, while E2-TTS (Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS) is a zero-shot model that doesn't require training on specific voices.

Do I need to provide reference text?

No, reference text will be automatically transcribed with Whisper if not provided. However, for best results, providing accurate reference text is recommended.

How can I get the best results from F5 TTS?

For optimal results, use high-quality reference audio, keep clips short, ensure the audio is fully uploaded before generating, and provide accurate reference text when possible.