Skip to content

Speech Synthesis 🎧

A rich library of voices and customization options provide a personalized dubbing experience to meet creative needs. Real-time preview ensures precise creation.

Debug Mode

After uploading the video, click Execute to start. At this point, Debug Mode will interrupt subsequent execution.

log
2025-04-18 04:55:14.045 | INFO 12672 trans_.py:43 - language: zh-CN
2025-04-18 04:55:14.045 | WARNING 12672 trans_.py:50 - skipping translated
2025-04-18 04:55:14.046 | INFO 12672 tts_.py:423 - Loaded transcription data from: webapp/temp/test/test_001.json
2025-04-18 04:55:14.050 | WARNING 12672 tts_.py:452 - delete tts file : webapp/temp/test/tts/00_27c51c.wav
...
2025-04-18 04:55:14.067 | WARNING 12672 tts_.py:452 - delete tts file : webapp/temp/test/tts/08_9ebe29.wav
2025-04-18 04:55:17.559 | INFO 12672 cbaudio.py:331 - Generated trans audio file: webapp/temp/test/test_001.wav 30.571

View Content

By clicking the bottom-right icon , you can view the currently recognized content.

Manual Editing

Modify the text values of different attributes and click the top-right icon to save the changes.

Provider Selection

ModelFeaturesSupported LanguagesRecommendation Index
Edge-TTSExtremely fast, suitable for low-performance devices, fast synthesis speed100+🔥🔥🔥🔥🔥
E2F5-TTSHigh quality, suitable for mid-to-high-performance devices, supports only Chinese and English2🔥🔥🔥🔥
Coqui-TTSHigh quality, suitable for mid-to-high-performance devices, supports minor languages17+🔥🔥🔥🔥
CosyVoice2High quality, suitable for high-performance devices, supports custom instructions4 (Chinese, English, Japanese, Korean)🔥🔥🔥🔥🔥

Notes

How to Choose❓

CosyVoice2, E2F5-TTS, and Coqui-TTS require high-performance devices, and synthesis efficiency is low on low-performance devices. It is recommended for general users to use Edge-TTS, which has faster synthesis speed. You can use it to adjust parameters for other modules first and then replace the dubbing with others. For users with high-performance devices or those looking to optimize dubbing quality, Edge-TTS is also recommended. Additionally, you can use Google Colab for remote deployment to improve processing speed.

Parameters

Different models use different parameters. This section will be supplemented gradually.

Edge-TTS

To be added.

E2F5-TTS

The default sampling step is 8, ranging from [4,64], affecting inference accuracy and speed. More steps result in slower speed but potentially better audio quality.

Coqui-TTS

Lists three potentially useful models:

  • tts_models/multilingual/multi-dataset/xtts_v2 supports 17 languages.
  • voice_conversion_models/multilingual/multi-dataset/openvoice_v2 supports minor languages and voice conversion.
  • voice_conversion_models/multilingual/multi-dataset/glow-tts supports monotonic alignment (does not support Chinese).

CosyVoice2

Users can customize instructions to make synthesized voices more personalized. More examples. Currently supports the following three types:

  • Say this sentence in Sichuan dialect.
  • Speak in Sichuan dialect.
  • Sichuan dialect.

Configuration Options

Gender

Switching gender changes the voice tone, helping users quickly find the ideal voice.

Voice

The dubbing options provided vary depending on the model and video. Dubbing is divided into three types: Built-in, Video, and User. Users can customize voice tones or record their own voices.

Tip

Refer to the 《Voice Data》 section for user customization.

Speech Rate

Speech rate is an important parameter for synthesized speech. The choice of speech rate has a significant impact on the effect in different languages and scenarios.

log
2025-04-18 07:08:59.473 | WARNING 11760 cbaudio.py:284 - idx_03.wav 06000-07757, 1.757s, speed up 1.220.
2025-04-18 07:09:00.092 | WARNING 11760 cbaudio.py:284 - idx_08.wav 16199-19406, 3.207s, speed up 1.454.
2025-04-18 07:09:00.441 | WARNING 11760 cbaudio.py:284 - idx_12.wav 24561-26832, 2.271s, speed up 1.235.
2025-04-18 07:09:00.926 | WARNING 11760 cbaudio.py:284 - idx_16.wav 33059-35588, 2.529s, speed up 1.277.

Note

During speech synthesis, the speech rate value will be printed. This value should be close to 1 and should not exceed 1.2 to avoid pitch distortion. If unavoidable, manual adjustments can be made.

Preview

Users can input custom text for speech synthesis preview to confirm the dubbing effect. Switch between Basic -> Language to change the preview text language. Preview records will be saved to the Dubbing List.

Dubbing List

Displays the currently available dubbing voice information, supporting preview, deletion, and reordering operations.

Operation Instructions:

  • Click Preview to immediately play the voice clip.
  • Click Delete to remove the voice.
  • Use Move Up / Move Down to control the voice order.

Note

Example: When a user uploads test.mp4 and marks it as 001:

  1. For single-person dubbing, the Speaker defaults to 0.
  2. For multi-person dubbing, the Speaker matches the spk in the test_001.json data.