Speech Synthesis 🎧
A rich library of voices and customization options provide a personalized dubbing experience to meet creative needs. Real-time preview ensures precise creation.
Debug Mode
After uploading the video, click Execute
to start. At this point, Debug Mode
will interrupt subsequent execution.
2025-04-18 04:55:14.045 | INFO 12672 trans_.py:43 - language: zh-CN
2025-04-18 04:55:14.045 | WARNING 12672 trans_.py:50 - skipping translated
2025-04-18 04:55:14.046 | INFO 12672 tts_.py:423 - Loaded transcription data from: webapp/temp/test/test_001.json
2025-04-18 04:55:14.050 | WARNING 12672 tts_.py:452 - delete tts file : webapp/temp/test/tts/00_27c51c.wav
...
2025-04-18 04:55:14.067 | WARNING 12672 tts_.py:452 - delete tts file : webapp/temp/test/tts/08_9ebe29.wav
2025-04-18 04:55:17.559 | INFO 12672 cbaudio.py:331 - Generated trans audio file: webapp/temp/test/test_001.wav 30.571
View Content
By clicking the bottom-right icon , you can view the currently recognized content.
Manual Editing
Modify the text values of different attributes
and click the top-right icon to save the changes.
Provider Selection
Model | Features | Supported Languages | Recommendation Index |
---|---|---|---|
Edge-TTS | Extremely fast , suitable for low-performance devices, fast synthesis speed | 100+ | 🔥🔥🔥🔥🔥 |
E2F5-TTS | High quality , suitable for mid-to-high-performance devices, supports only Chinese and English | 2 | 🔥🔥🔥🔥 |
Coqui-TTS | High quality , suitable for mid-to-high-performance devices, supports minor languages | 17+ | 🔥🔥🔥🔥 |
CosyVoice2 | High quality , suitable for high-performance devices, supports custom instructions | 4 (Chinese, English, Japanese, Korean) | 🔥🔥🔥🔥🔥 |
Notes
How to Choose❓
CosyVoice2
, E2F5-TTS
, and Coqui-TTS
require high-performance devices, and synthesis efficiency is low on low-performance devices. It is recommended for general users to use Edge-TTS
, which has faster synthesis speed. You can use it to adjust parameters for other modules first and then replace the dubbing with others. For users with high-performance devices or those looking to optimize dubbing quality, Edge-TTS
is also recommended. Additionally, you can use Google Colab
for remote deployment to improve processing speed.
Parameters
Different models use different parameters. This section will be supplemented gradually.
Edge-TTS
To be added.
E2F5-TTS
The default sampling step is 8
, ranging from [4,64]
, affecting inference accuracy and speed. More steps result in slower speed but potentially better audio quality.
Coqui-TTS
Lists three potentially useful models:
tts_models/multilingual/multi-dataset/xtts_v2
supports 17 languages.voice_conversion_models/multilingual/multi-dataset/openvoice_v2
supports minor languages and voice conversion.voice_conversion_models/multilingual/multi-dataset/glow-tts
supports monotonic alignment (does not support Chinese).
CosyVoice2
Users can customize instructions to make synthesized voices more personalized. More examples. Currently supports the following three types:
- Say this sentence in Sichuan dialect.
- Speak in Sichuan dialect.
- Sichuan dialect.
Configuration Options
Gender
Switching gender changes the voice tone, helping users quickly find the ideal voice.
Voice
The dubbing options provided vary depending on the model and video. Dubbing is divided into three types: Built-in
, Video
, and User
. Users can customize voice tones or record their own voices.
Tip
Refer to the 《Voice Data》 section for user customization.
Speech Rate
Speech rate is an important parameter for synthesized speech. The choice of speech rate has a significant impact on the effect in different languages and scenarios.
2025-04-18 07:08:59.473 | WARNING 11760 cbaudio.py:284 - idx_03.wav 06000-07757, 1.757s, speed up 1.220.
2025-04-18 07:09:00.092 | WARNING 11760 cbaudio.py:284 - idx_08.wav 16199-19406, 3.207s, speed up 1.454.
2025-04-18 07:09:00.441 | WARNING 11760 cbaudio.py:284 - idx_12.wav 24561-26832, 2.271s, speed up 1.235.
2025-04-18 07:09:00.926 | WARNING 11760 cbaudio.py:284 - idx_16.wav 33059-35588, 2.529s, speed up 1.277.
Note
During speech synthesis, the speech rate value will be printed. This value should be close to 1
and should not exceed 1.2
to avoid pitch distortion. If unavoidable, manual adjustments can be made.
Preview
Users can input custom text for speech synthesis preview to confirm the dubbing effect. Switch between Basic
-> Language
to change the preview text language. Preview records will be saved to the Dubbing List
.
Dubbing List
Displays the currently available dubbing voice information, supporting preview, deletion, and reordering operations.
Operation Instructions:
- Click
Preview
to immediately play the voice clip. - Click
Delete
to remove the voice. - Use
Move Up
/Move Down
to control the voice order.
Note
Example: When a user uploads test.mp4
and marks it as 001
:
- For single-person dubbing, the
Speaker
defaults to0
. - For multi-person dubbing, the
Speaker
matches thespk
in thetest_001.json
data.