Skip to content

Model Training ๐Ÿง  โ€‹

Word count
4658 words
Reading time
26 minutes

Based on the GPTSoVITS base model, fine-tuning training and inference

Use Cases โ€‹

If a 3โ€“6 second voice sample is not sufficient for your cloning needs, you can try training the model to enhance similarity and realism.

Application Preview โ€‹

gradio_app_gmt

Fine-tuning Training โ€‹

Please prepare the dataset carefully โ€” a good dataset is the foundation of a high-quality model.

1. Training Data Preparation โ€‹

gradio_app_gmt

Format โ€‹

json file content format

webapp/temp/test2/test2_zh.json
json
[
  ...
  {
    "idx": 4,
    "spk": 0,
    "lang": "en",
    "start": 9619,
    "end": 12145,
    "duration": 2526,
    "text": "ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰๏ผŒ",
    "text_trans": "I was offered true love,",
    "speed": 1.0
  }
  ...
]

Input โ€‹

webapp/temp/test2/test2_xx.json

webapp/temp/test2/stems/test2_instrumental.wav

Note

  • Go to homepage, upload audio or video
  • Open Subtitles debug
  • Select Provider and related parameters
  • Enable Track Separation, Voice Print Alignment, Emotion Recognition, Speaker Count as needed
  • Click Execute at bottom

After execution completes, you can get test2_xx.json and test2_instrumental.wav in the output

About test2_xx.json file

  • Without using Track Separation, you can upload processed audio separately, see below
  • Without using Voice Print Alignment, you can manually align srt files using third-party software Subtitle Edit, and select srt provider to upload
  • Without using Emotion Recognition, default is Neutral
  • Without using Speaker Count, default is 0, meaning same person
  • Recommendations
    • For long audio/video, select SRT, CapCut options to greatly reduce recognition time
    • For Chinese audio/video, select FunAsr option
    • For English audio/video, select FasterWhisper option :::

About test2_instrumental.wav file

  • If voice separation quality of test2_instrumental.wav is poor, you can enable uvr app to select more models for track separation
  • By default uses MDXC architecture model_bs_roformer_ep_368_sdr_12.9628.ckpt model, with vocals SDR value of 12.1
  • Check Console, select stems containing vocals track, where higher SDR value is better, execute separation and download save, can be used as Training Data Preparation audio input
log
2025-09-21 15:42:36.214 | INFO  14404 scheduler.py:56 - current time: 2025-09-21 15:42:36.214860
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Model Filename                                               Arch  Output Stems (SDR)  Friendly Name
---------------------------------------------------------------------------------------------------------------------------------------------------------------
MDX23C-8KFFT-InstVoc_HQ.ckpt                              MDXC  instrumental (15.8), vocals (10.6)   MDX23C Model: MDX23C-InstVoc HQ
MDX23C-8KFFT-InstVoc_HQ_2.ckpt                            MDXC  instrumental (15.9), vocals (10.5)   MDX23C Model VIP: MDX23C-InstVoc HQ 2
MDX23C_D1581.ckpt                                         MDXC  instrumental (15.5), vocals (10.0)   MDX23C Model VIP: MDX23C_D1581
mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt  MDXC  instrumental (14.7), vocals* (8.4)   Roformer Model: Mel-Roformer-Karaoke-Aufr33-Viperx
mel_band_roformer_kim_ft_unwa.ckpt                        MDXC  other, vocals* (12.4)                Roformer Model: MelBand Roformer Kim | FT by unwa
melband_roformer_big_beta4.ckpt                           MDXC  other, vocals* (12.5)                Roformer Model: MelBand Roformer Kim | Big Beta 4 FT by unwa
melband_roformer_big_beta5e.ckpt                          MDXC  other, vocals* (12.4)                Roformer Model: MelBand Roformer Kim | Big Beta 5e FT by unwa
melband_roformer_inst_v1.ckpt                             MDXC  instrumental* (15.9), vocals (9.8)   Roformer Model: MelBand Roformer Kim | Inst V1 by Unwa
melband_roformer_inst_v1e.ckpt                            MDXC  instrumental* (15.8), vocals (9.6)   Roformer Model: MelBand Roformer Kim | Inst V1 (E) by Unwa
melband_roformer_inst_v2.ckpt                             MDXC  instrumental* (16.1), vocals (10.3)  Roformer Model: MelBand Roformer Kim | Inst V2 by Unwa
melband_roformer_instvoc_duality_v1.ckpt                  MDXC  instrumental (16.1), vocals (11.0)   Roformer Model: MelBand Roformer Kim | InstVoc Duality V1 by Unwa
melband_roformer_instvox_duality_v2.ckpt                  MDXC  instrumental (16.1), vocals (11.0)   Roformer Model: MelBand Roformer Kim | InstVoc Duality V2 by Unwa
MelBandRoformerBigSYHFTV1.ckpt                            MDXC  other, vocals* (12.3)                Roformer Model: MelBand Roformer Kim | Big SYHFT V1 by SYH99999
MelBandRoformerSYHFT.ckpt                                 MDXC  other, vocals* (8.0)                 Roformer Model: MelBand Roformer Kim | SYHFT by SYH99999
MelBandRoformerSYHFTV2.5.ckpt                             MDXC  other, vocals* (8.5)                 Roformer Model: MelBand Roformer Kim | SYHFT V2.5 by SYH99999
MelBandRoformerSYHFTV2.ckpt                               MDXC  other, vocals* (8.6)                 Roformer Model: MelBand Roformer Kim | SYHFT V2 by SYH99999
MelBandRoformerSYHFTV3Epsilon.ckpt                        MDXC  other, vocals* (9.5)                 Roformer Model: MelBand Roformer Kim | SYHFT V3 by SYH99999
model_bs_roformer_ep_317_sdr_12.9755.ckpt                 MDXC  instrumental (16.5), vocals* (11.8)  Roformer Model: BS-Roformer-Viperx-1297
model_bs_roformer_ep_368_sdr_12.9628.ckpt                 MDXC  instrumental (16.3), vocals* (12.1)  Roformer Model: BS-Roformer-Viperx-1296
model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt          MDXC  instrumental (15.1), vocals* (10.5)  Roformer Model: Mel-Roformer-Viperx-1143
vocals_mel_band_roformer.ckpt                             MDXC  other, vocals* (12.6)                Roformer Model: MelBand Roformer | Vocals by Kimberley Jensen
...

Output โ€‹

webapp/training/sovits/test/test.list

Logs โ€‹

log
2025-09-21 07:54:23.510 | INFO  137210011969088 <frozen src.gradio.pages.gmt_>:262 - ่ทณ่ฟ‡ idx=0 (ๆ—ถ้•ฟ 860ms < ๆœ€ๅฐๅ€ผ 1000ms)
2025-09-21 07:54:23.510 | INFO  137210011969088 <frozen src.gradio.pages.gmt_>:262 - ่ทณ่ฟ‡ idx=1 (ๆ—ถ้•ฟ 975ms < ๆœ€ๅฐๅ€ผ 1000ms)
2025-09-21 07:54:23.602 | INFO  137210011969088 <frozen src.gradio.tools.toast>:29 - ๆ€ป็‰‡ๆฎต: 16, ไฟ็•™: 14, ไฟๅญ˜่‡ณ: webapp/training/sovits/test/test.list (duration=3s)

2. Dataset Formatting โ€‹

gradio_app_gmt

Format โ€‹

list file content format

path | spk | lang | text

Path Format

f"{root}/{name}/speaker/spk_{spk}/{emotion.value}_{emotion.name.lower()}/ใ€{emotion.value}_{emotion.name.lower()}_{idx}ใ€‘{text}"
  |      โ†“                  โ†“     โ†“               โ†“                        โ†“               โ†“                      โ†“      โ†“
  |      test               0     ้šพ่ฟ‡             sad                     ้šพ่ฟ‡            sad                     4     ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰
  โ†“
  webapp/training/sovits

Input โ€‹

webapp/training/sovits/test/test.list
text
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_4ใ€‘ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰๏ผŒ.wav|0|zh|ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_5ใ€‘ๆˆ‘ๆฒกๆœ‰็ๆƒœ๏ผŒ.wav|0|zh|ๆˆ‘ๆฒกๆœ‰็ๆƒœ๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_6ใ€‘็ญ‰ๆˆ‘ๅคฑๅŽป็š„ๆ—ถๅ€™๏ผŒ.wav|0|zh|็ญ‰ๆˆ‘ๅคฑๅŽป็š„ๆ—ถๅ€™๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_7ใ€‘ๆˆ‘ๆ‰ๅŽๆ‚”่ŽซๅŠใ€‚.wav|0|zh|ๆˆ‘ๆ‰ๅŽๆ‚”่ŽซๅŠใ€‚
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_8ใ€‘ไบบไธ–้—ดๆœ€็—›่‹ฆ็š„ไบ‹่Žซ่ฟ‡ไบŽๆญค๏ผŒ.wav|0|zh|ไบบไธ–้—ดๆœ€็—›่‹ฆ็š„ไบ‹่Žซ่ฟ‡ไบŽๆญค๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_9ใ€‘ไฝ ็š„ๅ‰‘ๅœจๆˆ‘็š„ๅ’ฝๅ–‰ไธŠๅ‰ฒไธ‹ๅŽปๅง๏ผŒ.wav|0|zh|ไฝ ็š„ๅ‰‘ๅœจๆˆ‘็š„ๅ’ฝๅ–‰ไธŠๅ‰ฒไธ‹ๅŽปๅง๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_10ใ€‘ไธ็”จๅ†็Šน่ฑซไบ†ใ€‚.wav|0|zh|ไธ็”จๅ†็Šน่ฑซไบ†ใ€‚
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_11ใ€‘ๅฆ‚ๆžœไธŠๅคฉ่ƒฝๅคŸ็ป™ๆˆ‘ไธ€ไธชๅ†ๆฅไธ€ๆฌก็š„ๆœบไผš๏ผŒ.wav|0|zh|ๅฆ‚ๆžœไธŠๅคฉ่ƒฝๅคŸ็ป™ๆˆ‘ไธ€ไธชๅ†ๆฅไธ€ๆฌก็š„ๆœบไผš๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_12ใ€‘ๆˆ‘ไผšๅฏน้‚ฃไธชๅฅณๅญฉๅญ่ฏดไธ‰ไธชๅญ—๏ผŒ.wav|0|zh|ๆˆ‘ไผšๅฏน้‚ฃไธชๅฅณๅญฉๅญ่ฏดไธ‰ไธชๅญ—๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_13ใ€‘ๆˆ‘็ˆฑไฝ ๏ผŒ.wav|0|zh|ๆˆ‘็ˆฑไฝ ๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_14ใ€‘ไฝ ไธ้ž่ฆๅœจ่ฟ™ไปฝ็ˆฑไธŠๅŠ ไธชๆœŸ้™๏ผŒ.wav|0|zh|ไฝ ไธ้ž่ฆๅœจ่ฟ™ไปฝ็ˆฑไธŠๅŠ ไธชๆœŸ้™๏ผŒ
webapp/training/sovits/test/speaker/spk_0/้šพ่ฟ‡_sad/ใ€้šพ่ฟ‡_sad_15ใ€‘ๆˆ‘ๅธŒๆœ›ๆ˜ฏไธ€ไธ‡ๅนด้‚ฃใ€‚.wav|0|zh|ๆˆ‘ๅธŒๆœ›ๆ˜ฏไธ€ไธ‡ๅนด้‚ฃใ€‚

Output โ€‹

xxx
ใ€Sad_sad_4ใ€‘ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰๏ผŒ.wav	c eng2 j ing1 y ou3 y i2 f en4 zh en1 ch eng2 d e5 AA ai4 q ing2 f ang4 z ai4 w o3 m ian4 q ian2 ,	[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]	ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰,
ใ€Sad_sad_5ใ€‘ๆˆ‘ๆฒกๆœ‰็ๆƒœ๏ผŒ.wav	w o3 m ei2 y ou3 zh en1 x i1 ,	[2, 2, 2, 2, 2, 1]	ๆˆ‘ๆฒกๆœ‰็ๆƒœ,
ใ€Sad_sad_6ใ€‘็ญ‰ๆˆ‘ๅคฑๅŽป็š„ๆ—ถๅ€™๏ผŒ.wav	d eng2 w o3 sh ir1 q v4 d e5 sh ir2 h ou5 ,	[2, 2, 2, 2, 2, 2, 2, 1]	็ญ‰ๆˆ‘ๅคฑๅŽป็š„ๆ—ถๅ€™,
ใ€Sad_sad_7ใ€‘ๆˆ‘ๆ‰ๅŽๆ‚”่ŽซๅŠใ€‚.wav	w o3 c ai2 h ou4 h ui3 m o4 j i2 .	[2, 2, 2, 2, 2, 2, 1]	ๆˆ‘ๆ‰ๅŽๆ‚”่ŽซๅŠ.
ใ€Sad_sad_8ใ€‘ไบบไธ–้—ดๆœ€็—›่‹ฆ็š„ไบ‹่Žซ่ฟ‡ไบŽๆญค๏ผŒ.wav	r en2 sh ir4 j ian1 z ui4 t ong4 k u3 d e5 sh ir4 m o4 g uo5 y v2 c i03 ,	[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]	ไบบไธ–้—ดๆœ€็—›่‹ฆ็š„ไบ‹่Žซ่ฟ‡ไบŽๆญค,
ใ€Sad_sad_9ใ€‘ไฝ ็š„ๅ‰‘ๅœจๆˆ‘็š„ๅ’ฝๅ–‰ไธŠๅ‰ฒไธ‹ๅŽปๅง๏ผŒ.wav	n i3 d e5 j ian4 z ai4 w o3 d e5 y En1 h ou2 sh ang4 g e1 x ia4 q v5 b a5 ,	[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]	ไฝ ็š„ๅ‰‘ๅœจๆˆ‘็š„ๅ’ฝๅ–‰ไธŠๅ‰ฒไธ‹ๅŽปๅง,
ใ€Sad_sad_10ใ€‘ไธ็”จๅ†็Šน่ฑซไบ†ใ€‚.wav	b u2 y ong4 z ai4 y ou2 y v4 l e5 .	[2, 2, 2, 2, 2, 2, 1]	ไธ็”จๅ†็Šน่ฑซไบ†.
ใ€Sad_sad_11ใ€‘ๅฆ‚ๆžœไธŠๅคฉ่ƒฝๅคŸ็ป™ๆˆ‘ไธ€ไธชๅ†ๆฅไธ€ๆฌก็š„ๆœบไผš๏ผŒ.wav	r u2 g uo3 sh ang4 t ian1 n eng2 g ou4 g ei2 w o3 y i2 g e5 z ai4 l ai2 y i2 c i04 d e5 j i1 h ui4 ,	[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]	ๅฆ‚ๆžœไธŠๅคฉ่ƒฝๅคŸ็ป™ๆˆ‘ไธ€ไธชๅ†ๆฅไธ€ๆฌก็š„ๆœบไผš,
ใ€Sad_sad_12ใ€‘ๆˆ‘ไผšๅฏน้‚ฃไธชๅฅณๅญฉๅญ่ฏดไธ‰ไธชๅญ—๏ผŒ.wav	w o3 h ui4 d ui4 n a4 g e5 n v3 h ai2 z i05 sh uo1 s an1 g e5 z i04 ,	[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]	ๆˆ‘ไผšๅฏน้‚ฃไธชๅฅณๅญฉๅญ่ฏดไธ‰ไธชๅญ—,
ใ€Sad_sad_13ใ€‘ๆˆ‘็ˆฑไฝ ๏ผŒ.wav	w o3 AA ai4 n i3 ,	[2, 2, 2, 1]	ๆˆ‘็ˆฑไฝ ,
ใ€Sad_sad_14ใ€‘ไฝ ไธ้ž่ฆๅœจ่ฟ™ไปฝ็ˆฑไธŠๅŠ ไธชๆœŸ้™๏ผŒ.wav	n i3 b u4 f ei1 y ao4 z ai4 zh e4 f en4 AA ai4 sh ang4 j ia1 g e5 q i1 x ian4 ,	[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]	ไฝ ไธ้ž่ฆๅœจ่ฟ™ไปฝ็ˆฑไธŠๅŠ ไธชๆœŸ้™,
ใ€Sad_sad_15ใ€‘ๆˆ‘ๅธŒๆœ›ๆ˜ฏไธ€ไธ‡ๅนด้‚ฃใ€‚.wav	w o3 x i1 w ang4 sh ir4 y i2 w an4 n ian2 n a4 .	[2, 2, 2, 2, 2, 2, 2, 2, 1]	ๆˆ‘ๅธŒๆœ›ๆ˜ฏไธ€ไธ‡ๅนด้‚ฃ.
xxx
ใ€Sad_sad_4ใ€‘ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰๏ผŒ.wav	181 200 185 1005 651 651 752 696 208 334 200 382 99 612 1001 338 334 679 334 1001 837 160 1005 185 185 420 1001 844 433 200 232 625 341 758 758 200 1005 200 344 420 1005 1005 837 96 763 438 435 312 121 297 134 377 637 688 369 70 877 828 129 928 320 902 159 62 297 341 396 861 354 946 956 403 936 869 568 176 948 417 439 380 595 419 150 705 407 425 142 67 611 984 766 651 822 723 912 449 737 940 148 811 525 938 424 921 161 45 771 80
ใ€Sad_sad_5ใ€‘ๆˆ‘ๆฒกๆœ‰็ๆƒœ๏ผŒ.wav	930 263 774 1001 273 754 273 263 23 420 185 1001 1001 334 338 334 578 263 565 656 797 593 593 593 1002 16 451 382 582 239 612 593 804 535 914 681 660 996 684 901 636 413 714 686 425 590 831 946 288 1003 705 300 963 544 612 232
ใ€Sad_sad_6ใ€‘็ญ‰ๆˆ‘ๅคฑๅŽป็š„ๆ—ถๅ€™๏ผŒ.wav	103 451 334 451 208 1005 200 200 263 417 417 263 273 953 797 417 432 365 565 647 232 844 576 428 576 334 263 425 321 941 951 18 577 519 145 438 911 600 416 299 134 1003 603 906 200 176 691 703 1003 588 422 663 134 434 909 651 774
ใ€Sad_sad_7ใ€‘ๆˆ‘ๆ‰ๅŽๆ‚”่ŽซๅŠใ€‚.wav	1005 365 797 382 263 1005 1005 338 221 205 576 221 774 565 200 647 328 282 1005 3 221 328 200 76 332 107 544 547 198 764 715 764 42 233 449 625 256 79 846 717 1001 862 631 660 873 86 485 773 705 196 1003 671 637 529
ใ€Sad_sad_8ใ€‘ไบบไธ–้—ดๆœ€็—›่‹ฆ็š„ไบ‹่Žซ่ฟ‡ไบŽๆญค๏ผŒ.wav	857 876 797 797 334 420 663 663 263 451 263 1001 774 338 283 1001 1001 205 134 544 278 406 340 224 817 718 135 168 247 243 57 995 995 27 663 576 273 638 200 1005 578 437 323 232 582 282 282 282 774 1001 221 821 417 844 49 691 854 997 807 166 438 438 246 129 179 124 1 491 299 221 976 491 690 136 213 339 237 1018 119 152 805 185 560 245 805 500 590 617 686 166 514 1003 1003 325 689 935 1001 16 99 263 797 263 334 451 208 277 27 474 56 370 233 830 318 431 713 655 21 120 632 632 548 577 150 197 197 178 396 651 401 696
ใ€Sad_sad_9ใ€‘ไฝ ็š„ๅ‰‘ๅœจๆˆ‘็š„ๅ’ฝๅ–‰ไธŠๅ‰ฒไธ‹ๅŽปๅง๏ผŒ.wav	185 565 679 277 797 656 612 1005 1005 578 283 221 199 221 199 221 199 199 1005 221 22 420 774 774 334 205 876 876 200 1005 647 282 1001 578 985 647 334 1005 656 334 334 365 185 299 591 456 871 529 76 943 300 911 742 416 377 443 471 889 889 917 949 917 240 817 205 232 90 885 671 322 188 757 941 466 1002 393 877 434 443 157 201 983 420 248 294 873 325 513 381 617 530 530 640 860 204 24 1 209 452 922 545 80 762 533 787 239 134 764 936 16 365
ใ€Sad_sad_10ใ€‘ไธ็”จๅ†็Šน่ฑซไบ†ใ€‚.wav	338 910 382 565 578 804 1001 1005 774 656 99 656 651 1005 205 283 273 982 328 496 941 655 335 614 370 825 577 645 659 595 811 179 488 47 757 1015 632 148 148 548 627 173 617 649 477 953 338
ใ€Sad_sad_11ใ€‘ๅฆ‚ๆžœไธŠๅคฉ่ƒฝๅคŸ็ป™ๆˆ‘ไธ€ไธชๅ†ๆฅไธ€ๆฌก็š„ๆœบไผš๏ผŒ.wav	428 647 647 338 774 936 936 936 774 581 858 997 581 263 581 647 221 10 420 1001 656 200 334 334 876 221 858 576 221 428 428 764 535 1001 185 638 997 997 936 420 420 1001 221 1001 232 696 760 774 541 451 200 625 23 200 420 656 656 1001 953 581 365 953 930 1005 208 663 612 2 663 752 248 857 674 979 215 244 188 920 4 650 332 661 790 27 33 769 669 660 590 778 577 119 1015 530 340 20 331 608 424 71 813 994 637 232 430 905 13 150 131 794 813 564 472 201 828 525 906 76 763 196 992 825 564 999 921 763 685 523 120 894 185 248 663 969 731 748 930
ใ€Sad_sad_12ใ€‘ๆˆ‘ไผšๅฏน้‚ฃไธชๅฅณๅญฉๅญ่ฏดไธ‰ไธชๅญ—๏ผŒ.wav	844 263 844 656 14 23 23 420 208 953 257 451 208 647 576 474 804 647 578 99 184 738 729 63 184 243 750 663 656 797 797 365 365 790 71 168 134 764 281 1011 490 728 36 669 341 759 422 367 129 294 935 393 854 184 312 747 595 956 364 925 986 666 166 437 421 178 814 150 336 664 734 362 526 519 422 760 171 197 396 568 961
ใ€Sad_sad_13ใ€‘ๆˆ‘็ˆฑไฝ ๏ผŒ.wav	799 185 663 612 623 582 1005 582 1005 200 22 656 582 23 23 722 23 1005 1005 774 581 647 647 263 774 715 718 517 993 517 764 437 10 200 593 612 221 903 257 366 394 226 226 186 821 936 417 156 936 534 581 263 263 185 283 221 1005 365 263 647 221 876 258 221 774 200 985 221 837 3 631 987 893 47 923 623 156 202 733 605 139 404 738 535 221 179
ใ€Sad_sad_14ใ€‘ไฝ ไธ้ž่ฆๅœจ่ฟ™ไปฝ็ˆฑไธŠๅŠ ไธชๆœŸ้™๏ผŒ.wav	634 504 504 59 752 256 804 578 334 804 451 474 578 876 797 876 451 451 283 774 282 221 411 578 221 876 535 282 341 1001 565 647 754 221 581 582 774 936 821 428 534 997 997 576 501 834 997 181 556 556 834 821 534 534 534 958 958 997 534 534 534 534 847 731 312 821 437 117 442 422 581 925 212 127 132 802 483 510 563 533 773 455 322 500 645 483 1017 688 846 173 982 547 595 172 514 10 276 257 773 745 453 924 564 991 740 994 539 161 705 300 559 518 994 544 745 963 688 906 627 418
ใ€Sad_sad_15ใ€‘ๆˆ‘ๅธŒๆœ›ๆ˜ฏไธ€ไธ‡ๅนด้‚ฃใ€‚.wav	641 844 930 930 641 593 930 338 944 184 406 406 406 406 718 406 406 406 406 406 995 57 57 581 625 581 625 844 625 593 625 541 582 696 837 565 200 200 844 221 752 752 581 232 876 232 221 754 754 227 625 754 629 456 22 844 997 844 844 203 404 203 844 641 930 641 404 641 181 404 404 203 930 638 181 997 638 638 232 752 930 181 918 997 844 638 997 844 844 581 844 844 844 565 232 565 185 893 886 187 161 705 567 637 284 535 275 206 508 534 729 475 963 745 25 445 552 576 576 961 428 936 576 961 918 997 581 581 956 470 665 282 779 563 649 184 203 526 805 961 569 253 949 127 127 205 576 961 876 997 876 961 918 560 420 625 334 754 432 334 334 263 844 582 263 930 581 52 567 738 221 576 581 997 534 997 774 576 753 128 709 709

Logs โ€‹

log
2025-09-21 06:38:47.451 | INFO  137210011969088 <frozen src.gradio.tools.toast>:29 - ๆ–‡ๆœฌๅˆ†่ฏไธŽ็‰นๅพๆๅ– (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/1-get-text.py
ใ€้šพ่ฟ‡_sad_4ใ€‘ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_5ใ€‘ๆˆ‘ๆฒกๆœ‰็ๆƒœ๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_6ใ€‘็ญ‰ๆˆ‘ๅคฑๅŽป็š„ๆ—ถๅ€™๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_7ใ€‘ๆˆ‘ๆ‰ๅŽๆ‚”่ŽซๅŠใ€‚.wav
ใ€้šพ่ฟ‡_sad_8ใ€‘ไบบไธ–้—ดๆœ€็—›่‹ฆ็š„ไบ‹่Žซ่ฟ‡ไบŽๆญค๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_9ใ€‘ไฝ ็š„ๅ‰‘ๅœจๆˆ‘็š„ๅ’ฝๅ–‰ไธŠๅ‰ฒไธ‹ๅŽปๅง๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_10ใ€‘ไธ็”จๅ†็Šน่ฑซไบ†ใ€‚.wav
ใ€้šพ่ฟ‡_sad_11ใ€‘ๅฆ‚ๆžœไธŠๅคฉ่ƒฝๅคŸ็ป™ๆˆ‘ไธ€ไธชๅ†ๆฅไธ€ๆฌก็š„ๆœบไผš๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_12ใ€‘ๆˆ‘ไผšๅฏน้‚ฃไธชๅฅณๅญฉๅญ่ฏดไธ‰ไธชๅญ—๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_13ใ€‘ๆˆ‘็ˆฑไฝ ๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_14ใ€‘ไฝ ไธ้ž่ฆๅœจ่ฟ™ไปฝ็ˆฑไธŠๅŠ ไธชๆœŸ้™๏ผŒ.wav
ใ€้šพ่ฟ‡_sad_15ใ€‘ๆˆ‘ๅธŒๆœ›ๆ˜ฏไธ€ไธ‡ๅนด้‚ฃใ€‚.wav
2025-09-21 06:39:06.815 | INFO  137210011969088 <frozen src.gradio.tools.toast>:29 - ่ฏญ้Ÿณ่‡ช็›‘็ฃ็‰นๅพๆๅ– (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/2-get-hubert-wav32k.py
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/2-get-sv.py
2025-09-21 06:39:26.681 | INFO  137210011969088 <frozen src.gradio.tools.toast>:29 - ่ฏญไน‰Tokenๆๅ– (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/3-get-semantic.py
2025-09-21 06:39:36.800 | INFO  137210011969088 <frozen src.gradio.tools.toast>:29 - ๆ•ฐๆฎ้›†ๆ ผๅผๅŒ–ๆˆๅŠŸ (duration=3s)

3. Fine-tuning Training โ€‹

gradio_app_gmt

Input โ€‹

webapp/training/sovits/test/test.list

Output โ€‹

Model training version = v2Pro

webapp/training/sovits/weights/GPT_weights_{Model training version} -> webapp/training/sovits/weights/GPT_weights_v2Pro

webapp/training/sovits/weights/SoVITS_weights_{Model training version} -> webapp/training/sovits/weights/SoVITS_weights_v2Pro

Logs โ€‹

GPT โ€‹
log
2025-09-21 06:21:59.296 | INFO  136692680197696 <frozen src.gradio.tools.toast>:29 - GPT ๆจกๅž‹่ฎญ็ปƒๅผ€ๅง‹ (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/s1_train.py --config_file "webapp/training/sovits/temp/tmp_s1.yaml"
2025-09-21 06:22:04.108 | INFO  136691218114112 <frozen src.router.scheduler>:56 - current time: 2025-09-21 06:22:04.108651
<All keys matched successfully>
ckpt_path: None
semantic_data_len: 14
phoneme_data_len: 14
        item_name                                     semantic_audio
1      ใ€้šพ่ฟ‡_sad_4ใ€‘ๆ›พ็ปๆœ‰ไธ€ไปฝ็œŸ่ฏš็š„็ˆฑๆƒ…ๆ”พๅœจๆˆ‘้ขๅ‰๏ผŒ.wav  181 200 185 1005 651 651 752 696 208 334 200 3...
2      ใ€้šพ่ฟ‡_sad_5ใ€‘ๆˆ‘ๆฒกๆœ‰็ๆƒœ๏ผŒ.wav  930 263 774 1001 273 754 273 263 23 420 185 10...
3      ใ€้šพ่ฟ‡_sad_6ใ€‘็ญ‰ๆˆ‘ๅคฑๅŽป็š„ๆ—ถๅ€™๏ผŒ.wav  103 451 334 451 208 1005 200 200 263 417 417 2...
4      ใ€้šพ่ฟ‡_sad_7ใ€‘ๆˆ‘ๆ‰ๅŽๆ‚”่ŽซๅŠใ€‚.wav  1005 365 797 382 263 1005 1005 338 221 205 576...
5      ใ€้šพ่ฟ‡_sad_8ใ€‘ไบบไธ–้—ดๆœ€็—›่‹ฆ็š„ไบ‹่Žซ่ฟ‡ไบŽๆญค๏ผŒ.wav  857 876 797 797 334 420 663 663 263 451 263 10...
6      ใ€้šพ่ฟ‡_sad_9ใ€‘ไฝ ็š„ๅ‰‘ๅœจๆˆ‘็š„ๅ’ฝๅ–‰ไธŠๅ‰ฒไธ‹ๅŽปๅง๏ผŒ.wav  185 565 679 277 797 656 612 1005 1005 578 283 ...
7      ใ€้šพ่ฟ‡_sad_10ใ€‘ไธ็”จๅ†็Šน่ฑซไบ†ใ€‚.wav  338 910 382 565 578 804 1001 1005 774 656 99 6...
8      ใ€้šพ่ฟ‡_sad_11ใ€‘ๅฆ‚ๆžœไธŠๅคฉ่ƒฝๅคŸ็ป™ๆˆ‘ไธ€ไธชๅ†ๆฅไธ€ๆฌก็š„ๆœบไผš๏ผŒ.wav  428 647 647 338 774 936 936 936 774 581 858 99...
9      ใ€้šพ่ฟ‡_sad_12ใ€‘ๆˆ‘ไผšๅฏน้‚ฃไธชๅฅณๅญฉๅญ่ฏดไธ‰ไธชๅญ—๏ผŒ.wav  844 263 844 656 14 23 23 420 208 953 257 451 2...
10     ใ€้šพ่ฟ‡_sad_13ใ€‘ๆˆ‘็ˆฑไฝ ๏ผŒ.wav  799 185 663 612 623 582 1005 582 1005 200 22 6...
11     ใ€้šพ่ฟ‡_sad_14ใ€‘ไฝ ไธ้ž่ฆๅœจ่ฟ™ไปฝ็ˆฑไธŠๅŠ ไธชๆœŸ้™๏ผŒ.wav  634 504 504 59 752 256 804 578 334 804 451 474...
12     ใ€้šพ่ฟ‡_sad_15ใ€‘ๆˆ‘ๅธŒๆœ›ๆ˜ฏไธ€ไธ‡ๅนด้‚ฃใ€‚.wav  641 844 930 930 641 593 930 338 944 184 406 40...
...
dataset.__len__(): 96
Epoch 14: 100%|โ–ˆ| 7/7 [00:04<00:00,  1.45it/s, v_num=0, total_loss_step=1.77e+3,
2025-09-21 06:22:41.355 | INFO  136692680197696 <frozen src.gradio.tools.toast>:29 - GPT ๆจกๅž‹่ฎญ็ปƒ็ป“ๆŸ (duration=3s)
SoVITS โ€‹
log
2025-09-21 06:25:39.005 | INFO  136692680197696 <frozen src.gradio.tools.toast>:29 - SoVITS ๆจกๅž‹่ฎญ็ปƒๅผ€ๅง‹ (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/s2_train.py --config "webapp/training/sovits/temp/tmp_s2.json"
phoneme_data_len: 14
wav_data_len: 98
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 98/98 [00:00<00:00, 88358.08it/s]
skipped_phone:  0 , skipped_dur:  0
total left:  98
2025-09-21 06:26:04.112 | INFO  136691218114112 <frozen src.router.scheduler>:56 - current time: 2025-09-21 06:26:04.112067
loaded pretrained src/support/third_party/gptsovits/GPT_SoVITS/pretrained_models/v2Pro/s2Gv2ProPlus.pth <All keys matched successfully>
loaded pretrained src/support/third_party/gptsovits/GPT_SoVITS/pretrained_models/v2Pro/s2Gv2ProPlus.pth <All keys matched successfully>
loaded pretrained src/support/third_party/gptsovits/GPT_SoVITS/pretrained_models/v2Pro/s2Dv2ProPlus.pth <All keys matched successfully>
start training from epoch 1
 12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹                                      | 1/8 [00:55<06:30, 55.73s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [01:02<00:00,  7.77s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [01:02<00:00,  7.77s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.66s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.66s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.64s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.64s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.66s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.66s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:10<00:00,  1.26s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:18<00:00,  2.30s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.65s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.65s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.67s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.67s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.67s/it]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:13<00:00,  1.67s/it]
...
training done
2025-09-21 06:28:55.882 | INFO  136692680197696 <frozen src.gradio.tools.toast>:29 - SoVITS ๆจกๅž‹่ฎญ็ปƒๆˆๅŠŸ (duration=3s)

Inference Usage โ€‹

How to Use Fine-tuned Models?

1. Configuration Instructions โ€‹

webapp/data/sovits.json
json
{
  "v1": [],
  "v2": [],
  "v2Pro": [
    {
      "name": "่‡ณๅฐŠๅฎ",
      "gender": "Male",
      "locale": "zh-CN",
      "model": {                // Fine-tuning model directory (required, absolute path)
        "้ป˜่ฎค": {
          "gpt": "่‡ณๅฐŠๅฎ-e15.ckpt",
          "vits": "่‡ณๅฐŠๅฎ_e4_s32.pth"
        }
      },
      "ref": {                  // Main reference audio (corresponds to voice library ID)
        "้ป˜่ฎค": 10000000,
      },
      "aux": [                  // Auxiliary reference audio list (optional, absolute path)
        "aux_ref_audio_path1.wav",
        "aux_ref_audio_path2.wav",
        "aux_ref_audio_path3.wav"
      ]
    }
  ],
  "v2ProPlus": [],
  "v3": [],
  "v4": []
}

How to control emotions?

Currently there are 2 methods with different configurations and effects, choose based on needs. Taking "Happy" as an example:

  • Generic: Character's training data includes but is not limited to "Happy", i.e. "Default"
  • Custom: Character's training data only contains "Happy"
  • Unspecified: Uses default
webapp/data/sovits.json
json
{
    "name": "xxx",
    "gender": "Male",
    "locale": "zh-CN",
    "model": {
        "Default": {
            "gpt": "xxx-e15.ckpt",
            "vits": "xxx_e4_s32.pth"
        },
        "Happy": {
            "gpt": "xxx_happy-e15.ckpt",
            "vits": "xxx_happy_e4_s32.pth"
        },
        "Sad": {
            "gpt": "xxx_sad-e15.ckpt",
            "vits": "xxx_sad_e4_s32.pth"
        }
    },
    "ref": {
        "Default": 10000000,
        "Happy": 10000001,
        "Sad": 10000002,
        "Angry": 10000003
    }
}

2. Page Selection โ€‹

If no model is configured, there will be no "Default" option for speaking style and no "xxx" option for voice. In this case, selecting other voice options will use the base model for voice cloning during inference.

gradio_app_gmt

3. Logs โ€‹

log
2025-09-17 14:34:31.336 | INFO  14272 gptsovits_tts.py:72 - First load, please wait...
2025-09-17 14:34:31.338 | INFO  14272 gptsovits_tts.py:78 - Loading Tts model v2Pro_่‡ณๅฐŠๅฎ_้ป˜่ฎค on device cpu
2025-09-17 14:34:31.348 | INFO  14272 TTS.py:589 - Loading Text2Semantic weights from D:/pretrained_models/s1v3.ckpt
2025-09-17 14:34:32.920 | INFO  14272 TTS.py:651 - loading vocoder
2025-09-17 14:34:32.987 | INFO  14272 TTS.py:556 - Loading VITS weights from D:/pretrained_models/v2Pro/s2Gv2Pro.pth. <All keys matched successfully>
2025-09-17 14:34:33.006 | INFO  14272 TTS.py:479 - Loading BERT weights from D:/pretrained_models/chinese-roberta-wwm-ext-large
2025-09-17 14:34:33.478 | INFO  14272 TTS.py:471 - Loading CNHuBERT weights from D:/pretrained_models/chinese-hubert-base
<!-- Custom model will print the following -->
2025-09-17 14:41:05.508 | INFO  2880  TTS.py:589 - Loading Text2Semantic weights from webapp/training/sovits/weights/GPT_weights_v2Pro/่‡ณๅฐŠๅฎ-e15.ckpt
2025-09-17 14:41:06.913 | INFO  2880  TTS.py:560 - Loading VITS pretrained weights from webapp/training/sovits/weights/SoVITS_weights_v2Pro/่‡ณๅฐŠๅฎ_e4_s32.pth. <All keys matched successfully>
2025-09-17 14:41:07.097 | INFO  2880  TTS.py:571 - Loading LoRA weights from webapp/training/sovits/weights/SoVITS_weights_v2Pro/่‡ณๅฐŠๅฎ_e4_s32.pth. _IncompatibleKeys(missing_keys=['enc_p.ssl_proj.weight', 'enc_p.ssl_proj.bias', ....])
2025-09-17 14:34:33.648 | INFO  14272 cache.py:67 - {'v2Pro': {'last_used': '2025-09-17 14:34:33', 'usage': 1},'v2Pro_่‡ณๅฐŠๅฎ_้ป˜่ฎค': {'last_used': '2025-09-17 14:36:25', 'usage': 1}}
2025-09-17 14:34:33.649 | INFO  14272 TTS.py:197 - Set seed to 99578076
2025-09-17 14:34:33.652 | INFO  14272 TTS.py:1046 - Parallel Inference Mode Enabled
2025-09-17 14:34:34.676 | INFO  14272 TTS.py:1118 - Actual Input Reference Text:
2025-09-17 14:34:35.826 | INFO  14272 TextPreprocessor.py:61 - ############ Segment Text ############
2025-09-17 14:34:35.827 | INFO  14272 TextPreprocessor.py:84 - Actual Input Target Text:
2025-09-17 14:34:35.827 | INFO  14272 TextPreprocessor.py:85 - CreatorBox๏ผŒไธบๅˆ›ไฝœ่€…่€Œ็”Ÿ๏ผŒๆๅ‡ๅˆ›ไฝœๆ•ˆ็އ๏ผŒ้‡Šๆ”พๅˆ›ไฝœๆฝœๅŠ›.
2025-09-17 14:34:35.828 | INFO  14272 TextPreprocessor.py:114 - Actual Input Target Text (after sentence segmentation):
2025-09-17 14:34:35.829 | INFO  14272 TextPreprocessor.py:115 - ['CreatorBox๏ผŒ', 'ไธบๅˆ›ไฝœ่€…่€Œ็”Ÿ๏ผŒ', 'ๆๅ‡ๅˆ›ไฝœๆ•ˆ็އ๏ผŒ', '้‡Šๆ”พๅˆ›ไฝœๆฝœๅŠ›.']
2025-09-17 14:34:35.829 | INFO  14272 TextPreprocessor.py:65 - ############ Extract Text BERT Features ############
<!-- Parallel processing, omitted -->
2025-09-17 14:34:40.471 | INFO  14272 TTS.py:1187 - ############ Inference ############
2025-09-17 14:34:40.471 | INFO  14272 TTS.py:1209 - Processed text from the frontend (per sentence):
2025-09-17 14:34:40.471 | INFO  14272 TTS.py:1217 - ############ Predict Semantic Token ############
  2%|โ–ˆโ–ˆโ–ˆโ–ˆ                                                                                                                                                                                                               | 29/1500 [00:00<00:19, 75.04it/s]T2S Decoding EOS [141 -> 176]
  2%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Š                                                                                                                                                                                                              | 34/1500 [00:00<00:21, 68.26it/s]
<!-- ..... -->
2025-09-17 14:35:50.402 | INFO  14272 TTS.py:1258 - ############ Synthesize Audio ############
2025-09-17 14:35:50.402 | INFO  14272 TTS.py:1305 - Parallel Synthesis in Progress...
2025-09-17 14:36:18.686 | INFO  14272 TTS.py:1342 - 2.173       4.645   2.194   96.021
2025-09-17 14:36:18.688 | INFO  14272 gptsovits_tts.py:195 - speech len 6.96, rtf 15.091852826633673
2025-09-17 14:36:18.864 | INFO  14272 response.py:52 - {"path":"webapp/tts/sovits_zh-CN_่‡ณๅฐŠๅฎ_1.00_1.00_1.00_32_0.wav","duration":6.96,"seed":99578076}