Model Training ๐ง โ
Based on the GPTSoVITS base model, fine-tuning training and inference
Use Cases โ
If a 3โ6 second voice sample is not sufficient for your cloning needs, you can try training the model to enhance similarity and realism.
Application Preview โ

Fine-tuning Training โ
1. Training Data Preparation โ

Format โ
json file content format
[
...
{
"idx": 4,
"spk": 0,
"lang": "en",
"start": 9619,
"end": 12145,
"duration": 2526,
"text": "ๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ๏ผ",
"text_trans": "I was offered true love,",
"speed": 1.0
}
...
]Input โ
webapp/temp/test2/test2_xx.json
webapp/temp/test2/stems/test2_instrumental.wav
Note
- Go to homepage, upload audio or video
- Open
Subtitlesdebug - Select
Providerand related parameters - Enable
Track Separation,Voice Print Alignment,Emotion Recognition,Speaker Countas needed - Click
Executeat bottom
After execution completes, you can get test2_xx.json and test2_instrumental.wav in the output
About test2_xx.json file
- Without using
Track Separation, you can upload processed audio separately, see below - Without using
Voice Print Alignment, you can manually alignsrtfiles using third-party softwareSubtitle Edit, and selectsrtprovider to upload - Without using
Emotion Recognition, default isNeutral - Without using
Speaker Count, default is0, meaning same person - Recommendations
- For long audio/video, select
SRT,CapCutoptions to greatly reduce recognition time - For Chinese audio/video, select
FunAsroption - For English audio/video, select
FasterWhisperoption :::
- For long audio/video, select
About test2_instrumental.wav file
- If voice separation quality of
test2_instrumental.wavis poor, you can enableuvrapp to select more models for track separation - By default uses
MDXCarchitecturemodel_bs_roformer_ep_368_sdr_12.9628.ckptmodel, withvocalsSDRvalue of12.1 - Check
Console, selectstemscontainingvocalstrack, where higher SDR value is better, execute separation and download save, can be used asTraining Data Preparationaudio input
2025-09-21 15:42:36.214 | INFO 14404 scheduler.py:56 - current time: 2025-09-21 15:42:36.214860
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Model Filename Arch Output Stems (SDR) Friendly Name
---------------------------------------------------------------------------------------------------------------------------------------------------------------
MDX23C-8KFFT-InstVoc_HQ.ckpt MDXC instrumental (15.8), vocals (10.6) MDX23C Model: MDX23C-InstVoc HQ
MDX23C-8KFFT-InstVoc_HQ_2.ckpt MDXC instrumental (15.9), vocals (10.5) MDX23C Model VIP: MDX23C-InstVoc HQ 2
MDX23C_D1581.ckpt MDXC instrumental (15.5), vocals (10.0) MDX23C Model VIP: MDX23C_D1581
mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt MDXC instrumental (14.7), vocals* (8.4) Roformer Model: Mel-Roformer-Karaoke-Aufr33-Viperx
mel_band_roformer_kim_ft_unwa.ckpt MDXC other, vocals* (12.4) Roformer Model: MelBand Roformer Kim | FT by unwa
melband_roformer_big_beta4.ckpt MDXC other, vocals* (12.5) Roformer Model: MelBand Roformer Kim | Big Beta 4 FT by unwa
melband_roformer_big_beta5e.ckpt MDXC other, vocals* (12.4) Roformer Model: MelBand Roformer Kim | Big Beta 5e FT by unwa
melband_roformer_inst_v1.ckpt MDXC instrumental* (15.9), vocals (9.8) Roformer Model: MelBand Roformer Kim | Inst V1 by Unwa
melband_roformer_inst_v1e.ckpt MDXC instrumental* (15.8), vocals (9.6) Roformer Model: MelBand Roformer Kim | Inst V1 (E) by Unwa
melband_roformer_inst_v2.ckpt MDXC instrumental* (16.1), vocals (10.3) Roformer Model: MelBand Roformer Kim | Inst V2 by Unwa
melband_roformer_instvoc_duality_v1.ckpt MDXC instrumental (16.1), vocals (11.0) Roformer Model: MelBand Roformer Kim | InstVoc Duality V1 by Unwa
melband_roformer_instvox_duality_v2.ckpt MDXC instrumental (16.1), vocals (11.0) Roformer Model: MelBand Roformer Kim | InstVoc Duality V2 by Unwa
MelBandRoformerBigSYHFTV1.ckpt MDXC other, vocals* (12.3) Roformer Model: MelBand Roformer Kim | Big SYHFT V1 by SYH99999
MelBandRoformerSYHFT.ckpt MDXC other, vocals* (8.0) Roformer Model: MelBand Roformer Kim | SYHFT by SYH99999
MelBandRoformerSYHFTV2.5.ckpt MDXC other, vocals* (8.5) Roformer Model: MelBand Roformer Kim | SYHFT V2.5 by SYH99999
MelBandRoformerSYHFTV2.ckpt MDXC other, vocals* (8.6) Roformer Model: MelBand Roformer Kim | SYHFT V2 by SYH99999
MelBandRoformerSYHFTV3Epsilon.ckpt MDXC other, vocals* (9.5) Roformer Model: MelBand Roformer Kim | SYHFT V3 by SYH99999
model_bs_roformer_ep_317_sdr_12.9755.ckpt MDXC instrumental (16.5), vocals* (11.8) Roformer Model: BS-Roformer-Viperx-1297
model_bs_roformer_ep_368_sdr_12.9628.ckpt MDXC instrumental (16.3), vocals* (12.1) Roformer Model: BS-Roformer-Viperx-1296
model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt MDXC instrumental (15.1), vocals* (10.5) Roformer Model: Mel-Roformer-Viperx-1143
vocals_mel_band_roformer.ckpt MDXC other, vocals* (12.6) Roformer Model: MelBand Roformer | Vocals by Kimberley Jensen
...Output โ
webapp/training/sovits/test/test.list
Logs โ
2025-09-21 07:54:23.510 | INFO 137210011969088 <frozen src.gradio.pages.gmt_>:262 - ่ทณ่ฟ idx=0 (ๆถ้ฟ 860ms < ๆๅฐๅผ 1000ms)
2025-09-21 07:54:23.510 | INFO 137210011969088 <frozen src.gradio.pages.gmt_>:262 - ่ทณ่ฟ idx=1 (ๆถ้ฟ 975ms < ๆๅฐๅผ 1000ms)
2025-09-21 07:54:23.602 | INFO 137210011969088 <frozen src.gradio.tools.toast>:29 - ๆป็ๆฎต: 16, ไฟ็: 14, ไฟๅญ่ณ: webapp/training/sovits/test/test.list (duration=3s)2. Dataset Formatting โ

Format โ
list file content format
path | spk | lang | textPath Format
f"{root}/{name}/speaker/spk_{spk}/{emotion.value}_{emotion.name.lower()}/ใ{emotion.value}_{emotion.name.lower()}_{idx}ใ{text}"
| โ โ โ โ โ โ โ โ
| test 0 ้พ่ฟ sad ้พ่ฟ sad 4 ๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ
โ
webapp/training/sovitsInput โ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_4ใๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ๏ผ.wav|0|zh|ๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_5ใๆๆฒกๆ็ๆ๏ผ.wav|0|zh|ๆๆฒกๆ็ๆ๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_6ใ็ญๆๅคฑๅป็ๆถๅ๏ผ.wav|0|zh|็ญๆๅคฑๅป็ๆถๅ๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_7ใๆๆๅๆ่ซๅใ.wav|0|zh|ๆๆๅๆ่ซๅใ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_8ใไบบไธ้ดๆ็่ฆ็ไบ่ซ่ฟไบๆญค๏ผ.wav|0|zh|ไบบไธ้ดๆ็่ฆ็ไบ่ซ่ฟไบๆญค๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_9ใไฝ ็ๅๅจๆ็ๅฝๅไธๅฒไธๅปๅง๏ผ.wav|0|zh|ไฝ ็ๅๅจๆ็ๅฝๅไธๅฒไธๅปๅง๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_10ใไธ็จๅ็น่ฑซไบใ.wav|0|zh|ไธ็จๅ็น่ฑซไบใ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_11ใๅฆๆไธๅคฉ่ฝๅค็ปๆไธไธชๅๆฅไธๆฌก็ๆบไผ๏ผ.wav|0|zh|ๅฆๆไธๅคฉ่ฝๅค็ปๆไธไธชๅๆฅไธๆฌก็ๆบไผ๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_12ใๆไผๅฏน้ฃไธชๅฅณๅญฉๅญ่ฏดไธไธชๅญ๏ผ.wav|0|zh|ๆไผๅฏน้ฃไธชๅฅณๅญฉๅญ่ฏดไธไธชๅญ๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_13ใๆ็ฑไฝ ๏ผ.wav|0|zh|ๆ็ฑไฝ ๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_14ใไฝ ไธ้่ฆๅจ่ฟไปฝ็ฑไธๅ ไธชๆ้๏ผ.wav|0|zh|ไฝ ไธ้่ฆๅจ่ฟไปฝ็ฑไธๅ ไธชๆ้๏ผ
webapp/training/sovits/test/speaker/spk_0/้พ่ฟ_sad/ใ้พ่ฟ_sad_15ใๆๅธๆๆฏไธไธๅนด้ฃใ.wav|0|zh|ๆๅธๆๆฏไธไธๅนด้ฃใOutput โ
ใSad_sad_4ใๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ๏ผ.wav c eng2 j ing1 y ou3 y i2 f en4 zh en1 ch eng2 d e5 AA ai4 q ing2 f ang4 z ai4 w o3 m ian4 q ian2 , [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1] ๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ,
ใSad_sad_5ใๆๆฒกๆ็ๆ๏ผ.wav w o3 m ei2 y ou3 zh en1 x i1 , [2, 2, 2, 2, 2, 1] ๆๆฒกๆ็ๆ,
ใSad_sad_6ใ็ญๆๅคฑๅป็ๆถๅ๏ผ.wav d eng2 w o3 sh ir1 q v4 d e5 sh ir2 h ou5 , [2, 2, 2, 2, 2, 2, 2, 1] ็ญๆๅคฑๅป็ๆถๅ,
ใSad_sad_7ใๆๆๅๆ่ซๅใ.wav w o3 c ai2 h ou4 h ui3 m o4 j i2 . [2, 2, 2, 2, 2, 2, 1] ๆๆๅๆ่ซๅ.
ใSad_sad_8ใไบบไธ้ดๆ็่ฆ็ไบ่ซ่ฟไบๆญค๏ผ.wav r en2 sh ir4 j ian1 z ui4 t ong4 k u3 d e5 sh ir4 m o4 g uo5 y v2 c i03 , [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1] ไบบไธ้ดๆ็่ฆ็ไบ่ซ่ฟไบๆญค,
ใSad_sad_9ใไฝ ็ๅๅจๆ็ๅฝๅไธๅฒไธๅปๅง๏ผ.wav n i3 d e5 j ian4 z ai4 w o3 d e5 y En1 h ou2 sh ang4 g e1 x ia4 q v5 b a5 , [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1] ไฝ ็ๅๅจๆ็ๅฝๅไธๅฒไธๅปๅง,
ใSad_sad_10ใไธ็จๅ็น่ฑซไบใ.wav b u2 y ong4 z ai4 y ou2 y v4 l e5 . [2, 2, 2, 2, 2, 2, 1] ไธ็จๅ็น่ฑซไบ.
ใSad_sad_11ใๅฆๆไธๅคฉ่ฝๅค็ปๆไธไธชๅๆฅไธๆฌก็ๆบไผ๏ผ.wav r u2 g uo3 sh ang4 t ian1 n eng2 g ou4 g ei2 w o3 y i2 g e5 z ai4 l ai2 y i2 c i04 d e5 j i1 h ui4 , [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1] ๅฆๆไธๅคฉ่ฝๅค็ปๆไธไธชๅๆฅไธๆฌก็ๆบไผ,
ใSad_sad_12ใๆไผๅฏน้ฃไธชๅฅณๅญฉๅญ่ฏดไธไธชๅญ๏ผ.wav w o3 h ui4 d ui4 n a4 g e5 n v3 h ai2 z i05 sh uo1 s an1 g e5 z i04 , [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1] ๆไผๅฏน้ฃไธชๅฅณๅญฉๅญ่ฏดไธไธชๅญ,
ใSad_sad_13ใๆ็ฑไฝ ๏ผ.wav w o3 AA ai4 n i3 , [2, 2, 2, 1] ๆ็ฑไฝ ,
ใSad_sad_14ใไฝ ไธ้่ฆๅจ่ฟไปฝ็ฑไธๅ ไธชๆ้๏ผ.wav n i3 b u4 f ei1 y ao4 z ai4 zh e4 f en4 AA ai4 sh ang4 j ia1 g e5 q i1 x ian4 , [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1] ไฝ ไธ้่ฆๅจ่ฟไปฝ็ฑไธๅ ไธชๆ้,
ใSad_sad_15ใๆๅธๆๆฏไธไธๅนด้ฃใ.wav w o3 x i1 w ang4 sh ir4 y i2 w an4 n ian2 n a4 . [2, 2, 2, 2, 2, 2, 2, 2, 1] ๆๅธๆๆฏไธไธๅนด้ฃ.ใSad_sad_4ใๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ๏ผ.wav 181 200 185 1005 651 651 752 696 208 334 200 382 99 612 1001 338 334 679 334 1001 837 160 1005 185 185 420 1001 844 433 200 232 625 341 758 758 200 1005 200 344 420 1005 1005 837 96 763 438 435 312 121 297 134 377 637 688 369 70 877 828 129 928 320 902 159 62 297 341 396 861 354 946 956 403 936 869 568 176 948 417 439 380 595 419 150 705 407 425 142 67 611 984 766 651 822 723 912 449 737 940 148 811 525 938 424 921 161 45 771 80
ใSad_sad_5ใๆๆฒกๆ็ๆ๏ผ.wav 930 263 774 1001 273 754 273 263 23 420 185 1001 1001 334 338 334 578 263 565 656 797 593 593 593 1002 16 451 382 582 239 612 593 804 535 914 681 660 996 684 901 636 413 714 686 425 590 831 946 288 1003 705 300 963 544 612 232
ใSad_sad_6ใ็ญๆๅคฑๅป็ๆถๅ๏ผ.wav 103 451 334 451 208 1005 200 200 263 417 417 263 273 953 797 417 432 365 565 647 232 844 576 428 576 334 263 425 321 941 951 18 577 519 145 438 911 600 416 299 134 1003 603 906 200 176 691 703 1003 588 422 663 134 434 909 651 774
ใSad_sad_7ใๆๆๅๆ่ซๅใ.wav 1005 365 797 382 263 1005 1005 338 221 205 576 221 774 565 200 647 328 282 1005 3 221 328 200 76 332 107 544 547 198 764 715 764 42 233 449 625 256 79 846 717 1001 862 631 660 873 86 485 773 705 196 1003 671 637 529
ใSad_sad_8ใไบบไธ้ดๆ็่ฆ็ไบ่ซ่ฟไบๆญค๏ผ.wav 857 876 797 797 334 420 663 663 263 451 263 1001 774 338 283 1001 1001 205 134 544 278 406 340 224 817 718 135 168 247 243 57 995 995 27 663 576 273 638 200 1005 578 437 323 232 582 282 282 282 774 1001 221 821 417 844 49 691 854 997 807 166 438 438 246 129 179 124 1 491 299 221 976 491 690 136 213 339 237 1018 119 152 805 185 560 245 805 500 590 617 686 166 514 1003 1003 325 689 935 1001 16 99 263 797 263 334 451 208 277 27 474 56 370 233 830 318 431 713 655 21 120 632 632 548 577 150 197 197 178 396 651 401 696
ใSad_sad_9ใไฝ ็ๅๅจๆ็ๅฝๅไธๅฒไธๅปๅง๏ผ.wav 185 565 679 277 797 656 612 1005 1005 578 283 221 199 221 199 221 199 199 1005 221 22 420 774 774 334 205 876 876 200 1005 647 282 1001 578 985 647 334 1005 656 334 334 365 185 299 591 456 871 529 76 943 300 911 742 416 377 443 471 889 889 917 949 917 240 817 205 232 90 885 671 322 188 757 941 466 1002 393 877 434 443 157 201 983 420 248 294 873 325 513 381 617 530 530 640 860 204 24 1 209 452 922 545 80 762 533 787 239 134 764 936 16 365
ใSad_sad_10ใไธ็จๅ็น่ฑซไบใ.wav 338 910 382 565 578 804 1001 1005 774 656 99 656 651 1005 205 283 273 982 328 496 941 655 335 614 370 825 577 645 659 595 811 179 488 47 757 1015 632 148 148 548 627 173 617 649 477 953 338
ใSad_sad_11ใๅฆๆไธๅคฉ่ฝๅค็ปๆไธไธชๅๆฅไธๆฌก็ๆบไผ๏ผ.wav 428 647 647 338 774 936 936 936 774 581 858 997 581 263 581 647 221 10 420 1001 656 200 334 334 876 221 858 576 221 428 428 764 535 1001 185 638 997 997 936 420 420 1001 221 1001 232 696 760 774 541 451 200 625 23 200 420 656 656 1001 953 581 365 953 930 1005 208 663 612 2 663 752 248 857 674 979 215 244 188 920 4 650 332 661 790 27 33 769 669 660 590 778 577 119 1015 530 340 20 331 608 424 71 813 994 637 232 430 905 13 150 131 794 813 564 472 201 828 525 906 76 763 196 992 825 564 999 921 763 685 523 120 894 185 248 663 969 731 748 930
ใSad_sad_12ใๆไผๅฏน้ฃไธชๅฅณๅญฉๅญ่ฏดไธไธชๅญ๏ผ.wav 844 263 844 656 14 23 23 420 208 953 257 451 208 647 576 474 804 647 578 99 184 738 729 63 184 243 750 663 656 797 797 365 365 790 71 168 134 764 281 1011 490 728 36 669 341 759 422 367 129 294 935 393 854 184 312 747 595 956 364 925 986 666 166 437 421 178 814 150 336 664 734 362 526 519 422 760 171 197 396 568 961
ใSad_sad_13ใๆ็ฑไฝ ๏ผ.wav 799 185 663 612 623 582 1005 582 1005 200 22 656 582 23 23 722 23 1005 1005 774 581 647 647 263 774 715 718 517 993 517 764 437 10 200 593 612 221 903 257 366 394 226 226 186 821 936 417 156 936 534 581 263 263 185 283 221 1005 365 263 647 221 876 258 221 774 200 985 221 837 3 631 987 893 47 923 623 156 202 733 605 139 404 738 535 221 179
ใSad_sad_14ใไฝ ไธ้่ฆๅจ่ฟไปฝ็ฑไธๅ ไธชๆ้๏ผ.wav 634 504 504 59 752 256 804 578 334 804 451 474 578 876 797 876 451 451 283 774 282 221 411 578 221 876 535 282 341 1001 565 647 754 221 581 582 774 936 821 428 534 997 997 576 501 834 997 181 556 556 834 821 534 534 534 958 958 997 534 534 534 534 847 731 312 821 437 117 442 422 581 925 212 127 132 802 483 510 563 533 773 455 322 500 645 483 1017 688 846 173 982 547 595 172 514 10 276 257 773 745 453 924 564 991 740 994 539 161 705 300 559 518 994 544 745 963 688 906 627 418
ใSad_sad_15ใๆๅธๆๆฏไธไธๅนด้ฃใ.wav 641 844 930 930 641 593 930 338 944 184 406 406 406 406 718 406 406 406 406 406 995 57 57 581 625 581 625 844 625 593 625 541 582 696 837 565 200 200 844 221 752 752 581 232 876 232 221 754 754 227 625 754 629 456 22 844 997 844 844 203 404 203 844 641 930 641 404 641 181 404 404 203 930 638 181 997 638 638 232 752 930 181 918 997 844 638 997 844 844 581 844 844 844 565 232 565 185 893 886 187 161 705 567 637 284 535 275 206 508 534 729 475 963 745 25 445 552 576 576 961 428 936 576 961 918 997 581 581 956 470 665 282 779 563 649 184 203 526 805 961 569 253 949 127 127 205 576 961 876 997 876 961 918 560 420 625 334 754 432 334 334 263 844 582 263 930 581 52 567 738 221 576 581 997 534 997 774 576 753 128 709 709Logs โ
2025-09-21 06:38:47.451 | INFO 137210011969088 <frozen src.gradio.tools.toast>:29 - ๆๆฌๅ่ฏไธ็นๅพๆๅ (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/1-get-text.py
ใ้พ่ฟ_sad_4ใๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ๏ผ.wav
ใ้พ่ฟ_sad_5ใๆๆฒกๆ็ๆ๏ผ.wav
ใ้พ่ฟ_sad_6ใ็ญๆๅคฑๅป็ๆถๅ๏ผ.wav
ใ้พ่ฟ_sad_7ใๆๆๅๆ่ซๅใ.wav
ใ้พ่ฟ_sad_8ใไบบไธ้ดๆ็่ฆ็ไบ่ซ่ฟไบๆญค๏ผ.wav
ใ้พ่ฟ_sad_9ใไฝ ็ๅๅจๆ็ๅฝๅไธๅฒไธๅปๅง๏ผ.wav
ใ้พ่ฟ_sad_10ใไธ็จๅ็น่ฑซไบใ.wav
ใ้พ่ฟ_sad_11ใๅฆๆไธๅคฉ่ฝๅค็ปๆไธไธชๅๆฅไธๆฌก็ๆบไผ๏ผ.wav
ใ้พ่ฟ_sad_12ใๆไผๅฏน้ฃไธชๅฅณๅญฉๅญ่ฏดไธไธชๅญ๏ผ.wav
ใ้พ่ฟ_sad_13ใๆ็ฑไฝ ๏ผ.wav
ใ้พ่ฟ_sad_14ใไฝ ไธ้่ฆๅจ่ฟไปฝ็ฑไธๅ ไธชๆ้๏ผ.wav
ใ้พ่ฟ_sad_15ใๆๅธๆๆฏไธไธๅนด้ฃใ.wav
2025-09-21 06:39:06.815 | INFO 137210011969088 <frozen src.gradio.tools.toast>:29 - ่ฏญ้ณ่ช็็ฃ็นๅพๆๅ (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/2-get-hubert-wav32k.py
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/2-get-sv.py
2025-09-21 06:39:26.681 | INFO 137210011969088 <frozen src.gradio.tools.toast>:29 - ่ฏญไนTokenๆๅ (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/prepare_datasets/3-get-semantic.py
2025-09-21 06:39:36.800 | INFO 137210011969088 <frozen src.gradio.tools.toast>:29 - ๆฐๆฎ้ๆ ผๅผๅๆๅ (duration=3s)3. Fine-tuning Training โ

Input โ
webapp/training/sovits/test/test.list
Output โ
Model training version = v2Pro
webapp/training/sovits/weights/GPT_weights_{Model training version} -> webapp/training/sovits/weights/GPT_weights_v2Pro
webapp/training/sovits/weights/SoVITS_weights_{Model training version} -> webapp/training/sovits/weights/SoVITS_weights_v2Pro
Logs โ
GPT โ
2025-09-21 06:21:59.296 | INFO 136692680197696 <frozen src.gradio.tools.toast>:29 - GPT ๆจกๅ่ฎญ็ปๅผๅง (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/s1_train.py --config_file "webapp/training/sovits/temp/tmp_s1.yaml"
2025-09-21 06:22:04.108 | INFO 136691218114112 <frozen src.router.scheduler>:56 - current time: 2025-09-21 06:22:04.108651
<All keys matched successfully>
ckpt_path: None
semantic_data_len: 14
phoneme_data_len: 14
item_name semantic_audio
1 ใ้พ่ฟ_sad_4ใๆพ็ปๆไธไปฝ็่ฏ็็ฑๆ
ๆพๅจๆ้ขๅ๏ผ.wav 181 200 185 1005 651 651 752 696 208 334 200 3...
2 ใ้พ่ฟ_sad_5ใๆๆฒกๆ็ๆ๏ผ.wav 930 263 774 1001 273 754 273 263 23 420 185 10...
3 ใ้พ่ฟ_sad_6ใ็ญๆๅคฑๅป็ๆถๅ๏ผ.wav 103 451 334 451 208 1005 200 200 263 417 417 2...
4 ใ้พ่ฟ_sad_7ใๆๆๅๆ่ซๅใ.wav 1005 365 797 382 263 1005 1005 338 221 205 576...
5 ใ้พ่ฟ_sad_8ใไบบไธ้ดๆ็่ฆ็ไบ่ซ่ฟไบๆญค๏ผ.wav 857 876 797 797 334 420 663 663 263 451 263 10...
6 ใ้พ่ฟ_sad_9ใไฝ ็ๅๅจๆ็ๅฝๅไธๅฒไธๅปๅง๏ผ.wav 185 565 679 277 797 656 612 1005 1005 578 283 ...
7 ใ้พ่ฟ_sad_10ใไธ็จๅ็น่ฑซไบใ.wav 338 910 382 565 578 804 1001 1005 774 656 99 6...
8 ใ้พ่ฟ_sad_11ใๅฆๆไธๅคฉ่ฝๅค็ปๆไธไธชๅๆฅไธๆฌก็ๆบไผ๏ผ.wav 428 647 647 338 774 936 936 936 774 581 858 99...
9 ใ้พ่ฟ_sad_12ใๆไผๅฏน้ฃไธชๅฅณๅญฉๅญ่ฏดไธไธชๅญ๏ผ.wav 844 263 844 656 14 23 23 420 208 953 257 451 2...
10 ใ้พ่ฟ_sad_13ใๆ็ฑไฝ ๏ผ.wav 799 185 663 612 623 582 1005 582 1005 200 22 6...
11 ใ้พ่ฟ_sad_14ใไฝ ไธ้่ฆๅจ่ฟไปฝ็ฑไธๅ ไธชๆ้๏ผ.wav 634 504 504 59 752 256 804 578 334 804 451 474...
12 ใ้พ่ฟ_sad_15ใๆๅธๆๆฏไธไธๅนด้ฃใ.wav 641 844 930 930 641 593 930 338 944 184 406 40...
...
dataset.__len__(): 96
Epoch 14: 100%|โ| 7/7 [00:04<00:00, 1.45it/s, v_num=0, total_loss_step=1.77e+3,
2025-09-21 06:22:41.355 | INFO 136692680197696 <frozen src.gradio.tools.toast>:29 - GPT ๆจกๅ่ฎญ็ป็ปๆ (duration=3s)SoVITS โ
2025-09-21 06:25:39.005 | INFO 136692680197696 <frozen src.gradio.tools.toast>:29 - SoVITS ๆจกๅ่ฎญ็ปๅผๅง (duration=3s)
"/usr/bin/python3" -s src/support/third_party/gptsovits/GPT_SoVITS/s2_train.py --config "webapp/training/sovits/temp/tmp_s2.json"
phoneme_data_len: 14
wav_data_len: 98
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 98/98 [00:00<00:00, 88358.08it/s]
skipped_phone: 0 , skipped_dur: 0
total left: 98
2025-09-21 06:26:04.112 | INFO 136691218114112 <frozen src.router.scheduler>:56 - current time: 2025-09-21 06:26:04.112067
loaded pretrained src/support/third_party/gptsovits/GPT_SoVITS/pretrained_models/v2Pro/s2Gv2ProPlus.pth <All keys matched successfully>
loaded pretrained src/support/third_party/gptsovits/GPT_SoVITS/pretrained_models/v2Pro/s2Gv2ProPlus.pth <All keys matched successfully>
loaded pretrained src/support/third_party/gptsovits/GPT_SoVITS/pretrained_models/v2Pro/s2Dv2ProPlus.pth <All keys matched successfully>
start training from epoch 1
12%|โโโโโโ | 1/8 [00:55<06:30, 55.73s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [01:02<00:00, 7.77s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [01:02<00:00, 7.77s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.66s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.66s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.64s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.64s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.66s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.66s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:10<00:00, 1.26s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:18<00:00, 2.30s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.65s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.65s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.67s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.67s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.67s/it]
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:13<00:00, 1.67s/it]
...
training done
2025-09-21 06:28:55.882 | INFO 136692680197696 <frozen src.gradio.tools.toast>:29 - SoVITS ๆจกๅ่ฎญ็ปๆๅ (duration=3s)Inference Usage โ
How to Use Fine-tuned Models?
1. Configuration Instructions โ
{
"v1": [],
"v2": [],
"v2Pro": [
{
"name": "่ณๅฐๅฎ",
"gender": "Male",
"locale": "zh-CN",
"model": { // Fine-tuning model directory (required, absolute path)
"้ป่ฎค": {
"gpt": "่ณๅฐๅฎ-e15.ckpt",
"vits": "่ณๅฐๅฎ_e4_s32.pth"
}
},
"ref": { // Main reference audio (corresponds to voice library ID)
"้ป่ฎค": 10000000,
},
"aux": [ // Auxiliary reference audio list (optional, absolute path)
"aux_ref_audio_path1.wav",
"aux_ref_audio_path2.wav",
"aux_ref_audio_path3.wav"
]
}
],
"v2ProPlus": [],
"v3": [],
"v4": []
}How to control emotions?
Currently there are 2 methods with different configurations and effects, choose based on needs. Taking "Happy" as an example:
- Generic: Character's training data includes but is not limited to "Happy", i.e. "Default"
- Custom: Character's training data only contains "Happy"
- Unspecified: Uses default
{
"name": "xxx",
"gender": "Male",
"locale": "zh-CN",
"model": {
"Default": {
"gpt": "xxx-e15.ckpt",
"vits": "xxx_e4_s32.pth"
},
"Happy": {
"gpt": "xxx_happy-e15.ckpt",
"vits": "xxx_happy_e4_s32.pth"
},
"Sad": {
"gpt": "xxx_sad-e15.ckpt",
"vits": "xxx_sad_e4_s32.pth"
}
},
"ref": {
"Default": 10000000,
"Happy": 10000001,
"Sad": 10000002,
"Angry": 10000003
}
}2. Page Selection โ
If no model is configured, there will be no "Default" option for speaking style and no "xxx" option for voice. In this case, selecting other voice options will use the base model for voice cloning during inference.

3. Logs โ
2025-09-17 14:34:31.336 | INFO 14272 gptsovits_tts.py:72 - First load, please wait...
2025-09-17 14:34:31.338 | INFO 14272 gptsovits_tts.py:78 - Loading Tts model v2Pro_่ณๅฐๅฎ_้ป่ฎค on device cpu
2025-09-17 14:34:31.348 | INFO 14272 TTS.py:589 - Loading Text2Semantic weights from D:/pretrained_models/s1v3.ckpt
2025-09-17 14:34:32.920 | INFO 14272 TTS.py:651 - loading vocoder
2025-09-17 14:34:32.987 | INFO 14272 TTS.py:556 - Loading VITS weights from D:/pretrained_models/v2Pro/s2Gv2Pro.pth. <All keys matched successfully>
2025-09-17 14:34:33.006 | INFO 14272 TTS.py:479 - Loading BERT weights from D:/pretrained_models/chinese-roberta-wwm-ext-large
2025-09-17 14:34:33.478 | INFO 14272 TTS.py:471 - Loading CNHuBERT weights from D:/pretrained_models/chinese-hubert-base
<!-- Custom model will print the following -->
2025-09-17 14:41:05.508 | INFO 2880 TTS.py:589 - Loading Text2Semantic weights from webapp/training/sovits/weights/GPT_weights_v2Pro/่ณๅฐๅฎ-e15.ckpt
2025-09-17 14:41:06.913 | INFO 2880 TTS.py:560 - Loading VITS pretrained weights from webapp/training/sovits/weights/SoVITS_weights_v2Pro/่ณๅฐๅฎ_e4_s32.pth. <All keys matched successfully>
2025-09-17 14:41:07.097 | INFO 2880 TTS.py:571 - Loading LoRA weights from webapp/training/sovits/weights/SoVITS_weights_v2Pro/่ณๅฐๅฎ_e4_s32.pth. _IncompatibleKeys(missing_keys=['enc_p.ssl_proj.weight', 'enc_p.ssl_proj.bias', ....])
2025-09-17 14:34:33.648 | INFO 14272 cache.py:67 - {'v2Pro': {'last_used': '2025-09-17 14:34:33', 'usage': 1},'v2Pro_่ณๅฐๅฎ_้ป่ฎค': {'last_used': '2025-09-17 14:36:25', 'usage': 1}}
2025-09-17 14:34:33.649 | INFO 14272 TTS.py:197 - Set seed to 99578076
2025-09-17 14:34:33.652 | INFO 14272 TTS.py:1046 - Parallel Inference Mode Enabled
2025-09-17 14:34:34.676 | INFO 14272 TTS.py:1118 - Actual Input Reference Text:
2025-09-17 14:34:35.826 | INFO 14272 TextPreprocessor.py:61 - ############ Segment Text ############
2025-09-17 14:34:35.827 | INFO 14272 TextPreprocessor.py:84 - Actual Input Target Text:
2025-09-17 14:34:35.827 | INFO 14272 TextPreprocessor.py:85 - CreatorBox๏ผไธบๅไฝ่
่็๏ผๆๅๅไฝๆ็๏ผ้ๆพๅไฝๆฝๅ.
2025-09-17 14:34:35.828 | INFO 14272 TextPreprocessor.py:114 - Actual Input Target Text (after sentence segmentation):
2025-09-17 14:34:35.829 | INFO 14272 TextPreprocessor.py:115 - ['CreatorBox๏ผ', 'ไธบๅไฝ่
่็๏ผ', 'ๆๅๅไฝๆ็๏ผ', '้ๆพๅไฝๆฝๅ.']
2025-09-17 14:34:35.829 | INFO 14272 TextPreprocessor.py:65 - ############ Extract Text BERT Features ############
<!-- Parallel processing, omitted -->
2025-09-17 14:34:40.471 | INFO 14272 TTS.py:1187 - ############ Inference ############
2025-09-17 14:34:40.471 | INFO 14272 TTS.py:1209 - Processed text from the frontend (per sentence):
2025-09-17 14:34:40.471 | INFO 14272 TTS.py:1217 - ############ Predict Semantic Token ############
2%|โโโโ | 29/1500 [00:00<00:19, 75.04it/s]T2S Decoding EOS [141 -> 176]
2%|โโโโโ | 34/1500 [00:00<00:21, 68.26it/s]
<!-- ..... -->
2025-09-17 14:35:50.402 | INFO 14272 TTS.py:1258 - ############ Synthesize Audio ############
2025-09-17 14:35:50.402 | INFO 14272 TTS.py:1305 - Parallel Synthesis in Progress...
2025-09-17 14:36:18.686 | INFO 14272 TTS.py:1342 - 2.173 4.645 2.194 96.021
2025-09-17 14:36:18.688 | INFO 14272 gptsovits_tts.py:195 - speech len 6.96, rtf 15.091852826633673
2025-09-17 14:36:18.864 | INFO 14272 response.py:52 - {"path":"webapp/tts/sovits_zh-CN_่ณๅฐๅฎ_1.00_1.00_1.00_32_0.wav","duration":6.96,"seed":99578076}