Model Training 🧠

字数

979 字

阅读

5 分钟

基于 GPTSoVITS 基准模型，微调训练与推理

使用场景

如果3-6秒的声音样本无法满足你的克隆需求，可以尝试训练模型, 提升声音相似度和真实感；

应用预览

gradio_app_gmt

微调训练

请认真准备数据集，好的数据集是炼出好的模型的基础。

1. 训练数据准备

gradio_app_gmt

2. 数据集格式化

gradio_app_gmt

3. 模型微调训练

gradio_app_gmt

推理使用

如果使用微调训练的模型？

1. 配置说明

webapp/data/sovits.json

json

{
  "v1": [],
  "v2": [],
  "v2Pro": [],
  "v2ProPlus": [],
  "v3": [],
  "v4": [
    {
      "name": "原神博士",
      "gender": "Male",
      "locale": "zh-CN",
      "model": {                // 微调模型目录（必填，绝对路径）
        "默认": {
          "gpt": "原神博士-e10.ckpt",
          "vits": "原神博士_e10_s140_l32.pth"
        }
      },
      "ref": {                  // 主参考音频（对应音色库id）
        "默认": 10000000,
        "高兴": 10000001,
        "难过": 10000002,
        "生气": 10000003
      },
      "aux": [                  // 辅助参考音频列表（选填，绝对路径）
        "aux_ref_audio_path1.wav",
        "aux_ref_audio_path2.wav",
        "aux_ref_audio_path3.wav"
      ]
    }
  ]
}

如何控制情感?

目前2种方法：配置不同，效果不同，按需取舍；以 高兴 为例:

通用型：某角色训练数据包含但不限于 高兴
定制型：某角色训练数据只包含 高兴

webapp/data/sovits.json

json

{
    "name": "原神博士",
    "gender": "Male",
    "locale": "zh-CN",
    "model": {
        "默认": {
            "gpt": "原神博士-e10.ckpt",
            "vits": "原神博士_e10_s140_l32.pth"
        },
        "高兴": {
            "gpt": "原神博士_happy-e10.ckpt",
            "vits": "原神博士_happy_e10_s140_l32.pth"
        }
    },
    "ref": {
        "默认": 10000000,
        "高兴": 10000001,
        "难过": 10000002,
        "生气": 10000003
    }
}

2. 页面选择

没有配置模型，此时说话风格无 默认 选项，音色也无 原神博士 选项，此时音色选择其他选项，继续推理会使用底模进行音色克隆

gradio_app_gmt

2. 运行日志

log

2025-09-17 14:34:31.336 | INFO  14272 gptsovits_tts.py:72 - First load, please wait...
2025-09-17 14:34:31.338 | INFO  14272 gptsovits_tts.py:78 - Loading Tts model v4 on device cpu
2025-09-17 14:34:31.348 | INFO  14272 TTS.py:589 - Loading Text2Semantic weights from D:/pretrained_models/s1v3.ckpt
2025-09-17 14:34:32.920 | INFO  14272 TTS.py:651 - loading vocoder
2025-09-17 14:34:32.987 | INFO  14272 TTS.py:556 - Loading VITS weights from D:/pretrained_models/gsv-v4-pretrained/s2Gv4.pth. <All keys matched successfully>
2025-09-17 14:34:33.006 | INFO  14272 TTS.py:479 - Loading BERT weights from D:/pretrained_models/chinese-roberta-wwm-ext-large
2025-09-17 14:34:33.478 | INFO  14272 TTS.py:471 - Loading CNHuBERT weights from D:/pretrained_models/chinese-hubert-base
<!-- 自定义模型会打印如下 -->
2025-09-17 14:41:05.508 | INFO  2880  TTS.py:589 - Loading Text2Semantic weights from D:/models/creatorbox/gmt/v4/原神博士/原神博士-e10.ckpt
2025-09-17 14:41:06.913 | INFO  2880  TTS.py:560 - Loading VITS pretrained weights from D:/models/creatorbox/gmt/v4/原神博士/原神博士_e10_s140_l32.pth. <All keys matched successfully>
2025-09-17 14:41:07.097 | INFO  2880  TTS.py:571 - Loading LoRA weights from D:/models/creatorbox/gmt/v4/原神博士/原神博士_e10_s140_l32.pth. _IncompatibleKeys(missing_keys=['enc_p.ssl_proj.weight', 'enc_p.ssl_proj.bias', ....])
2025-09-17 14:34:33.648 | INFO  14272 cache.py:67 - {'v4': {'last_used': '2025-09-17 14:34:33', 'usage': 1},'v4-原神博士-默认': {'last_used': '2025-09-17 14:36:25', 'usage': 1}}
2025-09-17 14:34:33.649 | INFO  14272 TTS.py:197 - Set seed to 99578076
2025-09-17 14:34:33.652 | INFO  14272 TTS.py:1046 - Parallel Inference Mode Enabled
2025-09-17 14:34:33.652 | INFO  14272 TTS.py:1064 - When parallel inference mode is enabled, SoVITS V3/4 models do not support bucket processing; bucket processing has been automatically disabled.
2025-09-17 14:34:34.676 | INFO  14272 TTS.py:1118 - Actual Input Reference Text:
2025-09-17 14:34:35.826 | INFO  14272 TextPreprocessor.py:61 - ############ Segment Text ############
2025-09-17 14:34:35.827 | INFO  14272 TextPreprocessor.py:84 - Actual Input Target Text:
2025-09-17 14:34:35.827 | INFO  14272 TextPreprocessor.py:85 - CreatorBox，为创作者而生，提升创作效率，释放创作潜力.
2025-09-17 14:34:35.828 | INFO  14272 TextPreprocessor.py:114 - Actual Input Target Text (after sentence segmentation):
2025-09-17 14:34:35.829 | INFO  14272 TextPreprocessor.py:115 - ['CreatorBox，', '为创作者而生，', '提升创作效率，', '释放创作潜力.']
2025-09-17 14:34:35.829 | INFO  14272 TextPreprocessor.py:65 - ############ Extract Text BERT Features ############
<!-- 并行处理，省略 -->
2025-09-17 14:34:40.471 | INFO  14272 TTS.py:1187 - ############ Inference ############
2025-09-17 14:34:40.471 | INFO  14272 TTS.py:1209 - Processed text from the frontend (per sentence):
2025-09-17 14:34:40.471 | INFO  14272 TTS.py:1217 - ############ Predict Semantic Token ############
  2%|████                                                                                                                                                                                                               | 29/1500 [00:00<00:19, 75.04it/s]T2S Decoding EOS [141 -> 176]
  2%|████▊                                                                                                                                                                                                              | 34/1500 [00:00<00:21, 68.26it/s] 
<!-- ..... -->
2025-09-17 14:35:50.402 | INFO  14272 TTS.py:1258 - ############ Synthesize Audio ############
2025-09-17 14:35:50.402 | INFO  14272 TTS.py:1305 - Parallel Synthesis in Progress...
2025-09-17 14:36:18.686 | INFO  14272 TTS.py:1342 - 2.173       4.645   2.194   96.021
2025-09-17 14:36:18.688 | INFO  14272 gptsovits_tts.py:195 - speech len 6.96, rtf 15.091852826633673
2025-09-17 14:36:18.864 | INFO  14272 response.py:52 - {"path":"webapp/tts/sovits_zh-CN_原神博士_1.00_1.00_1.00_32_0.wav","duration":6.96,"seed":99578076}

Model Training 🧠 ​

使用场景 ​

应用预览 ​

微调训练 ​

1. 训练数据准备 ​

2. 数据集格式化 ​

3. 模型微调训练 ​

推理使用 ​

1. 配置说明 ​

2. 页面选择 ​

2. 运行日志 ​

Model Training 🧠

使用场景

应用预览

微调训练

1. 训练数据准备

2. 数据集格式化

3. 模型微调训练

推理使用

1. 配置说明

2. 页面选择

2. 运行日志