Skip to content

语音识别 🎤

精准识别视频说话者的人声,灵活调整配置适应不同设备和场景,确保高质量文本生成

调试模式

上传视频后,点击 执行 开始,此时 调试模式 开启后会中断后续执行

log
2025-04-10 03:17:01.784 | INFO 8212 response.py:28 - {"task_id":"10b6a0826a6b4db280e5ff4dc00dcfbc"}
2025-04-10 03:17:01.786 | INFO 8212 cbutils.py:310 - File already exists. webapp/temp/test/test.mp4
2025-04-10 03:17:01.789 | INFO 8212 cbaudio.py:59 - Audio extracted and saved to: webapp/temp/test/test.wav duration 30.570666666666668s
2025-04-10 03:17:01.790 | INFO 8212 spleeter_.py:73 - Audio separate file already exists. (webapp/temp/test/stems/test_vocals.wav , webapp/temp/test/stems/test_vocals_bg.wav)
2025-04-10 03:17:05.936 | INFO 8212 whisper_.py:74 - Loading Whisper model base on device cpu
[2025-04-10 03:17:06.482] [ctranslate2] [thread 8212] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatica ally converted to use the float32 compute type instead.
2025-04-10 03:17:06.524 | INFO 8212 transcribe.py:839 - Processing audio with duration 00:30.571
2025-04-10 03:17:06.863 | INFO 8212 transcribe.py:906 - Detected language 'en' with probability 1.00
2025-04-10 03:17:10.076 | INFO 8212 whisper_.py:276 - 00 0 [0.26s -> 0.80s] 02  Hi, everyone.
2025-04-10 03:17:10.077 | INFO 8212 whisper_.py:276 - 01 0 [1.13s -> 3.70s] 08 You probably haven't come across these incredible products.
2025-04-10 03:17:10.077 | INFO 8212 whisper_.py:276 - 02 0 [4.24s -> 6.42s] 10  The majority of people aren't even aware of their existence.
2025-04-10 03:17:10.078 | INFO 8212 whisper_.py:276 - 03 0 [7.14s -> 10.46s] 12  Well, today I'm going to show you six amazing egg cooking gadgets.
2025-04-10 03:17:10.078 | INFO 8212 whisper_.py:276 - 04 0 [11.04s -> 14.38s] 12  This rolling egg organizer for eggs is really good for organizing eggs.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 05 0 [15.04s -> 17.82s] 11  A narrow one will not waste the space of the refrigerator.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 06 0 [18.52s -> 19.17s] 05  One can put 15 eggs.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 07 0 [20.80s -> 23.34s] 10  The quality and workmanship of this shelf is very good,
2025-04-10 03:17:10.080 | INFO 8212 whisper_.py:276 - 08 0 [23.80s -> 25.28s] 07  placed on the side of the refrigerator.
2025-04-10 03:17:10.080 | INFO 8212 whisper_.py:276 - 09 0 [26.08s -> 28.52s] 13  Every time you pick up the top of the eggs will roll down,
2025-04-10 03:17:10.081 | INFO 8212 whisper_.py:276 - 10 0 [28.80s -> 29.88s] 02  especially convenient.
2025-04-10 03:17:10.081 | INFO 8212 whisper_.py:276 - 11 0 [30.34s -> 30.50s] 01  This...
2025-04-10 03:17:10.306 | INFO 8212 whisper_.py:280 - Original transcription: 
  Hi, everyone. You probably haven't come across these incredible products.  The majority of people aren't even aware of their existence.  Well, today I'm going to show you six amazing egg cooking gadgets.  This rolling egg organizer for eggs is really good for organizing eggs.  A narrow one will not waste the space of the refrigerator.  One can put 15 eggs.  The quality and workmanship of this shelf is very good,  placed on the side of the refrigerator.  Every time you pick up the top of the eggs will roll down,  especially convenient.  This...
2025-04-10 03:17:10.310 | INFO 8212 whisper_.py:285 - Transcription data complete and saved to: webapp/temp/test/test_001.json

查看内容

通过点击右下角图标 可以查看到当前识别的内容

手动修改

你可以通过修改不同 属性 的文本值,点击右上角图标 可以保存修改内容

模型选择

模型说明

  • 模型尺寸越来越大,同样识别精确度也越来越高
json
[
  "tiny.en",    // 小型英文模型,适合低资源环境
  "tiny",       // 通用小型模型,支持多语言
  "base.en",    // 中小型英文模型,适合一般语音识别
  "base",       // 通用中小型模型,支持多语言
  "small.en",   // 英文小型模型,精度较高
  "small",      // 多语言小型模型,适合较复杂的语音任务
  "medium.en",  // 中型英文模型,适合精度要求较高的场景
  "medium",     // 多语言中型模型,适合高质量语音识别
  "large-v1",   // 大模型v1,精度高
  "large-v2",   // 大模型v2,优化了速度和精度
  "large-v3",   // 大模型v3,处理复杂语音数据
  "large",      // 通用大模型,适合大多数语音任务
  "distil-small.en",    // 精简版,小型英文模型,低计算需求
  "distil-medium.en",   // 精简版,中型英文模型,低资源需求
  "distil-large-v2",    // 精简版,大模型v2,计算需求低
  "distil-large-v3",    // 精简版,大模型v3,保留高精度,降低资源消耗
  "large-v3-turbo",     // 优化版大模型,速度更快
  "turbo"               // 高效版模型,低延迟和高吞吐量
]

TIP

  • 对于 中文 视频,建议至少选择 medium 模型
  • 对于 .en.distil 模型,只可用于 英文 视频
  • 遇到模型下载失败,出现如下错误,见 《常见问题》 章节
  • 建议根据自己的设备性能,视频质量,综合对比识别效果

配置选项

准确性

控制生成时探索的深度,较大值通常生成更准确的文本

随机性

控制文本生成的随机性,低温度更准确,高温度更具创意

连贯性

控制每次处理的文本长度,影响生成效率和上下文连贯性

说话人

识别视频中多人说话,需要开启

语音人数

  • 确定说话人数量时使用,默认值 1

语音范围

  • 不确定说话人数量时使用,默认值 16

日志输出

正确配置,重新执行,日志会出现日下打印

log
2025-04-10 03:59:34.929 | INFO 2576 whisper_.py:101 - Loading Speaker model on device cpu
segmentation         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
speaker_counting     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
embeddings           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:09
discrete_diarization ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

默认spk=0,此时spk也会发生变化

TIP

这里会使用 环境变量 配置的 HUGGINGFACEHUB_API_TOKEN