Skip to content

Speech Recognition ๐ŸŽค โ€‹

Accurately recognize the voices of speakers in videos, flexibly adjust configurations to adapt to different devices and scenarios, ensuring high-quality text generation.

Debug Mode โ€‹

After uploading the video, click Execute to start. At this point, Debug Mode will interrupt subsequent execution.

log
2025-04-10 03:17:01.784 | INFO 8212 response.py:28 - {"task_id":"10b6a0826a6b4db280e5ff4dc00dcfbc"}
2025-04-10 03:17:01.786 | INFO 8212 cbutils.py:310 - File already exists. webapp/temp/test/test.mp4
2025-04-10 03:17:01.789 | INFO 8212 cbaudio.py:59 - Audio extracted and saved to: webapp/temp/test/test.wav duration 30.570666666666668s
2025-04-10 03:17:01.790 | INFO 8212 spleeter_.py:73 - Audio separate file already exists. (webapp/temp/test/stems/test_vocals.wav , webapp/temp/test/stems/test_vocals_bg.wav)
2025-04-10 03:17:05.936 | INFO 8212 whisper_.py:74 - Loading Whisper model base on device cpu
[2025-04-10 03:17:06.482] [ctranslate2] [thread 8212] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
2025-04-10 03:17:06.524 | INFO 8212 transcribe.py:839 - Processing audio with duration 00:30.571
2025-04-10 03:17:06.863 | INFO 8212 transcribe.py:906 - Detected language 'en' with probability 1.00
2025-04-10 03:17:10.076 | INFO 8212 whisper_.py:276 - 00 0 [0.26s -> 0.80s] 02  Hi, everyone.
2025-04-10 03:17:10.077 | INFO 8212 whisper_.py:276 - 01 0 [1.13s -> 3.70s] 08 You probably haven't come across these incredible products.
2025-04-10 03:17:10.077 | INFO 8212 whisper_.py:276 - 02 0 [4.24s -> 6.42s] 10  The majority of people aren't even aware of their existence.
2025-04-10 03:17:10.078 | INFO 8212 whisper_.py:276 - 03 0 [7.14s -> 10.46s] 12  Well, today I'm going to show you six amazing egg cooking gadgets.
2025-04-10 03:17:10.078 | INFO 8212 whisper_.py:276 - 04 0 [11.04s -> 14.38s] 12  This rolling egg organizer for eggs is really good for organizing eggs.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 05 0 [15.04s -> 17.82s] 11  A narrow one will not waste the space of the refrigerator.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 06 0 [18.52s -> 19.17s] 05  One can put 15 eggs.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 07 0 [20.80s -> 23.34s] 10  The quality and workmanship of this shelf is very good,
2025-04-10 03:17:10.080 | INFO 8212 whisper_.py:276 - 08 0 [23.80s -> 25.28s] 07  placed on the side of the refrigerator.
2025-04-10 03:17:10.080 | INFO 8212 whisper_.py:276 - 09 0 [26.08s -> 28.52s] 13  Every time you pick up the top of the eggs will roll down,
2025-04-10 03:17:10.081 | INFO 8212 whisper_.py:276 - 10 0 [28.80s -> 29.88s] 02  especially convenient.
2025-04-10 03:17:10.081 | INFO 8212 whisper_.py:276 - 11 0 [30.34s -> 30.50s] 01  This...
2025-04-10 03:17:10.306 | INFO 8212 whisper_.py:280 - Original transcription: 
  Hi, everyone. You probably haven't come across these incredible products.  The majority of people aren't even aware of their existence.  Well, today I'm going to show you six amazing egg cooking gadgets.  This rolling egg organizer for eggs is really good for organizing eggs.  A narrow one will not waste the space of the refrigerator.  One can put 15 eggs.  The quality and workmanship of this shelf is very good,  placed on the side of the refrigerator.  Every time you pick up the top of the eggs will roll down,  especially convenient.  This...
2025-04-10 03:17:10.310 | INFO 8212 whisper_.py:285 - Transcription data complete and saved to: webapp/temp/test/test_001.json

View Content โ€‹

By clicking the top-right icon , you can view the currently recognized content.

Manual Modification โ€‹

Modify the text values of different attributes and click the top-right icon to save the changes.

Provider Selection โ€‹

Description โ€‹

json
[
  "Subtitle File",      // Subtitle file
  "Jianying Draft",     // Jianying draft
  "Faster-Whisper",     // Faster-Whisper
]

Configuration โ€‹

Subtitle File โ€‹

  • Local Path: C:\Users\home\Desktop\test.srt

Note

Default option, subtitle recognition will prioritize this subtitle; very useful for long texts, significantly reducing recognition time.

Jianying Draft โ€‹

  • Jianying Draft: test_en

Note

Indirectly reads recognized content through Jianying draft, significantly reducing recognition time.

Faster-Whisper โ€‹

  • Larger model sizes provide higher recognition accuracy.

json
[
  "tiny.en",    // Small English model, suitable for low-resource environments
  "tiny",       // General small model, supports multiple languages
  "base.en",    // Medium-small English model, suitable for general speech recognition
  "base",       // General medium-small model, supports multiple languages
  "small.en",   // Small English model, higher accuracy
  "small",      // Multilingual small model, suitable for more complex speech tasks
  "medium.en",  // Medium English model, suitable for high-accuracy scenarios
  "medium",     // Multilingual medium model, suitable for high-quality speech recognition
  "large-v1",   // Large model v1, high accuracy
  "large-v2",   // Large model v2, optimized for speed and accuracy
  "large-v3",   // Large model v3, handles complex speech data
  "large",      // General large model, suitable for most speech tasks
  "distil-small.en",    // Distilled version, small English model, low computational requirements
  "distil-medium.en",   // Distilled version, medium English model, low resource requirements
  "distil-large-v2",    // Distilled version, large model v2, low computational requirements
  "distil-large-v3",    // Distilled version, large model v3, retains high accuracy, reduces resource consumption
  "large-v3-turbo",     // Optimized large model, faster speed
  "turbo"               // Efficient model, low latency and high throughput
]

Note

  • For Chinese videos, it is recommended to choose at least the medium model.
  • .en and .distil models can only be used for English videos.
  • If model download fails with errors, refer to the ใ€ŠFAQใ€‹ section.
  • It is recommended to compare recognition effects based on your device performance and video quality.

Configuration Options โ€‹

Accuracy โ€‹

Controls the depth of exploration during generation; larger values typically produce more accurate text.

Randomness โ€‹

Controls the randomness of text generation; low temperature is more accurate, high temperature is more creative.

Coherence โ€‹

Controls the text length processed each time, affecting generation efficiency and context coherence.

Audio Track Separation โ€‹

Quickly extract vocals, accompaniment, drums, bass, and other multi-track audio. See ใ€ŠUltimate Vocal Separation UVRใ€‹ for effects.

Speaker โ€‹

Enable to recognize multiple speakers in the video.

Quantity โ€‹

  • Whether to specify the number of speakers.

Range โ€‹

Default range is 1 to 10, used in conjunction with speaker quantity to determine whether to use a specific number or range.

  • If the range is 1-6 and the minimum value is 1, no speaker recognition is performed (spk range 0).
  • If the range is 2-6 and speaker quantity is enabled, the number of speakers is 2 (spk range 0,1).
  • If the range is 2-6 and speaker quantity is disabled, the number of speakers is 2-6 (spk range 0,1,2,3,4,5).

Log Output โ€‹

Correctly configure and re-execute; logs will appear with printed output.

log
2025-04-10 03:59:34.929 | INFO 2576 whisper_.py:101 - Loading Speaker model on device cpu
segmentation         โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 0:00:00
speaker_counting     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 0:00:00
embeddings           โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 0:00:09
discrete_diarization โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 0:00:00

Default spk=0, spk will change accordingly.

Note