Speech Recognition 🎤

Accurately recognize the voices of speakers in videos, flexibly adjust configurations to adapt to different devices and scenarios, ensuring high-quality text generation.

Debug Mode

After uploading the video, click Execute to start. At this point, Debug Mode will interrupt subsequent execution.

log

2025-04-10 03:17:01.784 | INFO 8212 response.py:28 - {"task_id":"10b6a0826a6b4db280e5ff4dc00dcfbc"}
2025-04-10 03:17:01.786 | INFO 8212 cbutils.py:310 - File already exists. webapp/temp/test/test.mp4
2025-04-10 03:17:01.789 | INFO 8212 cbaudio.py:59 - Audio extracted and saved to: webapp/temp/test/test.wav duration 30.570666666666668s
2025-04-10 03:17:01.790 | INFO 8212 spleeter_.py:73 - Audio separate file already exists. (webapp/temp/test/stems/test_vocals.wav , webapp/temp/test/stems/test_vocals_bg.wav)
2025-04-10 03:17:05.936 | INFO 8212 whisper_.py:74 - Loading Whisper model base on device cpu
[2025-04-10 03:17:06.482] [ctranslate2] [thread 8212] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
2025-04-10 03:17:06.524 | INFO 8212 transcribe.py:839 - Processing audio with duration 00:30.571
2025-04-10 03:17:06.863 | INFO 8212 transcribe.py:906 - Detected language 'en' with probability 1.00
2025-04-10 03:17:10.076 | INFO 8212 whisper_.py:276 - 00 0 [0.26s -> 0.80s] 02  Hi, everyone.
2025-04-10 03:17:10.077 | INFO 8212 whisper_.py:276 - 01 0 [1.13s -> 3.70s] 08 You probably haven't come across these incredible products.
2025-04-10 03:17:10.077 | INFO 8212 whisper_.py:276 - 02 0 [4.24s -> 6.42s] 10  The majority of people aren't even aware of their existence.
2025-04-10 03:17:10.078 | INFO 8212 whisper_.py:276 - 03 0 [7.14s -> 10.46s] 12  Well, today I'm going to show you six amazing egg cooking gadgets.
2025-04-10 03:17:10.078 | INFO 8212 whisper_.py:276 - 04 0 [11.04s -> 14.38s] 12  This rolling egg organizer for eggs is really good for organizing eggs.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 05 0 [15.04s -> 17.82s] 11  A narrow one will not waste the space of the refrigerator.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 06 0 [18.52s -> 19.17s] 05  One can put 15 eggs.
2025-04-10 03:17:10.079 | INFO 8212 whisper_.py:276 - 07 0 [20.80s -> 23.34s] 10  The quality and workmanship of this shelf is very good,
2025-04-10 03:17:10.080 | INFO 8212 whisper_.py:276 - 08 0 [23.80s -> 25.28s] 07  placed on the side of the refrigerator.
2025-04-10 03:17:10.080 | INFO 8212 whisper_.py:276 - 09 0 [26.08s -> 28.52s] 13  Every time you pick up the top of the eggs will roll down,
2025-04-10 03:17:10.081 | INFO 8212 whisper_.py:276 - 10 0 [28.80s -> 29.88s] 02  especially convenient.
2025-04-10 03:17:10.081 | INFO 8212 whisper_.py:276 - 11 0 [30.34s -> 30.50s] 01  This...
2025-04-10 03:17:10.306 | INFO 8212 whisper_.py:280 - Original transcription: 
  Hi, everyone. You probably haven't come across these incredible products.  The majority of people aren't even aware of their existence.  Well, today I'm going to show you six amazing egg cooking gadgets.  This rolling egg organizer for eggs is really good for organizing eggs.  A narrow one will not waste the space of the refrigerator.  One can put 15 eggs.  The quality and workmanship of this shelf is very good,  placed on the side of the refrigerator.  Every time you pick up the top of the eggs will roll down,  especially convenient.  This...
2025-04-10 03:17:10.310 | INFO 8212 whisper_.py:285 - Transcription data complete and saved to: webapp/temp/test/test_001.json

View Content

By clicking the bottom-right icon , you can view the currently recognized content.

Manual Editing

Modify the text values of different attributes and click the bottom-right icon to save the changes.

Model Selection

Model Description

Larger models provide higher recognition accuracy.

json

[
  "tiny.en",    // Small English model, suitable for low-resource environments
  "tiny",       // General small model, supports multiple languages
  "base.en",    // Medium-small English model, suitable for general speech recognition
  "base",       // General medium-small model, supports multiple languages
  "small.en",   // Small English model with higher accuracy
  "small",      // Multilingual small model, suitable for more complex speech tasks
  "medium.en",  // Medium English model, suitable for high-accuracy scenarios
  "medium",     // Multilingual medium model, suitable for high-quality speech recognition
  "large-v1",   // Large model v1, high accuracy
  "large-v2",   // Large model v2, optimized for speed and accuracy
  "large-v3",   // Large model v3, handles complex speech data
  "large",      // General large model, suitable for most speech tasks
  "distil-small.en",    // Distilled version, small English model, low computational demand
  "distil-medium.en",   // Distilled version, medium English model, low resource demand
  "distil-large-v2",    // Distilled version, large model v2, low computational demand
  "distil-large-v3",    // Distilled version, large model v3, retains high accuracy with reduced resource consumption
  "large-v3-turbo",     // Optimized large model, faster speed
  "turbo"               // Efficient model, low latency and high throughput
]

TIP

For Chinese videos, it is recommended to select at least the medium model.
Models with .en or .distil are only for English videos.
If model download fails with an error, refer to the "Common Issues" section in 《Help》.
Choose a model based on your device performance, video quality, and recognition results.

Configuration Options

Accuracy

Controls the depth of exploration during generation. Higher values usually produce more accurate text.

Randomness

Controls the randomness of text generation. Lower temperature is more accurate, higher temperature is more creative.

Coherence

Controls the length of text processed at a time, affecting generation efficiency and contextual coherence.

Speaker

Enable this option to recognize multiple speakers in the video.

Quantity

Used when the number of speakers is known. Default value is 1. If set to 1, multi-speaker recognition is disabled.

Range

Used when the number of speakers is unknown. Default range is 1 to 6.

Noise Reduction

Speaker noise reduction. Default range is 0 to 5. A value of 0 disables noise reduction, with incremental increases improving the effect.

Log Output

Configure correctly and re-execute. Logs will display the output.

log

2025-04-10 03:59:34.929 | INFO 2576 whisper_.py:101 - Loading Speaker model on device cpu
segmentation         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
speaker_counting     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
embeddings           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:09
discrete_diarization ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Default spk=0, and spk will change accordingly.

TIP

This uses the HUGGINGFACEHUB_API_TOKEN configured in Environment Variables.
This model requires authentication. See Authorization.

Speech Recognition 🎤 ​

Debug Mode ​

View Content ​

Manual Editing ​

Model Selection ​

Model Description ​

Configuration Options ​

Accuracy ​

Randomness ​

Coherence ​

Speaker ​

Quantity ​

Range ​

Noise Reduction ​

Log Output ​