Spaces:
Running
Whisper-WebUI
A Gradio-based browser interface for Whisper. You can use it as an Easy Subtitle Generator!
Notebook
If you wish to try this on Colab, you can do it in here!
Feature
- Select the Whisper implementation you want to use between :
- openai/whisper
- SYSTRAN/faster-whisper (used by default)
- Vaibhavs10/insanely-fast-whisper
- Generate subtitles from various sources, including :
- Files
- Youtube
- Microphone
- Currently supported subtitle formats :
- SRT
- WebVTT
- txt ( only text file without timeline )
- Speech to Text Translation
- From other languages to English. ( This is Whisper's end-to-end speech-to-text translation feature )
- Text to Text Translation
- Translate subtitle files using Facebook NLLB models
- Translate subtitle files using DeepL API
- Pre-processing audio input with Silero VAD.
- Post-processing with speaker diarization using the pyannote model.
- To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below.
Installation and Running
Prerequisite
To run this WebUI, you need to have git
, python
version 3.8 ~ 3.10, FFmpeg
.
And if you're not using an Nvida GPU, or using a different CUDA
version than 12.1, edit the requirements.txt
to match your environment.
Please follow the links below to install the necessary software:
- git : https://git-scm.com/downloads
- python : https://www.python.org/downloads/ ( If your python version is too new, torch will not install properly.)
- FFmpeg : https://ffmpeg.org/download.html
- CUDA : https://developer.nvidia.com/cuda-downloads
After installing FFmpeg, make sure to add the FFmpeg/bin
folder to your system PATH!
Automatic Installation
- Download
Whisper-WebUI.zip
with the file corresponding to your OS from v1.0.0 and extract its contents. - Run
install.bat
orinstall.sh
to install dependencies. (This will create avenv
directory and install dependencies there.) - Start WebUI with
start-webui.bat
orstart-webui.sh
- To update the WebUI, run
update.bat
orupdate.sh
And you can also run the project with command line arguments if you like to, see wiki for a guide to arguments.
- Git clone the repository
git clone https://github.com/jhj0517/Whisper-WebUI.git
- Build the image ( Image is about 7GB~ )
docker compose build
- Run the container
docker compose up
- Connect to the WebUI with your browser at
http://localhost:7860
If needed, update the docker-compose.yaml
to match your environment.
VRAM Usages
This project is integrated with faster-whisper by default for better VRAM usage and transcription speed.
According to faster-whisper, the efficiency of the optimized whisper model is as follows:
Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
---|---|---|---|---|---|
openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
If you want to use an implementation other than faster-whisper, use --whisper_type
arg and the repository name.
Read wiki for more info about CLI args.
Available models
This is Whisper's original VRAM usage table for models.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |
.en
models are for English only, and the cool thing is that you can use the Translate to English
option from the "large" models!
TODO🗓
- Add DeepL API translation
- Add NLLB Model translation
- Integrate with faster-whisper
- Integrate with insanely-fast-whisper
- Integrate with whisperX ( Only speaker diarization part )
- Add background music separation pre-processing with MVSEP-MDX23
- Add fast api script
- Support real-time transcription for microphone