Speech to Text
Transcribe live microphone audio or uploaded files in your browser. Uses Web Speech API for live, Whisper Tiny for files.
About Speech to Text
Speech to Text offers two transcription modes. Live Mic mode uses your browser's built-in Web Speech API for real-time transcription from your microphone — it works immediately with no downloads. File mode uses OpenAI's Whisper Tiny model via Transformers.js, running entirely in your browser. The Whisper model (~39MB) is downloaded once and cached; your audio files are never uploaded to any server.
Why use Speech to Text
- Live mode works instantly with no downloads
- File mode uses Whisper AI with no server upload
- Supports 7+ languages in live mode
- Whisper model cached after first use
- Transcript can be copied with one click
- Live mode works instantly with no downloads via the Web Speech API for real-time dictation.
How to use Speech to Text
- Choose 'Live Mic' for real-time transcription or 'Audio File' for uploaded files
- For live mic: select language and click 'Start Listening'
- Speak into your microphone and watch text appear in real-time
- For file upload: select an audio file and click 'Transcribe'
- Copy the transcript with the Copy button
- Choose 'Live Mic' for real-time transcription or 'Audio File' for uploaded files.
- For live mic: select language, click 'Start Listening', and grant microphone permission when prompted.
When to use Speech to Text
- Transcribing meeting recordings or lectures
- Dictating notes hands-free
- Converting voicemail or audio memos to text
- Creating captions for audio content
- Accessibility use cases requiring spoken-to-written conversion
- Transcribing meeting recordings, lectures, or interviews after the fact in file mode.
Examples
Meeting recording
Input: MP3 of a 20-minute team meeting, English, recorded on a laptop microphone
Output: Plain-text transcript with punctuation and speaker pauses indicated, ready for action-item extraction.
Live dictation
Input: Speaker dictating notes via headset microphone in Chrome
Output: Real-time text streaming into the textarea as the speaker talks, captured in roughly 5 minutes for a 5-minute monologue.
Multilingual voicemail
Input: M4A voicemail in mixed Spanish and English, 90 seconds
Output: Whisper auto-detects the dominant language and produces a clean transcript with both languages preserved verbatim.
Tips
- Use Live Mic mode in a quiet room with a good headset microphone — Web Speech API accuracy is heavily dependent on input quality.
- For recorded audio, file mode (Whisper Tiny) produces dramatically better punctuation and capitalisation than the live API.
- The first file-mode run on a given browser downloads ~39 MB; trigger it on Wi-Fi if you're on a mobile data plan.
- Whisper processes audio in 30-second chunks; for very long files (>30 min) expect proportionally longer wait times — split into shorter clips for parallel manual review.
- If live mode stops mid-sentence, click 'Start Listening' again — some browsers timeout the speech recognition session after about 60 seconds of silence.
- For multilingual audio, just upload to file mode — Whisper detects the language automatically without you specifying it.
Frequently Asked Questions
Does my audio get uploaded to a server?▾
No. Live mode uses your browser's built-in speech API (processed by your browser or OS). File mode uses Whisper running entirely in your browser via WebAssembly.
What is Whisper?▾
Whisper is OpenAI's automatic speech recognition model. The Tiny variant (~39MB) runs in-browser via Transformers.js for offline transcription.
Why doesn't live mode work in Firefox?▾
Firefox doesn't support the Web Speech API. Use the File Upload tab instead, which works in all modern browsers.
How large is the Whisper model?▾
Whisper Tiny is approximately 39MB. It's downloaded once and cached by the browser, so subsequent uses are instant.
What languages does file mode support?▾
Whisper supports 99 languages with automatic language detection. Live mode supports the languages shown in the dropdown (7 options).
Can I transcribe long recordings?▾
For live mode, there's no limit. For file mode, very long files (>30 min) may take several minutes to process and could strain browser memory.
How accurate is the transcription?▾
Live mode accuracy depends on your browser and microphone. File mode (Whisper) is very accurate for clear speech in supported languages.
Does my audio get uploaded to a server?▾
No. Live mode uses your browser's built-in speech API (which on Chrome may use Google's cloud services depending on your browser settings). File mode uses Whisper running entirely in your browser via WebAssembly — your audio is never transmitted.
Glossary
- Whisper
- An open-source automatic speech recognition system from OpenAI, trained on 680,000 hours of multilingual web audio and capable of recognising 99 languages.
- Whisper Tiny
- The smallest Whisper variant (39 MB, 39M parameters); fast enough to run in-browser via WebAssembly while still producing solid transcripts for clean speech.
- Web Speech API
- A browser API exposed by Chromium-based browsers (and Safari) that performs real-time speech recognition using the OS or vendor speech engine, often with cloud assist.
- Transformers.js
- A Hugging Face JavaScript library that runs Hugging Face transformer models (including Whisper) in the browser via ONNX Runtime Web.
- ONNX Runtime Web
- A WebAssembly build of Microsoft's ONNX Runtime that executes neural network models in the browser without a server.
- Language detection
- The automatic identification of the spoken language in an audio clip; Whisper performs this in its initial token prediction step.
- IndexedDB
- A persistent client-side database used by Transformers.js to cache the Whisper model so it only downloads once per browser profile.