UtilityKit

500+ fast, free tools. Most run in your browser only; Image & PDF tools upload files to the backend when you run them.

Speech to Text

Transcribe live microphone audio or uploaded files in your browser. Uses Web Speech API for live, Whisper Tiny for files.

About Speech to Text

Speech to Text offers two transcription modes. Live Mic mode uses your browser's built-in Web Speech API for real-time transcription from your microphone — it works immediately with no downloads. File mode uses OpenAI's Whisper Tiny model via Transformers.js, running entirely in your browser. The Whisper model (~39MB) is downloaded once and cached; your audio files are never uploaded to any server.

Why use Speech to Text

Live mode works instantly

Live mode works instantly with no downloads

File mode uses Whisper

File mode uses Whisper AI with no server upload

Supports 7+ languages in

Supports 7+ languages in live mode

Whisper model cached after

Whisper model cached after first use

Transcript can be copied

Transcript can be copied with one click

Live mode works instantly

Live mode works instantly with no downloads via the Web Speech API for real-time dictation.

How to use Speech to Text

Choose 'Live Mic' for real-time transcription or 'Audio File' for uploaded files
For live mic: select language and click 'Start Listening'
Speak into your microphone and watch text appear in real-time
For file upload: select an audio file and click 'Transcribe'
Copy the transcript with the Copy button
Choose 'Live Mic' for real-time transcription or 'Audio File' for uploaded files.
For live mic: select language, click 'Start Listening', and grant microphone permission when prompted.

When to use Speech to Text

Transcribing meeting recordings or lectures
Dictating notes hands-free
Converting voicemail or audio memos to text
Creating captions for audio content
Accessibility use cases requiring spoken-to-written conversion
Transcribing meeting recordings, lectures, or interviews after the fact in file mode.

Examples

Meeting recording

Input: MP3 of a 20-minute team meeting, English, recorded on a laptop microphone

Output: Plain-text transcript with punctuation and speaker pauses indicated, ready for action-item extraction.

Live dictation

Input: Speaker dictating notes via headset microphone in Chrome

Output: Real-time text streaming into the textarea as the speaker talks, captured in roughly 5 minutes for a 5-minute monologue.

Multilingual voicemail

Input: M4A voicemail in mixed Spanish and English, 90 seconds

Output: Whisper auto-detects the dominant language and produces a clean transcript with both languages preserved verbatim.

Tips

Use Live Mic mode in a quiet room with a good headset microphone — Web Speech API accuracy is heavily dependent on input quality.
For recorded audio, file mode (Whisper Tiny) produces dramatically better punctuation and capitalisation than the live API.
The first file-mode run on a given browser downloads ~39 MB; trigger it on Wi-Fi if you're on a mobile data plan.
Whisper processes audio in 30-second chunks; for very long files (>30 min) expect proportionally longer wait times — split into shorter clips for parallel manual review.
If live mode stops mid-sentence, click 'Start Listening' again — some browsers timeout the speech recognition session after about 60 seconds of silence.
For multilingual audio, just upload to file mode — Whisper detects the language automatically without you specifying it.

Frequently Asked Questions

Does my audio get uploaded to a server?▾

No. Live mode uses your browser's built-in speech API (processed by your browser or OS). File mode uses Whisper running entirely in your browser via WebAssembly.

What is Whisper?▾

Whisper is OpenAI's automatic speech recognition model. The Tiny variant (~39MB) runs in-browser via Transformers.js for offline transcription.

Why doesn't live mode work in Firefox?▾

Firefox doesn't support the Web Speech API. Use the File Upload tab instead, which works in all modern browsers.

How large is the Whisper model?▾

Whisper Tiny is approximately 39MB. It's downloaded once and cached by the browser, so subsequent uses are instant.

What languages does file mode support?▾

Whisper supports 99 languages with automatic language detection. Live mode supports the languages shown in the dropdown (7 options).

Can I transcribe long recordings?▾

For live mode, there's no limit. For file mode, very long files (>30 min) may take several minutes to process and could strain browser memory.

How accurate is the transcription?▾

Live mode accuracy depends on your browser and microphone. File mode (Whisper) is very accurate for clear speech in supported languages.

Explore the category

Glossary

Whisper: An open-source automatic speech recognition system from OpenAI, trained on 680,000 hours of multilingual web audio and capable of recognising 99 languages.
Whisper Tiny: The smallest Whisper variant (39 MB, 39M parameters); fast enough to run in-browser via WebAssembly while still producing solid transcripts for clean speech.
Web Speech API: A browser API exposed by Chromium-based browsers (and Safari) that performs real-time speech recognition using the OS or vendor speech engine, often with cloud assist.
Transformers.js: A Hugging Face JavaScript library that runs Hugging Face transformer models (including Whisper) in the browser via ONNX Runtime Web.
ONNX Runtime Web: A WebAssembly build of Microsoft's ONNX Runtime that executes neural network models in the browser without a server.
Language detection: The automatic identification of the spoken language in an audio clip; Whisper performs this in its initial token prediction step.
IndexedDB: A persistent client-side database used by Transformers.js to cache the Whisper model so it only downloads once per browser profile.

Speech to Text

About Speech to Text

Why use Speech to Text

Live mode works instantly

File mode uses Whisper

Supports 7+ languages in

Whisper model cached after

Transcript can be copied

Live mode works instantly

How to use Speech to Text

When to use Speech to Text

Examples

Meeting recording

Live dictation

Multilingual voicemail

Tips

Frequently Asked Questions

Explore the category

Related Tools

Related reading

Glossary