How BPM Detection Works — Algorithms Behind Tempo Detection

Drop a song into a BPM detector and a few seconds later it spits out a number — 128, 174, 92. It feels like magic, but it isn't. The computer doesn't actually hear the music; it sees a long list of amplitude samples, usually 44,100 of them per second, and somehow has to decide which line up with the kick drum. The gap between "raw waveform" and "the tempo is 128 BPM" is bridged by a small stack of signal-processing tricks.

This post walks through that stack from the bottom up: what's in the audio file, how onsets get detected, how autocorrelation finds repetition, why every algorithm gets octave errors, and where neural beat trackers fit on top. To follow along with a real file, drop one into the BPM Detector while you read.

flowchart LR
  A[Raw waveform<br/>44.1kHz samples] --> B[STFT<br/>spectrogram]
  B --> C[Onset detection<br/>spectral flux]
  C --> D[Onset envelope<br/>1D function]
  D --> E[Autocorrelation<br/>find period]
  E --> F[Tempo prior<br/>60-200 BPM]
  F --> G[BPM estimate]
  G --> H[Beat tracking<br/>dynamic programming]
  H --> I[Per-beat timestamps]

What's Actually in an Audio File

Strip away the container format (MP3, WAV, OGG) and what you have is a sequence of numbers. At 44.1 kHz mono, that's 44,100 floating-point samples per second, each one the air pressure at a single instant. A two-minute song is 5.3 million samples. Stereo doubles that.

These numbers carry no labels. There's no metadata flag that says "kick drum at sample 22,050." The song's tempo is encoded only as a pattern of repeated energy bursts spread across that long array.

The first useful transformation is from the time domain to the frequency domain. The Short-Time Fourier Transform (STFT) chops the signal into overlapping windows of, say, 2,048 samples each, and runs an FFT on every window. The output is a spectrogram: a 2D matrix where one axis is time, the other is frequency, and each cell holds the energy at that point. This is the substrate most BPM algorithms actually work on, not the raw waveform.

Onset Detection: Finding the Hits

Tempo lives in the attacks — the moments when something hits hard, usually a drum. So before you can find the tempo, you need to know where the hits are. That's onset detection.

The simplest approach is an energy envelope: at each frame of the spectrogram, sum the energy across some frequency band, then look for moments when that sum jumps sharply.

# rough energy envelope
frame_size = 1024
hop = 512

envelope = []
for t in range(0, len(samples) - frame_size, hop):
    frame = samples[t : t + frame_size]
    energy = sum(s * s for s in frame)
    envelope.append(energy)

# onsets = points where envelope rises faster than threshold
onsets = []
for i in range(1, len(envelope)):
    delta = envelope[i] - envelope[i - 1]
    if delta > threshold:
        onsets.append(i * hop / sample_rate)  # in seconds

That works for a clean drum loop and falls apart on real music. Bass guitar, vocal sustains, and reverb all add energy that isn't an onset. Worse, a single drum hit produces a gradual energy rise as the signal is windowed, so you'll often double-trigger.

Better detectors work in the frequency domain. Spectral flux measures how much the spectrum changes between consecutive frames, summing only the positive changes. A snare drum spike shows up as a wide-band increase across many frequency bins simultaneously, which spectral flux captures cleanly. Sustained notes barely register because their spectrum doesn't change much frame to frame.

Even better: complex-domain onset detection also looks at phase changes, not just magnitude. A new note typically resets the phase of its harmonics, which is a strong onset cue even when the loudness barely changes. This is what librosa's onset.onset_detect uses by default.

The output of this stage is an onset detection function — a 1D signal, one value per spectrogram frame, where peaks correspond to candidate beats.

Autocorrelation: Finding the Repetition

You now have something like 5,000 candidate onset values for a two-minute song, most of them small, some of them large. The next question is: at what interval do the big peaks repeat?

Autocorrelation is the standard tool. It compares a signal to a delayed copy of itself, sliding the copy along and measuring how well it lines up at every possible lag. Where the signal is genuinely periodic, you get a strong peak in the autocorrelation at the lag corresponding to the period.

# autocorrelation of an onset detection function
def autocorrelate(signal, max_lag):
    n = len(signal)
    result = []
    for lag in range(max_lag):
        s = 0
        for i in range(n - lag):
            s += signal[i] * signal[i + lag]
        result.append(s)
    return result

# convert lag to BPM
# sample_rate of onset function = audio_rate / hop_size
# e.g. 44100 / 512 ≈ 86 frames/sec
def lag_to_bpm(lag_frames, frames_per_second):
    seconds_per_beat = lag_frames / frames_per_second
    return 60.0 / seconds_per_beat

Run autocorrelation on the onset detection function, restrict the lag range to plausible tempos (say, 60-200 BPM), and look for the highest peak. The lag of that peak, converted back to seconds, is your beat period. Divide 60 by it and you have BPM.

In practice nobody implements the O(n²) version above. Real code uses the Wiener-Khinchin theorem: the autocorrelation of a signal is the inverse FFT of its power spectrum. That brings the cost down to O(n log n).

The Octave Problem (and Why Detectors Disagree)

Here's the catch that bites every tempo detector: autocorrelation can't tell the difference between a beat and half a beat. If a song has a strong pulse every 0.5 seconds, the autocorrelation will show peaks at 0.5s, 1.0s, 1.5s, 2.0s — every integer multiple. They might be smaller, but they're there.

Worse, on songs with a strong off-beat (think disco hi-hats), the peak at half the true period is sometimes larger than the peak at the true period. This is the octave error: your detector confidently reports 174 BPM for a 87 BPM song, or 70 BPM for a 140 BPM dance track.

There's no clean mathematical fix because there's no objectively correct answer. Is "Smells Like Teen Spirit" 116 or 232 BPM? Depends on whether you count snare hits or hi-hats. Detectors handle this with tempo priors — a probability distribution over likely tempos that biases the algorithm toward, say, 100-140 BPM where most music actually lives. A strong peak at 87 BPM will beat a stronger peak at 174 BPM if the prior says 174 is implausibly fast.

The other defense is multi-pass: pick a tempo candidate, then snap a grid of beats onto the song using dynamic programming and check how well the grid lines up with the actual onsets. If a 174 BPM grid leaves half its beats on silence, the detector halves it.

Beat Tracking vs Tempo Estimation

Two related but different problems often get confused:

Tempo estimation answers "what's the BPM?" — a single number for the whole song.
Beat tracking answers "where exactly does each beat fall?" — a list of timestamps.

Tempo estimation is autocorrelation plus a prior. Beat tracking is harder because it has to commit to a phase: not just "beats are 0.5s apart" but "beats fall at 0.21s, 0.71s, 1.21s..." Phase matters for anything downstream that needs to do something on the beat — auto-cutting a video to music, syncing visual effects, generating drum fills.

The standard beat tracking algorithm (Ellis 2007, the foundation of librosa's beat tracker) uses dynamic programming to find a sequence of beat times that maximizes a score combining (a) closeness to the onset detection function and (b) regularity at the estimated tempo. It underpins a huge fraction of music-tech tools shipping today.

Where Neural Networks Come In

Classic algorithms — onset detection plus autocorrelation plus DP beat tracking — top out around 75-85% accuracy on diverse benchmarks like the GTZAN dataset. They struggle with rubato, swung rhythms, classical music with no drums, and anything where the beat is implied rather than struck.

Modern systems (madmom, BeatNet, Beat Transformer) replace the hand-engineered onset detection function with a neural network. A CNN reads the spectrogram and outputs a beat-likelihood curve directly; an RNN or HMM on top does the temporal smoothing.

Two changes drive the accuracy improvement. The network learns onset detectors specialized to musical events rather than generic energy bursts — it can ignore reverb tails and pick up implied beats in piano pieces. And the temporal model learns realistic tempo distributions and groove patterns from training data instead of relying on a hand-tuned prior.

State-of-the-art trackers now hit 90%+ on standard benchmarks. The trade-off is model size (tens of MBs) and inference cost — fine for offline analysis, expensive for real-time browser use. Browser-based BPM tools mostly still use classical methods because they fit in a few KB of JavaScript and run in milliseconds.

Doing It in the Browser

If you want to play with this yourself, the Web Audio API exposes everything you need. AudioContext.decodeAudioData() turns an MP3 or WAV into a Float32Array of samples; AnalyserNode gives you frequency-domain data with a real FFT under the hood; from there it's a few hundred lines to onset detection plus autocorrelation.

A practical browser pipeline:

Decode the file with decodeAudioData.
Downmix to mono and downsample to 11.025 kHz — tempo detection doesn't need full bandwidth, and lower sample rates make autocorrelation 4× faster.
Compute a low-resolution spectrogram (say, 512-sample windows, 256-sample hops).
Build an onset detection function via spectral flux.
Autocorrelate, restrict to 60-200 BPM, pick the highest peak after applying a tempo prior.
Optionally run beat tracking to get phase, not just period.

For the heavier lifting — extracting audio from a video first, or trimming the file to a known section before analysis — pair the BPM Detector with Video to MP3 or Audio Trim & MP3 Cutter. For workflows where you want lyric or vocal timing alongside the beat grid, Speech to Text (Whisper) gets you transcripts you can align manually.

What to Expect From a BPM Detector

A few practical takeaways if you're using one of these tools rather than building one:

Trust whole numbers more than fractions. "128 BPM" is an algorithm reporting a confident integer; "127.4 BPM" usually means the detector hedged. Real recordings drift by a few percent anyway.
Test on a known track first. Pick a song you know is 120 BPM, run it through the detector. If it returns 60 or 240, the tool is octave-erroring on your kind of music — try a different one or halve/double manually.
Short clips are harder than long ones. Autocorrelation needs a few bars of audio to lock on. A 10-second snippet will be noisier than a 60-second one.
Live recordings drift. A drummer rushing or pulling will give you an average BPM that hides what's actually happening. Beat tracking (per-beat times) tells the real story.

If you want to dig deeper, Wikipedia's Beat detection article covers the algorithmic taxonomy, the librosa documentation is the canonical reference for working code, and the ISMIR proceedings are where most of the modern beat tracking research is published.

The number a BPM detector gives you is the end of a long pipeline: samples become a spectrogram, the spectrogram becomes onsets, the onsets become an autocorrelation curve, the curve becomes a tempo guess, the guess gets sanity-checked against a prior. Every stage can fail in interesting ways. Knowing where they fail makes the difference between trusting the output and second-guessing it.

FAQ

Why does the same song give two different BPM values on different apps?

Almost always the octave error: one detector locked onto the snare and reported 116 BPM, another locked onto the hi-hat at double the rate and reported 232 BPM. Both are mathematically defensible — there's no objectively correct answer when the music has multiple periodicities. Apps with stronger tempo priors bias toward 100–140 BPM where most popular music sits, so they're more likely to halve aggressive readings.

How short can a clip be before BPM detection becomes unreliable?

Below about 8 seconds, autocorrelation doesn't have enough repetitions to lock on confidently. A 60-second clip gives you 30+ beat cycles at 120 BPM, which is plenty. A 10-second clip gives you 20 cycles, which is borderline. Below 5 seconds you're often guessing — try to feed the detector at least one full bar of obvious downbeats.

Why does my song's tempo seem to drift across the track?

Real recordings drift by 1–3% naturally — drummers rush, producers nudge tempos manually, live recordings breathe. Modern beat trackers (madmom, BeatNet) handle this by emitting per-beat timestamps rather than a single BPM number; the average tells you the macro tempo while the per-beat positions reveal the microtiming. If your tool only spits out one number, it's averaging away that information.

Is neural network BPM detection always better than classical algorithms?

For diverse content yes — neural beat trackers hit 90%+ on benchmark datasets versus 75–85% for autocorrelation-based methods. The catch is cost: a CNN-based tracker like Beat Transformer is tens of MB and takes seconds to run, while a classical detector fits in 5 KB of JavaScript and finishes in milliseconds. For a browser tool processing one song at a time, classical methods still hit the right speed/quality trade-off in 2026.

Why does BPM detection fail on classical music or jazz?

Both genres often have implied beats rather than struck ones — a piano sonata has no kick drum, and swing rhythms place beats on the off-beat in ways autocorrelation doesn't expect. Classical autocorrelation looks for the strongest periodic energy burst; if there isn't one, the algorithm grabs whatever's most regular, which is often wrong. Neural trackers trained on diverse music handle this much better because they learn what "implied beat" looks like.

Can I detect BPM directly from a video file in the browser?

Not without first extracting the audio track. The Web Audio API only operates on decoded audio buffers, so you need to either run the BPM detector on a separate audio file or extract the audio in-browser using ffmpeg.wasm. The cleanest pipeline is video → MP3 (browser-side) → BPM detection — both stages run client-side without uploads.

What's the lowest sample rate I can use without hurting accuracy?

For tempo detection specifically, 11.025 kHz is plenty — tempo lives in low-frequency energy bursts well below the Nyquist limit at that rate. Downsampling from 44.1 kHz to 11.025 kHz makes autocorrelation roughly 4× faster with no measurable accuracy loss. The same trick doesn't apply to pitch detection or timbre analysis, where higher frequencies actually matter.

Why does my detector report 87 BPM when the song is clearly 174?

This is the classic octave error in reverse — the detector picked the bar-level periodicity (every 2 beats at 174 BPM = one cycle every 0.69s, which corresponds to 87 BPM) instead of the beat-level. The fix is either a stronger tempo prior centered around 120 BPM or a multi-pass algorithm that snaps a beat grid onto candidate tempos and picks the one that best aligns with onsets. For manual correction, just double the reported value.