MP4 and H.264 Explained: How Modern Video Compression Works

MP4 and H.264 Explained: How Modern Video Compression Works

Open any random video file from the last 15 years and there's a roughly 90% chance it's an MP4 with H.264 video inside. Phones record it. YouTube transcodes to it. Browsers play it without a plugin. Editors export it by default. That dominance isn't an accident — it's what happens when a "good enough" combo gets ratified by hardware decoders, baked into chips, and locked in by network effects. AV1 is technically better. H.265 is technically better. Neither has dethroned H.264 yet, and the reasons are worth understanding if you ever convert, trim, or compress video.

The thing most people miss is that "MP4" and "H.264" aren't the same kind of thing at all. One is a container. The other is a codec. Confusing them is like confusing a manila folder with the documents inside it.

MP4 Is a Container, Not a Format

MP4 (technically MPEG-4 Part 14, file extension .mp4) is a wrapper. It defines how to package multiple tracks — video, audio, subtitles, chapter markers, metadata — into a single file with timing information so a player can keep them in sync. It does not specify how the video pixels or audio samples are actually compressed. That's the codec's job.

A typical MP4 file looks roughly like this internally:

moov (metadata)
  trak: video, codec=avc1 (H.264), 1920x1080, 30fps
  trak: audio, codec=mp4a (AAC), 48kHz stereo
mdat (the actual compressed video + audio bytes)
flowchart TB
  ftyp["ftyp<br/>file type<br/>(must be first)"]
  moov["moov<br/>metadata index"]
  mvhd["mvhd<br/>duration, timescale"]
  vtrak["trak (video)<br/>avc1 / H.264"]
  atrak["trak (audio)<br/>mp4a / AAC"]
  vstbl["stbl<br/>sample tables"]
  astbl["stbl<br/>sample tables"]
  mdat["mdat<br/>compressed payload"]
  ftyp --> moov
  moov --> mvhd
  moov --> vtrak
  moov --> atrak
  vtrak --> vstbl
  atrak --> astbl
  moov -.points to.-> mdat

The moov atom is an index. The mdat atom holds the real payload. The same MP4 container can hold H.264 video + AAC audio, or H.265 + AAC, or AV1 + Opus. The container doesn't care, but the player does — if your browser doesn't support the codec inside, the file won't play even though the wrapper says .mp4.

This is why renaming video.mkv to video.mp4 does nothing useful. The container is wrong, the codec inside is probably also wrong, and your player will refuse the file. To actually change formats you need a video resize converter or a tool like FFmpeg that re-muxes (and possibly re-encodes) the streams.

Why H.264 Took Over

H.264, also called AVC (Advanced Video Coding), was finalized in 2003 by a joint group from ITU-T and ISO/IEC. By 2010 it was everywhere because of three converging forces:

  1. Hardware decoders. Apple put H.264 decoders in the iPhone in 2007. Every smartphone, smart TV, and graphics card built since has one. Hardware-accelerated decoding means a phone can play 1080p at 30fps without melting the battery.
  2. License clarity. The MPEG-LA patent pool charges royalties, but for end users and most distributors the costs are predictable and capped. That mattered for browser vendors deciding what to ship.
  3. Quality at the bitrate. H.264 was roughly 50% more efficient than its predecessor MPEG-2 at the same visual quality. That meant DVD-quality streams over broadband connections that previously couldn't handle them.

Once hardware decode is on a billion devices, replacing the codec is a 15-year project. H.265 (HEVC) ships in 2013 and offers ~50% better compression, but its licensing is split across multiple patent pools and the terms scared off Mozilla and (initially) Google. AV1 launches in 2018 with no royalties — but encoding it is so much slower that real-time use cases (video calls, live streams, editing previews) still mostly fall back to H.264. See the H.264 Wikipedia article for the full standardization timeline.

How Video Compression Actually Works

Raw 1080p30 video is about 1.5 gigabits per second. No phone, network, or hard drive could handle that for any meaningful length. Codecs squeeze that down by 100-200x and rely on three big tricks.

Spatial compression (within a single frame) works like JPEG: divide the frame into blocks, transform them with the discrete cosine transform, throw away high-frequency detail the eye won't miss, then entropy-code what's left.

Temporal compression (between frames) is the bigger win. Most of frame N+1 looks identical to frame N. Why send all those pixels twice? Instead, send only what changed — and even then, send "this block of pixels moved 12 pixels right" rather than the pixels themselves. This is called motion compensation.

Entropy coding (CABAC in H.264's high profile) squeezes the final bit stream by assigning shorter codes to more common patterns. It's where another ~10-20% of the savings comes from.

flowchart LR
  A[Raw frames<br/>1.5 Gbps] --> B[Block split<br/>16x16 macroblocks]
  B --> C[Motion estimation<br/>find moving blocks]
  C --> D[DCT transform<br/>frequency domain]
  D --> E[Quantization<br/>discard high freq]
  E --> F[Entropy coding<br/>CABAC / CAVLC]
  F --> G[Compressed stream<br/>~5 Mbps]

The result is a stream of three different frame types working together.

I-Frames, P-Frames, and B-Frames

This is the part that confuses people the first time, but it's worth knowing because it explains why scrubbing a video sometimes feels sticky and why "fast" cuts in video editors aren't really cuts.

  • I-frame (Intra-coded). A complete picture, compressed only spatially. Think of it as a JPEG. You can decode it without reference to any other frame. Also called a keyframe.
  • P-frame (Predicted). Stores only the difference from the previous I- or P-frame. Tiny compared to an I-frame.
  • B-frame (Bi-directionally predicted). Stores differences from both a previous and a future frame. Usually the smallest of the three.

A typical encoder produces a pattern like:

I B B P B B P B B P B B I B B P ...
GOP structure: I-frames, P-frames, B-frames I B B P B B P B B P I I-frame (full) P (forward delta) B (bi-directional) Group of Pictures (GOP) — typical pattern

Every 1-10 seconds you get an I-frame. Between them, P- and B-frames carry the motion deltas. This is why seeking in a video isn't free — when you click the timeline at 0:47, the player has to find the most recent I-frame before that point and decode forward. If keyframes are sparse (every 10 seconds), seeking feels chunky.

It also explains why frame-accurate editing tools sometimes have to re-encode a clip rather than just snip the bytes. You can't cut between two B-frames cleanly — the cut would land in the middle of a prediction chain. Tools like Video Trimmer handle this by either snapping cuts to the nearest keyframe (fast, lossless, less precise) or re-encoding the affected GOP (group of pictures) for frame-perfect cuts.

If you want to dig into this, the ITU-T H.264 specification is the authoritative source — though "spec" undersells it; it's a 800-page reference.

Profiles, Levels, and Why Your Phone Sometimes Refuses a File

H.264 has profiles that gate which features the encoder is allowed to use. Common ones:

  • Baseline — no B-frames, no CABAC. Used by older video calls because decode is simpler.
  • Main — adds B-frames and CABAC. Standard for broadcast.
  • High — adds 8x8 transforms and other tricks. The default for modern recordings and most streaming.

And levels specify maximum resolution, bitrate, and frame rate the encoder will produce. Level 4.0 = 1080p30 at up to 25 Mbps. Level 5.1 = 4K30. Level 5.2 = 4K60.

When you hit "play" and a device says "unsupported format," what's almost always actually happening is: the container is fine, the codec is fine, but the profile or level exceeds what the device's hardware decoder is wired to handle. An older smart TV with an H.264 decoder maxed out at Level 4.0 will choke on a 4K60 stream even though it's "the same codec."

You can inspect this with ffprobe:

ffprobe -v error -show_streams input.mp4
# Look for: codec_name=h264, profile=High, level=42

Level 42 there means 4.2 — encoders write the level as a single integer where 4.2 → 42.

The MP4 Atom Tree

Inside the container, MP4 organizes everything into nested atoms (also called boxes). Each atom has a 4-byte type code and a size. A simplified view:

ftyp  (file type identifier — must be first)
moov  (movie metadata)
  mvhd  (movie header: duration, timescale)
  trak  (one per track)
    tkhd  (track header)
    mdia
      minf
        stbl  (sample table — the index)
          stsd  (sample description, codec params)
          stts  (time-to-sample mapping)
          stss  (sync sample table — keyframe positions!)
          stco  (chunk offsets)
mdat  (the actual compressed payload)

When a player seeks to 0:47, it reads stts to find which sample contains that time, then stss to find the nearest preceding keyframe, then stco to find the byte offset in mdat. Without those tables, seeking would be impossible — you'd have to decode from the start every time.

This is also why streaming MP4 historically required moov to come before mdat. If the index is at the end of the file (the FFmpeg default), a browser can't start playing until the whole file downloads. The fix is a flag — -movflags +faststart in FFmpeg — that rewrites the file so moov ships first. The MDN container reference covers this and the equivalents for WebM and other formats.

Practical Conversion: When to Re-Encode and When Not To

The most expensive operation you can do to video is re-encoding — fully decoding to raw frames and then re-compressing them. It's slow (1-10x real-time on a fast laptop) and it loses quality each generation.

The cheap operations are:

  • Re-muxing — extracting streams from one container and writing them to another. Stream copies happen at disk speed.
  • Bitstream filtering — small surgical edits to the compressed stream (like fixing aspect ratio metadata).

A good rule of thumb: if the output codec, profile, and resolution all match the input, you can probably stream-copy. If any of those change, you have to re-encode.

# Stream copy — fast, lossless
ffmpeg -i input.mov -c copy output.mp4

# Re-encode to a different codec — slow, lossy
ffmpeg -i input.mp4 -c:v libx265 -crf 28 output.mp4

The FFmpeg documentation has the full reference if you want to go deep. Most of our video tools try to stream-copy when they can and only re-encode when the output requires it.

Practical Takeaway and Tools That Show This in Action

If you remember nothing else: the file extension .mp4 tells you almost nothing. The codec inside is what determines whether it plays, how big it is, and whether it can be edited cheaply. When something doesn't work, run ffprobe on it, look at the codec, profile, and level, and 90% of the time the answer is right there. H.264 in High profile at Level 4.0 plays everywhere; almost anything else is going to surprise someone, somewhere.

The split between container and codec gets less abstract when you actually inspect a file. A few practical things to try:

  • Use video frame extractor to pull individual frames — every frame you can extract directly is sitting on or near an I-frame in the underlying stream.
  • A video to GIF converter has to fully decode H.264 to raw frames before re-encoding to GIF's weaker palette-based format. The size jump from MP4 to GIF for the same clip is a direct demonstration of how much modern codecs save.
  • Video to MP3 is pure track extraction — pull the AAC audio track out of the MP4 and re-encode (or re-mux) to MP3. The video track gets discarded entirely.
  • Video speed changer has to re-time the keyframe spacing in the stts atom while keeping the underlying compressed stream valid.

For deeper backend work — like understanding why a specific file behaves oddly across devices — pair these with the JSON Formatter when reading ffprobe -of json output. For adjacent reading on related infrastructure, REST API Design Best Practices covers similar tradeoffs in API land.

FAQ

Why does my MP4 play on Chrome but not on iPhone?

Almost always an H.264 profile or pixel format issue. iOS Safari requires yuv420p pixel format and even-numbered dimensions, and only supports H.264 up to High profile Level 4.2 (1080p60). MKV → MP4 conversions sometimes leave wrong pixel formats; the fix is ffmpeg -i input.mp4 -pix_fmt yuv420p -vf "scale=trunc(iw/2)*2:trunc(ih/2)*2" output.mp4. If the file plays on macOS Safari but fails on iOS, check the level — older iPhones cap at 4.0.

Should I switch to H.265 or AV1 in 2026?

For new content distribution, AV1 is increasingly viable — Chrome, Firefox, Edge all support it, and Safari added support in 17.4 (2024). The catch is encoding cost: AV1 is 5–10× slower than H.264. For pre-encoded content (videos delivered repeatedly), AV1 is worth the encode cost. For real-time use cases (video calls, livestreams, editing previews), H.264 still wins because it's what's hardware-accelerated everywhere.

What's the difference between MP4 and MKV?

MP4 (MPEG-4 Part 14) is the standard streaming-friendly container — supported natively by every browser and most devices, with patent-pool licensing. MKV (Matroska) is open-source and supports more codecs, more subtitle formats, and more flexible chapter metadata. MKV is favored by tech-enthusiast content (open-source rips, lossless audio); MP4 dominates the public web.

Why does scrubbing through a video sometimes feel sticky?

Because the player has to find the most recent I-frame before your seek point, then decode forward to your target frame. If keyframes are sparse (every 10 seconds), seeking can require decoding several seconds of video. Faster seeking comes from smaller GOP sizes (more I-frames, but also bigger files). YouTube and Netflix tune for fast seeking with relatively short GOPs; archive copies often have longer GOPs to save space.

Can I cut a video without re-encoding?

Yes, but only at I-frame boundaries. Tools like FFmpeg with -c copy do "stream copy" cuts that snap to the nearest keyframe — fast and lossless, but the cut might be a second or two off your target. For frame-accurate cuts, the affected GOP must be re-encoded; most editors do this transparently. The cost: re-encoded segments lose some quality, and the operation takes 10–60× longer than stream copy.

What's `+faststart` and why does it matter for streaming?

By default, FFmpeg writes the MP4 index (moov atom) at the end of the file, which means a browser can't start playing until the entire file downloads. The -movflags +faststart flag rewrites the file with moov at the front, so playback can begin while the rest is still streaming. Always use it for files served on the web; it's a non-negotiable for autoplay video and progressive download.

How do I figure out a file's actual codec without playing it?

Use ffprobe input.mp4 (part of FFmpeg) to dump the codec, profile, level, resolution, framerate, and bitrate. The output tells you codec_name=h264, profile=High, level=42 (where 42 = 4.2). For a JSON-formatted version that's easier to parse programmatically, add -of json. MediaInfo is a more user-friendly alternative if you prefer a GUI.

What's the difference between bitrate and quality?

Bitrate (bits per second) is how much data the encoder allocates to the video. Higher bitrate generally means better quality, but it depends on the codec and the content — H.264 at 5 Mbps looks similar to H.265 at 2.5 Mbps for the same scene. Modern encoders use CRF (Constant Rate Factor) instead of fixed bitrate: you specify a quality target and the encoder allocates bits where they're needed. CRF 23 is a good default for H.264, CRF 28 for H.265.