How AI Image Upscaling Works — ESRGAN vs Bicubic

Take a 256×256 thumbnail of a face, blow it up to 1024×1024 in Photoshop with bicubic interpolation, and you get a soft, blurry blob — recognizable, maybe, but the pores, hair strands, and eyelashes are gone. Run the same thumbnail through Real-ESRGAN, and the eyelashes come back. Sharp ones. Plausible ones. Ones that weren't actually in the source image. That is the whole story of AI upscaling in one paragraph: traditional algorithms blend, AI models invent. The interesting question is when each one is the right answer.

The Bicubic Floor

Classical upscaling algorithms — nearest neighbor, bilinear, bicubic, Lanczos — all work the same way at a conceptual level: the output pixel grid is denser than the input grid, so for each new pixel you need a value that doesn't exist yet. You compute it by sampling neighbors.

Bicubic interpolation samples a 4×4 neighborhood around the target location and fits a cubic polynomial through those values:

new_pixel = Σ Σ source(x+i, y+j) × kernel(i) × kernel(j)
            i=-1..2 j=-1..2

kernel(t) ≈ smooth cubic curve, peak at t=0

The math is closed-form, deterministic, and fast. It is also fundamentally limited. The algorithm has no concept of "this is an eye" or "this region is hair." It only sees a grid of numbers and produces a smoothed average. Edges that were sharp at low resolution stay smeared at high resolution because the kernel doesn't know they were edges. There is no way to recover detail that wasn't in the source — bicubic is a band-limited filter operating on a band-limited signal.

For a deeper comparison of these classical methods, see How Image Resizing Algorithms Work, which walks through nearest neighbor, bilinear, bicubic, and Lanczos with visual examples.

The academic term for AI upscaling is single-image super-resolution (SISR). The goal is to take a low-resolution input and produce a high-resolution output that matches what a real high-res photo would look like — not just a smoothed version.

Mathematically, you're solving an inverse problem. A low-res image is the result of:

LR = downsample(HR) + noise + compression_artifacts

There are infinitely many high-res images that could have produced the same low-res input. Bicubic picks the smoothest one. Super-resolution models pick the most plausible one based on what high-res images usually look like. That distinction — most plausible vs. mathematically smoothest — is where AI starts winning.

How a CNN Learns to Upscale

The first deep-learning approach that beat bicubic was SRCNN (2014). It is structurally tiny — three convolutional layers — but the training procedure is what mattered:

Take a million high-resolution images.
Downsample each one to create a (LR, HR) pair.
Feed LR into the network, compare output to HR, compute loss, backpropagate.
Repeat for weeks on a GPU.

After training, the network has learned a mapping from low-res patches to high-res patches based on patterns in real photos. When you give it a blurry 32×32 patch that looks like a textured cheek, it produces a 128×128 patch that looks like skin pores — because in training, that's what real cheeks look like at 4× resolution.

The key insight: the network is not retrieving training images. It learned a function — given any new low-res patch, produce a plausible high-res one. The output is novel, but it's drawn from the distribution of real images.

Pixel-loss CNNs like SRCNN and EDSR use mean squared error (MSE) between the predicted HR and the ground truth. MSE is mathematically clean but has a known weakness: it averages plausible options. If a low-res patch could be either "sharp grass" or "sharp gravel," MSE training pushes the network toward something between — soft, blurry texture. Better than bicubic, but still soft.

Why GANs Changed the Game

ESRGAN (Enhanced Super-Resolution GAN, 2018) replaced pixel loss with a generative adversarial setup. Two networks train simultaneously:

Generator: takes LR, produces HR.
Discriminator: tries to tell generator output from real high-res photos.

generator_loss = pixel_loss + perceptual_loss + adversarial_loss

The generator now has three objectives: be close to ground truth (pixel loss), match high-level features (perceptual loss, computed in a pretrained VGG network's feature space), and fool the discriminator (adversarial loss). That last term is the magic.

flowchart LR
  LR[Low-res input] --> G[Generator<br/>encoder-decoder CNN]
  G --> SR[Super-res output]
  HR[Real high-res] --> D{Discriminator<br/>real or fake?}
  SR --> D
  D -- "real" --> RR[Generator wins]
  D -- "fake" --> RF[Generator loses]
  RF -.adversarial loss.-> G
  HR -.pixel + perceptual loss.-> G

To fool the discriminator, the generator can no longer produce safe blurry averages — those look obviously fake. It has to commit to sharp, specific, plausible details.

The result is texture that looks real even when it isn't faithful to the original. Skin gets pores, fabric gets weave, leaves get veins. A blind human survey can't distinguish ESRGAN output from real photos in most cases. That is also exactly its danger, which we'll get to.

Real-ESRGAN: From Lab to Real World

The original ESRGAN was trained on clean LR/HR pairs where the only degradation was bicubic downsampling. In practice, real low-res images have JPEG artifacts, sensor noise, motion blur, and sharpening halos baked in. Feed an old JPEG to ESRGAN and it amplifies the compression blocks into something that looks like a hallucinating fractal.

Real-ESRGAN (2021) fixed this by training on synthetically degraded inputs:

LR = degrade(HR, [
  random_blur,
  random_resize,
  random_noise,
  jpeg_compression,
  second_pass_jpeg_compression
])

The network learns to be robust against the kinds of artifacts that exist in actual user-uploaded images, not just clean academic test sets. This is the model behind most consumer-facing AI upscalers today, including the one inside UtilityKit's AI image upscaler.

flowchart TB
  IN[LR image 256x256] --> CONV1[Conv 3x3]
  CONV1 --> RRDB[23 x RRDB blocks<br/>residual-in-residual<br/>dense feature extraction]
  RRDB --> SKIP[+ residual skip from CONV1]
  SKIP --> UP1[Pixel shuffle 2x]
  UP1 --> UP2[Pixel shuffle 2x]
  UP2 --> CONV2[Conv 3x3 -> RGB]
  CONV2 --> OUT[HR image 1024x1024]

Around 2022, diffusion-based super-resolution started outperforming GANs on quality benchmarks. Models like Stable Diffusion Upscaler and SUPIR work by treating upscaling as a conditional denoising task: start from noise, gradually refine toward a high-res image that's consistent with the low-res input. Diffusion upscalers produce more coherent global structure than GANs (better text legibility, fewer weird hallucinations on complex scenes) but cost 10–100× more compute per image. For a 1024×1024 output, ESRGAN takes ~200 ms on a consumer GPU; SUPIR takes 30+ seconds. For a free web tool, GAN-based models still hit the right speed/quality trade-off in 2026.

Where AI Upscaling Fails

Knowing the failure modes is the difference between a useful tool and a footgun:

Text and faces. GAN models hallucinate. A blurry "8" becomes a confident, sharp "B." A blurry face becomes a confident, sharp face — but possibly with the wrong number of teeth, asymmetric eyes, or a slightly different person. Never use AI upscaling on legal documents, medical images, identity photos, or anything where pixel fidelity is evidence.
Heavily compressed sources. A 50 KB JPEG of a 4K scene is missing too much information. Real-ESRGAN does its best, but the model is filling in what it thinks should be there, not what was. The output looks sharp but the details are confabulated.
Anime and line art. Models trained on photographs produce mush on illustrated content. Use anime-specific variants like Real-ESRGAN-anime or waifu2x — they're trained on the right distribution.
Repeating patterns. Brick walls, fabric weaves, and chain-link fences sometimes produce visible tiling artifacts at patch boundaries. Higher-quality models with overlapping inference windows mostly fix this.
Bit-perfect reproduction. AI upscaling is one-way. You cannot downscale the output and get back the original. If you need that property, use traditional algorithms.

When to Use Which

A practical decision matrix:

Source	Goal	Algorithm
Clean photo, 2×	Slight enlargement, preserve fidelity	Lanczos
Old photo, 4×	Visible sharpening, OK with invented detail	Real-ESRGAN
Screenshot of text	Legibility	Lanczos or specialized text upscaler — never GAN
Anime / illustration	Sharp lines	waifu2x or Real-ESRGAN-anime
Surveillance / forensic	Evidence value	None — upscaling fabricates detail
Print at 300 DPI	Sharp output	Real-ESRGAN, then downsize to target
Thumbnail for web	Small visual upgrade	Lanczos is fine and faster

A common workflow: run Real-ESRGAN at 4× to get maximum detail, then downsize with Lanczos to your actual target. The 4× pass produces sharper detail than going directly to 2×, and the Lanczos downsample softens any GAN artifacts.

A Quick Code Walkthrough

If you want to run upscaling locally in Node, the easiest path is the upscaler npm package or a Python service via ONNX runtime. Here's the conceptual shape with sharp for traditional and a Real-ESRGAN binary for AI:

import sharp from 'sharp';
import { execFile } from 'node:child_process';

// Traditional bicubic
await sharp('input.jpg')
  .resize(2048, 2048, { kernel: sharp.kernel.cubic })
  .toFile('output-bicubic.jpg');

// AI (calling Real-ESRGAN binary)
execFile('realesrgan-ncnn-vulkan', [
  '-i', 'input.jpg',
  '-o', 'output-ai.png',
  '-n', 'realesrgan-x4plus',  // model name
  '-s', '4'                    // scale factor
]);

Sharp gives you Lanczos, cubic, and several other classical kernels with one flag. The Real-ESRGAN ncnn binary runs on CPU, GPU (Vulkan), or Apple Silicon — practical for self-hosted setups. If managing inference servers isn't appealing, the hosted upscaler handles it for you.

In a real image pipeline you'll often combine upscaling with other steps: resize for thumbnails, convert formats for delivery, compress before publishing, or remove backgrounds for product shots — most of them benefit from AI upscaling as a preprocessing step when source resolution is poor.

The Practical Takeaway

AI upscaling isn't magic and it isn't fraud. It's a learned prior over what high-res images look like, applied to low-res inputs. When the source content is in the model's training distribution (real photos, ordinary scenes, common objects) and you don't need bit-level fidelity, the output is dramatically better than any classical algorithm. When the source is out-of-distribution (text, medical, forensic, exotic textures) or you need the output to be a faithful reconstruction rather than a plausible one, classical interpolation — bicubic, Lanczos — is the safer choice.

The mental model that works: bicubic answers "what's the smoothest pixel value here?" while ESRGAN answers "what would this look like if it had been photographed at higher resolution in the first place?" Different questions, different answers, different correct uses.

For more on the surrounding image stack, see How Image Compression Works and Image Formats Explained. For background on how these algorithms came to be standardized, the Wikipedia entry on image scaling is a good lay-of-the-land overview, and the Topaz Labs Learn site has practical vendor-side guidance on choosing models for specific photo types.

FAQ

Is AI upscaling safe to use on legal documents or evidence photos?

No — never. GAN-based upscalers like Real-ESRGAN invent plausible detail rather than recover it, which means a blurry "8" can confidently become a sharp "B" with no warning to the user. A 2020 paper on upsampling face photos showed models reliably hallucinating specific facial features that weren't in the source. For anything where pixel fidelity is evidence, stick to Lanczos or bicubic.

What's the difference between Real-ESRGAN and Stable Diffusion Upscaler?

Real-ESRGAN is a GAN trained specifically for super-resolution; it runs in ~200ms per image and produces sharp, photorealistic detail. Stable Diffusion Upscaler is a diffusion model adapted for the task; it produces more globally coherent results (better text legibility, fewer weird hallucinations on complex scenes) but takes 10–100× longer per image. For a free web tool, GAN-based models still hit the right speed/quality trade-off in 2026.

Why does AI upscaling fail so badly on text?

Models trained on photographs learn what skin, leaves, and bricks look like at high resolution — but text glyphs have specific shapes that don't generalize the way textures do. When the source is too blurry to disambiguate "rn" from "m" or "0" from "O," the model picks the most photo-plausible answer rather than the most text-correct one. Specialized text-aware upscalers exist but are far less common than generic photo upscalers.

Can I downscale Real-ESRGAN output back to the original and get a match?

No — AI upscaling is a lossy, one-way operation. The model invents detail that wasn't in the source, so downsampling the 4× output produces a 1× image that's similar to but different from the original. If you need round-trip fidelity (e.g., for archival or forensic use), classical algorithms are the only option.

Should I upscale before or after compression?

Always upscale first, then compress. JPEG and WebP compression artifacts confuse the upscaler — Real-ESRGAN trained on degraded inputs handles light JPEG noise, but heavy compression (quality 50 or below) produces output that's amplified garbage rather than recovered detail. The clean workflow is: source → upscale → final compression for delivery.

What model should I use for anime or illustrated content?

Real-ESRGAN-anime or waifu2x — both trained on illustration datasets rather than photographs. Generic photo upscalers produce mushy, smeared output on line art because they expect texture where the source has hard edges. The anime-specific variants preserve sharp lines and flat color regions, which is the opposite of what photo models do.

Why does upscaling produce visible tile boundaries on large images?

Most upscalers process images in patches (typically 256×256 or 512×512 input) and stitch them back together. If the patches don't overlap, you get visible seams at boundaries. Modern implementations use overlapping windows with feathered blending, which mostly fixes this. If your tool shows tiles, look for an "overlap" or "tile pad" parameter.

Is 4× upscaling really better than running 2× upscaling twice?

Usually yes — a single 4× pass is trained end-to-end and produces sharper detail than chaining two 2× passes. The double-2× workflow accumulates errors at each stage and tends to over-smooth fine textures. The exception is when you need an unusual scale factor like 3× — chaining 2× then downscaling to 3× often beats running a model that wasn't trained for that ratio directly.