A coworker drops a Slack message: "Quick one — got a regex that pulls phone numbers from a giant text dump?" You half-remember the right syntax for word boundaries, your brain stalls on whether \d includes Unicode digits in JavaScript, and you end up Googling the same MDN page for the seventh time this month. This page is the cheatsheet you keep meaning to bookmark — every commonly-needed pattern, with the gotchas that bite in 2026.
The regex practical guide walks you through the concepts; this is the reference you skim when you already know the concept and just need the syntax.
Why Regex Still Matters
LLMs can write regex for you, but you still need to read the result and tell whether it's correct. Regex shows up in find-and-replace, log filtering, form validation, build pipelines, and database queries — and the cost of a wrong regex is silent data corruption. A pattern that "looks right" on three test cases might miss a class of inputs entirely.
Regex is also the lowest-common-denominator pattern language across tools: grep, sed, awk, every text editor, every programming language, most database engines. Once you know the core syntax, you have a portable skill that pays back forever. The Regex Tester is the fastest way to iterate without context-switching to a REPL.
Character Classes and Quantifiers
The atoms of regex.
. any character except newline (use /s flag for "dotall" mode)
\d digit [0-9]
\D non-digit
\w word character [A-Za-z0-9_]
\W non-word character
\s whitespace (space, tab, newline, etc.)
\S non-whitespace
[abc] any of a, b, or c
[^abc] none of a, b, or c (negated class)
[a-z] any lowercase letter (range)
[A-Za-z0-9_-] custom class: alphanumeric, underscore, hyphen
Quantifiers control how many times the preceding atom matches:
? 0 or 1 (optional)
* 0 or more (greedy)
+ 1 or more (greedy)
{n} exactly n times
{n,} n or more times
{n,m} between n and m times
The *?, +?, ??, {n,m}? lazy variants match as few characters as possible. Greedy <.+> on <a><b> matches <a><b> (the whole thing); lazy <.+?> matches <a> (smallest match).
greedy: <.+> matches <a><b>
lazy: <.+?> matches <a>
Use lazy quantifiers when you want the shortest possible match — almost always the right choice when scanning HTML, XML, or anything with delimiters.
Anchors and Word Boundaries
Anchors don't consume characters; they assert position.
^ start of line (or string, depending on /m flag)
$ end of line (or string)
\b word boundary (between \w and non-\w)
\B non-word boundary
\A start of string (PCRE; not in JavaScript)
\Z end of string (PCRE; not in JavaScript)
\b is the most useful and most misunderstood. \bcat\b matches cat in the cat sat but not in category or concatenate. Word boundaries find positions where \w meets \W (or string boundary).
\bcat\b matches "cat" but not "category"
\bcat matches "cat" and "cathedral"
cat\b matches "cat" and "tomcat"
In multi-line mode (/m flag in JS, re.MULTILINE in Python), ^ and $ match at every line break, not just at string start/end. Without the flag, they only match at the absolute start/end of the input.
Groups: Capturing, Non-Capturing, Named
Groups bundle parts of a pattern for quantifiers, alternation, and back-references.
(abc) capturing group — accessed as $1 or \1
(?:abc) non-capturing group — same matching, no capture
(?<name>abc) named capturing group — accessed as groups.name
\1, $1 backreference to first capturing group
Capturing groups slow regex down slightly (the engine tracks the matched text). If you only need grouping for alternation or quantification, use (?:...):
// Capturing — wastes a slot you'll never use
/(?:https?):\/\/(www\.)?([^/]+)/
// Non-capturing where you don't need the result
/(?:https?):\/\/(?:www\.)?([^/]+)/ // only the domain is captured
Named groups make complex regexes self-documenting:
const m = '2026-05-08'.match(/^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/);
console.log(m.groups.year); // "2026"
Named groups are supported in JavaScript (ES2018+), Python 3.x, .NET, PCRE — basically everywhere except really old engines.
Lookarounds
Lookarounds are zero-width assertions: they check what's around the current position without consuming characters.
(?=abc) positive lookahead: must be followed by abc
(?!abc) negative lookahead: must NOT be followed by abc
(?<=abc) positive lookbehind: must be preceded by abc
(?<!abc) negative lookbehind: must NOT be preceded by abc
Useful for "match X but only when Y comes after/before":
\d+(?=USD) digits followed by "USD" — captures only the digits
(?<=\$)\d+ digits preceded by "$" — captures only the digits
(?<!error_)\d+ digits NOT preceded by "error_"
Lookbehind support varies. JavaScript got it in ES2018. PCRE has it. Python's re requires fixed-width lookbehinds (no (?<=a+)); Python's third-party regex module supports variable-width. The MDN regex lookbehind reference lists current browser support.
When you can't use lookbehind, capture the prefix and discard it in code:
// Without lookbehind
const m = text.match(/\$(\d+)/); // capture without lookbehind
const amount = m[1];
Common Patterns: Email, URL, IP, Date, Phone
The "I just need a regex for X" patterns. Each has a strict and a pragmatic version.
Email (pragmatic, handles 99% of real addresses):
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
The strict version per RFC 5322 is over 6,000 characters and matches things like "Abc\@def"@example.com. Don't use the strict version. Use the pragmatic one for client-side hint validation, then send a confirmation email — that's the only real validation.
URL (HTTP/HTTPS):
^https?:\/\/[^\s/$.?#].[^\s]*$
For full RFC 3986 compliance use a parser, not regex. The pattern above catches obvious typos but won't reject technically-valid weird inputs.
IPv4:
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})$
This validates each octet is 0-255. The naive \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} allows 999.999.999.999.
Date (YYYY-MM-DD, ISO format):
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
Doesn't validate February 30 — for real date validation, parse and check. Regex is the wrong tool for "is this a valid calendar date."
US phone (loose):
^\+?1?[-.\s]?\(?[2-9]\d{2}\)?[-.\s]?\d{3}[-.\s]?\d{4}$
For URL slugs from arbitrary text, the Slug Generator handles the corner cases (Unicode normalization, repeated separators). For cleaning text before regex matching, the Text Cleaner is faster than writing throwaway patterns. For inferring a regex from example strings, try the Regex Generator from Samples.
Performance Traps: Catastrophic Backtracking
The most expensive regex bugs cause "catastrophic backtracking" — patterns that take exponential time on inputs slightly longer than the test case.
The classic example:
^(a+)+$
On input "aaaaaaaaaaaaaaaaaaaaa!" (21 a's followed by !), this regex tries 2^21 paths before failing. On 30 a's plus !, it's a billion paths. This is how regex denial-of-service (ReDoS) works.
The fix is to avoid nested quantifiers on overlapping character classes. (a+)+ and (a*)* are red flags. So is (a|aa)+ — alternation where one branch is a prefix of the other.
In 2007, Russ Cox wrote "Regular Expression Matching Can Be Simple and Fast" — required reading if you want to understand why some engines (Go's regexp, Rust's regex) cap match time at linear, while PCRE-style engines are exponential in the worst case. JavaScript engines mostly use the exponential approach for compatibility.
To audit a pattern for backtracking issues: feed long inputs of repeated characters to the Regex Tester and check whether match time scales linearly with length. If doubling the input length more than doubles the time, you have a backtracking problem.
Defense: use possessive quantifiers (a++ in PCRE) or atomic groups ((?>a+)) when supported. In JavaScript, refactor the pattern to make alternatives unambiguous. Sometimes splitting a complex regex into two simpler ones is faster than trying to make one fast.
Browser vs PCRE Flavor Differences
Regex flavors differ in subtle ways. The same pattern can behave differently in Python, JavaScript, PCRE, .NET, Go, and Java.
| Feature | JavaScript | PCRE | Python |
|---|---|---|---|
\d matches Unicode digits |
with /u flag |
yes | yes |
| Lookbehind | ES2018+ | yes | fixed-width only (built-in) |
| Named groups | (?<name>...) |
(?<name>...) or (?P<name>...) |
(?P<name>...) |
Possessive quantifiers a++ |
no | yes | no (use third-party regex) |
Atomic groups (?>...) |
ES2024+ | yes | with third-party regex |
Recursion (?R) |
no | yes | with third-party regex |
Inline flags (?i) |
no | yes | yes |
JavaScript flags worth knowing:
/g global — find all matches, not just first
/i case-insensitive
/m multi-line — ^ and $ match at line breaks
/s dotall — . matches newline
/u unicode — \d, \w, character classes use Unicode
/y sticky — match starting exactly at lastIndex
/d has indices — match results include start/end positions
The /u flag is the one most JavaScript devs forget. Without it, \d only matches [0-9], and certain Unicode escapes don't work. Modern code should default to /u.
For migrating regex between flavors: simple patterns (character classes, quantifiers, anchors, basic groups) are portable. Anything with lookbehind, atomic groups, recursion, or backreferences past \9 needs flavor-specific testing. Build patterns incrementally — start with the simplest match against one example, expand to handle edge cases (empty strings, Unicode, multi-line input), and always keep a "negative" test set of strings that should NOT match. The MDN regex reference is the canonical browser-side guide; for batch replacements with capture groups, the Find Replace tool is faster than scripting. When a regex grows past 80 characters, split it into named groups or multiple smaller regexes — unreadable regex is unmaintainable regex.
FAQ
What's the difference between greedy and lazy quantifiers?
Greedy quantifiers (*, +, ?, {n,m}) match as many characters as possible while still allowing the overall regex to match. Lazy quantifiers (*?, +?, ??, {n,m}?) match as few characters as possible. On <a><b>, the regex <.+> (greedy) matches the entire string; <.+?> (lazy) matches just <a>. Use lazy when scanning text with delimiters.
When should I use a non-capturing group?
Whenever you only need grouping (for alternation (?:cat|dog) or quantification (?:abc)+) but don't need to extract the matched text. Capturing groups have a small performance cost and clutter the result array with values you don't need. Default to non-capturing (?:...) and switch to capturing only when you'll use the captured value.
Can I use regex to parse HTML?
For surface-level extraction (find all <a href> URLs in a clean document), regex works. For real parsing of arbitrary HTML, no — use a proper parser like cheerio (Node), BeautifulSoup (Python), or browser DOM APIs. HTML's grammar isn't regular; nested tags and SGML quirks defeat any sane regex. The famous Stack Overflow rant is meme-worthy but technically correct.
Why does my regex match more (or less) than I expect?
Almost always one of: (1) greedy vs lazy quantifier confusion (use +? for shortest match), (2) . not matching newlines without /s flag, (3) ^ and $ only matching string boundaries without /m flag, (4) you forgot to anchor with ^...$ and the regex is matching a substring, (5) word boundary \b defining "word" differently than you expected (only \w characters count). The Regex Tester with sample inputs is the fastest debug.
Should I use regex or a parser library?
Regex for searching, simple validation, and one-shot text manipulation. Parser library for structured data (JSON, HTML, CSV, code, dates). Rule of thumb: if the format has nested structures, escaping rules, or grammar, use a parser. The temptation to "just regex it" leads to brittle code that breaks on edge cases six months later.
What's a lookbehind and when do I need one?
A lookbehind asserts that something appears immediately before the current position without including it in the match. (?<=\$)\d+ matches digits that come after a $, but the $ itself isn't part of the match. Useful when you want the matched text to exclude the prefix or suffix you're checking for.
How do I match Unicode characters?
In JavaScript, use the /u flag and \p{...} Unicode property escapes: \p{Letter} for any letter, \p{Script=Cyrillic} for Cyrillic. In Python 3, this is the default. In PCRE, enable Unicode with (*UCP) or appropriate compile flags. Without explicit Unicode mode, \w and \d only match ASCII characters in most engines.
How do I avoid catastrophic backtracking in my regex?
Avoid nested quantifiers over overlapping character classes ((a+)+, (a*)*, (a|ab)+). Make sure no alternation branch is a prefix of another. Use atomic groups (?>...) or possessive quantifiers a++ if your engine supports them. Test with long inputs of repeated characters — if doubling the length more than doubles the match time, the regex has exponential worst-case behavior.