How UTF-8 and Unicode Work — Text Encoding for Developers

Before Unicode, every region of the world solved text encoding independently. Japan had Shift-JIS. Russia had KOI8-R. Western Europe had ISO 8859-1. Receive a file encoded in one standard and open it in software expecting another, and you'd get garbage characters — a phenomenon known as mojibake. Unicode was designed to end this. UTF-8 is how we got there.

ASCII: A Good Start That Wasn't Enough

ASCII (American Standard Code for Information Interchange) was defined in 1963. It assigns numbers 0–127 to the characters needed for English text: uppercase and lowercase letters, digits, punctuation, and control characters. One byte can hold values 0–255, so ASCII only uses the lower half.

That works fine as long as you're writing in English. The moment you need accented characters (é, ñ, ü), Cyrillic, Arabic, Chinese, or anything outside basic Latin, you're out of range. ASCII has no character for them.

Various 8-bit encodings tried to extend ASCII by using bytes 128–255 for regional characters. ISO 8859-1 (Latin-1) covered Western European languages. ISO 8859-5 covered Cyrillic. Each used those upper 128 bytes differently. A document labeled "ISO 8859-1" and a document labeled "ISO 8859-5" both use byte value 0xE9, but it represents é in one and щ in the other.

This is why encoding metadata matters. Without knowing which encoding a file uses, you're guessing.

Unicode: The Universal Character Set

Unicode doesn't define how bytes are stored. It defines a code point — a unique number — for every character in every writing system on Earth, plus symbols, emoji, and a lot more.

The Unicode standard, maintained by the Unicode Consortium, currently assigns code points to over 150,000 characters across more than 160 scripts. Code points are written as U+ followed by a hexadecimal number. The letter A is U+0041. The euro sign € is U+20AC. The grinning face emoji is U+1F600.

Code points range from U+0000 to U+10FFFF — that's just over 1.1 million possible positions, of which approximately 155,000 are currently assigned (leaving room for future characters).

Unicode solves the identity problem. Every character has exactly one code point, globally unique, stable once assigned. But it doesn't say how to store those numbers in bytes — that's what the encoding schemes handle.

UTF-8: Variable-Width Encoding

UTF-8 is the dominant encoding for Unicode text on the web and in most modern systems. The clever part is how it maps code points to bytes.

UTF-8 uses 1 to 4 bytes per character, determined by the code point's range:

Code point range	Bytes used	Bit pattern
U+0000 – U+007F	1	`0xxxxxxx`
U+0080 – U+07FF	2	`110xxxxx 10xxxxxx`
U+0800 – U+FFFF	3	`1110xxxx 10xxxxxx 10xxxxxx`
U+10000 – U+10FFFF	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

The lead byte tells you how many bytes follow: a byte starting with 0 is a single-byte character. A byte starting with 110 means two bytes total. 1110 means three. 11110 means four. Continuation bytes always start with 10.

'A'  → U+0041 → 0x41         (1 byte: 01000001)
'é'  → U+00E9 → 0xC3 0xA9    (2 bytes)
'€'  → U+20AC → 0xE2 0x82 0xAC  (3 bytes)
'😀' → U+1F600 → 0xF0 0x9F 0x98 0x80  (4 bytes)

You can verify any character's UTF-8 encoding manually: find the code point in the table, fill in the x bits, and convert to hex. The Base64 Encoder and URL Encoder both work at the byte level and will show you the percent-encoded form, which directly reveals each byte.

Why UTF-8 Won

UTF-8 has a property that made adoption practical: it's ASCII-compatible. Any valid ASCII text is also valid UTF-8 — the byte sequences are identical. A file that only uses bytes 0–127 is both ASCII and UTF-8 simultaneously.

This meant UTF-8 could be adopted incrementally. Existing ASCII software didn't need to change to handle UTF-8 files that only contained ASCII characters. As more characters were needed, they used the multi-byte sequences, which older software would just pass through without mangling.

UTF-8 is also self-synchronizing. If you start reading a byte stream in the middle, the lead byte structure tells you where the next character boundary is. You never need to scan backward. This makes error recovery and stream processing simpler.

For English text, UTF-8 is the same size as ASCII. For most other Latin-script languages, it's 1–2 bytes per character. Chinese, Japanese, and Korean characters typically take 3 bytes each. Emoji take 4 bytes.

UTF-16 and When It's Used

UTF-16 uses 2 bytes for most characters and 4 bytes (a surrogate pair) for code points above U+FFFF.

Historically, UTF-16 was adopted by Windows and Java when the plan was that 2 bytes would be enough for all Unicode characters — what was then called "Unicode" was UCS-2, a fixed-width 2-byte encoding. When the Unicode range was extended beyond 65,536 code points, UCS-2 had to be updated to UTF-16 with surrogate pairs, adding complexity.

JavaScript strings are internally UTF-16. str.length in JavaScript counts UTF-16 code units, not characters. Emoji and supplementary characters take 2 code units (a surrogate pair), so '😀'.length === 2 even though it's one character. The newer String.prototype.codePointAt() and iteration with for...of handle surrogate pairs correctly.

Windows APIs, the .NET runtime, and the Java runtime all use UTF-16 internally. Files exchanged between systems generally use UTF-8, but in-memory string representations vary.

The BOM Problem

A Byte Order Mark (BOM) is the code point U+FEFF placed at the very beginning of a text file. For UTF-16, it's necessary — the byte order (big-endian vs little-endian) affects how the 2-byte sequences are read, and the BOM tells the reader which order to expect.

For UTF-8, a BOM is meaningless (UTF-8 has no byte order issue) but some Windows tools write it anyway. This causes problems: a UTF-8 file with a BOM has three invisible bytes (EF BB BF) at the start. Programs that don't expect a BOM — particularly on Linux and macOS — can misinterpret these bytes or include them in the first line.

The classic symptom: a CSV file opens correctly in Excel (which writes BOM-marked UTF-8) but a Python script produces a UnicodeDecodeError or weird first-row content when read on Linux. The fix is to write UTF-8 without BOM, or explicitly handle the BOM when reading.

# Read UTF-8 with or without BOM
with open('file.csv', encoding='utf-8-sig') as f:  # 'utf-8-sig' strips BOM
    content = f.read()

# Write UTF-8 without BOM
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(content)

Mojibake: When Encodings Collide

Mojibake (文字化け) is the Japanese term for the garbled text that appears when content encoded in one system is interpreted as another. The word literally means "transformed characters."

A common modern example: a MySQL database column using latin1 (ISO 8859-1) storing UTF-8 bytes. The character é is stored as bytes 0xC3 0xA9 (UTF-8), but MySQL reads them as two latin1 characters: Ã and ©. The text looks correct when written but appears as Ã© when read.

The fix is almost always to declare encodings explicitly at every layer: the database column, the database connection charset, the HTTP response headers (Content-Type: text/html; charset=utf-8), and the HTML meta tag (<meta charset="UTF-8">). Inconsistency anywhere in the chain causes mojibake.

Practical Rules for Today

Use UTF-8 everywhere unless you have a specific reason not to. Most modern systems default to it, but being explicit beats relying on defaults:

Set <meta charset="UTF-8"> in every HTML document
Specify charset=utf-8 in HTTP Content-Type headers
Store text in databases as utf8mb4 (not utf8, which in MySQL only supports 3-byte sequences and can't store emoji)
Open and write files with an explicit encoding rather than relying on system defaults

For more on how encoding interacts with encryption and hashing — three concepts that work at the byte level and are often confused — see Encoding vs Encryption vs Hashing. For how number representations connect to byte values, Number Systems Explained covers the hex-to-byte connection that makes percent-encoding and UTF-8 byte sequences readable.