how to normalize unicode text

📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.

Why normalization exists

Unicode lets the same visible text be encoded multiple ways. The letter “é” can be one code point (U+00E9) or two (U+0065 + U+0301), and both render identically. When you compare two strings, index them in a database, use them as cache keys, or run a regex across them, these equivalent-but-different encodings silently diverge. Unicode normalization forces a canonical form so two “equal” strings actually compare equal. This guide covers the four normalization forms (NFC, NFD, NFKC, NFKD), when to use each, the security implications of homoglyph attacks, and the database and search-index patterns that depend on consistent normalization.

The four forms

Unicode added accented characters in two compatible ways. Legacy precomposed (single code point) and combining (letter plus modifier). Both render identically. Neither is “wrong.” But comparing them requires normalization.

NFC: the default for storage and comparison

NFC produces the shortest, most common form. Most of the web stores text in NFC. Compare with NFC for “are these the same user-perceived string” tests.

NFD: when you want to strip accents

This is the backbone of slug generation and accent-insensitive search.

NFKC: lossy but useful

NFKC collapses visual variants to their “plain” form:

NFKD: search-index form

NFKD is the aggressive “one true form” for search: strip compatibility variants and decompose. Then you can strip combining marks for full accent-insensitive indexing.

When normalizations disagree

Copy through Windows, macOS, Linux, and web apps, and normalization form can silently change. macOS famously uses NFD for its filesystem, which means file names copied to other systems shift form. Always normalize at boundaries: on input, on storage, on output.

How To Normalize Unicode Text

Why normalization exists

The four forms

NFC: the default for storage and comparison

NFD: when you want to strip accents

NFKC: lossy but useful

NFKD: search-index form

When normalizations disagree

Database key normalization

Homoglyph attacks

Normalization + case folding

Benchmarking and file size

Round-tripping through systems

Common mistakes

Run the numbers