How To Normalize Unicode Text
📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.
Why normalization exists
Unicode lets the same visible text be encoded multiple ways. The letter “é” can be one code point (U+00E9) or two (U+0065 + U+0301), and both render identically. When you compare two strings, index them in a database, use them as cache keys, or run a regex across them, these equivalent-but-different encodings silently diverge. Unicode normalization forces a canonical form so two “equal” strings actually compare equal. This guide covers the four normalization forms (NFC, NFD, NFKC, NFKD), when to use each, the security implications of homoglyph attacks, and the database and search-index patterns that depend on consistent normalization.
The four forms
Unicode added accented characters in two compatible ways. Legacy precomposed (single code point) and combining (letter plus modifier). Both render identically. Neither is “wrong.” But comparing them requires normalization.
NFC: the default for storage and comparison
NFC produces the shortest, most common form. Most of the web stores text in NFC. Compare with NFC for “are these the same user-perceived string” tests.
NFD: when you want to strip accents
This is the backbone of slug generation and accent-insensitive search.
NFKC: lossy but useful
NFKC collapses visual variants to their “plain” form:
NFKD: search-index form
NFKD is the aggressive “one true form” for search: strip compatibility variants and decompose. Then you can strip combining marks for full accent-insensitive indexing.
When normalizations disagree
Copy through Windows, macOS, Linux, and web apps, and normalization form can silently change. macOS famously uses NFD for its filesystem, which means file names copied to other systems shift form. Always normalize at boundaries: on input, on storage, on output.