how to detect invisible characters

📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.

The usual suspects

Some of the most frustrating bugs in text processing are caused by characters you literally cannot see. Zero-width spaces, non-breaking spaces, byte-order marks, zero-width joiners, and the exotic tag characters used in Unicode hide in pasted text, survive regex cleanup, and silently break string matching, search indexes, and CSV parsing. A password field rejects your input; a regex doesn’t match what you know is there; a file has a mysterious first character. This guide covers the most common invisible characters, how they sneak into your workflow, and the detection and stripping patterns that actually work.

How they get in

Paste workflows are the main source:

Detecting with a hex dump

The most reliable inspection: view the hex. Anything in a visible region that isn’t ASCII is suspect.

Regex detection

Match anything that renders with zero or ambiguous width:

Zero-width characters

This covers the Unicode ranges explicitly listed as “default-ignorable” or known invisible-space characters. Extend with the E0000 tag block if you’re paranoid about hidden-message attacks.

The BOM problem

U+200B–U+200D and U+FEFF take zero rendering width. They’re functionally invisible but affect:

Non-breaking space variants

Strip aggressively for input normalization:

Tag characters — the hidden-message vector

U+FEFF at the start of a file is a byte-order mark, used by some tools to signal UTF encoding. It causes:

Detection UI patterns

NBSP (U+00A0) is the most common impostor. Looks identical to space. Breaks:

Prevention at input boundaries

Other space variants to watch: U+2007 (figure space), U+2008 (punctuation space), U+202F (narrow no-break), U+3000 (ideographic space). Normalize all to regular space:

When invisible characters are wanted

Unicode’s U+E0020–U+E007F block mirrors ASCII but is default-ignorable. You can encode an entire message in “invisible” tag characters and append it to normal text. It survives most regex, most UI display, and most copy-paste. Used in watermarking and some attack scenarios. Strip unless you have a specific reason to keep them.

Common mistakes

When a user complains “the form says my input is invalid but it looks fine,” show a character-by-character diagnostic:

Run the numbers

The fix is at ingest, not at query. On every user-text input:

How To Detect İnvisible Characters