TPToolPazar
Ana Sayfa/Rehberler/How To Remove Duplicate Lines

How To Remove Duplicate Lines

📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.

Exact vs normalized dedup

Dedup looks like the simplest text operation in the world: remove lines that appear more than once. In reality “duplicate” is a spectrum. Is leading whitespace significant? Does case matter? Should the first occurrence win, or the last? Do trailing spaces make two lines different or the same? And what about a file that’s 10 GB and won’t fit in memory? The right answer depends entirely on what you’re cleaning — email lists, log files, source code, shopping lists — and picking the wrong one can silently discard data you needed. This guide walks through every dedup decision and the patterns that handle each.

Case-insensitive dedup

Exact dedup compares bytes. Normalized dedup compares after a transformation — lowercase, trim, collapse whitespace, etc. Real-world lists almost always need some normalization, because real-world sources have inconsistent formatting.

Trimmed comparison

Common for emails, usernames, domains. Build a key by lowercasing, keep the original for output:

Preserve-first vs preserve-last

Leading and trailing whitespace silently differentiates identical content. Trim for the comparison, keep whichever version you prefer for output:

Unique vs all-duplicates

For really aggressive matching, also collapse internal whitespace:

Unix: sort | uniq

When two lines match, which copy do you keep? Default is preserve-first: walk the list, skip anything you’ve seen. Preserve-last requires a second pass:

Preserving order with awk

Preserve-first is right for logs (earliest record matters). Preserve-last is right for change feeds (last state wins).

Hash-based keys for long lines

Three possible outputs for a deduplication job:

Dedup with count column

SHA-1 collisions on human text are vanishingly rare. For adversarial input, use SHA-256.

CSV dedup by key column

For tabular data, “duplicate” usually means “same value in the key column,” not full-row match. Use a CSV-aware tool:

Common mistakes

Run the numbers