How To Count Word Frequency
📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.
Tokenization: the first hard choice
How you cut text into words determines every count downstream. Naive whitespace split:
Case folding
Punctuation is attached. Better: split on non-word characters, then lowercase:
Stop words
This keeps contractions (“don’t”) but strips commas and periods. Add hyphens to the class if you want “state-of-the-art” as one token.
Stemming vs lemmatization
The top 20 words in English text are “the, of, and, a, to, in, is, you, that, it, he, was, for, on, are” — rarely interesting. Standard stop-word lists strip them so the remaining counts reflect content.
Counting
Customize the list for your domain. SEO stop-word lists usually keep more terms than research-corpus lists.
N-grams: beyond single words
Both collapse word variants to a single form:
TF-IDF: frequency in context
Trivial with a map:
SEO application: keyword density
Single-word counts miss phrases. “San Francisco” carries information that “san” + “francisco” separately doesn’t. Bigrams (2-word) and trigrams (3-word) capture this:
Style checking
Bigram stop-word filtering is trickier — “of the” is noise but “state of the art” is signal. Strip bigrams where both tokens are stop words, keep the rest.
Research and corpus analysis
High TF-IDF = characteristic of the document. Great for tagging, topic extraction, and finding the “gist” words.
Hapax legomena and Zipf’s law
Keyword density = (count of keyword / total words) × 100. Old SEO target was 1–3%. Modern consensus: natural language beats forced density. Use frequency counting to:
Common mistakes
Frequency counts reveal habitual tics: “really,” “just,” “very,” “that” overused as filler. Run your draft through a frequency pass and the top 30 content words show your patterns.
Run the numbers
For larger corpora: