Clustering¶
MyNews uses SimHash fingerprinting to deduplicate articles and detect recurring topics.
SimHash Deduplication¶
SimHash generates a 64-bit fingerprint from article text. Articles with fingerprints within a configurable hamming distance are grouped into clusters.
Article text
▶
Compute SimHash
▶
Compare hamming distance
▶
Distance < threshold?
Yes ▶
Same cluster
No ▶
New cluster
How SimHash Works¶
- Tokenize the article text into words
- Hash each word to a 64-bit value
- For each bit position, sum +1 (if bit is 1) or -1 (if bit is 0) across all word hashes
- The final fingerprint has bit N set to 1 if the sum at position N is positive
Two articles are considered duplicates if their SimHash fingerprints differ in fewer than ~3 bit positions (hamming distance).
Recurring Topic Detection¶
After clustering, the recurrence detector compares today's clusters against articles from the last 3 days. Topics that appear across multiple days are flagged as is_recurring = true.
Recurring topics are placed at the end of published bulletins.
Running¶
This runs both deduplication and recurrence detection.