Pipeline Architecture¶
Overview¶
FETCH
Load feeds from config
▼
Async HTTP requests
▼
ETag / Last-Modified caching
▼
Parse RSS / Atom XML
▼
Store entries in SQLite
NORMALIZE
Load unprocessed entries
▼
Readability extraction
▼
HTML → Markdown
▼
Create Article records
SUMMARIZE
Load unsummarized articles
▼
Build LLM prompt
▼
Call provider via ruby_llm
▼
Store summary on article
CLUSTER
Compute SimHash fingerprints
▼
Group by hamming distance
▼
Assign cluster IDs
▼
Detect recurring topics
PUBLISH
Group articles by theme
▼
Build bulletin content
▼
Write Markdown + HTML files
▼
Push to FreshRSS via Fever API
Stage Details¶
Fetch¶
- Uses
asyncandasync-httpfor concurrent requests (default: 10 concurrent) - Respects
ETagandLast-Modifiedheaders to skip unchanged feeds - Custom handlers for sites needing special parsing (Hacker News, Mastodon)
- Optional Tor SOCKS5 proxy support
- Stores raw HTML entries in the
entriestable
Normalize¶
- Extracts full-text content using Readability algorithm via
nokogiri - Converts clean HTML to Markdown using
reverse_markdown - Creates
Articlerecords linked back to sourceEntry
Summarize¶
- Uses
ruby_llmgem for unified access to OpenAI, Anthropic, and Gemini - Applies an Economist-style editor system prompt
- Generates concise summaries stored on each
Article
Cluster¶
- Computes 64-bit SimHash fingerprints from article text
- Groups articles by hamming distance threshold
- Assigns
cluster_idto duplicate groups - Detects recurring topics by comparing against articles from the last 3 days
Publish¶
- Groups articles into themed bulletins based on
bulletins.ymlconfiguration - Recurring topics placed at the end of each bulletin
- Writes Markdown and HTML output files
- Pushes bulletins to FreshRSS via the Fever API
Design Decisions¶
| Decision | Rationale |
|---|---|
ruby_llm gem |
Single gem for all LLM providers; no custom adapter code |
| Sequel over ActiveRecord | Lighter weight, better FTS5 support, no Rails dependency |
async gem over threads |
Structured concurrency via Fibers for 200+ HTTP requests |
| Pure Ruby SimHash | Avoids native dependencies; 64-bit fingerprint is fast enough for hundreds of articles |
| Discrete CLI commands | Each pipeline stage independently runnable or chained via pipeline |
| Fever API for FreshRSS | Simple HTTP POST, no special client needed |