Skip to content

Pipeline Architecture

Overview

FETCH
Load feeds from config
Async HTTP requests
ETag / Last-Modified caching
Parse RSS / Atom XML
Store entries in SQLite
NORMALIZE
Load unprocessed entries
Readability extraction
HTML → Markdown
Create Article records
SUMMARIZE
Load unsummarized articles
Build LLM prompt
Call provider via ruby_llm
Store summary on article
CLUSTER
Compute SimHash fingerprints
Group by hamming distance
Assign cluster IDs
Detect recurring topics
PUBLISH
Group articles by theme
Build bulletin content
Write Markdown + HTML files
Push to FreshRSS via Fever API

Stage Details

Fetch

  • Uses async and async-http for concurrent requests (default: 10 concurrent)
  • Respects ETag and Last-Modified headers to skip unchanged feeds
  • Custom handlers for sites needing special parsing (Hacker News, Mastodon)
  • Optional Tor SOCKS5 proxy support
  • Stores raw HTML entries in the entries table

Normalize

  • Extracts full-text content using Readability algorithm via nokogiri
  • Converts clean HTML to Markdown using reverse_markdown
  • Creates Article records linked back to source Entry

Summarize

  • Uses ruby_llm gem for unified access to OpenAI, Anthropic, and Gemini
  • Applies an Economist-style editor system prompt
  • Generates concise summaries stored on each Article

Cluster

  • Computes 64-bit SimHash fingerprints from article text
  • Groups articles by hamming distance threshold
  • Assigns cluster_id to duplicate groups
  • Detects recurring topics by comparing against articles from the last 3 days

Publish

  • Groups articles into themed bulletins based on bulletins.yml configuration
  • Recurring topics placed at the end of each bulletin
  • Writes Markdown and HTML output files
  • Pushes bulletins to FreshRSS via the Fever API

Design Decisions

Decision Rationale
ruby_llm gem Single gem for all LLM providers; no custom adapter code
Sequel over ActiveRecord Lighter weight, better FTS5 support, no Rails dependency
async gem over threads Structured concurrency via Fibers for 200+ HTTP requests
Pure Ruby SimHash Avoids native dependencies; 64-bit fingerprint is fast enough for hundreds of articles
Discrete CLI commands Each pipeline stage independently runnable or chained via pipeline
Fever API for FreshRSS Simple HTTP POST, no special client needed