ADR-005: RAG-Based Retrieval with Hybrid Search¶
Status: Accepted (Updated for Client-Side Embeddings)
Date: 2025-10-25 (Updated: 2025-10-27)
Decision Makers: Dewayne VanHoozer, Claude (Anthropic)
Architecture Update (October 2025)
Following the reversal of ADR-011, query embeddings are now generated client-side in Ruby using EmbeddingService before being passed to SQL for vector similarity search. This provides a reliable, cross-platform solution.
Quick Summary¶
HTM implements RAG-based retrieval with three search strategies: vector search (semantic), full-text search (keywords), and hybrid search (combined). All strategies include temporal filtering to leverage TimescaleDB's time-series optimization.
Why: Different queries benefit from different approaches. Semantic search handles concepts, full-text handles precise terms, and hybrid provides the best balance for most use cases.
Impact: Flexible retrieval with excellent recall and precision. Client-side embedding generation provides reliable, debuggable operation across all platforms.
Context¶
Traditional memory systems for LLMs face challenges in retrieving relevant information:
- Keyword-only search: Misses semantic relationships ("car" vs "automobile")
- Vector-only search: May miss exact keyword matches ("PostgreSQL 17.2" vs "database")
- No temporal context: Doesn't leverage time-based relevance
- Scalability: Simple linear scans don't scale to thousands of memories
Requirements¶
HTM needs intelligent retrieval that balances:
- Semantic understanding (what does the query mean?)
- Keyword precision (exact term matching)
- Temporal relevance (recent vs historical context)
- Performance (fast retrieval from large datasets)
Alternative Approaches¶
- Pure vector search: Semantic only, no keyword precision
- Pure full-text search: Keywords only, no semantic understanding
- Hybrid search: Combine vector + full-text + temporal filtering
- LLM-as-retriever: Use LLM to generate retrieval queries (slow, expensive)
Decision¶
We will implement RAG-based retrieval with three search strategies: vector, full-text, and hybrid, all with temporal filtering.
Search Strategies¶
1. Vector Search (:vector)
- Generate embedding for query
- Compute cosine similarity with stored embeddings
- Temporal filtering on timeframe
- Best for: Semantic queries, conceptual relationships
2. Full-Text Search (:fulltext)
- PostgreSQL
to_tsvectorandplainto_tsquery ts_rankscoring for relevance- Temporal filtering on timeframe
- Best for: Exact keywords, technical terms, proper nouns
3. Hybrid Search (:hybrid) - Recommended Default
- Full-text pre-filter to get candidates (top 100)
- Vector reranking of candidates for semantic relevance
- Temporal filtering on timeframe
- Best for: Balanced retrieval with precision + recall
Rationale¶
Why RAG-Based Retrieval?¶
Temporal filtering is foundational:
- "What did we discuss last week?" - time is the primary filter
- Recent context often more relevant than old context
- TimescaleDB optimized for time-range queries
Semantic search handles synonyms:
- User says "database", finds memories about "PostgreSQL"
- "Bug fix" matches "resolved issue"
- Captures conceptual relationships
Full-text handles precision:
- "PostgreSQL 17.2" needs exact version match
- Technical terminology like "pgvector", "HNSW"
- Proper nouns like robot names, project names
Hybrid combines strengths:
- Pre-filter with keywords reduces vector search space
- Vector reranking improves relevance of keyword matches
- Avoids false positives from pure vector search
- Avoids missing results from pure keyword search
Implementation Details¶
Client-Side Embedding Generation
Query embeddings are generated client-side in Ruby via EmbeddingService before being passed to SQL for vector similarity search.
Vector Search¶
def search(timeframe:, query:, limit:, embedding_service:)
# Generate query embedding client-side
query_embedding = embedding_service.embed(query)
# Pad to 2000 dimensions if needed
query_embedding += Array.new(2000 - query_embedding.length, 0.0) if query_embedding.length < 2000
# Convert to PostgreSQL vector format
embedding_str = "[#{query_embedding.join(',')}]"
# Vector search in database
conn.exec_params(<<~SQL, [embedding_str, timeframe.begin, timeframe.end, limit])
SELECT id, content, speaker, type, category, importance, created_at, robot_id, token_count,
1 - (embedding <=> $1::vector) as similarity
FROM nodes
WHERE created_at BETWEEN $2 AND $3
AND embedding IS NOT NULL
ORDER BY embedding <=> $1::vector
LIMIT $4
SQL
end
Full-Text Search¶
def search_fulltext(timeframe:, query:, limit:)
# No embedding needed for full-text search
conn.exec_params(<<~SQL, [query, timeframe.begin, timeframe.end, limit])
SELECT *, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $1)) as rank
FROM nodes
WHERE created_at BETWEEN $2 AND $3
AND to_tsvector('english', content) @@ plainto_tsquery('english', $1)
ORDER BY rank DESC
LIMIT $4
SQL
end
Hybrid Search¶
def search_hybrid(timeframe:, query:, limit:, embedding_service:, prefilter_limit: 100)
# Generate query embedding client-side
query_embedding = embedding_service.embed(query)
query_embedding += Array.new(2000 - query_embedding.length, 0.0) if query_embedding.length < 2000
embedding_str = "[#{query_embedding.join(',')}]"
# Combine full-text pre-filter with vector reranking
conn.exec_params(<<~SQL, [embedding_str, timeframe.begin, timeframe.end, query, prefilter_limit, limit])
WITH candidates AS (
SELECT id, content, speaker, type, category, importance, created_at, robot_id, token_count, embedding
FROM nodes
WHERE created_at BETWEEN $2 AND $3
AND to_tsvector('english', content) @@ plainto_tsquery('english', $4)
AND embedding IS NOT NULL
LIMIT $5 -- Pre-filter to top candidates
)
SELECT id, content, speaker, type, category, importance, created_at, robot_id, token_count,
1 - (embedding <=> $1::vector) as similarity
FROM candidates
ORDER BY embedding <=> $1::vector
LIMIT $6 -- Final top results
SQL
end
User API¶
# Use hybrid search (recommended)
memories = htm.recall(
timeframe: "last week",
topic: "PostgreSQL performance",
limit: 20,
strategy: :hybrid # default recommended
)
# Use pure vector search
memories = htm.recall(
timeframe: "last month",
topic: "database design philosophy",
strategy: :vector # best for conceptual queries
)
# Use pure full-text search
memories = htm.recall(
timeframe: "yesterday",
topic: "PostgreSQL 17.2 upgrade",
strategy: :fulltext # best for exact keywords
)
Consequences¶
Positive¶
- Flexible retrieval: Choose strategy based on query type
- Temporal context: Time-range filtering built into all strategies
- Semantic understanding: Vector search captures relationships
- Keyword precision: Full-text search handles exact matches
- Balanced hybrid: Best of both worlds with pre-filter optimization
- Scalable: HNSW indexing on vectors, GIN indexing on tsvectors
- Transparent scoring: Return similarity/rank scores for debugging
Negative¶
- Complexity: Three strategies to understand and choose from
- Embedding latency: Vector/hybrid require embedding generation
- Storage overhead: Both embeddings and full-text indexes
- English-only: Full-text optimized for English language
- Tuning required: Hybrid prefilter_limit may need adjustment
Neutral¶
- Strategy selection: User must choose appropriate strategy
- Timeframe parsing: Natural language time parsing adds complexity
- Embedding consistency: Different embedding models produce different results
Use Cases¶
Use Case 1: Semantic Concept Retrieval¶
# Query: "What architectural decisions have we made?"
# Best strategy: :vector (semantic concept matching)
memories = htm.recall(
timeframe: "last month",
topic: "architectural decisions design choices",
strategy: :vector
)
# Finds: "We decided to use PostgreSQL", "Chose two-tier memory model", etc.
# Matches conceptually even without exact keywords
Use Case 2: Exact Technical Term¶
# Query: "Find all mentions of PostgreSQL 17.2"
# Best strategy: :fulltext (exact version number)
memories = htm.recall(
timeframe: "this week",
topic: "PostgreSQL 17.2",
strategy: :fulltext
)
# Finds: Exact "PostgreSQL 17.2" mentions
# Avoids false matches to "PostgreSQL 16" or generic "database"
Use Case 3: Balanced Query¶
# Query: "What did we discuss about database performance?"
# Best strategy: :hybrid (keyword + semantic)
memories = htm.recall(
timeframe: "last week",
topic: "database performance optimization",
strategy: :hybrid
)
# Pre-filters: Documents containing "database", "performance", "optimization"
# Reranks: By semantic similarity to full query
# Result: Best balance of precision + recall
Use Case 4: Conversation Timeline¶
# Get chronological conversation about a topic
timeline = htm.conversation_timeline("HTM design", limit: 50)
# Returns memories sorted by created_at
# Useful for replaying decision evolution over time
Performance Characteristics¶
Client-Side Embedding Generation
Embeddings are generated client-side before SQL queries. Latency includes HTTP call to Ollama/OpenAI for embedding generation.
Vector Search¶
- Latency: ~30-50ms for client-side embedding + index lookup
- Index: HNSW (Hierarchical Navigable Small World)
- Scalability: O(log n) with HNSW, sublinear
- Best case: Conceptual queries, semantic relationships
- Breakdown: ~20-30ms embedding generation, ~10-20ms vector search
Full-Text Search¶
- Latency: ~5-20ms (no embedding generation)
- Index: GIN (Generalized Inverted Index) on tsvector
- Scalability: O(log n) with GIN index
- Best case: Exact keywords, technical terms
- Benefit: Fastest option when embeddings not needed
Hybrid Search¶
- Latency: Full-text pre-filter + client-side embedding + vector reranking
- Total: ~35-70ms
- Optimization: Pre-filter reduces vector search space
- Best case: Large datasets where full-text can narrow candidates
- Breakdown: ~20-30ms embedding, ~5-10ms full-text, ~10-30ms vector reranking
Temporal Filtering¶
- Optimization: TimescaleDB hypertable partitioning by time
- Index: B-tree on
created_atcolumn - Benefit: Prunes partitions outside timeframe, faster scans
Design Decisions¶
Decision: Three Strategies Instead of One¶
Rationale: Different queries benefit from different approaches. Give users flexibility.
Alternative: Single hybrid strategy for all queries
Rejected: Forces hybrid approach even when pure vector or full-text is better
Decision: Temporal Filtering is Mandatory¶
Rationale: HTM is time-oriented. All retrieval should consider temporal context.
Alternative: Optional timeframe parameter
Rejected: Easy to forget, defeats TimescaleDB optimization benefits
Decision: Hybrid Pre-filter Limit = 100¶
Rationale: Balances recall (enough candidates) with performance (vector search cost)
Alternative: Dynamic limit based on result count
Deferred: Can optimize later based on real-world usage patterns
Decision: Return Similarity/Rank Scores¶
Rationale: Enables debugging, threshold filtering, and understanding retrieval quality
Alternative: Just return nodes without scores
Rejected: Lose valuable signal for debugging and optimization
Risks and Mitigations¶
Risk: Wrong Strategy Selection¶
Risk
User chooses vector for exact keyword query (poor results)
Likelihood: Medium (requires understanding differences)
Impact: Medium (degraded retrieval quality)
Mitigation:
- Default to hybrid for balanced results
- Document use cases clearly
- Provide examples in API docs
- Consider auto-detection in future
Risk: Embedding Latency¶
Risk
Vector/hybrid slow due to embedding generation
Likelihood: High (embedding is I/O bound)
Impact: Medium (100-500ms for Ollama)
Mitigation:
- Cache embeddings for common queries (future)
- Use fast local embedding models (gpt-oss)
- Provide fallback to full-text if embedding fails
Risk: Language Limitation¶
Risk
Full-text search optimized for English only
Likelihood: Low (single-user, likely English)
Impact: High (non-English users)
Mitigation:
- Document English assumption
- Support language parameter in future
- Vector search language-agnostic (works for all languages)
Risk: Pre-filter Misses Results¶
Risk
Hybrid pre-filter (100) misses relevant candidates
Likelihood: Low (100 is generous)
Impact: Medium (reduced recall)
Mitigation:
- Make prefilter_limit configurable
- Monitor recall metrics in practice
- Adjust default if needed
Future Enhancements¶
Query Auto-Detection¶
# Automatically choose strategy based on query
htm.recall_smart(timeframe: "last week", topic: "PostgreSQL 17.2")
# Detects version number → uses :fulltext
htm.recall_smart(timeframe: "last month", topic: "architectural philosophy")
# Detects conceptual query → uses :vector
Re-ranking Strategies¶
# Custom re-ranking based on multiple signals
memories = htm.recall(
timeframe: "last week",
topic: "PostgreSQL",
strategy: :hybrid,
rerank: [:similarity, :importance, :recency] # Multi-factor scoring
)
Query Expansion¶
# LLM-powered query expansion
original = "database"
expanded = ["database", "PostgreSQL", "TimescaleDB", "SQL", "storage"]
memories = htm.recall(
timeframe: "last month",
topic: expanded,
strategy: :fulltext
)
Caching Layer¶
# Cache embedding generation for common queries
@embedding_cache = {}
def search_cached(query)
@embedding_cache[query] ||= embedding_service.embed(query)
end
Alternatives Comparison¶
| Approach | Pros | Cons | Decision |
|---|---|---|---|
| Hybrid Search | Balanced precision + recall | Strategy selection | ACCEPTED |
| Pure Vector Only | Simplest API, semantic | Misses exact matches, slower | Rejected |
| Pure Full-Text Only | Fast, no embeddings | No semantic understanding | Rejected |
| LLM-as-Retriever | Most flexible queries | Too slow, expensive | Rejected |
| Elasticsearch | Dedicated search engine | Additional infrastructure | Rejected |
References¶
- RAG (Retrieval-Augmented Generation)
- pgvector Documentation
- PostgreSQL Full-Text Search
- HNSW Algorithm
- Hybrid Search Best Practices
- ADR-001: PostgreSQL Storage
- ADR-003: Ollama Embeddings - Superseded by ADR-011
- ADR-011: Database-Side Embedding Generation with pgai - Superseded (returned to client-side)
- Search Strategies Guide
Review Notes¶
AI Engineer: Hybrid search is the right approach for RAG systems. Pre-filter optimization is smart.
Database Architect: TimescaleDB + pgvector + full-text is well-architected. Consider query plan analysis for optimization.
Performance Specialist: HNSW and GIN indexes will scale. Monitor embedding latency in production.
Systems Architect: Three strategies provide good flexibility. Document decision matrix clearly for users.
Ruby Expert: Clean API design. Consider strategy as default parameter: recall(..., strategy: :hybrid)