Architecture Overview¶
HTM (Hierarchical Temporal Memory) implements a sophisticated two-tier memory system designed specifically for LLM-based applications ("robots"). This architecture enables robots to maintain long-term context across sessions while managing token budgets efficiently.
System Overview¶
HTM provides intelligent memory management through five core components that work together to deliver persistent, searchable, and context-aware memory for AI agents.
Core Components¶
HTM (Main Interface)¶
The HTM class is the primary interface for memory operations. It coordinates between working memory, long-term memory, and embedding services to provide a unified API.
Key Responsibilities:
- Initialize and coordinate all memory subsystems
- Manage robot identification and registration
- Generate embeddings for new memories
- Orchestrate recall operations with RAG-based retrieval
- Assemble context for LLM consumption
- Track memory statistics and robot activity
Related ADRs: ADR-002, ADR-008
Working Memory¶
Token-limited, in-memory storage for active conversation context. Working memory acts as a fast cache for recently accessed or highly important memories that the LLM needs immediate access to.
Characteristics:
- Capacity: Token-limited (default: 128,000 tokens)
- Storage: Ruby Hash (in-memory)
- Eviction: Hybrid importance + recency (LRU-based)
- Lifetime: Process lifetime
- Access Time: O(1) hash lookups
Related ADRs: ADR-002, ADR-007
Long-Term Memory¶
Durable PostgreSQL storage for permanent knowledge retention. All memories are stored here permanently unless explicitly deleted.
Characteristics:
- Capacity: Effectively unlimited
- Storage: PostgreSQL with TimescaleDB extension
- Retention: Permanent (explicit deletion only)
- Access Pattern: RAG-based retrieval (semantic + temporal)
- Lifetime: Forever
Related ADRs: ADR-001, ADR-005
Embedding Service¶
Generates vector embeddings for semantic search and manages token counting for memory management.
Supported Providers:
- Ollama (default): Local embedding models (gpt-oss, nomic-embed-text, mxbai-embed-large)
- OpenAI: text-embedding-3-small, text-embedding-3-large
- Cohere: embed-english-v3.0, embed-multilingual-v3.0
- Local: Transformers.js for browser/edge deployment
Related ADRs: ADR-003
Database¶
PostgreSQL 16+ with extensions for time-series optimization, vector similarity search, and full-text search.
Key Extensions:
- TimescaleDB: Hypertable partitioning, compression policies, time-range optimization
- pgvector: Vector similarity search with HNSW indexing
- pg_trgm: Trigram-based fuzzy text matching
Related ADRs: ADR-001
Component Interaction Flow¶
Adding a Memory¶
sequenceDiagram
participant User
participant HTM
participant EmbeddingService
participant LongTermMemory
participant WorkingMemory
participant Database
User->>HTM: add_node(key, value, ...)
HTM->>EmbeddingService: embed(value)
EmbeddingService-->>HTM: embedding vector
HTM->>EmbeddingService: count_tokens(value)
EmbeddingService-->>HTM: token_count
HTM->>LongTermMemory: add(key, value, embedding, ...)
LongTermMemory->>Database: INSERT INTO nodes
Database-->>LongTermMemory: node_id
LongTermMemory-->>HTM: node_id
HTM->>WorkingMemory: add(key, value, token_count, ...)
Note over WorkingMemory: Evict if needed
WorkingMemory-->>HTM: success
HTM-->>User: node_id
Recalling Memories¶
sequenceDiagram
participant User
participant HTM
participant LongTermMemory
participant EmbeddingService
participant Database
participant WorkingMemory
User->>HTM: recall(timeframe, topic, ...)
HTM->>EmbeddingService: embed(topic)
EmbeddingService-->>HTM: query_embedding
HTM->>LongTermMemory: search(timeframe, embedding, ...)
LongTermMemory->>Database: SELECT with vector similarity
Database-->>LongTermMemory: matching nodes
LongTermMemory-->>HTM: recalled_memories
loop For each recalled memory
HTM->>WorkingMemory: add(memory)
Note over WorkingMemory: Evict old memories if needed
end
HTM-->>User: recalled_memories
Key Architectural Principles¶
1. Never Forget (Unless Told)¶
HTM implements a "never forget" philosophy. Eviction from working memory moves data to long-term storage, it doesn't delete it. Only explicit forget(key, confirm: :confirmed) operations delete data.
Design Principle
Memory eviction is about managing working memory tokens, not data deletion. All evicted memories remain searchable and recallable from long-term storage.
Related ADRs: ADR-009
2. Two-Tier Memory Hierarchy¶
Working memory provides fast O(1) access to recent/important context, while long-term memory provides unlimited durable storage with RAG-based retrieval.
Performance Benefit
This architecture balances the competing needs of fast access (working memory) and unlimited retention (long-term memory).
Related ADRs: ADR-002
3. Hive Mind Architecture¶
All robots share a global long-term memory database, enabling cross-robot learning and context continuity. Each robot maintains its own working memory for process isolation.
Multi-Robot Collaboration
Knowledge gained by one robot benefits all robots. Users never need to repeat information across sessions or robots.
Related ADRs: ADR-004
4. RAG-Based Retrieval¶
HTM uses Retrieval-Augmented Generation patterns with hybrid search strategies combining semantic similarity (vector search) and temporal relevance (time-range filtering).
Search Strategies
- Vector: Pure semantic similarity
- Full-text: Keyword-based search
- Hybrid: Combines both with RRF scoring
Related ADRs: ADR-005
5. Importance-Weighted Eviction¶
Working memory eviction prioritizes low-importance older memories first, preserving critical context even if it's old.
Token Budget Management
Eviction is inevitable with finite token limits. The hybrid importance + recency strategy ensures the most valuable memories stay in working memory.
Related ADRs: ADR-007
Memory Lifecycle¶
stateDiagram-v2
[*] --> Created: add_node()
Created --> InWorkingMemory: Add to WM
Created --> InLongTermMemory: Persist to LTM
InWorkingMemory --> Evicted: Token limit reached
Evicted --> InLongTermMemory: Mark as evicted
InLongTermMemory --> Recalled: recall()
Recalled --> InWorkingMemory: Add back to WM
InWorkingMemory --> [*]: Process ends
InLongTermMemory --> Forgotten: forget(confirm: :confirmed)
Forgotten --> [*]: Permanently deleted
note right of InWorkingMemory
Fast O(1) access
Token-limited
Process-local
end note
note right of InLongTermMemory
Durable PostgreSQL
Unlimited capacity
RAG retrieval
end note
Architecture Documents¶
Explore detailed architecture documentation:
- Detailed Architecture - Deep dive into system architecture, data flows, and performance characteristics
- Two-Tier Memory System - Working memory and long-term memory design, eviction strategies, and context assembly
- Hive Mind Architecture - Multi-robot shared memory, robot identification, and cross-robot knowledge sharing
Technology Stack¶
| Layer | Technology | Purpose |
|---|---|---|
| Language | Ruby 3.2+ | Core implementation |
| Database | PostgreSQL 16+ | Relational storage |
| Time-Series | TimescaleDB | Hypertable partitioning, compression |
| Vector Search | pgvector | Semantic similarity (HNSW) |
| Full-Text | pg_trgm | Fuzzy text matching |
| Embeddings | Ollama/OpenAI | Vector generation |
| Connection Pool | connection_pool gem | Database connection management |
| Testing | Minitest | Test framework |
Performance Characteristics¶
Working Memory¶
- Add: O(1) amortized (eviction is O(n log n) when needed)
- Retrieve: O(1) hash lookup
- Context Assembly: O(n log n) for sorting, O(k) for selecting
- Typical Size: 50-200 nodes (~128K tokens)
Long-Term Memory¶
- Add: O(log n) with PostgreSQL indexes
- Vector Search: O(log n) with HNSW (approximate)
- Full-Text Search: O(log n) with GIN indexes
- Hybrid Search: O(log n) + merge
- Typical Size: Thousands to millions of nodes
Overall System¶
- Memory Addition: < 100ms (including embedding generation)
- Recall Operation: < 200ms (typical hybrid search)
- Context Assembly: < 10ms (working memory sort)
- Eviction: < 10ms (rare, only when working memory full)
Scalability Considerations¶
Vertical Scaling¶
- Working Memory: Limited by process RAM (~1-2GB for 128K tokens)
- Database: PostgreSQL scales to TBs with proper indexing
- Embeddings: Local models (Ollama) bounded by GPU/CPU
Horizontal Scaling¶
- Multiple Robots: Each robot process has independent working memory
- Database: Single shared PostgreSQL instance (can add replicas)
- Read Replicas: For query scaling (future consideration)
- Sharding: By robot_id or timeframe (future consideration)
Scaling Strategy
Start with single PostgreSQL instance. Add read replicas when query load increases. Consider partitioning by robot_id for multi-tenant scenarios.
Related Documentation¶
- Installation Guide - Setup PostgreSQL, TimescaleDB, and dependencies
- Quick Start - Get started with HTM in 5 minutes
- API Reference - Complete API documentation
- Architecture Decision Records - Detailed decision history
Architecture Reviews¶
All architecture decisions are documented in ADRs and reviewed by domain experts:
- Systems Architect: Overall system design and scalability
- Database Architect: PostgreSQL schema and query optimization
- AI Engineer: Embedding strategies and RAG implementation
- Performance Specialist: Latency and throughput analysis
- Ruby Expert: Idiomatic Ruby patterns and best practices
- Security Specialist: Data privacy and access control
See Architecture Decision Records for complete review notes.