ADR-002: Two-Tier Memory Architecture¶

Status: Accepted

Date: 2025-10-25

Decision Makers: Dewayne VanHoozer, Claude (Anthropic)

Quick Summary¶

HTM implements a two-tier memory architecture with token-limited working memory (hot tier) and unlimited long-term memory (cold tier), managing LLM context windows while preserving all historical data through RAG-based retrieval.

Why: LLMs have limited context windows but need awareness across long conversations. Two tiers provide fast access to recent context while maintaining complete history.

Impact: Efficient token budget management with never-forget guarantees, at the cost of coordination between two storage layers.

Context¶

LLM-based applications face a fundamental challenge: LLMs have limited context windows (typically 128K-200K tokens) but need to maintain awareness across long conversations and sessions spanning days, weeks, or months.

Requirements¶

Persist memories across sessions (durable storage)
Provide fast access to recent/relevant context
Manage token budgets efficiently
Never lose data accidentally
Support contextual recall from the past

Alternative Approaches¶

Database-only: Store everything in PostgreSQL, load on demand
Memory-only: Keep everything in RAM, serialize on shutdown
Two-tier: Combine fast working memory with durable long-term storage
External service: Use a managed memory service

Decision¶

We will implement a two-tier memory architecture with:

Working Memory: Token-limited, in-memory active context
Long-term Memory: Durable PostgreSQL storage

Rationale¶

Working Memory (Hot Tier)¶

Characteristics: - Purpose: Immediate context for LLM - Storage: In-memory Ruby data structures - Capacity: Token-limited (default 128K tokens) - Eviction: LRU-based eviction when full - Access pattern: Frequent reads, moderate writes - Lifetime: Process lifetime

Benefits: - O(1) hash lookups for fast context access - Token budget control prevents context overflow - Explicit eviction policy with transparent behavior

Long-term Memory (Cold Tier)¶

Characteristics: - Purpose: Permanent knowledge base - Storage: PostgreSQL with TimescaleDB - Capacity: Effectively unlimited - Retention: Permanent (explicit deletion only) - Access pattern: RAG-based retrieval - Lifetime: Forever

Benefits: - Never lose data, survives restarts - Search historical context semantically - Time-series queries for temporal context

Data Flow¶

Add Memory:
  User Input → Working Memory → Long-term Memory
               (immediate)      (persisted)

Recall Memory:
  Query → Long-term Memory (RAG search) → Working Memory
          (semantic + temporal)            (evict if needed)

Eviction:
  Working Memory (full) → Evict LRU → Long-term Memory (already there)
                                       (mark as evicted, not deleted)

Implementation Details¶

Working Memory¶

class WorkingMemory
  attr_reader :max_tokens, :token_count

  def initialize(max_tokens: 128_000)
    @nodes = {}           # key => {value, token_count, importance, timestamp}
    @max_tokens = max_tokens
    @token_count = 0
    @access_order = []    # Track access for LRU
  end

  def add(key, value, token_count:, importance: 1.0)
    evict_to_make_space(token_count) if needs_eviction?(token_count)
    @nodes[key] = {
      value: value,
      token_count: token_count,
      importance: importance,
      added_at: Time.now,
      last_accessed: Time.now
    }
    @token_count += token_count
    @access_order << key
  end

  def evict_to_make_space(needed_tokens)
    # LRU eviction based on last access + importance
    # See ADR-007 for detailed eviction strategy
  end

  def assemble_context(strategy: :balanced, max_tokens: nil)
    # Sort by strategy and assemble within budget
    # See ADR-006 for context assembly strategies
  end
end

Long-term Memory¶

class LongTermMemory
  def add(key:, value:, embedding:, robot_id:, importance: 1.0, type: nil)
    # Insert into PostgreSQL with vector embedding
    @db.exec_params(<<~SQL, [key, value, embedding, robot_id, importance, type])
      INSERT INTO nodes (key, value, embedding, robot_id, importance, type, created_at)
      VALUES ($1, $2, $3, $4, $5, $6, CURRENT_TIMESTAMP)
      RETURNING id
    SQL
  end

  def search(timeframe:, query:, embedding_service:, limit:, strategy: :hybrid)
    # RAG-based retrieval: temporal + semantic
    # See ADR-005 for retrieval strategies
  end

  def mark_evicted(keys)
    # Update in_working_memory flag (not deleted)
    @db.exec_params(<<~SQL, [keys])
      UPDATE nodes
      SET in_working_memory = FALSE
      WHERE key = ANY($1)
    SQL
  end
end

Coordination (HTM Class)¶

class HTM
  def initialize(robot_name:, robot_id: nil, max_tokens: 128_000, ...)
    @working_memory = WorkingMemory.new(max_tokens: max_tokens)
    @long_term_memory = LongTermMemory.new(db_config)
    @embedding_service = EmbeddingService.new(...)
    @robot_id = robot_id || SecureRandom.uuid
    @robot_name = robot_name
  end

  def add_node(key, value, importance: 1.0, type: nil)
    # 1. Generate embedding
    embedding = @embedding_service.embed(value)

    # 2. Store in long-term memory
    @long_term_memory.add(
      key: key,
      value: value,
      embedding: embedding,
      robot_id: @robot_id,
      importance: importance,
      type: type
    )

    # 3. Add to working memory (evict if needed)
    token_count = estimate_tokens(value)
    @working_memory.add(key, value,
                        token_count: token_count,
                        importance: importance)
  end

  def recall(timeframe:, topic:, limit: 10, strategy: :hybrid)
    # 1. Search long-term memory (RAG)
    results = @long_term_memory.search(
      timeframe: timeframe,
      query: topic,
      embedding_service: @embedding_service,
      limit: limit,
      strategy: strategy
    )

    # 2. Add results to working memory (evict if needed)
    results.each do |node|
      @working_memory.add(node[:key], node[:value],
                          token_count: node[:token_count],
                          importance: node[:importance])
    end

    # 3. Return nodes
    results
  end
end

Consequences¶

Positive¶

Fast context access through O(1) working memory lookups
Durable storage ensures never lose data, survives restarts
Token budget control with automatic management
Explicit eviction policy provides transparent behavior
RAG-enabled search of historical context semantically
Never-delete philosophy: eviction moves data, never removes
Process-isolated: each robot instance has independent working memory

Negative¶

Complexity of coordinating two storage layers
Memory overhead from working memory consuming RAM
Synchronization challenges keeping both tiers consistent
Eviction overhead when moving data between tiers

Neutral¶

Token counting requires accurate estimation
Strategy tuning for eviction and assembly needs calibration
Per-process state means working memory not shared across processes

Eviction Strategies¶

LRU-based (Implemented)¶

def eviction_score(node)
  recency = Time.now - node[:last_accessed]
  importance = node[:importance]

  # Lower score = evict first
  importance / (recency + 1.0)
end

See ADR-007: Working Memory Eviction Strategy for detailed eviction algorithm.

Future Strategies¶

Importance-only: Keep most important nodes
Recency-only: Pure LRU cache
Frequency-based: Track access counts
Category-based: Pin certain types (facts, preferences)
Smart eviction: ML-based prediction of future access

Context Assembly Strategies¶

Recent (`:recent`)¶

Sort by created_at DESC, newest first

Important (`:important`)¶

Sort by importance DESC, most important first

Balanced (`:balanced`)¶

score = importance * (1.0 / (1 + age_in_hours))

See ADR-006: Context Assembly Strategies for detailed assembly algorithms.

Design Principles¶

Never Forget (Unless Told)¶

Eviction moves data, never deletes
Only forget(confirm: :confirmed) deletes
Long-term memory is append-only (updates rare)

See ADR-009: Never-Forget Philosophy for deletion policies.

Token Budget Management¶

Token counting happens at add time
Working memory enforces hard token limit
Context assembly respects token budget
Safety margin (10%) for token estimation errors

Transparent Behavior¶

Log all evictions
Track in_working_memory flag
Operations log for audit trail

Risks and Mitigations¶

Risk: Token Count Inaccuracy¶

Risk

Tiktoken approximation differs from LLM's actual count

Likelihood: Medium (different tokenizers) Impact: Medium (context overflow) Mitigation: Add safety margin (10%), use LLM-specific counters

Risk: Eviction Thrashing¶

Risk

Constant eviction/recall cycles

Likelihood: Low (with proper sizing) Impact: Medium (performance degradation) Mitigation: Larger working memory, smarter eviction, caching

Risk: Working Memory Growth¶

Risk

Memory leaks or unbounded growth

Likelihood: Low (token budget enforced) Impact: High (OOM crashes) Mitigation: Hard limits, monitoring, alerts

Risk: Stale Working Memory¶

Risk

Working memory doesn't reflect long-term updates

Likelihood: Low (single-writer pattern) Impact: Low (eventual consistency OK) Mitigation: Refresh on recall, invalidation on update

Performance Characteristics¶

Working Memory¶

Add: O(1) amortized (eviction is O(n) when needed)
Retrieve: O(1) hash lookup
Eviction: O(n log n) for sorting, O(k) for removing k nodes
Context assembly: O(n log n) for sorting, O(k) for selecting

Long-term Memory¶

Add: O(log n) PostgreSQL insert with indexes
Vector search: O(log n) with HNSW index (approximate)
Full-text search: O(log n) with GIN index
Hybrid search: O(log n) for both, then merge

Future Enhancements¶

Shared working memory: Redis-backed for multi-process
Lazy loading: Load nodes on first access
Pre-fetching: Anticipate needed context
Compression: Compress old working memory nodes
Tiered eviction: Multiple working memory levels
Smart assembly: ML-driven context selection

References¶

Review Notes¶

Systems Architect: Clean separation of concerns. Consider shared cache for horizontal scaling.

Performance Specialist: Good balance of speed and durability. Monitor eviction frequency.

AI Engineer: Token budget management is critical. Add safety margins for token count variance.

Ruby Expert: Consider using Concurrent::Map for thread-safe working memory in future.

ADR-002: Two-Tier Memory Architecture¶

Quick Summary¶

Context¶

Requirements¶

Alternative Approaches¶

Decision¶

Rationale¶

Working Memory (Hot Tier)¶

Long-term Memory (Cold Tier)¶

Data Flow¶

Implementation Details¶

Working Memory¶

Long-term Memory¶

Coordination (HTM Class)¶

Consequences¶

Positive¶

Negative¶

Neutral¶

Eviction Strategies¶

LRU-based (Implemented)¶

Future Strategies¶

Context Assembly Strategies¶

Recent (:recent)¶

Important (:important)¶

Balanced (:balanced)¶

Design Principles¶

Never Forget (Unless Told)¶

Token Budget Management¶

Transparent Behavior¶

Risks and Mitigations¶

Risk: Token Count Inaccuracy¶

Risk: Eviction Thrashing¶

Risk: Working Memory Growth¶

Risk: Stale Working Memory¶

Performance Characteristics¶

Working Memory¶

Long-term Memory¶

Future Enhancements¶

References¶

Review Notes¶

Recent (`:recent`)¶

Important (`:important`)¶

Balanced (`:balanced`)¶