Database Schema Documentation¶

This document provides a comprehensive reference for HTM's PostgreSQL database schema, including query patterns, optimization strategies, and best practices.

Schema Overview¶

HTM uses PostgreSQL 17 with pgvector and pg_trgm extensions to provide:

Vector similarity search via pgvector for semantic memory retrieval
Full-text search with PostgreSQL's built-in tsvector capabilities
Fuzzy matching using pg_trgm for flexible text search
Many-to-many relationships for flexible tagging and categorization

Required Extensions¶

HTM requires these PostgreSQL extensions:

CREATE EXTENSION IF NOT EXISTS pg_trgm WITH SCHEMA public;
CREATE EXTENSION IF NOT EXISTS vector WITH SCHEMA public;

Entity-Relationship Diagram¶

Here's the complete database structure (auto-generated by tbls):

HTM Entity-Relationship Diagram

Table Reference¶

For detailed table definitions, columns, indexes, and constraints, see the auto-generated documentation:

Core Tables¶

Table	Description	Details
robots	Registry of all LLM robots using the HTM system	Stores robot metadata and activity tracking
nodes	Core memory storage for conversation messages and context	Vector embeddings, full-text search, deduplication
tags	Unique hierarchical tag names for categorization	Colon-separated namespaces (e.g., `ai:llm:embeddings`)
file_sources	Source file metadata for loaded documents	Path, mtime, frontmatter, sync tracking

Join Tables¶

Table	Description	Details
robot_nodes	Links robots to nodes (many-to-many)	Enables "hive mind" shared memory; includes `working_memory` boolean for per-robot working memory state
node_tags	Links nodes to tags (many-to-many)	Flexible multi-tag categorization

System Tables¶

Table	Description	Details
schema_migrations	ActiveRecord migration tracking	Tracks applied migrations

For the complete schema overview including all stored procedures and functions, see the Database Tables Overview.

Key Concepts¶

Content Deduplication¶

Content deduplication is enforced via SHA-256 hashing in the nodes table:

When remember() is called, a SHA-256 hash of the content is computed
If a node with the same content_hash exists, the existing node is reused
A new robot_nodes association is created (or updated if it already exists)
This ensures identical memories are stored once but can be "remembered" by multiple robots

JSONB Metadata¶

The nodes table includes a metadata JSONB column for flexible key-value storage:

Column	Type	Default	Description
`metadata`	jsonb	`{}`	Arbitrary key-value data

Features: - Stores any valid JSON data (strings, numbers, booleans, arrays, objects) - GIN index (idx_nodes_metadata) for efficient containment queries - Queried using PostgreSQL's @> containment operator

Query examples:

-- Find nodes with specific metadata
SELECT * FROM nodes WHERE metadata @> '{"priority": "high"}'::jsonb;

-- Find nodes with nested metadata
SELECT * FROM nodes WHERE metadata @> '{"user": {"role": "admin"}}'::jsonb;

-- Find nodes with multiple conditions
SELECT * FROM nodes WHERE metadata @> '{"environment": "production", "version": 2}'::jsonb;

Ruby usage:

# Store with metadata
htm.remember("API config", metadata: { environment: "production", version: 2 })

# Recall filtering by metadata
htm.recall("config", metadata: { environment: "production" })

Hierarchical Tags¶

Tags use colon-separated hierarchies for organization: - programming:ruby:gems - Programming > Ruby > Gems - database:postgresql:extensions - Database > PostgreSQL > Extensions - ai:llm:embeddings - AI > LLM > Embeddings

Query by prefix to find all related tags:

SELECT * FROM tags WHERE name LIKE 'database:%';  -- All database-related tags
SELECT * FROM tags WHERE name LIKE 'ai:llm:%';    -- All LLM-related tags

File Source Tracking¶

The file_sources table tracks loaded documents for re-sync support:

Column	Type	Description
`id`	bigint	Primary key
`file_path`	text	Absolute path to the source file
`file_hash`	varchar(64)	SHA-256 hash of file contents
`mtime`	timestamptz	File modification time for change detection
`file_size`	integer	File size in bytes
`frontmatter`	jsonb	Parsed YAML frontmatter metadata
`last_synced_at`	timestamptz	When file was last synced
`created_at`	timestamptz	When source was first loaded
`updated_at`	timestamptz	When source was last updated

Nodes loaded from files have: - source_id - Foreign key to file_sources (nullable, ON DELETE SET NULL) - chunk_position - Integer position within the file (0-indexed)

Query nodes from a file:

SELECT n.*
FROM nodes n
JOIN file_sources fs ON n.source_id = fs.id
WHERE fs.file_path = '/path/to/file.md'
ORDER BY n.chunk_position;

Remember Tracking¶

The robot_nodes table tracks per-robot remember metadata:

first_remembered_at - When this robot first encountered this content
last_remembered_at - Updated each time the robot tries to remember the same content
remember_count - Incremented each time (useful for identifying frequently reinforced memories)

This allows querying for: - Recently reinforced memories: ORDER BY last_remembered_at DESC - Frequently remembered content: ORDER BY remember_count DESC - New vs old memories: Compare first_remembered_at across robots

Common Query Patterns¶

Finding Nodes for a Robot¶

SELECT n.*
FROM nodes n
JOIN robot_nodes rn ON n.id = rn.node_id
WHERE rn.robot_id = $1
ORDER BY rn.last_remembered_at DESC;

SELECT r.*
FROM robots r
JOIN robot_nodes rn ON r.id = rn.robot_id
WHERE rn.node_id = $1
ORDER BY rn.first_remembered_at;

Finding Frequently Remembered Content¶

SELECT n.*, rn.remember_count, rn.first_remembered_at, rn.last_remembered_at
FROM nodes n
JOIN robot_nodes rn ON n.id = rn.node_id
WHERE rn.robot_id = $1
ORDER BY rn.remember_count DESC
LIMIT 10;

Finding Tags for a Node¶

SELECT t.name
FROM tags t
JOIN node_tags nt ON t.id = nt.tag_id
WHERE nt.node_id = $1
ORDER BY t.name;

Finding Nodes with a Specific Tag¶

SELECT n.*
FROM nodes n
JOIN node_tags nt ON n.id = nt.node_id
JOIN tags t ON nt.tag_id = t.id
WHERE t.name = 'database:postgresql'
ORDER BY n.created_at DESC;

Finding Nodes with Hierarchical Tag Prefix¶

SELECT n.*
FROM nodes n
JOIN node_tags nt ON n.id = nt.node_id
JOIN tags t ON nt.tag_id = t.id
WHERE t.name LIKE 'ai:llm:%'
ORDER BY n.created_at DESC;

SELECT
    t1.name AS topic1,
    t2.name AS topic2,
    COUNT(DISTINCT nt1.node_id) AS shared_nodes
FROM tags t1
JOIN node_tags nt1 ON t1.id = nt1.tag_id
JOIN node_tags nt2 ON nt1.node_id = nt2.node_id
JOIN tags t2 ON nt2.tag_id = t2.id
WHERE t1.name < t2.name
GROUP BY t1.name, t2.name
HAVING COUNT(DISTINCT nt1.node_id) >= 2
ORDER BY shared_nodes DESC;

Vector Similarity Search with Tag Filter¶

SELECT n.*, n.embedding <=> $1::vector AS distance
FROM nodes n
JOIN node_tags nt ON n.id = nt.node_id
JOIN tags t ON nt.tag_id = t.id
WHERE t.name = 'programming:ruby'
  AND n.embedding IS NOT NULL
ORDER BY distance
LIMIT 10;

Full-Text Search with Tag Filter¶

SELECT n.*, ts_rank(to_tsvector('english', n.content), query) AS rank
FROM nodes n
JOIN node_tags nt ON n.id = nt.node_id
JOIN tags t ON nt.tag_id = t.id,
     to_tsquery('english', 'database & optimization') query
WHERE to_tsvector('english', n.content) @@ query
  AND t.name LIKE 'database:%'
ORDER BY rank DESC
LIMIT 20;

Finding Content Shared by Multiple Robots¶

SELECT n.*, COUNT(DISTINCT rn.robot_id) AS robot_count
FROM nodes n
JOIN robot_nodes rn ON n.id = rn.node_id
GROUP BY n.id
HAVING COUNT(DISTINCT rn.robot_id) > 1
ORDER BY robot_count DESC;

Database Optimization¶

Vector Search Performance¶

The idx_nodes_embedding index uses HNSW (Hierarchical Navigable Small World) algorithm for fast approximate nearest neighbor search:

m=16: Number of bi-directional links per node (higher = better recall, more memory)
ef_construction=64: Size of dynamic candidate list during index construction (higher = better quality, slower build)

For queries, you can adjust ef_search (defaults to 40):

SET hnsw.ef_search = 100;  -- Better recall, slower queries

Full-Text Search Performance¶

The idx_nodes_content_gin index enables fast full-text search using PostgreSQL's tsvector:

-- Query optimization with explicit tsvector
SELECT * FROM nodes
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'memory & retrieval');

Fuzzy Matching Performance¶

The idx_nodes_content_trgm index enables similarity search and pattern matching:

-- Similarity search
SELECT * FROM nodes
WHERE content % 'semantic retreval';  -- Handles typos

-- Pattern matching
SELECT * FROM nodes
WHERE content ILIKE '%memry%';  -- Uses trigram index

Index Maintenance¶

Monitor and maintain indexes for optimal performance:

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

-- Reindex if needed
REINDEX INDEX CONCURRENTLY idx_nodes_embedding;
REINDEX INDEX CONCURRENTLY idx_nodes_content_gin;

Schema Migration¶

The schema is managed through ActiveRecord migrations located in db/migrate/:

20250101000001_create_robots.rb - Creates robots table
20250101000002_create_nodes.rb - Creates nodes table with all indexes
20250101000005_create_tags.rb - Creates tags and nodes_tags tables
20251128000002_create_file_sources.rb - Creates file_sources table for document tracking
20251128000003_add_source_to_nodes.rb - Adds source_id and chunk_position to nodes

To apply migrations:

bundle exec rake htm:db:migrate

To generate the current schema dump:

bundle exec rake htm:db:schema:dump

The canonical schema is maintained in db/schema.sql.

Database Extensions¶

pgvector¶

Provides vector similarity search capabilities:

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector WITH SCHEMA public;

-- Vector operations
SELECT embedding <=> $1::vector AS cosine_distance FROM nodes;  -- Cosine distance
SELECT embedding <-> $1::vector AS l2_distance FROM nodes;      -- L2 distance
SELECT embedding <#> $1::vector AS inner_product FROM nodes;    -- Inner product

pg_trgm¶

Provides trigram-based fuzzy text matching:

-- Install extension
CREATE EXTENSION IF NOT EXISTS pg_trgm WITH SCHEMA public;

-- Trigram operations
SELECT content % 'search term' FROM nodes;           -- Similarity operator
SELECT similarity(content, 'search term') FROM nodes; -- Similarity score
SELECT content ILIKE '%pattern%' FROM nodes;          -- Pattern matching (uses trigram index)

Best Practices¶

Tagging Strategy¶

Use hierarchical namespaces: category:subcategory:detail
Be consistent with naming: Use lowercase, singular nouns
Limit depth: 2-3 levels is optimal (e.g., ai:llm:embeddings)
Avoid redundancy: Don't duplicate information already in node fields

Node Management¶

Set appropriate importance: Use 0.0-1.0 scale for priority-based retrieval
Update last_accessed: Touch timestamp when retrieving for LRU eviction
Manage token_count: Update when content changes for working memory budget
Use appropriate types: fact, context, code, preference, decision, question

Search Strategy¶

Vector search: Best for semantic similarity ("concepts like X")
Full-text search: Best for keyword matching ("documents containing Y")
Fuzzy search: Best for typo tolerance and pattern matching
Hybrid search: Combine vector + full-text with weighted scores

Performance Tuning¶

Monitor index usage: Use pg_stat_user_indexes
Vacuum regularly: Especially after bulk deletes
Adjust HNSW parameters: Balance recall vs speed based on dataset size
Use connection pooling: Managed by HTM::LongTermMemory