Database Schema¶
Ragdoll implements a sophisticated PostgreSQL database schema designed for high-performance vector similarity search and multi-modal content management. The schema uses polymorphic associations, Single Table Inheritance (STI), and advanced PostgreSQL features including pgvector for embedding storage.
Polymorphic Multi-Modal Database Design¶
The database architecture is built around three core principles:
- Multi-Modal Content Support: Single schema handles text, images, audio, and mixed content
- Polymorphic Embeddings: Vector embeddings can be associated with any content type
- Performance Optimization: Advanced indexing strategies for both vector and full-text search
Schema Architecture Overview¶
erDiagram
DOCUMENTS {
bigint id PK
string location UK
string title
text summary
text keywords
string document_type
string status
json metadata
timestamp file_modified_at
timestamp created_at
timestamp updated_at
}
CONTENTS {
bigint id PK
string type
bigint document_id FK
string embedding_model
text content
text data
json metadata
float duration
integer sample_rate
timestamp created_at
timestamp updated_at
}
EMBEDDINGS {
bigint id PK
string embeddable_type
bigint embeddable_id
text content
vector embedding_vector
integer chunk_index
integer usage_count
datetime returned_at
json metadata
timestamp created_at
timestamp updated_at
}
DOCUMENTS ||--o{ CONTENTS : has_many
CONTENTS ||--o{ EMBEDDINGS : polymorphic
DOCUMENTS ||--o{ EMBEDDINGS : polymorphic
Key Design Features¶
Single Table Inheritance (STI) for Content¶
# Content types inherit from base Content model
class Content < ActiveRecord::Base
# Base functionality for all content types
end
class TextContent < Content
# Text-specific methods and validations
end
class ImageContent < Content
# Image-specific methods and validations
end
class AudioContent < Content
# Audio-specific methods and validations
end
Polymorphic Embedding Architecture¶
# Embeddings can belong to any content type or document
belongs_to :embeddable, polymorphic: true
# Examples:
embedding.embeddable = text_content
embedding.embeddable = image_content
embedding.embeddable = document
Dual Metadata System¶
- Document.metadata: LLM-generated semantic metadata (classification, topics, etc.)
- Content.metadata: Technical file metadata (encoding, dimensions, etc.)
- Embedding.metadata: Vector processing metadata (positions, chunks, etc.)
Core Tables¶
Ragdoll uses four primary tables for data storage:
Documents Table (ragdoll_documents
)¶
The central table for document management and LLM-generated metadata storage.
CREATE TABLE ragdoll_documents (
id BIGINT PRIMARY KEY,
location VARCHAR NOT NULL, -- File path, URL, or identifier
title VARCHAR NOT NULL, -- Human-readable title
summary TEXT NOT NULL DEFAULT '', -- LLM-generated summary
keywords TEXT NOT NULL DEFAULT '', -- LLM-generated keywords
document_type VARCHAR NOT NULL DEFAULT 'text', -- Document format
status VARCHAR NOT NULL DEFAULT 'pending', -- Processing status
metadata JSON DEFAULT '{}', -- LLM-generated structured metadata
file_modified_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
Column Details¶
Column | Type | Purpose | Example |
---|---|---|---|
location |
VARCHAR | Unique document identifier | /path/to/doc.pdf , https://example.com/file |
title |
VARCHAR | Display name for search results | "Machine Learning Research Paper" |
summary |
TEXT | LLM-generated content summary | "This paper explores neural networks..." |
keywords |
TEXT | Comma-separated keywords | "machine learning, AI, neural networks" |
document_type |
VARCHAR | Content format type | pdf , text , image , mixed |
status |
VARCHAR | Processing state | pending , processing , completed , failed |
metadata |
JSON | Structured LLM metadata | {"classification": "research", "topics": [...]} |
file_modified_at |
TIMESTAMP | Source file modification time | Used for change detection |
Relationships¶
# Document model relationships
has_many :contents, dependent: :destroy
has_many :text_contents, -> { where(type: 'TextContent') }
has_many :image_contents, -> { where(type: 'ImageContent') }
has_many :audio_contents, -> { where(type: 'AudioContent') }
has_many :embeddings, as: :embeddable, dependent: :destroy
Status Values¶
pending
: Document added but not yet processedprocessing
: Currently being analyzed by LLM servicescompleted
: All processing finished successfullyfailed
: Processing encountered errorsoutdated
: Source file modified since last processing
Contents Table (ragdoll_contents
)¶
Single Table Inheritance (STI) table storing all content types with type-specific fields.
CREATE TABLE ragdoll_contents (
id BIGINT PRIMARY KEY,
type VARCHAR NOT NULL, -- STI discriminator
document_id BIGINT NOT NULL REFERENCES ragdoll_documents(id),
embedding_model VARCHAR NOT NULL, -- Model for embedding generation
content TEXT, -- Text content or description
data TEXT, -- Raw file data (base64, etc.)
metadata JSON DEFAULT '{}', -- Technical file metadata
duration FLOAT, -- Audio duration (seconds)
sample_rate INTEGER, -- Audio sample rate (Hz)
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
Content Types (STI)¶
TextContent (type = 'TextContent'
)
- content
: The actual text content
- metadata
: {"encoding": "UTF-8", "line_count": 150, "chunk_size": 1000}
ImageContent (type = 'ImageContent'
)
- content
: LLM-generated image description
- data
: Base64 encoded image data or file path
- metadata
: {"width": 1920, "height": 1080, "format": "PNG", "file_size": 2048576}
AudioContent (type = 'AudioContent'
)
- content
: Transcribed text or audio description
- data
: Audio file path or encoded data
- duration
: Audio length in seconds
- sample_rate
: Audio quality (e.g., 44100 Hz)
- metadata
: {"format": "MP3", "bitrate": 320, "channels": 2}
Technical Metadata Examples¶
// TextContent metadata
{
"encoding": "UTF-8",
"line_count": 245,
"word_count": 1850,
"chunk_size": 1000,
"overlap": 200,
"processing_time_ms": 150
}
// ImageContent metadata
{
"width": 1920,
"height": 1080,
"format": "PNG",
"file_size": 2048576,
"color_depth": 24,
"has_transparency": false,
"exif_data": {...}
}
// AudioContent metadata
{
"format": "MP3",
"bitrate": 320,
"channels": 2,
"codec": "LAME",
"file_size": 5242880,
"transcription_confidence": 0.95
}
Embeddings Table (ragdoll_embeddings
)¶
Polymorphic table storing vector embeddings with usage tracking and similarity search optimization.
CREATE TABLE ragdoll_embeddings (
id BIGINT PRIMARY KEY,
embeddable_type VARCHAR NOT NULL, -- Polymorphic type
embeddable_id BIGINT NOT NULL, -- Polymorphic ID
content TEXT NOT NULL DEFAULT '', -- Original text that was embedded
embedding_vector VECTOR(1536) NOT NULL, -- pgvector embedding
chunk_index INTEGER NOT NULL, -- Chunk ordering
usage_count INTEGER DEFAULT 0, -- Search usage tracking
returned_at TIMESTAMP, -- Last usage timestamp
metadata JSON DEFAULT '{}', -- Processing metadata
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
Vector Storage Details¶
Embedding Vector (embedding_vector
)
- Uses pgvector's VECTOR(1536)
type for OpenAI embeddings
- Supports other dimensions: 768 (sentence-transformers), 4096 (large models)
- Stored as compressed binary data for performance
Polymorphic Associations
# Can belong to any content type or document
belongs_to :embeddable, polymorphic: true
# Examples:
embedding.embeddable_type = 'TextContent'
embedding.embeddable_type = 'Document'
embedding.embeddable_type = 'ImageContent'
Usage Tracking
- usage_count
: Incremented each time embedding appears in search results
- returned_at
: Updated when embedding is returned in search
- Used for analytics and caching strategies
Chunk Management
- chunk_index
: Orders chunks within a document (0, 1, 2, ...)
- content
: Original text chunk that was embedded
- metadata
: Processing info like start/end positions
Embedding Metadata Examples¶
{
"start_position": 1250,
"end_position": 2250,
"chunk_size": 1000,
"overlap": 200,
"embedding_model": "openai/text-embedding-3-small",
"embedding_dimensions": 1536,
"generation_time_ms": 45,
"token_count": 180
}
Search Tracking Tables¶
Ragdoll includes comprehensive search tracking capabilities with two additional tables for analytics and similarity analysis.
Searches Table (ragdoll_searches
)¶
Stores all search queries with vector embeddings for similarity analysis and performance tracking.
CREATE TABLE ragdoll_searches (
id BIGINT PRIMARY KEY,
query TEXT NOT NULL, -- Original search query
query_embedding VECTOR(1536) NOT NULL, -- Query vector for similarity
search_type VARCHAR DEFAULT 'semantic', -- semantic, hybrid, fulltext
results_count INTEGER DEFAULT 0, -- Number of results returned
max_similarity_score FLOAT, -- Highest similarity score
min_similarity_score FLOAT, -- Lowest similarity score
avg_similarity_score FLOAT, -- Average similarity score
search_filters JSON DEFAULT '{}', -- Applied filters
search_options JSON DEFAULT '{}', -- Search options used
execution_time_ms INTEGER, -- Query execution time
session_id VARCHAR, -- User session identifier
user_id VARCHAR, -- User identifier
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
Key Features: - Stores query embeddings for finding similar searches - Tracks performance metrics (execution time, result counts) - Records search configuration (filters, options, weights) - Links to user sessions for behavior analysis
Search Results Table (ragdoll_search_results
)¶
Junction table linking searches to returned embeddings with engagement tracking.
CREATE TABLE ragdoll_search_results (
id BIGINT PRIMARY KEY,
search_id BIGINT NOT NULL REFERENCES ragdoll_searches(id) ON DELETE CASCADE,
embedding_id BIGINT NOT NULL REFERENCES ragdoll_embeddings(id) ON DELETE CASCADE,
similarity_score FLOAT NOT NULL, -- Similarity score for this result
result_rank INTEGER NOT NULL, -- Position in result list (1-based)
clicked BOOLEAN DEFAULT FALSE, -- User engagement tracking
clicked_at TIMESTAMP, -- When result was clicked
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL,
UNIQUE(search_id, result_rank) -- One rank per search
);
Key Features: - Tracks which embeddings were returned for each search - Records similarity scores and ranking positions - Monitors user engagement (click-through tracking) - Automatic cascade deletion when searches or embeddings are removed
Search Tracking Indexes¶
-- Vector similarity search for finding similar queries
CREATE INDEX idx_searches_query_embedding
ON ragdoll_searches
USING ivfflat (query_embedding vector_cosine_ops)
WITH (lists = 100);
-- Performance analysis indexes
CREATE INDEX idx_searches_execution_time
ON ragdoll_searches(execution_time_ms DESC);
CREATE INDEX idx_searches_created_at
ON ragdoll_searches(created_at DESC);
CREATE INDEX idx_searches_session
ON ragdoll_searches(session_id);
CREATE INDEX idx_searches_user
ON ragdoll_searches(user_id);
-- Search results indexes
CREATE INDEX idx_search_results_search
ON ragdoll_search_results(search_id);
CREATE INDEX idx_search_results_embedding_score
ON ragdoll_search_results(embedding_id, similarity_score DESC);
CREATE INDEX idx_search_results_clicked
ON ragdoll_search_results(clicked, clicked_at DESC);
### PostgreSQL Extensions Table
Ragdoll requires several PostgreSQL extensions enabled in the database:
```sql
-- Vector similarity search (REQUIRED)
CREATE EXTENSION IF NOT EXISTS vector;
-- Text processing extensions
CREATE EXTENSION IF NOT EXISTS unaccent; -- Remove accents
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Trigram fuzzy search
-- UUID support
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
Database Requirements¶
Ragdoll has strict database requirements for optimal performance and functionality.
PostgreSQL Version Requirements¶
Minimum Requirements: - PostgreSQL 12+ (for JSON operators and improved indexing) - pgvector 0.4.0+ (for vector similarity search) - Recommended: PostgreSQL 14+ with pgvector 0.5.0+
Version Compatibility Matrix:
PostgreSQL | pgvector | Status | Notes |
---|---|---|---|
12.x | 0.4.0+ | ✅ Supported | Minimum required version |
13.x | 0.4.0+ | ✅ Supported | Good performance |
14.x | 0.5.0+ | ✅ Recommended | Improved vector performance |
15.x | 0.5.0+ | ✅ Recommended | Best JSON performance |
16.x | 0.5.0+ | ✅ Latest | Latest features |
Required Extensions¶
pgvector Extension (CRITICAL)¶
-- Enable vector similarity search
CREATE EXTENSION IF NOT EXISTS vector;
-- Verify installation
SELECT * FROM pg_extension WHERE extname = 'vector';
Installation Methods:
# Ubuntu/Debian
sudo apt-get install postgresql-14-pgvector
# macOS with Homebrew
brew install pgvector
# Docker
docker run -d --name ragdoll-postgres \
-e POSTGRES_PASSWORD=password \
-p 5432:5432 \
pgvector/pgvector:pg14
Text Processing Extensions¶
-- Remove accents for better text search
CREATE EXTENSION IF NOT EXISTS unaccent;
-- Trigram similarity for fuzzy matching
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- UUID generation
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
Database Configuration¶
Memory Settings¶
# postgresql.conf optimizations
shared_buffers = 256MB # 25% of RAM for small instances
effective_cache_size = 1GB # Available cache memory
work_mem = 64MB # Per-operation memory
maintenance_work_mem = 256MB # Index maintenance
# Vector-specific settings
max_connections = 100 # Adjust based on needs
random_page_cost = 1.1 # SSD optimization
Connection Settings¶
# database.yml configuration
production:
adapter: postgresql
database: ragdoll_production
username: ragdoll
password: <%= ENV['DATABASE_PASSWORD'] %>
host: localhost
port: 5432
pool: 25 # Connection pool size
timeout: 5000 # Connection timeout (ms)
prepared_statements: true # Performance optimization
advisory_locks: true # Enable advisory locking
Performance Optimization¶
Vector Search Optimization¶
-- Adjust IVFFlat parameters for your dataset size
SET ivfflat.probes = 10; -- Search accuracy vs speed
-- Monitor vector index performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM ragdoll_embeddings
ORDER BY embedding_vector <=> '[0.1,0.2,...]'
LIMIT 10;
JSON Indexing¶
-- Create expression indexes for frequently queried JSON fields
CREATE INDEX idx_metadata_classification
ON ragdoll_documents USING btree ((metadata->>'classification'));
CREATE INDEX idx_metadata_topics
ON ragdoll_documents USING gin ((metadata->'topics'));
Full-Text Search Optimization¶
-- Update full-text search statistics
ANALYZE ragdoll_documents;
ANALYZE ragdoll_contents;
-- Monitor full-text query performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM ragdoll_documents
WHERE to_tsvector('english', title || ' ' || summary)
@@ plainto_tsquery('english', 'machine learning');
Indexing Strategy¶
Ragdoll uses a comprehensive indexing strategy optimized for vector similarity search, full-text search, and metadata filtering.
Vector Similarity Indexes¶
IVFFlat Index for Embeddings¶
-- Primary vector similarity index (cosine distance)
CREATE INDEX index_ragdoll_embeddings_on_embedding_vector_cosine
ON ragdoll_embeddings
USING ivfflat (embedding_vector vector_cosine_ops)
WITH (lists = 100);
-- Alternative: L2 distance index
CREATE INDEX index_ragdoll_embeddings_on_embedding_vector_l2
ON ragdoll_embeddings
USING ivfflat (embedding_vector vector_l2_ops)
WITH (lists = 100);
IVFFlat Configuration¶
Lists Parameter Sizing:
- Small datasets (< 100K vectors): lists = 100
- Medium datasets (100K - 1M vectors): lists = 1000
- Large datasets (> 1M vectors): lists = sqrt(rows)
Query-Time Parameters:
-- Adjust search accuracy vs speed
SET ivfflat.probes = 10; -- Default: good balance
SET ivfflat.probes = 1; -- Fastest, less accurate
SET ivfflat.probes = 50; -- Most accurate, slower
Vector Index Maintenance¶
-- Rebuild vector indexes after significant data changes
REINDEX INDEX index_ragdoll_embeddings_on_embedding_vector_cosine;
-- Monitor index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexname LIKE '%embedding_vector%';
Full-Text Search Indexes¶
Document Full-Text Index¶
-- Comprehensive full-text search across multiple fields
CREATE INDEX index_ragdoll_documents_on_fulltext_search
ON ragdoll_documents
USING gin (
to_tsvector('english',
COALESCE(title, '') || ' ' ||
COALESCE(metadata->>'summary', '') || ' ' ||
COALESCE(metadata->>'keywords', '') || ' ' ||
COALESCE(metadata->>'description', '')
)
);
Content Full-Text Index¶
-- Content-specific full-text search
CREATE INDEX index_ragdoll_contents_on_fulltext_search
ON ragdoll_contents
USING gin (to_tsvector('english', COALESCE(content, '')));
Full-Text Search Usage¶
-- Example full-text queries
SELECT * FROM ragdoll_documents
WHERE to_tsvector('english', title || ' ' || (metadata->>'summary'))
@@ plainto_tsquery('english', 'machine learning neural networks');
-- Ranked full-text search
SELECT *, ts_rank(to_tsvector('english', title), query) as rank
FROM ragdoll_documents, plainto_tsquery('english', 'AI research') query
WHERE to_tsvector('english', title) @@ query
ORDER BY rank DESC;
JSON Metadata Indexes¶
Classification and Type Indexes¶
-- Document type filtering
CREATE INDEX index_ragdoll_documents_on_metadata_type
ON ragdoll_documents ((metadata->>'document_type'));
-- Content classification filtering
CREATE INDEX index_ragdoll_documents_on_metadata_classification
ON ragdoll_documents ((metadata->>'classification'));
-- Language filtering
CREATE INDEX index_ragdoll_documents_on_metadata_language
ON ragdoll_documents ((metadata->>'language'));
Composite JSON Indexes¶
-- Multi-field filtering
CREATE INDEX index_ragdoll_documents_on_type_and_classification
ON ragdoll_documents (
(metadata->>'document_type'),
(metadata->>'classification')
);
-- Date-based filtering with type
CREATE INDEX index_ragdoll_documents_on_type_and_date
ON ragdoll_documents (
(metadata->>'document_type'),
created_at
);
Array and Complex JSON Indexes¶
-- GIN index for array fields (topics, keywords)
CREATE INDEX index_ragdoll_documents_on_metadata_topics
ON ragdoll_documents USING gin ((metadata->'topics'));
CREATE INDEX index_ragdoll_documents_on_metadata_keywords_array
ON ragdoll_documents USING gin ((metadata->'keywords'));
-- JSONB path indexes for nested objects
CREATE INDEX index_ragdoll_documents_on_author_info
ON ragdoll_documents USING gin ((metadata->'author_info'));
Performance-Optimized Indexes¶
Primary Key and Foreign Key Indexes¶
-- Documents table indexes
CREATE UNIQUE INDEX index_ragdoll_documents_on_location
ON ragdoll_documents (location); -- Unique document lookup
CREATE INDEX index_ragdoll_documents_on_status
ON ragdoll_documents (status); -- Processing status filtering
CREATE INDEX index_ragdoll_documents_on_document_type
ON ragdoll_documents (document_type); -- Document type filtering
-- Contents table indexes
CREATE INDEX index_ragdoll_contents_on_document_id
ON ragdoll_contents (document_id); -- Foreign key performance
CREATE INDEX index_ragdoll_contents_on_type
ON ragdoll_contents (type); -- STI type filtering
CREATE INDEX index_ragdoll_contents_on_embedding_model
ON ragdoll_contents (embedding_model); -- Model-based filtering
-- Embeddings table indexes
CREATE INDEX index_ragdoll_embeddings_on_embeddable
ON ragdoll_embeddings (embeddable_type, embeddable_id); -- Polymorphic lookup
CREATE INDEX index_ragdoll_embeddings_on_usage_tracking
ON ragdoll_embeddings (returned_at DESC, usage_count DESC); -- Analytics
Composite Performance Indexes¶
-- Multi-column indexes for common query patterns
CREATE INDEX index_ragdoll_documents_on_type_and_status
ON ragdoll_documents (document_type, status);
CREATE INDEX index_ragdoll_contents_on_document_and_type
ON ragdoll_contents (document_id, type);
CREATE INDEX index_ragdoll_embeddings_on_type_and_usage
ON ragdoll_embeddings (embeddable_type, usage_count DESC);
Time-Based Indexes¶
-- Chronological sorting and filtering
CREATE INDEX index_ragdoll_documents_on_created_at
ON ragdoll_documents (created_at DESC);
CREATE INDEX index_ragdoll_documents_on_file_modified_at
ON ragdoll_documents (file_modified_at DESC);
CREATE INDEX index_ragdoll_embeddings_on_returned_at
ON ragdoll_embeddings (returned_at DESC)
WHERE returned_at IS NOT NULL;
Index Maintenance and Monitoring¶
Index Usage Statistics¶
-- Monitor index performance
SELECT
schemaname,
tablename,
indexname,
idx_scan as scans,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched,
idx_scan::float / NULLIF(seq_scan + idx_scan, 0) as index_usage_ratio
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
AND tablename LIKE 'ragdoll_%'
ORDER BY idx_scan DESC;
Index Size Monitoring¶
-- Check index sizes
SELECT
indexname,
tablename,
pg_size_pretty(pg_relation_size(indexname::regclass)) as size
FROM pg_indexes
WHERE schemaname = 'public'
AND tablename LIKE 'ragdoll_%'
ORDER BY pg_relation_size(indexname::regclass) DESC;
Index Maintenance Commands¶
-- Rebuild indexes after major data changes
REINDEX TABLE ragdoll_documents;
REINDEX TABLE ragdoll_contents;
REINDEX TABLE ragdoll_embeddings;
-- Update table statistics
ANALYZE ragdoll_documents;
ANALYZE ragdoll_contents;
ANALYZE ragdoll_embeddings;
-- Vacuum for space reclamation
VACUUM ANALYZE ragdoll_documents;
VACUUM ANALYZE ragdoll_contents;
VACUUM ANALYZE ragdoll_embeddings;
Database Migrations¶
Ragdoll includes a comprehensive migration system with automatic setup, versioning, and backward compatibility management.
Migration Files Overview¶
The migration system consists of four primary migration files:
- 001_enable_postgresql_extensions.rb - PostgreSQL extensions setup
- 004_create_ragdoll_documents.rb - Documents table and indexes
- 005_create_ragdoll_embeddings.rb - Embeddings table with vector support
- 006_create_ragdoll_contents.rb - Contents table with STI architecture
Auto-Migration Feature¶
Automatic Database Setup¶
# Ragdoll automatically runs migrations on first use
Ragdoll::Core.configure do |config|
config.database_config = {
adapter: 'postgresql',
database: 'ragdoll_production',
username: 'ragdoll',
password: ENV['DATABASE_PASSWORD'],
host: 'localhost',
port: 5432,
auto_migrate: true # Enables automatic migrations
}
end
# First client creation triggers migration
client = Ragdoll::Core.client # Automatically sets up database
Migration Trigger Points¶
# Auto-migration triggers
class DatabaseManager
def self.ensure_database_ready!
return if @database_initialized
if Ragdoll.config.database_config[:auto_migrate]
run_pending_migrations!
verify_extensions!
@database_initialized = true
end
end
private
def self.run_pending_migrations!
ActiveRecord::Migration.verbose = true
ActiveRecord::MigrationContext.new(migrations_path).migrate
end
end
Manual Migration Process¶
Running Migrations Manually¶
# Using ActiveRecord migration commands
bundle exec rake db:migrate
# Run specific migration
bundle exec rake db:migrate:up VERSION=20240115000001
# Rollback migrations
bundle exec rake db:migrate:down VERSION=20240115000001
# Check migration status
bundle exec rake db:migrate:status
Custom Migration Runner¶
# Manual migration execution
class MigrationRunner
def self.run_all_migrations
migration_context = ActiveRecord::MigrationContext.new(
Rails.root.join('db', 'migrate')
)
migration_context.migrate
end
def self.rollback_to_version(version)
migration_context = ActiveRecord::MigrationContext.new(
Rails.root.join('db', 'migrate')
)
migration_context.migrate(version)
end
def self.migration_status
migration_context = ActiveRecord::MigrationContext.new(
Rails.root.join('db', 'migrate')
)
migration_context.migrations_status
end
end
Schema Versioning¶
Migration Version Management¶
# Migration file naming convention
# YYYYMMDDHHMMSS_migration_name.rb
# 001_enable_postgresql_extensions.rb
class EnablePostgresqlExtensions < ActiveRecord::Migration[7.0]
def up
execute "CREATE EXTENSION IF NOT EXISTS vector"
execute "CREATE EXTENSION IF NOT EXISTS unaccent"
execute "CREATE EXTENSION IF NOT EXISTS pg_trgm"
execute "CREATE EXTENSION IF NOT EXISTS \"uuid-ossp\""
end
def down
# Extensions are rarely dropped to avoid data loss
# execute "DROP EXTENSION IF EXISTS vector CASCADE"
end
end
Schema Version Tracking¶
-- ActiveRecord schema_migrations table
CREATE TABLE schema_migrations (
version VARCHAR PRIMARY KEY
);
-- Check current schema version
SELECT version FROM schema_migrations ORDER BY version DESC;
-- Example output:
-- 20240115000006 (006_create_ragdoll_contents)
-- 20240115000005 (005_create_ragdoll_embeddings)
-- 20240115000004 (004_create_ragdoll_documents)
-- 20240115000001 (001_enable_postgresql_extensions)
Version Compatibility Matrix¶
Schema Version | Ragdoll Version | PostgreSQL | pgvector | Features |
---|---|---|---|---|
001-004 | 1.0.0 | 12+ | 0.4.0+ | Basic functionality |
005 | 1.1.0 | 13+ | 0.4.0+ | Vector embeddings |
006 | 1.2.0 | 13+ | 0.5.0+ | Multi-modal content |
007+ | 2.0.0+ | 14+ | 0.5.0+ | Advanced features |
Backward Compatibility¶
Migration Safety Features¶
# Safe migration practices in Ragdoll
class SafeMigrationBase < ActiveRecord::Migration[7.0]
# Disable DDL transactions for extension creation
disable_ddl_transaction!
def safe_add_column(table, column, type, options = {})
# Add column with default to avoid locks on large tables
add_column table, column, type, options.merge(default: options[:default])
# Remove default after column is added (if temporary)
if options[:temporary_default]
change_column_default table, column, nil
end
end
def safe_add_index(table, columns, options = {})
# Use concurrent index creation for large tables
options[:algorithm] = :concurrently
add_index table, columns, options
end
end
Data Migration Strategies¶
# Data-safe migration example
class AddNewMetadataFields < ActiveRecord::Migration[7.0]
def up
# Add new columns with defaults
add_column :ragdoll_documents, :new_field, :string, default: ''
# Migrate existing data in batches
Document.find_in_batches(batch_size: 1000) do |batch|
batch.each do |document|
document.update_column(:new_field, migrate_old_data(document))
end
end
end
def down
remove_column :ragdoll_documents, :new_field
end
private
def migrate_old_data(document)
# Safe data transformation logic
document.old_field&.transform_data || ''
end
end
Rollback Safety¶
# Reversible migration patterns
class ReversibleMigration < ActiveRecord::Migration[7.0]
def change
# Automatically reversible operations
create_table :new_table do |t|
t.string :name, null: false
t.timestamps
end
add_index :new_table, :name
end
# For complex operations requiring explicit rollback
def up
# Forward migration
execute "CREATE MATERIALIZED VIEW complex_view AS ..."
end
def down
# Explicit rollback
execute "DROP MATERIALIZED VIEW IF EXISTS complex_view"
end
end
Migration Best Practices¶
Development Workflow¶
# 1. Create new migration
rails generate migration AddFeatureToDocuments feature:string
# 2. Edit migration file with safe practices
# 3. Test migration on development data
bundle exec rake db:migrate
# 4. Test rollback
bundle exec rake db:rollback
# 5. Re-run migration
bundle exec rake db:migrate
# 6. Commit migration file
git add db/migrate/*
git commit -m "Add feature column to documents"
Production Deployment¶
# Production migration checklist:
# 1. Backup database
pg_dump ragdoll_production > backup_$(date +%Y%m%d_%H%M%S).sql
# 2. Run migration with monitoring
time bundle exec rake db:migrate
# 3. Verify data integrity
bundle exec rake db:migrate:status
# 4. Test application functionality
curl -f http://localhost:3000/health
# 5. Monitor performance
psql -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
Emergency Rollback Procedures¶
# If migration causes issues:
# 1. Stop application
sudo systemctl stop ragdoll
# 2. Rollback migration
bundle exec rake db:rollback
# 3. Restore from backup if needed
psql ragdoll_production < backup_20240115_120000.sql
# 4. Restart application
sudo systemctl start ragdoll
# 5. Investigate and fix migration
Database Maintenance¶
Regular Maintenance Tasks¶
Vacuum and Analyze¶
-- Weekly maintenance routine
VACUUM ANALYZE ragdoll_documents;
VACUUM ANALYZE ragdoll_contents;
VACUUM ANALYZE ragdoll_embeddings;
-- Full vacuum (monthly, during low usage)
VACUUM FULL ragdoll_embeddings; -- Reclaim space from deleted vectors
Index Maintenance¶
-- Rebuild vector indexes after significant changes
REINDEX INDEX index_ragdoll_embeddings_on_embedding_vector_cosine;
-- Update statistics for query planner
ANALYZE ragdoll_embeddings;
Performance Monitoring¶
-- Monitor slow queries
SELECT query, mean_exec_time, calls, total_exec_time
FROM pg_stat_statements
WHERE query LIKE '%ragdoll_%'
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Check index usage
SELECT indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename LIKE 'ragdoll_%'
ORDER BY idx_scan DESC;
Backup and Recovery¶
Backup Strategy¶
# Full database backup
pg_dump -Fc ragdoll_production > ragdoll_backup_$(date +%Y%m%d).dump
# Schema-only backup
pg_dump -s ragdoll_production > ragdoll_schema_$(date +%Y%m%d).sql
# Data-only backup
pg_dump -a ragdoll_production > ragdoll_data_$(date +%Y%m%d).sql
Recovery Procedures¶
# Restore full database
pg_restore -d ragdoll_production ragdoll_backup_20240115.dump
# Restore specific table
pg_restore -d ragdoll_production -t ragdoll_embeddings ragdoll_backup_20240115.dump
This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.