Models Reference¶
Ragdoll uses a sophisticated ActiveRecord model architecture with Single Table Inheritance (STI) for multi-modal content storage and polymorphic associations for flexible embeddings.
Detailed API Documentation
For complete class and method documentation, see the Ruby API Documentation (RDoc) which provides detailed technical reference for all Ragdoll models and their methods.
ActiveRecord Models and Relationships¶
The model architecture provides:
- Single Table Inheritance (STI): Content models (
TextContent,ImageContent,AudioContent) share a single table - Polymorphic Associations: Embeddings can belong to any content type through polymorphic relationships
- PostgreSQL Optimizations: Native JSON columns, full-text search indexes, and pgvector integration
- Rich Metadata Support: Flexible metadata storage with validation and type-specific schemas
- Usage Analytics: Built-in tracking for search optimization and performance monitoring
- Comprehensive Validations: Data integrity through extensive validation rules and callbacks
Core Models¶
Document Model¶
Class: Ragdoll::Document
Table: ragdoll_documents
Primary Attributes:
# Core document identification
id :bigint # Primary key
location :string # Source location (file path, URL, identifier)
title :string # Human-readable document title
document_type :string # Document format type (text, image, audio, pdf, etc.)
status :string # Processing status (pending, processing, processed, error)
file_modified_at :datetime # Source file modification timestamp
# Metadata storage
metadata :json # LLM-generated structured metadata
file_metadata :json # File properties and processing metadata
# Timestamps
created_at :datetime
updated_at :datetime
Multi-Modal Content Associations:
# STI-based content relationships
has_many :contents, class_name: "Ragdoll::Content", dependent: :destroy
has_many :text_contents, class_name: "Ragdoll::TextContent"
has_many :image_contents, class_name: "Ragdoll::ImageContent"
has_many :audio_contents, class_name: "Ragdoll::AudioContent"
# Embedding relationships through content
has_many :text_embeddings, through: :text_contents, source: :embeddings
has_many :image_embeddings, through: :image_contents, source: :embeddings
has_many :audio_embeddings, through: :audio_contents, source: :embeddings
Key Instance Methods:
# Content access
document.content # Returns combined content from all content types
document.content = "new content" # Creates appropriate content model
# Multi-modal detection
document.multi_modal? # True if document has multiple content types
document.content_types # Array of content types: ['text', 'image', 'audio']
document.primary_content_type # Primary content type for the document
# Statistics
document.total_word_count # Sum of words across all text content
document.total_character_count # Sum of characters across all text content
document.total_embedding_count # Total embeddings across all content types
document.embeddings_by_type # Hash: { text: 10, image: 5, audio: 2 }
# Processing
document.processed? # True if status == 'processed'
document.process_content! # Generate embeddings and metadata
document.generate_metadata! # Generate LLM-based metadata
Search and Query Methods:
# PostgreSQL full-text search
Document.search_content("machine learning")
# Keywords search (array overlap - finds documents with any matching keywords)
Document.search_by_keywords(['machine', 'learning', 'ai'])
# Returns documents with keywords_match_count attribute
# Keywords search (array contains - finds documents with ALL keywords)
Document.search_by_keywords_all(['python', 'programming'])
# Returns documents with total_keywords_count attribute
# Faceted search with metadata filters
Document.faceted_search(
query: "AI research",
keywords: ["neural networks"],
classification: "academic_paper",
tags: ["machine-learning"]
)
# Hybrid search combining semantic and text search
Document.hybrid_search(
"deep learning applications",
query_embedding: embedding_vector,
semantic_weight: 0.7,
text_weight: 0.3
)
Content Models (STI Architecture)¶
Base Class: Ragdoll::Content
Table: ragdoll_contents (shared by all content types)
STI Classes:
- Ragdoll::TextContent
- Ragdoll::ImageContent
- Ragdoll::AudioContent
Shared Attributes:
id :bigint # Primary key
type :string # STI discriminator (TextContent, ImageContent, etc.)
document_id :bigint # Foreign key to document
embedding_model :string # Model used for embeddings
content :text # Content text (text, description, transcript)
data :text # Raw file data or metadata
metadata :json # Content-specific metadata
duration :float # Audio duration (audio content only)
sample_rate :integer # Audio sample rate (audio content only)
created_at :datetime
updated_at :datetime
Polymorphic Relationships:
# Each content model belongs to a document
belongs_to :document, class_name: "Ragdoll::Document"
# Each content model can have many embeddings
has_many :embeddings, class_name: "Ragdoll::Embedding", as: :embeddable
TextContent Model¶
Specific Validations:
Text-Specific Methods:
# Content analysis
text_content.word_count # Number of words in content
text_content.character_count # Number of characters in content
text_content.line_count # Number of lines (from metadata)
# Chunking configuration
text_content.chunk_size # Tokens per chunk (default: 1000)
text_content.chunk_size = 1500 # Set custom chunk size
text_content.overlap # Token overlap (default: 200)
text_content.overlap = 300 # Set custom overlap
# Content processing
text_content.chunks # Array of content chunks with positions
text_content.generate_embeddings! # Generate embeddings for all chunks
Text Processing Example:
text_content = document.text_contents.create!(
content: "Large document text...",
embedding_model: "text-embedding-3-large",
metadata: {
encoding: "UTF-8",
line_count: 150,
chunk_size: 1000,
overlap: 200
}
)
# Generate embeddings automatically
text_content.generate_embeddings!
# Access generated chunks and embeddings
text_content.chunks.each do |chunk|
puts "Chunk #{chunk[:chunk_index]}: #{chunk[:content][0..50]}..."
end
text_content.embeddings.each do |embedding|
puts "Embedding #{embedding.chunk_index}: #{embedding.embedding_vector.length} dimensions"
end
ImageContent Model¶
Image-Specific Attributes:
# content field stores AI-generated description
# data field stores image binary data or file reference
# metadata stores image properties
Image-Specific Methods:
# Image properties (from metadata)
image_content.width # Image width in pixels
image_content.height # Image height in pixels
image_content.file_size # File size in bytes
image_content.format # Image format (jpg, png, etc.)
# AI-generated content
image_content.description # AI-generated description (stored in content field)
image_content.objects_detected # Detected objects (from metadata)
image_content.scene_type # Scene classification (from metadata)
AudioContent Model (Planned)¶
Audio-Specific Attributes:
duration :float # Audio duration in seconds
sample_rate :integer # Sample rate in Hz
# content field stores transcript
# data field stores audio binary data
# metadata stores audio properties and timestamps
Audio-Specific Methods (Planned):
# Audio properties
audio_content.duration_formatted # "5:42" format
audio_content.bitrate # Audio bitrate (from metadata)
audio_content.channels # Number of audio channels
# Transcript and timestamps
audio_content.transcript # Full transcript (stored in content field)
audio_content.timestamps # Word-level timestamps (from metadata)
audio_content.speakers # Speaker identification (from metadata)
Embedding Model¶
Class: Ragdoll::Embedding
Table: ragdoll_embeddings
Attributes:
id :bigint # Primary key
embeddable_type :string # Polymorphic type (Content class name)
embeddable_id :bigint # Polymorphic ID (Content record ID)
chunk_index :integer # Order within content
content :text # Original text that was embedded
embedding_vector :vector(1536) # pgvector column (configurable dimensions)
usage_count :integer # Number of times used in searches
returned_at :datetime # Last usage timestamp
created_at :datetime
updated_at :datetime
Polymorphic Association:
# Belongs to any content type through polymorphic association
belongs_to :embeddable, polymorphic: true
# Can belong to TextContent, ImageContent, or AudioContent
embedding.embeddable # Returns the associated content object
embedding.embeddable_type # "Ragdoll::TextContent"
embedding.embeddable_id # Content record ID
Vector Search Methods:
# pgvector similarity search
Embedding.search_similar(
query_embedding,
limit: 20,
threshold: 0.7,
filters: {
embeddable_type: "Ragdoll::TextContent",
document_type: "pdf",
embedding_model: "text-embedding-3-large"
}
)
# Usage analytics
embedding.mark_as_used! # Increment usage_count and update returned_at
embedding.usage_score # Calculated usage score for ranking
embedding.embedding_dimensions # Number of vector dimensions
Search Result Format:
[
{
embedding_id: "123",
embeddable_id: "456",
embeddable_type: "Ragdoll::TextContent",
document_id: "789",
document_title: "AI Research Paper",
document_location: "/path/to/document.pdf",
content: "Machine learning algorithms...",
similarity: 0.85,
distance: 0.15,
chunk_index: 5,
embedding_dimensions: 1536,
embedding_model: "text-embedding-3-large",
usage_count: 12,
returned_at: "2025-01-15T10:30:00Z",
combined_score: 0.92
}
]
Model Relationships¶
Ragdoll uses a sophisticated relationship structure optimized for multi-modal content:
Primary Relationships¶
erDiagram
Document ||--o{ TextContent : "has many"
Document ||--o{ ImageContent : "has many"
Document ||--o{ AudioContent : "has many"
TextContent ||--o{ Embedding : "has many (polymorphic)"
ImageContent ||--o{ Embedding : "has many (polymorphic)"
AudioContent ||--o{ Embedding : "has many (polymorphic)"
Document {
bigint id PK
string location
string title
string document_type
string status
json metadata
json file_metadata
datetime file_modified_at
}
TextContent {
bigint id PK
string type "'TextContent'"
bigint document_id FK
string embedding_model
text content
text data
json metadata
}
ImageContent {
bigint id PK
string type "'ImageContent'"
bigint document_id FK
string embedding_model
text content "AI description"
text data "Image data"
json metadata
}
AudioContent {
bigint id PK
string type "'AudioContent'"
bigint document_id FK
string embedding_model
text content "Transcript"
text data "Audio data"
json metadata
float duration
integer sample_rate
}
Embedding {
bigint id PK
string embeddable_type FK
bigint embeddable_id FK
integer chunk_index
text content
vector embedding_vector
integer usage_count
datetime returned_at
}
Association Details¶
Document Associations:
class Document < ActiveRecord::Base
# Content associations (STI)
has_many :contents, class_name: "Content", dependent: :destroy
has_many :text_contents, -> { where(type: "TextContent") }
has_many :image_contents, -> { where(type: "ImageContent") }
has_many :audio_contents, -> { where(type: "AudioContent") }
# Embedding associations through content
has_many :text_embeddings, through: :text_contents, source: :embeddings
has_many :image_embeddings, through: :image_contents, source: :embeddings
has_many :audio_embeddings, through: :audio_contents, source: :embeddings
# Access all embeddings across content types
def all_embeddings(content_type: nil)
if content_type
case content_type.to_s
when 'text' then text_embeddings
when 'image' then image_embeddings
when 'audio' then audio_embeddings
end
else
Embedding.where(
embeddable_type: 'Ragdoll::Content',
embeddable_id: contents.pluck(:id)
)
end
end
end
Content Associations (STI Base):
class Content < ActiveRecord::Base
# Parent document relationship
belongs_to :document, class_name: "Document", foreign_key: "document_id"
# Polymorphic embedding relationship
has_many :embeddings, as: :embeddable, dependent: :destroy
# STI subclasses: TextContent, ImageContent, AudioContent
end
Embedding Associations (Polymorphic):
class Embedding < ActiveRecord::Base
# Polymorphic association - can belong to any content type
belongs_to :embeddable, polymorphic: true
# Access parent document through content
def document
embeddable&.document
end
# Scopes for different content types
scope :text_embeddings, -> { where(embeddable_type: "Ragdoll::TextContent") }
scope :image_embeddings, -> { where(embeddable_type: "Ragdoll::ImageContent") }
scope :audio_embeddings, -> { where(embeddable_type: "Ragdoll::AudioContent") }
end
Database Constraints and Foreign Keys¶
Foreign Key Constraints:
-- Document to Content relationship
ALTER TABLE ragdoll_contents
ADD CONSTRAINT fk_contents_document
FOREIGN KEY (document_id) REFERENCES ragdoll_documents(id)
ON DELETE CASCADE;
-- Polymorphic embedding relationships (enforced by application)
-- Note: PostgreSQL doesn't support polymorphic foreign key constraints
-- These are enforced through ActiveRecord validations and callbacks
Unique Constraints:
-- Ensure unique document locations
ALTER TABLE ragdoll_documents
ADD CONSTRAINT unique_document_location UNIQUE (location);
-- Ensure unique chunk indexes per content
ALTER TABLE ragdoll_embeddings
ADD CONSTRAINT unique_chunk_per_content
UNIQUE (embeddable_type, embeddable_id, chunk_index);
Check Constraints:
-- Ensure valid document types
ALTER TABLE ragdoll_documents
ADD CONSTRAINT valid_document_type
CHECK (document_type IN ('text', 'image', 'audio', 'pdf', 'docx', 'html', 'markdown', 'mixed'));
-- Ensure valid processing status
ALTER TABLE ragdoll_documents
ADD CONSTRAINT valid_status
CHECK (status IN ('pending', 'processing', 'processed', 'error'));
-- Ensure valid content types for STI
ALTER TABLE ragdoll_contents
ADD CONSTRAINT valid_content_type
CHECK (type IN ('Ragdoll::TextContent',
'Ragdoll::ImageContent',
'Ragdoll::AudioContent'));
Index Strategy¶
Performance Indexes:
-- Document indexes
CREATE INDEX idx_documents_status ON ragdoll_documents(status);
CREATE INDEX idx_documents_type ON ragdoll_documents(document_type);
CREATE INDEX idx_documents_created_at ON ragdoll_documents(created_at);
-- Content indexes (STI table)
CREATE INDEX idx_contents_type ON ragdoll_contents(type);
CREATE INDEX idx_contents_document_id ON ragdoll_contents(document_id);
CREATE INDEX idx_contents_embedding_model ON ragdoll_contents(embedding_model);
-- Embedding indexes
CREATE INDEX idx_embeddings_embeddable ON ragdoll_embeddings(embeddable_type, embeddable_id);
CREATE INDEX idx_embeddings_usage_count ON ragdoll_embeddings(usage_count);
CREATE INDEX idx_embeddings_returned_at ON ragdoll_embeddings(returned_at);
-- pgvector similarity search index
CREATE INDEX idx_embeddings_vector_cosine ON ragdoll_embeddings
USING ivfflat (embedding_vector vector_cosine_ops) WITH (lists = 100);
Full-Text Search Indexes:
-- Document full-text search
CREATE INDEX idx_documents_fulltext ON ragdoll_documents
USING gin(to_tsvector('english',
title || ' ' ||
COALESCE(metadata->>'summary', '') || ' ' ||
COALESCE(metadata->>'keywords', '') || ' ' ||
COALESCE(metadata->>'description', '')
));
-- Content full-text search
CREATE INDEX idx_contents_fulltext ON ragdoll_contents
USING gin(to_tsvector('english', COALESCE(content, '')));
Instance Methods¶
Document Methods¶
Content Retrieval Methods¶
# Dynamic content access based on primary content type
document.content
# Returns combined content from all content types
# For text: concatenated text from all text_contents
# For image: concatenated descriptions from all image_contents
# For audio: concatenated transcripts from all audio_contents
# Content type detection
document.content_types # => ['text', 'image']
document.primary_content_type # => 'text'
document.multi_modal? # => true (if multiple content types)
# Content statistics
document.total_word_count # Sum across all text content
document.total_character_count # Sum across all text content
document.total_embedding_count # Sum across all content types
document.embeddings_by_type # => { text: 15, image: 3, audio: 0 }
# Content access by type
document.text_contents.each { |tc| puts tc.content }
document.image_contents.each { |ic| puts ic.content } # AI descriptions
document.audio_contents.each { |ac| puts ac.content } # Transcripts
Metadata Accessors¶
# LLM-generated metadata (stored in metadata JSON column)
document.metadata # Full metadata hash
document.description # metadata['description']
document.description = "New desc" # Updates metadata hash
document.classification # metadata['classification']
document.classification = "technical"
document.tags # metadata['tags'] (array)
document.tags = ['ai', 'research']
# Metadata utility methods
document.has_summary? # Check if summary exists
document.has_keywords? # Check if keywords exist
document.keywords_array # Parse keywords into array
document.add_keyword('machine-learning')
document.remove_keyword('outdated')
# File metadata (stored in file_metadata JSON column)
document.file_metadata # File processing metadata
document.total_file_size # Sum of all content file sizes
document.primary_file_type # Document's primary file type
Processing Status Methods¶
# Status checking
document.processed? # status == 'processed'
document.status # 'pending', 'processing', 'processed', 'error'
# Content processing
document.process_content! # Full processing pipeline:
# 1. Generate embeddings for all content
# 2. Generate LLM metadata
# 3. Update status to 'processed'
document.generate_embeddings_for_all_content!
# Generate embeddings only
document.generate_metadata! # Generate LLM metadata only
# Processing validation
document.has_files? # Check if content has associated files
document.has_pending_content? # Check for content awaiting processing
File Handling Methods¶
# File association (through content models)
document.has_files? # Any content has file data
document.total_file_size # Sum of all file sizes
document.primary_file_type # Main file type
# File metadata access
document.file_modified_at # Source file modification time
document.location # Source file path or URL
# Content creation from files
document.content = "new text" # Creates TextContent automatically
# For images/audio, use specific content models:
document.image_contents.create!(data: image_data, embedding_model: 'clip')
Content Methods¶
Embedding Generation¶
# Base Content methods (inherited by all content types)
content.generate_embeddings! # Generate embeddings for this content
content.should_generate_embeddings? # Check if embeddings needed
content.content_for_embedding # Text to use for embedding (overrideable)
# TextContent specific
text_content.generate_embeddings! # Chunks text and generates embeddings
text_content.chunks # Array of content chunks with metadata
text_content.chunk_size # Tokens per chunk
text_content.overlap # Token overlap between chunks
# Embedding management
content.embeddings.count # Number of embeddings
content.embedding_count # Alias for count
content.embeddings.destroy_all # Remove all embeddings
Content Validation¶
# Base validations (all content types)
content.valid? # ActiveRecord validation
content.errors.full_messages # Validation error messages
# Content-specific validations
text_content.content.present? # TextContent requires content
image_content.data.present? # ImageContent requires data
# Custom validation methods
content.validate_embedding_model # Ensure model is supported
content.validate_content_size # Check content size limits
Processing Callbacks¶
# Automatic processing callbacks
# after_create: Generate embeddings if content is ready
# after_update: Regenerate embeddings if content changed
# before_destroy: Clean up associated embeddings
# Manual callback triggering
content.run_callbacks(:create) # Trigger create callbacks
content.run_callbacks(:update) # Trigger update callbacks
# Callback status checking
content.embeddings_generated? # Check if embeddings exist
content.metadata['embeddings_generated_at'] # Generation timestamp
Embedding Methods¶
Similarity Search¶
# Instance-level similarity (compare with other embeddings)
embedding.similarity_to(other_embedding) # Cosine similarity score
embedding.distance_to(other_embedding) # Distance (1 - similarity)
# Class-level similarity search
Embedding.search_similar(
query_embedding,
limit: 20,
threshold: 0.7,
filters: {
embeddable_type: 'Ragdoll::TextContent',
document_type: 'pdf'
}
)
# Specialized search methods
embedding.find_similar(limit: 10) # Find similar embeddings
embedding.find_related_in_document(limit: 5) # Similar chunks in same document
Usage Tracking¶
# Usage analytics
embedding.mark_as_used! # Increment usage_count, update returned_at
embedding.usage_count # Number of times used in searches
embedding.returned_at # Last usage timestamp
embedding.last_used_days_ago # Days since last use
# Usage scoring
embedding.usage_score # Calculated usage score for ranking
embedding.frequency_score # Frequency-based component
embedding.recency_score # Recency-based component
# Usage statistics
embedding.is_popular? # usage_count > threshold
embedding.is_recent? # used within recent timeframe
embedding.is_trending? # increasing usage pattern
Analytics Methods¶
# Embedding metadata
embedding.embedding_dimensions # Vector dimensionality
embedding.embedding_model # Model used (via content relationship)
embedding.chunk_index # Position within content
# Content access
embedding.embeddable # Associated content object
embedding.document # Parent document (through content)
embedding.content_preview(length: 100) # Truncated content preview
# Search result formatting
embedding.to_search_result(similarity: 0.85)
# Returns formatted hash for search APIs
# Performance metrics
embedding.vector_magnitude # Vector magnitude (for normalization)
embedding.vector_norm # L2 norm of the vector
embedding.vector_sparsity # Percentage of zero values
Class Methods¶
Document Class Methods¶
Scopes and Query Methods¶
# Status-based scopes
Document.processed # WHERE status = 'processed'
Document.pending # WHERE status = 'pending'
Document.processing # WHERE status = 'processing'
Document.with_errors # WHERE status = 'error'
# Content-based scopes
Document.by_type('pdf') # WHERE document_type = 'pdf'
Document.multi_modal # Documents with multiple content types
Document.text_only # Documents with only text content
Document.with_content # Documents that have content models
Document.without_content # Documents missing content models
# Time-based scopes
Document.recent # ORDER BY created_at DESC
Document.created_since(1.week.ago) # WHERE created_at > ?
Document.modified_since(1.day.ago) # WHERE file_modified_at > ?
# Advanced queries
Document.with_embeddings_count # Includes embedding count
Document.by_content_length(min: 1000) # Filter by content length
Document.by_file_size(max: 10.megabytes) # Filter by file size
Search and Filtering¶
# PostgreSQL full-text search
Document.search_content(
"machine learning algorithms",
limit: 20
)
# Faceted search with metadata filters
Document.faceted_search(
query: "neural networks",
keywords: ["deep learning", "AI"],
classification: "research_paper",
tags: ["computer-science"],
limit: 50
)
# Hybrid search (semantic + full-text)
Document.hybrid_search(
"artificial intelligence applications",
query_embedding: embedding_vector,
semantic_weight: 0.7,
text_weight: 0.3,
limit: 25
)
# Metadata-based filtering
Document.with_classification('technical_manual')
Document.with_keywords(['api', 'documentation'])
Document.with_tags(['development', 'guide'])
Document.by_metadata_field('complexity', 'advanced')
Statistics and Analytics¶
# Comprehensive statistics
Document.stats
# Returns:
# {
# total_documents: 1250,
# by_status: { processed: 1100, pending: 50, processing: 75, error: 25 },
# by_type: { pdf: 600, docx: 300, text: 200, image: 100, mixed: 50 },
# multi_modal_documents: 75,
# total_text_contents: 1000,
# total_image_contents: 125,
# total_audio_contents: 25,
# total_embeddings: { text: 15000, image: 500, audio: 100 },
# storage_type: "activerecord_polymorphic"
# }
# Usage analytics
Document.popular(limit: 10) # Most searched documents
Document.trending(timeframe: 1.week) # Recently popular documents
Document.usage_summary(period: 1.month) # Usage statistics
# Content analysis
Document.average_word_count # Average words per document
Document.total_storage_size # Total storage used
Document.embedding_coverage # Percentage with embeddings
# Performance metrics
Document.processing_time_stats # Processing time statistics
Document.error_rate(period: 1.day) # Error rate percentage
Document.throughput_stats # Documents processed per hour
Batch Operations¶
# Batch processing
Document.process_pending! # Process all pending documents
Document.regenerate_embeddings!(model: 'text-embedding-3-large')
Document.bulk_update_metadata(classification: 'archived')
# Batch import
Document.import_from_directory(
'/path/to/documents',
file_patterns: ['*.pdf', '*.docx'],
recursive: true,
batch_size: 100
)
# Batch cleanup
Document.cleanup_orphaned_content! # Remove content without documents
Document.remove_old_embeddings!(older_than: 6.months)
Document.vacuum_unused_storage! # Cleanup unused file storage
Content Class Methods¶
Content-Type Specific Queries¶
# Base Content class methods
Content.by_type('TextContent') # Filter by STI type
Content.with_embeddings # Content that has embeddings
Content.without_embeddings # Content missing embeddings
Content.by_embedding_model('text-embedding-3-large')
# TextContent specific
TextContent.by_word_count(min: 500, max: 5000)
TextContent.by_character_count(min: 2000)
TextContent.with_long_content # Content over threshold
TextContent.recently_processed # Recently generated embeddings
# ImageContent specific
ImageContent.by_dimensions(min_width: 800, min_height: 600)
ImageContent.by_file_size(max: 5.megabytes)
ImageContent.with_descriptions # Has AI-generated descriptions
ImageContent.by_format(['jpg', 'png'])
# AudioContent specific (planned)
AudioContent.by_duration(min: 30.seconds, max: 10.minutes)
AudioContent.by_sample_rate(44100)
AudioContent.with_transcripts # Has speech-to-text transcripts
Content Statistics¶
# TextContent statistics
TextContent.stats
# Returns:
# {
# total_text_contents: 1000,
# by_model: { 'text-embedding-3-large': 600, 'text-embedding-3-small': 400 },
# total_embeddings: 15000,
# average_word_count: 1250,
# average_chunk_size: 1000
# }
# Processing statistics
Content.processing_stats # Embedding generation statistics
Content.model_usage_stats # Usage by embedding model
Content.error_rate_by_type # Error rates by content type
Embedding Class Methods¶
Advanced Search Methods¶
# Vector similarity search with filters
Embedding.search_similar(
query_embedding,
limit: 20,
threshold: 0.75,
filters: {
embeddable_type: 'Ragdoll::TextContent',
embedding_model: 'text-embedding-3-large',
document_type: 'pdf',
created_after: 1.month.ago
}
)
# Batch similarity search
Embedding.batch_search_similar(
[embedding1, embedding2, embedding3],
limit: 10,
aggregate_results: true
)
# Specialized search methods
Embedding.find_duplicates(threshold: 0.95) # Near-duplicate detection
Embedding.find_outliers(threshold: 0.3) # Low-similarity outliers
Embedding.cluster_similar(max_clusters: 10) # K-means clustering
Usage Analytics¶
# Usage tracking
Embedding.most_used(limit: 100) # Highest usage_count
Embedding.recently_used(since: 1.hour.ago)
Embedding.trending(period: 1.day) # Increasing usage pattern
Embedding.popular_content_types # Usage by content type
# Performance analytics
Embedding.search_performance_stats # Search timing statistics
Embedding.model_performance_comparison # Compare model effectiveness
Embedding.quality_metrics # Embedding quality assessment
# Cache optimization
Embedding.precompute_popular! # Cache popular embeddings
Embedding.optimize_indexes! # Rebuild vector indexes
Batch Operations¶
# Batch embedding operations
Embedding.regenerate_for_model!(
old_model: 'text-embedding-ada-002',
new_model: 'text-embedding-3-large'
)
Embedding.update_usage_analytics! # Recalculate usage scores
Embedding.cleanup_orphaned! # Remove embeddings without content
Embedding.normalize_vectors! # L2 normalize all vectors
# Database maintenance
Embedding.rebuild_vector_indexes! # Rebuild pgvector indexes
Embedding.vacuum_embeddings_table! # PostgreSQL VACUUM operation
Embedding.analyze_vector_distribution! # Update query planner statistics
Model Validations¶
Ragdoll implements comprehensive validation rules to ensure data integrity:
Document Model Validations¶
Required Fields¶
class Document < ActiveRecord::Base
validates :location, presence: true
validates :title, presence: true
validates :document_type, presence: true
validates :status, presence: true
validates :file_modified_at, presence: true
end
Format Validations¶
# Document type validation
validates :document_type,
inclusion: {
in: %w[text image audio pdf docx html markdown mixed],
message: "must be a valid document type"
}
# Status validation
validates :status,
inclusion: {
in: %w[pending processing processed error],
message: "must be a valid processing status"
}
# Location format validation
validates :location, format: {
with: /\A(https?:\/\/|\/).*\z/,
message: "must be a valid URL or absolute file path"
}
# Metadata JSON validation
validate :validate_metadata_structure
private
def validate_metadata_structure
return unless metadata.present?
# Validate metadata against document type schema
schema_errors = MetadataSchemas.validate_metadata(document_type, metadata)
schema_errors.each { |error| errors.add(:metadata, error) }
end
Custom Validators¶
# Custom location validator
validate :validate_location_accessibility
def validate_location_accessibility
return unless location.present?
# For file paths, check if file exists and is readable
if location.start_with?('/')
unless File.exist?(location) && File.readable?(location)
errors.add(:location, "file does not exist or is not readable")
end
end
# For URLs, validate format more strictly
if location.start_with?('http')
begin
uri = URI.parse(location)
unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
errors.add(:location, "must be a valid HTTP or HTTPS URL")
end
rescue URI::InvalidURIError
errors.add(:location, "is not a valid URL")
end
end
end
# File size validation
validate :validate_reasonable_file_size
def validate_reasonable_file_size
if location.present? && File.exist?(location)
file_size = File.size(location)
max_size = 100.megabytes # Configurable limit
if file_size > max_size
errors.add(:location, "file size (#{file_size} bytes) exceeds maximum (#{max_size} bytes)")
end
end
end
Content Model Validations¶
Base Content Validations¶
class Content < ActiveRecord::Base
validates :type, presence: true
validates :embedding_model, presence: true
validates :document_id, presence: true
# Ensure valid STI type
validates :type, inclusion: {
in: %w[
Ragdoll::TextContent
Ragdoll::ImageContent
Ragdoll::AudioContent
],
message: "must be a valid content type"
}
# Validate embedding model exists
validate :validate_embedding_model_exists
private
def validate_embedding_model_exists
return unless embedding_model.present?
valid_models = Ragdoll.config.embedding_config.keys.map(&:to_s)
unless valid_models.include?(embedding_model)
errors.add(:embedding_model, "'#{embedding_model}' is not a configured embedding model")
end
end
end
TextContent Specific Validations¶
class TextContent < Content
validates :content, presence: true
validates :content, length: {
minimum: 10,
maximum: 1_000_000, # 1MB text limit
message: "must be between 10 and 1,000,000 characters"
}
# Validate chunk configuration
validate :validate_chunk_configuration
private
def validate_chunk_configuration
chunk_size_val = chunk_size
overlap_val = overlap
if chunk_size_val <= 0
errors.add(:chunk_size, "must be greater than 0")
end
if overlap_val < 0
errors.add(:overlap, "cannot be negative")
end
if overlap_val >= chunk_size_val
errors.add(:overlap, "must be less than chunk_size")
end
end
end
ImageContent Specific Validations¶
class ImageContent < Content
validates :data, presence: true
# Validate image metadata
validate :validate_image_metadata
private
def validate_image_metadata
return unless metadata.present?
# Validate dimensions if present
if metadata['width'] && metadata['height']
width = metadata['width'].to_i
height = metadata['height'].to_i
if width <= 0 || height <= 0
errors.add(:metadata, "image dimensions must be positive integers")
end
# Reasonable size limits
if width > 50000 || height > 50000
errors.add(:metadata, "image dimensions are unreasonably large")
end
end
# Validate file format
if metadata['file_type']
valid_formats = %w[jpg jpeg png gif bmp webp svg ico tiff tif]
unless valid_formats.include?(metadata['file_type'].downcase)
errors.add(:metadata, "unsupported image format: #{metadata['file_type']}")
end
end
end
end
Embedding Model Validations¶
Vector and Content Validations¶
class Embedding < ActiveRecord::Base
validates :embeddable_id, presence: true
validates :embeddable_type, presence: true
validates :chunk_index, presence: true
validates :content, presence: true
validates :embedding_vector, presence: true
# Unique chunk index per content
validates :chunk_index, uniqueness: {
scope: [:embeddable_id, :embeddable_type],
message: "must be unique within the same content"
}
# Vector dimension validation
validate :validate_embedding_dimensions
# Content length validation
validates :content, length: {
minimum: 1,
maximum: 10000, # Reasonable chunk size limit
message: "must be between 1 and 10,000 characters"
}
# Usage count validation
validates :usage_count, numericality: {
greater_than_or_equal_to: 0,
message: "cannot be negative"
}
private
def validate_embedding_dimensions
return unless embedding_vector.present?
# Get expected dimensions for the model
expected_dimensions = get_expected_dimensions
actual_dimensions = embedding_vector.length
if actual_dimensions != expected_dimensions
errors.add(
:embedding_vector,
"has #{actual_dimensions} dimensions, expected #{expected_dimensions}"
)
end
# Validate vector values
if embedding_vector.any? { |val| !val.is_a?(Numeric) }
errors.add(:embedding_vector, "must contain only numeric values")
end
# Check for NaN or infinite values
if embedding_vector.any? { |val| val.nan? || val.infinite? }
errors.add(:embedding_vector, "cannot contain NaN or infinite values")
end
end
def get_expected_dimensions
model_name = embeddable&.embedding_model
return 1536 unless model_name # Default OpenAI dimension
# Look up dimensions from configuration
config = Ragdoll.config.embedding_config
config.dig(model_name.to_sym, :dimensions) || 1536
end
end
Error Handling¶
Validation Error Processing¶
# Custom error handling for validation failures
class ValidationErrorHandler
def self.handle_document_errors(document)
return { success: true } if document.valid?
{
success: false,
errors: {
validation_errors: document.errors.full_messages,
field_errors: document.errors.messages,
error_count: document.errors.count
}
}
end
def self.handle_content_errors(content)
return { success: true } if content.valid?
{
success: false,
content_type: content.class.name,
errors: {
validation_errors: content.errors.full_messages,
field_errors: content.errors.messages,
suggested_fixes: generate_fix_suggestions(content.errors)
}
}
end
private
def self.generate_fix_suggestions(errors)
suggestions = []
errors.each do |field, messages|
case field
when :content
if messages.any? { |m| m.include?('too short') }
suggestions << "Ensure content has at least 10 characters"
end
when :embedding_model
suggestions << "Use a configured embedding model: #{available_models.join(', ')}"
when :chunk_size
suggestions << "Set chunk_size to a positive integer (recommended: 1000)"
end
end
suggestions
end
def self.available_models
Ragdoll.config.embedding_config.keys.map(&:to_s)
end
end
Validation Callbacks¶
# Before validation callbacks for cleanup
class Document < ActiveRecord::Base
before_validation :normalize_location
before_validation :set_default_file_modified_at
before_validation :sanitize_metadata
private
def normalize_location
return unless location.present?
# Convert relative paths to absolute paths
if location.start_with?('./')
self.location = File.expand_path(location)
end
# Normalize URL protocols
if location.match?(/^https?:\/\//i)
self.location = location.downcase.gsub(/^http:/i, 'https:')
end
end
def sanitize_metadata
return unless metadata.present?
# Remove nil values and empty strings
self.metadata = metadata.reject { |k, v| v.nil? || v == '' }
# Ensure arrays are actually arrays
['tags', 'keywords'].each do |field|
if metadata[field].is_a?(String)
metadata[field] = metadata[field].split(',').map(&:strip)
end
end
end
end
This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.