Skip to content

Models Reference

Ragdoll uses a sophisticated ActiveRecord model architecture with Single Table Inheritance (STI) for multi-modal content storage and polymorphic associations for flexible embeddings.

ActiveRecord Models and Relationships

The model architecture provides:

  • Single Table Inheritance (STI): Content models (TextContent, ImageContent, AudioContent) share a single table
  • Polymorphic Associations: Embeddings can belong to any content type through polymorphic relationships
  • PostgreSQL Optimizations: Native JSON columns, full-text search indexes, and pgvector integration
  • Rich Metadata Support: Flexible metadata storage with validation and type-specific schemas
  • Usage Analytics: Built-in tracking for search optimization and performance monitoring
  • Comprehensive Validations: Data integrity through extensive validation rules and callbacks

Core Models

Document Model

Class: Ragdoll::Document

Table: ragdoll_documents

Primary Attributes:

# Core document identification
id                :bigint           # Primary key
location          :string           # Source location (file path, URL, identifier)
title             :string           # Human-readable document title
document_type     :string           # Document format type (text, image, audio, pdf, etc.)
status            :string           # Processing status (pending, processing, processed, error)
file_modified_at  :datetime         # Source file modification timestamp

# Metadata storage
metadata          :json             # LLM-generated structured metadata
file_metadata     :json             # File properties and processing metadata

# Timestamps
created_at        :datetime
updated_at        :datetime

Multi-Modal Content Associations:

# STI-based content relationships
has_many :contents, class_name: "Ragdoll::Content", dependent: :destroy
has_many :text_contents, class_name: "Ragdoll::TextContent"
has_many :image_contents, class_name: "Ragdoll::ImageContent"
has_many :audio_contents, class_name: "Ragdoll::AudioContent"

# Embedding relationships through content
has_many :text_embeddings, through: :text_contents, source: :embeddings
has_many :image_embeddings, through: :image_contents, source: :embeddings
has_many :audio_embeddings, through: :audio_contents, source: :embeddings

Key Instance Methods:

# Content access
document.content                    # Returns combined content from all content types
document.content = "new content"    # Creates appropriate content model

# Multi-modal detection
document.multi_modal?               # True if document has multiple content types
document.content_types              # Array of content types: ['text', 'image', 'audio']
document.primary_content_type       # Primary content type for the document

# Statistics
document.total_word_count           # Sum of words across all text content
document.total_character_count      # Sum of characters across all text content
document.total_embedding_count      # Total embeddings across all content types
document.embeddings_by_type         # Hash: { text: 10, image: 5, audio: 2 }

# Processing
document.processed?                 # True if status == 'processed'
document.process_content!           # Generate embeddings and metadata
document.generate_metadata!         # Generate LLM-based metadata

Search and Query Methods:

# PostgreSQL full-text search
Document.search_content("machine learning")

# Keywords search (array overlap - finds documents with any matching keywords)
Document.search_by_keywords(['machine', 'learning', 'ai'])
# Returns documents with keywords_match_count attribute

# Keywords search (array contains - finds documents with ALL keywords)
Document.search_by_keywords_all(['python', 'programming'])
# Returns documents with total_keywords_count attribute

# Faceted search with metadata filters
Document.faceted_search(
  query: "AI research",
  keywords: ["neural networks"],
  classification: "academic_paper",
  tags: ["machine-learning"]
)

# Hybrid search combining semantic and text search
Document.hybrid_search(
  "deep learning applications",
  query_embedding: embedding_vector,
  semantic_weight: 0.7,
  text_weight: 0.3
)

Content Models (STI Architecture)

Base Class: Ragdoll::Content

Table: ragdoll_contents (shared by all content types)

STI Classes: - Ragdoll::TextContent - Ragdoll::ImageContent - Ragdoll::AudioContent

Shared Attributes:

id              :bigint           # Primary key
type            :string           # STI discriminator (TextContent, ImageContent, etc.)
document_id     :bigint           # Foreign key to document
embedding_model :string           # Model used for embeddings
content         :text             # Content text (text, description, transcript)
data            :text             # Raw file data or metadata
metadata        :json             # Content-specific metadata
duration        :float            # Audio duration (audio content only)
sample_rate     :integer          # Audio sample rate (audio content only)
created_at      :datetime
updated_at      :datetime

Polymorphic Relationships:

# Each content model belongs to a document
belongs_to :document, class_name: "Ragdoll::Document"

# Each content model can have many embeddings
has_many :embeddings, class_name: "Ragdoll::Embedding", as: :embeddable

TextContent Model

Specific Validations:

validates :content, presence: true  # Text content is required

Text-Specific Methods:

# Content analysis
text_content.word_count             # Number of words in content
text_content.character_count        # Number of characters in content
text_content.line_count             # Number of lines (from metadata)

# Chunking configuration
text_content.chunk_size             # Tokens per chunk (default: 1000)
text_content.chunk_size = 1500      # Set custom chunk size
text_content.overlap                # Token overlap (default: 200)
text_content.overlap = 300          # Set custom overlap

# Content processing
text_content.chunks                 # Array of content chunks with positions
text_content.generate_embeddings!   # Generate embeddings for all chunks

Text Processing Example:

text_content = document.text_contents.create!(
  content: "Large document text...",
  embedding_model: "text-embedding-3-large",
  metadata: {
    encoding: "UTF-8",
    line_count: 150,
    chunk_size: 1000,
    overlap: 200
  }
)

# Generate embeddings automatically
text_content.generate_embeddings!

# Access generated chunks and embeddings
text_content.chunks.each do |chunk|
  puts "Chunk #{chunk[:chunk_index]}: #{chunk[:content][0..50]}..."
end

text_content.embeddings.each do |embedding|
  puts "Embedding #{embedding.chunk_index}: #{embedding.embedding_vector.length} dimensions"
end

ImageContent Model

Image-Specific Attributes:

# content field stores AI-generated description
# data field stores image binary data or file reference
# metadata stores image properties

Image-Specific Methods:

# Image properties (from metadata)
image_content.width                 # Image width in pixels
image_content.height                # Image height in pixels
image_content.file_size             # File size in bytes
image_content.format                # Image format (jpg, png, etc.)

# AI-generated content
image_content.description           # AI-generated description (stored in content field)
image_content.objects_detected      # Detected objects (from metadata)
image_content.scene_type            # Scene classification (from metadata)

AudioContent Model (Planned)

Audio-Specific Attributes:

duration        :float            # Audio duration in seconds
sample_rate     :integer          # Sample rate in Hz
# content field stores transcript
# data field stores audio binary data
# metadata stores audio properties and timestamps

Audio-Specific Methods (Planned):

# Audio properties
audio_content.duration_formatted    # "5:42" format
audio_content.bitrate               # Audio bitrate (from metadata)
audio_content.channels              # Number of audio channels

# Transcript and timestamps
audio_content.transcript            # Full transcript (stored in content field)
audio_content.timestamps            # Word-level timestamps (from metadata)
audio_content.speakers             # Speaker identification (from metadata)

Embedding Model

Class: Ragdoll::Embedding

Table: ragdoll_embeddings

Attributes:

id                :bigint           # Primary key
embeddable_type   :string           # Polymorphic type (Content class name)
embeddable_id     :bigint           # Polymorphic ID (Content record ID)
chunk_index       :integer          # Order within content
content           :text             # Original text that was embedded
embedding_vector  :vector(1536)     # pgvector column (configurable dimensions)
usage_count       :integer          # Number of times used in searches
returned_at       :datetime         # Last usage timestamp
created_at        :datetime
updated_at        :datetime

Polymorphic Association:

# Belongs to any content type through polymorphic association
belongs_to :embeddable, polymorphic: true

# Can belong to TextContent, ImageContent, or AudioContent
embedding.embeddable                # Returns the associated content object
embedding.embeddable_type           # "Ragdoll::TextContent"
embedding.embeddable_id             # Content record ID

Vector Search Methods:

# pgvector similarity search
Embedding.search_similar(
  query_embedding,
  limit: 20,
  threshold: 0.7,
  filters: {
    embeddable_type: "Ragdoll::TextContent",
    document_type: "pdf",
    embedding_model: "text-embedding-3-large"
  }
)

# Usage analytics
embedding.mark_as_used!             # Increment usage_count and update returned_at
embedding.usage_score               # Calculated usage score for ranking
embedding.embedding_dimensions      # Number of vector dimensions

Search Result Format:

[
  {
    embedding_id: "123",
    embeddable_id: "456",
    embeddable_type: "Ragdoll::TextContent",
    document_id: "789",
    document_title: "AI Research Paper",
    document_location: "/path/to/document.pdf",
    content: "Machine learning algorithms...",
    similarity: 0.85,
    distance: 0.15,
    chunk_index: 5,
    embedding_dimensions: 1536,
    embedding_model: "text-embedding-3-large",
    usage_count: 12,
    returned_at: "2025-01-15T10:30:00Z",
    combined_score: 0.92
  }
]

Model Relationships

Ragdoll uses a sophisticated relationship structure optimized for multi-modal content:

Primary Relationships

erDiagram
    Document ||--o{ TextContent : "has many"
    Document ||--o{ ImageContent : "has many"
    Document ||--o{ AudioContent : "has many"
    TextContent ||--o{ Embedding : "has many (polymorphic)"
    ImageContent ||--o{ Embedding : "has many (polymorphic)"
    AudioContent ||--o{ Embedding : "has many (polymorphic)"

    Document {
        bigint id PK
        string location
        string title
        string document_type
        string status
        json metadata
        json file_metadata
        datetime file_modified_at
    }

    TextContent {
        bigint id PK
        string type "'TextContent'"
        bigint document_id FK
        string embedding_model
        text content
        text data
        json metadata
    }

    ImageContent {
        bigint id PK
        string type "'ImageContent'"
        bigint document_id FK
        string embedding_model
        text content "AI description"
        text data "Image data"
        json metadata
    }

    AudioContent {
        bigint id PK
        string type "'AudioContent'"
        bigint document_id FK
        string embedding_model
        text content "Transcript"
        text data "Audio data"
        json metadata
        float duration
        integer sample_rate
    }

    Embedding {
        bigint id PK
        string embeddable_type FK
        bigint embeddable_id FK
        integer chunk_index
        text content
        vector embedding_vector
        integer usage_count
        datetime returned_at
    }

Association Details

Document Associations:

class Document < ActiveRecord::Base
  # Content associations (STI)
  has_many :contents, class_name: "Content", dependent: :destroy
  has_many :text_contents, -> { where(type: "TextContent") }
  has_many :image_contents, -> { where(type: "ImageContent") }
  has_many :audio_contents, -> { where(type: "AudioContent") }

  # Embedding associations through content
  has_many :text_embeddings, through: :text_contents, source: :embeddings
  has_many :image_embeddings, through: :image_contents, source: :embeddings
  has_many :audio_embeddings, through: :audio_contents, source: :embeddings

  # Access all embeddings across content types
  def all_embeddings(content_type: nil)
    if content_type
      case content_type.to_s
      when 'text' then text_embeddings
      when 'image' then image_embeddings
      when 'audio' then audio_embeddings
      end
    else
      Embedding.where(
        embeddable_type: 'Ragdoll::Content',
        embeddable_id: contents.pluck(:id)
      )
    end
  end
end

Content Associations (STI Base):

class Content < ActiveRecord::Base
  # Parent document relationship
  belongs_to :document, class_name: "Document", foreign_key: "document_id"

  # Polymorphic embedding relationship
  has_many :embeddings, as: :embeddable, dependent: :destroy

  # STI subclasses: TextContent, ImageContent, AudioContent
end

Embedding Associations (Polymorphic):

class Embedding < ActiveRecord::Base
  # Polymorphic association - can belong to any content type
  belongs_to :embeddable, polymorphic: true

  # Access parent document through content
  def document
    embeddable&.document
  end

  # Scopes for different content types
  scope :text_embeddings, -> { where(embeddable_type: "Ragdoll::TextContent") }
  scope :image_embeddings, -> { where(embeddable_type: "Ragdoll::ImageContent") }
  scope :audio_embeddings, -> { where(embeddable_type: "Ragdoll::AudioContent") }
end

Database Constraints and Foreign Keys

Foreign Key Constraints:

-- Document to Content relationship
ALTER TABLE ragdoll_contents 
ADD CONSTRAINT fk_contents_document 
FOREIGN KEY (document_id) REFERENCES ragdoll_documents(id) 
ON DELETE CASCADE;

-- Polymorphic embedding relationships (enforced by application)
-- Note: PostgreSQL doesn't support polymorphic foreign key constraints
-- These are enforced through ActiveRecord validations and callbacks

Unique Constraints:

-- Ensure unique document locations
ALTER TABLE ragdoll_documents 
ADD CONSTRAINT unique_document_location UNIQUE (location);

-- Ensure unique chunk indexes per content
ALTER TABLE ragdoll_embeddings 
ADD CONSTRAINT unique_chunk_per_content 
UNIQUE (embeddable_type, embeddable_id, chunk_index);

Check Constraints:

-- Ensure valid document types
ALTER TABLE ragdoll_documents 
ADD CONSTRAINT valid_document_type 
CHECK (document_type IN ('text', 'image', 'audio', 'pdf', 'docx', 'html', 'markdown', 'mixed'));

-- Ensure valid processing status
ALTER TABLE ragdoll_documents 
ADD CONSTRAINT valid_status 
CHECK (status IN ('pending', 'processing', 'processed', 'error'));

-- Ensure valid content types for STI
ALTER TABLE ragdoll_contents 
ADD CONSTRAINT valid_content_type 
CHECK (type IN ('Ragdoll::TextContent', 
                'Ragdoll::ImageContent', 
                'Ragdoll::AudioContent'));

Index Strategy

Performance Indexes:

-- Document indexes
CREATE INDEX idx_documents_status ON ragdoll_documents(status);
CREATE INDEX idx_documents_type ON ragdoll_documents(document_type);
CREATE INDEX idx_documents_created_at ON ragdoll_documents(created_at);

-- Content indexes (STI table)
CREATE INDEX idx_contents_type ON ragdoll_contents(type);
CREATE INDEX idx_contents_document_id ON ragdoll_contents(document_id);
CREATE INDEX idx_contents_embedding_model ON ragdoll_contents(embedding_model);

-- Embedding indexes
CREATE INDEX idx_embeddings_embeddable ON ragdoll_embeddings(embeddable_type, embeddable_id);
CREATE INDEX idx_embeddings_usage_count ON ragdoll_embeddings(usage_count);
CREATE INDEX idx_embeddings_returned_at ON ragdoll_embeddings(returned_at);

-- pgvector similarity search index
CREATE INDEX idx_embeddings_vector_cosine ON ragdoll_embeddings 
USING ivfflat (embedding_vector vector_cosine_ops) WITH (lists = 100);

Full-Text Search Indexes:

-- Document full-text search
CREATE INDEX idx_documents_fulltext ON ragdoll_documents 
USING gin(to_tsvector('english', 
  title || ' ' || 
  COALESCE(metadata->>'summary', '') || ' ' || 
  COALESCE(metadata->>'keywords', '') || ' ' || 
  COALESCE(metadata->>'description', '')
));

-- Content full-text search
CREATE INDEX idx_contents_fulltext ON ragdoll_contents 
USING gin(to_tsvector('english', COALESCE(content, '')));

Instance Methods

Document Methods

Content Retrieval Methods

# Dynamic content access based on primary content type
document.content                    
# Returns combined content from all content types
# For text: concatenated text from all text_contents
# For image: concatenated descriptions from all image_contents
# For audio: concatenated transcripts from all audio_contents

# Content type detection
document.content_types              # => ['text', 'image']
document.primary_content_type       # => 'text'
document.multi_modal?               # => true (if multiple content types)

# Content statistics
document.total_word_count           # Sum across all text content
document.total_character_count      # Sum across all text content
document.total_embedding_count      # Sum across all content types
document.embeddings_by_type         # => { text: 15, image: 3, audio: 0 }

# Content access by type
document.text_contents.each { |tc| puts tc.content }
document.image_contents.each { |ic| puts ic.content }  # AI descriptions
document.audio_contents.each { |ac| puts ac.content }  # Transcripts

Metadata Accessors

# LLM-generated metadata (stored in metadata JSON column)
document.metadata                   # Full metadata hash
document.description                # metadata['description']
document.description = "New desc"   # Updates metadata hash
document.classification             # metadata['classification']
document.classification = "technical"
document.tags                       # metadata['tags'] (array)
document.tags = ['ai', 'research']

# Metadata utility methods
document.has_summary?               # Check if summary exists
document.has_keywords?              # Check if keywords exist
document.keywords_array             # Parse keywords into array
document.add_keyword('machine-learning')
document.remove_keyword('outdated')

# File metadata (stored in file_metadata JSON column)
document.file_metadata              # File processing metadata
document.total_file_size            # Sum of all content file sizes
document.primary_file_type          # Document's primary file type

Processing Status Methods

# Status checking
document.processed?                 # status == 'processed'
document.status                     # 'pending', 'processing', 'processed', 'error'

# Content processing
document.process_content!           # Full processing pipeline:
                                    # 1. Generate embeddings for all content
                                    # 2. Generate LLM metadata
                                    # 3. Update status to 'processed'

document.generate_embeddings_for_all_content!
                                    # Generate embeddings only

document.generate_metadata!         # Generate LLM metadata only

# Processing validation
document.has_files?                 # Check if content has associated files
document.has_pending_content?       # Check for content awaiting processing

File Handling Methods

# File association (through content models)
document.has_files?                 # Any content has file data
document.total_file_size            # Sum of all file sizes
document.primary_file_type          # Main file type

# File metadata access
document.file_modified_at           # Source file modification time
document.location                   # Source file path or URL

# Content creation from files
document.content = "new text"       # Creates TextContent automatically
# For images/audio, use specific content models:
document.image_contents.create!(data: image_data, embedding_model: 'clip')

Content Methods

Embedding Generation

# Base Content methods (inherited by all content types)
content.generate_embeddings!        # Generate embeddings for this content
content.should_generate_embeddings? # Check if embeddings needed
content.content_for_embedding       # Text to use for embedding (overrideable)

# TextContent specific
text_content.generate_embeddings!   # Chunks text and generates embeddings
text_content.chunks                 # Array of content chunks with metadata
text_content.chunk_size             # Tokens per chunk
text_content.overlap                # Token overlap between chunks

# Embedding management
content.embeddings.count            # Number of embeddings
content.embedding_count             # Alias for count
content.embeddings.destroy_all      # Remove all embeddings

Content Validation

# Base validations (all content types)
content.valid?                      # ActiveRecord validation
content.errors.full_messages        # Validation error messages

# Content-specific validations
text_content.content.present?       # TextContent requires content
image_content.data.present?         # ImageContent requires data

# Custom validation methods
content.validate_embedding_model    # Ensure model is supported
content.validate_content_size       # Check content size limits

Processing Callbacks

# Automatic processing callbacks
# after_create: Generate embeddings if content is ready
# after_update: Regenerate embeddings if content changed
# before_destroy: Clean up associated embeddings

# Manual callback triggering
content.run_callbacks(:create)      # Trigger create callbacks
content.run_callbacks(:update)      # Trigger update callbacks

# Callback status checking
content.embeddings_generated?       # Check if embeddings exist
content.metadata['embeddings_generated_at']  # Generation timestamp

Embedding Methods

# Instance-level similarity (compare with other embeddings)
embedding.similarity_to(other_embedding)     # Cosine similarity score
embedding.distance_to(other_embedding)       # Distance (1 - similarity)

# Class-level similarity search
Embedding.search_similar(
  query_embedding,
  limit: 20,
  threshold: 0.7,
  filters: {
    embeddable_type: 'Ragdoll::TextContent',
    document_type: 'pdf'
  }
)

# Specialized search methods
embedding.find_similar(limit: 10)           # Find similar embeddings
embedding.find_related_in_document(limit: 5) # Similar chunks in same document

Usage Tracking

# Usage analytics
embedding.mark_as_used!             # Increment usage_count, update returned_at
embedding.usage_count               # Number of times used in searches
embedding.returned_at               # Last usage timestamp
embedding.last_used_days_ago        # Days since last use

# Usage scoring
embedding.usage_score               # Calculated usage score for ranking
embedding.frequency_score           # Frequency-based component
embedding.recency_score             # Recency-based component

# Usage statistics
embedding.is_popular?               # usage_count > threshold
embedding.is_recent?                # used within recent timeframe
embedding.is_trending?              # increasing usage pattern

Analytics Methods

# Embedding metadata
embedding.embedding_dimensions      # Vector dimensionality
embedding.embedding_model           # Model used (via content relationship)
embedding.chunk_index               # Position within content

# Content access
embedding.embeddable                # Associated content object
embedding.document                  # Parent document (through content)
embedding.content_preview(length: 100)  # Truncated content preview

# Search result formatting
embedding.to_search_result(similarity: 0.85)
# Returns formatted hash for search APIs

# Performance metrics
embedding.vector_magnitude          # Vector magnitude (for normalization)
embedding.vector_norm               # L2 norm of the vector
embedding.vector_sparsity           # Percentage of zero values

Class Methods

Document Class Methods

Scopes and Query Methods

# Status-based scopes
Document.processed                  # WHERE status = 'processed'
Document.pending                    # WHERE status = 'pending'
Document.processing                 # WHERE status = 'processing'
Document.with_errors                # WHERE status = 'error'

# Content-based scopes
Document.by_type('pdf')             # WHERE document_type = 'pdf'
Document.multi_modal                # Documents with multiple content types
Document.text_only                  # Documents with only text content
Document.with_content               # Documents that have content models
Document.without_content            # Documents missing content models

# Time-based scopes
Document.recent                     # ORDER BY created_at DESC
Document.created_since(1.week.ago)  # WHERE created_at > ?
Document.modified_since(1.day.ago)  # WHERE file_modified_at > ?

# Advanced queries
Document.with_embeddings_count      # Includes embedding count
Document.by_content_length(min: 1000)  # Filter by content length
Document.by_file_size(max: 10.megabytes)  # Filter by file size

Search and Filtering

# PostgreSQL full-text search
Document.search_content(
  "machine learning algorithms",
  limit: 20
)

# Faceted search with metadata filters
Document.faceted_search(
  query: "neural networks",
  keywords: ["deep learning", "AI"],
  classification: "research_paper",
  tags: ["computer-science"],
  limit: 50
)

# Hybrid search (semantic + full-text)
Document.hybrid_search(
  "artificial intelligence applications",
  query_embedding: embedding_vector,
  semantic_weight: 0.7,
  text_weight: 0.3,
  limit: 25
)

# Metadata-based filtering
Document.with_classification('technical_manual')
Document.with_keywords(['api', 'documentation'])
Document.with_tags(['development', 'guide'])
Document.by_metadata_field('complexity', 'advanced')

Statistics and Analytics

# Comprehensive statistics
Document.stats
# Returns:
# {
#   total_documents: 1250,
#   by_status: { processed: 1100, pending: 50, processing: 75, error: 25 },
#   by_type: { pdf: 600, docx: 300, text: 200, image: 100, mixed: 50 },
#   multi_modal_documents: 75,
#   total_text_contents: 1000,
#   total_image_contents: 125,
#   total_audio_contents: 25,
#   total_embeddings: { text: 15000, image: 500, audio: 100 },
#   storage_type: "activerecord_polymorphic"
# }

# Usage analytics
Document.popular(limit: 10)         # Most searched documents
Document.trending(timeframe: 1.week) # Recently popular documents
Document.usage_summary(period: 1.month)  # Usage statistics

# Content analysis
Document.average_word_count          # Average words per document
Document.total_storage_size          # Total storage used
Document.embedding_coverage          # Percentage with embeddings

# Performance metrics
Document.processing_time_stats       # Processing time statistics
Document.error_rate(period: 1.day)   # Error rate percentage
Document.throughput_stats            # Documents processed per hour

Batch Operations

# Batch processing
Document.process_pending!            # Process all pending documents
Document.regenerate_embeddings!(model: 'text-embedding-3-large')
Document.bulk_update_metadata(classification: 'archived')

# Batch import
Document.import_from_directory(
  '/path/to/documents',
  file_patterns: ['*.pdf', '*.docx'],
  recursive: true,
  batch_size: 100
)

# Batch cleanup
Document.cleanup_orphaned_content!   # Remove content without documents
Document.remove_old_embeddings!(older_than: 6.months)
Document.vacuum_unused_storage!      # Cleanup unused file storage

Content Class Methods

Content-Type Specific Queries

# Base Content class methods
Content.by_type('TextContent')       # Filter by STI type
Content.with_embeddings              # Content that has embeddings
Content.without_embeddings           # Content missing embeddings
Content.by_embedding_model('text-embedding-3-large')

# TextContent specific
TextContent.by_word_count(min: 500, max: 5000)
TextContent.by_character_count(min: 2000)
TextContent.with_long_content        # Content over threshold
TextContent.recently_processed       # Recently generated embeddings

# ImageContent specific
ImageContent.by_dimensions(min_width: 800, min_height: 600)
ImageContent.by_file_size(max: 5.megabytes)
ImageContent.with_descriptions       # Has AI-generated descriptions
ImageContent.by_format(['jpg', 'png'])

# AudioContent specific (planned)
AudioContent.by_duration(min: 30.seconds, max: 10.minutes)
AudioContent.by_sample_rate(44100)
AudioContent.with_transcripts        # Has speech-to-text transcripts

Content Statistics

# TextContent statistics
TextContent.stats
# Returns:
# {
#   total_text_contents: 1000,
#   by_model: { 'text-embedding-3-large': 600, 'text-embedding-3-small': 400 },
#   total_embeddings: 15000,
#   average_word_count: 1250,
#   average_chunk_size: 1000
# }

# Processing statistics
Content.processing_stats             # Embedding generation statistics
Content.model_usage_stats            # Usage by embedding model
Content.error_rate_by_type           # Error rates by content type

Embedding Class Methods

Advanced Search Methods

# Vector similarity search with filters
Embedding.search_similar(
  query_embedding,
  limit: 20,
  threshold: 0.75,
  filters: {
    embeddable_type: 'Ragdoll::TextContent',
    embedding_model: 'text-embedding-3-large',
    document_type: 'pdf',
    created_after: 1.month.ago
  }
)

# Batch similarity search
Embedding.batch_search_similar(
  [embedding1, embedding2, embedding3],
  limit: 10,
  aggregate_results: true
)

# Specialized search methods
Embedding.find_duplicates(threshold: 0.95)  # Near-duplicate detection
Embedding.find_outliers(threshold: 0.3)     # Low-similarity outliers
Embedding.cluster_similar(max_clusters: 10) # K-means clustering

Usage Analytics

# Usage tracking
Embedding.most_used(limit: 100)     # Highest usage_count
Embedding.recently_used(since: 1.hour.ago)
Embedding.trending(period: 1.day)   # Increasing usage pattern
Embedding.popular_content_types     # Usage by content type

# Performance analytics
Embedding.search_performance_stats  # Search timing statistics
Embedding.model_performance_comparison  # Compare model effectiveness
Embedding.quality_metrics           # Embedding quality assessment

# Cache optimization
Embedding.precompute_popular!       # Cache popular embeddings
Embedding.optimize_indexes!         # Rebuild vector indexes

Batch Operations

# Batch embedding operations
Embedding.regenerate_for_model!(
  old_model: 'text-embedding-ada-002',
  new_model: 'text-embedding-3-large'
)

Embedding.update_usage_analytics!   # Recalculate usage scores
Embedding.cleanup_orphaned!         # Remove embeddings without content
Embedding.normalize_vectors!        # L2 normalize all vectors

# Database maintenance
Embedding.rebuild_vector_indexes!   # Rebuild pgvector indexes
Embedding.vacuum_embeddings_table!  # PostgreSQL VACUUM operation
Embedding.analyze_vector_distribution!  # Update query planner statistics

Model Validations

Ragdoll implements comprehensive validation rules to ensure data integrity:

Document Model Validations

Required Fields

class Document < ActiveRecord::Base
  validates :location, presence: true
  validates :title, presence: true
  validates :document_type, presence: true
  validates :status, presence: true
  validates :file_modified_at, presence: true
end

Format Validations

# Document type validation
validates :document_type, 
  inclusion: { 
    in: %w[text image audio pdf docx html markdown mixed],
    message: "must be a valid document type"
  }

# Status validation
validates :status,
  inclusion: {
    in: %w[pending processing processed error],
    message: "must be a valid processing status"
  }

# Location format validation
validates :location, format: {
  with: /\A(https?:\/\/|\/).*\z/,
  message: "must be a valid URL or absolute file path"
}

# Metadata JSON validation
validate :validate_metadata_structure

private

def validate_metadata_structure
  return unless metadata.present?

  # Validate metadata against document type schema
  schema_errors = MetadataSchemas.validate_metadata(document_type, metadata)
  schema_errors.each { |error| errors.add(:metadata, error) }
end

Custom Validators

# Custom location validator
validate :validate_location_accessibility

def validate_location_accessibility
  return unless location.present?

  # For file paths, check if file exists and is readable
  if location.start_with?('/')
    unless File.exist?(location) && File.readable?(location)
      errors.add(:location, "file does not exist or is not readable")
    end
  end

  # For URLs, validate format more strictly
  if location.start_with?('http')
    begin
      uri = URI.parse(location)
      unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
        errors.add(:location, "must be a valid HTTP or HTTPS URL")
      end
    rescue URI::InvalidURIError
      errors.add(:location, "is not a valid URL")
    end
  end
end

# File size validation
validate :validate_reasonable_file_size

def validate_reasonable_file_size
  if location.present? && File.exist?(location)
    file_size = File.size(location)
    max_size = 100.megabytes  # Configurable limit

    if file_size > max_size
      errors.add(:location, "file size (#{file_size} bytes) exceeds maximum (#{max_size} bytes)")
    end
  end
end

Content Model Validations

Base Content Validations

class Content < ActiveRecord::Base
  validates :type, presence: true
  validates :embedding_model, presence: true
  validates :document_id, presence: true

  # Ensure valid STI type
  validates :type, inclusion: {
    in: %w[
      Ragdoll::TextContent
      Ragdoll::ImageContent
      Ragdoll::AudioContent
    ],
    message: "must be a valid content type"
  }

  # Validate embedding model exists
  validate :validate_embedding_model_exists

  private

  def validate_embedding_model_exists
    return unless embedding_model.present?

    valid_models = Ragdoll.config.embedding_config.keys.map(&:to_s)
    unless valid_models.include?(embedding_model)
      errors.add(:embedding_model, "'#{embedding_model}' is not a configured embedding model")
    end
  end
end

TextContent Specific Validations

class TextContent < Content
  validates :content, presence: true
  validates :content, length: {
    minimum: 10,
    maximum: 1_000_000,  # 1MB text limit
    message: "must be between 10 and 1,000,000 characters"
  }

  # Validate chunk configuration
  validate :validate_chunk_configuration

  private

  def validate_chunk_configuration
    chunk_size_val = chunk_size
    overlap_val = overlap

    if chunk_size_val <= 0
      errors.add(:chunk_size, "must be greater than 0")
    end

    if overlap_val < 0
      errors.add(:overlap, "cannot be negative")
    end

    if overlap_val >= chunk_size_val
      errors.add(:overlap, "must be less than chunk_size")
    end
  end
end

ImageContent Specific Validations

class ImageContent < Content
  validates :data, presence: true

  # Validate image metadata
  validate :validate_image_metadata

  private

  def validate_image_metadata
    return unless metadata.present?

    # Validate dimensions if present
    if metadata['width'] && metadata['height']
      width = metadata['width'].to_i
      height = metadata['height'].to_i

      if width <= 0 || height <= 0
        errors.add(:metadata, "image dimensions must be positive integers")
      end

      # Reasonable size limits
      if width > 50000 || height > 50000
        errors.add(:metadata, "image dimensions are unreasonably large")
      end
    end

    # Validate file format
    if metadata['file_type']
      valid_formats = %w[jpg jpeg png gif bmp webp svg ico tiff tif]
      unless valid_formats.include?(metadata['file_type'].downcase)
        errors.add(:metadata, "unsupported image format: #{metadata['file_type']}")
      end
    end
  end
end

Embedding Model Validations

Vector and Content Validations

class Embedding < ActiveRecord::Base
  validates :embeddable_id, presence: true
  validates :embeddable_type, presence: true
  validates :chunk_index, presence: true
  validates :content, presence: true
  validates :embedding_vector, presence: true

  # Unique chunk index per content
  validates :chunk_index, uniqueness: {
    scope: [:embeddable_id, :embeddable_type],
    message: "must be unique within the same content"
  }

  # Vector dimension validation
  validate :validate_embedding_dimensions

  # Content length validation
  validates :content, length: {
    minimum: 1,
    maximum: 10000,  # Reasonable chunk size limit
    message: "must be between 1 and 10,000 characters"
  }

  # Usage count validation
  validates :usage_count, numericality: {
    greater_than_or_equal_to: 0,
    message: "cannot be negative"
  }

  private

  def validate_embedding_dimensions
    return unless embedding_vector.present?

    # Get expected dimensions for the model
    expected_dimensions = get_expected_dimensions
    actual_dimensions = embedding_vector.length

    if actual_dimensions != expected_dimensions
      errors.add(
        :embedding_vector,
        "has #{actual_dimensions} dimensions, expected #{expected_dimensions}"
      )
    end

    # Validate vector values
    if embedding_vector.any? { |val| !val.is_a?(Numeric) }
      errors.add(:embedding_vector, "must contain only numeric values")
    end

    # Check for NaN or infinite values
    if embedding_vector.any? { |val| val.nan? || val.infinite? }
      errors.add(:embedding_vector, "cannot contain NaN or infinite values")
    end
  end

  def get_expected_dimensions
    model_name = embeddable&.embedding_model
    return 1536 unless model_name  # Default OpenAI dimension

    # Look up dimensions from configuration
    config = Ragdoll.config.embedding_config
    config.dig(model_name.to_sym, :dimensions) || 1536
  end
end

Error Handling

Validation Error Processing

# Custom error handling for validation failures
class ValidationErrorHandler
  def self.handle_document_errors(document)
    return { success: true } if document.valid?

    {
      success: false,
      errors: {
        validation_errors: document.errors.full_messages,
        field_errors: document.errors.messages,
        error_count: document.errors.count
      }
    }
  end

  def self.handle_content_errors(content)
    return { success: true } if content.valid?

    {
      success: false,
      content_type: content.class.name,
      errors: {
        validation_errors: content.errors.full_messages,
        field_errors: content.errors.messages,
        suggested_fixes: generate_fix_suggestions(content.errors)
      }
    }
  end

  private

  def self.generate_fix_suggestions(errors)
    suggestions = []

    errors.each do |field, messages|
      case field
      when :content
        if messages.any? { |m| m.include?('too short') }
          suggestions << "Ensure content has at least 10 characters"
        end
      when :embedding_model
        suggestions << "Use a configured embedding model: #{available_models.join(', ')}"
      when :chunk_size
        suggestions << "Set chunk_size to a positive integer (recommended: 1000)"
      end
    end

    suggestions
  end

  def self.available_models
    Ragdoll.config.embedding_config.keys.map(&:to_s)
  end
end

Validation Callbacks

# Before validation callbacks for cleanup
class Document < ActiveRecord::Base
  before_validation :normalize_location
  before_validation :set_default_file_modified_at
  before_validation :sanitize_metadata

  private

  def normalize_location
    return unless location.present?

    # Convert relative paths to absolute paths
    if location.start_with?('./')
      self.location = File.expand_path(location)
    end

    # Normalize URL protocols
    if location.match?(/^https?:\/\//i)
      self.location = location.downcase.gsub(/^http:/i, 'https:')
    end
  end

  def sanitize_metadata
    return unless metadata.present?

    # Remove nil values and empty strings
    self.metadata = metadata.reject { |k, v| v.nil? || v == '' }

    # Ensure arrays are actually arrays
    ['tags', 'keywords'].each do |field|
      if metadata[field].is_a?(String)
        metadata[field] = metadata[field].split(',').map(&:strip)
      end
    end
  end
end

This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.