Models Reference¶

Ragdoll uses a sophisticated ActiveRecord model architecture with Single Table Inheritance (STI) for multi-modal content storage and polymorphic associations for flexible embeddings.

Detailed API Documentation

For complete class and method documentation, see the Ruby API Documentation (RDoc) which provides detailed technical reference for all Ragdoll models and their methods.

ActiveRecord Models and Relationships¶

The model architecture provides:

Single Table Inheritance (STI): Content models (TextContent, ImageContent, AudioContent) share a single table
Polymorphic Associations: Embeddings can belong to any content type through polymorphic relationships
PostgreSQL Optimizations: Native JSON columns, full-text search indexes, and pgvector integration
Rich Metadata Support: Flexible metadata storage with validation and type-specific schemas
Usage Analytics: Built-in tracking for search optimization and performance monitoring
Comprehensive Validations: Data integrity through extensive validation rules and callbacks

Core Models¶

Document Model¶

Class: Ragdoll::Document

Table: ragdoll_documents

Primary Attributes:

# Core document identification
id                :bigint           # Primary key
location          :string           # Source location (file path, URL, identifier)
title             :string           # Human-readable document title
document_type     :string           # Document format type (text, image, audio, pdf, etc.)
status            :string           # Processing status (pending, processing, processed, error)
file_modified_at  :datetime         # Source file modification timestamp

# Metadata storage
metadata          :json             # LLM-generated structured metadata
file_metadata     :json             # File properties and processing metadata

# Timestamps
created_at        :datetime
updated_at        :datetime

Multi-Modal Content Associations:

# STI-based content relationships
has_many :contents, class_name: "Ragdoll::Content", dependent: :destroy
has_many :text_contents, class_name: "Ragdoll::TextContent"
has_many :image_contents, class_name: "Ragdoll::ImageContent"
has_many :audio_contents, class_name: "Ragdoll::AudioContent"

# Embedding relationships through content
has_many :text_embeddings, through: :text_contents, source: :embeddings
has_many :image_embeddings, through: :image_contents, source: :embeddings
has_many :audio_embeddings, through: :audio_contents, source: :embeddings

Key Instance Methods:

# Content access
document.content                    # Returns combined content from all content types
document.content = "new content"    # Creates appropriate content model

# Multi-modal detection
document.multi_modal?               # True if document has multiple content types
document.content_types              # Array of content types: ['text', 'image', 'audio']
document.primary_content_type       # Primary content type for the document

# Statistics
document.total_word_count           # Sum of words across all text content
document.total_character_count      # Sum of characters across all text content
document.total_embedding_count      # Total embeddings across all content types
document.embeddings_by_type         # Hash: { text: 10, image: 5, audio: 2 }

# Processing
document.processed?                 # True if status == 'processed'
document.process_content!           # Generate embeddings and metadata
document.generate_metadata!         # Generate LLM-based metadata

Search and Query Methods:

# PostgreSQL full-text search
Document.search_content("machine learning")

# Keywords search (array overlap - finds documents with any matching keywords)
Document.search_by_keywords(['machine', 'learning', 'ai'])
# Returns documents with keywords_match_count attribute

# Keywords search (array contains - finds documents with ALL keywords)
Document.search_by_keywords_all(['python', 'programming'])
# Returns documents with total_keywords_count attribute

# Faceted search with metadata filters
Document.faceted_search(
  query: "AI research",
  keywords: ["neural networks"],
  classification: "academic_paper",
  tags: ["machine-learning"]
)

# Hybrid search combining semantic and text search
Document.hybrid_search(
  "deep learning applications",
  query_embedding: embedding_vector,
  semantic_weight: 0.7,
  text_weight: 0.3
)

Content Models (STI Architecture)¶

Base Class: Ragdoll::Content

Table: ragdoll_contents (shared by all content types)

STI Classes: - Ragdoll::TextContent - Ragdoll::ImageContent - Ragdoll::AudioContent

Shared Attributes:

id              :bigint           # Primary key
type            :string           # STI discriminator (TextContent, ImageContent, etc.)
document_id     :bigint           # Foreign key to document
embedding_model :string           # Model used for embeddings
content         :text             # Content text (text, description, transcript)
data            :text             # Raw file data or metadata
metadata        :json             # Content-specific metadata
duration        :float            # Audio duration (audio content only)
sample_rate     :integer          # Audio sample rate (audio content only)
created_at      :datetime
updated_at      :datetime

Polymorphic Relationships:

# Each content model belongs to a document
belongs_to :document, class_name: "Ragdoll::Document"

# Each content model can have many embeddings
has_many :embeddings, class_name: "Ragdoll::Embedding", as: :embeddable

TextContent Model¶

Specific Validations:

validates :content, presence: true  # Text content is required

Text-Specific Methods:

# Content analysis
text_content.word_count             # Number of words in content
text_content.character_count        # Number of characters in content
text_content.line_count             # Number of lines (from metadata)

# Chunking configuration
text_content.chunk_size             # Tokens per chunk (default: 1000)
text_content.chunk_size = 1500      # Set custom chunk size
text_content.overlap                # Token overlap (default: 200)
text_content.overlap = 300          # Set custom overlap

# Content processing
text_content.chunks                 # Array of content chunks with positions
text_content.generate_embeddings!   # Generate embeddings for all chunks

Text Processing Example:

text_content = document.text_contents.create!(
  content: "Large document text...",
  embedding_model: "text-embedding-3-large",
  metadata: {
    encoding: "UTF-8",
    line_count: 150,
    chunk_size: 1000,
    overlap: 200
  }
)

# Generate embeddings automatically
text_content.generate_embeddings!

# Access generated chunks and embeddings
text_content.chunks.each do |chunk|
  puts "Chunk #{chunk[:chunk_index]}: #{chunk[:content][0..50]}..."
end

text_content.embeddings.each do |embedding|
  puts "Embedding #{embedding.chunk_index}: #{embedding.embedding_vector.length} dimensions"
end

ImageContent Model¶

Image-Specific Attributes:

# content field stores AI-generated description
# data field stores image binary data or file reference
# metadata stores image properties

Image-Specific Methods:

# Image properties (from metadata)
image_content.width                 # Image width in pixels
image_content.height                # Image height in pixels
image_content.file_size             # File size in bytes
image_content.format                # Image format (jpg, png, etc.)

# AI-generated content
image_content.description           # AI-generated description (stored in content field)
image_content.objects_detected      # Detected objects (from metadata)
image_content.scene_type            # Scene classification (from metadata)

AudioContent Model (Planned)¶

Audio-Specific Attributes:

duration        :float            # Audio duration in seconds
sample_rate     :integer          # Sample rate in Hz
# content field stores transcript
# data field stores audio binary data
# metadata stores audio properties and timestamps

Audio-Specific Methods (Planned):

# Audio properties
audio_content.duration_formatted    # "5:42" format
audio_content.bitrate               # Audio bitrate (from metadata)
audio_content.channels              # Number of audio channels

# Transcript and timestamps
audio_content.transcript            # Full transcript (stored in content field)
audio_content.timestamps            # Word-level timestamps (from metadata)
audio_content.speakers             # Speaker identification (from metadata)

Embedding Model¶

Class: Ragdoll::Embedding

Table: ragdoll_embeddings

Attributes:

id                :bigint           # Primary key
embeddable_type   :string           # Polymorphic type (Content class name)
embeddable_id     :bigint           # Polymorphic ID (Content record ID)
chunk_index       :integer          # Order within content
content           :text             # Original text that was embedded
embedding_vector  :vector(1536)     # pgvector column (configurable dimensions)
usage_count       :integer          # Number of times used in searches
returned_at       :datetime         # Last usage timestamp
created_at        :datetime
updated_at        :datetime

Polymorphic Association:

# Belongs to any content type through polymorphic association
belongs_to :embeddable, polymorphic: true

# Can belong to TextContent, ImageContent, or AudioContent
embedding.embeddable                # Returns the associated content object
embedding.embeddable_type           # "Ragdoll::TextContent"
embedding.embeddable_id             # Content record ID

Vector Search Methods:

# pgvector similarity search
Embedding.search_similar(
  query_embedding,
  limit: 20,
  threshold: 0.7,
  filters: {
    embeddable_type: "Ragdoll::TextContent",
    document_type: "pdf",
    embedding_model: "text-embedding-3-large"
  }
)

# Usage analytics
embedding.mark_as_used!             # Increment usage_count and update returned_at
embedding.usage_score               # Calculated usage score for ranking
embedding.embedding_dimensions      # Number of vector dimensions

Search Result Format:

[
  {
    embedding_id: "123",
    embeddable_id: "456",
    embeddable_type: "Ragdoll::TextContent",
    document_id: "789",
    document_title: "AI Research Paper",
    document_location: "/path/to/document.pdf",
    content: "Machine learning algorithms...",
    similarity: 0.85,
    distance: 0.15,
    chunk_index: 5,
    embedding_dimensions: 1536,
    embedding_model: "text-embedding-3-large",
    usage_count: 12,
    returned_at: "2025-01-15T10:30:00Z",
    combined_score: 0.92
  }
]

Model Relationships¶

Ragdoll uses a sophisticated relationship structure optimized for multi-modal content:

Primary Relationships¶

erDiagram
    Document ||--o{ TextContent : "has many"
    Document ||--o{ ImageContent : "has many"
    Document ||--o{ AudioContent : "has many"
    TextContent ||--o{ Embedding : "has many (polymorphic)"
    ImageContent ||--o{ Embedding : "has many (polymorphic)"
    AudioContent ||--o{ Embedding : "has many (polymorphic)"

    Document {
        bigint id PK
        string location
        string title
        string document_type
        string status
        json metadata
        json file_metadata
        datetime file_modified_at
    }

    TextContent {
        bigint id PK
        string type "'TextContent'"
        bigint document_id FK
        string embedding_model
        text content
        text data
        json metadata
    }

    ImageContent {
        bigint id PK
        string type "'ImageContent'"
        bigint document_id FK
        string embedding_model
        text content "AI description"
        text data "Image data"
        json metadata
    }

    AudioContent {
        bigint id PK
        string type "'AudioContent'"
        bigint document_id FK
        string embedding_model
        text content "Transcript"
        text data "Audio data"
        json metadata
        float duration
        integer sample_rate
    }

    Embedding {
        bigint id PK
        string embeddable_type FK
        bigint embeddable_id FK
        integer chunk_index
        text content
        vector embedding_vector
        integer usage_count
        datetime returned_at
    }

Association Details¶

Document Associations:

class Document < ActiveRecord::Base
  # Content associations (STI)
  has_many :contents, class_name: "Content", dependent: :destroy
  has_many :text_contents, -> { where(type: "TextContent") }
  has_many :image_contents, -> { where(type: "ImageContent") }
  has_many :audio_contents, -> { where(type: "AudioContent") }

  # Embedding associations through content
  has_many :text_embeddings, through: :text_contents, source: :embeddings
  has_many :image_embeddings, through: :image_contents, source: :embeddings
  has_many :audio_embeddings, through: :audio_contents, source: :embeddings

  # Access all embeddings across content types
  def all_embeddings(content_type: nil)
    if content_type
      case content_type.to_s
      when 'text' then text_embeddings
      when 'image' then image_embeddings
      when 'audio' then audio_embeddings
      end
    else
      Embedding.where(
        embeddable_type: 'Ragdoll::Content',
        embeddable_id: contents.pluck(:id)
      )
    end
  end
end

Content Associations (STI Base):

class Content < ActiveRecord::Base
  # Parent document relationship
  belongs_to :document, class_name: "Document", foreign_key: "document_id"

  # Polymorphic embedding relationship
  has_many :embeddings, as: :embeddable, dependent: :destroy

  # STI subclasses: TextContent, ImageContent, AudioContent
end

Embedding Associations (Polymorphic):

class Embedding < ActiveRecord::Base
  # Polymorphic association - can belong to any content type
  belongs_to :embeddable, polymorphic: true

  # Access parent document through content
  def document
    embeddable&.document
  end

  # Scopes for different content types
  scope :text_embeddings, -> { where(embeddable_type: "Ragdoll::TextContent") }
  scope :image_embeddings, -> { where(embeddable_type: "Ragdoll::ImageContent") }
  scope :audio_embeddings, -> { where(embeddable_type: "Ragdoll::AudioContent") }
end

Database Constraints and Foreign Keys¶

Foreign Key Constraints:

-- Document to Content relationship
ALTER TABLE ragdoll_contents 
ADD CONSTRAINT fk_contents_document 
FOREIGN KEY (document_id) REFERENCES ragdoll_documents(id) 
ON DELETE CASCADE;

-- Polymorphic embedding relationships (enforced by application)
-- Note: PostgreSQL doesn't support polymorphic foreign key constraints
-- These are enforced through ActiveRecord validations and callbacks

Unique Constraints:

-- Ensure unique document locations
ALTER TABLE ragdoll_documents 
ADD CONSTRAINT unique_document_location UNIQUE (location);

-- Ensure unique chunk indexes per content
ALTER TABLE ragdoll_embeddings 
ADD CONSTRAINT unique_chunk_per_content 
UNIQUE (embeddable_type, embeddable_id, chunk_index);

Check Constraints:

-- Ensure valid document types
ALTER TABLE ragdoll_documents 
ADD CONSTRAINT valid_document_type 
CHECK (document_type IN ('text', 'image', 'audio', 'pdf', 'docx', 'html', 'markdown', 'mixed'));

-- Ensure valid processing status
ALTER TABLE ragdoll_documents 
ADD CONSTRAINT valid_status 
CHECK (status IN ('pending', 'processing', 'processed', 'error'));

-- Ensure valid content types for STI
ALTER TABLE ragdoll_contents 
ADD CONSTRAINT valid_content_type 
CHECK (type IN ('Ragdoll::TextContent', 
                'Ragdoll::ImageContent', 
                'Ragdoll::AudioContent'));

Index Strategy¶

Performance Indexes:

-- Document indexes
CREATE INDEX idx_documents_status ON ragdoll_documents(status);
CREATE INDEX idx_documents_type ON ragdoll_documents(document_type);
CREATE INDEX idx_documents_created_at ON ragdoll_documents(created_at);

-- Content indexes (STI table)
CREATE INDEX idx_contents_type ON ragdoll_contents(type);
CREATE INDEX idx_contents_document_id ON ragdoll_contents(document_id);
CREATE INDEX idx_contents_embedding_model ON ragdoll_contents(embedding_model);

-- Embedding indexes
CREATE INDEX idx_embeddings_embeddable ON ragdoll_embeddings(embeddable_type, embeddable_id);
CREATE INDEX idx_embeddings_usage_count ON ragdoll_embeddings(usage_count);
CREATE INDEX idx_embeddings_returned_at ON ragdoll_embeddings(returned_at);

-- pgvector similarity search index
CREATE INDEX idx_embeddings_vector_cosine ON ragdoll_embeddings 
USING ivfflat (embedding_vector vector_cosine_ops) WITH (lists = 100);

Full-Text Search Indexes:

-- Document full-text search
CREATE INDEX idx_documents_fulltext ON ragdoll_documents 
USING gin(to_tsvector('english', 
  title || ' ' || 
  COALESCE(metadata->>'summary', '') || ' ' || 
  COALESCE(metadata->>'keywords', '') || ' ' || 
  COALESCE(metadata->>'description', '')
));

-- Content full-text search
CREATE INDEX idx_contents_fulltext ON ragdoll_contents 
USING gin(to_tsvector('english', COALESCE(content, '')));

Instance Methods¶

Document Methods¶

Content Retrieval Methods¶

# Dynamic content access based on primary content type
document.content                    
# Returns combined content from all content types
# For text: concatenated text from all text_contents
# For image: concatenated descriptions from all image_contents
# For audio: concatenated transcripts from all audio_contents

# Content type detection
document.content_types              # => ['text', 'image']
document.primary_content_type       # => 'text'
document.multi_modal?               # => true (if multiple content types)

# Content statistics
document.total_word_count           # Sum across all text content
document.total_character_count      # Sum across all text content
document.total_embedding_count      # Sum across all content types
document.embeddings_by_type         # => { text: 15, image: 3, audio: 0 }

# Content access by type
document.text_contents.each { |tc| puts tc.content }
document.image_contents.each { |ic| puts ic.content }  # AI descriptions
document.audio_contents.each { |ac| puts ac.content }  # Transcripts

Metadata Accessors¶

# LLM-generated metadata (stored in metadata JSON column)
document.metadata                   # Full metadata hash
document.description                # metadata['description']
document.description = "New desc"   # Updates metadata hash
document.classification             # metadata['classification']
document.classification = "technical"
document.tags                       # metadata['tags'] (array)
document.tags = ['ai', 'research']

# Metadata utility methods
document.has_summary?               # Check if summary exists
document.has_keywords?              # Check if keywords exist
document.keywords_array             # Parse keywords into array
document.add_keyword('machine-learning')
document.remove_keyword('outdated')

# File metadata (stored in file_metadata JSON column)
document.file_metadata              # File processing metadata
document.total_file_size            # Sum of all content file sizes
document.primary_file_type          # Document's primary file type

Processing Status Methods¶

# Status checking
document.processed?                 # status == 'processed'
document.status                     # 'pending', 'processing', 'processed', 'error'

# Content processing
document.process_content!           # Full processing pipeline:
                                    # 1. Generate embeddings for all content
                                    # 2. Generate LLM metadata
                                    # 3. Update status to 'processed'

document.generate_embeddings_for_all_content!
                                    # Generate embeddings only

document.generate_metadata!         # Generate LLM metadata only

# Processing validation
document.has_files?                 # Check if content has associated files
document.has_pending_content?       # Check for content awaiting processing

File Handling Methods¶

# File association (through content models)
document.has_files?                 # Any content has file data
document.total_file_size            # Sum of all file sizes
document.primary_file_type          # Main file type

# File metadata access
document.file_modified_at           # Source file modification time
document.location                   # Source file path or URL

# Content creation from files
document.content = "new text"       # Creates TextContent automatically
# For images/audio, use specific content models:
document.image_contents.create!(data: image_data, embedding_model: 'clip')

Content Methods¶

Embedding Generation¶

# Base Content methods (inherited by all content types)
content.generate_embeddings!        # Generate embeddings for this content
content.should_generate_embeddings? # Check if embeddings needed
content.content_for_embedding       # Text to use for embedding (overrideable)

# TextContent specific
text_content.generate_embeddings!   # Chunks text and generates embeddings
text_content.chunks                 # Array of content chunks with metadata
text_content.chunk_size             # Tokens per chunk
text_content.overlap                # Token overlap between chunks

# Embedding management
content.embeddings.count            # Number of embeddings
content.embedding_count             # Alias for count
content.embeddings.destroy_all      # Remove all embeddings

Content Validation¶

# Base validations (all content types)
content.valid?                      # ActiveRecord validation
content.errors.full_messages        # Validation error messages

# Content-specific validations
text_content.content.present?       # TextContent requires content
image_content.data.present?         # ImageContent requires data

# Custom validation methods
content.validate_embedding_model    # Ensure model is supported
content.validate_content_size       # Check content size limits

Processing Callbacks¶

# Automatic processing callbacks
# after_create: Generate embeddings if content is ready
# after_update: Regenerate embeddings if content changed
# before_destroy: Clean up associated embeddings

# Manual callback triggering
content.run_callbacks(:create)      # Trigger create callbacks
content.run_callbacks(:update)      # Trigger update callbacks

# Callback status checking
content.embeddings_generated?       # Check if embeddings exist
content.metadata['embeddings_generated_at']  # Generation timestamp

Embedding Methods¶

Similarity Search¶

# Instance-level similarity (compare with other embeddings)
embedding.similarity_to(other_embedding)     # Cosine similarity score
embedding.distance_to(other_embedding)       # Distance (1 - similarity)

# Class-level similarity search
Embedding.search_similar(
  query_embedding,
  limit: 20,
  threshold: 0.7,
  filters: {
    embeddable_type: 'Ragdoll::TextContent',
    document_type: 'pdf'
  }
)

# Specialized search methods
embedding.find_similar(limit: 10)           # Find similar embeddings
embedding.find_related_in_document(limit: 5) # Similar chunks in same document

Usage Tracking¶

# Usage analytics
embedding.mark_as_used!             # Increment usage_count, update returned_at
embedding.usage_count               # Number of times used in searches
embedding.returned_at               # Last usage timestamp
embedding.last_used_days_ago        # Days since last use

# Usage scoring
embedding.usage_score               # Calculated usage score for ranking
embedding.frequency_score           # Frequency-based component
embedding.recency_score             # Recency-based component

# Usage statistics
embedding.is_popular?               # usage_count > threshold
embedding.is_recent?                # used within recent timeframe
embedding.is_trending?              # increasing usage pattern

Analytics Methods¶

# Embedding metadata
embedding.embedding_dimensions      # Vector dimensionality
embedding.embedding_model           # Model used (via content relationship)
embedding.chunk_index               # Position within content

# Content access
embedding.embeddable                # Associated content object
embedding.document                  # Parent document (through content)
embedding.content_preview(length: 100)  # Truncated content preview

# Search result formatting
embedding.to_search_result(similarity: 0.85)
# Returns formatted hash for search APIs

# Performance metrics
embedding.vector_magnitude          # Vector magnitude (for normalization)
embedding.vector_norm               # L2 norm of the vector
embedding.vector_sparsity           # Percentage of zero values

Class Methods¶

Document Class Methods¶

Scopes and Query Methods¶

# Status-based scopes
Document.processed                  # WHERE status = 'processed'
Document.pending                    # WHERE status = 'pending'
Document.processing                 # WHERE status = 'processing'
Document.with_errors                # WHERE status = 'error'

# Content-based scopes
Document.by_type('pdf')             # WHERE document_type = 'pdf'
Document.multi_modal                # Documents with multiple content types
Document.text_only                  # Documents with only text content
Document.with_content               # Documents that have content models
Document.without_content            # Documents missing content models

# Time-based scopes
Document.recent                     # ORDER BY created_at DESC
Document.created_since(1.week.ago)  # WHERE created_at > ?
Document.modified_since(1.day.ago)  # WHERE file_modified_at > ?

# Advanced queries
Document.with_embeddings_count      # Includes embedding count
Document.by_content_length(min: 1000)  # Filter by content length
Document.by_file_size(max: 10.megabytes)  # Filter by file size

Search and Filtering¶

# PostgreSQL full-text search
Document.search_content(
  "machine learning algorithms",
  limit: 20
)

# Faceted search with metadata filters
Document.faceted_search(
  query: "neural networks",
  keywords: ["deep learning", "AI"],
  classification: "research_paper",
  tags: ["computer-science"],
  limit: 50
)

# Hybrid search (semantic + full-text)
Document.hybrid_search(
  "artificial intelligence applications",
  query_embedding: embedding_vector,
  semantic_weight: 0.7,
  text_weight: 0.3,
  limit: 25
)

# Metadata-based filtering
Document.with_classification('technical_manual')
Document.with_keywords(['api', 'documentation'])
Document.with_tags(['development', 'guide'])
Document.by_metadata_field('complexity', 'advanced')

Statistics and Analytics¶

# Comprehensive statistics
Document.stats
# Returns:
# {
#   total_documents: 1250,
#   by_status: { processed: 1100, pending: 50, processing: 75, error: 25 },
#   by_type: { pdf: 600, docx: 300, text: 200, image: 100, mixed: 50 },
#   multi_modal_documents: 75,
#   total_text_contents: 1000,
#   total_image_contents: 125,
#   total_audio_contents: 25,
#   total_embeddings: { text: 15000, image: 500, audio: 100 },
#   storage_type: "activerecord_polymorphic"
# }

# Usage analytics
Document.popular(limit: 10)         # Most searched documents
Document.trending(timeframe: 1.week) # Recently popular documents
Document.usage_summary(period: 1.month)  # Usage statistics

# Content analysis
Document.average_word_count          # Average words per document
Document.total_storage_size          # Total storage used
Document.embedding_coverage          # Percentage with embeddings

# Performance metrics
Document.processing_time_stats       # Processing time statistics
Document.error_rate(period: 1.day)   # Error rate percentage
Document.throughput_stats            # Documents processed per hour

Batch Operations¶

# Batch processing
Document.process_pending!            # Process all pending documents
Document.regenerate_embeddings!(model: 'text-embedding-3-large')
Document.bulk_update_metadata(classification: 'archived')

# Batch import
Document.import_from_directory(
  '/path/to/documents',
  file_patterns: ['*.pdf', '*.docx'],
  recursive: true,
  batch_size: 100
)

# Batch cleanup
Document.cleanup_orphaned_content!   # Remove content without documents
Document.remove_old_embeddings!(older_than: 6.months)
Document.vacuum_unused_storage!      # Cleanup unused file storage

Content Class Methods¶

Content-Type Specific Queries¶

# Base Content class methods
Content.by_type('TextContent')       # Filter by STI type
Content.with_embeddings              # Content that has embeddings
Content.without_embeddings           # Content missing embeddings
Content.by_embedding_model('text-embedding-3-large')

# TextContent specific
TextContent.by_word_count(min: 500, max: 5000)
TextContent.by_character_count(min: 2000)
TextContent.with_long_content        # Content over threshold
TextContent.recently_processed       # Recently generated embeddings

# ImageContent specific
ImageContent.by_dimensions(min_width: 800, min_height: 600)
ImageContent.by_file_size(max: 5.megabytes)
ImageContent.with_descriptions       # Has AI-generated descriptions
ImageContent.by_format(['jpg', 'png'])

# AudioContent specific (planned)
AudioContent.by_duration(min: 30.seconds, max: 10.minutes)
AudioContent.by_sample_rate(44100)
AudioContent.with_transcripts        # Has speech-to-text transcripts

Content Statistics¶

# TextContent statistics
TextContent.stats
# Returns:
# {
#   total_text_contents: 1000,
#   by_model: { 'text-embedding-3-large': 600, 'text-embedding-3-small': 400 },
#   total_embeddings: 15000,
#   average_word_count: 1250,
#   average_chunk_size: 1000
# }

# Processing statistics
Content.processing_stats             # Embedding generation statistics
Content.model_usage_stats            # Usage by embedding model
Content.error_rate_by_type           # Error rates by content type

Embedding Class Methods¶

Advanced Search Methods¶

# Vector similarity search with filters
Embedding.search_similar(
  query_embedding,
  limit: 20,
  threshold: 0.75,
  filters: {
    embeddable_type: 'Ragdoll::TextContent',
    embedding_model: 'text-embedding-3-large',
    document_type: 'pdf',
    created_after: 1.month.ago
  }
)

# Batch similarity search
Embedding.batch_search_similar(
  [embedding1, embedding2, embedding3],
  limit: 10,
  aggregate_results: true
)

# Specialized search methods
Embedding.find_duplicates(threshold: 0.95)  # Near-duplicate detection
Embedding.find_outliers(threshold: 0.3)     # Low-similarity outliers
Embedding.cluster_similar(max_clusters: 10) # K-means clustering

Usage Analytics¶

# Usage tracking
Embedding.most_used(limit: 100)     # Highest usage_count
Embedding.recently_used(since: 1.hour.ago)
Embedding.trending(period: 1.day)   # Increasing usage pattern
Embedding.popular_content_types     # Usage by content type

# Performance analytics
Embedding.search_performance_stats  # Search timing statistics
Embedding.model_performance_comparison  # Compare model effectiveness
Embedding.quality_metrics           # Embedding quality assessment

# Cache optimization
Embedding.precompute_popular!       # Cache popular embeddings
Embedding.optimize_indexes!         # Rebuild vector indexes

Batch Operations¶

# Batch embedding operations
Embedding.regenerate_for_model!(
  old_model: 'text-embedding-ada-002',
  new_model: 'text-embedding-3-large'
)

Embedding.update_usage_analytics!   # Recalculate usage scores
Embedding.cleanup_orphaned!         # Remove embeddings without content
Embedding.normalize_vectors!        # L2 normalize all vectors

# Database maintenance
Embedding.rebuild_vector_indexes!   # Rebuild pgvector indexes
Embedding.vacuum_embeddings_table!  # PostgreSQL VACUUM operation
Embedding.analyze_vector_distribution!  # Update query planner statistics

Model Validations¶

Ragdoll implements comprehensive validation rules to ensure data integrity:

Document Model Validations¶

Required Fields¶

class Document < ActiveRecord::Base
  validates :location, presence: true
  validates :title, presence: true
  validates :document_type, presence: true
  validates :status, presence: true
  validates :file_modified_at, presence: true
end

Format Validations¶

# Document type validation
validates :document_type, 
  inclusion: { 
    in: %w[text image audio pdf docx html markdown mixed],
    message: "must be a valid document type"
  }

# Status validation
validates :status,
  inclusion: {
    in: %w[pending processing processed error],
    message: "must be a valid processing status"
  }

# Location format validation
validates :location, format: {
  with: /\A(https?:\/\/|\/).*\z/,
  message: "must be a valid URL or absolute file path"
}

# Metadata JSON validation
validate :validate_metadata_structure

private

def validate_metadata_structure
  return unless metadata.present?

  # Validate metadata against document type schema
  schema_errors = MetadataSchemas.validate_metadata(document_type, metadata)
  schema_errors.each { |error| errors.add(:metadata, error) }
end

Custom Validators¶

# Custom location validator
validate :validate_location_accessibility

def validate_location_accessibility
  return unless location.present?

  # For file paths, check if file exists and is readable
  if location.start_with?('/')
    unless File.exist?(location) && File.readable?(location)
      errors.add(:location, "file does not exist or is not readable")
    end
  end

  # For URLs, validate format more strictly
  if location.start_with?('http')
    begin
      uri = URI.parse(location)
      unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
        errors.add(:location, "must be a valid HTTP or HTTPS URL")
      end
    rescue URI::InvalidURIError
      errors.add(:location, "is not a valid URL")
    end
  end
end

# File size validation
validate :validate_reasonable_file_size

def validate_reasonable_file_size
  if location.present? && File.exist?(location)
    file_size = File.size(location)
    max_size = 100.megabytes  # Configurable limit

    if file_size > max_size
      errors.add(:location, "file size (#{file_size} bytes) exceeds maximum (#{max_size} bytes)")
    end
  end
end

Content Model Validations¶

Base Content Validations¶

class Content < ActiveRecord::Base
  validates :type, presence: true
  validates :embedding_model, presence: true
  validates :document_id, presence: true

  # Ensure valid STI type
  validates :type, inclusion: {
    in: %w[
      Ragdoll::TextContent
      Ragdoll::ImageContent
      Ragdoll::AudioContent
    ],
    message: "must be a valid content type"
  }

  # Validate embedding model exists
  validate :validate_embedding_model_exists

  private

  def validate_embedding_model_exists
    return unless embedding_model.present?

    valid_models = Ragdoll.config.embedding_config.keys.map(&:to_s)
    unless valid_models.include?(embedding_model)
      errors.add(:embedding_model, "'#{embedding_model}' is not a configured embedding model")
    end
  end
end

TextContent Specific Validations¶

class TextContent < Content
  validates :content, presence: true
  validates :content, length: {
    minimum: 10,
    maximum: 1_000_000,  # 1MB text limit
    message: "must be between 10 and 1,000,000 characters"
  }

  # Validate chunk configuration
  validate :validate_chunk_configuration

  private

  def validate_chunk_configuration
    chunk_size_val = chunk_size
    overlap_val = overlap

    if chunk_size_val <= 0
      errors.add(:chunk_size, "must be greater than 0")
    end

    if overlap_val < 0
      errors.add(:overlap, "cannot be negative")
    end

    if overlap_val >= chunk_size_val
      errors.add(:overlap, "must be less than chunk_size")
    end
  end
end

ImageContent Specific Validations¶

class ImageContent < Content
  validates :data, presence: true

  # Validate image metadata
  validate :validate_image_metadata

  private

  def validate_image_metadata
    return unless metadata.present?

    # Validate dimensions if present
    if metadata['width'] && metadata['height']
      width = metadata['width'].to_i
      height = metadata['height'].to_i

      if width <= 0 || height <= 0
        errors.add(:metadata, "image dimensions must be positive integers")
      end

      # Reasonable size limits
      if width > 50000 || height > 50000
        errors.add(:metadata, "image dimensions are unreasonably large")
      end
    end

    # Validate file format
    if metadata['file_type']
      valid_formats = %w[jpg jpeg png gif bmp webp svg ico tiff tif]
      unless valid_formats.include?(metadata['file_type'].downcase)
        errors.add(:metadata, "unsupported image format: #{metadata['file_type']}")
      end
    end
  end
end

Embedding Model Validations¶

Vector and Content Validations¶

class Embedding < ActiveRecord::Base
  validates :embeddable_id, presence: true
  validates :embeddable_type, presence: true
  validates :chunk_index, presence: true
  validates :content, presence: true
  validates :embedding_vector, presence: true

  # Unique chunk index per content
  validates :chunk_index, uniqueness: {
    scope: [:embeddable_id, :embeddable_type],
    message: "must be unique within the same content"
  }

  # Vector dimension validation
  validate :validate_embedding_dimensions

  # Content length validation
  validates :content, length: {
    minimum: 1,
    maximum: 10000,  # Reasonable chunk size limit
    message: "must be between 1 and 10,000 characters"
  }

  # Usage count validation
  validates :usage_count, numericality: {
    greater_than_or_equal_to: 0,
    message: "cannot be negative"
  }

  private

  def validate_embedding_dimensions
    return unless embedding_vector.present?

    # Get expected dimensions for the model
    expected_dimensions = get_expected_dimensions
    actual_dimensions = embedding_vector.length

    if actual_dimensions != expected_dimensions
      errors.add(
        :embedding_vector,
        "has #{actual_dimensions} dimensions, expected #{expected_dimensions}"
      )
    end

    # Validate vector values
    if embedding_vector.any? { |val| !val.is_a?(Numeric) }
      errors.add(:embedding_vector, "must contain only numeric values")
    end

    # Check for NaN or infinite values
    if embedding_vector.any? { |val| val.nan? || val.infinite? }
      errors.add(:embedding_vector, "cannot contain NaN or infinite values")
    end
  end

  def get_expected_dimensions
    model_name = embeddable&.embedding_model
    return 1536 unless model_name  # Default OpenAI dimension

    # Look up dimensions from configuration
    config = Ragdoll.config.embedding_config
    config.dig(model_name.to_sym, :dimensions) || 1536
  end
end

Error Handling¶

Validation Error Processing¶

# Custom error handling for validation failures
class ValidationErrorHandler
  def self.handle_document_errors(document)
    return { success: true } if document.valid?

    {
      success: false,
      errors: {
        validation_errors: document.errors.full_messages,
        field_errors: document.errors.messages,
        error_count: document.errors.count
      }
    }
  end

  def self.handle_content_errors(content)
    return { success: true } if content.valid?

    {
      success: false,
      content_type: content.class.name,
      errors: {
        validation_errors: content.errors.full_messages,
        field_errors: content.errors.messages,
        suggested_fixes: generate_fix_suggestions(content.errors)
      }
    }
  end

  private

  def self.generate_fix_suggestions(errors)
    suggestions = []

    errors.each do |field, messages|
      case field
      when :content
        if messages.any? { |m| m.include?('too short') }
          suggestions << "Ensure content has at least 10 characters"
        end
      when :embedding_model
        suggestions << "Use a configured embedding model: #{available_models.join(', ')}"
      when :chunk_size
        suggestions << "Set chunk_size to a positive integer (recommended: 1000)"
      end
    end

    suggestions
  end

  def self.available_models
    Ragdoll.config.embedding_config.keys.map(&:to_s)
  end
end

Validation Callbacks¶

# Before validation callbacks for cleanup
class Document < ActiveRecord::Base
  before_validation :normalize_location
  before_validation :set_default_file_modified_at
  before_validation :sanitize_metadata

  private

  def normalize_location
    return unless location.present?

    # Convert relative paths to absolute paths
    if location.start_with?('./')
      self.location = File.expand_path(location)
    end

    # Normalize URL protocols
    if location.match?(/^https?:\/\//i)
      self.location = location.downcase.gsub(/^http:/i, 'https:')
    end
  end

  def sanitize_metadata
    return unless metadata.present?

    # Remove nil values and empty strings
    self.metadata = metadata.reject { |k, v| v.nil? || v == '' }

    # Ensure arrays are actually arrays
    ['tags', 'keywords'].each do |field|
      if metadata[field].is_a?(String)
        metadata[field] = metadata[field].split(',').map(&:strip)
      end
    end
  end
end

This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.