Architecture Overview¶

Ragdoll Core is a database-oriented RAG (Retrieval-Augmented Generation) framework built on ActiveRecord with PostgreSQL as the primary data store. The architecture follows a layered service-oriented design with clear separation of concerns, optimized for performance and extensibility.

System Design and Component Relationships¶

High-Level Architecture¶

graph TB
    subgraph "Client Interface Layer"
        Client[Ragdoll::Core::Client<br/>🏗️ Facade Pattern]
        API[Public API Methods<br/>🔌 High-level Interface]
        Delegation[Module Delegation<br/>📡 Method Forwarding]
    end

    subgraph "Service Layer"
        DocProc[Document Processor<br/>📄 Multi-format Parsing]
        DocMgmt[Document Management<br/>🗂️ CRUD Operations]
        EmbedSvc[Embedding Service<br/>🧠 Vector Generation]
        SearchEng[Search Engine<br/>🔍 Semantic & Hybrid Search]
        TextChunk[Text Chunker<br/>✂️ Content Segmentation]
        TextGen[Text Generation Service<br/>💬 LLM Integration]
        Config[Configuration Manager<br/>⚙️ Settings & Providers]
    end

    subgraph "Background Processing Layer"
        ActiveJob[ActiveJob Integration<br/>⚡ Queue Management]
        EmbedJob[Generate Embeddings Job<br/>🔄 Async Processing]
        MetaJob[Metadata Generation Job<br/>📊 LLM Enhancement]
    end

    subgraph "Data Layer"
        DocModel[Document Model<br/>📑 Central Entity]
        TextContent[Text Content<br/>📝 Text Processing]
        ImageContent[Image Content<br/>🖼️ Image Analysis]
        AudioContent[Audio Content<br/>🎵 Audio Processing]
        EmbedModel[Embedding Model<br/>🎯 Vector Storage]
        SearchModel[Search Model<br/>🔍 Query Tracking]
        SearchResultModel[Search Result Model<br/>📊 Analytics]
    end

    subgraph "Infrastructure Layer"
        PostgreSQL[(PostgreSQL + pgvector<br/>🐘 Vector Database)]
        Shrine[Shrine File Storage<br/>📁 Upload Management]
        FullText[Full-text Search<br/>🔎 GIN Indexes]
        VectorSearch[Vector Similarity<br/>📐 IVFFlat Indexes]
    end

    subgraph "External Services"
        LLMProviders[LLM Providers<br/>🤖 OpenAI, Anthropic, etc.]
        FileStorage[File Storage Backends<br/>☁️ Local, S3, etc.]
    end

    %% Client Layer Connections
    Client --> DocProc
    Client --> DocMgmt
    Client --> SearchEng
    Client --> EmbedSvc
    API --> Client
    Delegation --> Client

    %% Service Layer Connections
    DocProc --> TextChunk
    DocMgmt --> DocProc
    DocMgmt --> EmbedJob
    SearchEng --> EmbedSvc
    EmbedSvc --> LLMProviders
    TextGen --> LLMProviders

    %% Background Processing
    EmbedJob --> EmbedSvc
    EmbedJob --> TextChunk
    MetaJob --> TextGen
    ActiveJob --> EmbedJob
    ActiveJob --> MetaJob

    %% Data Layer Connections
    DocModel --> TextContent
    DocModel --> ImageContent
    DocModel --> AudioContent
    TextContent --> EmbedModel
    ImageContent --> EmbedModel
    AudioContent --> EmbedModel
    SearchEng --> SearchModel
    SearchModel --> SearchResultModel
    SearchResultModel --> EmbedModel

    %% Infrastructure Connections
    DocModel --> PostgreSQL
    EmbedModel --> PostgreSQL
    SearchModel --> PostgreSQL
    SearchResultModel --> PostgreSQL
    EmbedModel --> VectorSearch
    SearchModel --> VectorSearch
    DocModel --> FullText
    DocProc --> Shrine
    Shrine --> FileStorage
    VectorSearch --> PostgreSQL
    FullText --> PostgreSQL

    %% Styling
    classDef clientLayer fill:#e1f5fe
    classDef serviceLayer fill:#f3e5f5
    classDef processLayer fill:#fff3e0
    classDef dataLayer fill:#e8f5e8
    classDef infraLayer fill:#fce4ec
    classDef externalLayer fill:#f1f8e9

    class Client,API,Delegation clientLayer
    class DocProc,DocMgmt,EmbedSvc,SearchEng,TextChunk,TextGen,Config serviceLayer
    class ActiveJob,EmbedJob,MetaJob processLayer
    class DocModel,TextContent,ImageContent,AudioContent,EmbedModel dataLayer
    class PostgreSQL,Shrine,FullText,VectorSearch infraLayer
    class LLMProviders,FileStorage externalLayer

Data Flow Through the System¶

1. Document Ingestion Flow¶

sequenceDiagram
    participant User
    participant Client as Ragdoll::Core::Client
    participant DocProc as Document Processor
    participant Shrine as File Storage
    participant DB as PostgreSQL
    participant Job as Background Job
    participant EmbedSvc as Embedding Service
    participant LLM as LLM Provider

    User->>Client: add_document(path)
    Client->>DocProc: parse(file_path)
    Client->>DocMgmt: add_document(path, content, metadata)
    DocMgmt->>DB: store_document()
    DocMgmt-->>Client: return document_id
    Client-->>User: success response

    Client->>Job: queue GenerateEmbeddings
    Job->>EmbedSvc: process_document()
    EmbedSvc->>LLM: generate_embeddings()
    LLM-->>EmbedSvc: vector_embeddings
    EmbedSvc->>DB: store_embeddings()
    Job->>DB: update_status(processed)

2. Search and Retrieval Flow¶

flowchart TD
    A[Query Input] --> B{Search Type?}

    B -->|Semantic| C[Generate Query Embedding]
    B -->|Full-text| D[PostgreSQL Text Search]
    B -->|Hybrid| E[Both Semantic + Text]

    C --> F[pgvector Similarity Search]
    D --> G[GIN Index Search]
    E --> H[Weighted Result Combination]

    F --> I[Result Ranking]
    G --> I
    H --> I

    I --> J[Usage Analytics Update]
    J --> K[Context Assembly]
    K --> L[Return Results]

    style A fill:#e1f5fe
    style L fill:#e8f5e8
    style I fill:#fff3e0

3. RAG Enhancement Flow¶

graph LR
    subgraph "Context Retrieval"
        A[User Prompt] --> B[get_context]
        B --> C[search_similar_content]
        C --> D[Similarity Search]
        D --> E[Top K Results]
    end

    subgraph "Prompt Enhancement"
        E --> F[build_enhanced_prompt]
        F --> G[Template Application]
        G --> H[Context Integration]
    end

    subgraph "Output Generation"
        H --> I[Enhanced Prompt]
        I --> J[LLM Processing]
        J --> K[Contextual Response]
    end

    style A fill:#e1f5fe
    style I fill:#fff3e0
    style K fill:#e8f5e8

Integration Points with External Services¶

LLM Providers: OpenAI, Anthropic, Google, Azure, Ollama via ruby_llm gem
File Storage: Shrine gem with configurable backends (filesystem, cloud storage)
Background Processing: ActiveJob with multiple adapter support
Search Extensions: Optional OpenSearch/Elasticsearch integration
Monitoring: Built-in analytics and optional external monitoring tools

Core Components¶

1. Client Interface Layer¶

Primary Component: Ragdoll::Core::Client

Responsibilities: - High-level API facade for all RAG operations - Document lifecycle management (add, update, delete, list) - Context retrieval and prompt enhancement - Multi-modal search capabilities (semantic, hybrid, full-text) - Health monitoring and system analytics

Key Methods:

# Document Management
add_document(path:) → result_hash
add_text(content:, title:, **options) → document_id
add_directory(path:, recursive: false) → results_array

# Search & Retrieval
search(query:, **options) → search_results
hybrid_search(query:, **options) → weighted_results
get_context(query:, limit: 10, **options) → context_hash
enhance_prompt(prompt:, context_limit: 5, **options) → enhanced_prompt

# Analytics & Health
stats() → system_statistics
healthy?() → boolean
search_analytics(days: 30) → analytics_data

Design Pattern: Facade Pattern - Simplifies complex subsystem interactions

2. Document Processing Pipeline¶

Primary Component: Ragdoll::DocumentProcessor

Responsibilities: - Multi-format document parsing (PDF, DOCX, HTML, Markdown, Text) - Content extraction and normalization - Format-specific handling and validation - Error handling for malformed documents

Related Component: Ragdoll::DocumentManagement

Responsibilities: - Document CRUD operations (create, read, update, delete) - Database persistence and retrieval - Document metadata management - Integration with background job processing

Architecture Details:

class DocumentProcessor
  # Strategy pattern for format-specific parsing
  PARSERS = {
    '.pdf' => :parse_pdf,
    '.docx' => :parse_docx,
    '.html' => :parse_html,
    '.md' => :parse_markdown
  }.freeze

  def parse(file_path)
    # Returns structured result with content, metadata, and document_type
    {
      content: extracted_text,
      metadata: { title:, author:, creation_date:, ... },
      document_type: mime_type
    }
  end
end

Architecture Details:

class DocumentManagement
  # Handles all document database operations
  def self.add_document(location, content, metadata = {})
    # Create document record with metadata
    # Returns document ID for further processing
  end

  def self.get_document(id)
    # Retrieve document with all associated content
  end

  def self.update_document(id, **updates)
    # Update document metadata and properties
  end
end

Integration Points: - DocumentProcessor for content parsing - Background jobs for asynchronous processing - ActiveRecord models for data persistence

3. Embedding Generation System¶

Primary Component: Ragdoll::EmbeddingService

Responsibilities: - Vector embedding generation using multiple LLM providers - Text preprocessing and normalization - Batch processing for efficiency - Cosine similarity calculations - Provider failover and error handling

Provider Integration:

class EmbeddingService
  SUPPORTED_PROVIDERS = [
    :openai, :anthropic, :google, :azure, 
    :ollama, :huggingface, :openrouter
  ].freeze

  def generate_embedding(text)
    # Unified interface to multiple providers via ruby_llm
    @client.embed(text: clean_text(text))
  end

  def generate_embeddings_batch(texts)
    # Optimized batch processing
    texts.each_slice(batch_size).flat_map { |batch| process_batch(batch) }
  end
end

Design Pattern: Adapter Pattern - Unifies diverse LLM provider APIs

4. Search and Retrieval Engine¶

Primary Component: Ragdoll::SearchEngine

Responsibilities: - Semantic search using vector embeddings and pgvector - Hybrid search combining semantic and full-text approaches - Multi-modal content search across text, image, and audio - Query embedding generation and similarity matching - Search result ranking and relevance scoring

Search Types:

Semantic Search: Vector similarity using cosine distance
Hybrid Search: Combines semantic search with keyword extraction and filtering
Content-Type Search: Specialized search for text, image, or audio content
Filtered Search: Document-type and metadata-based filtering

Performance Optimizations:

class SearchEngine
  def search_similar_content(query, **options)
    # Generate query embedding for semantic search
    embedding = @embedding_service.generate_embedding(query)

    # Use pgvector for efficient similarity search
    Ragdoll::Embedding.search_similar(
      embedding,
      limit: options[:limit],
      threshold: options[:threshold],
      filters: options[:filters]
    )
  end

  def search_documents(query, options = {})
    # High-level search interface with filtering
    search_similar_content(query, options)
  end
end

5. Database Abstraction Layer¶

Primary Component: Ragdoll::Core::Database

Responsibilities: - Database connection management and health monitoring - Automatic migration execution on startup - Schema versioning and compatibility checks - Development database reset capabilities - Connection pooling optimization

Key Features:

class Database
  def self.setup(config)
    establish_connection(config)
    run_migrations if config[:auto_migrate]
    verify_extensions  # Ensure pgvector is available
  end

  def self.healthy?
    connection.active? && verify_schema_version
  rescue StandardError
    false
  end
end

6. Background Job System¶

Primary Component: Ragdoll::GenerateEmbeddingsJob

Responsibilities: - Asynchronous embedding generation for new content - LLM-powered metadata enhancement - Text extraction and chunking for large documents - Error handling and retry logic - Progress tracking and status updates

Job Architecture:

class GenerateEmbeddings < ActiveJob::Base
  def perform(document_id)
    document = Ragdoll::Document.find(document_id)

    # Process each content type polymorphically
    document.content_items.each do |content|
      generate_embeddings_for_content(content)
    end

    document.update!(status: 'processed', processed_at: Time.current)
  end
end

Design Pattern: Observer Pattern - Jobs triggered by model lifecycle events

Architecture Decisions¶

Choice of ActiveRecord for ORM¶

Decision: Use ActiveRecord as the primary ORM with PostgreSQL

Rationale: - Performance: Direct SQL generation with optimization capabilities - Ecosystem: Rich plugin ecosystem (pgvector, full-text search) - Migrations: Built-in schema versioning and migration management - Associations: Powerful polymorphic associations for multi-modal content - Connection Management: Built-in pooling and health monitoring

Trade-offs: - ✅ Familiar Rails conventions and patterns - ✅ Excellent PostgreSQL integration - ✅ Rich querying capabilities with Arel - ❌ Not database-agnostic (PostgreSQL-specific features used) - ❌ Potential N+1 query issues (mitigated with careful includes)

Decision: Use polymorphic associations for content types (text, image, audio)

Rationale: - Extensibility: Easy addition of new content types without schema changes - Consistency: Unified embedding storage across all content types - Performance: Single embedding table with efficient indexing - Flexibility: Content-type specific processing while maintaining common interfaces

Implementation:

class Document < ActiveRecord::Base
  has_many :text_contents, dependent: :destroy
  has_many :image_contents, dependent: :destroy
  has_many :audio_contents, dependent: :destroy
end

class Embedding < ActiveRecord::Base
  belongs_to :embeddable, polymorphic: true  # Text, Image, or Audio content
end

Database Schema Relationships:

erDiagram
    DOCUMENTS {
        uuid id PK
        string title
        string status
        jsonb metadata "LLM-generated content analysis"
        jsonb file_metadata "System file properties"
        text content "Extracted text content"
        string document_type
        timestamp created_at
        timestamp updated_at
        timestamp processed_at
    }

    TEXT_CONTENTS {
        uuid id PK
        uuid document_id FK
        text content
        integer chunk_index
        integer start_position
        integer end_position
        jsonb metadata
        timestamp created_at
        timestamp updated_at
    }

    IMAGE_CONTENTS {
        uuid id PK
        uuid document_id FK
        string file_data "Shrine attachment"
        string alt_text
        jsonb metadata "Image analysis"
        integer width
        integer height
        string format
        timestamp created_at
        timestamp updated_at
    }

    AUDIO_CONTENTS {
        uuid id PK
        uuid document_id FK
        string file_data "Shrine attachment"
        text transcript
        integer duration_seconds
        string format
        jsonb metadata "Audio analysis"
        timestamp created_at
        timestamp updated_at
    }

    EMBEDDINGS {
        uuid id PK
        uuid embeddable_id FK
        string embeddable_type "Polymorphic type"
        vector vector "pgvector embedding"
        string model_name
        integer vector_dimensions
        integer usage_count
        timestamp last_used_at
        timestamp created_at
        timestamp updated_at
    }

    SEARCHES {
        uuid id PK
        text query "Original search query"
        vector query_embedding "Query vector for similarity"
        string search_type "semantic, hybrid, fulltext"
        integer results_count "Number of results returned"
        float max_similarity_score
        float min_similarity_score
        float avg_similarity_score
        jsonb search_filters "Applied filters"
        jsonb search_options "Search configuration"
        integer execution_time_ms "Query performance"
        string session_id "User session"
        string user_id "User identifier"
        timestamp created_at
        timestamp updated_at
    }

    SEARCH_RESULTS {
        uuid id PK
        uuid search_id FK
        uuid embedding_id FK
        float similarity_score "Result similarity"
        integer result_rank "Position in results"
        boolean clicked "User engagement"
        timestamp clicked_at "Click timestamp"
        timestamp created_at
        timestamp updated_at
    }

    %% Relationships
    DOCUMENTS ||--o{ TEXT_CONTENTS : "has_many"
    DOCUMENTS ||--o{ IMAGE_CONTENTS : "has_many"
    DOCUMENTS ||--o{ AUDIO_CONTENTS : "has_many"

    TEXT_CONTENTS ||--o{ EMBEDDINGS : "embeddable (polymorphic)"
    IMAGE_CONTENTS ||--o{ EMBEDDINGS : "embeddable (polymorphic)"
    AUDIO_CONTENTS ||--o{ EMBEDDINGS : "embeddable (polymorphic)"

    SEARCHES ||--o{ SEARCH_RESULTS : "has_many"
    EMBEDDINGS ||--o{ SEARCH_RESULTS : "has_many"

Dual Metadata Architecture¶

Decision: Separate LLM-generated content metadata from file metadata

Rationale: - Separation of Concerns: Technical file properties vs. semantic content analysis - Search Optimization: Different indexing strategies for different metadata types - Cost Management: LLM metadata generation only when needed - Extensibility: Easy addition of domain-specific metadata schemas

Schema Design:

class Document < ActiveRecord::Base
  # System-generated file properties
  store :file_metadata, accessors: [:file_size, :mime_type, :dimensions, :processing_params]

  # LLM-generated content analysis
  store :metadata, accessors: [:summary, :keywords, :classification, :topics, :sentiment]
end

Search Tracking Architecture¶

Decision: Comprehensive search analytics with vector similarity for query analysis

Rationale: - User Behavior Analytics: Track search patterns, click-through rates, and engagement - Query Similarity: Use vector embeddings to find similar searches and improve relevance - Performance Monitoring: Measure search execution times and optimize slow queries - Session Tracking: Associate searches with users and sessions for personalization - Automatic Cleanup: Cascade deletion and orphaned search cleanup for data integrity

Schema Design:

class Search < ActiveRecord::Base
  # Vector similarity support for finding similar searches
  has_neighbors :query_embedding
  has_many :search_results, dependent: :destroy
  has_many :embeddings, through: :search_results

  # Analytics methods
  def click_through_rate
    return 0.0 if search_results.count.zero?
    (search_results.clicked.count / search_results.count.to_f) * 100
  end

  # Find searches with similar queries
  def similar_searches(limit: 5)
    nearest_neighbors(:query_embedding, distance: :cosine).limit(limit)
  end
end

class SearchResult < ActiveRecord::Base
  belongs_to :search
  belongs_to :embedding

  # User engagement tracking
  def mark_as_clicked!
    update!(clicked: true, clicked_at: Time.current)
  end

  # Automatic cleanup when search becomes empty
  after_destroy :cleanup_empty_search

  private

  def cleanup_empty_search
    search.destroy if search.search_results.count == 0
  end
end

Key Features: - Automatic Recording: All searches tracked unless explicitly disabled - Vector Similarity: Query embeddings enable finding similar searches - Performance Metrics: Execution time, result counts, and similarity scores - User Engagement: Click tracking and analytics - Data Integrity: Cascade deletion and automatic cleanup

Background Processing Approach¶

Decision: Use ActiveJob with asynchronous embedding generation

Rationale: - User Experience: Non-blocking document upload and immediate response - Resource Management: Controlled concurrency for expensive LLM operations - Reliability: Retry logic and error handling for API failures - Scalability: Horizontal scaling through job queue workers

Error Handling Strategy:

class GenerateEmbeddings < ActiveJob::Base
  retry_on EmbeddingService::APIError, wait: :exponentially_longer, attempts: 3
  discard_on EmbeddingService::InvalidInputError

  def perform(document_id)
    # Robust error handling with fallback strategies
  end
end

Vector Database Integration Strategy¶

Decision: Use PostgreSQL + pgvector instead of dedicated vector databases

Rationale: - Simplicity: Single database system reduces operational complexity - Performance: Native PostgreSQL optimizations with pgvector extension - ACID Compliance: Transactional consistency between documents and embeddings - Ecosystem: Leverages existing PostgreSQL tooling and expertise

Performance Characteristics: - IVFFlat Indexing: Sub-linear search performance O(log n) - Concurrent Access: PostgreSQL's MVCC for high concurrency - Memory Efficiency: Configurable vector dimensions and precision - Backup/Recovery: Standard PostgreSQL tools work seamlessly

Performance Considerations¶

Database Indexing Strategy¶

Vector Indexes:

-- IVFFlat index for similarity search
CREATE INDEX idx_embeddings_vector ON ragdoll_embeddings 
USING ivfflat (vector vector_cosine_ops) WITH (lists = 100);

-- Composite index for filtered searches
CREATE INDEX idx_embeddings_content_vector ON ragdoll_embeddings 
USING ivfflat (embeddable_type, vector vector_cosine_ops);

Text Search Indexes:

-- GIN index for full-text search
CREATE INDEX idx_documents_content_gin ON ragdoll_documents 
USING gin(to_tsvector('english', title || ' ' || coalesce(content, '')));

-- Metadata search optimization
CREATE INDEX idx_documents_metadata_gin ON ragdoll_documents 
USING gin(metadata);

Performance Optimizations:

graph TD
    subgraph "Query Optimization"
        A[Query Input] --> B{Query Type}
        B -->|Vector Similarity| C[IVFFlat Index]
        B -->|Full-text Search| D[GIN Index]
        B -->|Metadata Filter| E[JSON Index]

        C --> F[pgvector Cosine Distance]
        D --> G[PostgreSQL Text Ranking]
        E --> H[JSON Path Queries]
    end

    subgraph "Caching Strategy"
        I[Frequently Used Embeddings] --> J[In-Memory Cache]
        K[Search Results] --> L[TTL-based Cache]
        M[Configuration] --> N[Memoization]
    end

    subgraph "Connection Management"
        O[Multiple Clients] --> P[Connection Pool]
        P --> Q[Load Balancing]
        Q --> R[Health Monitoring]
    end

    F --> S[Result Ranking]
    G --> S
    H --> S
    S --> T[Analytics Update]
    T --> U[Optimized Results]

    style A fill:#e1f5fe
    style U fill:#e8f5e8
    style J fill:#fff3e0
    style P fill:#f3e5f5

Implementation Details: - Query Planning: Use of EXPLAIN ANALYZE for query optimization - Index Maintenance: Regular VACUUM and ANALYZE operations - Connection Pooling: ActiveRecord pool configuration for concurrency - Prepared Statements: Automatic statement caching for repeated queries

Async Processing Design¶

Batch Processing: - Embedding generation in configurable batch sizes (default: 10) - Parallel processing of independent content items - Memory management for large document processing

Queue Management:

# Configuration for different job priorities
class GenerateEmbeddings < ActiveJob::Base
  queue_as :embeddings

  # High priority for interactive operations
  def self.perform_now_if_small(document_id)
    document = Ragdoll::Document.find(document_id)
    if document.estimated_processing_time < 5.seconds
      perform_now(document_id)
    else
      perform_later(document_id)
    end
  end
end

Caching Strategy¶

Application-Level Caching: - Configuration memoization for frequently accessed settings - Embedding result caching for repeated queries - Search result caching with TTL-based invalidation

Database-Level Optimization: - Query result caching through PostgreSQL's shared buffers - Materialized views for complex analytics queries - Partial indexes for status-based filtering

Memory Management: - Streaming file processing for large documents - Configurable chunk sizes balancing quality vs. performance - Connection pool sizing based on concurrency requirements

Monitoring and Observability¶

Built-in Analytics: - Search frequency and result quality tracking - Embedding generation performance metrics - Document processing success/failure rates - API response time monitoring

Health Checks:

def healthy?
  checks = {
    database: Database.healthy?,
    embedding_service: @embedding_service.healthy?,
    job_queue: ActiveJob::Base.queue_adapter.healthy?
  }

  checks.all? { |_name, status| status }
end

Performance Monitoring: - Query execution time tracking - Memory usage monitoring during document processing - Background job queue depth monitoring - LLM API latency and error rate tracking

This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.