Skip to content

Multi-Modal Architecture

Ragdoll's multi-modal architecture is one of its most sophisticated features, designed from the ground up to handle text, image, and audio content as first-class citizens through a polymorphic database design.

Overview

Unlike most RAG systems that retrofit multi-modal support, Ragdoll implements a native multi-modal architecture where different content types are unified through polymorphic associations while maintaining specialized handling for each media type.

Architecture Design

Schema Optimization

Ragdoll uses an optimized polymorphic schema that eliminates field duplication while maintaining full functionality:

  • Embedding model information is stored in the content-specific tables (text_contents, image_contents, audio_contents)
  • Embeddings table contains only embedding-specific data (vector, content chunk, metadata)
  • Polymorphic relationships provide seamless access to embedding model information
  • Zero data duplication across the schema while preserving all functionality

This design provides better data normalization, reduced storage requirements, and maintains referential integrity across all content types.

Polymorphic Content System

# Unified embedding storage across all content types
Embedding.where(embeddable_type: 'TextContent')
Embedding.where(embeddable_type: 'ImageContent')
Embedding.where(embeddable_type: 'AudioContent')

# Access embedding model through polymorphic relationship
embedding = Embedding.find(123)
embedding.embedding_model  # Returns content-specific embedding model
# e.g., 'text-embedding-3-small', 'clip-vit-large-patch14', 'whisper-embedding-v1'

# Cross-modal search
results = SearchEngine.search_similar_content("machine learning diagram")
# Can return text, image, and audio results in a single query

Database Schema

-- Central document entity
ragdoll_documents
├── file_data (Shrine attachment)
├── metadata (LLM-generated content analysis)
├── file_metadata (system file properties)
└── document_type (text/image/audio/pdf/mixed)

-- Content-specific tables (each stores embedding_model)
ragdoll_text_contents
├── document_id (foreign key)
├── content (extracted text)
├── chunk_size, chunk_overlap (processing parameters)
├── embedding_model (model used for this content type)
└── language_detected

ragdoll_image_contents
├── document_id (foreign key)
├── file_data (Shrine image attachment)
├── description (AI-generated description)
├── embedding_model (model used for this content type)
├── width, height (image dimensions)
└── alt_text

ragdoll_audio_contents
├── document_id (foreign key)
├── file_data (Shrine audio attachment)
├── transcript (speech-to-text result)
├── embedding_model (model used for this content type)
├── duration_seconds
└── language_detected

-- Polymorphic embeddings (normalized schema - no duplicated fields)
ragdoll_embeddings
├── embeddable_type (TextContent/ImageContent/AudioContent)
├── embeddable_id (references content table)
├── embedding_vector (pgvector)
├── content (original text/description/transcript chunk)
├── chunk_index (position within embeddable content)
└── metadata (embedding-specific data only)

Content Type Support

Text Content

Fully Implemented with comprehensive processing:

# Supported formats
text_content = TextContent.create!(
  document: document,
  content: extracted_text,
  chunk_size: 1000,
  chunk_overlap: 200,
  embedding_model: 'text-embedding-3-small',
  language_detected: 'en'
)

# Automatic chunking and embedding generation
Ragdoll::GenerateEmbeddingsJob.perform_later(text_content)

Features: - ✅ Intelligent text chunking with boundary detection - ✅ Language detection and encoding handling - ✅ Configurable chunk size and overlap - ✅ Sentence/paragraph boundary preservation - ✅ Code-aware chunking for technical documents

Image Content

Fully Implemented with AI-powered analysis:

# Image processing with Shrine
image_content = ImageContent.create!(
  document: document,
  file: uploaded_file,  # Shrine attachment
  description: ai_generated_description,
  embedding_model: 'clip-vit-large-patch14',
  width: 1920,
  height: 1080,
  alt_text: "Machine learning workflow diagram"
)

# AI description generation
MetadataGeneratorService.generate_image_description(image_content)

Features: - ✅ Shrine file attachment with validation - ✅ AI-powered image description generation - ✅ Metadata extraction (dimensions, format, size) - ✅ Embedding generation from descriptions - ✅ Cross-modal search (find images from text queries)

Audio Content

Fully Implemented with transcription support:

# Audio processing with transcription
audio_content = AudioContent.create!(
  document: document,
  file: uploaded_file,  # Shrine attachment
  transcript: speech_to_text_result,
  embedding_model: 'whisper-embedding-v1',
  duration_seconds: 125.5,
  language_detected: 'en'
)

# Transcription and embedding
Ragdoll::ExtractTextJob.perform_later(audio_content)
Ragdoll::GenerateEmbeddingsJob.perform_later(audio_content)

Features: - ✅ Shrine file attachment with audio validation - ✅ Speech-to-text transcription integration - ✅ Duration and metadata extraction - ✅ Language detection - ✅ Embedding generation from transcripts - ✅ Searchable by spoken content

The unified embedding system enables sophisticated cross-modal search capabilities:

Search Across All Content Types

# Single query searches text, image descriptions, and audio transcripts
results = Ragdoll::Core.search(
  query: "neural network architecture",
  content_types: ['text', 'image', 'audio']  # optional filter
)

# Results include:
# - Text documents mentioning neural networks
# - Images with AI-generated descriptions about neural networks
# - Audio files with transcripts discussing neural networks
# Search only images
image_results = Ragdoll::Core.search(
  query: "machine learning diagram",
  content_type: 'image'
)

# Search only audio transcripts
audio_results = Ragdoll::Core.search(
  query: "podcast about AI",
  content_type: 'audio'
)

Advanced Multi-Modal Queries

# Complex cross-modal search with metadata filters
results = Ragdoll::Core.search(
  query: "deep learning",
  content_types: ['text', 'image'],
  metadata_filters: {
    classification: 'technical',
    topics: ['artificial intelligence']
  },
  similarity_threshold: 0.8
)

File Upload and Processing

Shrine Integration

Each content type uses Shrine for production-grade file handling:

# Configuration in shrine_config.rb
Shrine.configure do |config|
  config.storages = {
    cache: Shrine::Storage::FileSystem.new("tmp", prefix: "uploads/cache"),
    store: Shrine::Storage::FileSystem.new("storage", prefix: "uploads")
  }
end

# Automatic file validation by content type
class ImageUploader < Shrine
  plugin :validation_helpers

  Attacher.validate do
    validate_max_size 10.megabytes
    validate_mime_type %w[image/jpeg image/png image/gif image/webp]
  end
end

Processing Pipeline

# Multi-modal document processing
document = Document.create!(
  location: file_path,
  document_type: 'mixed'  # Can contain multiple content types
)

# Automatic content type detection and processing
DocumentProcessor.process(document) do |processor|
  case processor.detected_type
  when 'image'
    processor.create_image_content_with_description
  when 'audio'
    processor.create_audio_content_with_transcription
  when 'text'
    processor.create_text_content_with_chunking
  end
end

Background Processing

All multi-modal operations are designed for background processing:

# Jobs for each content type
Ragdoll::GenerateEmbeddingsJob.perform_later(text_content)
Ragdoll::GenerateEmbeddingsJob.perform_later(image_content)  # From description
Ragdoll::GenerateEmbeddingsJob.perform_later(audio_content)  # From transcript

# Content analysis jobs
Ragdoll::ExtractTextJob.perform_later(audio_content)         # Speech-to-text
Ragdoll::GenerateSummaryJob.perform_later(text_content)      # Summarization
Ragdoll::ExtractKeywordsJob.perform_later(image_content)     # Image analysis

Usage Analytics

Multi-modal search includes sophisticated analytics:

# Usage tracking across content types
embedding.update!(
  usage_count: embedding.usage_count + 1,
  returned_at: Time.current
)

# Analytics by content type
Document.joins(:embeddings)
        .where(embeddings: { embeddable_type: 'ImageContent' })
        .group(:document_type)
        .count

# Cross-modal performance metrics
SearchEngine.analytics_for_query("machine learning")
# Returns usage stats across text, image, and audio results

API Examples

Adding Multi-Modal Content

# Mixed document with multiple content types
result = Ragdoll::Core.add_document(path: 'presentation.pptx')
# Automatically extracts:
# - Text from slides → TextContent
# - Images from slides → ImageContent
# - Speaker notes → TextContent

# Manual content addition
doc_id = Ragdoll::Core.add_document(path: 'research_paper.pdf')[:document_id]

# Add supplementary image
Ragdoll::Core.add_image(
  document_id: doc_id,
  image_path: 'diagram.png',
  description: 'Neural network architecture diagram'
)

# Add supplementary audio
Ragdoll::Core.add_audio(
  document_id: doc_id,
  audio_path: 'presentation.mp3',
  transcript: 'Today we discuss neural network architectures...'
)

Searching Multi-Modal Content

# Unified search across all content
results = Ragdoll::Core.search(query: 'convolutional neural networks')

results.each do |result|
  case result[:content_type]
  when 'text'
    puts "Text: #{result[:content]}"
  when 'image'
    puts "Image: #{result[:description]} (#{result[:file_url]})"
  when 'audio'
    puts "Audio: #{result[:transcript]} (#{result[:duration]}s)"
  end
end

Performance Considerations

Vector Storage Optimization

-- Polymorphic embeddings with optimized indexing
CREATE INDEX idx_embeddings_polymorphic_search
ON ragdoll_embeddings (embeddable_type, embedding_vector)
USING ivfflat (embedding_vector vector_cosine_ops);

-- Content-type specific indexes
CREATE INDEX idx_embeddings_text_usage
ON ragdoll_embeddings (embeddable_type, usage_count DESC)
WHERE embeddable_type = 'TextContent';

Batch Processing

# Efficient batch embedding generation
TextContent.where(embeddings_count: 0)
           .find_in_batches(batch_size: 100) do |batch|
  Ragdoll::GenerateEmbeddingsJob.perform_later(batch.map(&:id))
end

# Cross-modal batch search
queries = ['AI research', 'neural networks', 'deep learning']
results = SearchEngine.batch_search(queries, content_types: ['text', 'image'])

Extending Multi-Modal Support

Adding New Content Types

# 1. Create new content model
class VideoContent < ApplicationRecord
  belongs_to :document
  has_many :embeddings, as: :embeddable

  # Shrine attachment for video files
  include VideoUploader::Attachment(:file)
end

# 2. Add processing logic
class DocumentProcessor
  def process_video(video_file)
    # Extract frames, transcribe audio, analyze content
    VideoContent.create!(
      document: @document,
      file: video_file,
      transcript: extract_audio_transcript(video_file),
      frame_descriptions: extract_key_frames(video_file),
      duration_seconds: get_video_duration(video_file)
    )
  end
end

# 3. Add embedding generation
class Ragdoll::GenerateEmbeddingsJob
  def perform_for_video(video_content)
    # Generate embeddings from transcript + frame descriptions
    combined_text = "#{video_content.transcript} #{video_content.frame_descriptions.join(' ')}"
    vector = EmbeddingService.generate_embedding(combined_text)

    video_content.embeddings.create!(
      embedding_vector: vector,
      content: combined_text,
      embedding_model: current_model
    )
  end
end

Best Practices

1. Content Type Strategy

  • Use appropriate content types for your data
  • Consider mixed documents for complex files (PDFs, presentations)
  • Leverage cross-modal search for comprehensive results

2. Performance Optimization

  • Process large files in background jobs
  • Use batch operations for multiple files
  • Implement appropriate caching for frequently accessed content

3. Search Strategy

  • Start with broad cross-modal searches
  • Use content-type filters to narrow results
  • Combine with metadata filters for precision

4. Storage Management

  • Configure appropriate Shrine storage backends
  • Implement file validation and size limits
  • Plan for storage scaling in production

The multi-modal architecture in Ragdoll provides a powerful foundation for building sophisticated document intelligence applications that can understand and search across different types of content seamlessly.