Skip to content

Unified Text-Based RAG Architecture

Ragdoll has evolved from a multi-modal polymorphic architecture to a unified text-based RAG system that converts all media types to comprehensive text representations before vectorization. This approach enables powerful cross-modal search while dramatically simplifying the system architecture.

Overview

The unified text-based architecture represents a fundamental shift in how RAG systems handle diverse content types:

  • All Media → Text: Images become AI-generated descriptions, audio becomes transcripts, documents become extracted text
  • Single Embedding Model: One text embedding model handles all content types
  • Cross-Modal Search: Find any media type through natural language queries
  • Simplified Architecture: Single content model instead of complex polymorphic relationships
  • Better Retrieval: Text descriptions often contain more searchable information than raw media embeddings

Architecture Design

Unified Content Pipeline

graph LR
    subgraph "Input Media"
        PDF[PDF/DOCX]
        IMG[Images]
        AUD[Audio]
        CSV[CSV/JSON]
    end

    subgraph "Text Conversion"
        TE[Text Extraction]
        ID[Image Description<br/>via Vision AI]
        AT[Audio Transcription<br/>via Speech AI]
        SE[Structured Extraction]
    end

    subgraph "Unified Processing"
        TC[Text Content]
        QA[Quality Assessment]
        CH[Chunking]
        EM[Text Embeddings]
    end

    subgraph "Search"
        UI[Unified Index]
        CS[Cross-Modal Search]
    end

    PDF --> TE
    IMG --> ID
    AUD --> AT
    CSV --> SE

    TE --> TC
    ID --> TC
    AT --> TC
    SE --> TC

    TC --> QA
    TC --> CH
    CH --> EM
    EM --> UI
    UI --> CS

Database Schema

-- Unified document entity
ragdoll_unified_documents
├── title
├── location (file path/URL)
├── document_type (original format)
├── status (pending/processing/processed/failed)
└── metadata (conversion settings and results)

-- Unified text content
ragdoll_unified_contents
├── unified_document_id (foreign key)
├── content (converted text representation)
├── original_media_type (text/image/audio/document)
├── conversion_method (extraction/description/transcription)
├── content_quality_score (0.0-1.0)
├── word_count
├── character_count
├── embedding_model (single text model)
└── metadata (conversion-specific data)

-- Text embeddings only
ragdoll_embeddings
├── embeddable_type (UnifiedContent)
├── embeddable_id
├── embedding (pgvector - text embeddings only)
├── content (text chunk)
├── chunk_index
└── metadata

Text Conversion Services

Document Text Extraction

Extracts text from various document formats:

# PDF text extraction
text_content = Ragdoll::TextExtractionService.extract('research.pdf')
# => "This paper presents a novel approach to..."

# CSV to readable text
csv_text = Ragdoll::TextExtractionService.extract('data.csv')
# => "name: John Smith, age: 30, city: New York..."

# Supported formats: PDF, DOCX, HTML, Markdown, CSV, JSON, XML, YAML

Image to Text Conversion

Generates comprehensive descriptions using vision AI models:

# Image description generation
description = Ragdoll::ImageToTextService.convert(
  'diagram.png',
  detail_level: :comprehensive
)
# => "A flowchart diagram showing the machine learning pipeline with..."

# Detail levels:
# :minimal - Brief one-sentence description
# :standard - Main elements and composition
# :comprehensive - Detailed description including objects, colors, mood
# :analytical - Thorough analysis including artistic elements

Audio to Text Transcription

Converts speech to searchable text:

# Audio transcription
transcript = Ragdoll::AudioToTextService.transcribe('meeting.mp3')
# => "In today's meeting we discussed the Q3 results..."

# Supported providers:
# :openai - Whisper API
# :azure - Speech Services
# :google - Cloud Speech-to-Text
# :whisper_local - Local Whisper installation

The unified architecture enables powerful search across all media types:

# Find images by describing their content
results = Ragdoll.search(query: "architecture diagram with database symbols")
# Returns images whose AI descriptions match the query

# Search audio by spoken content
results = Ragdoll.search(query: "discussion about machine learning")
# Returns audio files whose transcripts contain these topics

# Mixed results across all media types
results = Ragdoll.search(query: "neural networks")
# Returns:
# - Text documents mentioning neural networks
# - Images with descriptions of neural network diagrams
# - Audio with transcripts discussing neural networks
# All ranked by unified relevance scoring

Content Quality Assessment

Automatic assessment of converted content quality:

document = Ragdoll::UnifiedDocument.find(id)
content = document.unified_contents.first

# Quality score (0.0 to 1.0)
puts content.content_quality_score

# Quality factors:
# - Content length (optimal: 50-2000 words)
# - Original media type (text > documents > descriptions > placeholders)
# - Conversion success (full > partial > fallback)

# Batch quality analysis
stats = Ragdoll::UnifiedContent.stats
puts stats[:content_quality_distribution]
# => { high: 150, medium: 75, low: 25 }

Configuration

Ragdoll.configure do |config|
  # Enable unified text-based models
  config.use_unified_content = true

  # Text conversion settings
  config.text_conversion = {
    # Image description detail
    image_detail_level: :comprehensive,

    # Audio transcription provider
    audio_transcription_provider: :openai,

    # Fallback behavior
    enable_fallback_descriptions: true
  }

  # Single embedding model for all content
  config.embedding_model = "text-embedding-3-large"
  config.embedding_provider = :openai

  # Vision models for image descriptions
  config.vision_models = {
    primary: 'gpt-4-vision-preview',
    fallback: 'claude-3-opus'
  }

  # Audio transcription settings
  config.audio_config = {
    model: 'whisper-1',
    temperature: 0.0
  }
end

Migration from Multi-Modal

For systems migrating from the previous multi-modal architecture:

# Run migration service
migration_service = Ragdoll::MigrationService.new

# Check migration readiness
report = migration_service.create_comparison_report
puts report[:benefits]

# Migrate all documents
results = Ragdoll::MigrationService.migrate_all_documents(
  batch_size: 50,
  process_embeddings: true
)

# Validate migration
validation = migration_service.validate_migration
puts "Passed: #{validation[:passed]}/#{validation[:total_checks]} checks"

Advantages of Unified Text RAG

Simplified Architecture

  • Single content model instead of polymorphic complexity
  • One embedding pipeline for all content types
  • Unified search index
  • Natural language queries work across all media types
  • Images findable through visual descriptions
  • Audio searchable through spoken content

Cost Effective

  • Single embedding model reduces API costs
  • No need for specialized models per media type
  • Smaller vector storage requirements

Improved Quality

  • AI-generated descriptions often more searchable than raw embeddings
  • Text provides semantic context that visual/audio embeddings miss
  • Quality scoring helps identify and improve weak content

Easier Maintenance

  • One processing pipeline to optimize
  • Consistent search behavior across all content
  • Simpler debugging and monitoring

Examples

Processing Mixed Media

# Add various document types
pdf_doc = Ragdoll.add_document(path: 'research.pdf')
image_doc = Ragdoll.add_document(path: 'diagram.png')
audio_doc = Ragdoll.add_document(path: 'lecture.mp3')
csv_doc = Ragdoll.add_document(path: 'data.csv')

# All converted to searchable text:
# - PDF: Extracted text content
# - Image: AI-generated description
# - Audio: Speech transcript
# - CSV: Structured data as readable text

# Search across all with one query
results = Ragdoll.search(query: "machine learning algorithms")
# Returns relevant content from all document types

Quality-Based Retrieval

# Search with quality filtering
high_quality = Ragdoll.search(
  query: "deep learning",
  min_quality_score: 0.8,
  limit: 10
)

# Reprocess low-quality content
low_quality_docs = Ragdoll::UnifiedDocument
  .joins(:unified_contents)
  .where('unified_contents.content_quality_score < 0.5')

low_quality_docs.each do |doc|
  Ragdoll::UnifiedDocumentManagement.new.reprocess_document(
    doc.id,
    image_detail_level: :analytical
  )
end

Best Practices

  1. Choose Appropriate Detail Levels: Use :comprehensive or :analytical for complex images
  2. Monitor Quality Scores: Regularly check and reprocess low-quality content
  3. Optimize Chunking: Adjust chunk sizes based on your search patterns
  4. Cache Conversions: Converted text is cached to avoid reprocessing
  5. Use Batch Processing: Process multiple documents together for efficiency
  6. Set Quality Thresholds: Filter search results by quality scores
  7. Regular Reprocessing: Periodically reprocess with improved models

Troubleshooting

Low Quality Scores

# Check quality distribution
stats = Ragdoll::UnifiedContent.stats
puts stats[:content_quality_distribution]

# Identify problem documents
problems = Ragdoll::UnifiedDocument
  .joins(:unified_contents)
  .where('unified_contents.content_quality_score < 0.3')
  .pluck(:location, 'unified_contents.original_media_type')

Conversion Failures

# Check failed conversions
failed = Ragdoll::UnifiedDocument.where(status: 'failed')
failed.each do |doc|
  puts "#{doc.location}: #{doc.metadata['error']}"
end

# Retry failed documents
failed.each do |doc|
  Ragdoll::UnifiedDocumentManagement.new.reprocess_document(doc.id)
end

Performance Optimization

# Batch process for efficiency
files = Dir.glob('documents/**/*')
Ragdoll::UnifiedDocumentManagement.new.batch_process_documents(
  files,
  batch_size: 10,
  parallel: true
)

# Monitor processing times
Ragdoll::UnifiedDocument.where(status: 'processed').each do |doc|
  processing_time = doc.updated_at - doc.created_at
  puts "#{doc.location}: #{processing_time}s"
end