Unified Text-Based RAG Architecture¶
Ragdoll has evolved from a multi-modal polymorphic architecture to a unified text-based RAG system that converts all media types to comprehensive text representations before vectorization. This approach enables powerful cross-modal search while dramatically simplifying the system architecture.
Overview¶
The unified text-based architecture represents a fundamental shift in how RAG systems handle diverse content types:
- All Media → Text: Images become AI-generated descriptions, audio becomes transcripts, documents become extracted text
- Single Embedding Model: One text embedding model handles all content types
- Cross-Modal Search: Find any media type through natural language queries
- Simplified Architecture: Single content model instead of complex polymorphic relationships
- Better Retrieval: Text descriptions often contain more searchable information than raw media embeddings
Architecture Design¶
Unified Content Pipeline¶
graph LR
subgraph "Input Media"
PDF[PDF/DOCX]
IMG[Images]
AUD[Audio]
CSV[CSV/JSON]
end
subgraph "Text Conversion"
TE[Text Extraction]
ID[Image Description<br/>via Vision AI]
AT[Audio Transcription<br/>via Speech AI]
SE[Structured Extraction]
end
subgraph "Unified Processing"
TC[Text Content]
QA[Quality Assessment]
CH[Chunking]
EM[Text Embeddings]
end
subgraph "Search"
UI[Unified Index]
CS[Cross-Modal Search]
end
PDF --> TE
IMG --> ID
AUD --> AT
CSV --> SE
TE --> TC
ID --> TC
AT --> TC
SE --> TC
TC --> QA
TC --> CH
CH --> EM
EM --> UI
UI --> CS
Database Schema¶
-- Unified document entity
ragdoll_unified_documents
├── title
├── location (file path/URL)
├── document_type (original format)
├── status (pending/processing/processed/failed)
└── metadata (conversion settings and results)
-- Unified text content
ragdoll_unified_contents
├── unified_document_id (foreign key)
├── content (converted text representation)
├── original_media_type (text/image/audio/document)
├── conversion_method (extraction/description/transcription)
├── content_quality_score (0.0-1.0)
├── word_count
├── character_count
├── embedding_model (single text model)
└── metadata (conversion-specific data)
-- Text embeddings only
ragdoll_embeddings
├── embeddable_type (UnifiedContent)
├── embeddable_id
├── embedding (pgvector - text embeddings only)
├── content (text chunk)
├── chunk_index
└── metadata
Text Conversion Services¶
Document Text Extraction¶
Extracts text from various document formats:
# PDF text extraction
text_content = Ragdoll::TextExtractionService.extract('research.pdf')
# => "This paper presents a novel approach to..."
# CSV to readable text
csv_text = Ragdoll::TextExtractionService.extract('data.csv')
# => "name: John Smith, age: 30, city: New York..."
# Supported formats: PDF, DOCX, HTML, Markdown, CSV, JSON, XML, YAML
Image to Text Conversion¶
Generates comprehensive descriptions using vision AI models:
# Image description generation
description = Ragdoll::ImageToTextService.convert(
'diagram.png',
detail_level: :comprehensive
)
# => "A flowchart diagram showing the machine learning pipeline with..."
# Detail levels:
# :minimal - Brief one-sentence description
# :standard - Main elements and composition
# :comprehensive - Detailed description including objects, colors, mood
# :analytical - Thorough analysis including artistic elements
Audio to Text Transcription¶
Converts speech to searchable text:
# Audio transcription
transcript = Ragdoll::AudioToTextService.transcribe('meeting.mp3')
# => "In today's meeting we discussed the Q3 results..."
# Supported providers:
# :openai - Whisper API
# :azure - Speech Services
# :google - Cloud Speech-to-Text
# :whisper_local - Local Whisper installation
Cross-Modal Search¶
The unified architecture enables powerful search across all media types:
# Find images by describing their content
results = Ragdoll.search(query: "architecture diagram with database symbols")
# Returns images whose AI descriptions match the query
# Search audio by spoken content
results = Ragdoll.search(query: "discussion about machine learning")
# Returns audio files whose transcripts contain these topics
# Mixed results across all media types
results = Ragdoll.search(query: "neural networks")
# Returns:
# - Text documents mentioning neural networks
# - Images with descriptions of neural network diagrams
# - Audio with transcripts discussing neural networks
# All ranked by unified relevance scoring
Content Quality Assessment¶
Automatic assessment of converted content quality:
document = Ragdoll::UnifiedDocument.find(id)
content = document.unified_contents.first
# Quality score (0.0 to 1.0)
puts content.content_quality_score
# Quality factors:
# - Content length (optimal: 50-2000 words)
# - Original media type (text > documents > descriptions > placeholders)
# - Conversion success (full > partial > fallback)
# Batch quality analysis
stats = Ragdoll::UnifiedContent.stats
puts stats[:content_quality_distribution]
# => { high: 150, medium: 75, low: 25 }
Configuration¶
Ragdoll.configure do |config|
# Enable unified text-based models
config.use_unified_content = true
# Text conversion settings
config.text_conversion = {
# Image description detail
image_detail_level: :comprehensive,
# Audio transcription provider
audio_transcription_provider: :openai,
# Fallback behavior
enable_fallback_descriptions: true
}
# Single embedding model for all content
config.embedding_model = "text-embedding-3-large"
config.embedding_provider = :openai
# Vision models for image descriptions
config.vision_models = {
primary: 'gpt-4-vision-preview',
fallback: 'claude-3-opus'
}
# Audio transcription settings
config.audio_config = {
model: 'whisper-1',
temperature: 0.0
}
end
Migration from Multi-Modal¶
For systems migrating from the previous multi-modal architecture:
# Run migration service
migration_service = Ragdoll::MigrationService.new
# Check migration readiness
report = migration_service.create_comparison_report
puts report[:benefits]
# Migrate all documents
results = Ragdoll::MigrationService.migrate_all_documents(
batch_size: 50,
process_embeddings: true
)
# Validate migration
validation = migration_service.validate_migration
puts "Passed: #{validation[:passed]}/#{validation[:total_checks]} checks"
Advantages of Unified Text RAG¶
Simplified Architecture¶
- Single content model instead of polymorphic complexity
- One embedding pipeline for all content types
- Unified search index
Better Search¶
- Natural language queries work across all media types
- Images findable through visual descriptions
- Audio searchable through spoken content
Cost Effective¶
- Single embedding model reduces API costs
- No need for specialized models per media type
- Smaller vector storage requirements
Improved Quality¶
- AI-generated descriptions often more searchable than raw embeddings
- Text provides semantic context that visual/audio embeddings miss
- Quality scoring helps identify and improve weak content
Easier Maintenance¶
- One processing pipeline to optimize
- Consistent search behavior across all content
- Simpler debugging and monitoring
Examples¶
Processing Mixed Media¶
# Add various document types
pdf_doc = Ragdoll.add_document(path: 'research.pdf')
image_doc = Ragdoll.add_document(path: 'diagram.png')
audio_doc = Ragdoll.add_document(path: 'lecture.mp3')
csv_doc = Ragdoll.add_document(path: 'data.csv')
# All converted to searchable text:
# - PDF: Extracted text content
# - Image: AI-generated description
# - Audio: Speech transcript
# - CSV: Structured data as readable text
# Search across all with one query
results = Ragdoll.search(query: "machine learning algorithms")
# Returns relevant content from all document types
Quality-Based Retrieval¶
# Search with quality filtering
high_quality = Ragdoll.search(
query: "deep learning",
min_quality_score: 0.8,
limit: 10
)
# Reprocess low-quality content
low_quality_docs = Ragdoll::UnifiedDocument
.joins(:unified_contents)
.where('unified_contents.content_quality_score < 0.5')
low_quality_docs.each do |doc|
Ragdoll::UnifiedDocumentManagement.new.reprocess_document(
doc.id,
image_detail_level: :analytical
)
end
Best Practices¶
- Choose Appropriate Detail Levels: Use
:comprehensiveor:analyticalfor complex images - Monitor Quality Scores: Regularly check and reprocess low-quality content
- Optimize Chunking: Adjust chunk sizes based on your search patterns
- Cache Conversions: Converted text is cached to avoid reprocessing
- Use Batch Processing: Process multiple documents together for efficiency
- Set Quality Thresholds: Filter search results by quality scores
- Regular Reprocessing: Periodically reprocess with improved models
Troubleshooting¶
Low Quality Scores¶
# Check quality distribution
stats = Ragdoll::UnifiedContent.stats
puts stats[:content_quality_distribution]
# Identify problem documents
problems = Ragdoll::UnifiedDocument
.joins(:unified_contents)
.where('unified_contents.content_quality_score < 0.3')
.pluck(:location, 'unified_contents.original_media_type')
Conversion Failures¶
# Check failed conversions
failed = Ragdoll::UnifiedDocument.where(status: 'failed')
failed.each do |doc|
puts "#{doc.location}: #{doc.metadata['error']}"
end
# Retry failed documents
failed.each do |doc|
Ragdoll::UnifiedDocumentManagement.new.reprocess_document(doc.id)
end
Performance Optimization¶
# Batch process for efficiency
files = Dir.glob('documents/**/*')
Ragdoll::UnifiedDocumentManagement.new.batch_process_documents(
files,
batch_size: 10,
parallel: true
)
# Monitor processing times
Ragdoll::UnifiedDocument.where(status: 'processed').each do |doc|
processing_time = doc.updated_at - doc.created_at
puts "#{doc.location}: #{processing_time}s"
end