Skip to content

Document Processing

Ragdoll provides comprehensive document processing through a unified text-based pipeline that converts all media typesβ€”documents, images, audio, and structured dataβ€”into searchable text representations. This unified approach enables powerful cross-modal search while simplifying the architecture.

Unified Text Processing Pipeline

The document processing pipeline converts all media types to text through intelligent conversion:

  • File Format Detection: Automatic detection and routing to conversion services
  • Text Conversion: Media-specific conversion to comprehensive text representations
  • Quality Assessment: Automatic scoring of converted content quality (0.0-1.0)
  • Unified Storage: Single content model for all media types (UnifiedContent)
  • AI-Enhanced Conversion: Vision models for images, speech-to-text for audio
  • Single Embedding Model: One text embedding model for all content types
  • Cross-Modal Search: Find any media type through natural language queries

Supported File Types

Ragdoll supports a wide range of file formats through specialized parsers:

Text Documents

PDF Processing (pdf-reader gem) - Full text extraction from all pages - Metadata extraction (title, author, subject, creator, producer) - Creation and modification dates - Page count and page-by-page processing - Handles malformed PDFs with graceful error recovery - Supports password-protected PDFs

DOCX Processing (docx gem) - Paragraph text extraction with formatting preservation - Table content extraction with structure maintained - Core document properties (title, author, subject, description) - Keywords and metadata from document properties - Creation and modification timestamps - Word and paragraph count statistics

HTML and Markdown Parsing - Script and style tag removal for clean content - HTML tag stripping with whitespace normalization - Markdown files processed as plain text - File size and encoding detection - Preserves content structure and readability

Plain Text Handling - UTF-8 encoding with automatic fallback to ISO-8859-1 - Robust encoding detection and conversion - File size and encoding metadata - Direct content preservation without modification - Supports .txt, .md, .markdown extensions

Structured Data Documents

CSV Processing

# CSV files are converted to readable text format
csv_content = "name,age,city\nJohn,30,New York"
# Becomes: "name: John, age: 30, city: New York"
- Headers extracted from first row - Each row converted to key-value pairs - Custom CSV parser handles complex quoting - Encoding detection with fallback support - Empty cells and rows handled gracefully

JSON Processing

# JSON converted to hierarchical text representation
json_content = {"user": {"name": "John", "age": 30}}
# Becomes structured text with proper indentation
- Nested objects and arrays preserved - Long strings truncated intelligently - Numeric and boolean values converted - Error handling for malformed JSON

XML Processing - Tags removed, text content extracted - Comments and processing instructions stripped - Nested structure preserved as text - Namespace handling

YAML Processing - Full YAML to text conversion - Preserves document structure - Handles complex data types safely - Front matter extraction for markdown files

Image Documents

Supported Formats

.jpg, .jpeg, .png, .gif, .bmp, .webp, .svg, .ico, .tiff, .tif

Image to Text Conversion - AI-powered comprehensive descriptions using vision models - Multiple detail levels: minimal, standard, comprehensive, analytical - GPT-4 Vision, Claude 3 Opus, Gemini Pro Vision support - Automatic dimension and format extraction - Descriptions optimized for semantic search - Fallback descriptions from metadata if AI unavailable - Quality scoring for generated descriptions

Audio Documents

Supported Formats

.mp3, .wav, .m4a, .flac, .ogg, .aac, .wma

Audio Processing - Speech-to-text transcription via multiple providers - OpenAI Whisper API integration - Azure Speech Services support - Google Cloud Speech-to-Text - Local Whisper installation option - Language detection and timestamps - Transcripts stored as searchable text - Speech-to-text transcription using Whisper or similar - Audio metadata extraction (duration, bitrate, codec) - Transcript storage in AudioContent models - Speaker identification and timestamp extraction - Background job processing for long audio files

Processing Pipeline

The document processing workflow follows a structured six-stage pipeline:

1. File Upload and Validation

# File path processing
document = DocumentProcessor.create_document_from_file(
  'path/to/document.pdf',
  title: 'Custom Title',
  metadata: { source: 'import' }
)

# Upload processing (Shrine compatible)
document = DocumentProcessor.create_document_from_upload(
  uploaded_file,
  title: 'Uploaded Document',
  metadata: { user_id: 123 }
)

# Force option to bypass duplicate detection
document = DocumentProcessor.create_document_from_file(
  'path/to/document.pdf',
  title: 'Forced Duplicate',
  force: true  # Creates new document even if duplicate exists
)

Validation Steps: - File existence and accessibility verification - Duplicate detection using file hash and metadata comparison - File size limits (configurable) - Format detection via extension and MIME type - Permission checks for file access - Malware scanning (if configured)

2. Format Detection and Routing

Primary Detection Method:

def self.determine_document_type(file_path)
  case File.extname(file_path).downcase
  when ".pdf" then "pdf"
  when ".docx" then "docx"
  when ".txt" then "text"
  when ".md", ".markdown" then "markdown"
  when ".html", ".htm" then "html"
  when /\.(jpg|jpeg|png|gif|bmp|webp|svg|ico|tiff|tif)$/i then "image"
  else "text"  # Default fallback
  end
end

Secondary Detection (MIME Type): - Used for uploaded files without reliable extensions - Content-type header analysis - Magic number detection for binary files - Fallback to text processing for unknown types

3. Content Extraction

Multi-Modal STI Architecture:

graph TD
    A[Document] --> B[TextContent]
    A --> C[ImageContent]
    A --> D[AudioContent]
    B --> E[Text Embeddings]
    C --> F[Image Embeddings]
    D --> G[Audio Embeddings]

Content Storage Strategy: - TextContent: Raw text, processed text, word/character counts - ImageContent: AI description, dimensions, file metadata - AudioContent: Transcript, duration, speaker info (planned) - Polymorphic Embeddings: Linked to each content type

4. Metadata Generation with LLM

AI-Powered Analysis:

# Automatic metadata generation
doc.generate_metadata!

# Generated metadata includes:
# - summary: Concise document summary
# - keywords: Extracted key terms
# - classification: Document category
# - description: Detailed description
# - tags: Topical tags

Schema Validation: - Document type-specific schemas - Required field validation - Format and length constraints - Error handling with fallback values

5. Content Chunking for Embeddings

TextChunker Integration:

# Configurable chunking strategy
config.chunking[:text][:max_tokens] = 1000
config.chunking[:text][:overlap] = 200
config.chunking[:text][:strategy] = 'sentence_boundary'

Chunking Strategies: - Sentence Boundary: Respects sentence structure - Token-Based: Fixed token count with overlap - Paragraph-Based: Natural paragraph breaks - Semantic Chunking: Content-aware splitting (planned)

6. Database Storage and Indexing

PostgreSQL Storage: - Documents: Main document metadata and status - Contents: STI-based content storage - Embeddings: Vector storage with pgvector - Full-text Indexes: PostgreSQL GIN indexes - JSON Metadata: Structured metadata with indexes

Index Strategy:

-- Full-text search index
CREATE INDEX idx_documents_fulltext ON ragdoll_documents 
USING gin(to_tsvector('english', title || ' ' || 
  COALESCE(metadata->>'summary', '') || ' ' || 
  COALESCE(metadata->>'keywords', '')));

-- Vector similarity index
CREATE INDEX idx_embeddings_vector ON ragdoll_embeddings 
USING ivfflat (embedding_vector vector_cosine_ops);

-- Duplicate detection indexes
CREATE UNIQUE INDEX idx_documents_location ON ragdoll_documents (location);
CREATE INDEX idx_documents_file_hash ON ragdoll_documents 
USING gin((metadata->>'file_hash'));

Duplicate Detection

Ragdoll includes sophisticated duplicate detection to prevent redundant document processing and storage:

Multi-Level Detection Strategy

Primary Detection (Exact Match):

# 1. Location-based detection
existing = Document.find_by(location: file_path)

# 2. Location + modification time for files
existing = Document.find_by(
  location: file_path,
  file_modified_at: File.mtime(file_path)
)

Secondary Detection (Content-Based):

# 3. File content hash (SHA256)
file_hash = Digest::SHA256.file(file_path).hexdigest
existing = Document.where("metadata->>'file_hash' = ?", file_hash).first

# 4. Content hash for text documents
content_hash = Digest::SHA256.hexdigest(content)
existing = Document.where("metadata->>'content_hash' = ?", content_hash).first

Tertiary Detection (Similarity-Based):

# 5. File size + metadata similarity
same_size_docs = Document.where("metadata->>'file_size' = ?", file_size.to_s)
same_size_docs.each do |doc|
  return doc if documents_are_similar?(doc, new_document)
end

Detection Criteria

File-Based Documents: - Exact location/path match - File modification time comparison - SHA256 file content hash - File size and type matching - Filename similarity (basename)

Web/URL Documents: - URL location match - Content hash comparison (SHA256) - Content length similarity (5% tolerance) - Title and document type matching

Metadata Comparison:

def documents_are_similar?(existing_doc, new_doc)
  # Compare basename without extension
  existing_basename = File.basename(existing_doc.location, File.extname(existing_doc.location))
  new_basename = File.basename(new_doc.location, File.extname(new_doc.location))
  return false unless existing_basename == new_basename

  # Compare content length (5% tolerance)
  if existing_doc.content.present? && new_doc.content.present?
    length_diff = (existing_doc.content.length - new_doc.content.length).abs
    max_length = [existing_doc.content.length, new_doc.content.length].max
    return false if max_length > 0 && (length_diff.to_f / max_length) > 0.05
  end

  # Compare document type and title
  return false if existing_doc.document_type != new_doc.document_type
  return false if existing_doc.title != new_doc.title

  true
end

Force Override Option

Bypassing Duplicate Detection:

# Add document with force option
result = Ragdoll.add_document(
  path: 'document.pdf',
  force: true  # Creates new document even if duplicate exists
)

# In DocumentManagement service
if force
  # Modify location to avoid unique constraint violation
  final_location = "#{location}#forced_#{Time.current.to_i}_#{SecureRandom.hex(4)}"
else
  final_location = location
end

Configuration Options

Duplicate Detection Settings:

config.duplicate_detection.tap do |dd|
  dd[:enabled] = true                     # Enable/disable duplicate detection
  dd[:content_similarity_threshold] = 0.95  # Content similarity threshold
  dd[:content_length_tolerance] = 0.05    # 5% content length tolerance
  dd[:check_file_hash] = true            # Enable file hash checking
  dd[:check_content_hash] = true         # Enable content hash checking
  dd[:check_metadata_similarity] = true  # Enable metadata comparison
end

Performance Optimizations

Database Indexes for Fast Lookups:

-- Primary lookup index
CREATE UNIQUE INDEX idx_documents_location ON ragdoll_documents (location);

-- Hash-based lookups
CREATE INDEX idx_documents_file_hash ON ragdoll_documents 
USING gin((metadata->>'file_hash'));

CREATE INDEX idx_documents_content_hash ON ragdoll_documents 
USING gin((metadata->>'content_hash'));

-- Size-based filtering
CREATE INDEX idx_documents_file_size ON ragdoll_documents 
USING btree((metadata->>'file_size'));

Efficient Detection Process: 1. Fast Path: Exact location match (unique index lookup) 2. Hash Path: File/content hash lookup (GIN index) 3. Similarity Path: Size filtering + metadata comparison 4. Fallback: Full content analysis if needed

Use Cases and Benefits

Development Environment:

# Avoid re-processing during development
result = Ragdoll.add_document(path: 'test_document.pdf')
# Second call returns existing document ID immediately

result2 = Ragdoll.add_document(path: 'test_document.pdf')
assert_equal result[:document_id], result2[:document_id]

Production Import Scripts:

# Safe bulk import without duplicates
documents.each do |file_path|
  result = Ragdoll.add_document(path: file_path)
  puts "#{result[:duplicate] ? 'Skipped' : 'Added'}: #{file_path}"
end

Content Versioning:

# Force new version when needed
updated_result = Ragdoll.add_document(
  path: 'updated_document.pdf',
  force: true,
  metadata: { version: '2.0', previous_id: original_id }
)

Metadata Generation

Ragdoll uses AI-powered metadata extraction to enhance document searchability and organization:

LLM-Based Content Analysis

MetadataGenerator Service:

generator = Ragdoll::MetadataGenerator.new
metadata = generator.generate_for_document(document)

# Example generated metadata:
{
  "summary" => "This technical document explains the implementation...",
  "keywords" => ["API", "authentication", "security", "OAuth"],
  "classification" => "technical_documentation",
  "description" => "Comprehensive guide to API security practices",
  "tags" => ["development", "security", "best-practices"],
  "sentiment" => "neutral",
  "complexity" => "intermediate",
  "estimated_reading_time" => 15
}

Configurable LLM Models:

# Different models for different tasks
config.summarization_config[:model] = 'openai/gpt-4o'
config.keywords_config[:model] = 'anthropic/claude-3-haiku-20240307'
config.classification_config[:model] = 'openai/gpt-4o-mini'

Schema Validation

Document Type-Specific Schemas:

# Text document schema
MetadataSchemas::TEXT_SCHEMA = {
  summary: { type: :string, required: true, max_length: 500 },
  keywords: { type: :array, items: :string, max_items: 20 },
  classification: { type: :string, enum: CLASSIFICATIONS },
  description: { type: :string, max_length: 1000 },
  tags: { type: :array, items: :string, max_items: 10 }
}

# Image document schema
MetadataSchemas::IMAGE_SCHEMA = {
  description: { type: :string, required: true },
  objects_detected: { type: :array, items: :string },
  scene_type: { type: :string },
  colors: { type: :array, items: :string },
  text_content: { type: :string }  # OCR results
}

Validation Process:

errors = MetadataSchemas.validate_metadata(document_type, metadata)
if errors.any?
  Rails.logger.warn "Metadata validation errors: #{errors.join(', ')}"
  # Apply fallback values for failed fields
end

Summary Generation

Configurable Summary Strategy:

config.summarization_config.tap do |c|
  c[:enable] = true
  c[:model] = 'openai/gpt-4o'
  c[:max_length] = 300
  c[:style] = 'concise'  # concise, detailed, bullet_points
  c[:include_keywords] = true
end

Content-Aware Summarization: - Technical Documents: Focus on key concepts and procedures - Legal Documents: Highlight important clauses and obligations - Academic Papers: Emphasize methodology and findings - General Content: Extract main themes and conclusions

Keyword Extraction

Multi-Strategy Keyword Extraction:

# LLM-based extraction
llm_keywords = extract_keywords_with_llm(content)

# Statistical extraction (TF-IDF)
stats_keywords = extract_keywords_statistical(content)

# Hybrid approach combining both
final_keywords = merge_keyword_strategies(
  llm_keywords, 
  stats_keywords,
  weights: { llm: 0.7, statistical: 0.3 }
)

Keyword Quality Filtering: - Minimum length requirements (>3 characters) - Stop word removal - Duplicate detection and merging - Relevance scoring - Maximum keyword limits (configurable)

Classification and Tagging

Hierarchical Classification:

CLASSIFICATIONS = {
  'technical_documentation' => {
    'api_documentation' => ['rest', 'graphql', 'rpc'],
    'user_guides' => ['tutorial', 'how-to', 'reference'],
    'architecture' => ['design', 'patterns', 'infrastructure']
  },
  'business_documents' => {
    'contracts' => ['nda', 'service_agreement', 'license'],
    'reports' => ['financial', 'quarterly', 'analysis'],
    'procedures' => ['policy', 'workflow', 'compliance']
  }
}

Smart Tagging System:

# Auto-generated tags based on content analysis
auto_tags = [
  content_based_tags,      # From text analysis
  format_based_tags,       # From document format
  metadata_based_tags,     # From existing metadata
  context_based_tags       # From file location/name
].flatten.uniq

# User-defined tags (preserved and merged)
final_tags = (user_tags + auto_tags).uniq

Tag Confidence Scoring: - High confidence: Direct content matches - Medium confidence: Contextual indicators - Low confidence: Statistical correlations - Threshold-based filtering for quality control

Background Processing

Ragdoll uses background jobs for resource-intensive processing operations:

ActiveJob Integration

Available Background Jobs:

# Text extraction job
Ragdoll::ExtractTextJob.perform_later(document_id)

# Embedding generation job
Ragdoll::GenerateEmbeddingsJob.perform_later(content_id, content_type)

# Summary generation job
Ragdoll::GenerateSummaryJob.perform_later(document_id)

# Keyword extraction job
Ragdoll::ExtractKeywordsJob.perform_later(document_id)

Job Configuration:

config.background_processing_config.tap do |c|
  c[:enable] = true
  c[:queue_name] = 'ragdoll_processing'
  c[:job_timeout] = 300.seconds
  c[:max_retry_attempts] = 3
  c[:retry_backoff] = :exponential
end

Job Queues and Workers

Queue Priority System:

# High priority: User-facing operations
queue_as :ragdoll_high_priority, priority: 10

# Medium priority: Batch processing
queue_as :ragdoll_medium_priority, priority: 5

# Low priority: Background optimization
queue_as :ragdoll_low_priority, priority: 1

Worker Scaling Configuration:

# Development: Single worker
bundle exec sidekiq -q ragdoll_processing

# Production: Multiple workers with priority queues
bundle exec sidekiq -q ragdoll_high_priority:3 -q ragdoll_medium_priority:2 -q ragdoll_low_priority:1

Error Handling and Retries

Retry Strategy:

class ProcessDocumentJob < ApplicationJob
  retry_on StandardError, wait: :exponentially_longer, attempts: 3
  retry_on ActiveRecord::Deadlocked, wait: 5.seconds, attempts: 3

  discard_on ActiveJob::DeserializationError
  discard_on Ragdoll::Core::UnsupportedFormatError

  def perform(document_id)
    document = Document.find(document_id)
    document.process_content!
  rescue => e
    document&.update(status: 'error', error_message: e.message)
    raise
  end
end

Error Recovery: - Automatic retries with exponential backoff - Dead letter queue for failed jobs - Error notification system - Manual job retry capabilities - Partial processing recovery

Progress Tracking

Job Status Monitoring:

# Document processing status
document.status  # 'pending', 'processing', 'processed', 'error'

# Detailed progress tracking
processing_info = {
  stage: 'embedding_generation',
  progress: 75,
  total_steps: 4,
  current_step: 3,
  estimated_completion: 2.minutes.from_now
}

document.update(processing_info: processing_info)

Real-time Updates:

# WebSocket integration for live progress
ActionCable.server.broadcast(
  "document_#{document.id}",
  {
    event: 'processing_update',
    progress: 50,
    message: 'Generating embeddings...'
  }
)

Scaling Considerations

Horizontal Scaling:

# Docker Compose example
services:
  ragdoll_worker_1:
    build: .
    command: bundle exec sidekiq -q ragdoll_high_priority:2
    environment:
      - REDIS_URL=redis://redis:6379/0

  ragdoll_worker_2:
    build: .
    command: bundle exec sidekiq -q ragdoll_medium_priority:3 -q ragdoll_low_priority:1
    environment:
      - REDIS_URL=redis://redis:6379/0

Resource Management:

# Memory-aware job processing
class ProcessLargeDocumentJob < ApplicationJob
  def perform(document_id)
    # Process in chunks to manage memory
    document = Document.find(document_id)

    if document.file_size > 50.megabytes
      process_in_chunks(document)
    else
      process_normally(document)
    end
  ensure
    GC.start  # Force garbage collection
  end
end

Performance Monitoring:

# Job performance metrics
class JobMetrics
  def self.track_job_performance(job_name, &block)
    start_time = Time.current
    result = block.call
    duration = Time.current - start_time

    Rails.logger.info "Job #{job_name} completed in #{duration}s"

    # Send to monitoring service
    StatsD.histogram('job.duration', duration, tags: ["job:#{job_name}"])

    result
  end
end

Configuration Options

Ragdoll provides extensive configuration options for document processing:

Chunk Size and Overlap Settings

Text Chunking Configuration:

Ragdoll::Core.configure do |config|
  config.chunking[:text].tap do |c|
    c[:max_tokens] = 1000           # Maximum tokens per chunk
    c[:overlap] = 200               # Token overlap between chunks
    c[:strategy] = 'sentence'       # 'sentence', 'paragraph', 'token'
    c[:min_chunk_size] = 100        # Minimum viable chunk size
    c[:preserve_paragraphs] = true  # Respect paragraph boundaries
    c[:split_on_headers] = true     # Split at header boundaries
  end
end

Content-Type Specific Chunking:

# PDF documents (technical content)
config.chunking[:pdf][:max_tokens] = 1500
config.chunking[:pdf][:overlap] = 300
config.chunking[:pdf][:preserve_page_breaks] = true

# HTML documents (web content)
config.chunking[:html][:max_tokens] = 800
config.chunking[:html][:overlap] = 150
config.chunking[:html][:preserve_structure] = true

# Code documents
config.chunking[:code][:max_tokens] = 2000
config.chunking[:code][:overlap] = 100
config.chunking[:code][:preserve_functions] = true

Model Selection for Metadata Generation

LLM Model Configuration:

config.ruby_llm_config.tap do |llm|
  # Primary models for different tasks
  llm[:openai][:api_key] = ENV['OPENAI_API_KEY']
  llm[:anthropic][:api_key] = ENV['ANTHROPIC_API_KEY']
  llm[:google][:api_key] = ENV['GOOGLE_API_KEY']
end

# Task-specific model assignment
config.models.tap do |m|
  m[:summarization] = 'openai/gpt-4o'           # Best for summaries
  m[:keywords] = 'anthropic/claude-3-haiku'     # Fast keyword extraction
  m[:classification] = 'openai/gpt-4o-mini'     # Cost-effective classification
  m[:description] = 'google/gemini-1.5-pro'    # Detailed descriptions
end

Embedding Model Configuration:

config.embedding_config.tap do |e|
  # Text embeddings
  e[:text][:model] = 'openai/text-embedding-3-large'
  e[:text][:dimensions] = 3072
  e[:text][:batch_size] = 100

  # Image embeddings (planned)
  e[:image][:model] = 'openai/clip-vit-large-patch14'
  e[:image][:dimensions] = 768

  # Audio embeddings (planned)
  e[:audio][:model] = 'openai/whisper-embedding-v1'
  e[:audio][:dimensions] = 1024
end

Processing Timeouts

Timeout Configuration:

config.processing_timeouts.tap do |t|
  # Per-operation timeouts
  t[:file_parsing] = 120.seconds        # File content extraction
  t[:text_extraction] = 60.seconds      # Text processing
  t[:image_analysis] = 180.seconds      # Vision AI processing
  t[:metadata_generation] = 300.seconds # LLM metadata creation
  t[:embedding_generation] = 240.seconds # Vector embedding creation

  # Document size-based scaling
  t[:scaling_factor] = 1.5              # Multiply timeout by this for large docs
  t[:large_document_threshold] = 10.megabytes
end

Background Job Timeouts:

config.background_processing_config.tap do |bg|
  bg[:job_timeout] = 600.seconds        # Maximum job execution time
  bg[:queue_timeout] = 3600.seconds     # Maximum time in queue
  bg[:retry_timeout] = 1800.seconds     # Time between retries
end

Quality Thresholds

Content Quality Filters:

config.quality_thresholds.tap do |q|
  # Minimum content requirements
  q[:min_text_length] = 50              # Minimum characters for processing
  q[:min_word_count] = 10               # Minimum words for meaningful content
  q[:max_empty_lines_ratio] = 0.5       # Maximum ratio of empty lines

  # Metadata quality requirements
  q[:min_summary_length] = 20           # Minimum summary length
  q[:max_summary_length] = 500          # Maximum summary length
  q[:min_keywords_count] = 3            # Minimum number of keywords
  q[:max_keywords_count] = 20           # Maximum number of keywords

  # Embedding quality thresholds
  q[:min_embedding_similarity] = 0.1    # Minimum similarity for relevance
  q[:duplicate_threshold] = 0.95        # Similarity threshold for duplicates
end

Language Detection and Filtering:

config.language_config.tap do |l|
  l[:enabled] = true
  l[:supported_languages] = ['en', 'es', 'fr', 'de', 'it']
  l[:confidence_threshold] = 0.8         # Minimum language detection confidence
  l[:fallback_language] = 'en'          # Default when detection fails
  l[:skip_unsupported] = false          # Process unsupported languages as text
end

Advanced Processing Options

Performance Optimization:

config.performance_config.tap do |p|
  # Parallel processing
  p[:parallel_processing] = true
  p[:max_parallel_jobs] = 4
  p[:chunk_processing_batch_size] = 50

  # Memory management
  p[:memory_limit] = 2.gigabytes
  p[:gc_frequency] = 100               # GC every N operations
  p[:temp_file_cleanup] = true

  # Caching
  p[:cache_parsed_content] = true
  p[:cache_embeddings] = true
  p[:cache_ttl] = 1.hour
end

Error Handling Configuration:

config.error_handling.tap do |e|
  e[:continue_on_parse_error] = true    # Continue processing other content
  e[:retry_failed_chunks] = true       # Retry failed chunk processing
  e[:max_retry_attempts] = 3           # Maximum retry attempts
  e[:fallback_to_text] = true          # Fallback to text processing
  e[:notify_on_errors] = true          # Send error notifications
end

Environment-Specific Configuration

Development Settings:

if Rails.env.development?
  config.chunking[:text][:max_tokens] = 500    # Smaller chunks for faster processing
  config.processing_timeouts[:metadata_generation] = 60.seconds
  config.background_processing_config[:enable] = false  # Synchronous processing
end

Production Settings:

if Rails.env.production?
  config.chunking[:text][:max_tokens] = 1500   # Larger chunks for efficiency
  config.background_processing_config[:enable] = true
  config.performance_config[:parallel_processing] = true
  config.quality_thresholds[:min_text_length] = 100
end


This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.