Metadata Schemas¶
Ragdoll implements a sophisticated metadata schema system that ensures consistent, structured metadata generation across different content types. The system uses JSON Schema-based definitions to guide LLM-powered metadata extraction and validate the resulting structured data.
Structured Content Analysis and Validation¶
The metadata schema system provides:
- Content-Type Specific Schemas: Tailored metadata structures for text, images, audio, PDFs, and multi-modal content
- LLM-Guided Generation: Schema-aware prompts ensure consistent metadata format and quality
- Validation Framework: Automatic validation of generated metadata against defined schemas
- Extensible Architecture: Easy addition of new content types and custom schema definitions
- Dual Metadata System: Separation of LLM-generated content metadata and system file metadata
- Quality Assurance: Built-in fallback strategies and error handling for robust metadata generation
Schema Architecture¶
Ragdoll uses a dual metadata architecture that separates concerns between AI-generated content insights and technical file properties:
Dual Metadata System¶
LLM-Generated Content Metadata (metadata
column)
# Stored in document.metadata (JSON column)
{
"summary": "This research paper explores machine learning applications...",
"keywords": ["machine learning", "neural networks", "AI"],
"classification": "research",
"topics": ["artificial intelligence", "computer science"],
"sentiment": "neutral",
"complexity_level": "advanced",
"reading_time_minutes": 25,
"language": "en",
"tags": ["AI", "research", "academic"]
}
System-Generated File Metadata (file_metadata
column)
# Stored in document.file_metadata (JSON column)
{
"file_size": 2048576,
"file_type": "pdf",
"page_count": 15,
"creation_date": "2024-01-15T10:30:00Z",
"modification_date": "2024-01-20T14:15:30Z",
"author": "Dr. Jane Smith",
"title": "Advanced ML Techniques",
"encoding": "UTF-8",
"extraction_method": "pdf-reader",
"processing_time_ms": 1250
}
Schema Separation Rationale: 1. Semantic vs Technical: Content metadata focuses on meaning, file metadata on technical properties 2. LLM vs System Generated: Different generation methods require different validation approaches 3. Update Patterns: Content metadata may be regenerated, file metadata is typically static 4. Search Optimization: Separate indexes for semantic search vs file property filtering 5. Schema Evolution: Content schemas evolve with AI capabilities, file schemas remain stable
Integration Patterns:
# Combined search across both metadata types
Document.search_combined(
content_query: "machine learning", # Searches content metadata
file_filters: { file_type: 'pdf', page_count: 10..50 } # Filters file metadata
)
# Unified metadata access
document.combined_metadata # Merges both metadata types
document.searchable_content # Optimized for search indexing
Content Type Schemas¶
Ragdoll provides specialized schemas for different content types:
graph TD
A[Base Schema] --> B[Text Schema]
A --> C[Image Schema]
A --> D[Audio Schema]
A --> E[PDF Schema]
A --> F[Mixed Schema]
B --> B1["summary, keywords, classification"]
C --> C1["description, objects, scene_type"]
D --> D1["transcript_summary, speakers, content_type"]
E --> E1["document_type, structure, reading_time"]
F --> F1["content_types, cohesion_analysis"]
Schema Inheritance Pattern:
module MetadataSchemas
# Base fields common to all content types
BASE_PROPERTIES = {
summary: { type: "string", description: "Content summary" },
keywords: { type: "array", items: { type: "string" } },
classification: { type: "string" },
tags: { type: "array", items: { type: "string" } }
}.freeze
# Content-specific extensions
TEXT_EXTENSIONS = {
reading_time_minutes: { type: "integer" },
complexity_level: { type: "string", enum: %w[beginner intermediate advanced expert] },
sentiment: { type: "string", enum: %w[positive negative neutral mixed] }
}.freeze
# Combined schema
TEXT_SCHEMA = {
type: "object",
properties: BASE_PROPERTIES.merge(TEXT_EXTENSIONS),
required: %w[summary keywords classification]
}.freeze
end
Standard Metadata Fields¶
Ragdoll defines comprehensive metadata fields organized by function and content type:
Content Analysis Fields¶
Summary Generation¶
# Text content summary
summary: {
type: "string",
description: "Concise summary of the text content (2-3 sentences)",
min_length: 50,
max_length: 500,
required: true
}
# Image content summary
summary: {
type: "string",
description: "Brief summary of the image content (1 sentence)",
max_length: 200,
required: true
}
# Multi-modal summary
summary: {
type: "string",
description: "Overall summary combining all content types in the document",
max_length: 600,
required: true
}
Keyword Extraction¶
# Standard keyword field (all content types)
keywords: {
type: "array",
items: { type: "string" },
description: "Relevant keywords and phrases extracted from content",
maxItems: 10,
minItems: 3,
required: true,
validation: {
min_word_length: 3,
no_stopwords: true,
unique_only: true
}
}
# Enhanced keywords with confidence scores
keywords_enhanced: {
type: "array",
items: {
type: "object",
properties: {
term: { type: "string" },
confidence: { type: "number", minimum: 0, maximum: 1 },
category: { type: "string", enum: %w[concept entity action descriptor] }
}
},
maxItems: 15
}
Topic Classification¶
# Primary classification
classification: {
type: "string",
enum: {
text: %w[research article blog documentation technical legal financial marketing other],
image: %w[technical diagram photo artwork chart screenshot document other],
audio: %w[educational entertainment business technical musical interview podcast other],
pdf: %w[academic business legal technical manual report presentation other]
},
required: true
}
# Detailed topics array
topics: {
type: "array",
items: { type: "string" },
description: "Main topics discussed in the document",
maxItems: 5,
validation: {
hierarchical: true, # Support parent::child topic structure
taxonomy_validation: true
}
}
# Classification confidence
classification_confidence: {
type: "number",
minimum: 0,
maximum: 1,
description: "Confidence score for the assigned classification"
}
Sentiment Analysis¶
# Basic sentiment (text content)
sentiment: {
type: "string",
enum: %w[positive negative neutral mixed],
description: "Overall sentiment of the text"
}
# Detailed sentiment analysis
sentiment_detailed: {
type: "object",
properties: {
overall: { type: "string", enum: %w[positive negative neutral mixed] },
confidence: { type: "number", minimum: 0, maximum: 1 },
emotional_tone: {
type: "array",
items: { type: "string" },
enum: %w[joy sadness anger fear disgust surprise trust anticipation]
},
subjectivity: { type: "string", enum: %w[objective subjective] }
}
}
# Mood for images and audio
mood: {
type: "string",
enum: {
image: %w[professional casual formal technical artistic dramatic serene energetic other],
audio: %w[formal casual energetic calm professional educational entertaining informative other]
},
description: "Overall mood or tone of the content"
}
Content Categorization¶
# Complexity assessment
complexity_level: {
type: "string",
enum: %w[beginner intermediate advanced expert],
description: "Complexity/difficulty level of the content",
scoring_criteria: {
beginner: "Basic concepts, simple language, introductory material",
intermediate: "Some specialized knowledge required, moderate complexity",
advanced: "Specialized knowledge required, complex concepts",
expert: "Deep expertise required, highly technical content"
}
}
# Reading time estimation
reading_time_minutes: {
type: "integer",
minimum: 1,
maximum: 600, # 10 hours max
description: "Estimated reading time in minutes",
calculation: "Based on 200-250 words per minute average reading speed"
}
# Language detection
language: {
type: "string",
pattern: "^[a-z]{2}(-[A-Z]{2})?$", # ISO 639-1 format
description: "Primary language of the content",
examples: ["en", "es", "fr", "de", "zh-CN"]
}
# User-defined tags
tags: {
type: "array",
items: { type: "string" },
description: "User-defined or AI-suggested tags for organization",
maxItems: 20,
validation: {
no_spaces: false, # Allow multi-word tags
lowercase: true,
unique_only: true
}
}
Technical Metadata¶
Processing Parameters¶
# Content processing metadata (stored in file_metadata)
processing_metadata: {
extraction_method: { type: "string" }, # "pdf-reader", "docx", "image-magick"
processing_time_ms: { type: "integer" },
embedding_model: { type: "string" },
embedding_dimensions: { type: "integer" },
chunk_count: { type: "integer" },
chunk_strategy: { type: "string" },
content_hash: { type: "string" }, # For change detection
last_processed_at: { type: "string", format: "date-time" }
}
Quality Metrics¶
# Content quality assessment
quality_metrics: {
content_completeness: { type: "number", minimum: 0, maximum: 1 },
extraction_confidence: { type: "number", minimum: 0, maximum: 1 },
metadata_completeness: { type: "number", minimum: 0, maximum: 1 },
validation_score: { type: "number", minimum: 0, maximum: 1 },
overall_quality: { type: "number", minimum: 0, maximum: 1 }
}
Performance Data¶
# Performance tracking metadata
performance_data: {
file_size_bytes: { type: "integer" },
processing_duration_ms: { type: "integer" },
embedding_generation_time_ms: { type: "integer" },
metadata_generation_time_ms: { type: "integer" },
memory_usage_mb: { type: "number" },
cpu_usage_percent: { type: "number" },
api_calls_made: { type: "integer" },
cost_estimate_usd: { type: "number" }
}
Schema Validation¶
Ragdoll implements comprehensive validation to ensure metadata quality and consistency:
Validation Rules¶
Required Fields Validation¶
# Schema-based required field checking
def self.validate_metadata(document_type, metadata)
schema = schema_for(document_type)
required_fields = schema[:required] || []
errors = []
required_fields.each do |field|
unless metadata.key?(field) && !metadata[field].to_s.strip.empty?
errors << "Missing required field: #{field}"
end
end
errors
end
# Content-specific required fields
TEXT_SCHEMA[:required] = %w[summary keywords classification]
IMAGE_SCHEMA[:required] = %w[description summary scene_type classification]
AUDIO_SCHEMA[:required] = %w[summary content_type classification]
PDF_SCHEMA[:required] = %w[summary document_type classification]
MIXED_SCHEMA[:required] = %w[summary content_types primary_content_type classification]
Data Type Constraints¶
# Type validation with coercion
class MetadataValidator
def self.validate_field_type(field_name, value, field_schema)
expected_type = field_schema[:type]
case expected_type
when 'string'
validate_string_field(field_name, value, field_schema)
when 'array'
validate_array_field(field_name, value, field_schema)
when 'integer'
validate_integer_field(field_name, value, field_schema)
when 'number'
validate_number_field(field_name, value, field_schema)
when 'boolean'
validate_boolean_field(field_name, value, field_schema)
when 'object'
validate_object_field(field_name, value, field_schema)
else
["Unknown field type: #{expected_type}"]
end
end
private
def self.validate_string_field(field_name, value, schema)
errors = []
unless value.is_a?(String)
return ["#{field_name} must be a string, got #{value.class}"]
end
# Length constraints
if schema[:minLength] && value.length < schema[:minLength]
errors << "#{field_name} must be at least #{schema[:minLength]} characters"
end
if schema[:maxLength] && value.length > schema[:maxLength]
errors << "#{field_name} must be no more than #{schema[:maxLength]} characters"
end
# Enum validation
if schema[:enum] && !schema[:enum].include?(value)
errors << "#{field_name} must be one of: #{schema[:enum].join(', ')}"
end
# Pattern validation
if schema[:pattern] && !value.match?(Regexp.new(schema[:pattern]))
errors << "#{field_name} does not match required pattern"
end
errors
end
end
Format Validation¶
# Specialized format validators
class FormatValidators
# Language code validation (ISO 639-1)
def self.validate_language_code(code)
valid_codes = %w[en es fr de it pt ru ja ko zh ar hi]
return true if valid_codes.include?(code)
return true if code.match?(/^[a-z]{2}-[A-Z]{2}$/) # en-US format
false
end
# Keyword validation
def self.validate_keywords(keywords)
errors = []
return ["Keywords must be an array"] unless keywords.is_a?(Array)
keywords.each_with_index do |keyword, index|
unless keyword.is_a?(String)
errors << "Keyword at index #{index} must be a string"
next
end
if keyword.length < 3
errors << "Keyword '#{keyword}' must be at least 3 characters"
end
if keyword.length > 50
errors << "Keyword '#{keyword}' must be no more than 50 characters"
end
if keyword.match?(/^\d+$/) # Only numbers
errors << "Keyword '#{keyword}' cannot be only numbers"
end
end
# Check for duplicates
duplicates = keywords.group_by(&:downcase).select { |k, v| v.size > 1 }.keys
if duplicates.any?
errors << "Duplicate keywords found: #{duplicates.join(', ')}"
end
errors
end
# URL validation for image sources
def self.validate_url(url)
return true if url.nil? || url.empty?
begin
uri = URI.parse(url)
uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
rescue URI::InvalidURIError
false
end
end
end
Range Constraints¶
# Numeric range validation
class RangeValidator
def self.validate_reading_time(minutes)
errors = []
unless minutes.is_a?(Integer)
return ["Reading time must be an integer"]
end
if minutes < 1
errors << "Reading time must be at least 1 minute"
end
if minutes > 600 # 10 hours
errors << "Reading time cannot exceed 600 minutes (10 hours)"
end
# Warn for unusual values
if minutes > 120 # 2 hours
errors << "Warning: Reading time of #{minutes} minutes seems unusually high"
end
errors
end
def self.validate_confidence_score(score)
return ["Confidence score must be a number"] unless score.is_a?(Numeric)
return ["Confidence score must be between 0 and 1"] unless (0..1).cover?(score)
[]
end
end
Error Handling¶
Validation Error Reporting¶
class ValidationErrorReporter
def self.generate_detailed_report(document_type, metadata, errors)
{
document_type: document_type,
validation_status: errors.empty? ? 'passed' : 'failed',
error_count: errors.length,
errors: errors.map { |error| format_error(error) },
metadata_completeness: calculate_completeness(document_type, metadata),
suggestions: generate_suggestions(document_type, errors),
schema_version: get_schema_version(document_type)
}
end
private
def self.format_error(error)
{
message: error,
severity: determine_severity(error),
field: extract_field_name(error),
suggestion: suggest_fix(error)
}
end
def self.determine_severity(error)
case error
when /Missing required field/
'critical'
when /must be one of/
'error'
when /Warning:/
'warning'
else
'error'
end
end
end
Schema Compatibility Checking¶
class SchemaCompatibilityChecker
def self.check_compatibility(old_metadata, new_schema)
compatibility_report = {
compatible: true,
issues: [],
migration_required: false,
breaking_changes: []
}
# Check for removed required fields
old_schema = infer_schema_from_metadata(old_metadata)
new_required = new_schema[:required] || []
old_required = old_schema[:required] || []
removed_required = old_required - new_required
added_required = new_required - old_required
if removed_required.any?
compatibility_report[:issues] << {
type: 'removed_required_fields',
fields: removed_required,
impact: 'breaking_change'
}
compatibility_report[:compatible] = false
end
if added_required.any?
compatibility_report[:issues] << {
type: 'new_required_fields',
fields: added_required,
impact: 'migration_required'
}
compatibility_report[:migration_required] = true
end
compatibility_report
end
end
Fallback Strategies¶
class MetadataFallbackHandler
def self.apply_fallbacks(document_type, invalid_metadata, errors)
fallback_metadata = invalid_metadata.dup
errors.each do |error|
case error
when /Missing required field: summary/
fallback_metadata['summary'] = generate_fallback_summary(invalid_metadata)
when /Missing required field: keywords/
fallback_metadata['keywords'] = extract_fallback_keywords(invalid_metadata)
when /Missing required field: classification/
fallback_metadata['classification'] = infer_classification(document_type, invalid_metadata)
when /Invalid language code/
fallback_metadata['language'] = 'en' # Default to English
when /Reading time.*unreasonable/
fallback_metadata['reading_time_minutes'] = estimate_reading_time(invalid_metadata)
end
end
# Re-validate with fallbacks applied
new_errors = validate_metadata(document_type, fallback_metadata)
{
fallback_metadata: fallback_metadata,
remaining_errors: new_errors,
fallbacks_applied: errors.length - new_errors.length
}
end
private
def self.generate_fallback_summary(metadata)
# Generate basic summary from available content
if metadata['description']
metadata['description'][0..200] + (metadata['description'].length > 200 ? '...' : '')
elsif metadata['title']
"Document: #{metadata['title']}"
else
"Content summary not available"
end
end
end
Quality Thresholds¶
class QualityThresholdManager
QUALITY_THRESHOLDS = {
minimum_acceptable: 0.6,
good_quality: 0.8,
excellent_quality: 0.95
}.freeze
def self.assess_metadata_quality(document_type, metadata)
schema = MetadataSchemas.schema_for(document_type)
total_possible_fields = schema[:properties].keys.length
scores = {
completeness: calculate_completeness_score(metadata, schema),
accuracy: calculate_accuracy_score(metadata, schema),
richness: calculate_richness_score(metadata, schema),
consistency: calculate_consistency_score(metadata)
}
overall_score = (
scores[:completeness] * 0.3 +
scores[:accuracy] * 0.4 +
scores[:richness] * 0.2 +
scores[:consistency] * 0.1
)
{
overall_score: overall_score,
quality_level: determine_quality_level(overall_score),
component_scores: scores,
meets_threshold: overall_score >= QUALITY_THRESHOLDS[:minimum_acceptable],
recommendations: generate_quality_recommendations(overall_score, scores)
}
end
end
Custom Schemas¶
Ragdoll supports creating custom metadata schemas for specialized content types and domain-specific requirements:
Schema Definition¶
JSON Schema Format¶
# Custom schema for legal documents
LEGAL_DOCUMENT_SCHEMA = {
type: "object",
schema_version: "1.0.0",
schema_id: "legal_document_v1",
description: "Metadata schema for legal documents and contracts",
properties: {
# Required base fields (inherited)
summary: {
type: "string",
description: "Legal document summary focusing on key provisions",
minLength: 100,
maxLength: 1000
},
# Legal-specific fields
document_type: {
type: "string",
enum: %w[contract agreement policy statute regulation ordinance brief motion other],
description: "Type of legal document"
},
jurisdiction: {
type: "string",
description: "Legal jurisdiction (e.g., 'US-CA', 'UK', 'EU')",
pattern: "^[A-Z]{2}(-[A-Z]{2})?$"
},
legal_areas: {
type: "array",
items: {
type: "string",
enum: %w[contract corporate employment intellectual_property real_estate family criminal civil other]
},
maxItems: 5,
description: "Areas of law covered by this document"
},
parties: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
role: { type: "string", enum: %w[plaintiff defendant buyer seller lessor lessee employer employee other] },
entity_type: { type: "string", enum: %w[individual corporation llc partnership government other] }
},
required: %w[name role]
},
maxItems: 10
},
effective_date: {
type: "string",
format: "date",
description: "Date when the document becomes effective"
},
expiration_date: {
type: "string",
format: "date",
description: "Date when the document expires (if applicable)"
},
key_terms: {
type: "array",
items: {
type: "object",
properties: {
term: { type: "string" },
definition: { type: "string" },
importance: { type: "string", enum: %w[critical important standard] }
},
required: %w[term definition]
},
maxItems: 20
},
compliance_requirements: {
type: "array",
items: { type: "string" },
description: "Regulatory or legal compliance requirements"
},
risk_level: {
type: "string",
enum: %w[low medium high critical],
description: "Risk assessment level for the document"
}
},
required: %w[summary document_type jurisdiction legal_areas],
# Custom validation rules
custom_validators: %w[validate_jurisdiction_format validate_date_consistency validate_party_roles],
# Schema metadata
created_by: "Legal Team",
created_at: "2024-01-15",
compatible_with: ["base_schema_v1"],
extends: "base_document_schema"
}.freeze
Field Type Definitions¶
# Extended field type system
CUSTOM_FIELD_TYPES = {
# Geographic types
"coordinates" => {
type: "object",
properties: {
latitude: { type: "number", minimum: -90, maximum: 90 },
longitude: { type: "number", minimum: -180, maximum: 180 }
},
required: %w[latitude longitude]
},
# Monetary types
"currency_amount" => {
type: "object",
properties: {
amount: { type: "number", minimum: 0 },
currency: { type: "string", pattern: "^[A-Z]{3}$" }, # ISO 4217
formatted: { type: "string" }
},
required: %w[amount currency]
},
# Person/entity types
"person" => {
type: "object",
properties: {
name: { type: "string" },
email: { type: "string", format: "email" },
role: { type: "string" },
organization: { type: "string" }
},
required: %w[name]
},
# Date range types
"date_range" => {
type: "object",
properties: {
start_date: { type: "string", format: "date" },
end_date: { type: "string", format: "date" },
duration_days: { type: "integer", minimum: 0 }
},
required: %w[start_date end_date]
}
}.freeze
Validation Rule Specification¶
# Custom validation rules for specialized schemas
class CustomValidators
def self.validate_jurisdiction_format(value)
# US state codes, country codes, or international codes
valid_patterns = [
/^US-[A-Z]{2}$/, # US-CA, US-NY
/^[A-Z]{2}$/, # US, UK, DE
/^EU$/, # European Union
/^UN$/ # United Nations
]
return [] if valid_patterns.any? { |pattern| value.match?(pattern) }
["Invalid jurisdiction format: #{value}"]
end
def self.validate_date_consistency(metadata)
errors = []
if metadata['effective_date'] && metadata['expiration_date']
effective = Date.parse(metadata['effective_date'])
expiration = Date.parse(metadata['expiration_date'])
if effective > expiration
errors << "Effective date cannot be after expiration date"
end
if expiration < Date.current
errors << "Warning: Document appears to be expired"
end
end
errors
rescue Date::Error
["Invalid date format in date fields"]
end
def self.validate_party_roles(parties)
return [] unless parties.is_a?(Array)
errors = []
role_counts = parties.group_by { |p| p['role'] }.transform_values(&:count)
# Business logic validation
if role_counts['buyer'] && role_counts['seller']
unless role_counts['buyer'] == role_counts['seller']
errors << "Number of buyers must equal number of sellers in transaction"
end
end
# Check for conflicting roles
parties.each do |party|
name = party['name']
conflicting_parties = parties.select { |p| p['name'] == name && p['role'] != party['role'] }
if conflicting_parties.any?
errors << "Party '#{name}' has conflicting roles: #{[party['role']] + conflicting_parties.map { |p| p['role'] }}"
end
end
errors
end
end
Documentation Requirements¶
# Schema documentation template
SCHEMA_DOCUMENTATION_TEMPLATE = {
schema_info: {
name: "Legal Document Schema",
version: "1.0.0",
description: "Comprehensive metadata schema for legal documents",
use_cases: [
"Contract analysis and management",
"Legal compliance tracking",
"Document classification and search"
],
target_documents: ["contracts", "agreements", "policies", "regulations"]
},
field_documentation: {
"jurisdiction" => {
description: "Legal jurisdiction where document applies",
examples: ["US-CA", "UK", "EU"],
validation_notes: "Must follow ISO country codes or US state format",
business_impact: "Critical for determining applicable laws and regulations"
},
"parties" => {
description: "All parties involved in the legal document",
examples: [
{ name: "Acme Corp", role: "buyer", entity_type: "corporation" },
{ name: "John Smith", role: "seller", entity_type: "individual" }
],
validation_notes: "Must include name and role for each party",
business_impact: "Essential for contract management and obligation tracking"
}
},
usage_guidelines: {
best_practices: [
"Always validate jurisdiction format before processing",
"Include all parties mentioned in the document",
"Use standardized role names for consistency"
],
common_mistakes: [
"Forgetting to include all parties",
"Using non-standard jurisdiction codes",
"Inconsistent date formats"
]
}
}.freeze
Schema Registration¶
Schema Loading Mechanisms¶
class CustomSchemaRegistry
@@registered_schemas = {}
@@schema_versions = {}
def self.register_schema(schema_id, schema_definition)
validate_schema_format!(schema_definition)
version = schema_definition[:schema_version] || '1.0.0'
@@registered_schemas[schema_id] = schema_definition
@@schema_versions[schema_id] ||= []
@@schema_versions[schema_id] << version
# Register custom validators if present
if schema_definition[:custom_validators]
register_custom_validators(schema_id, schema_definition[:custom_validators])
end
Rails.logger.info "Registered custom schema: #{schema_id} v#{version}"
end
def self.load_schemas_from_directory(directory_path)
Dir.glob(File.join(directory_path, '*.rb')).each do |schema_file|
load_schema_file(schema_file)
end
end
def self.get_schema(schema_id, version: 'latest')
base_schema = @@registered_schemas[schema_id]
return nil unless base_schema
if version == 'latest'
base_schema
else
get_schema_version(schema_id, version)
end
end
private
def self.validate_schema_format!(schema)
required_keys = %w[type properties required]
missing_keys = required_keys - schema.keys.map(&:to_s)
if missing_keys.any?
raise "Schema missing required keys: #{missing_keys.join(', ')}"
end
unless schema[:type] == 'object'
raise "Schema type must be 'object', got '#{schema[:type]}'"
end
end
end
# Usage
CustomSchemaRegistry.register_schema('legal_document', LEGAL_DOCUMENT_SCHEMA)
CustomSchemaRegistry.load_schemas_from_directory(Rails.root.join('config', 'schemas'))
Version Management¶
class SchemaVersionManager
def self.create_new_version(schema_id, updates)
current_schema = CustomSchemaRegistry.get_schema(schema_id)
return nil unless current_schema
current_version = current_schema[:schema_version] || '1.0.0'
new_version = increment_version(current_version, updates[:breaking_changes])
new_schema = current_schema.deep_merge(updates).merge(
schema_version: new_version,
previous_version: current_version,
migration_guide: generate_migration_guide(current_schema, updates)
)
CustomSchemaRegistry.register_schema("#{schema_id}_v#{new_version.gsub('.', '_')}", new_schema)
new_schema
end
def self.increment_version(current_version, has_breaking_changes)
major, minor, patch = current_version.split('.').map(&:to_i)
if has_breaking_changes
"#{major + 1}.0.0"
elsif updates[:new_fields]&.any?
"#{major}.#{minor + 1}.0"
else
"#{major}.#{minor}.#{patch + 1}"
end
end
end
Migration Strategies¶
class SchemaMigrator
def self.migrate_metadata(old_metadata, from_schema, to_schema)
migration_plan = create_migration_plan(from_schema, to_schema)
migrated_metadata = old_metadata.dup
migration_plan[:field_mappings].each do |old_field, new_field|
if migrated_metadata.key?(old_field)
migrated_metadata[new_field] = migrated_metadata.delete(old_field)
end
end
migration_plan[:field_transformations].each do |field, transformer|
if migrated_metadata.key?(field)
migrated_metadata[field] = apply_transformation(migrated_metadata[field], transformer)
end
end
migration_plan[:default_values].each do |field, default|
migrated_metadata[field] ||= default
end
{
migrated_metadata: migrated_metadata,
migration_warnings: validate_migrated_metadata(migrated_metadata, to_schema),
migration_log: migration_plan[:log]
}
end
end
Domain-Specific Schemas¶
Ragdoll provides specialized schemas for common document domains:
Academic Papers¶
Citation Extraction Schema¶
ACADEMIC_PAPER_SCHEMA = {
type: "object",
properties: {
# Standard fields
summary: {
type: "string",
description: "Abstract or executive summary of the research"
},
# Academic-specific fields
paper_type: {
type: "string",
enum: %w[research_paper review_article conference_paper thesis dissertation preprint technical_report other],
description: "Type of academic document"
},
research_areas: {
type: "array",
items: { type: "string" },
description: "Research disciplines and fields covered",
maxItems: 5
},
authors: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
affiliation: { type: "string" },
email: { type: "string", format: "email" },
orcid: { type: "string", pattern: "^\\d{4}-\\d{4}-\\d{4}-\\d{3}[0-9X]$" },
corresponding_author: { type: "boolean" }
},
required: %w[name]
},
minItems: 1
},
publication_info: {
type: "object",
properties: {
journal: { type: "string" },
conference: { type: "string" },
volume: { type: "string" },
issue: { type: "string" },
pages: { type: "string" },
publication_date: { type: "string", format: "date" },
doi: { type: "string", pattern: "^10\\.\\d+/.+" },
isbn: { type: "string" },
publisher: { type: "string" }
}
},
citations: {
type: "array",
items: {
type: "object",
properties: {
title: { type: "string" },
authors: { type: "string" },
year: { type: "integer", minimum: 1900, maximum: 2030 },
source: { type: "string" },
doi: { type: "string" },
citation_type: { type: "string", enum: %w[foundational supportive comparative critical methodological] }
},
required: %w[title authors year]
}
},
methodology: {
type: "object",
properties: {
research_methods: {
type: "array",
items: { type: "string", enum: %w[experimental observational survey case_study meta_analysis theoretical computational qualitative quantitative mixed_methods] }
},
data_sources: { type: "array", items: { type: "string" } },
sample_size: { type: "integer", minimum: 0 },
statistical_methods: { type: "array", items: { type: "string" } }
}
},
funding: {
type: "array",
items: {
type: "object",
properties: {
agency: { type: "string" },
grant_number: { type: "string" },
amount: { type: "number" },
currency: { type: "string" }
},
required: %w[agency]
}
},
peer_review_status: {
type: "string",
enum: %w[peer_reviewed non_peer_reviewed preprint under_review],
description: "Peer review status of the publication"
},
impact_metrics: {
type: "object",
properties: {
citation_count: { type: "integer", minimum: 0 },
h_index_contribution: { type: "number" },
altmetric_score: { type: "number" },
download_count: { type: "integer", minimum: 0 }
}
}
},
required: %w[summary paper_type research_areas authors]
}.freeze
Legal Documents¶
Legal Classification Schema¶
LEGAL_DOCUMENT_SCHEMA = {
type: "object",
properties: {
# Standard fields adapted for legal content
summary: {
type: "string",
description: "Legal summary focusing on key provisions and obligations",
minLength: 100
},
# Legal-specific classification
document_type: {
type: "string",
enum: %w[contract agreement policy statute regulation ordinance brief motion pleading judgment settlement nda license other],
description: "Type of legal document"
},
jurisdiction: {
type: "object",
properties: {
primary: { type: "string", description: "Primary jurisdiction (US-CA, UK, EU)" },
additional: { type: "array", items: { type: "string" } },
court_level: { type: "string", enum: %w[federal state local municipal international] }
},
required: %w[primary]
},
legal_areas: {
type: "array",
items: {
type: "string",
enum: %w[contract corporate employment intellectual_property real_estate family criminal civil tax immigration environmental securities banking healthcare privacy data_protection other]
},
maxItems: 5
},
parties: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
role: {
type: "string",
enum: %w[plaintiff defendant buyer seller lessor lessee employer employee licensor licensee grantor grantee other]
},
entity_type: {
type: "string",
enum: %w[individual corporation llc partnership government nonprofit trust estate other]
},
representation: { type: "string", description: "Legal representation/counsel" }
},
required: %w[name role entity_type]
}
},
key_provisions: {
type: "array",
items: {
type: "object",
properties: {
provision_type: {
type: "string",
enum: %w[payment termination liability confidentiality indemnification warranty limitation_of_liability force_majeure governing_law dispute_resolution other]
},
description: { type: "string" },
critical: { type: "boolean" },
page_reference: { type: "string" }
},
required: %w[provision_type description]
}
},
compliance_requirements: {
type: "array",
items: {
type: "object",
properties: {
regulation: { type: "string" },
requirement: { type: "string" },
deadline: { type: "string", format: "date" },
responsible_party: { type: "string" },
penalty: { type: "string" }
},
required: %w[regulation requirement]
}
},
risk_analysis: {
type: "object",
properties: {
overall_risk_level: { type: "string", enum: %w[low medium high critical] },
financial_exposure: { type: "string" },
operational_risks: { type: "array", items: { type: "string" } },
legal_risks: { type: "array", items: { type: "string" } },
mitigation_strategies: { type: "array", items: { type: "string" } }
}
},
important_dates: {
type: "array",
items: {
type: "object",
properties: {
date: { type: "string", format: "date" },
description: { type: "string" },
date_type: { type: "string", enum: %w[effective_date expiration_date deadline milestone notification_date other] },
recurring: { type: "boolean" },
recurrence_pattern: { type: "string" }
},
required: %w[date description date_type]
}
}
},
required: %w[summary document_type jurisdiction legal_areas parties]
}.freeze
Technical Documentation¶
API Documentation Schema¶
TECHNICAL_DOCUMENTATION_SCHEMA = {
type: "object",
properties: {
# Standard technical fields
summary: {
type: "string",
description: "Technical summary of the documentation content"
},
documentation_type: {
type: "string",
enum: %w[api_reference user_guide developer_guide installation_guide troubleshooting architecture_document code_documentation release_notes other],
description: "Type of technical documentation"
},
# API-specific fields
api_info: {
type: "object",
properties: {
api_name: { type: "string" },
version: { type: "string", pattern: "^\\d+\\.\\d+\\.\\d+" },
base_url: { type: "string", format: "uri" },
authentication_methods: {
type: "array",
items: { type: "string", enum: %w[api_key oauth2 jwt basic_auth bearer_token none] }
},
supported_formats: {
type: "array",
items: { type: "string", enum: %w[json xml yaml csv] }
},
rate_limits: {
type: "object",
properties: {
requests_per_minute: { type: "integer" },
requests_per_hour: { type: "integer" },
requests_per_day: { type: "integer" }
}
}
}
},
endpoints: {
type: "array",
items: {
type: "object",
properties: {
path: { type: "string" },
method: { type: "string", enum: %w[GET POST PUT DELETE PATCH HEAD OPTIONS] },
description: { type: "string" },
parameters: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
type: { type: "string" },
required: { type: "boolean" },
description: { type: "string" },
example: { type: "string" }
},
required: %w[name type required]
}
},
response_codes: {
type: "array",
items: {
type: "object",
properties: {
code: { type: "integer" },
description: { type: "string" },
example: { type: "string" }
}
}
}
},
required: %w[path method description]
}
},
code_examples: {
type: "array",
items: {
type: "object",
properties: {
language: { type: "string", enum: %w[javascript python ruby java php curl bash other] },
code: { type: "string" },
description: { type: "string" },
output_example: { type: "string" }
},
required: %w[language code]
}
},
dependencies: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
version: { type: "string" },
type: { type: "string", enum: %w[runtime development build peer optional] },
purpose: { type: "string" },
license: { type: "string" }
},
required: %w[name version type]
}
},
version_info: {
type: "object",
properties: {
current_version: { type: "string" },
supported_versions: { type: "array", items: { type: "string" } },
deprecated_versions: { type: "array", items: { type: "string" } },
version_history: {
type: "array",
items: {
type: "object",
properties: {
version: { type: "string" },
release_date: { type: "string", format: "date" },
changes: { type: "array", items: { type: "string" } },
breaking_changes: { type: "boolean" }
}
}
}
}
},
technical_requirements: {
type: "object",
properties: {
minimum_versions: {
type: "object",
patternProperties: {
".*": { type: "string" } # e.g., "node": ">=14.0.0", "python": ">=3.8"
}
},
supported_platforms: {
type: "array",
items: { type: "string", enum: %w[windows macos linux docker kubernetes web mobile] }
},
hardware_requirements: {
type: "object",
properties: {
min_memory_mb: { type: "integer" },
min_disk_space_mb: { type: "integer" },
cpu_cores: { type: "integer" }
}
}
}
},
troubleshooting: {
type: "array",
items: {
type: "object",
properties: {
problem: { type: "string" },
solution: { type: "string" },
error_codes: { type: "array", items: { type: "string" } },
category: { type: "string", enum: %w[installation configuration authentication authorization performance security other] }
},
required: %w[problem solution]
}
}
},
required: %w[summary documentation_type]
}.freeze
Code Documentation Schema¶
CODE_DOCUMENTATION_SCHEMA = {
type: "object",
properties: {
summary: {
type: "string",
description: "Summary of the code functionality and purpose"
},
code_type: {
type: "string",
enum: %w[class function module library framework application script configuration other],
description: "Type of code being documented"
},
programming_language: {
type: "string",
enum: %w[javascript typescript python ruby java csharp cpp c php go rust swift kotlin scala other],
description: "Primary programming language"
},
functions_and_methods: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
description: { type: "string" },
parameters: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
type: { type: "string" },
required: { type: "boolean" },
default_value: { type: "string" },
description: { type: "string" }
}
}
},
return_type: { type: "string" },
return_description: { type: "string" },
exceptions: { type: "array", items: { type: "string" } },
complexity: { type: "string", enum: %w[O(1) O(log n) O(n) O(n log n) O(n²) O(2^n) other] }
},
required: %w[name description]
}
},
design_patterns: {
type: "array",
items: { type: "string", enum: %w[singleton factory observer strategy command adapter decorator facade mvc mvp mvvm repository unit_of_work other] }
},
testing_info: {
type: "object",
properties: {
test_coverage: { type: "number", minimum: 0, maximum: 100 },
testing_frameworks: { type: "array", items: { type: "string" } },
test_types: { type: "array", items: { type: "string", enum: %w[unit integration e2e performance security accessibility] } }
}
}
},
required: %w[summary code_type programming_language]
}.freeze
LLM Metadata Generation¶
Ragdoll uses advanced prompt engineering and LLM integration for high-quality metadata generation:
Generation Process¶
Content Analysis Pipeline¶
class Ragdoll::MetadataGenerator
def generate_for_document(document)
content_analyzer = ContentAnalyzer.new(document)
# Stage 1: Content preprocessing
preprocessed_content = content_analyzer.preprocess
# Stage 2: Schema selection
schema = MetadataSchemas.schema_for(document.document_type)
# Stage 3: Model selection
model = select_optimal_model(document, schema)
# Stage 4: Prompt generation
prompt = generate_schema_aware_prompt(preprocessed_content, schema)
# Stage 5: LLM generation
raw_metadata = call_llm_with_retry(model, prompt)
# Stage 6: Validation and cleanup
validated_metadata = validate_and_clean(raw_metadata, schema)
# Stage 7: Quality assessment
quality_score = assess_metadata_quality(validated_metadata, schema)
{
metadata: validated_metadata,
quality_score: quality_score,
model_used: model,
processing_time: Time.current - start_time
}
end
end
Model Selection Strategies¶
class ModelSelector
MODEL_CAPABILITIES = {
'openai/gpt-4o' => {
strengths: [:complex_analysis, :technical_content, :multilingual],
optimal_for: [:academic_papers, :legal_documents, :technical_docs],
cost: :high,
speed: :medium,
max_tokens: 128000
},
'openai/gpt-4o-mini' => {
strengths: [:general_analysis, :speed, :cost_effective],
optimal_for: [:text_documents, :simple_classification],
cost: :low,
speed: :fast,
max_tokens: 128000
},
'anthropic/claude-3-sonnet-20240229' => {
strengths: [:detailed_analysis, :accuracy, :reasoning],
optimal_for: [:complex_documents, :legal_analysis],
cost: :medium,
speed: :medium,
max_tokens: 200000
},
'anthropic/claude-3-haiku-20240307' => {
strengths: [:speed, :simple_tasks, :cost_effective],
optimal_for: [:keyword_extraction, :basic_classification],
cost: :low,
speed: :very_fast,
max_tokens: 200000
}
}.freeze
def self.select_optimal_model(document, schema)
content_complexity = analyze_content_complexity(document)
schema_complexity = analyze_schema_complexity(schema)
# Score models based on document and schema requirements
model_scores = MODEL_CAPABILITIES.map do |model, capabilities|
score = 0
# Content type optimization
if capabilities[:optimal_for].include?(document.document_type.to_sym)
score += 30
end
# Complexity matching
case content_complexity
when :high
score += 20 if capabilities[:strengths].include?(:complex_analysis)
when :medium
score += 15 if capabilities[:cost] == :medium
when :low
score += 25 if capabilities[:speed] == :fast
end
# Schema complexity matching
if schema_complexity == :high && capabilities[:strengths].include?(:detailed_analysis)
score += 15
end
# Cost optimization (configurable weight)
cost_weight = Ragdoll.config.metadata_generation[:cost_optimization_weight] || 0.2
case capabilities[:cost]
when :low then score += (20 * cost_weight)
when :medium then score += (10 * cost_weight)
when :high then score -= (10 * cost_weight)
end
[model, score]
end
# Return the highest scoring model
model_scores.max_by { |model, score| score }.first
end
end
Quality Assessment¶
class MetadataQualityAssessor
def self.assess_quality(metadata, schema, original_content)
scores = {
completeness: assess_completeness(metadata, schema),
accuracy: assess_accuracy(metadata, original_content),
consistency: assess_consistency(metadata),
relevance: assess_relevance(metadata, original_content),
specificity: assess_specificity(metadata)
}
# Weighted overall score
overall_score = (
scores[:completeness] * 0.25 +
scores[:accuracy] * 0.30 +
scores[:consistency] * 0.15 +
scores[:relevance] * 0.20 +
scores[:specificity] * 0.10
)
{
overall_score: overall_score,
component_scores: scores,
quality_level: determine_quality_level(overall_score),
recommendations: generate_improvement_recommendations(scores)
}
end
end
Prompt Engineering¶
Schema-Aware Prompts¶
class SchemaAwarePromptBuilder
def build_prompt(content, schema)
prompt_parts = []
# System instruction
prompt_parts << build_system_instruction(schema)
# Schema specification
prompt_parts << build_schema_specification(schema)
# Content context
prompt_parts << build_content_context(content)
# Examples (few-shot learning)
prompt_parts << build_examples(schema)
# Final instruction
prompt_parts << build_final_instruction(schema)
prompt_parts.join("\n\n")
end
private
def build_system_instruction(schema)
<<~PROMPT
You are an expert content analyst specializing in extracting structured metadata from documents.
Your task is to analyze the provided content and generate metadata that strictly follows the given JSON schema.
Key requirements:
- All required fields must be present
- Enum values must match exactly (case-sensitive)
- Array fields must not exceed maxItems limits
- Provide accurate, relevant, and specific information
- If information is not available, use appropriate defaults or null values
- Maintain consistency across related fields
PROMPT
end
def build_schema_specification(schema)
<<~PROMPT
## JSON Schema Specification
```json
#{JSON.pretty_generate(schema)}
```
Required fields: #{schema[:required]&.join(', ') || 'none'}
PROMPT
end
def build_content_context(content)
# Truncate content if too long for model context
truncated_content = truncate_for_model(content)
<<~PROMPT
## Content to Analyze
```
#{truncated_content}
```
PROMPT
end
def build_examples(schema)
example_metadata = generate_example_metadata(schema)
<<~PROMPT
## Example Output Format
```json
#{JSON.pretty_generate(example_metadata)}
```
PROMPT
end
def build_final_instruction(schema)
<<~PROMPT
## Instructions
Analyze the provided content and generate metadata following the exact schema format.
Return only valid JSON that matches the schema - no additional text or explanations.
Focus on:
#{build_focus_points(schema)}
PROMPT
end
def build_focus_points(schema)
focus_points = []
schema[:properties].each do |field, field_schema|
case field
when :summary, "summary"
focus_points << "- Create a concise, informative summary (#{field_schema[:description]})"
when :keywords, "keywords"
focus_points << "- Extract relevant, specific keywords (max #{field_schema[:maxItems] || 10})"
when :classification, "classification"
focus_points << "- Choose the most appropriate classification from: #{field_schema[:enum]&.join(', ')}"
end
end
focus_points.join("\n")
end
end
Consistency Optimization¶
class ConsistencyOptimizer
def self.optimize_for_consistency(prompt, previous_metadata = nil)
optimized_prompt = prompt.dup
if previous_metadata
# Add consistency instruction
consistency_instruction = build_consistency_instruction(previous_metadata)
optimized_prompt += "\n\n#{consistency_instruction}"
end
# Add terminology consistency rules
optimized_prompt += "\n\n#{build_terminology_rules}"
optimized_prompt
end
private
def self.build_consistency_instruction(previous_metadata)
<<~INSTRUCTION
## Consistency Requirements
Maintain consistency with previously generated metadata:
- Use similar classification categories when appropriate
- Maintain consistent keyword terminology
- Keep similar complexity levels for related content
- Use consistent language and tone in summaries
Previous metadata reference:
```json
#{JSON.pretty_generate(previous_metadata.sample(3))} # Sample for reference
```
INSTRUCTION
end
def self.build_terminology_rules
<<~RULES
## Terminology Standards
- Use consistent technical terms (e.g., "API" not "api" or "Api")
- Standardize compound terms (e.g., "machine learning" not "machine-learning")
- Use full forms for important concepts (e.g., "artificial intelligence" before "AI")
- Maintain consistent capitalization in proper nouns
- Use industry-standard terminology when available
RULES
end
end
Error Reduction Techniques¶
class ErrorReductionStrategies
def self.apply_error_reduction(prompt, schema)
enhanced_prompt = prompt.dup
# Add common error prevention
enhanced_prompt += "\n\n#{build_error_prevention_rules(schema)}"
# Add validation reminders
enhanced_prompt += "\n\n#{build_validation_reminders(schema)}"
enhanced_prompt
end
private
def self.build_error_prevention_rules(schema)
rules = []
# Enum validation rules
enum_fields = schema[:properties].select { |k, v| v[:enum] }
if enum_fields.any?
rules << "CRITICAL: Enum fields must use exact values from the schema:"
enum_fields.each do |field, field_schema|
rules << "- #{field}: #{field_schema[:enum].join(' | ')}"
end
end
# Array length rules
array_fields = schema[:properties].select { |k, v| v[:type] == 'array' && v[:maxItems] }
if array_fields.any?
rules << "\nArray length limits:"
array_fields.each do |field, field_schema|
rules << "- #{field}: maximum #{field_schema[:maxItems]} items"
end
end
# Required field reminders
if schema[:required]&.any?
rules << "\nRequired fields (cannot be null or empty):"
rules << "- #{schema[:required].join(', ')}"
end
rules.join("\n")
end
def self.build_validation_reminders(schema)
<<~REMINDERS
## Pre-submission Checklist
Before returning your response, verify:
✓ All required fields are present and non-empty
✓ Enum values match exactly (check spelling and case)
✓ Arrays don't exceed maximum item limits
✓ JSON is valid and properly formatted
✓ Field types match schema requirements
✓ No additional fields outside the schema
REMINDERS
end
end
Multi-Language Support¶
class MultiLanguagePromptBuilder < SchemaAwarePromptBuilder
LANGUAGE_INSTRUCTIONS = {
'es' => {
system_instruction: "Eres un analista experto en contenido especializado en extraer metadatos estructurados de documentos.",
focus_keywords: "Extrae palabras clave relevantes y específicas en español",
summary_instruction: "Crea un resumen conciso e informativo en español"
},
'fr' => {
system_instruction: "Vous êtes un analyste de contenu expert spécialisé dans l'extraction de métadonnées structurées de documents.",
focus_keywords: "Extraire des mots-clés pertinents et spécifiques en français",
summary_instruction: "Créer un résumé concis et informatif en français"
},
'de' => {
system_instruction: "Sie sind ein Experte für Inhaltsanalyse, der auf die Extraktion strukturierter Metadaten aus Dokumenten spezialisiert ist.",
focus_keywords: "Extrahieren Sie relevante und spezifische Schlüsselwörter auf Deutsch",
summary_instruction: "Erstellen Sie eine prägnante und informative Zusammenfassung auf Deutsch"
}
}.freeze
def build_prompt(content, schema, language = 'en')
if language == 'en'
super(content, schema)
else
build_multilingual_prompt(content, schema, language)
end
end
private
def build_multilingual_prompt(content, schema, language)
instructions = LANGUAGE_INSTRUCTIONS[language] || LANGUAGE_INSTRUCTIONS['en']
prompt_parts = [
instructions[:system_instruction],
build_schema_specification(schema),
build_content_context(content),
build_language_specific_instructions(language, instructions),
build_final_instruction(schema)
]
prompt_parts.join("\n\n")
end
def build_language_specific_instructions(language, instructions)
<<~INSTRUCTIONS
## Language-Specific Requirements
- Content language: #{language.upcase}
- #{instructions[:summary_instruction]}
- #{instructions[:focus_keywords]}
- Maintain cultural context and terminology appropriate for #{language.upcase}
- Use native language conventions for formatting and style
INSTRUCTIONS
end
end
This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.