Duplicate Detection¶
Ragdoll includes a sophisticated multi-level duplicate detection system to prevent redundant document processing and storage. This system uses multiple detection criteria to identify duplicates while providing options to override when needed.
Overview¶
The duplicate detection system operates on several levels:
- Primary Detection: Exact location and file modification time matching
- Secondary Detection: File content hashing (SHA256)
- Tertiary Detection: Content hashing and metadata similarity
- Quaternary Detection: Smart similarity analysis with configurable tolerance
Detection Methods¶
Location-Based Detection¶
Exact Path Matching:
# Primary check - fastest detection method
existing = Document.find_by(location: file_path)
return existing if existing
Path + Modification Time:
# For file-based documents, include modification time
existing = Document.find_by(
location: file_path,
file_modified_at: File.mtime(file_path)
)
Hash-Based Detection¶
File Content Hash (SHA256):
# Most reliable method for file integrity
file_hash = Digest::SHA256.file(file_path).hexdigest
existing = Document.where("metadata->>'file_hash' = ?", file_hash).first
# Automatically calculated during document processing
metadata[:file_hash] = calculate_file_hash(file_path)
Text Content Hash:
# For extracted text content
content_hash = Digest::SHA256.hexdigest(extracted_text)
existing = Document.where("metadata->>'content_hash' = ?", content_hash).first
# Added to all document types during processing
metadata[:content_hash] = calculate_content_hash(content)
Similarity-Based Detection¶
Content Length Comparison:
# 5% tolerance for minor content differences
content_length_diff = (existing_length - new_length).abs
max_length = [existing_length, new_length].max
similarity_within_tolerance = (content_length_diff.to_f / max_length) <= 0.05
Metadata Similarity:
def documents_are_similar?(existing_doc, new_doc)
# Compare filename (basename without extension)
existing_basename = File.basename(existing_doc.location, File.extname(existing_doc.location))
new_basename = File.basename(new_doc.location, File.extname(new_doc.location))
return false unless existing_basename == new_basename
# Compare document types
return false if existing_doc.document_type != new_doc.document_type
# Compare titles if available
return false if existing_doc.title != new_doc.title
# Content length tolerance check
if existing_doc.content.present? && new_doc.content.present?
length_diff = (existing_doc.content.length - new_doc.content.length).abs
max_length = [existing_doc.content.length, new_doc.content.length].max
return false if max_length > 0 && (length_diff.to_f / max_length) > 0.05
end
true
end
Detection Process Flow¶
File-Based Documents¶
graph TD
A[New Document] --> B{Exact Location Match?}
B -->|Yes| C[Return Existing Document]
B -->|No| D{File Hash Match?}
D -->|Yes| E[Return Existing Document]
D -->|No| F{Same File Size?}
F -->|Yes| G[Check Metadata Similarity]
F -->|No| H[Create New Document]
G -->|Similar| I[Return Existing Document]
G -->|Different| H
URL/Web Documents¶
graph TD
A[New URL Document] --> B{Exact URL Match?}
B -->|Yes| C[Return Existing Document]
B -->|No| D{Content Hash Match?}
D -->|Yes| E[Return Existing Document]
D -->|No| F{Same Title & Type?}
F -->|Yes| G[Check Content Length]
F -->|No| H[Create New Document]
G -->|Similar Length| I[Return Existing Document]
G -->|Different| H
Configuration¶
Basic Configuration¶
Ragdoll.configure do |config|
config.duplicate_detection.tap do |dd|
dd[:enabled] = true # Enable duplicate detection
dd[:content_similarity_threshold] = 0.95 # Content similarity threshold
dd[:content_length_tolerance] = 0.05 # 5% content length tolerance
dd[:check_file_hash] = true # Enable file hash checking
dd[:check_content_hash] = true # Enable content hash checking
dd[:check_metadata_similarity] = true # Enable metadata comparison
end
end
Advanced Configuration¶
config.duplicate_detection.tap do |dd|
# Detection method priorities
dd[:detection_methods] = [
:location_match, # Fastest - exact location
:file_hash_match, # Most reliable - file content
:content_hash_match, # Text content comparison
:metadata_similarity # Fuzzy matching
]
# Performance tuning
dd[:max_similarity_checks] = 100 # Limit expensive similarity checks
dd[:hash_cache_ttl] = 1.hour # Cache file hashes
dd[:enable_parallel_checks] = true # Parallel hash calculation
# File size thresholds
dd[:large_file_threshold] = 50.megabytes
dd[:skip_hash_for_large_files] = false
# Content type specific settings
dd[:pdf_content_tolerance] = 0.02 # Stricter for PDFs
dd[:text_content_tolerance] = 0.05 # Standard for text
dd[:image_similarity_threshold] = 0.98 # Very strict for images
end
Force Override Option¶
Basic Usage¶
# Bypass duplicate detection completely
result = Ragdoll.add_document(
path: 'document.pdf',
force: true
)
# Creates new document with modified location to avoid conflicts
# Location becomes: "/path/to/document.pdf#forced_1642608000_a1b2c3d4"
Advanced Force Options¶
# Force with custom metadata
result = Ragdoll.add_document(
path: 'document.pdf',
force: true,
metadata: {
version: '2.0',
reason: 'Updated content',
previous_document_id: original_id
}
)
# Force in DocumentManagement service
document_id = Ragdoll::DocumentManagement.add_document(
location,
content,
metadata,
force: true
)
Location Modification Strategy¶
When forcing duplicate creation, Ragdoll modifies the location to avoid database unique constraint violations:
if force
# Append timestamp and random hex to make location unique
final_location = "#{absolute_location}#forced_#{Time.current.to_i}_#{SecureRandom.hex(4)}"
else
final_location = absolute_location
end
Database Performance¶
Indexes for Fast Detection¶
-- Primary location lookup (unique constraint)
CREATE UNIQUE INDEX idx_documents_location ON ragdoll_documents (location);
-- Hash-based lookups using GIN indexes for JSON metadata
CREATE INDEX idx_documents_file_hash ON ragdoll_documents
USING gin((metadata->>'file_hash'));
CREATE INDEX idx_documents_content_hash ON ragdoll_documents
USING gin((metadata->>'content_hash'));
-- File size for similarity filtering
CREATE INDEX idx_documents_file_size ON ragdoll_documents
USING btree((metadata->>'file_size'));
-- Document type filtering
CREATE INDEX idx_documents_type ON ragdoll_documents (document_type);
-- File modification time for temporal checks
CREATE INDEX idx_documents_file_modified_at ON ragdoll_documents (file_modified_at);
Query Performance Optimization¶
Fast Path Queries:
# O(1) lookup via unique index
Document.find_by(location: file_path)
# O(log n) hash lookup via GIN index
Document.where("metadata->>'file_hash' = ?", file_hash).limit(1)
Efficient Similarity Queries:
# Filter by file size first to reduce comparison set
same_size_docs = Document.where("metadata->>'file_size' = ?", file_size.to_s)
.where(document_type: document_type)
.limit(50) # Reasonable limit for similarity checks
Use Cases and Examples¶
Development Environment¶
Avoid Re-processing During Development:
# First call processes the document
result1 = Ragdoll.add_document(path: 'test_document.pdf')
puts result1[:message] # "Document 'test_document' added successfully"
# Second call returns existing document immediately
result2 = Ragdoll.add_document(path: 'test_document.pdf')
puts result2[:message] # "Document 'test_document' already exists"
assert_equal result1[:document_id], result2[:document_id]
Production Batch Import¶
Safe Bulk Import Without Duplicates:
documents = Dir.glob('/import_folder/**/*.pdf')
import_stats = { added: 0, skipped: 0, errors: 0 }
documents.each do |file_path|
begin
result = Ragdoll.add_document(path: file_path)
if result[:duplicate]
import_stats[:skipped] += 1
puts "SKIPPED: #{file_path} (duplicate of document #{result[:document_id]})"
else
import_stats[:added] += 1
puts "ADDED: #{file_path} (new document #{result[:document_id]})"
end
rescue => e
import_stats[:errors] += 1
puts "ERROR: #{file_path} - #{e.message}"
end
end
puts "\nImport Summary:"
puts "Added: #{import_stats[:added]}"
puts "Skipped: #{import_stats[:skipped]}"
puts "Errors: #{import_stats[:errors]}"
Content Versioning¶
Managing Document Versions:
# Original document
original = Ragdoll.add_document(
path: 'manual_v1.pdf',
metadata: { version: '1.0' }
)
# Updated version - force new document
updated = Ragdoll.add_document(
path: 'manual_v2.pdf', # Same content, different name
force: true,
metadata: {
version: '2.0',
supersedes: original[:document_id],
changelog: 'Added new chapter on advanced features'
}
)
# Link versions in metadata
Ragdoll.update_document(
id: original[:document_id],
metadata: { superseded_by: updated[:document_id] }
)
Multi-Source Document Management¶
Handling Documents from Different Sources:
# Same document from different sources
local_result = Ragdoll.add_document(
path: '/local/docs/whitepaper.pdf',
metadata: { source: 'local_filesystem' }
)
# URL to same document - detected as duplicate via content hash
url_result = Ragdoll.add_document(
path: 'https://example.com/whitepaper.pdf',
metadata: { source: 'company_website' }
)
# Both return same document_id due to content hash matching
assert_equal local_result[:document_id], url_result[:document_id]
# Force separate entries if needed for tracking
forced_url_result = Ragdoll.add_document(
path: 'https://example.com/whitepaper.pdf',
force: true,
metadata: {
source: 'company_website',
primary_document_id: local_result[:document_id]
}
)
Monitoring and Analytics¶
Duplicate Detection Metrics¶
# Check duplicate detection effectiveness
stats = Ragdoll.stats
puts "Total documents: #{stats[:total_documents]}"
puts "Duplicates prevented: #{stats[:duplicates_prevented]}"
puts "Duplicate detection rate: #{(stats[:duplicates_prevented].to_f / stats[:total_attempts] * 100).round(2)}%"
# Analyze duplicate patterns
duplicate_analysis = Ragdoll::Document.joins(
"LEFT JOIN ragdoll_documents rd2 ON rd2.metadata->>'file_hash' = ragdoll_documents.metadata->>'file_hash'"
).where("rd2.id IS NOT NULL AND rd2.id != ragdoll_documents.id")
.group("ragdoll_documents.metadata->>'file_hash'")
.count
puts "Hash collision analysis: #{duplicate_analysis}"
Performance Monitoring¶
# Monitor detection performance
require 'benchmark'
detection_time = Benchmark.measure do
result = Ragdoll.add_document(path: 'large_document.pdf')
end
puts "Duplicate detection took: #{detection_time.real}s"
# Log slow detection cases
if detection_time.real > 1.0
Rails.logger.warn "Slow duplicate detection: #{detection_time.real}s for #{file_path}"
end
Troubleshooting¶
Common Issues¶
False Positives (Legitimate Documents Detected as Duplicates):
# Check why document was considered duplicate
result = Ragdoll.add_document(path: 'document.pdf', debug: true)
puts result[:duplicate_reason] # "file_hash_match", "content_similarity", etc.
# Force addition if false positive
result = Ragdoll.add_document(path: 'document.pdf', force: true)
Performance Issues with Large Documents:
# Temporarily disable expensive checks for large files
config.duplicate_detection[:skip_content_hash_for_large_files] = true
config.duplicate_detection[:large_file_threshold] = 10.megabytes
# Or use background processing
config.duplicate_detection[:async_hash_calculation] = true
Database Constraint Violations:
# Check for location conflicts
begin
result = Ragdoll.add_document(path: 'document.pdf', force: true)
rescue ActiveRecord::RecordNotUnique => e
# This shouldn't happen with proper force handling
Rails.logger.error "Unexpected duplicate constraint violation: #{e.message}"
end
Debug Mode¶
# Enable detailed duplicate detection logging
Ragdoll.configure do |config|
config.duplicate_detection[:debug] = true
config.logging_config[:log_level] = :debug
end
# Results in detailed logs:
# DEBUG: Checking location match for /path/to/document.pdf
# DEBUG: No location match found
# DEBUG: Calculating file hash for /path/to/document.pdf
# DEBUG: File hash: a1b2c3d4e5f6...
# DEBUG: Checking for existing documents with same hash
# DEBUG: Found potential duplicate: Document ID 123
# DEBUG: Comparing metadata similarity...
# DEBUG: Documents are similar, returning existing document
This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or Document Processing.