Search & Analytics¶

Ragdoll implements a sophisticated search and analytics system that goes far beyond basic semantic similarity. The system combines vector search with usage analytics, full-text search capabilities, and intelligent ranking algorithms to provide enterprise-grade search functionality.

Overview¶

The search system consists of several integrated components:

Semantic Search: pgvector-powered similarity search with cosine distance
Usage Analytics: Frequency and recency-based ranking algorithms
Full-Text Search: PostgreSQL GIN indexes with tsquery support
Cross-Modal Search: Unified search across text, image, and audio content
Hybrid Search: Combining semantic and full-text search with keyword extraction
Keyword Extraction: Intelligent keyword extraction for enhanced search relevance
Smart Ranking: Usage patterns, similarity scores, and metadata weighting

Architecture¶

SearchEngine Core¶

class SearchEngine
  def search_similar_content(query, options = {})
    # 1. Generate query embedding
    query_vector = EmbeddingService.generate_embedding(query)

    # 2. Perform vector similarity search
    similar_embeddings = find_similar_embeddings(
      query_vector, 
      options[:similarity_threshold] || Configuration.search_similarity_threshold,
      options[:limit] || Configuration.max_search_results
    )

    # 3. Apply usage-based ranking
    ranked_results = apply_usage_ranking(similar_embeddings, options)

    # 4. Enhance with metadata and formatting
    format_search_results(ranked_results, options)
  end
end

Database Optimization¶

-- Optimized pgvector indexes for fast similarity search
CREATE INDEX idx_embeddings_vector_cosine 
ON ragdoll_embeddings 
USING ivfflat (embedding_vector vector_cosine_ops) 
WITH (lists = 100);

-- Usage analytics indexes
CREATE INDEX idx_embeddings_usage_analytics 
ON ragdoll_embeddings (usage_count DESC, returned_at DESC);

-- Polymorphic content type indexes
CREATE INDEX idx_embeddings_content_type 
ON ragdoll_embeddings (embeddable_type, embeddable_id);

-- Full-text search indexes
CREATE INDEX idx_embeddings_content_fulltext 
ON ragdoll_embeddings 
USING GIN (to_tsvector('english', content));

Semantic Search¶

Vector Similarity Search¶

# Core similarity search implementation
def find_similar_embeddings(query_vector, threshold, limit)
  Embedding.select(
    "*, " \
    "embedding_vector <=> '#{query_vector}' as distance, " \
    "1 - (embedding_vector <=> '#{query_vector}') as similarity"
  )
  .where("1 - (embedding_vector <=> '#{query_vector}') >= ?", threshold)
  .order("embedding_vector <=> '#{query_vector}'")
  .limit(limit)
end

# Usage example
results = SearchEngine.search_similar_content(
  "machine learning algorithms",
  similarity_threshold: 0.8,
  limit: 20
)

# Search across all content types
def cross_modal_search(query, content_types: nil)
  base_query = Embedding.joins(:embeddable)

  if content_types.present?
    base_query = base_query.where(embeddable_type: content_types.map(&:classify))
  end

  # Generate embeddings and search
  query_vector = EmbeddingService.generate_embedding(query)

  base_query.select(
    "ragdoll_embeddings.*, " \
    "embedding_vector <=> '#{query_vector}' as distance, " \
    "embeddable_type as content_type"
  )
  .where("1 - (embedding_vector <=> '#{query_vector}') >= ?", similarity_threshold)
  .order("embedding_vector <=> '#{query_vector}'")
end

# Search specific content types
text_results = SearchEngine.search_similar_content(
  "neural networks",
  content_types: ['text']
)

image_results = SearchEngine.search_similar_content(
  "architecture diagram", 
  content_types: ['image']
)

# Combined cross-modal search
all_results = SearchEngine.search_similar_content(
  "deep learning concepts",
  content_types: ['text', 'image', 'audio']
)

Hybrid Search¶

Ragdoll's hybrid search combines semantic vector search with keyword-based full-text search to provide more comprehensive and relevant results.

Architecture¶

# High-level hybrid search interface
results = Ragdoll::Core.hybrid_search(
  query: "neural networks and deep learning",
  semantic_weight: 0.7,    # Weight for vector similarity (0.0 - 1.0)
  text_weight: 0.3,        # Weight for full-text search (0.0 - 1.0)
  limit: 20
)

How Hybrid Search Works¶

Query Processing: The input query is processed through both semantic and lexical pipelines
Keyword Extraction: Keywords are extracted from the query using extract_keywords
Semantic Search: Vector embeddings are generated and similarity search is performed
Full-Text Search: PostgreSQL full-text search using GIN indexes
Result Combination: Results are merged with configurable weighting
Ranking: Combined scores determine final result ordering

Implementation Details¶

class Document < ActiveRecord::Base
  # Extract keywords from query string (words > 4 characters)
  def self.extract_keywords(query:)
    return [] if query.nil? || query.strip.empty?

    query.split(/\s+/)
         .map(&:strip)
         .reject(&:empty?)
         .select { |word| word.length > 4 }
  end

  # Hybrid search combining semantic and full-text approaches
  def self.hybrid_search(query, query_embedding: nil, **options)
    limit = options[:limit] || 20
    semantic_weight = options[:semantic_weight] || 0.7
    text_weight = options[:text_weight] || 0.3

    results = []

    # Get semantic search results if embedding provided
    if query_embedding
      semantic_results = embeddings_search(query_embedding, limit: limit)
      results.concat(semantic_results.map do |result|
        result.merge(
          search_type: 'semantic',
          weighted_score: result[:combined_score] * semantic_weight
        )
      end)
    end

    # Get PostgreSQL full-text search results
    text_results = search_content(query, limit: limit)
    text_results.each_with_index do |doc, index|
      score = (limit - index).to_f / limit * text_weight
      results << {
        document_id: doc.id.to_s,
        document_title: doc.title,
        content: doc.content[0..500],
        search_type: 'full_text',
        weighted_score: score,
        document: doc
      }
    end

    # Combine and deduplicate by document_id
    combined = results.group_by { |r| r[:document_id] }
                      .map do |_doc_id, doc_results|
      best_result = doc_results.max_by { |r| r[:weighted_score] }
      total_score = doc_results.sum { |r| r[:weighted_score] }
      search_types = doc_results.map { |r| r[:search_type] }.uniq

      best_result.merge(
        combined_score: total_score,
        search_types: search_types
      )
    end

    combined.sort_by { |r| -r[:combined_score] }.take(limit)
  end
end

Usage Examples¶

# Basic hybrid search
results = client.hybrid_search(
  query: "machine learning algorithms for text processing"
)

# Emphasize semantic similarity
results = client.hybrid_search(
  query: "neural network architectures",
  semantic_weight: 0.8,
  text_weight: 0.2
)

# Emphasize keyword matching
results = client.hybrid_search(
  query: "specific technical term",
  semantic_weight: 0.3,
  text_weight: 0.7
)

# Large result set for comprehensive analysis
results = client.hybrid_search(
  query: "artificial intelligence research",
  limit: 50,
  semantic_weight: 0.6,
  text_weight: 0.4
)

Keyword Extraction¶

The extract_keywords method intelligently identifies significant terms from search queries:

# Extract meaningful keywords from queries
keywords = Ragdoll::Document.extract_keywords(
  query: "machine learning algorithms neural networks"
)
# Returns: ["machine", "learning", "algorithms", "neural", "networks"]

# Handles punctuation and formatting
keywords = Ragdoll::Document.extract_keywords(
  query: "deep-learning, artificial-intelligence!"
)  
# Returns: ["deep-learning,", "artificial-intelligence!"]

# Filters short words automatically
keywords = Ragdoll::Document.extract_keywords(
  query: "AI and ML for big data analysis"
)
# Returns: ["analysis"] # Only words > 4 characters

Best Practices¶

Query Optimization: - Use descriptive queries with both specific terms and conceptual language - For technical searches, use higher text weights to prioritize exact matches - For conceptual searches, use higher semantic weights for broader relevance

Weight Configuration: - Semantic-Heavy (0.8/0.2): Conceptual searches, exploratory research - Balanced (0.6/0.4): General purpose searches, mixed content - Text-Heavy (0.3/0.7): Technical documentation, specific term searches

Performance Considerations: - Hybrid search is more computationally expensive than single-mode search - Consider caching frequently used query patterns - Use appropriate limits to balance comprehensiveness with performance

## Usage Analytics

### Analytics Data Collection

```ruby
class Embedding < ApplicationRecord
  # Automatic usage tracking
  after_find :track_usage, if: :search_context?

  def track_usage
    increment!(:usage_count)
    update_column(:returned_at, Time.current)

    # Record detailed analytics
    SearchAnalytics.record_access(
      embedding: self,
      query: current_search_query,
      user_id: current_user_id,
      session_id: current_session_id,
      timestamp: Time.current
    )
  end

  # Usage-based scoring
  def usage_score
    recency_score = calculate_recency_score
    frequency_score = calculate_frequency_score

    (recency_score * 0.3) + (frequency_score * 0.7)
  end

  private

  def calculate_recency_score
    return 0 unless returned_at

    days_since = (Time.current - returned_at) / 1.day
    Math.exp(-days_since / 30.0)  # Exponential decay over 30 days
  end

  def calculate_frequency_score
    return 0 if usage_count.zero?

    # Logarithmic scaling for usage frequency
    Math.log(usage_count + 1) / Math.log(100)  # Max score at 100 uses
  end
end

Smart Ranking Algorithm¶

class SearchRanker
  def rank_results(embeddings, options = {})
    embeddings.map do |embedding|
      {
        embedding: embedding,
        similarity_score: embedding.similarity,
        usage_score: embedding.usage_score,
        metadata_score: calculate_metadata_score(embedding, options),
        composite_score: calculate_composite_score(embedding, options)
      }
    end.sort_by { |result| -result[:composite_score] }
  end

  private

  def calculate_composite_score(embedding, options)
    weights = options[:ranking_weights] || default_weights

    similarity_weight = weights[:similarity] || 0.6
    usage_weight = weights[:usage] || 0.3
    metadata_weight = weights[:metadata] || 0.1

    (embedding.similarity * similarity_weight) +
    (embedding.usage_score * usage_weight) +
    (calculate_metadata_score(embedding, options) * metadata_weight)
  end

  def calculate_metadata_score(embedding, options)
    document = embedding.embeddable.document
    score = 0.0

    # Boost recent documents
    if document.created_at > 30.days.ago
      score += 0.2
    end

    # Boost documents with rich metadata
    if document.metadata&.key?('classification')
      score += 0.1
    end

    # Classification matching
    if options[:preferred_classification] && 
       document.metadata&.dig('classification') == options[:preferred_classification]
      score += 0.3
    end

    # Topic relevance scoring
    if options[:topic_filters].present? && document.metadata&.key?('topics')
      overlap = (options[:topic_filters] & document.metadata['topics']).size
      score += (overlap.to_f / options[:topic_filters].size) * 0.4
    end

    [score, 1.0].min  # Cap at 1.0
  end
end

Full-Text Search Integration¶

PostgreSQL Full-Text Search¶

class HybridSearchEngine
  def hybrid_search(query, semantic_weight: 0.7, text_weight: 0.3, options: {})
    # 1. Semantic search results
    semantic_results = semantic_search(query, options)

    # 2. Full-text search results
    fulltext_results = fulltext_search(query, options)

    # 3. Combine and weight results
    combine_search_results(
      semantic_results, 
      fulltext_results,
      semantic_weight: semantic_weight,
      text_weight: text_weight
    )
  end

  private

  def fulltext_search(query, options)
    # PostgreSQL full-text search with ranking
    Embedding.select(
      "*, " \
      "ts_rank(to_tsvector('english', content), plainto_tsquery('english', ?)) as text_rank"
    )
    .where("to_tsvector('english', content) @@ plainto_tsquery('english', ?)", query)
    .where("ts_rank(to_tsvector('english', content), plainto_tsquery('english', ?)) > 0.1", query)
    .order("text_rank DESC")
    .limit(options[:limit] || 50)
  end

  def combine_search_results(semantic_results, fulltext_results, weights)
    all_results = {}

    # Add semantic results with weighting
    semantic_results.each do |result|
      all_results[result.id] = {
        embedding: result,
        semantic_score: result.similarity * weights[:semantic_weight],
        text_score: 0,
        composite_score: result.similarity * weights[:semantic_weight]
      }
    end

    # Add/enhance with full-text results
    fulltext_results.each do |result|
      if all_results[result.id]
        # Enhance existing result
        all_results[result.id][:text_score] = result.text_rank * weights[:text_weight]
        all_results[result.id][:composite_score] += result.text_rank * weights[:text_weight]
      else
        # Add new result
        all_results[result.id] = {
          embedding: result,
          semantic_score: 0,
          text_score: result.text_rank * weights[:text_weight],
          composite_score: result.text_rank * weights[:text_weight]
        }
      end
    end

    # Sort by composite score and apply usage ranking
    ranked_results = all_results.values
                               .sort_by { |r| -r[:composite_score] }
                               .first(weights[:limit] || 20)

    apply_usage_ranking(ranked_results)
  end
end

Advanced Search Features¶

Faceted Search¶

class FacetedSearch
  def search_with_facets(query, facets: {})
    base_results = SearchEngine.search_similar_content(query)

    # Apply facet filters
    filtered_results = apply_facet_filters(base_results, facets)

    # Calculate facet counts for UI
    facet_counts = calculate_facet_counts(base_results, facets.keys)

    {
      results: filtered_results,
      facets: facet_counts,
      total_count: filtered_results.size
    }
  end

  private

  def apply_facet_filters(results, facets)
    filtered = results.includes(embeddable: :document)

    facets.each do |facet_type, values|
      case facet_type
      when :content_type
        filtered = filtered.where(embeddable_type: values.map(&:classify))
      when :classification
        filtered = filtered.joins(embeddable: :document)
                          .where("ragdoll_documents.metadata ->> 'classification' IN (?)", values)
      when :topics
        filtered = filtered.joins(embeddable: :document)
                          .where("ragdoll_documents.metadata -> 'topics' ?| array[?]", values)
      when :date_range
        start_date, end_date = values
        filtered = filtered.joins(embeddable: :document)
                          .where(ragdoll_documents: { created_at: start_date..end_date })
      end
    end

    filtered
  end
end

# Usage example
results = FacetedSearch.search_with_facets(
  "machine learning",
  facets: {
    content_type: ['text', 'image'],
    classification: ['research', 'technical'],
    topics: ['artificial intelligence', 'neural networks'],
    date_range: [1.year.ago, Time.current]
  }
)

Search Suggestions and Autocomplete¶

class SearchSuggestions
  def suggest_queries(partial_query, limit: 10)
    # 1. Historical query suggestions
    historical = suggest_from_history(partial_query, limit: limit / 2)

    # 2. Content-based suggestions  
    content_based = suggest_from_content(partial_query, limit: limit / 2)

    # 3. Combine and rank suggestions
    combine_suggestions(historical, content_based).first(limit)
  end

  private

  def suggest_from_history(partial_query, limit:)
    SearchAnalytics.where("query ILIKE ?", "#{partial_query}%")
                   .group(:query)
                   .order("COUNT(*) DESC")
                   .limit(limit)
                   .pluck(:query)
  end

  def suggest_from_content(partial_query, limit:)
    # Extract common phrases from content
    Embedding.where("content ILIKE ?", "%#{partial_query}%")
             .select("regexp_split_to_table(content, E'\\\\s+') as word")
             .where("word ILIKE ?", "#{partial_query}%")
             .group("word")
             .order("COUNT(*) DESC")
             .limit(limit)
             .pluck("word")
  end
end

Search Tracking System¶

Ragdoll includes a comprehensive search tracking system that automatically records all searches performed through the platform. This provides valuable insights into user behavior, search patterns, and system performance.

Database Schema¶

The search tracking system uses two main tables:

ragdoll_searches - Stores search queries and metadata
ragdoll_search_results - Junction table linking searches to returned embeddings

# Search record with vector support for similarity analysis
class Ragdoll::Search < ActiveRecord::Base
  has_many :search_results, dependent: :destroy
  has_many :embeddings, through: :search_results

  # Vector similarity support for finding similar searches
  has_neighbors :query_embedding

  # Automatic tracking is enabled by default
  def self.record_search(query:, query_embedding:, results:, **options)
    search = create!(
      query: query,
      query_embedding: query_embedding,
      search_type: options[:search_type] || 'semantic',
      results_count: results.size,
      search_filters: options[:filters],
      search_options: options[:options],
      execution_time_ms: options[:execution_time_ms],
      session_id: options[:session_id],
      user_id: options[:user_id]
    )

    # Record individual results with similarity scores
    results.each_with_index do |result, index|
      search.search_results.create!(
        embedding_id: result[:embedding_id],
        similarity_score: result[:similarity],
        result_rank: index + 1
      )
    end

    # Calculate and store similarity statistics
    search.calculate_similarity_stats!
    search
  end
end

Automatic Search Tracking¶

All searches are automatically tracked unless explicitly disabled:

# Search with automatic tracking (default)
results = client.search(
  query: "machine learning algorithms",
  session_id: "user_session_123",
  user_id: "user_456"
)

# Disable tracking for specific searches
results = client.search(
  query: "private query",
  track_search: false  # Disables tracking for this search
)

# Hybrid search with tracking
results = client.hybrid_search(
  query: "neural networks",
  session_id: "session_789",
  user_id: "user_456"
)

Search Result Tracking¶

Each search result is tracked with detailed metrics:

class Ragdoll::SearchResult < ActiveRecord::Base
  belongs_to :search
  belongs_to :embedding

  # Track user interactions
  def mark_as_clicked!
    update!(clicked: true, clicked_at: Time.current)
  end

  # Analytics methods
  scope :clicked, -> { where(clicked: true) }
  scope :unclicked, -> { where(clicked: false) }
  scope :high_similarity, ->(threshold) { where('similarity_score >= ?', threshold) }
  scope :by_rank, -> { order(result_rank: :asc) }
end

Finding Similar Searches¶

The system can identify similar searches using vector similarity:

# Find searches similar to a given query
similar_searches = Ragdoll::Search.find_similar(
  query_embedding,
  limit: 10,
  threshold: 0.8
)

# Find searches similar to an existing search
search = Ragdoll::Search.first
similar = search.nearest_neighbors(:query_embedding, distance: :cosine).limit(5)

# Analyze search patterns
similar.each do |similar_search|
  puts "Query: #{similar_search.query}"
  puts "Similarity: #{similar_search.neighbor_distance}"
  puts "Results: #{similar_search.results_count}"
  puts "CTR: #{similar_search.click_through_rate}%"
end

Search Performance Metrics¶

Track execution time and performance characteristics:

# Search records include performance metrics
search = Ragdoll::Search.last
puts "Query: #{search.query}"
puts "Execution time: #{search.execution_time_ms}ms"
puts "Results returned: #{search.results_count}"
puts "Avg similarity: #{search.avg_similarity_score}"
puts "Search type: #{search.search_type}"

# Analyze slow searches
slow_searches = Ragdoll::Search.slow_searches(threshold_ms: 1000)
slow_searches.each do |search|
  puts "Slow query: #{search.query} (#{search.execution_time_ms}ms)"
end

Click-Through Rate (CTR) Analysis¶

Monitor which searches lead to user engagement:

# Calculate CTR for individual searches
search = Ragdoll::Search.find(id)
ctr = search.click_through_rate
puts "CTR: #{ctr}%"

# Analyze CTR by result rank
rank_analysis = Ragdoll::SearchResult.rank_click_analysis
rank_analysis.each do |rank, stats|
  puts "Rank #{rank}: #{stats[:ctr]}% CTR (#{stats[:clicked]}/#{stats[:total]})"
end

# Find top performing embeddings
top_embeddings = Ragdoll::SearchResult.top_performing_embeddings(limit: 10)
top_embeddings.each do |embedding_stats|
  puts "Embedding #{embedding_stats.embedding_id}:"
  puts "  Appearances: #{embedding_stats.appearance_count}"
  puts "  Clicks: #{embedding_stats.click_count}"
  puts "  CTR: #{embedding_stats.ctr}%"
  puts "  Avg Similarity: #{embedding_stats.avg_similarity}"
end

Cleanup and Maintenance¶

The system includes automatic cleanup capabilities:

# Remove orphaned searches (searches with no results)
orphaned_count = Ragdoll::Search.cleanup_orphaned_searches
puts "Cleaned up #{orphaned_count} orphaned searches"

# Remove old unused searches
cleaned_count = Ragdoll::Search.cleanup_old_unused_searches(days: 30)
puts "Cleaned up #{cleaned_count} old unused searches"

# Automatic cascade deletion
# When documents are deleted, associated search results are automatically cleaned up
# Empty searches (with no results) are automatically removed

Analytics and Reporting¶