Class: FactDb::Extractors::RuleBasedExtractor

Inherits:
Base
  • Object
show all
Defined in:
lib/fact_db/extractors/rule_based_extractor.rb

Overview

Rule-based fact extractor using regex patterns

Extracts facts from text using predefined regex patterns for common fact types like employment, relationships, and locations. Does not require an LLM but is limited to recognized patterns.

Examples:

Extract facts using patterns

extractor = RuleBasedExtractor.new
facts = extractor.extract("Paula works at Microsoft in Seattle")

Constant Summary collapse

DATE_PATTERNS =

Returns patterns for extracting start dates.

Returns:

  • (Array<Regexp>)

    patterns for extracting start dates

[
  # "on January 10, 2024"
  /(?:on|since|from|as of|starting)\s+(\w+\s+\d{1,2},?\s+\d{4})/i,
  # "on 2024-01-10"
  /(?:on|since|from|as of|starting)\s+(\d{4}-\d{2}-\d{2})/i,
  # "in January 2024"
  /(?:in|during)\s+(\w+\s+\d{4})/i,
  # "in 2024"
  /(?:in|during)\s+(\d{4})\b/i
].freeze
END_DATE_PATTERNS =

Returns patterns for extracting end dates.

Returns:

  • (Array<Regexp>)

    patterns for extracting end dates

[
  # "until January 10, 2024"
  /(?:until|through|to|ended|left)\s+(\w+\s+\d{1,2},?\s+\d{4})/i,
  /(?:until|through|to|ended|left)\s+(\d{4}-\d{2}-\d{2})/i
].freeze
EMPLOYMENT_PATTERNS =

Returns patterns for employment facts.

Returns:

  • (Array<Regexp>)

    patterns for employment facts

[
  # "Paula works at Microsoft"
  /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:works?|worked|is working)[ ]+(?:at|for)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/,
  # "Paula joined Microsoft"
  /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:joined|started at|was hired by)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/,
  # "Paula left Microsoft"
  /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:left|departed|resigned from|was fired from)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/,
  # "Paula is a Principal Engineer at Microsoft"
  /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:is|was|became)[ ]+(?:a[ ]+)?([A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)[ ]+at[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/
].freeze
RELATIONSHIP_PATTERNS =

Returns patterns for relationship facts.

Returns:

  • (Array<Regexp>)

    patterns for relationship facts

[
  # "Paula is married to John"
  /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:is|was)[ ]+(?:married to|engaged to|dating)[ ]+(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b/,
  # "Paula is the CEO of Microsoft"
  /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:is|was)[ ]+(?:the[ ]+)?(\w+(?:[ ]+\w+)*)[ ]+of[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/
].freeze
LOCATION_PATTERNS =

Returns patterns for location facts.

Returns:

  • (Array<Regexp>)

    patterns for location facts

[
  # "Paula lives in Seattle" or "Bob lives in New York City"
  /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:lives?|lived|is based|was based|relocated|moved)[ ]+(?:in|to)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*(?:,[ ]+[A-Z]{2})?)\b/,
  # "Microsoft is headquartered in Redmond" or "in Seattle, Washington"
  /(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b[ ]+(?:is|was)[ ]+(?:headquartered|located|based)[ ]+in[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*(?:,[ ]+[A-Z][A-Za-z]+)?)\b/
].freeze

Instance Attribute Summary

Attributes inherited from Base

#config

Instance Method Summary collapse

Methods inherited from Base

available_types, #extraction_method, for, #initialize

Constructor Details

This class inherits a constructor from FactDb::Extractors::Base

Instance Method Details

#extract(text, context = {}) ⇒ Array<Hash>

Extracts facts from text using regex patterns

Applies employment, relationship, and location patterns to identify facts, with associated entity mentions and temporal information.

Parameters:

  • text (String)

    raw text to extract from

  • context (Hash) (defaults to: {})

    additional context

Options Hash (context):

  • :captured_at (Date, Time)

    default timestamp for facts

Returns:

  • (Array<Hash>)

    array of fact hashes, deduplicated by text



72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/fact_db/extractors/rule_based_extractor.rb', line 72

def extract(text, context = {})
  return [] if text.nil? || text.strip.empty?

  facts = []

  # Extract employment facts
  facts.concat(extract_employment_facts(text, context))

  # Extract relationship facts
  facts.concat(extract_relationship_facts(text, context))

  # Extract location facts
  facts.concat(extract_location_facts(text, context))

  facts.uniq { |f| f[:text] }
end

#extract_entities(text) ⇒ Array<Hash>

Extracts entities from text using regex patterns

Identifies person names, organization names, and locations using pattern matching. Filters out common words, job titles, and known phrases.

Parameters:

  • text (String)

    raw text to extract from

Returns:

  • (Array<Hash>)

    array of entity hashes with :name and :kind



96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# File 'lib/fact_db/extractors/rule_based_extractor.rb', line 96

def extract_entities(text)
  return [] if text.nil? || text.strip.empty?

  entities = []

  # Extract person names (capitalized word sequences on same line)
  # Use [ ]+ instead of \s+ to avoid matching across newlines
  text.scan(/\b([A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)+)\b/).flatten.uniq.each do |name|
    next if common_word?(name)
    next if job_title?(name)
    next if common_phrase?(name)
    next if known_place?(name)
    next if organization_indicator?(name)

    entities << build_entity(name: name, kind: "person")
  end

  # Extract organization names (from employment patterns)
  EMPLOYMENT_PATTERNS.each do |pattern|
    text.scan(pattern).each do |match|
      org_name = match.last
      entities << build_entity(name: org_name, kind: "organization") unless common_word?(org_name)
    end
  end

  # Extract locations
  LOCATION_PATTERNS.each do |pattern|
    text.scan(pattern).each do |match|
      location = match.last
      entities << build_entity(name: location, kind: "place") unless common_word?(location)
    end
  end

  entities.uniq { |e| e[:name].downcase }
end