Class: FactDb::Extractors::RuleBasedExtractor
- Defined in:
- lib/fact_db/extractors/rule_based_extractor.rb
Overview
Rule-based fact extractor using regex patterns
Extracts facts from text using predefined regex patterns for common fact types like employment, relationships, and locations. Does not require an LLM but is limited to recognized patterns.
Constant Summary collapse
- DATE_PATTERNS =
Returns patterns for extracting start dates.
[ # "on January 10, 2024" /(?:on|since|from|as of|starting)\s+(\w+\s+\d{1,2},?\s+\d{4})/i, # "on 2024-01-10" /(?:on|since|from|as of|starting)\s+(\d{4}-\d{2}-\d{2})/i, # "in January 2024" /(?:in|during)\s+(\w+\s+\d{4})/i, # "in 2024" /(?:in|during)\s+(\d{4})\b/i ].freeze
- END_DATE_PATTERNS =
Returns patterns for extracting end dates.
[ # "until January 10, 2024" /(?:until|through|to|ended|left)\s+(\w+\s+\d{1,2},?\s+\d{4})/i, /(?:until|through|to|ended|left)\s+(\d{4}-\d{2}-\d{2})/i ].freeze
- EMPLOYMENT_PATTERNS =
Returns patterns for employment facts.
[ # "Paula works at Microsoft" /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:works?|worked|is working)[ ]+(?:at|for)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/, # "Paula joined Microsoft" /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:joined|started at|was hired by)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/, # "Paula left Microsoft" /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:left|departed|resigned from|was fired from)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/, # "Paula is a Principal Engineer at Microsoft" /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:is|was|became)[ ]+(?:a[ ]+)?([A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)[ ]+at[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/ ].freeze
- RELATIONSHIP_PATTERNS =
Returns patterns for relationship facts.
[ # "Paula is married to John" /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:is|was)[ ]+(?:married to|engaged to|dating)[ ]+(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b/, # "Paula is the CEO of Microsoft" /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:is|was)[ ]+(?:the[ ]+)?(\w+(?:[ ]+\w+)*)[ ]+of[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b/ ].freeze
- LOCATION_PATTERNS =
Returns patterns for location facts.
[ # "Paula lives in Seattle" or "Bob lives in New York City" /(\b[A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)*)\b[ ]+(?:lives?|lived|is based|was based|relocated|moved)[ ]+(?:in|to)[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*(?:,[ ]+[A-Z]{2})?)\b/, # "Microsoft is headquartered in Redmond" or "in Seattle, Washington" /(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*)\b[ ]+(?:is|was)[ ]+(?:headquartered|located|based)[ ]+in[ ]+(\b[A-Z][A-Za-z]+(?:[ ]+[A-Z][A-Za-z]+)*(?:,[ ]+[A-Z][A-Za-z]+)?)\b/ ].freeze
Instance Attribute Summary
Attributes inherited from Base
Instance Method Summary collapse
-
#extract(text, context = {}) ⇒ Array<Hash>
Extracts facts from text using regex patterns.
-
#extract_entities(text) ⇒ Array<Hash>
Extracts entities from text using regex patterns.
Methods inherited from Base
available_types, #extraction_method, for, #initialize
Constructor Details
This class inherits a constructor from FactDb::Extractors::Base
Instance Method Details
#extract(text, context = {}) ⇒ Array<Hash>
Extracts facts from text using regex patterns
Applies employment, relationship, and location patterns to identify facts, with associated entity mentions and temporal information.
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/fact_db/extractors/rule_based_extractor.rb', line 72 def extract(text, context = {}) return [] if text.nil? || text.strip.empty? facts = [] # Extract employment facts facts.concat(extract_employment_facts(text, context)) # Extract relationship facts facts.concat(extract_relationship_facts(text, context)) # Extract location facts facts.concat(extract_location_facts(text, context)) facts.uniq { |f| f[:text] } end |
#extract_entities(text) ⇒ Array<Hash>
Extracts entities from text using regex patterns
Identifies person names, organization names, and locations using pattern matching. Filters out common words, job titles, and known phrases.
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# File 'lib/fact_db/extractors/rule_based_extractor.rb', line 96 def extract_entities(text) return [] if text.nil? || text.strip.empty? entities = [] # Extract person names (capitalized word sequences on same line) # Use [ ]+ instead of \s+ to avoid matching across newlines text.scan(/\b([A-Z][a-z]+(?:[ ]+[A-Z][a-z]+)+)\b/).flatten.uniq.each do |name| next if common_word?(name) next if job_title?(name) next if common_phrase?(name) next if known_place?(name) next if organization_indicator?(name) entities << build_entity(name: name, kind: "person") end # Extract organization names (from employment patterns) EMPLOYMENT_PATTERNS.each do |pattern| text.scan(pattern).each do |match| org_name = match.last entities << build_entity(name: org_name, kind: "organization") unless common_word?(org_name) end end # Extract locations LOCATION_PATTERNS.each do |pattern| text.scan(pattern).each do |match| location = match.last entities << build_entity(name: location, kind: "place") unless common_word?(location) end end entities.uniq { |e| e[:name].downcase } end |