Extracting Facts¶
Facts are extracted from content using one of three methods: manual, LLM-powered, or rule-based.
Extraction Methods¶
Manual Extraction¶
Create facts directly via the API:
facts = FactDb.new
# Create entities first
paula = facts.entity_service.create("Paula Chen", type: :person)
microsoft = facts.entity_service.create("Microsoft", type: :organization)
# Create fact with explicit links
fact = facts.fact_service.create(
"Paula Chen joined Microsoft as Principal Engineer",
valid_at: Date.parse("2024-01-10"),
mentions: [
{ entity: paula, role: "subject", text: "Paula Chen" },
{ entity: microsoft, role: "organization", text: "Microsoft" }
],
sources: [
{ source: source, type: "primary", excerpt: "...accepted the offer..." }
]
)
LLM Extraction¶
Use AI to automatically extract facts:
# Configure LLM
FactDb.configure do |config|
config.llm.provider = :openai
config.llm.api_key = ENV['OPENAI_API_KEY']
end
facts = FactDb.new
# Extract facts from source
extracted = facts.extract_facts(source.id, extractor: :llm)
extracted.each do |fact|
puts fact.text
puts " Valid from: #{fact.valid_at}"
puts " Entities: #{fact.entity_mentions.map(&:entity).map(&:name)}"
end
Rule-Based Extraction¶
Use regex patterns for structured content:
The rule-based extractor includes patterns for:
- Dates and time references
- Employment events (joined, promoted, left)
- Title/role changes
- Location references
- Organizational relationships
Setting Default Extractor¶
FactDb.configure do |config|
config.default_extractor = :llm # or :manual, :rule_based
end
# Uses configured default
extracted = facts.extract_facts(source.id)
Fact Structure¶
Every extracted fact includes:
fact = Models::Fact.new(
text: "Paula Chen is Principal Engineer at Microsoft",
digest: "sha256...", # For deduplication
valid_at: Time.parse("2024-01-10"),
invalid_at: nil, # nil = currently valid
status: "canonical", # canonical, superseded, corroborated, synthesized
confidence: 0.95, # Extraction confidence
extraction_method: "llm", # manual, llm, rule_based
metadata: {} # Additional data
)
Entity Mentions¶
Facts link to entities via mentions:
fact.add_mention(
entity: paula,
text: "Paula Chen", # How entity was mentioned
role: "subject", # Role in the fact
confidence: 0.95 # Resolution confidence
)
Mention Roles¶
| Role | Description | Example |
|---|---|---|
subject |
Primary actor | "Paula joined..." |
object |
Target | "...hired Paula" |
organization |
Company/team | "...at Microsoft" |
location |
Place | "...in Seattle" |
role |
Title/position | "...as Engineer" |
temporal |
Time reference | "...in Q4 2024" |
attribute |
Property | "...with 10 years experience" |
Source Links¶
Facts link to source content:
fact.add_source(
source: email_source,
type: "primary",
excerpt: "Paula has accepted our offer to join as Principal Engineer...",
confidence: 0.95
)
Source Types¶
| Type | Description |
|---|---|
primary |
Direct source of the fact |
supporting |
Confirms the fact |
contradicting |
Contradicts the fact |
Batch Extraction¶
Process multiple content items:
source_ids = [source1.id, source2.id, source3.id]
# Sequential processing
results = facts.batch_extract(source_ids, parallel: false)
# Parallel processing (default)
results = facts.batch_extract(source_ids, parallel: true)
results.each do |result|
puts "Source #{result[:source_id]}:"
puts " Facts: #{result[:facts].count}"
puts " Error: #{result[:error]}" if result[:error]
end
Custom Extractors¶
Create custom extractors by extending the base class:
class MyExtractor < FactDb::Extractors::Base
def extract(source)
extracted = []
# Your extraction logic here
# Parse source.content
# Create fact records
extracted
end
end
# Register and use
facts.fact_service.extract_from_source(
source.id,
extractor: MyExtractor.new(config)
)
Extraction Confidence¶
Track confidence levels:
# High confidence - direct statement
fact = facts.fact_service.create(
"Paula is Principal Engineer",
confidence: 0.95
)
# Medium confidence - inferred
fact = facts.fact_service.create(
"Paula likely works in Engineering",
confidence: 0.7
)
# Low confidence - speculation
fact = facts.fact_service.create(
"Paula may be promoted soon",
confidence: 0.4
)
Post-Extraction Processing¶
After extraction, you may want to:
Resolve Entities¶
extracted = facts.extract_facts(source.id, extractor: :llm)
extracted.each do |fact|
fact.entity_mentions.each do |mention|
if mention.entity.nil?
# Resolve unlinked mention
entity = facts.resolve_entity(mention.mention_text)
mention.update!(entity: entity) if entity
end
end
end
Detect Conflicts¶
conflicts = facts.fact_service.resolver.find_conflicts(
entity_id: paula.id
)
conflicts.each do |conflict|
puts "Conflict between:"
puts " #{conflict[:fact1].text}"
puts " #{conflict[:fact2].text}"
end
Corroborate Facts¶
# If multiple sources say the same thing
if fact1.text.similar_to?(fact2.text)
facts.fact_service.resolver.corroborate(fact1.id, fact2.id)
end
Best Practices¶
1. Review LLM Extractions¶
extracted = facts.extract_facts(source.id, extractor: :llm)
extracted.select { |f| f.confidence < 0.8 }.each do |fact|
# Flag for human review
fact.update!(metadata: fact.metadata.merge(needs_review: true))
end
2. Validate Temporal Information¶
# Ensure valid_at is reasonable
if fact.valid_at > Time.current
logger.warn "Future date detected: #{fact.valid_at}"
end
3. Link Sources¶
# Always link facts to their sources
fact = facts.fact_service.create(
"Important fact",
valid_at: Date.today,
sources: [{ source: source_record, type: "primary" }]
)