Extracting Facts¶

Facts are extracted from content using one of three methods: manual, LLM-powered, or rule-based.

Extraction Methods¶

Manual Extraction¶

Create facts directly via the API:

facts = FactDb.new

# Create entities first
paula = facts.entity_service.create("Paula Chen", type: :person)
microsoft = facts.entity_service.create("Microsoft", type: :organization)

# Create fact with explicit links
fact = facts.fact_service.create(
  "Paula Chen joined Microsoft as Principal Engineer",
  valid_at: Date.parse("2024-01-10"),
  mentions: [
    { entity: paula, role: "subject", text: "Paula Chen" },
    { entity: microsoft, role: "organization", text: "Microsoft" }
  ],
  sources: [
    { source: source, type: "primary", excerpt: "...accepted the offer..." }
  ]
)

LLM Extraction¶

Use AI to automatically extract facts:

# Configure LLM
FactDb.configure do |config|
  config.llm.provider = :openai
  config.llm.api_key = ENV['OPENAI_API_KEY']
end

facts = FactDb.new

# Extract facts from source
extracted = facts.extract_facts(source.id, extractor: :llm)

extracted.each do |fact|
  puts fact.text
  puts "  Valid from: #{fact.valid_at}"
  puts "  Entities: #{fact.entity_mentions.map(&:entity).map(&:name)}"
end

Rule-Based Extraction¶

Use regex patterns for structured content:

extracted = facts.extract_facts(source.id, extractor: :rule_based)

The rule-based extractor includes patterns for:

Dates and time references
Employment events (joined, promoted, left)
Title/role changes
Location references
Organizational relationships

Setting Default Extractor¶

FactDb.configure do |config|
  config.default_extractor = :llm  # or :manual, :rule_based
end

# Uses configured default
extracted = facts.extract_facts(source.id)

Fact Structure¶

Every extracted fact includes:

fact = Models::Fact.new(
  text: "Paula Chen is Principal Engineer at Microsoft",
  digest: "sha256...",           # For deduplication
  valid_at: Time.parse("2024-01-10"),
  invalid_at: nil,                   # nil = currently valid
  status: "canonical",               # canonical, superseded, corroborated, synthesized
  confidence: 0.95,                  # Extraction confidence
  extraction_method: "llm",          # manual, llm, rule_based
  metadata: {}                       # Additional data
)

Entity Mentions¶

Facts link to entities via mentions:

fact.add_mention(
  entity: paula,
  text: "Paula Chen",    # How entity was mentioned
  role: "subject",       # Role in the fact
  confidence: 0.95       # Resolution confidence
)

Mention Roles¶

Role	Description	Example
`subject`	Primary actor	"Paula joined..."
`object`	Target	"...hired Paula"
`organization`	Company/team	"...at Microsoft"
`location`	Place	"...in Seattle"
`role`	Title/position	"...as Engineer"
`temporal`	Time reference	"...in Q4 2024"
`attribute`	Property	"...with 10 years experience"

Source Links¶

Facts link to source content:

fact.add_source(
  source: email_source,
  type: "primary",
  excerpt: "Paula has accepted our offer to join as Principal Engineer...",
  confidence: 0.95
)

Source Types¶

Type	Description
`primary`	Direct source of the fact
`supporting`	Confirms the fact
`contradicting`	Contradicts the fact

Batch Extraction¶

Process multiple content items:

source_ids = [source1.id, source2.id, source3.id]

# Sequential processing
results = facts.batch_extract(source_ids, parallel: false)

# Parallel processing (default)
results = facts.batch_extract(source_ids, parallel: true)

results.each do |result|
  puts "Source #{result[:source_id]}:"
  puts "  Facts: #{result[:facts].count}"
  puts "  Error: #{result[:error]}" if result[:error]
end

Custom Extractors¶

Create custom extractors by extending the base class:

class MyExtractor < FactDb::Extractors::Base
  def extract(source)
    extracted = []

    # Your extraction logic here
    # Parse source.content
    # Create fact records

    extracted
  end
end

# Register and use
facts.fact_service.extract_from_source(
  source.id,
  extractor: MyExtractor.new(config)
)

Extraction Confidence¶

Track confidence levels:

# High confidence - direct statement
fact = facts.fact_service.create(
  "Paula is Principal Engineer",
  confidence: 0.95
)

# Medium confidence - inferred
fact = facts.fact_service.create(
  "Paula likely works in Engineering",
  confidence: 0.7
)

# Low confidence - speculation
fact = facts.fact_service.create(
  "Paula may be promoted soon",
  confidence: 0.4
)

Post-Extraction Processing¶

After extraction, you may want to:

Resolve Entities¶

extracted = facts.extract_facts(source.id, extractor: :llm)

extracted.each do |fact|
  fact.entity_mentions.each do |mention|
    if mention.entity.nil?
      # Resolve unlinked mention
      entity = facts.resolve_entity(mention.mention_text)
      mention.update!(entity: entity) if entity
    end
  end
end

Detect Conflicts¶

conflicts = facts.fact_service.resolver.find_conflicts(
  entity_id: paula.id
)

conflicts.each do |conflict|
  puts "Conflict between:"
  puts "  #{conflict[:fact1].text}"
  puts "  #{conflict[:fact2].text}"
end

Corroborate Facts¶

# If multiple sources say the same thing
if fact1.text.similar_to?(fact2.text)
  facts.fact_service.resolver.corroborate(fact1.id, fact2.id)
end

Best Practices¶

1. Review LLM Extractions¶

extracted = facts.extract_facts(source.id, extractor: :llm)

extracted.select { |f| f.confidence < 0.8 }.each do |fact|
  # Flag for human review
  fact.update!(metadata: fact.metadata.merge(needs_review: true))
end

2. Validate Temporal Information¶

# Ensure valid_at is reasonable
if fact.valid_at > Time.current
  logger.warn "Future date detected: #{fact.valid_at}"
end

3. Link Sources¶

# Always link facts to their sources
fact = facts.fact_service.create(
  "Important fact",
  valid_at: Date.today,
  sources: [{ source: source_record, type: "primary" }]
)

4. Handle Extraction Errors¶

begin
  extracted = facts.extract_facts(source.id, extractor: :llm)
rescue FactDb::ExtractionError => e
  logger.error "Extraction failed: #{e.message}"
  # Fall back to manual or rule-based
  extracted = facts.extract_facts(source.id, extractor: :rule_based)
end