Ingesting Content¶
Content is the foundation of FactDb - immutable source documents from which facts are extracted.
Basic Ingestion¶
facts = FactDb.new
content = facts.ingest(
"Paula Chen joined Microsoft as Principal Engineer on January 10, 2024.",
type: :announcement
)
Full Options¶
source = facts.ingest(
text_content,
type: :email,
title: "RE: Offer Letter - Paula Chen",
source_uri: "mailto:hr@company.com/msg/12345",
captured_at: Time.parse("2024-01-08 10:30:00"),
metadata: {
from: "hr@company.com",
to: "hiring@company.com",
cc: ["manager@company.com"],
subject: "RE: Offer Letter - Paula Chen",
thread_id: "THR-12345"
}
)
Content Types¶
Choose a type that best describes the source:
| Type | Use Case |
|---|---|
:email |
Email messages |
:document |
General documents, PDFs |
:article |
News articles, blog posts |
:transcript |
Meeting transcripts, interviews |
:report |
Reports, analysis documents |
:announcement |
Official announcements |
:social |
Social media posts |
:form |
Structured forms, surveys |
:note |
Notes, memos |
Metadata¶
Store additional context in metadata:
# Email metadata
metadata: {
from: "sender@example.com",
to: "recipient@example.com",
subject: "Important Update",
message_id: "<abc123@mail.example.com>"
}
# Document metadata
metadata: {
author: "Jane Smith",
version: "2.1",
department: "Engineering",
classification: "internal"
}
# Article metadata
metadata: {
author: "John Doe",
publication: "Tech News",
url: "https://technews.com/article/123",
published_at: "2024-01-15T14:30:00Z"
}
Deduplication¶
Content is automatically deduplicated by SHA256 hash:
# First ingestion - creates new record
source1 = facts.ingest("Hello world", type: :note)
# Second ingestion - returns existing record
source2 = facts.ingest("Hello world", type: :note)
source1.id == source2.id # => true
Timestamps¶
captured_at¶
When the content was captured/received (defaults to current time):
# Email received yesterday
content = facts.ingest(
email_body,
type: :email,
captured_at: Time.parse("2024-01-14 09:00:00")
)
created_at¶
Automatically set when record is created (system timestamp).
Batch Ingestion¶
For multiple documents:
documents = [
{ text: "Doc 1 content", type: :document, title: "Doc 1" },
{ text: "Doc 2 content", type: :document, title: "Doc 2" },
{ text: "Doc 3 content", type: :document, title: "Doc 3" }
]
contents = documents.map do |doc|
facts.ingest(doc[:text], type: doc[:type], title: doc[:title])
end
Source Service¶
For advanced operations, use the source service directly:
# Create source
source = facts.source_service.create(
text_content,
type: :document,
title: "Annual Report"
)
# Find by ID
source = facts.source_service.find(source_id)
# Find by hash
source = facts.source_service.find_by_hash(sha256_hash)
# Search by text
results = facts.source_service.search("quarterly earnings")
# Semantic search (requires embedding)
results = facts.source_service.semantic_search(
"financial performance",
limit: 10
)
Embeddings¶
If you configure an embedding generator, content embeddings are created automatically:
FactDb.configure do |config|
config.embedding_generator = ->(text) {
# Your embedding logic
client.embeddings(input: text)
}
end
# Embeddings generated on ingest
content = facts.ingest(text, type: :document)
content.embedding # => [0.123, -0.456, ...]
Source URIs¶
Track original locations with source_uri:
# Email
source_uri: "mailto:sender@example.com/msg/12345"
# Web page
source_uri: "https://example.com/articles/123"
# File
source_uri: "file:///path/to/document.pdf"
# Database record
source_uri: "db://crm/contacts/12345"
# API
source_uri: "api://salesforce/leads/ABC123"
Best Practices¶
1. Preserve Original Text¶
# Good - preserve original formatting
facts.ingest(original_email_body, type: :email)
# Avoid - don't pre-process
facts.ingest(cleaned_text.strip.downcase, type: :email)
2. Include Context in Metadata¶
content = facts.ingest(
transcript,
type: :transcript,
title: "Q4 2024 Earnings Call",
metadata: {
participants: ["CEO", "CFO", "Analysts"],
duration_minutes: 60,
recording_url: "https://..."
}
)
3. Use Consistent Types¶
# Define content types for your organization
module ContentTypes
EMAIL = :email
SLACK = :slack_message
MEETING = :meeting_transcript
# ...
end
facts.ingest(text, type: ContentTypes::EMAIL)