File Loading¶
HTM can load text-based files (currently markdown) into long-term memory with automatic chunking, source tracking, and re-sync support. This is ideal for building knowledge bases from documentation, notes, or any text content.
Overview¶
The file loading system provides:
- Automatic chunking: Large files are split into semantically-aware chunks
- YAML frontmatter extraction: Metadata from file headers is preserved
- Source tracking: Files are tracked for re-sync when content changes
- Duplicate detection: Content hashing prevents duplicate chunks
- Soft delete: Unloading files uses soft delete for recovery
Quick Start¶
require 'htm'
htm = HTM.new(robot_name: "Document Loader")
# Load a single markdown file
result = htm.load_file("docs/guide.md")
# => { file_source_id: 1, chunks_created: 5, chunks_updated: 0, skipped: false }
# Load all markdown files from a directory
results = htm.load_directory("docs/", pattern: "**/*.md")
# => [{ file_path: "docs/guide.md", ... }, { file_path: "docs/api.md", ... }]
# Query nodes from a specific file
nodes = htm.nodes_from_file("docs/guide.md")
# Unload a file (soft deletes chunks)
htm.unload_file("docs/guide.md")
API Reference¶
load_file(path, force: false)¶
Loads a single file into long-term memory.
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
String | required | Path to the file |
force |
Boolean | false |
Force reload even if file unchanged |
Returns: Hash with keys:
- file_source_id: ID of the FileSource record
- chunks_created: Number of new chunks created
- chunks_updated: Number of existing chunks updated
- chunks_deleted: Number of chunks removed
- skipped: Whether file was skipped (unchanged)
# Normal load - skips unchanged files
result = htm.load_file("docs/guide.md")
# Force reload even if file hasn't changed
result = htm.load_file("docs/guide.md", force: true)
load_directory(path, pattern: "**/*.md", force: false)¶
Loads all matching files from a directory.
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
String | required | Directory path |
pattern |
String | "**/*.md" |
Glob pattern for files |
force |
Boolean | false |
Force reload all files |
Returns: Array of result hashes (one per file)
# Load all markdown files
results = htm.load_directory("docs/")
# Load only top-level markdown files
results = htm.load_directory("docs/", pattern: "*.md")
# Load specific subdirectory
results = htm.load_directory("docs/guides/", pattern: "**/*.md")
nodes_from_file(path)¶
Returns all nodes loaded from a specific file.
nodes = htm.nodes_from_file("docs/guide.md")
nodes.each do |node|
puts "#{node.id}: #{node.content[0..50]}..."
end
unload_file(path)¶
Soft deletes all nodes from a file and removes the file source.
YAML Frontmatter¶
Files with YAML frontmatter have their metadata extracted and stored:
---
title: PostgreSQL Guide
author: HTM Team
tags:
- database
- postgresql
version: 1.2
---
# PostgreSQL Guide
Content starts here...
Access frontmatter via the FileSource model:
source = HTM::Models::FileSource.find_by(file_path: "docs/guide.md")
source.title # => "PostgreSQL Guide"
source.author # => "HTM Team"
source.frontmatter_tags # => ["database", "postgresql"]
source.frontmatter # => { "title" => "...", "author" => "...", ... }
Chunking Strategy¶
HTM uses the Baran gem with MarkdownSplitter for intelligent chunking that respects markdown structure:
- Headers: Chunks break at header boundaries
- Code blocks: Code blocks are kept intact
- Horizontal rules: Natural section breaks
- Configurable size: Control chunk size and overlap
Configuration¶
# Global configuration
HTM.configure do |config|
config.chunk_size = 1024 # Characters per chunk (default: 1024)
config.chunk_overlap = 64 # Overlap between chunks (default: 64)
end
# Or via environment variables
# HTM_CHUNK_SIZE=512
# HTM_CHUNK_OVERLAP=50
Per-Loader Configuration¶
loader = HTM::Loaders::MarkdownLoader.new(
htm,
chunk_size: 512,
chunk_overlap: 50
)
loader.load("docs/guide.md")
Re-Sync Behavior¶
The file loading system tracks file modification times for efficient re-syncing:
- First load: Creates FileSource record and chunks
- Subsequent loads: Compares mtime, skips unchanged files
- Changed files: Re-chunks and updates nodes
- Force reload: Bypasses mtime check
# First load - creates chunks
htm.load_file("docs/guide.md")
# => { skipped: false, chunks_created: 5 }
# Second load - skipped (unchanged)
htm.load_file("docs/guide.md")
# => { skipped: true }
# After editing file - re-syncs
htm.load_file("docs/guide.md")
# => { skipped: false, chunks_updated: 2, chunks_created: 1 }
# Force reload
htm.load_file("docs/guide.md", force: true)
# => { skipped: false, chunks_updated: 5 }
FileSource Model¶
The HTM::Models::FileSource tracks loaded files:
source = HTM::Models::FileSource.find_by(file_path: "docs/guide.md")
source.file_path # Full path to file
source.mtime # Last modification time
source.needs_sync? # Check if file changed since load
source.chunks # Associated nodes (ordered by position)
source.frontmatter # Parsed YAML frontmatter
source.title # Frontmatter title (convenience)
source.author # Frontmatter author (convenience)
source.frontmatter_tags # Tags from frontmatter
Rake Tasks¶
HTM provides rake tasks for file management:
# Load a single file
rake 'htm:files:load[docs/guide.md]'
# Load directory
rake 'htm:files:load_dir[docs/]'
rake 'htm:files:load_dir[docs/,**/*.md]'
# List loaded files
rake htm:files:list
# Show file details
rake 'htm:files:info[docs/guide.md]'
# Unload a file
rake 'htm:files:unload[docs/guide.md]'
# Sync all files (reload changed)
rake htm:files:sync
# Show statistics
rake htm:files:stats
# Force reload with FORCE=true
FORCE=true rake 'htm:files:load[docs/guide.md]'
Best Practices¶
Organize Files Logically¶
# Load by category
htm.load_directory("docs/guides/", pattern: "**/*.md")
htm.load_directory("docs/api/", pattern: "**/*.md")
htm.load_directory("docs/tutorials/", pattern: "**/*.md")
Use Frontmatter for Metadata¶
Tune Chunk Size for Your Content¶
# Smaller chunks for dense technical content
HTM.configure { |c| c.chunk_size = 512 }
# Larger chunks for narrative content
HTM.configure { |c| c.chunk_size = 2048 }
Regular Sync for Updated Content¶
Example: Building a Knowledge Base¶
require 'htm'
# Initialize
htm = HTM.new(robot_name: "Knowledge Base")
# Configure chunking for technical docs
HTM.configure do |config|
config.chunk_size = 768
config.chunk_overlap = 100
end
# Load documentation
htm.load_directory("docs/", pattern: "**/*.md")
htm.load_directory("README.md")
htm.load_directory("CHANGELOG.md")
# Query the knowledge base
results = htm.recall(
"How do I configure authentication?",
strategy: :hybrid,
limit: 5
)
results.each do |result|
puts result['content']
puts "---"
end
Related Documentation¶
- Adding Memories - Core memory operations
- Search Strategies - Querying loaded content
- API Reference: HTM - Complete API documentation
- Example: File Loading - Working example