Skip to content

Doc Tool Example

The DocTool provides an interface for document processing operations, particularly for reading and extracting text from PDF documents.

Overview

This example demonstrates how to use the DocTool facade to read and extract text from PDF documents. The tool supports reading single pages, multiple pages, and handles invalid page numbers gracefully.

Example Code

View the complete example: doc_tool_example.rb

Key Features

1. Reading a Single Page

Extract text from a specific page:

doc_tool = SharedTools::Tools::DocTool.new

result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: sample_pdf,
  page_numbers: "1"
)

puts "Total pages: #{result[:total_pages]}"
puts "Page 1 text: #{result[:pages].first[:text]}"

2. Reading Multiple Pages

Extract text from multiple specific pages:

result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: sample_pdf,
  page_numbers: "1, 2, 3"
)

result[:pages].each do |page|
  puts "Page #{page[:page]}: #{page[:text].length} characters"
end

3. Handling Invalid Page Numbers

The tool automatically filters out invalid page numbers:

result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: sample_pdf,
  page_numbers: "1, 999"
)

puts "Valid pages: #{result[:pages].size}"
puts "Invalid pages: #{result[:invalid_pages].inspect}"
# The tool automatically filters out page 999 if it doesn't exist

Extract text for search and analysis:

result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: sample_pdf,
  page_numbers: "1"
)

text = result[:pages].first[:text]

# Search for specific terms
search_terms = ['the', 'and', 'of']

search_terms.each do |term|
  count = text.downcase.scan(/\b#{term}\b/).size
  puts "'#{term}': #{count} occurrences"
end

Document Statistics

Calculate statistics from extracted text:

result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: sample_pdf,
  page_numbers: "1, 2, 3"
)

total_chars = 0
total_words = 0
total_lines = 0

result[:pages].each do |page|
  text = page[:text] || ""
  total_chars += text.length
  total_words += text.split.size
  total_lines += text.lines.count
end

puts "Total characters: #{total_chars}"
puts "Total words: #{total_words}"
puts "Average words per page: #{total_words / result[:pages].size}"

Using Individual Tools Directly

You can also use the PDF reader tool directly:

pdf_tool = SharedTools::Tools::Doc::PdfReaderTool.new

result = pdf_tool.execute(
  doc_path: sample_pdf,
  page_numbers: "1"
)

puts "Pages extracted: #{result[:pages].size}"
puts "Total document pages: #{result[:total_pages]}"

Practical Examples

Finding Section Headers

Extract potential section headers from a document:

result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: sample_pdf,
  page_numbers: "1, 2, 3"
)

result[:pages].each do |page|
  next unless page[:text]

  lines = page[:text].lines
  headers = lines.select do |line|
    line.strip.length > 5 &&
    line.strip == line.strip.upcase &&
    line.strip.match?(/^[A-Z\s]+$/)
  end

  if headers.any?
    puts "Page #{page[:page]}:"
    headers.each { |header| puts "  - #{header.strip}" }
  end
end

Word Frequency Analysis

Analyze word frequency across multiple pages:

# Read first 5 pages
result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: sample_pdf,
  page_numbers: "1, 2, 3, 4, 5"
)

# Combine all text
all_text = result[:pages].map { |p| p[:text] }.join(" ")

# Count word frequencies
words = all_text.downcase.scan(/\b[a-z]{4,}\b/)
freq = Hash.new(0)
words.each { |word| freq[word] += 1 }

# Show top 10 most common words
freq.sort_by { |_, count| -count }.first(10).each_with_index do |(word, count), i|
  puts "#{i + 1}. '#{word}': #{count} times"
end

Error Handling

The tool gracefully handles errors:

result = doc_tool.execute(
  action: SharedTools::Tools::DocTool::Action::PDF_READ,
  doc_path: "/nonexistent/file.pdf",
  page_numbers: "1"
)

if result[:error]
  puts "Error: #{result[:error]}"
  # The tool returns error information instead of raising exceptions
end

Result Structure

{
  total_pages: <number of pages in document>,
  requested_pages: <array of requested page numbers>,
  pages: [
    {
      page: <page number>,
      text: <extracted text>
    },
    ...
  ],
  invalid_pages: <array of invalid page numbers>,
  error: <error message if failed>
}

Available Actions

  • PDF_READ - Read and extract text from PDF pages

Run the Example

cd examples
bundle exec ruby doc_tool_example.rb

The example requires a test PDF file at test/fixtures/test.pdf. If you don't have one, you'll need to provide a sample PDF.

Requirements

  • The pdf-reader gem must be installed (included in SharedTools dependencies)
  • A valid PDF file to process

Key Takeaways

  • DocTool provides a unified interface for document processing
  • PDF reading supports single pages, multiple pages, and ranges
  • Invalid page numbers are automatically filtered out
  • Extracted text can be used for search, analysis, and processing
  • Error handling is built-in with descriptive error messages
  • Individual tools (PdfReaderTool) can be used directly for more control
  • Results include metadata about total pages and requested pages

Use Cases

  • Extracting text for full-text search
  • Analyzing document content
  • Finding specific information in PDFs
  • Converting PDF content to other formats
  • Document summarization and analysis
  • Building document indexing systems