Doc Tool Example¶
The DocTool provides an interface for document processing operations, particularly for reading and extracting text from PDF documents.
Overview¶
This example demonstrates how to use the DocTool facade to read and extract text from PDF documents. The tool supports reading single pages, multiple pages, and handles invalid page numbers gracefully.
Example Code¶
View the complete example: doc_tool_example.rb
Key Features¶
1. Reading a Single Page¶
Extract text from a specific page:
doc_tool = SharedTools::Tools::DocTool.new
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: sample_pdf,
page_numbers: "1"
)
puts "Total pages: #{result[:total_pages]}"
puts "Page 1 text: #{result[:pages].first[:text]}"
2. Reading Multiple Pages¶
Extract text from multiple specific pages:
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: sample_pdf,
page_numbers: "1, 2, 3"
)
result[:pages].each do |page|
puts "Page #{page[:page]}: #{page[:text].length} characters"
end
3. Handling Invalid Page Numbers¶
The tool automatically filters out invalid page numbers:
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: sample_pdf,
page_numbers: "1, 999"
)
puts "Valid pages: #{result[:pages].size}"
puts "Invalid pages: #{result[:invalid_pages].inspect}"
# The tool automatically filters out page 999 if it doesn't exist
4. Text Extraction and Search¶
Extract text for search and analysis:
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: sample_pdf,
page_numbers: "1"
)
text = result[:pages].first[:text]
# Search for specific terms
search_terms = ['the', 'and', 'of']
search_terms.each do |term|
count = text.downcase.scan(/\b#{term}\b/).size
puts "'#{term}': #{count} occurrences"
end
Document Statistics¶
Calculate statistics from extracted text:
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: sample_pdf,
page_numbers: "1, 2, 3"
)
total_chars = 0
total_words = 0
total_lines = 0
result[:pages].each do |page|
text = page[:text] || ""
total_chars += text.length
total_words += text.split.size
total_lines += text.lines.count
end
puts "Total characters: #{total_chars}"
puts "Total words: #{total_words}"
puts "Average words per page: #{total_words / result[:pages].size}"
Using Individual Tools Directly¶
You can also use the PDF reader tool directly:
pdf_tool = SharedTools::Tools::Doc::PdfReaderTool.new
result = pdf_tool.execute(
doc_path: sample_pdf,
page_numbers: "1"
)
puts "Pages extracted: #{result[:pages].size}"
puts "Total document pages: #{result[:total_pages]}"
Practical Examples¶
Finding Section Headers¶
Extract potential section headers from a document:
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: sample_pdf,
page_numbers: "1, 2, 3"
)
result[:pages].each do |page|
next unless page[:text]
lines = page[:text].lines
headers = lines.select do |line|
line.strip.length > 5 &&
line.strip == line.strip.upcase &&
line.strip.match?(/^[A-Z\s]+$/)
end
if headers.any?
puts "Page #{page[:page]}:"
headers.each { |header| puts " - #{header.strip}" }
end
end
Word Frequency Analysis¶
Analyze word frequency across multiple pages:
# Read first 5 pages
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: sample_pdf,
page_numbers: "1, 2, 3, 4, 5"
)
# Combine all text
all_text = result[:pages].map { |p| p[:text] }.join(" ")
# Count word frequencies
words = all_text.downcase.scan(/\b[a-z]{4,}\b/)
freq = Hash.new(0)
words.each { |word| freq[word] += 1 }
# Show top 10 most common words
freq.sort_by { |_, count| -count }.first(10).each_with_index do |(word, count), i|
puts "#{i + 1}. '#{word}': #{count} times"
end
Error Handling¶
The tool gracefully handles errors:
result = doc_tool.execute(
action: SharedTools::Tools::DocTool::Action::PDF_READ,
doc_path: "/nonexistent/file.pdf",
page_numbers: "1"
)
if result[:error]
puts "Error: #{result[:error]}"
# The tool returns error information instead of raising exceptions
end
Result Structure¶
{
total_pages: <number of pages in document>,
requested_pages: <array of requested page numbers>,
pages: [
{
page: <page number>,
text: <extracted text>
},
...
],
invalid_pages: <array of invalid page numbers>,
error: <error message if failed>
}
Available Actions¶
PDF_READ- Read and extract text from PDF pages
Run the Example¶
The example requires a test PDF file at test/fixtures/test.pdf. If you don't have one, you'll need to provide a sample PDF.
Related Documentation¶
Requirements¶
- The
pdf-readergem must be installed (included in SharedTools dependencies) - A valid PDF file to process
Key Takeaways¶
- DocTool provides a unified interface for document processing
- PDF reading supports single pages, multiple pages, and ranges
- Invalid page numbers are automatically filtered out
- Extracted text can be used for search, analysis, and processing
- Error handling is built-in with descriptive error messages
- Individual tools (PdfReaderTool) can be used directly for more control
- Results include metadata about total pages and requested pages
Use Cases¶
- Extracting text for full-text search
- Analyzing document content
- Finding specific information in PDFs
- Converting PDF content to other formats
- Document summarization and analysis
- Building document indexing systems