Contributing¶

Guidelines for Contributing to the Project¶

Welcome to Ragdoll! We appreciate your interest in contributing to this PostgreSQL-focused RAG system. This guide provides comprehensive information for all types of contributions.

Quick Start for Contributors¶

Fork the repository on GitHub
Clone your fork locally
Set up the development environment
Create a feature branch
Make your changes with tests
Submit a pull request

Getting Started¶

Development Setup¶

Fork and Clone Process¶

# 1. Fork the repository on GitHub (click Fork button)

# 2. Clone your fork
git clone https://github.com/YOUR_USERNAME/ragdoll.git
cd ragdoll

# 3. Add upstream remote
git remote add upstream https://github.com/original-org/ragdoll.git

# 4. Verify remotes
git remote -v
# origin    https://github.com/YOUR_USERNAME/ragdoll.git (fetch)
# origin    https://github.com/YOUR_USERNAME/ragdoll.git (push)
# upstream  https://github.com/original-org/ragdoll.git (fetch)
# upstream  https://github.com/original-org/ragdoll.git (push)

Development Environment Setup¶

Prerequisites: - Ruby 3.0+ (recommended: 3.2+) - PostgreSQL 12+ with pgvector extension - Git 2.0+ - Basic development tools (gcc, make, etc.)

# Install dependencies
bundle install

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys and database credentials

# Set up PostgreSQL database
./bin/setup_postgresql.rb

# Run tests to verify setup
bundle exec rake test

# Check code style
bundle exec rubocop

Initial Configuration¶

# config/development.rb (create if needed)
Ragdoll::Core.configure do |config|
  config.database_config = {
    adapter: 'postgresql',
    database: 'ragdoll_development',
    username: 'ragdoll',
    password: ENV['DATABASE_PASSWORD'],
    host: 'localhost',
    port: 5432,
    auto_migrate: true,
    logger: Logger.new(STDOUT, level: Logger::INFO)
  }

  # Set up API keys for testing
  config.ruby_llm_config[:openai][:api_key] = ENV['OPENAI_API_KEY']
end

Understanding the Codebase¶

Architecture Overview¶

flowchart TD
    A[Client Layer] --> B[Service Layer]
    B --> C[Model Layer] 
    C --> D[Database Layer]

    B --> E[Jobs Layer]
    E --> F[External APIs]

    D --> G[PostgreSQL + pgvector]
    F --> H[LLM Providers]

    subgraph "Core Services"
        I[DocumentProcessor]
        J[EmbeddingService]
        K[SearchEngine]
        L[TextGenerationService]
    end

    B --> I
    B --> J
    B --> K
    B --> L

Code Organization¶

lib/ragdoll/core/
├── client.rb              # Main public API
├── configuration.rb       # System configuration
├── database.rb           # Database setup and migrations
├── document_processor.rb # Multi-format parsing
├── embedding_service.rb  # Vector generation
├── search_engine.rb      # Semantic search
├── text_chunker.rb       # Content segmentation
├── text_generation_service.rb # LLM integration
├── jobs/                 # Background processing
│   ├── generate_embeddings.rb
│   ├── extract_keywords.rb
│   └── generate_summary.rb
├── models/               # ActiveRecord models
│   ├── document.rb
│   ├── embedding.rb
│   └── content.rb        # STI base class
└── services/             # Specialized services
    ├── metadata_generator.rb
    └── image_description_service.rb

Design Patterns¶

Service Layer Pattern: Business logic in dedicated service classes
Repository Pattern: Database access through ActiveRecord models
Factory Pattern: Document creation through DocumentProcessor
Strategy Pattern: Multiple LLM providers via configuration
Observer Pattern: Background jobs triggered by model callbacks

Coding Conventions¶

Ruby Style: Follow Ruby community conventions
RuboCop: Automated style enforcement
Naming: Descriptive names over comments
Methods: Single responsibility, max 20 lines
Classes: Max 200 lines, extract services for complex logic

Contribution Types¶

Code Contributions¶

Bug Fixes¶

# Create bug fix branch
git checkout -b fix/document-parsing-error

# Write failing test first
# test/core/document_processor_test.rb
def test_handles_corrupted_pdf
  assert_raises(DocumentProcessor::ParseError) do
    DocumentProcessor.parse('test/fixtures/corrupted.pdf')
  end
end

# Implement fix
# lib/ragdoll/core/document_processor.rb
def parse_pdf
  # Add error handling
rescue PDF::Reader::MalformedPDFError => e
  raise ParseError, "Corrupted PDF: #{e.message}"
end

# Verify fix works
bundle exec rake test

Feature Implementations¶

# Create feature branch
git checkout -b feature/add-excel-support

# Plan implementation
# 1. Add Excel gem dependency
# 2. Implement parse_excel method
# 3. Add Excel to supported formats
# 4. Write comprehensive tests
# 5. Update documentation

# Implementation example
# Gemfile
gem 'roo', '~> 2.9'

# lib/ragdoll/core/document_processor.rb
when '.xlsx', '.xls'
  parse_excel

private

def parse_excel
  workbook = Roo::Spreadsheet.open(@file_path)
  content = extract_excel_content(workbook)
  metadata = extract_excel_metadata(workbook)

  {
    content: content,
    metadata: metadata,
    document_type: 'excel'
  }
end

Performance Improvements¶

# Example: Optimize batch embedding generation
class EmbeddingService
  def generate_embeddings_batch_optimized(texts, batch_size: 50)
    # Process in smaller batches to reduce memory usage
    texts.each_slice(batch_size).flat_map do |batch|
      generate_embeddings_batch(batch)
    end
  end
end

# Add benchmark test
class EmbeddingServicePerformanceTest < Minitest::Test
  def test_batch_processing_performance
    texts = Array.new(1000) { "Sample text #{rand(1000)}" }

    time = Benchmark.measure do
      service.generate_embeddings_batch_optimized(texts)
    end

    assert time.real < 30, "Batch processing too slow: #{time.real}s"
  end
end

Documentation Contributions¶

Documentation Updates¶

# Always include code examples
## New Feature Documentation

### Usage

```ruby
# Basic usage
client = Ragdoll::Core.client
result = client.new_feature(param: 'value')

# Advanced usage with options
result = client.new_feature(
  param: 'value',
  advanced_option: true,
  callback: ->(data) { puts data }
)

Configuration¶

Ragdoll::Core.configure do |config|
  config.new_feature_config = {
    enabled: true,
    timeout: 30
  }
end

#### API Documentation

```ruby
# Use YARD-style documentation
class NewService
  # Process documents with advanced filtering
  #
  # @param documents [Array<Document>] Documents to process
  # @param options [Hash] Processing options
  # @option options [String] :filter_type ('all') Filter criteria
  # @option options [Integer] :batch_size (100) Batch processing size
  # @return [Array<Hash>] Processed document results
  # @raise [ProcessingError] When processing fails
  #
  # @example Basic usage
  #   service = NewService.new
  #   results = service.process(documents)
  #
  # @example With options
  #   results = service.process(documents, 
  #     filter_type: 'academic',
  #     batch_size: 50
  #   )
  def process(documents, options = {})
    # Implementation
  end
end

Testing Contributions¶

Test Coverage Improvements¶

# Always test edge cases
class DocumentProcessorTest < Minitest::Test
  def test_handles_empty_file
    File.write('test/fixtures/empty.txt', '')
    result = DocumentProcessor.parse('test/fixtures/empty.txt')
    assert_equal '', result[:content]
  end

  def test_handles_binary_file
    assert_raises(DocumentProcessor::UnsupportedFormatError) do
      DocumentProcessor.parse('test/fixtures/binary.exe')
    end
  end

  def test_handles_very_large_file
    # Create 100MB test file
    large_content = 'x' * (100 * 1024 * 1024)
    File.write('test/fixtures/large.txt', large_content)

    result = DocumentProcessor.parse('test/fixtures/large.txt')
    assert result[:content].length > 0
  ensure
    File.delete('test/fixtures/large.txt') if File.exist?('test/fixtures/large.txt')
  end
end

Development Process¶

Branch Management¶

Branch Naming Conventions¶

Features: feature/short-description
Bug fixes: fix/issue-description
Documentation: docs/section-being-updated
Refactoring: refactor/component-name
Performance: perf/optimization-area

Commit Message Format¶

Type: Brief description (50 characters max)

Detailed explanation of what changed and why. Include:
- What problem this solves
- How it was implemented
- Any breaking changes
- References to issues (#123)

Types: feat, fix, docs, style, refactor, test, chore

Examples:

feat: Add Excel document processing support

- Implement parse_excel method using roo gem
- Support .xlsx and .xls file formats
- Extract cell values, formulas, and metadata
- Add comprehensive test coverage
- Update documentation with usage examples

Resolves #145

fix: Handle corrupted PDF files gracefully

- Add proper error handling for PDF::Reader exceptions
- Return informative error messages
- Add test cases for various corruption scenarios
- Prevent application crashes during parsing

Fixes #178

Code Quality¶

Pre-commit Checklist¶

# Run before every commit

# 1. Code style check
bundle exec rubocop

# 2. Run all tests
bundle exec rake test

# 3. Check test coverage
open coverage/index.html
# Ensure coverage >= 85%

# 4. Run specific tests for changed code
bundle exec rake test test/core/your_changed_test.rb

# 5. Manual testing of changes
./bin/console
# Test your changes interactively

Code Review Preparation¶

# Before creating PR

# 1. Rebase on latest main
git fetch upstream
git rebase upstream/main

# 2. Squash commits if needed
git rebase -i HEAD~3

# 3. Update documentation
# Update relevant .md files
# Add code examples
# Update CHANGELOG.md

# 4. Self-review your changes
git diff upstream/main..HEAD

Submission Process¶

Pull Request Process¶

PR Creation Guidelines¶

Title: Clear, concise description of changes
Description: Use the PR template
Labels: Add appropriate labels (bug, feature, documentation)
Reviewers: Request reviews from maintainers
Linked Issues: Reference related issues

PR Template¶

## Summary

Brief description of changes and motivation.

## Changes Made

- [ ] Added new feature X
- [ ] Fixed bug in component Y
- [ ] Updated documentation
- [ ] Added test coverage

## Testing

- [ ] All existing tests pass
- [ ] Added new tests for changes
- [ ] Manual testing completed
- [ ] Performance impact assessed

## Documentation

- [ ] Updated relevant documentation
- [ ] Added code examples
- [ ] Updated CHANGELOG.md

## Breaking Changes

- [ ] No breaking changes
- [ ] Breaking changes documented below

## Checklist

- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Code is commented where needed
- [ ] Tests cover edge cases

Review Process¶

Automated Checks: CI must pass
Code Review: At least one maintainer approval
Testing: Manual testing by reviewer
Documentation: Verify docs are complete
Merge: Squash and merge after approval

Code Review Guidelines¶

For Contributors¶

Respond promptly to review feedback
Ask questions if feedback is unclear
Test suggested changes before implementing
Update PR description if scope changes
Be patient - thorough reviews take time

For Reviewers¶

Be constructive and specific in feedback
Explain the why behind suggestions
Acknowledge good code when you see it
Test the changes locally when possible
Consider backward compatibility impact

Bug Reports¶

Issue Templates¶

Bug Report Template¶

**Bug Description**
A clear description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Configure Ragdoll with...
2. Process document...
3. Call search method...
4. See error

**Expected Behavior**
What you expected to happen.

**Actual Behavior**
What actually happened, including error messages.

**Environment**
- Ruby version: [e.g. 3.2.0]
- Ragdoll version: [e.g. 0.1.0]
- PostgreSQL version: [e.g. 14.2]
- Operating System: [e.g. macOS 12.0]

**Additional Context**
- Configuration details
- Log output
- Sample files (if applicable)
- Stack trace

**Possible Solution**
(Optional) Suggest a fix or workaround

Information Requirements¶

Always include: - Reproduction steps: Minimal code to reproduce - Environment details: Ruby, PostgreSQL, OS versions - Configuration: Relevant Ragdoll configuration - Error messages: Complete stack traces - Expected vs actual: What should happen vs what happens

For document processing issues: - File format and size - Sample file (if not sensitive) - Processing configuration

For search issues: - Query text - Search configuration - Database state (document count, embedding count)

Feature Requests¶

Request Format¶

Feature Request Template¶

**Feature Summary**
One-line summary of the requested feature.

**Problem Statement**
What problem does this solve? What use case does it enable?

**Proposed Solution**
Detailed description of the proposed implementation.

**Alternative Solutions**
Other approaches you've considered.

**Use Cases**
Specific scenarios where this would be useful:
1. Scenario 1: ...
2. Scenario 2: ...

**Implementation Considerations**
- Database schema changes needed
- API changes required
- Backward compatibility impact
- Performance implications

**Priority**
- [ ] Critical - Blocking current work
- [ ] High - Important for upcoming release
- [ ] Medium - Would be nice to have
- [ ] Low - Future consideration

**Would you be willing to implement this?**
- [ ] Yes, I can submit a PR
- [ ] Yes, with guidance
- [ ] No, but I can test
- [ ] No

Evaluation Process¶

Initial Review: Maintainers assess fit with project goals
Community Discussion: Gather feedback from users
Technical Design: Plan implementation approach
Priority Assignment: Based on impact and complexity
Implementation: Either by maintainers or community

Community Guidelines¶

Code of Conduct¶

Behavioral Expectations¶

Be respectful in all interactions
Be inclusive and welcoming to newcomers
Be constructive in criticism and feedback
Be patient with questions and learning
Be professional in all communications

Communication Guidelines¶

Use clear, descriptive titles for issues and PRs
Provide context when asking questions
Search existing issues before creating new ones
Stay on topic in discussions
Use appropriate channels for different types of communication

Communication Channels¶

GitHub Issues: Bug reports, feature requests
GitHub Discussions: General questions, ideas
Pull Requests: Code review and discussion
Documentation: Inline comments and suggestions

Release Process¶

Version Management¶

Ragdoll follows Semantic Versioning (SemVer):

MAJOR.MINOR.PATCH (e.g., 1.2.3)
MAJOR: Breaking changes
MINOR: New features (backward compatible)
PATCH: Bug fixes (backward compatible)

Release Cycle¶

Patch releases: As needed for critical bugs
Minor releases: Monthly with new features
Major releases: Quarterly with breaking changes

Changelog Maintenance¶

# Changelog

## [Unreleased]

### Added
- New feature descriptions

### Changed
- Modified behavior descriptions

### Fixed
- Bug fix descriptions

### Removed
- Deprecated feature removals

## [1.2.0] - 2024-01-15

### Added
- Excel document processing support
- Batch embedding optimization

### Fixed
- PDF parsing error handling
- Memory leak in background jobs

Getting Help¶

Before Contributing¶

Read the documentation - Check existing docs first
Search issues - Your question might already be answered
Try the troubleshooting guide - Common issues and solutions
Check the development guide - Setup and workflow information

Need Assistance?¶

Questions: Use GitHub Discussions
Bugs: Create detailed issue reports
Features: Submit feature request with use cases
Code: Start with small contributions and ask for guidance

Recognition¶

Contributors are recognized in: - README.md - Contributor section - CHANGELOG.md - Release notes - GitHub - Automatic contribution tracking - Documentation - Author attribution where appropriate

Thank you for contributing to Ragdoll! 🎉

This document is part of the Ragdoll documentation suite. For immediate help, see the Quick Start Guide or API Reference.