Skip to main content

Overview

When a file is uploaded to Active Storage (S3), the system automatically:
  1. Uploads to S3 via Active Storage
  2. Triggers background job ProcessFileJob on the file_processing queue
  3. Parses document using Unstructured.io API
  4. Generates embeddings using OpenAI
  5. Uploads vectors to Turbopuffer in batches (default: 100)

Environment Variables

Required Variables

# Unstructured.io API
UNSTRUCTURED_API_KEY=your_unstructured_api_key_here
UNSTRUCTURED_API_URL=https://api.unstructuredapp.io  # Optional, defaults to this

# OpenAI (for embeddings)
OPENAI_API_KEY=your_openai_api_key_here

# Turbopuffer
TURBOPUFFER_API_KEY=your_turbopuffer_api_key_here
TURBOPUFFER_REGION=gcp-us-east4  # Optional, defaults to this

# AWS S3 (Active Storage)
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=us-east-1
AWS_S3_BUCKET=your_bucket_name

Optional Worker Concurrency

# Default queue worker concurrency (default: 1)
JOB_CONCURRENCY=2

# File processing queue concurrency (default: 2)
FILE_PROCESSING_CONCURRENCY=4

Setting Secrets in Production (Fly.io)

# Set Unstructured.io API key
fly secrets set UNSTRUCTURED_API_KEY=your_key_here

# Set OpenAI API key (if not already set)
fly secrets set OPENAI_API_KEY=your_key_here

# Set Turbopuffer API key (if not already set)
fly secrets set TURBOPUFFER_API_KEY=your_key_here

Supported File Types

  • PDF: application/pdf
  • Word Documents: .docx, .doc
  • Text: .txt
  • HTML: .html
  • Markdown: .md
  • PowerPoint: .pptx, .ppt
File Size Limit: 50MB per file

Usage

Automatic Processing (Default)

Files are automatically processed when uploaded:
# In your controller
file_data_source = FileDataSource.create!(
  organization_id: current_organization.id,
  file: params[:file]
)

# ProcessFileJob is automatically enqueued after commit
# No manual intervention needed!

Manual Processing

You can manually trigger processing:
# Trigger with default options
ProcessFileJob.perform_later(file_data_source.id)

# Trigger with custom options
ProcessFileJob.perform_later(
  file_data_source.id,
  {
    strategy: "hi_res",        # Unstructured.io strategy
    chunk_size: 1500,          # Maximum characters per chunk
    overlap: 300,              # Character overlap between chunks
    batch_size: 50             # Turbopuffer upload batch size
  }
)

Checking Processing Status

file_data_source = FileDataSource.find(id)
metadata = file_data_source.metadata

# Check status
metadata["processing_status"]  # "processing", "completed", "failed"

# Timestamps
metadata["processing_started_at"]
metadata["processing_completed_at"]
metadata["processing_failed_at"]

# Results
metadata["chunk_count"]        # Number of chunks uploaded
metadata["total_chunks"]       # Total chunks extracted
metadata["error"]              # Error message if failed

Querying Uploaded Chunks

Once processed, you can query the chunks:
# Get Turbopuffer service for organization
turbopuffer = TurbopufferService.for_organization(organization_id)

# Search for similar content
result = turbopuffer.query_similar(
  "What is the revenue forecast?",
  top_k: 10
)

# Filter by specific file
result = turbopuffer.query_similar(
  "What is the revenue forecast?",
  top_k: 10,
  filters: { file_id: ["Equals", file_data_source.id] }
)

# Delete all chunks for a file
turbopuffer.delete_file_chunks(file_data_source.id)

Monitoring

View Logs

# Local development
tail -f log/development.log | grep ProcessFileJob

# Production (Fly.io)
fly logs --app your-app-name

Troubleshooting

Files not processing

  • Check Solid Queue: Ensure bin/rails solid_queue:start is running
  • Check environment variables: Verify all API keys are set
  • Check job dashboard: Visit http://localhost:3000/jobs

Processing failed

  • Check API keys: Verify UNSTRUCTURED_API_KEY, OPENAI_API_KEY, TURBOPUFFER_API_KEY
  • Check file size: Files over 50MB are automatically discarded
  • Check file type: Verify file type is supported
  • Review logs: Check Rails logs for specific errors

Performance Optimization

Batch Size Tuning

Adjust batch size based on:
  • Smaller batches (50-75): Faster feedback, less memory
  • Larger batches (100-200): Better throughput, more memory
ProcessFileJob.perform_later(file_id, { batch_size: 150 })

Concurrency Settings

For high-volume processing:
# Increase file processing workers
FILE_PROCESSING_CONCURRENCY=8

# Scale worker machines
fly scale count worker=5

Cost Estimation

Per 1000-Page Document

  • Unstructured.io: ~0.100.10 - 0.50 per document
  • OpenAI Embeddings: ~$0.01 per 1000 pages
  • Turbopuffer: ~$0.0004/month storage
Total: ~0.110.11 - 0.51 per 1000-page document + minimal storage

Next Steps

Billing Setup

Configure credit-based billing