Batch Processing

CASSIA supports batch processing to analyze multiple clusters simultaneously. This guide explains how to prepare your data and run batch analysis efficiently.

Model Recommendations

If you're using OpenRouter as your provider, you can specify models like "openai/gpt-4o-2024-11-20" or "anthropic/claude-3.5-sonnet". Here are some model recommendations:

Claude 3.5 Sonnet (Best performance)
- Model ID: "anthropic/claude-3.5-sonnet"
GPT-4o (Balanced option)
- Model ID: "openai/gpt-4o-2024-11-20"
Llama 3.2 (Open source, cost-effective)
- Model ID: "meta-llama/llama-3.2-90b-vision-instruct"
Deepseek v3 (Open source, almost free, and performance on par with gpt4o, most recommended option)
- Model ID: "deepseek/deepseek-chat-v3-0324"
- Model ID: "deepseek/deepseek-chat-v3-0324:free"

Preparing Marker Data

You have three options for providing marker data:

Create a data frame or CSV file containing your clusters and marker genes
Use Seurat's findAllMarkers function output directly
Use CASSIA's example marker data

# Option 1: Load your own marker data
markers <- read.csv("path/to/your/markers.csv")

# Option 2: Use Seurat's findAllMarkers output directly
# (assuming you already have a Seurat object)
markers <- FindAllMarkers(seurat_obj)

# Option 3: Load example marker data
markers <- loadExampleMarkers()

# Preview the data
head(markers)
R

Marker Data Format

CASSIA accepts two formats:

FindAllMarkers Output: The standard output from Seurat's FindAllMarkers function
Simplified Format: A two-column data frame where:
- First column: cluster identifier
- Second column: comma-separated ranked marker genes

Running Batch Analysis

Setting Up Parameters

# Detect available CPU cores
available_cores <- parallel::detectCores()

# Calculate recommended workers (75% of available cores)
recommended_workers <- floor(available_cores * 0.75)

runCASSIA_batch(
    # Required parameters
    marker = markers,                    # Marker data (data frame or file path)
    output_name = "my_annotation",       # Base name for output files
    model = "gpt-4o",                     # Model to use
    tissue = "brain",                    # Tissue type
    species = "human",                   # Species
    
    # Optional parameters
    max_workers = recommended_workers,    # Number of parallel workers
    n_genes = 50,                        # Number of top marker genes to use
    additional_info = "",                # Additional context
    provider = "openai"                  # API provider
)
R

Parameter Details

Marker Gene Selection:
- Default: top 50 genes per cluster
- Filtering criteria:
  - Adjusted p-value < 0.05
  - Average log2 fold change > 0.25
  - Minimum percentage > 0.1
- If fewer than 50 genes pass filters, all passing genes are used
Parallel Processing:
- max_workers: Controls parallel processing threads
- Recommended: 80% of available CPU cores
- Example: For a 16-core machine, set to 13
Additional Context (optional):
- Use additional_info to provide experimental context
- Examples:
  - Treatment conditions: "Samples were antibody-treated"
  - Analysis focus: "Please carefully distinguish between cancer and non-cancer cells"
- Tip: Compare results with and without additional context

Output Files

The analysis generates two files:

my_annotation_full.csv: Complete conversation history
my_annotation_summary.csv: Condensed results summary

Tips for Optimal Results

Resource Management:
- Monitor system resources when setting max_workers
- Start with recommended 75% of cores and adjust if needed
Marker Gene Selection:
- Default 50 genes works well for most cases
- Increase for more complex cell types
- Decrease if running into API rate limits
Context Optimization:
- Test runs with and without additional context
- Keep context concise and relevant