Batch Processing
CASSIA supports batch processing to analyze multiple clusters simultaneously. This guide explains how to prepare your data and run batch analysis efficiently.
Model Recommendations
If you're using OpenRouter as your provider, you can specify models like "openai/gpt-4o-2024-11-20"
or "anthropic/claude-3.5-sonnet"
. Here are some model recommendations:
- Claude 3.5 Sonnet (Best performance)
- Model ID:
"anthropic/claude-3.5-sonnet"
- Model ID:
- GPT-4o (Balanced option)
- Model ID:
"openai/gpt-4o-2024-11-20"
- Model ID:
- Llama 3.2 (Open source, cost-effective)
- Model ID:
"meta-llama/llama-3.2-90b-vision-instruct"
- Model ID:
- Deepseek v3 (Open source, almost free, and performance on par with gpt4o, most recommended option)
- Model ID:
"deepseek/deepseek-chat-v3-0324"
- Model ID:
"deepseek/deepseek-chat-v3-0324:free"
- Model ID:
Preparing Marker Data
You have three options for providing marker data:
- Create a data frame or CSV file containing your clusters and marker genes
- Use Seurat's
findAllMarkers
function output directly - Use CASSIA's example marker data
# Option 1: Load your own marker data markers <- read.csv("path/to/your/markers.csv") # Option 2: Use Seurat's findAllMarkers output directly # (assuming you already have a Seurat object) markers <- FindAllMarkers(seurat_obj) # Option 3: Load example marker data markers <- loadExampleMarkers() # Preview the data head(markers)
R
Marker Data Format
CASSIA accepts two formats:
- FindAllMarkers Output: The standard output from Seurat's FindAllMarkers function
- Simplified Format: A two-column data frame where:
- First column: cluster identifier
- Second column: comma-separated ranked marker genes
Running Batch Analysis
Setting Up Parameters
# Detect available CPU cores available_cores <- parallel::detectCores() # Calculate recommended workers (75% of available cores) recommended_workers <- floor(available_cores * 0.75) runCASSIA_batch( # Required parameters marker = markers, # Marker data (data frame or file path) output_name = "my_annotation", # Base name for output files model = "gpt-4o", # Model to use tissue = "brain", # Tissue type species = "human", # Species # Optional parameters max_workers = recommended_workers, # Number of parallel workers n_genes = 50, # Number of top marker genes to use additional_info = "", # Additional context provider = "openai" # API provider )
R
Parameter Details
-
Marker Gene Selection:
- Default: top 50 genes per cluster
- Filtering criteria:
- Adjusted p-value < 0.05
- Average log2 fold change > 0.25
- Minimum percentage > 0.1
- If fewer than 50 genes pass filters, all passing genes are used
-
Parallel Processing:
max_workers
: Controls parallel processing threads- Recommended: 80% of available CPU cores
- Example: For a 16-core machine, set to 13
-
Additional Context (optional):
- Use
additional_info
to provide experimental context - Examples:
- Treatment conditions: "Samples were antibody-treated"
- Analysis focus: "Please carefully distinguish between cancer and non-cancer cells"
- Tip: Compare results with and without additional context
- Use
Output Files
The analysis generates two files:
my_annotation_full.csv
: Complete conversation historymy_annotation_summary.csv
: Condensed results summary
Tips for Optimal Results
-
Resource Management:
- Monitor system resources when setting
max_workers
- Start with recommended 75% of cores and adjust if needed
- Monitor system resources when setting
-
Marker Gene Selection:
- Default 50 genes works well for most cases
- Increase for more complex cell types
- Decrease if running into API rate limits
-
Context Optimization:
- Test runs with and without additional context
- Keep context concise and relevant