Subclustering Analysis (Optional)

Subclustering analysis is a powerful technique for studying specific cell populations in greater detail. This tutorial walks you through the process of analyzing subclustered populations, such as T cells or fibroblasts, using Cassia and Seurat.

Prerequisites

A Seurat object containing your single-cell data
The Cassia package installed and loaded
Basic familiarity with R and single-cell analysis

Workflow Summary

Initial Cassia analysis
Subcluster extraction and processing
Marker identification
Cassia subclustering analysis
Uncertainty assessment (optional)

Detailed Steps

1. Initial Analysis

First, run the default Cassia pipeline on your complete dataset to identify major cell populations.

2. Subcluster Processing

Extract and process your target cluster using Seurat:

# Extract target population (example using CD8+ T cells)
cd8_cells <- subset(large, cell_ontology_class == "cd8-positive, alpha-beta t cell")

# Normalize data
cd8_cells <- NormalizeData(cd8_cells)

# Identify variable features
cd8_cells <- FindVariableFeatures(cd8_cells, 
    selection.method = "vst", 
    nfeatures = 2000)

# Scale data
all.genes <- rownames(cd8_cells)
cd8_cells <- ScaleData(cd8_cells, features = all.genes)

# Run PCA
cd8_cells <- RunPCA(cd8_cells, 
    features = VariableFeatures(object = cd8_cells),
    npcs = 30)

# Perform clustering
cd8_cells <- FindNeighbors(cd8_cells, dims = 1:20)
cd8_cells <- FindClusters(cd8_cells, resolution = 0.3)

# Generate UMAP visualization
cd8_cells <- RunUMAP(cd8_cells, dims = 1:20)
R

3. Marker Identification

Identify markers for each subcluster:

# Find markers
cd8_markers <- FindAllMarkers(cd8_cells,
    only.pos = TRUE,
    min.pct = 0.1,
    logfc.threshold = 0.25)

# Filter significant markers
cd8_markers <- cd8_markers %>% filter(p_val_adj < 0.05)

# Save results
write.csv(cd8_markers, "cd8_subcluster_markers.csv")
R

4. Cassia Subclustering Analysis

Run Cassia analysis on the subclusters:

# Basic analysis
runCASSIA_subclusters(
    marker = marker_sub,
    major_cluster_info = "cd8 t cell",
    output_name = "subclustering_results",
    model = "anthropic/claude-3.5-sonnet",
    provider = "openrouter"
)

# For mixed populations
runCASSIA_subclusters(
    marker = marker_sub,
    major_cluster_info = "cd8 t cell mixed with other celltypes",
    output_name = "subclustering_results2",
    model = "anthropic/claude-3.5-sonnet",
    provider = "openrouter"
)
R

5. Uncertainty Assessment (Optional)

For more confident results, calculate CS scores:

# Run multiple iterations
runCASSIA_n_subcluster(
    n = 5,
    marker = marker_sub,
    major_cluster_info = "cd8 t cell",
    output_name = "subclustering_results_n",
    model = "anthropic/claude-3.5-sonnet",
    temperature = 0,
    provider = "openrouter",
    max_workers = 5,
    n_genes = 50L
)

# Calculate similarity scores
similarity_scores <- runCASSIA_similarity_score_batch(
    marker = marker_sub,
    file_pattern = "subclustering_results_n_*.csv",
    output_name = "subclustering_uncertainty",
    max_workers = 6,
    model = "claude-3-5-sonnet-20241022",
    provider = "anthropic",
    main_weight = 0.5,
    sub_weight = 0.5
)
R

Tips and Recommendations

Always run the default Cassia analysis first before subclustering
Adjust clustering resolution based on your data's complexity
When dealing with mixed populations, specify this in the major_cluster_info parameter
Use the uncertainty assessment for more robust results

Output Files

The analysis generates several output files:

cd8_subcluster_markers.csv: Marker genes for each subcluster
subclustering_results.csv: Basic Cassia analysis results
subclustering_uncertainty.csv: Similarity scores (if uncertainty assessment is performed)