Foundation Models for Weakly Supervised Whole Slide Image Classification: A Comprehensive Guide for Biomedical Research

Hannah Simmons Dec 02, 2025 197

This article provides a comprehensive exploration of using foundation models for weakly supervised classification of histopathological Whole Slide Images (WSIs).

Foundation Models for Weakly Supervised Whole Slide Image Classification: A Comprehensive Guide for Biomedical Research

Abstract

This article provides a comprehensive exploration of using foundation models for weakly supervised classification of histopathological Whole Slide Images (WSIs). Tailored for researchers and drug development professionals, it covers the foundational principles of self-supervised learning and Multiple Instance Learning (MIL) that underpin this approach. The content details state-of-the-art methodologies, including specific foundation models like CONCH and Virchow2, and frameworks such as CLAM. It addresses critical challenges like data scarcity and model selection, offering practical optimization strategies. Finally, the article synthesizes evidence from recent large-scale clinical benchmarks, providing performance comparisons and validation techniques to guide the development of robust, clinically applicable computational pathology tools.

Core Concepts: Demystifying Foundation Models and Weak Supervision in Computational Pathology

Whole Slide Imaging (WSI) is a transformative technology that involves the high-resolution digitization of entire glass pathology slides to create digital images that can be viewed, shared, and analyzed computationally [1] [2]. These gigapixel-sized digital files, known as Whole Slide Images (WSIs), have become fundamental to modern digital pathology, enabling remote diagnostics, collaborative reviews, and the application of artificial intelligence (AI) in pathological analysis [1] [3].

The creation of a WSI involves scanning glass slides using a motorized microscope with a digital camera system that captures numerous individual image tiles at high magnification, which are then computationally stitched together to form a single, comprehensive digital image [3]. This process allows pathologists and researchers to examine specimens on computer screens with the ability to zoom, pan, and annotate specific regions of interest, replicating and enhancing the capabilities of traditional light microscopy [1].

The Annotation Bottleneck in Computational Pathology

Despite the advantages of digitization, a significant challenge hindering the development of AI models for computational pathology is the annotation bottleneck. This refers to the immense difficulty and cost associated with obtaining pixel-level or region-level annotations required for training supervised deep learning models [4].

The annotation bottleneck arises from several factors:

WSI Complexity and Size: A single WSI can exceed 10 billion pixels, making comprehensive annotation extremely time-consuming [4].
Specialized Expertise Required: Accurate pathological annotation demands highly trained pathologists whose time is limited and valuable [1].
Inter-observer Variability: Annotation consistency suffers from subjective interpretations between different pathologists [4].
Data Heterogeneity: Variations in staining protocols, scanner models, and tissue processing create domain shifts that complicate annotation standardization [4].

This bottleneck severely constrains the scalability of AI solutions in pathology and has motivated research into alternative learning paradigms, particularly weakly supervised learning approaches that can utilize more readily available annotation types [4].

Foundation Models and Weakly Supervised Learning

Foundation Models (FMs) and Weakly Supervised Learning represent promising approaches to address the annotation bottleneck in computational pathology [4]. Foundation Models are large-scale models pre-trained on broad data that can be adapted to various downstream tasks, while weakly supervised learning utilizes imperfect or higher-level annotations (such as slide-level labels) instead of detailed pixel-level annotations [5].

The integration of these approaches enables researchers to develop AI models for WSI classification with significantly reduced annotation requirements. Current research explores multiple paradigms:

Foundation Model + Multi-Instance Learning (MIL): This established paradigm learns mappings from image patterns to slide-level labels without requiring detailed local annotations [4].
Vision-Language Models (VLM): These models align image representations with pathological text descriptions to learn richer semantic understanding [4].
Multi-Agent Systems: Frameworks like WSI-Agents utilize specialized collaborative agents for different analytical tasks, balancing versatility and accuracy [6].

Table 1: Quantitative Performance of Deep Learning Models in CRC Molecular Subtyping

Model	Sub-image Level Accuracy	Slide Level Accuracy	CMS2 Subtype Accuracy
VGG16	53.04%	51.72%	75.00%
VGG19+Dropout	53.04%	51.72%	75.00%
VGG24+Dropout	53.04%	51.72%	75.00%
VGG24+BN+Dropout	53.04%	51.72%	75.00%
Inception v3	53.04%	51.72%	75.00%
Resnet18	53.04%	51.72%	75.00%
Resnet34	53.04%	51.72%	75.00%
Resnet50	53.04%	51.72%	75.00%

Note: Data adapted from a study on colorectal cancer (CRC) consensus molecular subtype (CMS) classification using WSIs [7].

Experimental Protocols for Weakly Supervised WSI Classification

Protocol: Weakly Supervised Classification Using Slide-Level Labels

This protocol outlines a methodology for training classification models using only slide-level labels, based on established approaches in computational pathology research [4] [7].

Materials and Equipment

Whole Slide Images (WSIs) in compatible formats (.svs, .ndpi, .tiff)
High-performance computing workstation with GPU acceleration
WSI processing software (e.g., ASAP, QuPath)
Deep learning framework (PyTorch or TensorFlow)

Procedure

Data Preprocessing
- Convert WSIs to manageable patches (e.g., 256×256 pixels)
- Filter out non-tissue regions using HSV color space transformation and thresholding
- Perform data augmentation (flipping, rotation, color variation)
- Normalize staining variations across different scanners
Feature Extraction
- Utilize pre-trained foundation models (e.g., CONCH, Virchow) for patch-level feature extraction
- Process all patches from each WSI through the feature extractor
- Aggregate patch-level features into slide-level representations
Model Training with Multi-Instance Learning
- Implement attention-based MIL pooling mechanisms
- Train classifier using only slide-level labels
- Optimize using Adam optimizer with learning rate 0.0001
- Validate performance on held-out test set
Interpretation and Analysis
- Generate attention heatmaps to identify informative regions
- Perform quantitative evaluation on independent test set
- Compare against ground truth and pathologist interpretations

Protocol: Foundation Model Assisted Weak Segmentation

This protocol adapts methods from foundation model research to generate segmentation seeds using only image-level supervision [5].

Procedure

Foundation Model Integration
- Initialize with pre-trained CLIP and SAM models
- Design task-specific prompts for classification and segmentation
- Keep foundation model weights frozen during initial training
Prompt Learning
- Learn classification-specific prompts using image-level labels
- Learn segmentation-specific prompts using contrastive losses
- Incorporate activation loss supervised by coarse seed maps
Seed Generation
- Process images through prompt-enhanced foundation models
- Generate high-quality segmentation seeds
- Use seeds as pseudo-labels for training segmentation networks

Visualization of Methodologies

Weakly Supervised WSI Classification Workflow

The Annotation Bottleneck in Pathology AI

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 2: Key Research Reagent Solutions for Computational Pathology

Item	Function	Application Notes
Whole Slide Scanners (e.g., Aperio GT 450)	Digitizes glass slides into WSIs	For research use only; not for diagnostic procedures [3]
WSI Viewing Software (e.g., ImageScope)	Enables visualization and basic annotation of digital slides	Supports remote collaboration and multi-user access [2]
Deep Learning Frameworks (PyTorch, TensorFlow)	Provides infrastructure for model development and training	Essential for implementing weakly supervised algorithms [7]
Pre-trained Foundation Models (e.g., CONCH, Virchow)	Offers pre-learned feature representations for pathology images	Reduces need for extensive task-specific training data [4]
Computational Pathology Platforms (e.g., paiwsit.com)	Data management and analysis systems for WSI datasets	Facilitates storage, organization, and analysis of large WSI collections [7]
Annotation Tools (e.g., ASAP, QuPath)	Enables region-of-interest marking and labeling	Critical for generating ground truth data for validation [7]
High-Performance Computing (GPU servers)	Accelerates model training and inference	Necessary for processing gigapixel-sized WSIs efficiently [7]

Future Perspectives

The integration of foundation models with weakly supervised learning approaches represents a promising path forward for addressing the annotation bottleneck in computational pathology [4]. Emerging research directions include:

Unified Generative and Understanding Models: Developing models that can simultaneously perform comprehension and generation tasks for virtual staining and data augmentation [4].
AI Agent Frameworks: Creating specialized AI agents that can simulate pathological reasoning through multi-step processes, enhancing interpretability and trust [6].
Enhanced Vision-Language Models: Improving the alignment between visual features and pathological semantics to bridge the gap between pixel recognition and true semantic understanding [4].

As these technologies mature, they hold the potential to significantly reduce the dependency on extensive manual annotation while maintaining diagnostic accuracy, ultimately accelerating the adoption of AI-assisted tools in clinical and research pathology workflows.

What are Foundation Models? From Self-Supervised Learning (SSL) to Pre-trained Feature Extractors

Foundation models represent a paradigm shift in artificial intelligence (AI), characterized by their training on vast, broad datasets using self-supervised learning (SSL) techniques, which enables them to serve as versatile base models for a wide array of downstream tasks [8] [9]. These models, typically built on transformer architectures, have demonstrated remarkable adaptability across multiple domains including natural language processing, computer vision, and—more recently—computational pathology [8] [9] [10]. The term "foundation model" was coined by researchers at Stanford University to describe models that are "incomplete but serve as the common basis from which many task-specific models are built via adaptation" [9]. In the context of computational pathology, these models are particularly valuable due to their ability to learn generalized representations from unlabeled data, which can then be fine-tuned for specific diagnostic tasks with limited annotations [11] [10].

The significance of foundation models stems from their scalability and adaptability. Rather than developing AI systems from scratch for each specific task, researchers can use foundation models as starting points to develop specialized applications more quickly and cost-effectively [8]. This is especially relevant in medical imaging domains like whole slide image (WSI) analysis, where labeled data is scarce and expensive to obtain [11] [12]. Foundation models employ transfer learning to apply knowledge gained from one task to others, making them suitable for expansive domains including computer vision and natural language processing [9]. Their self-supervised learning approach allows them to create labels directly from input data without human annotation, setting them apart from previous machine learning architectures that relied on supervised or unsupervised learning [8].

Foundation Models in Computational Pathology

The Unique Challenges of Whole Slide Images

Computational pathology presents distinctive challenges that foundation models are particularly well-suited to address. Whole slide images (WSIs) are gigapixel-sized digital scans of tissue sections, creating significant computational hurdles for analysis [11]. These images exhibit tremendous spatial heterogeneity, with diagnostic regions often representing only tiny fractions of the entire slide—creating what researchers describe as "needles in a haystack" problems [11]. Traditional supervised deep learning approaches require extensive manual annotation of regions of interest, which is time-consuming, expensive, and subject to inter-observer variability among pathologists [11] [13].

The field has increasingly adopted weakly supervised learning paradigms, particularly multiple-instance learning (MIL), where only slide-level labels are available during training without precise localization of diagnostically relevant regions [11] [13]. This approach aligns well with clinical practice where pathologists provide overall diagnoses without pixel-level annotations. However, these methods often require large datasets of WSIs with slide-level labels and can suffer from poor domain adaptation and interpretability issues [11]. Foundation models pretrained using self-supervised learning on large, diverse histopathology datasets offer a promising solution to these challenges by providing robust, transferable feature representations that can be adapted to various diagnostic tasks with limited labeled data [10] [14].

Self-Supervised Learning Approaches for Pathology

Self-supervised learning has emerged as a powerful paradigm for training foundation models in computational pathology, effectively addressing the annotation bottleneck. SSL methods create supervisory signals directly from the data itself without human annotation, allowing models to learn rich visual representations from large-scale unlabeled WSI datasets [10] [14]. The SPT (Slide Pre-trained Transformers) framework exemplifies this approach by treating WSIs as collections of tokens and applying NLP-inspired transformation strategies including splitting, cropping, and masking to generate different "views" for self-supervised pretraining [14].

Multiple SSL objectives have been adapted for histopathology, including:

Contrastive learning (e.g., SimCLR): Maximizes agreement between differently augmented views of the same image while minimizing agreement with other images [14]
Self-distillation (e.g., BYOL): Uses a momentum encoder to predict representations of augmented views without negative samples [14]
Regularization-based methods (e.g., VICReg): Applies variance, invariance, and covariance constraints to learn robust representations [14]
Masked image modeling: Randomly masks portions of the input and trains the model to reconstruct the missing parts [10]

These approaches enable foundation models to capture morphologically meaningful patterns in histology images, forming a robust basis for downstream diagnostic tasks.

Pre-trained Feature Extractors for WSI Classification

Architectural Approaches

Table 1: Comparison of Foundation Model Architectures in Computational Pathology

Model	Architecture	Modality	Pretraining Data	Key Features
TITAN [10]	Vision Transformer	Multimodal (image + text)	335,645 WSIs + reports	Cross-modal alignment, slide-level representations
CLAM [11]	Attention-based MIL	Visual	Task-specific datasets	Data-efficient, attention-based localization
SPT [14]	Transformer	Visual	Diverse WSI collections	Multiple view generation, Fourier positional encoding
UNI [15]	Transformer	Visual	Large-scale histology patches	Patch-level foundational representations
Virchow2 [15]	Transformer	Visual	Large-scale histology patches	Competitive feature extraction capabilities

Foundation models for computational pathology employ diverse architectural strategies to handle the unique challenges of WSIs. The TITAN (Transformer-based pathology Image and Text Alignment Network) model represents a comprehensive approach, using a Vision Transformer (ViT) backbone pretrained on 335,645 WSIs through visual self-supervised learning and vision-language alignment with corresponding pathology reports [10]. TITAN introduces several innovations to address computational challenges, including processing non-overlapping patches of 512×512 pixels at 20× magnification, using random cropping of feature grids, and employing attention with linear bias (ALiBi) for long-context extrapolation [10].

The CLAM (Clustering-constrained Attention Multiple-instance Learning) framework takes a different approach, specifically designed for data-efficient WSI processing using only slide-level labels [11]. CLAM uses attention-based learning to identify subregions of high diagnostic value and instance-level clustering to refine the feature space [11]. This method generates interpretable heatmaps that highlight regions contributing to classifications, enhancing model transparency without requiring pixel-level annotations during training.

Diagram 1: Weakly Supervised WSI Classification Workflow

Benchmarking Performance

Table 2: Performance Comparison of Weakly Supervised Methods on Cancer Subtyping Tasks

Method	Architecture	RCC Subtyping (AUROC)	NSCLC Subtyping (AUROC)	Lymph Node Metastasis Detection (AUROC)	Data Efficiency
CLAM [11]	Attention-based MIL	0.99	0.97	0.96	High
ViT-based [13]	Vision Transformer	>0.90	>0.90	>0.90	Medium
CNN-based MIL [13]	Convolutional Neural Network	>0.90	>0.90	>0.90	Low-Medium
Classical Weakly Supervised [13]	CNN with averaging	>0.90	>0.90	>0.90	Low

Recent benchmarking studies have demonstrated the effectiveness of foundation models in various computational pathology tasks. In a comprehensive comparison of six AI algorithms across six clinical problems, Vision Transformers (ViTs) were found to outperform convolutional neural networks (CNNs) in clinically relevant prediction tasks, suggesting they could become the new standard in the field [13]. Surprisingly, for predicting molecular alterations, classical weakly-supervised workflows consistently outperformed more complex multiple-instance learning approaches, highlighting the importance of method selection based on specific clinical tasks [13].

The data efficiency of foundation models is particularly notable. CLAM has demonstrated high performance with limited annotations, achieving excellent results in renal cell carcinoma (RCC) and non-small-cell lung cancer (NSCLC) subtyping as well as lymph node metastasis detection with AUROC scores above 0.96 across all tasks [11]. This efficiency is crucial for rare cancers where large annotated datasets are unavailable.

Experimental Protocols

Protocol 1: Implementing CLAM for WSI Classification

Purpose: To implement a weakly supervised classification pipeline for whole slide images using the CLAM framework with foundation model feature extractors.

Materials and Reagents:

Whole slide images (WSIs) in SVS or other standard formats
High-performance computing environment with GPU acceleration
Python 3.8+ with PyTorch and CLAM repository

Procedure:

Data Preparation:
- Organize WSIs into directory structure with slide-level labels
- Perform tissue segmentation to exclude background regions
- Extract patches of size 256×256 or 512×512 at desired magnification (typically 20×)
Feature Extraction:
- Use pretrained foundation models (UNI, Virchow2, or PLIP) to convert patches into feature embeddings
- Save features as .h5 or .pt files with spatial coordinates preserved
- Example code snippet:
Model Training:
- Split data into training/validation sets (typically 80/20 or 70/30)
- Configure CLAM with appropriate parameters:
- Train with slide-level labels using attention-based multiple instance learning
- Monitor training with accuracy and loss metrics
Interpretation and Evaluation:
- Generate attention heatmaps to visualize diagnostically relevant regions
- Calculate slide-level predictions and performance metrics (AUROC, accuracy)
- Perform statistical analysis of model performance

Troubleshooting Tips:

For large datasets, use data loaders with appropriate batch sizes
If attention maps are diffuse, adjust clustering strength parameters
For poor performance, try different foundation model feature extractors

Purpose: To pretrain a multimodal foundation model for pathology images and text using self-supervised learning.

Materials and Reagents:

Large-scale dataset of WSIs (100,000+ slides)
Corresponding pathology reports or synthetic captions
High-performance computing cluster with multiple GPUs

Procedure:

Data Collection and Curation:
- Gather diverse WSIs across multiple organ systems and stain types
- Collect corresponding pathology reports or generate synthetic captions using generative AI
- Ensure data privacy compliance and ethical approvals
Vision-Only Pretraining:
- Apply iBOT framework for masked image modeling at patch level
- Use knowledge distillation with teacher-student framework
- Train with random cropping and feature grid sampling
Cross-modal Alignment:
- Implement image-text contrastive learning (ITC)
- Use masked language modeling (MLM) for text encoding
- Apply image-text matching (ITM) objectives
Evaluation and Validation:
- Assess representation quality through linear probing on multiple tasks
- Perform zero-shot classification and cross-modal retrieval tests
- Validate on rare cancer retrieval to test generalization

Timeline: 4-8 weeks depending on dataset size and computational resources

The Scientist's Toolkit

Table 3: Essential Research Reagents and Computational Tools

Tool/Resource	Type	Function	Access
CLAM [11]	Software Framework	Weakly supervised WSI classification with interpretability	GitHub: mahmoodlab/CLAM
TITAN [10]	Foundation Model	Multimodal slide representation learning	Upon request
UNI [15]	Feature Extractor	Patch-level foundational representations	Open source
Virchow2 [15]	Feature Extractor	Competitive pathology-specific features	Research license
Hugging Face [8]	Model Repository	Access to community-shared models and datasets	huggingface.co
Amazon Bedrock [8]	API Service	Access to foundation models through API	AWS console
PMC [11] [16]	Data Repository	Access to scientific literature and datasets	pmc.ncbi.nlm.nih.gov

The integration of foundation models with weakly supervised learning approaches represents a transformative development in computational pathology. These models address critical challenges in the field, including data efficiency, interpretability, and generalizability across domains and institutions [11] [10]. The ability to leverage self-supervised learning on large unlabeled datasets followed by fine-tuning on small annotated collections makes foundation models particularly valuable for medical applications where expert annotations are scarce and expensive [12] [10].

Future research directions include developing more sophisticated multimodal foundation models that integrate histology with genomic, transcriptomic, and clinical data [10]. There is also growing interest in federated learning approaches that enable model training across institutions without sharing sensitive patient data [13]. As noted in benchmarking studies, the field would benefit from standardized evaluation frameworks and more comprehensive comparisons of emerging foundation models against established baselines [13].

In conclusion, foundation models trained through self-supervised learning have established a new paradigm for computational pathology research. By serving as versatile, powerful feature extractors, these models enable more data-efficient and interpretable whole slide image classification, accelerating the development of AI-assisted diagnostic tools that can ultimately enhance patient care. The protocols and frameworks described in this article provide researchers with practical guidance for leveraging these advanced techniques in their own computational pathology workflows.

Understanding Weakly Supervised Learning and the Multiple Instance Learning (MIL) Paradigm

The advancement of computational pathology is significantly constrained by the enormous cost and expertise required for detailed annotation of gigapixel Whole Slide Images (WSIs). Weakly Supervised Learning (WSL) has emerged as a powerful paradigm to overcome this bottleneck by enabling model development using only slide-level labels, which are more readily available from diagnostic reports, rather than exhaustive, expert-made region-of-interest or pixel-level annotations [11] [17]. This approach is particularly vital for medical image analysis, where acquiring large-scale, fully-annotated datasets is often impractical [18]. Within WSL, Multiple Instance Learning (MIL) has become the predominant framework, treating each WSI as a "bag" containing thousands of unlabeled image patches ("instances") and learning from a single label assigned to the entire bag [19] [11].

The application of MIL in pathology is typically governed by the standard MIL assumption: a bag (slide) is positive if at least one of its instances (patches) is positive, and negative if all instances are negative [19] [17]. This formulation naturally fits various diagnostic tasks, such as detecting cancer metastasis or subtyping tumours, where a slide is positive for a disease if it contains at least one region with malignant cells [11] [17]. The primary challenge lies in effectively aggregating information from thousands of instances to make accurate slide-level predictions while also identifying which instances were most critical for the decision—a key to model interpretability.

Fundamental MIL Architectures and Benchmark Performance

Multiple architectural approaches have been developed to implement the MIL paradigm for WSIs. The following table summarizes the core characteristics and representative performance of key methods.

Table 1: Comparison of Fundamental MIL Architectures for WSI Classification

Architecture	Key Principle	Advantages	Limitations	Representative Performance (AUC)
Instance-based MIL	Classifies each instance individually; aggregates predictions via max-pooling [19].	Simple, intuitive implementation.	Instance-level classifier may be poorly trained due to lack of labels; can introduce error [19].	~0.92-0.95 for lung carcinoma classification [17]
Embedding-based MIL	Maps instances to embeddings; aggregates into a single slide-level representation for classification [19].	More flexible and often higher performance; end-to-end training [19].	Lacks inherent interpretability [19].	High performance on various subtyping tasks [11]
Attention-based MIL (AMIL)	Uses a neural network to assign an attention score to each instance, weighting their contribution to the bag-level representation [19] [11].	Differentiable; provides instance-level interpretability via attention scores [19].	Computationally more complex than max-pooling.	0.97 for tumor detection in TCGA-LUSC [20]
Clustering-constrained ATT (CLAM)	Adds an instance-level clustering loss to attention-based MIL to refine the feature space [11].	Data-efficient; provides interpretable heatmaps; suitable for multi-class tasks [11].	Requires additional loss hyperparameter tuning.	0.975-0.988 across four independent test sets for lung carcinoma [11]
HybridMIL	Combines CNNs with Broad Learning Systems (BLS) to capture multi-level features and global semantics [18].	Captures multi-level feature correlations; good for classification and localization [18].	Complex architecture design.	Surpasses other MIL models by up to 8.5% in classification accuracy [18]

The transition from simpler instance-based methods to more advanced attention-based and hybrid models has led to significant gains in both accuracy and interpretability. For example, in a landmark study on lung carcinoma classification, a weakly supervised MIL model outperformed a fully-supervised approach, achieving AUCs between 0.974 and 0.988 on four independent test sets, demonstrating exceptional generalization [11] [17]. The adoption of attention mechanisms has been particularly transformative, as it allows models to learn which regions of a slide are most indicative of the diagnosis without any spatial labels during training [19] [11].

The Emergence of Whole-Slide Foundation Models

A recent paradigm shift in the field is the development of whole-slide foundation models pre-trained on massive, diverse datasets of histopathology images. These models aim to learn general-purpose, transferable representations of WSIs that can be readily adapted to various downstream tasks with minimal task-specific labeling.

Table 2: Overview of Whole-Slide Foundation Models for Digital Pathology

Model Name	Key Innovation	Pre-training Scale	Modality	Reported Performance
Prov-GigaPath [21]	Adapts LongNet for ultra-long-context modeling of entire WSIs (tens of thousands of tiles).	171,189 WSIs (1.3B image tiles) [21]	Vision & Vision-Language	SOTA on 25/26 tasks; 23.5% AUROC improvement for EGFR mutation prediction vs. second-best [21].
TITAN [10]	A multimodal vision-language model aligned with pathology reports and synthetic captions.	335,645 WSIs [10]	Vision-Language	Excels in zero-shot classification, rare cancer retrieval, and report generation [10].
CLAM [11]	An attention-based MIL framework with instance-level clustering for data-efficient learning.	Not a foundation model; a method designed for data-efficient learning on smaller datasets [11].	Vision	Highly data-efficient; achieves high performance with hundreds, not thousands, of labeled WSIs [11].

These foundation models address key limitations of earlier patch-based approaches by explicitly modeling the long-range context and global tissue architecture within a slide [10] [21]. For instance, Prov-GigaPath uses a novel Vision Transformer (ViT) architecture with dilated self-attention to process sequences of tens of thousands of tile embeddings, effectively capturing both local and global patterns across the gigapixel slide [21]. In benchmarks spanning cancer subtyping and mutation prediction, Prov-GigaPath attained state-of-the-art performance on 25 out of 26 tasks, demonstrating the power of large-scale, whole-slide pre-training [21].

Integrated Experimental Protocols

This section outlines detailed protocols for implementing a modern, foundation-model-enhanced MIL pipeline for WSI classification.

Protocol: WSI Preprocessing and Feature Extraction

Purpose: To standardize gigapixel WSIs and convert them into a structured set of feature vectors suitable for MIL model input.

Tissue Segmentation:
- Process the WSI using automated tissue segmentation algorithms to exclude background regions (e.g., glass slide, air bubbles) [11]. This drastically reduces the number of patches to process.
Patch Extraction:
- Tile the segmented tissue region into smaller, non-overlapping patches (e.g., 256x256 or 512x512 pixels at 20x magnification) [11] [17].
- A stride of 256 pixels for a 512x512 patch is common to ensure coverage [17].
Feature Embedding:
- Instead of using raw pixel values, extract a low-dimensional feature vector for each patch using a pre-trained encoder.
- Recommended Encoder: Use a foundation model encoder like Prov-GigaPath's tile encoder or other powerful pre-trained networks (e.g., CONCH) [10] [21]. This transfer learning step is critical for performance.
- The output is a set of feature vectors {h_1, h_2, ..., h_K}, where K is the number of patches, which constitutes the "bag" for the WSI [19] [11].

WSI Preprocessing and Feature Extraction Workflow

Protocol: Attention-Based Deep MIL with a Foundation Model Encoder

Purpose: To train an interpretable, high-performance classifier using slide-level labels.

Input Preparation:
- Input the bag of feature vectors {h_1, h_2, ..., h_K} obtained from Protocol 4.1.
Attention Network:
- Pass each feature vector h_k through a small, fully-connected neural network to generate a scalar attention score a_k [19] [11]. The scores measure the relative importance of each patch.
- a_k = w^T (tanh(V h_k^T) ⊙ sigm(U h_k^T)) where V, U, and w are learnable parameters [19].
- Normalize the scores across all patches in the slide using a softmax function to create attention weights [19].
MIL Pooling:
- Compute the slide-level representation z as a weighted sum of all patch features, using the attention weights: z = Σ (a_k * h_k) [19] [11]. This is a permutation-invariant aggregation [19].
Classification and Output:
- Pass the slide-level representation z through a final classification layer g (e.g., a fully-connected network) to predict the slide-level class probability ϴ(X) [19].
- The model is trained end-to-end by minimizing a standard classification loss (e.g., cross-entropy) using the slide-level labels [11].

Attention-based MIL Architecture

Protocol: Data-Efficient Learning with CLAM

Purpose: To achieve high performance with limited labeled data, a common scenario in clinical research.

Steps 1-3 of Protocol 4.2: Follow the same feature extraction and attention-based MIL process [11].
Instance-Level Clustering:
- In addition to the slide-level classification loss, introduce an instance-level clustering loss [11].
- For a given slide, use the model's attention scores to identify "high-attention" and "low-attention" patches.
- Apply a clustering loss (e.g., a form of cluster center distance loss) to encourage the model to separate the features of high-attention patches from low-attention patches, effectively refining the feature space [11].
- Under the mutual exclusivity assumption, you can treat high-attention patches from negative classes as "false positives" and add corresponding loss terms [11].
Joint Training:
- The total loss is a weighted sum of the slide-level classification loss and the instance-level clustering loss.
- This auxiliary task acts as a strong regularizer and enables the model to learn effectively from fewer labeled slides [11].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Tools and Resources for MIL-based WSI Research

Category / Item	Specific Examples	Function & Utility
Public WSI Datasets	The Cancer Genome Atlas (TCGA), CAMELYON-16, TCGA-LUNG, BRACS [22] [11] [20]	Provide large, well-characterized, and often publicly available WSI datasets with slide-level and sometimes pixel-level labels for training and benchmarking.
Pre-trained Patch Encoders	CONCH, CTransPath, DINOv2 [10] [21]	Act as powerful feature extractors for image patches. Using models pre-trained on large histopathology or natural image datasets is a form of transfer learning that significantly boosts performance.
Whole-Slide Foundation Models	Prov-GigaPath, TITAN [10] [21] [23]	Provide off-the-shelf, general-purpose slide-level representations. Can be used as strong feature extractors for entire WSIs or fine-tuned for specific tasks, often achieving state-of-the-art results.
MIL Software Frameworks	CLAM [11]	Open-source Python packages that provide high-throughput, easy-to-use pipelines for WSI processing and MIL model training, facilitating rapid prototyping and experimentation.
Computational Hardware	High-Performance Compute (HPC) Clusters, GPUs (NVIDIA)	Essential for processing gigapixel WSIs and training large deep-learning models, especially foundation models with billions of parameters.

The fusion of the Multiple Instance Learning paradigm with large-scale foundation models represents the cutting edge of computational pathology. MIL provides a principled framework for learning from weak, slide-level labels, while foundation models like Prov-GigaPath and TITAN inject powerful, general-purpose representations pre-trained on hundreds of thousands of slides. This synergy enables the development of more accurate, data-efficient, and interpretable AI tools for tasks ranging from cancer diagnosis and subtyping to mutation prediction. As these technologies mature, they hold the profound potential to become integral components of the digital pathology workflow, assisting pathologists and accelerating the pace of biomedical discovery.

The analysis of Whole Slide Images (WSIs) in digital pathology presents a unique computational challenge due to the gigapixel size of the images, which can comprise tens of thousands of individual image tiles. Foundation models, trained on broad data using self-supervision and adapted to various downstream tasks, have emerged as powerful solutions, particularly in scenarios with scarce labeled data. Within this paradigm, three key architectural families have proven instrumental: Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Vision-Language Models (VLMs). CNNs bring inductive biases beneficial for learning local hierarchical features, while ViTs excel at capturing long-range dependencies and global context through self-attention mechanisms. VLMs extend these capabilities by integrating visual and textual information, enabling tasks that require semantic understanding. In the context of weakly supervised learning for WSI classification—where only slide-level labels are available—these architectures form the backbone of Multiple Instance Learning (MIL) frameworks, allowing models to learn from bags of instances (patches) without costly pixel-level or patch-level annotations. Their adaptation and scaling for gigapixel images represent an active and critical area of research [24] [25].

Core Architectural Principles

Convolutional Neural Networks (CNNs): CNNs, such as ResNet, operate on the principles of locality and weight sharing. Their inductive bias assumes that nearby pixels are related, making them highly effective at extracting hierarchical local features like edges and textures from image data. They process images through sliding convolutional filters and are known for their parameter efficiency [25]. In WSI analysis, they are often used as tile-level feature extractors.
Vision Transformers (ViTs): ViTs split an image into a sequence of fixed-size patches, linearly embed them, and process them through a standard Transformer encoder. The core of the ViT is the self-attention mechanism, which allows the model to weigh the importance of all other patches when encoding a specific patch. This enables ViTs to capture global context and long-range dependencies across the entire image, a significant advantage for understanding complex tissue structures [26] [25].
Vision-Language Models (VLMs): VLMs like CONCH and PLIP are typically dual-encoder models that process image and text inputs independently, projecting them into a shared latent representation space. They are trained using contrastive learning objectives that pull the embeddings of matching image-text pairs closer together while pushing non-matching pairs apart. This alignment allows for capabilities such as zero-shot classification, image-to-text retrieval, and text-to-image retrieval by using natural language prompts to query visual data [27] [24].

Quantitative Performance Comparison

The table below summarizes the reported performance of these architectures across various pathology tasks, highlighting their respective strengths.

Table 1: Performance Comparison of Key Architectures on Pathology Tasks

Architecture	Model Example	Task	Dataset	Performance Metric	Score	Key Advantage
CNN-based	ResNet (Weakly-Supervised)	Microorganism Enumeration [26]	Multiple Microbiology Datasets	Overall Performance	Better	Proven reliability, strong local feature extraction
ViT-based	CrossViT (Weakly-Supervised) [26]	Microorganism Enumeration [26]	Multiple Microbiology Datasets	Competent Results	Competent	Computational efficiency on homogeneous data
ViT-based	ViT-WSI [28]	Brain Tumor Subtyping	TCGA, In-house	Patient-Level AUC	0.960 (IDH1)	High interpretability, powerful feature discovery
ViT-based	ViT-WSI [28]	Molecular Marker Prediction	TCGA, In-house	Patient-Level AUC	0.874 (p53), 0.845 (MGMT)	Predicts molecular status from H&E stains
VLM	CONCH (Zero-shot) [27]	NSCLC Subtyping	TCGA NSCLC	Accuracy	90.7%	Large margin over other VLMs, no task-specific training
VLM	CONCH (Zero-shot) [27]	RCC Subtyping	TCGA RCC	Accuracy	90.2%	+9.8% over next-best (PLIP)
VLM	CONCH (Zero-shot) [27]	BRCA Subtyping	TCGA BRCA	Accuracy	91.3%	~35% improvement over other VLMs
Foundation Model	Prov-GigaPath [24]	Pan-Cancer Mutation Prediction	TCGA (18 genes)	Macro AUROC / AUPRC	+3.3% / +8.9% (vs. SOTA)	Whole-slide context modeling, scales to billions of tiles

Application Notes and Experimental Protocols

Protocol 1: Weakly-Supervised WSI Classification using a ViT-based MIL Framework

This protocol outlines the methodology for major primary brain tumor classification using the ViT-WSI model, a representative ViT-based approach for weakly supervised learning [28].

Objective: To perform end-to-end classification of WSIs (e.g., brain tumor typing and subtyping) using only slide-level labels and to identify discriminative histological features.
Materials:
- Dataset: WSIs from The Cancer Genome Atlas (TCGA) and/or institutional archives.
- Label Type: Slide-level diagnostic labels (e.g., glioblastoma, astrocytoma).
- Computing: High-memory GPU servers capable of processing thousands of image patches per WSI.
Methodology:
- WSI Preprocessing: Tissue patches are extracted from the WSI at a specified magnification (e.g., 20x). Each patch is processed through a pre-trained feature extractor (e.g., a CNN trained on histopathology data) to obtain a feature vector. The entire WSI is represented as a bag of these feature vectors.
- Feature Projection: The feature vectors are projected into a lower-dimensional space and combined with positional encoding to retain spatial information.
- Vision Transformer Encoder: The sequence of projected patch features is fed into a standard Vision Transformer encoder. The self-attention mechanism within the ViT models the interrelationships between all patches in the WSI, creating a contextualized embedding for each patch.
- Classification Head: The [CLS] token output from the ViT encoder, which aggregates global slide information, is connected to a multi-layer perceptron (MLP) for the final slide-level classification.
- Interpretation via Attribution: A gradient-based attribution analysis (e.g., calculating gradients of the output class with respect to the input patch features) is performed. This highlights the patches that most contributed to the prediction, effectively localizing diagnostic histological features without any pixel-level supervision.
Key Considerations: This end-to-end approach avoids the separate feature extraction and aggregation steps of older MIL methods. The quality of the initial feature extractor can significantly impact final performance.

Protocol 2: Zero-Shot WSI Classification using a Vision-Language Foundation Model

This protocol describes how to use a VLM like CONCH for zero-shot classification of WSIs, which requires no task-specific training data [27].

Objective: To classify WSIs into diagnostic categories using only natural language descriptions of the classes.
Materials:
- Pre-trained VLM: A model like CONCH, pretrained on histopathology image-caption pairs.
- Text Prompts: A set of hand-crafted text descriptions for each diagnostic class (e.g., "A whole slide image of invasive ductal carcinoma of the breast").
Methodology:
- Prompt Engineering: Create an ensemble of multiple text prompts for each class to improve robustness (e.g., "breast invasive ductal carcinoma," "a histology image of IDC").
- Text Embedding: Use the VLM's text encoder to generate embedding vectors for all prompts across all classes.
- WSI Tiling and Image Embedding: Divide the target WSI into tiles. Use the VLM's image encoder to generate an embedding vector for each tile.
- Similarity Calculation: Compute the cosine similarity between each tile's image embedding and the text embeddings of all class prompts.
- Score Aggregation: For each class, aggregate the similarity scores across all tiles in the WSI (e.g., by averaging the top-k most similar tiles, or using a method like MI-Zero). The class with the highest aggregated score is the predicted label.
- Heatmap Visualization (Optional): To create an interpretable heatmap, visualize the cosine similarity scores between tiles and the predicted class's text prompt across the WSI canvas.
Key Considerations: Performance is highly dependent on the quality and diversity of the text prompts. This method is particularly powerful for rare diseases or new tasks where collecting labeled data is infeasible.

Table 2: Essential Research Reagents for Weakly Supervised WSI Analysis

Reagent / Resource	Type	Description	Representative Examples
TCGA (The Cancer Genome Atlas)	Dataset	A large, publicly available repository of WSIs with associated genomic and clinical data, serving as a primary benchmark.	TCGA-BRCA, TCGA-LUAD, TCGA-RCC [27] [28]
Prov-Path	Dataset	A very large-scale, real-world dataset from a health network; used for pre-training foundation models.	Prov-GigaPath was pretrained on 1.3 billion tiles from 171k slides [24]
CONCH	Vision-Language Model	A VLM pretrained on 1.17M histopathology image-caption pairs; enables zero-shot transfer.	Used for classification, segmentation, and retrieval [27]
Prov-GigaPath	Foundation Model	An open-weight, ViT-based foundation model pretrained on Prov-Path; excels at whole-slide context modeling.	Used for cancer subtyping and mutation prediction [24]
DINOv2	Algorithm	A self-supervised learning method used for pre-training feature extractors without labels.	Used as the tile encoder in the Prov-GigaPath model [24]
MI-Zero	Algorithm	A method for aggregating tile-level predictions in a WSI for zero-shot VLM classification.	Used with CONCH for gigapixel WSI classification [27]

Advanced Integrated Workflow and Emerging Paradigms

Building on the core protocols, this section outlines a sophisticated, integrated workflow that leverages the strengths of both ViTs and VLMs, and introduces emerging paradigms like weakly semi-supervised learning.

Integrated Workflow: Combining ViT Feature Encoding with VLM Zero-Shot Triage

A powerful approach for real-world deployment involves creating a multi-stage pipeline. This protocol uses a VLM for high-throughput, zero-shot triage of easy cases, and a fine-tuned ViT model for complex diagnostics on uncertain cases [27] [28] [24].

Objective: To maximize diagnostic accuracy and efficiency by combining the broad, zero-shot capabilities of VLMs with the task-optimized performance of fine-tuned ViTs.
Materials: A pre-trained VLM, a ViT architecture (e.g., ViT-WSI, Prov-GigaPath), and a dataset with slide-level labels for fine-tuning.
Methodology:
- Initial Triage: Process all incoming WSIs through the zero-shot VLM pipeline (Protocol 2).
- Confidence Thresholding: Assign cases where the VLM's prediction confidence exceeds a pre-defined, high threshold to the "Automated Diagnosis" stream. These cases are reported automatically.
- Expert Routing: Route all low-confidence cases (e.g., difficult or rare subtypes) for fine-tuned analysis. These WSIs are processed by a specialized, fine-tuned ViT model (following Protocol 1), which was trained on a curated dataset of complex cases.
- Pathologist in the Loop: The predictions from the fine-tuned ViT, along with its interpretability heatmaps, are presented to a pathologist for final review and diagnosis.
Key Considerations: This workflow optimizes pathologist workload by automating clear-cut cases and providing decision support for challenging ones. The confidence threshold is a critical parameter that requires careful calibration based on clinical requirements.

Protocol 3: Weakly Semi-Supervised Learning with Cross-Consistency (CroCo)

The CroCo framework addresses a realistic clinical scenario where only a small fraction of WSIs have bag-level labels, and the rest are unlabeled. It moves beyond standard weakly supervised learning into a weakly semi-supervised paradigm [29].

Objective: To achieve robust bag-level and instance-level classification using a limited set of labeled WSIs and a large set of unlabeled WSIs.
Materials:
- Dataset: A collection of WSIs where only a small subset has slide-level labels.
- Model: A dual-branch architecture with heterogeneous classifiers.
Methodology:
- Dual-Branch Architecture: The CroCo framework consists of two branches:
  - Upper Branch (Bag-based): An attention-based MIL pooler that aggregates instance features to make a bag-level prediction.
  - Lower Branch (Instance-based): A classifier that assigns a pseudo-label to each instance.
- Two-Level Cross Consistency:
  - Bag-Level Consistency: The bag-level predictions from both branches are enforced to be consistent for the same unlabeled WSI.
  - Instance-Level Consistency: The attention scores from the upper branch are used as pseudo-labels to supervise the instance classifier in the lower branch, and vice-versa. For labeled positive bags, the instance-level pseudo-labels are exchanged between branches.
- Training: The model is trained jointly on the labeled data (using true labels) and the unlabeled data (using the consistency constraints).
Key Considerations: This method efficiently leverages the unlabeled data by exploiting the multi-granularity (bag and instance) consistency between two different model perspectives, significantly improving performance when labeled data is scarce.

In computational pathology, the classification of Whole Slide Images (WSIs) presents a unique computational challenge due to their gigapixel size, often comprising tens of thousands of image tiles [24]. Multiple Instance Learning (MIL) has emerged as the predominant weakly supervised framework for this task, where a WSI is treated as a "bag" containing numerous patches or "instances" [30]. A significant breakthrough has been the integration of foundation models as powerful feature extractors for these instances. These models, pre-trained on massive datasets, provide generalized, information-rich tile embeddings that dramatically enhance the performance of MIL frameworks on critical tasks such as cancer subtyping, biomarker prediction, and prognosis stratification [31]. This application note explores the synergy between pathology foundation models and MIL, providing a quantitative benchmark of current models, detailed protocols for their implementation, and visualization of the leading architectures driving innovation in the field.

Benchmarking Pathology Foundation Models

Recent independent benchmarking studies have comprehensively evaluated numerous pathology foundation models across thousands of slides and dozens of clinically relevant tasks. The table below summarizes the performance of top-performing models based on their mean Area Under the Receiver Operating Characteristic curve (AUROC) across different task categories.

Table 1: Benchmarking Performance of Leading Pathology Foundation Models

Foundation Model	Model Type	Morphology Tasks (Mean AUROC)	Biomarker Tasks (Mean AUROC)	Prognosis Tasks (Mean AUROC)	Overall Mean AUROC
CONCH	Vision-Language	0.77	0.73	0.63	0.71
Virchow2	Vision-Only	0.76	0.73	0.61	0.71
Prov-GigaPath	Vision-Only	-	0.72	-	0.69
DinoSSLPath	Vision-Only	0.76	-	-	0.69
BiomedCLIP	Vision-Language	-	-	0.61	0.66

This benchmarking study, which involved 19 foundation models and 31 tasks across 6,818 patients, revealed that CONCH, a vision-language model trained on 1.17 million image-caption pairs, and Virchow2, a vision-only model trained on 3.1 million WSIs, achieved the highest overall performance [31]. A key finding was that foundation models trained on distinct cohorts learn complementary features; an ensemble combining CONCH and Virchow2 predictions outperformed individual models in 55% of tasks [31]. Furthermore, the research indicated that data diversity in pre-training may outweigh data volume, as CONCH outperformed BiomedCLIP despite being trained on far fewer image-caption pairs [31].

Table 2: Key Characteristics of Top-Performing Foundation Models

Foundation Model	Pre-training Dataset Scale	Architecture	Key Strengths
CONCH	1.17M image-caption pairs [31]	Vision-Language [31]	Highest overall performance, excels in morphology tasks [31]
Virchow2	3.1M WSIs [31]	Vision-Only [31]	Top performance in biomarker prediction, strong in low-data settings [31]
Prov-GigaPath	171,189 WSIs (1.3B tiles) [24]	Vision-Transformer (LongNet) [24]	State-of-the-art whole-slide context modeling [24]
TCv2	Openly available data (TCGA, CPTAC) [32]	Supervised Multi-task Learning [32]	Resource-efficient training, state-of-the-art in cancer subtyping [32]

Experimental Protocols for MIL-Based WSI Classification

Feature Extraction Using Foundation Models

Objective: To generate high-quality tile-level feature embeddings from a gigapixel WSI using a pre-trained foundation model for downstream MIL aggregation.

Materials and Reagents:

Whole Slide Image (WSI): A gigapixel image file (e.g., in .svs, .ndpi format).
Pre-trained Foundation Model: Such as CONCH, Virchow2, or Prov-GigaPath.
Computational Resources: A GPU-equipped workstation (e.g., NVIDIA A100) with sufficient VRAM (>16GB recommended).

Procedure:

WSI Pre-processing:
- Load the WSI using a library like OpenSlide.
- Tile the WSI at a specified magnification level (e.g., 20x) into non-overlapping patches of standard size (e.g., 256x256 pixels).
- Perform background detection and filtering to remove non-tissue tiles using thresholding algorithms (e.g., based on color or intensity).

Feature Extraction:
- Load the pre-trained weights of the chosen foundation model.
- Pass each valid image tile through the model to extract a feature vector.
- For vision transformer-based models like Prov-GigaPath, this yields an embedding for each tile [24].
- The output is a feature matrix of dimensions N x D, where N is the number of tiles and D is the dimensionality of the feature vector.

MIL Aggregation and Classification

Objective: To aggregate tile-level features into a slide-level representation and perform a classification task (e.g., mutation prediction).

Materials and Reagents:

Feature Matrix: The N x D matrix from the previous protocol.
MIL Framework: Such as an Attention-based MIL (ABMIL) or a Transformer-based aggregator.

Procedure:

Feature Aggregation:
- Input the feature matrix into the MIL model.
- The model learns to assign attention weights to each tile, identifying diagnostically relevant regions.
- A slide-level embedding is generated as a weighted sum of all tile features.

Classification Head:
- The slide-level embedding is passed through a classifier, typically a fully connected neural network layer, to produce a prediction (e.g., BRAF mutation positive/negative).
Training and Evaluation:
- Train the MIL model end-to-end using slide-level labels. The foundation model's feature extractor can be frozen or fine-tuned.
- Evaluate performance on a held-out test set using metrics such as AUROC and AUPRC.

Advanced Methodologies and Visualization

Workflow Diagram

The following diagram illustrates the end-to-end pipeline for WSI classification using foundation models within an MIL framework.

The CAMCSA Architecture

To address limitations of standard attention-based MIL, novel architectures like CAMCSA (Class Activation Map with Cross-Slide Augmentation) have been developed [30] [33]. The diagram below details its structure.

The WSICAM module introduces Class Activation Map (CAM) theory to generate instance scores that more accurately represent each tile's contribution to the final slide-level classification, moving beyond simple feature similarity [30] [33]. The CSA module leverages these scores to select significant instances from two different slides, creating a "mixed bag" with a blended label. This augmentation technique enhances model generalization and is particularly effective for imbalanced datasets [30].

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Key Resources for Implementing Foundation Model-Driven WSI Analysis

Category / Item	Specification / Example	Function & Application Note
Foundation Models	CONCH, Virchow2, Prov-GigaPath, TCv2 [31] [24] [32]	Pre-trained feature extractors. Select based on task: vision-language for morphology, large-scale vision-only for biomarkers.
MIL Aggregators	ABMIL, Transformer-based, CAMCSA [31] [30]	Algorithms to aggregate tile features. Transformer-based and advanced methods like CAMCSA often outperform simple attention.
Computational Hardware	NVIDIA A100/A6000 GPU [32]	Accelerates feature extraction and model training. Essential for processing gigapixel WSIs in a feasible time.
WSI Datasets	TCGA, CPTAC, Camelyon16, in-house cohorts [32]	Source of WSIs for training and validation. Ensure external validation cohorts to test generalizability [31].
Software Libraries	Python, PyTorch, OpenSlide, HiMLS	Core programming environment and tools for WSI handling, model implementation, and data management.

The synergy between foundation models and MIL frameworks represents a paradigm shift in computational pathology. By providing powerful, general-purpose feature representations, foundation models like CONCH and Virchow2 have significantly elevated the performance ceiling for weakly supervised WSI classification on tasks ranging from biomarker prediction to prognosis. The experimental protocols and advanced methodologies detailed herein provide a roadmap for researchers to leverage these tools effectively. Future progress will likely be driven by more data-efficient and explainable architectures, such as CAMCSA and TCv2, further cementing the role of foundation models as the cornerstone of digital pathology analysis.

From Theory to Practice: Implementing Weakly Supervised Pipelines with Foundation Models

Whole Slide Image (WSI) classification is a cornerstone of computational pathology, enabling automated diagnosis, prognosis, and biomarker prediction from gigapixel tissue scans. A typical WSI analysis pipeline involves a sequence of critical steps: tissue segmentation to distinguish relevant tissue from background, patching to divide the gigapixel image into manageable segments, and feature extraction to convert pixel data into meaningful numerical representations. The emergence of foundation models, pre-trained on massive datasets via self-supervised learning, has revolutionized feature extraction, facilitating robust and data-efficient downstream analysis. Furthermore, the paradigm of weakly supervised learning, which utilizes only slide-level labels for training, has gained significant traction as it alleviates the need for expensive and tedious pixel-level annotations. This document details the application notes and protocols for constructing a WSI classification pipeline within the context of using foundation models for weakly supervised whole-slide image classification research.

Core Building Blocks of the WSI Pipeline

The processing of a WSI involves a multi-stage pipeline where each stage is critical for the overall performance of the classification system. The workflow progresses from raw WSI input to a final slide-level prediction, with intermediate steps of tissue segmentation, patching, and feature extraction feeding into a weakly supervised classification model. The following diagram illustrates the logical flow and key decision points within this pipeline.

Tissue Segmentation

Purpose and Goal: The initial and crucial step in WSI analysis is tissue segmentation, which involves identifying and separating biologically relevant tissue regions from the background (e.g., glass slide, air bubbles, pen marks). This process drastically reduces the computational load by ensuring that subsequent processing is focused only on meaningful areas.

Detailed Protocol:

Input: A gigapixel WSI at its highest magnification level.
Thumbnail Generation: Generate a low-resolution thumbnail (e.g., 5x or 10x magnification) of the entire WSI to simplify the segmentation task.
Color Normalization (Optional): Apply color normalization techniques to standardize stain variations across different scanners and laboratories, improving segmentation robustness.
Segmentation Model:
- Model Architecture: A U-Net is a commonly used architecture for this semantic segmentation task. An in-house trained U-Net model can be utilized to segment tissue versus background on the low-magnification thumbnail [34].
- Output: A binary mask where pixels corresponding to tissue are labeled 1, and background pixels are labeled 0.
Mask Application: The generated binary mask is used to filter out background regions during the patching step. Only patches that fall within the tissue mask are processed further.

Patching and Representative Patch Selection

Purpose and Goal: Given the gigapixel size of WSIs, they must be divided into smaller, manageable patches for analysis. A key challenge is selecting a representative subset of patches that capture the morphological diversity of the tissue while minimizing redundancy, particularly from uninformative normal regions.

Detailed Protocol:

Input: The original WSI and its corresponding tissue segmentation mask.
Grid-based Patching: Overlay a grid on the WSI at a specified magnification (typically 20x) and extract patches of a fixed size (e.g., 256x256 or 512x512 pixels).
Foreground Filtering: Discard any patch where less than a certain threshold (e.g., 50%) of its area is covered by the tissue mask.
Representative Patch Selection: Several strategies exist to select the most informative patches and avoid processing thousands of redundant ones.
- Clustering-based Mosaics: As implemented in the Yottixel tool, patches can be clustered based on color and texture features. A representative subset (a "mosaic") is selected from each cluster to ensure morphological diversity [34].
- Anomaly Detection with a "Normal Atlas": For tasks focused on pathology, a powerful strategy is to prune patches of normal tissue. An atlas of normal tissue can be created using one-class classifiers (e.g., Isolation Forest, one-class SVM) trained on deep features from confirmed normal WSIs. This model can then identify and filter out "normal" patches from a new WSI, leaving a concise set of patches enriched for abnormal morphology [34]. This can reduce the number of patches by 30-50% without sacrificing performance.

Feature Extraction Using Foundation Models

Purpose and Goal: Feature extraction transforms image patches into a low-dimensional, numerical vector that encapsulates the salient visual and semantic information. Foundation models, pre-trained on vast histopathology datasets, have become the preferred method for this, providing highly transferable and powerful features.

Detailed Protocol:

Input: The selected representative patches from the previous step.
Foundation Model Selection: Choose a pre-trained foundation model based on the task and data availability. Recent benchmarks indicate that vision-language models like CONCH and vision-only models like Virchow2 are top performers across various tasks, including morphology assessment, biomarker prediction, and prognosis [31].
Feature Extraction:
- Process: Pass each image patch through the foundation model's encoder. The output from a pre-final layer (e.g., a 768-dimensional or 1024-dimensional vector) is extracted as the feature embedding for that patch.
- Output: A set of feature vectors, one for each patch, which collectively represent the WSI. These features serve as input to the weakly supervised classification models.

Table 1: Benchmarking of Select Pathology Foundation Models for Feature Extraction

Foundation Model	Model Type	Key Strength	Reported Avg. AUROC (Morphology)	Reported Avg. AUROC (Biomarker)	Reported Avg. AUROC (Prognosis)
CONCH [31]	Vision-Language	Highest overall performance, especially in morphology and prognosis	0.77	0.73	0.63
Virchow2 [31]	Vision-Only	Top performance in biomarker prediction, close second overall	0.76	0.73	0.61
TITAN [10]	Multimodal WSI	Generates general-purpose slide representations; excels in low-data scenarios	N/A	N/A	N/A
DinoSSLPath [31]	Vision-Only	Strong all-around performer	0.76	0.69	N/A

Weakly Supervised Classification with Slide-Level Labels

With feature embeddings extracted for every patch in a WSI, the next step is to aggregate them to make a single slide-level prediction using only the overall slide label. This is typically framed as a Multiple Instance Learning (MIL) problem.

Attention-based Multiple Instance Learning

Clustering-constrained Attention Multiple Instance Learning (CLAM) is a seminal data-efficient method that creates a slide-level representation from patch features [11].

Detailed Protocol:

Input: A bag of feature vectors {x₁, x₂, ..., xₙ} for a WSI and its slide-level label Y.
Attention Branch: The model uses an attention-based pooling module to assign an attention score aₖ to each patch, representing its importance for the slide's classification.
Slide-level Representation: A weighted average of all patch features is computed using the attention scores, resulting in a single, refined slide-level feature vector.
Classification: This slide-level vector is passed through a fully connected layer for classification.
Instance-level Clustering (Key Innovation): CLAM introduces an auxiliary learning objective. During training, it uses the attention scores to generate pseudo-labels for the most and least attended patches, encouraging the model to cluster patch-level features in a semantically meaningful way. This provides additional supervision and improves data efficiency [11].

The internal workflow of CLAM, from patch embedding aggregation to slide-level classification and the auxiliary clustering process, is detailed below.

Joint Segmentation and Classification

WholeSIGHT is a graph-based method that performs weakly supervised segmentation and classification simultaneously, using only slide-level labels [35] [36].

Detailed Protocol:

Input: A WSI.
Tissue Graph Construction: The WSI is converted into a graph where nodes represent superpixels (small, homogeneous tissue regions) and edges represent the interactions between adjacent regions [36].
Graph Neural Network (GNN): A GNN processes the tissue graph to contextualize node embeddings.
Dual-Headed Architecture:
- A graph classification head performs the slide-level classification.
- A node classification head is trained to perform segmentation. During training, this head is supervised by node-level pseudo-labels generated via post-hoc feature attribution on the graph classification head [35] [36].
Output: At inference time, the model outputs both the slide-level class and a segmentation map highlighting the histological regions of interest.

Table 2: Comparison of Weakly Supervised Methods for WSI Analysis

Method	Core Architecture	Key Innovation	Outputs	Notable Application
CLAM [11]	Attention-based MIL	Instance-level clustering for data efficiency	Slide-level classification & attention heatmaps	Subtyping of RCC and NSCLC; lymph node metastasis detection
WholeSIGHT [35] [36]	Graph Neural Network	Generates node pseudo-labels from graph attribution	Joint slide-level classification and semantic segmentation	Gleason pattern segmentation and grading in prostate cancer
DG-WSDH [37]	Dynamic Graph + Hashing	Deep hashing for classification and image retrieval	Slide-level classification & binary codes for retrieval	WSI and patch-level retrieval tasks

Experimental Protocols

Protocol: Benchmarking Foundation Models as Feature Extractors

Objective: To evaluate and select the most suitable foundation model for a specific weakly supervised task (e.g., cancer subtyping).

Materials:

Dataset: A cohort of WSIs with slide-level labels, split into training, validation, and test sets.
Foundation Models: Pre-trained models such as CONCH, Virchow2, UNI, etc.
Software: A weakly supervised framework like CLAM.

Procedure:

Pre-processing: For all WSIs in the cohort, perform tissue segmentation and patching at 20x magnification (e.g., 256x256 px).
Feature Extraction: Extract feature embeddings for each patch using each foundation model under evaluation.
Model Training: For each set of features, train an identical weakly supervised model (e.g., CLAM) using the same training/validation split and hyperparameters.
Evaluation: Compare the performance of the models on the held-out test set using metrics such as Area Under the Receiver Operating Characteristic Curve (AUROC) and Area Under the Precision-Recall Curve (AUPRC). Assess statistical significance.

Expected Output: A performance ranking of foundation models for the specific task, guiding the optimal model selection [31].

Protocol: Implementing Weakly Supervised Segmentation with WholeSIGHT

Objective: To train a model that can simultaneously classify a WSI and segment its diagnostically relevant regions using only slide-level labels.

Materials:

Dataset: WSIs with slide-level labels (e.g., Gleason grade). Pixel-level annotations for a test set are required for segmentation evaluation.
Software: The WholeSIGHT codebase.

Procedure:

Graph Construction: For each WSI, generate a tissue graph. This involves:
- Superpixel Generation: Use an algorithm like SLIC to oversegment the WSI into superpixels.
- Node Feature Extraction: Extract features from each superpixel region, potentially using a pre-trained CNN.
- Edge Definition: Connect nodes based on spatial proximity of superpixels.
Model Training: Train the WholeSIGHT model end-to-end.
- The graph classification head is trained via the slide-level label.
- The node classification head is trained using pseudo-labels generated from the graph classification head via a method like Grad-CAM.
Inference and Evaluation:
- Classification: Evaluate slide-level prediction accuracy on the test set.
- Segmentation: Generate the segmentation map by inferring node classes and compare against pixel-level ground truth using metrics like Dice Similarity Coefficient (DSC) [36].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools and Resources for WSI Analysis

Resource Name	Type	Function/Purpose	Reference/Resource
CLAM	Software Package	A high-throughput, data-efficient framework for weakly supervised WSI classification using attention-based MIL.	GitHub Repository [11]
WholeSIGHT	Software Package	A graph-based method for joint weakly supervised segmentation and classification of WSIs.	GitHub Repository [35]
CONCH / Virchow2	Foundation Model	Pre-trained models for extracting state-of-the-art feature embeddings from histopathology patches.	[Publication & Model Zoos] [31]
Yottixel	Software Tool	Creates a mosaic of representative patches from a WSI for efficient downstream processing.	[Publication] [34]
Normal Tissue Atlas	Method/Protocol	A one-class classifier approach to filter out normal tissue patches, improving patch selection efficiency.	[Publication] [34]

The advent of digital pathology has generated vast quantities of whole slide images (WSIs), creating unprecedented opportunities for artificial intelligence to transform diagnostic practices. Foundation models, pre-trained on massive datasets using self-supervised learning (SSL), have emerged as powerful tools that learn general-purpose feature representations from unlabeled histopathology images. These models address a critical bottleneck in computational pathology: the scarcity of expensive, expert-curated labels. Unlike task-specific models that require extensive labeled data for each new application, foundation models can be adapted to diverse downstream tasks with minimal fine-tuning, making them particularly valuable for weakly supervised learning scenarios where only slide-level labels are available.

This paradigm is especially relevant for WSI classification, where gigapixel images are too large to process directly. Weakly supervised multiple instance learning (MIL) approaches overcome this by treating a WSI as a "bag" of smaller image patches ("instances"), allowing models to predict slide-level labels while potentially identifying diagnostically relevant regions. Foundation models serve as superior feature extractors for these patches, capturing rich morphological patterns that generalize across various diagnostic tasks, from cancer subtyping and biomarker prediction to rare disease detection. This document provides a detailed overview of five leading pathology foundation models—CONCH, Virchow2, UNI, CTransPath, and Phikon—focusing on their architectures, performance, and practical applications in weakly supervised WSI classification research.

Model Specifications and Comparative Analysis

Technical Specifications of Leading Pathology Foundation Models

Model Name	Architecture Type	Training Data Scale	Parameters	Primary Training Data Sources	Key Innovation / Focus
CONCH [27] [38]	Vision-Language (Multimodal)	1.17M image-caption pairs [27]	86 million [39]	Diverse public sources (PubMed, Twitter); No TCGA/PAIP [38]	Contrastive learning aligning images with biomedical text
Virchow2 [31]	Vision-Only (ViT)	3.1 million WSIs [31]	Information Missing	Proprietary (MSKCC) [40]	Extreme scale; DINOv2 self-supervised learning
UNI [41]	Vision-Only (ViT)	100k WSIs (UNI v1); >200M images from 350k+ WSIs (UNI2) [41]	307 million (UNI v1) [40]	Mass-100K (proprietary) [39]	General-purpose feature extraction; Scalability
CTransPath [42]	Hybrid CNN-Transformer	15.6M patches from 32.2k WSIs [43]	Information Missing	TCGA, PAIP (public) [42]	Combines local (CNN) and global (Transformer) feature learning
Phikon [40] [31]	Vision-Only (ViT)	6,000 WSIs [39]	86.4 million [39]	TCGA (public) [39]	SSL adaptation for pathology with a smaller dataset

Performance Benchmarking Across Key Domains

Recent independent benchmarking studies provide critical insights into the real-world performance of these models. A comprehensive evaluation of 19 foundation models across 31 clinical tasks—including morphology, biomarkers, and prognostication—revealed that CONCH and Virchow2 achieved the highest overall performance [31].

Table: Model Performance Across Task Types (Mean AUROC) [31]

Model	Morphology (5 tasks)	Biomarkers (19 tasks)	Prognosis (7 tasks)	Overall (31 tasks)
CONCH	0.77	0.73	0.63	0.71
Virchow2	0.76	0.73	0.61	0.71
Prov-GigaPath	0.72	0.72	0.60	0.69
DinoSSLPath	0.76	0.68	0.59	0.69
UNI	0.71	0.68	0.60	0.68
Virchow	0.70	0.67	0.60	0.67
CTransPath	0.70	0.67	0.59	0.67
Phikon	0.69	0.65	0.58	0.65
PLIP	0.67	0.63	0.58	0.64

This benchmark demonstrates that CONCH, a vision-language model, performs on par with Virchow2, a vision-only model trained on nearly three times as many WSIs, highlighting the value of incorporating textual information during pre-training [31].

Experimental Protocols for Weakly Supervised WSI Classification

Standardized Workflow for Downstream Task Evaluation

Implementing a foundation model for a weakly supervised classification task involves a multi-stage pipeline. The diagram below illustrates the key steps from WSI processing to slide-level prediction.

Detailed Protocol Steps

Step 1: WSI Preprocessing and Patching

Tissue Segmentation: Use saturation thresholding or other algorithms to distinguish tissue regions from background [44]. This critical step eliminates non-informative image areas.
Patch Extraction: Segment the tissue areas into smaller, manageable image patches (e.g., 224×224 or 256×256 pixels) at a specified magnification level (typically 20×). A single WSI can generate thousands of patches.
Feature Extraction: Process each patch through a pre-trained foundation model to extract feature embeddings. These embeddings (typically 512-1024 dimensional vectors) serve as compact, informative representations of the visual content.

Step 2: Multiple Instance Learning (MIL) Setup

Bag Formation: Aggregate all patch-level embeddings from a single WSI to form a "bag" of instances. This bag represents the entire slide for weakly supervised learning.
Feature Aggregation: Implement an aggregation mechanism to combine patch-level information into a single slide-level representation. Common approaches include:
- Attention-based MIL (ABMIL): Uses an attention network to weight patches based on their importance to the slide-level classification [44].
- Transformer-based Aggregation: Employs transformer blocks to model interactions between patches before final classification [31].
Classifier Training: Train a final classification layer on the aggregated slide-level features using the available weak labels (e.g., cancer subtype, mutation status).

Step 3: Evaluation and Validation

Performance Metrics: Evaluate model performance using area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), balanced accuracy, and F1-score [31].
Validation Strategy: Employ rigorous validation including hold-out testing and external validation on datasets from different institutions to assess generalizability [44].

The Scientist's Toolkit: Essential Research Reagents

Successful implementation of computational pathology workflows requires both computational resources and specialized software tools.

Table: Essential Materials and Computational Tools

Category	Item / Tool Name	Function / Application	Implementation Notes
Foundation Models	CONCH, Virchow2, UNI, CTransPath, Phikon	Feature extraction from histopathology patches	Available via GitHub (e.g., mahmoodlab/CONCH, mahmoodlab/UNI) [41] [38]
ML Frameworks	PyTorch, TensorFlow	Model implementation and training	Essential for building custom MIL architectures
Whole Slide Image Libraries	OpenSlide, CuCIM	WSI reading and processing at multiple magnifications	Enable efficient handling of gigapixel images
Multiple Instance Learning	ABMIL, Transformer-MIL	Slide-level prediction from patch embeddings	ABMIL is a widely used baseline; Transformer-MIL shows promising performance [31]
Computing Infrastructure	High-end GPUs (NVIDIA A100, H100)	Model training and inference	Critical for processing large WSI datasets in reasonable timeframes
Pathology Datasets	TCGA, CPTAC, External cohorts	Model training, benchmarking, and validation	External validation is crucial for assessing generalizability [44]

Performance Analysis and Model Selection Guidelines

Comparative Strengths and Application Fit

CONCH excels in multimodal tasks that benefit from semantic alignment between images and text, demonstrating strong performance across diverse evaluation benchmarks [27] [31]. Its training on diverse public sources without using common benchmarks like TCGA reduces data contamination risks [38]. CONCH is particularly suitable for:

Zero-shot classification and retrieval tasks
Applications requiring integration of image and textual data
Scenarios where explainability via text-based similarity is valuable

Virchow2 leverages massive-scale training data to achieve state-of-the-art performance on many cancer detection and subtyping tasks [31]. Its vision-only approach benefits from:

Pan-cancer detection across common and rare cancers
Scenarios with abundant computational resources for large model inference
Applications where textual data is unavailable or irrelevant

UNI provides a balanced approach with strong performance across multiple tasks at a lower computational cost compared to larger models [44]. Its applications include:

Resource-constrained environments
General-purpose feature extraction across multiple tissue types
Scenarios requiring a robust, well-documented model

CTransPath offers a hybrid architecture that captures both local features (via CNN) and global context (via Transformer) [42]. It performs well despite its smaller training dataset, making it suitable for:

Tasks requiring both cellular and tissue-level features
Scenarios with limited computational resources for inference
Applications where CNN-based feature compatibility is desired

Phikon demonstrates that effective foundation models can be trained with relatively limited data (6,000 WSIs) [39], serving as a valuable baseline for:

Proof-of-concept studies with limited resources
Comparisons with larger-scale models
Educational purposes in computational pathology

Advanced Applications: Ensemble Approaches and Low-Data Scenarios

Research indicates that foundation models trained on distinct cohorts learn complementary features. Ensemble approaches that combine predictions from multiple models (e.g., CONCH and Virchow2) can outperform individual models, achieving superior performance in 55% of tasks in one benchmark [31].

In low-data scenarios—common for rare molecular events—the relative performance of models shifts. While Virchow2 dominates in settings with 300 training patients, PRISM (another foundation model) performs better with 150 patients, and CONCH shows competitive performance with only 75 patients [31]. This suggests that model selection should consider both the foundation model architecture and the amount of available training data for the downstream task.

For non-neoplastic pathology, an area underrepresented in most foundation model training data, performance gaps between pathology-specific and general vision models narrow, particularly for inflammatory conditions [39]. This highlights the importance of domain-specific tuning for applications beyond oncology.

Pathology foundation models represent a transformative advance in computational pathology, enabling robust weakly supervised classification of WSIs across diverse diagnostic tasks. CONCH and Virchow2 currently demonstrate the strongest overall performance, though optimal model selection depends on specific application requirements, available data, and computational resources. The field continues to evolve rapidly, with emerging trends including larger multi-modal architectures, ensemble methods, and improved generalization to non-cancer pathologies. By following the standardized protocols outlined in this document and leveraging appropriate model selection criteria, researchers can effectively harness these powerful tools to advance precision medicine and computational pathology research.

Implementing Attention-Based Multiple Instance Learning (ABMIL) with Foundation Model Features

The integration of foundation models with Attention-Based Multiple Instance Learning (ABMIL) represents a paradigm shift in computational pathology, enabling robust whole-slide image (WSI) analysis using only slide-level labels. This approach leverages transfer learning from large-scale pretrained models to extract discriminative features from gigapixel images, which are then processed through attention mechanisms to identify diagnostically relevant regions. This methodology has demonstrated significant performance improvements across various cancer types while maintaining interpretability through attention heatmaps. The following application notes provide a comprehensive framework for implementing this powerful technique, including quantitative benchmarks, standardized protocols, and practical implementation tools.

Whole-slide images in computational pathology present unique computational challenges due to their gigapixel resolution, typically around 40,000 × 40,000 pixels or 1 GB per image [45]. Multiple Instance Learning (MIL) has emerged as a fundamental framework for analyzing these images using only slide-level labels, where each WSI is treated as a "bag" containing thousands of smaller image patches ("instances") [46] [47]. The core MIL premise states that a bag is positive if it contains at least one positive instance, making it ideal for cancer detection where only slide-level diagnoses are available.

Attention-Based MIL (ABMIL) enhances this framework by introducing an attention mechanism that learns to weight instances based on their importance, enabling both classification and identification of critical regions [46]. This approach provides inherent interpretability through attention heatmaps that highlight morphologically significant areas. Foundation models, pretrained on massive diverse datasets, have recently revolutionized this pipeline by providing superior feature representations compared to traditionally trained models [48] [49]. These models extract semantically rich features that capture essential pathological patterns, significantly boosting downstream task performance.

Quantitative Benchmarking of Pathology Foundation Models

Comprehensive evaluation of foundation models for feature extraction provides critical guidance for model selection in ABMIL pipelines. Recent large-scale benchmarking studies have evaluated 19 pathology foundation models across 13 patient cohorts encompassing 6,828 patients and 9,528 slides from lung, colorectal, gastric, and breast cancers [48].

Table 1: Performance Comparison of Leading Foundation Models on Weakly-Supervised Computational Pathology Tasks

Foundation Model	Average AUROC	Key Strengths	Computational Requirements	Optimal Use Cases
CONCH	0.891	Best overall performance, strong multimodal capabilities	High	General-purpose pathology tasks, biomarker prediction
Virchow2	0.876	Excellent morphological feature extraction	Medium-High	Tumor subtyping, prognosis prediction
CHIEF	0.863	Strong domain generalization, handles preparation variances	Medium	Multi-center studies, diverse populations
UNI	0.852	Competitive single-modal performance	Medium	Resource-constrained environments
RetCCL	0.844	Optimized for tissue representation learning	Low-Medium	Screening applications, educational tools

The benchmarking results demonstrate that the CONCH model achieves superior performance across multiple tasks, followed closely by Virchow2 [48]. Importantly, ensemble approaches combining CONCH and Virchow2 predictions achieved performance improvements in 55% of tasks compared to individual models, highlighting the value of model diversity. The research also revealed that data diversity significantly outweighs data quantity in model performance, emphasizing the importance of varied tissue representations and staining protocols in training data.

Table 2: Task-Specific Performance Metrics (AUROC) for Top Foundation Models

Pathology Task	CONCH	Virchow2	CHIEF	UNI	RetCCL
Lung Cancer Subtyping	0.912	0.899	0.881	0.867	0.859
Colorectal Cancer Detection	0.885	0.872	0.866	0.851	0.842
Breast Cancer Grading	0.893	0.884	0.869	0.861	0.853
Molecular Marker Prediction	0.874	0.863	0.851	0.839	0.827
Survival Outcome Prediction	0.882	0.871	0.858	0.846	0.836

Experimental Protocols and Workflows

Whole-Slide Image Preprocessing and Patch Extraction

Standardized WSI preprocessing is critical for consistent feature extraction and model performance. The following protocol ensures optimal data preparation:

Tissue Segmentation and Validation:
- Apply Otsu's thresholding or deep learning-based segmentation to identify tissue regions excluding background
- Validate segmentation quality by manual review of 5% of slides randomly selected across batches
- Maintain segmentation parameters consistent across datasets to minimize batch effects
Patch Extraction Parameters:
- Extract non-overlapping patches at 20× magnification (256×256 pixels)
- Set minimum tissue area threshold to 60% to exclude non-informative regions
- Employ quality control filters to remove artifacts, folds, and out-of-focus regions
- Average patch count per WSI: 5,000-15,000 depending on tissue size
Color Normalization and Augmentation:
- Apply Macenko or Vahadane normalization to address staining variability
- Implement standard augmentation during training: random rotation (90°, 180°, 270°), horizontal/vertical flipping
- Exclude intensity-based augmentations to preserve pathological information

Foundation Model Feature Extraction Protocol

Feature extraction from foundation models transforms high-dimensional image patches into compact, semantically rich representations:

Model Selection and Configuration:
- Select appropriate foundation model based on task requirements (refer to Table 1)
- Utilize pretrained weights without fine-tuning for feature extraction
- Extract features from penultimate layer before classification heads
Feature Extraction Implementation:
- Process patches in batches of 64-128 depending on GPU memory
- Generate feature vectors of dimension 512-1024 (varies by foundation model)
- Aggregate features in HDF5 format with WSI-level indexing
- Implement quality checks to ensure feature consistency across patches
Feature Bank Creation:
- Store features with associated metadata: WSI identifier, patch coordinates, extraction parameters
- Apply principal component analysis (PCA) for optional dimensionality reduction
- Create database indices for efficient retrieval during MIL training

ABMIL Implementation with Foundation Model Features

The integration of foundation model features with ABMIL requires specific architectural considerations and training strategies:

ABMIL-Foundation Model Integration Workflow

ABMIL Architecture Configuration:
- Implement attention mechanism with 128-256 hidden units
- Initialize attention branch with He initialization
- Apply L2 regularization (λ=0.0001) to prevent overfitting
- Use dropout (rate=0.3) between fully connected layers
Training Protocol:
- Employ Adam optimizer with learning rate 1e-4, β₁=0.9, β₂=0.999
- Implement learning rate reduction on plateau (factor=0.5, patience=5 epochs)
- Use cross-entropy loss for classification, Cox loss for survival analysis
- Train for 100-200 epochs with early stopping (patience=15 epochs)
Validation and Interpretation:
- Monitor attention entropy to ensure diverse instance utilization
- Generate attention heatmaps by projecting attention scores to spatial coordinates
- Validate heatmap relevance with pathologist review of high-attention regions

Advanced Implementation: AttriMIL Framework

The recently proposed Attribute-Driven MIL (AttriMIL) framework addresses limitations in standard ABMIL by explicitly modeling instance attributes:

Attribute Scoring Mechanism:
- Quantifies each instance's contribution to bag prediction
- Enables spatial and semantic constraints for improved discrimination
- Incorporates histopathology-adaptive backbones for optimized feature extraction
Implementation Protocol:
- Integrate attribute scoring layer after feature extraction
- Apply spatial attribute constraint to model instance correlations within slides
- Implement attribute ranking constraint to model semantic similarity across slides
- Balance constraint losses with classification loss using adaptive weighting (α=0.7, β=0.3)
Performance Benefits:
- Demonstrates 3-5% improvement over standard ABMIL across multiple benchmarks
- Enhanced ability to identify challenging instances and rare morphologies
- Improved robustness to tissue heterogeneity and staining variations

Successful implementation of ABMIL with foundation models requires specific computational resources and software components:

Table 3: Essential Research Reagents and Computational Resources for ABMIL-Foundation Model Implementation

Category	Specific Resource	Key Specifications	Function in Workflow
Foundation Models	CONCH [48]	Vision-language model, 500M parameters	Primary feature extraction for diverse pathology tasks
	Virchow2 [48]	Transformer architecture, 300M parameters	Specialized morphological feature extraction
	CHIEF [49]	Dual pretraining (unsupervised + weakly supervised)	Domain generalization across populations and protocols
Software Frameworks	PyTorch 1.9+ [50]	CUDA 11.1+, Python 3.8+	Deep learning framework for ABMIL implementation
	OpenSlide	Whole-slide image processing library	WSI reading, patch extraction, and normalization
	HiMIL	Specialized MIL library	Reference ABMIL implementations and extensions
Computational Infrastructure	GPU Cluster	NVIDIA A100 (40GB+ VRAM)	Foundation model inference and ABMIL training
	Storage System	High-speed SSD, 10TB+ capacity	WSI storage and feature bank management
	Data Management	HDF5 + SQLite	Efficient feature storage and retrieval

Advanced Applications and Future Directions

Multimodal Integration with Clinical Data

The fusion of foundation model features with clinical data represents the cutting edge of computational pathology:

MPath-Net Framework:
- Integrates WSI features from ABMIL with pathology reports using Sentence-BERT
- Achieves 94.65% accuracy on TCGA cancer subtype classification
- Provides joint interpretability through attention heatmaps and text embeddings
Implementation Protocol:
- Encode clinical text using specialized clinical language models (ClinicalBERT)
- Apply feature-level fusion by concatenating image and text embeddings
- Implement cross-modal attention to model interactions between modalities
- Fine-tune fusion layers with lower learning rate (5e-5) than base encoders

Survival Analysis with AdvMIL Framework

Adversarial Multiple Instance Learning (AdvMIL) extends ABMIL for survival analysis:

Framework Overview:
- Integrates conditional GAN with MIL for survival distribution estimation
- Enables semi-supervised learning with unlabeled WSIs
- Provides robust time-to-event predictions with uncertainty quantification
Implementation Details:
- Replace standard survival loss with adversarial objective function
- Implement generator with MIL encoder and discriminator with fusion network
- Apply k-fold training strategy for semi-supervised learning efficiency
- Regularize with patch occlusion and noise injection for improved robustness

The integration of foundation models with ABMIL represents a significant advancement in weakly supervised computational pathology. Based on current benchmarking results and implementation experiences, we recommend:

Model Selection Strategy: Begin with CONCH for general applications or Virchow2 for morphology-focused tasks, considering computational constraints and performance requirements.
Data Diversity Priority: Prioritize diverse data collection across populations, preparation methods, and scanner types over simple data quantity increases.
Iterative Implementation: Start with standard ABMIL implementation, then advance to AttriMIL for challenging discrimination tasks, and finally incorporate multimodal integration for maximum performance.
Validation Framework: Implement comprehensive validation including quantitative metrics, attention visualization, and pathologist correlation studies to ensure clinical relevance.

This protocol provides researchers with a comprehensive framework for implementing ABMIL with foundation model features, enabling robust and interpretable whole-slide image analysis while leveraging the power of modern foundation models pretrained on extensive pathology datasets.

The analysis of gigapixel Whole Slide Images (WSIs) in computational pathology presents a unique challenge for deep learning. Fully supervised methods require manual annotation of vast tissue regions, which is prohibitively time-consuming and expertise-intensive. Clustering-constrained-attention multiple-instance learning (CLAM) is a data-efficient deep-learning method that addresses this bottleneck by requiring only slide-level labels for training, eliminating the need for pixel-level or region-of-interest annotations [11] [51]. This weakly supervised approach reformulates WSI classification as a multiple-instance learning (MIL) problem, where each slide is treated as a "bag" containing thousands of unlabeled tissue patches or "instances" [11]. CLAM distinguishes itself through its interpretable architecture and data efficiency, achieving high performance with fewer training slides compared to standard weakly supervised methods. It has been successfully applied to diverse tasks including tumor subtyping in renal cell and non-small-cell lung carcinomas, and detection of lymph node metastasis, demonstrating adaptability to independent test cohorts and even smartphone microscopy images [11].

Core Methodological Framework of CLAM

The CLAM pipeline operates through a sequential workflow that transforms raw WSIs into slide-level predictions and interpretable heatmaps. The key stages are detailed below [11] [52]:

WSI Segmentation and Patching: The tissue region on a gigapixel WSI is automatically segmented to exclude background and artifacts. The segmented tissue is then divided into a vast collection of smaller, manageable image patches (e.g., 256x256 pixels) [52].
Feature Extraction: Each image patch is processed by a pre-trained Convolutional Neural Network (CNN) encoder. This step converts high-dimensional pixel data into a low-dimensional feature vector for each patch, creating a set of feature embeddings that represent the entire slide [11].
Attention-based Multiple Instance Learning with Clustering: This is the core of CLAM. An attention network examines all patch-level features in a slide and assigns an attention score to each patch, ranking them by their perceived diagnostic value.
- For a multi-class classification problem (e.g., cancer subtyping), CLAM employs parallel attention branches—one for each class. Each branch computes a unique slide-level representation as a weighted average of all patch features, with weights determined by its class-specific attention scores [11].
- These class-specific slide-level representations are then fed into a classification layer to produce the final slide-level prediction [11].
- A clustering constraint is applied as an auxiliary learning objective. The network is encouraged to separate the most-attended and least-attended patches of each class into distinct clusters in the feature space, thereby refining the feature representation and increasing supervisory signals without manual labels [11].

Table 1: Key Stages of the CLAM Workflow

Stage	Key Input	Key Output	Primary Function
1. Segmentation & Patching	Raw WSI (.svs, .tiff)	Coordinates of tissue patches	Segments tissue and prepares patches for feature extraction [52].
2. Feature Extraction	Image Patches	Low-dimensional feature embeddings	Converts image pixels into meaningful numerical representations using a pre-trained encoder [11].
3. CLAM Model (Attention & Clustering)	Feature embeddings	Slide-level prediction & attention scores	Identifies diagnostically relevant regions and performs classification [11].
4. Heatmap Visualization	Attention scores & Patch coordinates	Interpretable heatmap overlaid on WSI	Allows visualization of model decisions for clinical validation [11] [52].

Visual Workflow Diagram

The following diagram illustrates the end-to-end CLAM pipeline, from WSI processing to final classification and visualization.

Experimental Protocols and Validation

Protocol: Implementing a CLAM Workflow for Cancer Subtyping

This protocol outlines the steps to train and validate a CLAM model for a representative computational pathology task: histological subtyping of renal cell carcinoma (RCC).

Table 2: Protocol for CLAM-based Cancer Subtyping

Step	Action	Key Parameters & Notes
1. Data Curation	Collect WSIs with slide-level labels (e.g., PRCC, CRCC, CCRCC). Split data into training, validation, and test sets.	Ensure patient-level splits to prevent data leakage. A few hundred slides can suffice due to data efficiency [11].
2. WSI Segmentation & Patching	Run `create_patches_fp.py` script.	`--patch_size 256 --seg --patch --stitch`. Use `--preset` for scanner-specific parameters [52].
3. Feature Extraction	Run `extract_features_fp.py` script.	Use a pre-trained encoder (e.g., ResNet50 on ImageNet, or a foundation model like CONCH/UNI [52] [53]). Output is an `.h5` file per WSI.
4. Model Training	Execute CLAM training script (e.g., `main.py --task task_1_tumor_subtyping`).	Specify `--model_type clam_mb` for multi-class. Tune learning rate, dropout, and number of epochs. The clustering loss weight is a critical hyperparameter [11].
5. Model Evaluation	Assess on held-out test set.	Use metrics: Area Under the ROC Curve (AUC), accuracy, and confusion matrix. Generate slide-level predictions and attention scores [11] [53].
6. Visualization & Interpretation	Run `create_heatmaps.py` script.	Overlays attention scores onto original WSIs. Validate localized regions with a pathologist to confirm morphological relevance [11] [52].

Performance Benchmarking and Quantitative Validation

CLAM has been rigorously validated against standard weakly supervised methods and in some cases, fully supervised approaches. Its performance is marked by high accuracy and data efficiency.

Table 3: Quantitative Performance of CLAM on Representative Tasks

Task / Dataset	Key Metric	CLAM Performance	Comparative Performance
Renal Cell Carcinoma (RCC) Subtyping (3-class: PRCC, CRCC, CCRCC)	Test AUC (Macro-average)	0.994 ± 0.0013 [53]	Outperforms standard weakly supervised classification algorithms [11].
Non-Small-Cell Lung Cancer (NSCLC) Subtyping	Test AUC	> 0.98 for Lung Adenocarcinoma vs. Squamous Cell Carcinoma [11]	Demonstrates adaptability to independent test cohorts [11].
Breast Cancer (BRCA) Subtyping (IDC vs. ILC)	Test AUC	0.966 ± 0.018 [53]	Achieves high accuracy using only slide-level labels [11] [53].
Lymph Node Metastasis Detection (Camelyon16)	AUC	> 0.99 [11]	Comparable to state-of-the-art methods trained with extensive data or strong supervision [11].

The framework's data efficiency is a critical advantage. Studies show CLAM achieves high performance while systematically using fewer training slides, making it particularly valuable for rare diseases where large datasets are unavailable [11]. Furthermore, models trained on resection specimens can directly generalize to biopsies and smartphone photomicrographs, underscoring its robustness and adaptability [11].

Evolution Beyond CLAM: Integrated and Foundation Models

The core principles of CLAM have inspired and been integrated into several advanced frameworks and foundation models, expanding the capabilities of weakly supervised computational pathology.

MS-CLAM: Enhancing Localization with Mixed Supervision

MS-CLAM extends CLAM by incorporating mixed supervision, which uses a limited amount of patch-level labels alongside abundant slide-level labels [54]. This approach addresses the challenge that purely weakly supervised models can sometimes produce suboptimal localization. Key enhancements include [54]:

Focused Annotation: Pathologists annotate only a small subset of slides (e.g., 12-62% of the dataset) at the patch-level.
Attention Regularization: An additional loss term explicitly regularizes the attention scores based on the available patch-level labels, guiding the model to more accurately localize key instances.
Performance: With only a fraction of strongly annotated data, MS-CLAM can achieve localization and classification performance close to that of fully supervised models [54].

Synergy with Foundation Models

Recent foundation models pre-trained on massive, diverse histopathology datasets provide powerful feature representations that can significantly boost the performance of MIL frameworks like CLAM.

Foundation Models as Feature Extractors: Models like UNI, CONCH, TITAN, and BEPH are trained on hundreds of thousands to millions of histopathology images in a self-supervised manner [52] [10] [53]. They learn rich, general-purpose representations of histopathological morphology.
Enhanced CLAM Pipelines: Replacing the standard pre-trained encoder (e.g., ImageNet-trained ResNet) in the CLAM feature extraction stage with a foundation model like CONCH leads to more discriminative patch features. This directly translates to improved performance in downstream weakly supervised classification tasks [52] [53].
Whole-Slide Foundation Models: Models like TITAN represent a further evolution by moving beyond patch-level analysis. TITAN is a Vision Transformer trained to encode entire WSIs into a single, general-purpose slide-level representation using self-supervised learning and vision-language alignment with pathology reports [10]. This slide-level embedding can itself be used for tasks like classification, prognosis, and cross-modal retrieval, offering a powerful alternative to MIL-based aggregation [10].

The following diagram illustrates this evolving architectural landscape, from CLAM to integrated foundation models.

Table 4: Comparison of CLAM and Key Foundation Models

Model	Primary Innovation	Supervision Level	Key Advantage
CLAM [11]	Attention-based MIL with clustering constraint.	Weak (Slide-level)	Data efficiency and high interpretability.
MS-CLAM [54]	Incorporates limited patch-level labels.	Mixed (Slide + some Patch)	Improved localization accuracy with minimal annotation cost.
UNI/CONCH [52] [10]	Self-supervised learning on large-scale histopathology patches.	Self-Supervised	Provides powerful, domain-specific feature representations for downstream tasks.
TITAN [10]	Multimodal whole-slide foundation model.	Self-Supervised + Language	Generates general-purpose slide embeddings; enables zero-shot tasks and report generation.
BEPH [53]	Masked Image Modeling (MIM) pre-training on TCGA.	Self-Supervised	Strong generalization across multiple cancer types and tasks (patch, WSI, survival).

The Scientist's Toolkit: Essential Research Reagents

Successful implementation of the frameworks discussed requires a suite of computational tools and data resources.

Table 5: Essential Research Reagents and Resources

Category	Item	Specification / Example	Function in Workflow
Software & Libraries	CLAM Python Package	GitHub: mahmoodlab/CLAM [52]	Core framework for weakly supervised WSI classification.
	Foundation Model Encoders	UNI, CONCH, CTransPath [52] [53]	Pre-trained models for extracting powerful patch-level features.
	Whole-Slide Foundation Models	TITAN [10]	Provides general-purpose slide-level embeddings for diverse tasks.
Computational Infrastructure	GPU Clusters	NVIDIA GPUs (e.g., A100, V100)	Accelerates model training and inference on large WSI datasets.
Data Resources	Public WSI Repositories	The Cancer Genome Atlas (TCGA)	Source of large-scale, multi-organ WSI data for training and validation [55] [53].
	Task-Specific Datasets	Camelyon16, RCC & NSCLC datasets from BWH/TCGA [11]	Benchmark datasets for model development and comparative performance assessment.

The application of artificial intelligence (AI) in computational pathology represents a paradigm shift in cancer diagnostics and research. Foundation models, pre-trained on large-scale histopathology datasets using self-supervised learning (SSL), have emerged as powerful tools for analyzing Whole Slide Images (WSIs) without requiring extensive manual annotations [31]. This capability is particularly valuable for predicting biomarkers and cancer subtypes, tasks essential for personalized medicine but often limited by data scarcity and annotation costs in traditional supervised approaches.

This application note details a structured framework for leveraging foundation models in weakly supervised learning scenarios. We present benchmark performance data for leading models, provide detailed experimental protocols, and outline essential computational tools, creating a comprehensive resource for researchers and drug development professionals working to translate computational pathology into clinical impact.

Benchmarking Foundation Models for WSI Analysis

Independent benchmarking on clinically relevant tasks is crucial for selecting the appropriate foundation model. A comprehensive evaluation of 19 histopathology foundation models across 31 tasks provides critical performance data for model selection [31].

The table below summarizes the performance of top-performing foundation models across different task categories, measured by Average Area Under the Receiver Operating Characteristic Curve (AUROC) [31].

Table 1: Benchmark Performance of Leading Pathology Foundation Models

Foundation Model	Model Type	Morphology Tasks (Avg. AUROC)	Biomarker Tasks (Avg. AUROC)	Prognostication Tasks (Avg. AUROC)	Overall Average (Avg. AUROC)
CONCH	Vision-Language	0.77	0.73	0.63	0.71
Virchow2	Vision-Only	0.76	0.73	0.61	0.71
Prov-GigaPath	Vision-Only	-	0.72	-	0.69
DinoSSLPath	Vision-Only	0.76	-	-	0.69

Performance in Low-Data Scenarios

A key advantage of foundation models is their applicability in data-scarce settings common for rare molecular events. Performance remains relatively stable even when downstream models are trained on cohorts as small as 75 patients, with Virchow2 and PRISM showing particular robustness in these scenarios [31].

Experimental Protocol for Weakly Supervised WSI Classification

This protocol outlines the complete workflow for training a weakly supervised classifier to predict biomarkers or cancer subtypes from WSIs using a pre-trained foundation model.

WSI Preprocessing and Tiling

Purpose: Prepare high-resolution WSIs for feature extraction by dividing them into smaller, manageable patches while excluding uninformative tissue regions.

Procedure:

Format Conversion: Ensure WSIs are in a compatible digital format (e.g., .svs, .ndpi).
Quality Control: Manually review or use an automated quality control model (e.g., PathOrchestra, which achieved >0.95 accuracy in tasks like blur, bubble, and wrinkle detection) to flag and exclude slides with significant artifacts [56].
Tissue Detection: Apply a tissue segmentation algorithm (e.g., Otsu's thresholding) to identify regions of interest and exclude background.
Tiling: Tessellate the WSI into non-overlapping 256×256 pixel patches at 20× magnification [31] [56].

Feature Extraction with Foundation Models

Purpose: Convert image patches into a numerical feature representation using a pre-trained foundation model.

Procedure:

Model Selection: Choose a foundation model based on benchmark performance (see Table 1). CONCH or Virchow2 are recommended for general use [31].
Feature Extraction:
- Without modifying the foundation model's weights, pass each image patch through the model.
- Extract the output feature vector (embedding) from the penultimate layer of the model for each patch.
- This results in a set of feature vectors {x_1, x_2, ..., x_N} for a slide with N patches.

Weakly Supervised Aggregation and Classification

Purpose: Aggregate patch-level features into a single slide-level representation and train a classifier using only slide-level labels.

Procedure:

Feature Aggregation: Utilize an attention-based multiple instance learning (ABMIL) architecture or a transformer-based aggregator to combine patch-level features into a slide-level embedding [31].
Classifier Training:
- Input the slide-level embeddings into a final classification layer (e.g., a fully connected network with a softmax/sigmoid output).
- Use a slide-level label (e.g., "BRAF mutant" vs. "BRAF wild-type") as the ground truth for training.
- Employ a standard cross-entropy loss and backpropagate through the entire network (aggregator and classifier) to fine-tune the model for the specific task.

Model Validation

Purpose: Rigorously assess model performance on held-out data to ensure generalizability.

Procedure:

Data Splitting: Partition the dataset into training, validation, and test sets at the patient level to prevent data leakage.
Performance Metrics: Evaluate the model on the independent test set using AUROC, Average Precision (AUPRC), Balanced Accuracy, and F1-score, with particular attention to AUPRC for imbalanced datasets [31].
External Validation: For the strongest evidence of clinical utility, validate the final model on an external cohort from a different institution that was not used in any part of the training process [31].

The following diagram illustrates the complete experimental workflow.

The Scientist's Toolkit: Key Research Reagents and Materials

The following table details essential computational tools and resources required to implement the described protocols.

Table 2: Essential Research Reagents and Computational Tools

Item Name	Function / Purpose	Specifications / Examples
Pre-trained Foundation Model	Provides high-quality feature representations from image patches without task-specific training.	CONCH (vision-language), Virchow2 (vision-only), PathOrchestra (vision). Model weights are typically available from original publications [31] [56].
Whole Slide Image (WSI) Dataset	Serves as the primary input data for training and evaluation. Requires slide-level labels.	Datasets should be large-scale and ideally multi-institutional. Examples: TCGA, in-house clinical cohorts. Size: 6,818 patients / 9,528 slides used in [31].
Digital Slide Storage	Secure and efficient storage for large WSI files.	Institutional PACS (Picture Archiving and Communication System) or research data lakes capable of storing terabytes of image data.
High-Performance Computing (HPC)	Provides the computational power for feature extraction and model training.	GPU clusters with modern GPUs (e.g., NVIDIA A100/V100), sufficient RAM (>64GB), and multi-core CPUs.
WSI Processing Library	Software for reading WSIs and performing tiling/preprocessing.	Openslide, OpenSlide Python, CuCIM.
Deep Learning Framework	Environment for implementing and training the aggregation and classification models.	PyTorch or TensorFlow, with libraries like MONAI or TIAToolbox for computational pathology-specific functions.

Foundation models like CONCH and Virchow2, when applied within a weakly supervised learning framework, establish a new state-of-the-art for predicting critical biomarkers and cancer subtypes directly from WSIs. The experimental protocols and benchmarking data provided herein offer a robust foundation for researchers to build upon, accelerating the development of more precise, data-driven diagnostic tools in oncology. Future work should focus on the integration of multimodal data and the validation of these models in prospective clinical settings to fully realize their potential in personalized cancer care.

Overcoming Challenges: Strategies for Data-Efficient and Robust Model Performance

Foundation models (FMs), pre-trained on extensive datasets using self-supervised learning, are revolutionizing the analysis of whole slide images (WSIs) in computational pathology. Their ability to learn general-purpose visual features from largely unlabeled data directly addresses the critical challenge of data scarcity, which traditionally hinders the development of deep learning tools for clinical and research applications. This protocol details the application of FMs for weakly supervised WSI classification, with a specific focus on their performance in low-data and low-prevalence task settings. We provide a benchmarking analysis of contemporary FMs, elaborate on experimental methodologies for leveraging these models with minimal annotations, and outline essential computational tools. The guidance is intended to enable researchers and drug development professionals to implement these approaches effectively, accelerating biomarker discovery and diagnostic model development while minimizing dependency on large, expensively annotated datasets.

The adoption of digital pathology has generated a wealth of whole slide images (WSIs), which are multi-gigapixel digital representations of tissue samples [57]. However, leveraging this data for supervised deep learning is notoriously challenging due to the prohibitively high cost and expertise required for detailed, patch-level annotations [57] [58]. Weakly supervised learning, particularly using only slide-level labels, presents a viable path forward.

Foundation Models (FMs) are models pre-trained on broad data using self-supervision at scale, which can be adapted to a wide range of downstream tasks [59]. In computational pathology, FMs pre-trained on millions of histopathology patches learn powerful, generalizable representations of tissue morphology. These representations can be effectively utilized for downstream classification tasks with very few task-specific labels, thus directly tackling the problem of data scarcity [31] [59]. This document provides application notes and protocols for evaluating and deploying FMs in low-data regimes for WSI analysis.

Benchmarking Foundation Model Performance

Independent benchmarking studies are crucial for selecting the appropriate FM for a specific task. A large-scale evaluation of 19 histopathology foundation models across 31 clinical tasks provides key insights into their performance in data-scarce environments [31].

Table 1: Top-Performing Foundation Models Across Task Types (Mean AUROC)

Model	Model Type	Overall (31 tasks)	Morphology (5 tasks)	Biomarkers (19 tasks)	Prognostication (7 tasks)
CONCH	Vision-Language	0.71	0.77	0.73	0.63
Virchow2	Vision-Only	0.71	0.76	0.73	0.61
Prov-GigaPath	Vision-Only	0.69	0.73	0.72	0.60
DinoSSLPath	Vision-Only	0.69	0.76	0.69	0.60

Table 2: FM Performance with Varying Downstream Training Data [31]

Sampled Training Cohort Size	Best Performing Model(s) (Number of tasks led)	Performance Trend
300 Patients	Virchow2 (8 tasks), PRISM (7 tasks)	Virchow2 and PRISM demonstrate dominance with more data.
150 Patients	PRISM (9 tasks), Virchow2 (6 tasks)	PRISM shows strong performance with medium data volume.
75 Patients	CONCH (5 tasks), PRISM (4 tasks), Virchow2 (4 tasks)	Performance is more balanced, with CONCH leading in more tasks.

Key Observations [31]:

Data Diversity Outweighs Volume: The superior performance of CONCH, trained on 1.17 million image-caption pairs, over BiomedCLIP, trained on 15 million, suggests that the diversity and quality of pre-training data are more critical than sheer volume.
Robustness in Low-Data Settings: Performance metrics remained relatively stable when the downstream training data was reduced from 150 to 75 patients, indicating that FMs can be effectively fine-tuned with very small labeled datasets.
Complementary Strengths: Models trained on distinct cohorts learn complementary features. Ensembling models like CONCH and Virchow2 can outperform individual models, leveraging their diverse strengths [31].

Experimental Protocols for Low-Data WSI Classification

This section outlines a standardized protocol for using FMs in a weakly supervised manner for WSI classification with minimal labeled data.

Protocol 1: Weakly Supervised Patch Embedding Aggregation

This is the most common method for applying FMs to WSIs, requiring only slide-level labels [31] [60].

Workflow Description: The protocol begins with a Whole Slide Image (WSI) as input. The first step is patch extraction and feature embedding using a pre-trained Foundation Model. This steps processes the WSI into a set of feature vectors. These patch embeddings are then aggregated, a step that can be performed using a Transformer Encoder or an Attention-Based Multiple Instance Learning (ABMIL) model. The aggregated features are used for the final Slide-Level Classification, which produces a prediction. The weak supervision from the slide-level label is applied to this final output to train the aggregation and classification components.

Materials and Reagents:

Hardware: Server with high-end GPU (e.g., NVIDIA A100 or RTX 4090) and substantial RAM (≥64GB) for handling large WSIs.
Software: Python (v3.9+), PyTorch or TensorFlow, and libraries like OpenSlide or CuCIM for WSI handling.
Foundation Model: Pre-trained weights for an FM like CONCH or Virchow2 [31].

Step-by-Step Procedure:

WSI Preprocessing:
- Load the WSI using a library like OpenSlide.
- Segment the tissue from the background using a thresholding method like Otsu's or a hysteresis filter [57].
- Extract square patches (e.g., 256x256 pixels) from the tissue regions at a desired magnification level (e.g., 20x).

Feature Embedding Extraction:
- Load a pre-trained FM. Keep the model frozen (i.e., do not update its parameters during this step).
- Pass each extracted patch through the FM to obtain a feature vector (embedding).
- This results in hundreds to thousands of feature vectors per WSI.
Weakly Supervised Aggregation and Classification:
- This is the trainable part of the pipeline. Use the slide-level label as the ground truth.
- Aggregation Method: Use an aggregation model like a Transformer encoder [31] or an Attention-Based Multiple Instance Learning (ABMIL) model to combine the patch embeddings into a single slide-level representation.
- Classifier: Train a classifier (e.g., a fully connected layer) on the aggregated slide-level representation to predict the slide-level label.
- This entire aggregation and classification module can be trained with as few as 75-300 patient samples [31].

Protocol 2: Dual-Tier Few-Shot Learning (FAST)

For scenarios with extremely limited budgets for annotation, the FAST paradigm provides a high-performance solution [61].

Workflow Description: The FAST paradigm involves two parallel branches. The Prior Branch uses a pre-trained Vision-Language Model (VLM) with learnable prompt vectors to generate patch classifications, leveraging textual knowledge. Simultaneously, the Cache Branch uses a small set of sparsely annotated patches to learn patch labels through knowledge retrieval. The outputs from both branches are then integrated to produce the final WSI Classification. The entire process is supported by a Dual-Level Annotation Strategy that provides a few WSIs with slide-level labels and a very few patches with fine-grained labels.

Materials and Reagents:

All materials from Protocol 1.
A vision-language FM (VLFM) like CONCH that has been trained on both images and text [31].
A small set of patches with fine-grained labels (e.g., 0.22% of the annotation cost of a fully supervised model) [61].

Step-by-Step Procedure:

Dual-Level Annotation:
- Collect a small number of WSIs and assign them slide-level labels.
- From these WSIs, annotate an extremely small number of individual patches (e.g., fewer than 100) with fine-grained labels.

Dual-Branch Classification:
- Prior Branch: Utilize a VLFM. Incorporate learnable prompt vectors that are tuned during training to generate patch classifications based on the model's semantic knowledge.
- Cache Branch: Use the sparsely labeled patches to build a "cache" of known examples. For each unlabeled patch in a WSI, its label is inferred by retrieving and comparing it to the most similar examples in this cache (e.g., using k-Nearest Neighbors).
- Integration: The predictions from both branches are combined (e.g., via weighted averaging) to produce the final WSI-level prediction. This approach has been shown to approach the accuracy of fully supervised methods with only a fraction of the annotation cost [61].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Reagents for FM-Based WSI Analysis

Item Name	Type	Function/Benefit	Example/Citation
CONCH	Vision-Language FM	Excels in morphology, biomarker, and prognostication tasks; strong in low-data scenarios.	[31]
Virchow2	Vision-Only FM	Top performer overall; robust across diverse tasks, especially with ~150+ training samples.	[31]
UNI	Foundation Model	Effective for cross-domain transfer; used successfully in TB detection from pathology pre-training.	[60]
UMedPT	Multi-Task FM	A foundational model trained on diverse medical imaging tasks (classification, segmentation, detection), enhancing feature robustness for data-scarce domains.	[62]
Transformer Aggregator	Algorithm	Aggregates patch embeddings into a slide-level representation for classification using slide-level labels only.	[31] [60]
Optimized Heterogeneous Ensemble (SSL-OHE)	Algorithm	Combines multiple lightweight models via an optimized weighting strategy to improve performance and mitigate class imbalance.	[63]

Foundation models represent a paradigm shift in computational pathology, effectively overcoming the historical bottleneck of data scarcity. Benchmarking studies confirm that models like CONCH and Virchow2 deliver robust performance even when fine-tuning data is severely limited. The experimental protocols outlined—ranging from standard weakly supervised aggregation to advanced few-shot learning—provide researchers with a clear roadmap for implementing these powerful tools. By leveraging FMs, the field can accelerate the development of reproducible and accurate diagnostic and prognostic tools, ultimately advancing personalized pathology and drug development.

In the field of computational pathology, the development of foundation models for weakly supervised whole slide image (WSI) classification has traditionally operated under the assumption that larger datasets yield superior models. However, a paradigm shift is underway, with recent evidence compellingly demonstrating that the diversity of pretraining data often exerts a more powerful influence on model performance than sheer data volume alone. This principle is particularly critical in weakly supervised learning, where models must generalize from slide-level labels without precise regional annotations.

This Application Note synthesizes recent benchmarking studies and novel methodologies to provide researchers and drug development professionals with actionable insights and protocols for constructing more robust and data-efficient computational pathology workflows. The findings underscore that strategic data curation focusing on variety across institutions, scanner types, patient demographics, and tissue sites can enable foundation models to achieve state-of-the-art performance while utilizing significantly fewer computational resources.

Quantitative Evidence: Diversity Over Scale

Comprehensive benchmarking studies directly challenge the conventional wisdom of "scale-first" in foundation model training for computational pathology. The evidence reveals that models trained on more diverse datasets consistently match or surpass the performance of models trained on orders of magnitude more data.

Table 1: Foundation Model Performance vs. Training Data Scale

Model	Training WSIs	Training Patches	Key Performance Findings	Primary Strength
CONCH [31]	-	1.17M image-caption pairs	Highest mean AUROC (0.71) across 31 tasks; leader in morphology (0.77) and prognosis (0.63)	Vision-language pretraining; complementary feature learning
Virchow2 [31]	3.1 million	-	Overall AUROC of 0.71; top performer in biomarker tasks (0.73) and with 300-patient cohorts	Vision-only model; strong in data-rich scenarios
Athena [64]	282,000	115 million	Competes with state-of-the-art on slide-level tasks despite minimal patch count	High WSI diversity; efficient ViT-G/14 architecture
UNI [64]	100,000	100 million	Strong performance (AUROC 0.68), but outperformed by more diverse models	Trained on unique patches
Prov-GigaPath [31]	-	-	Strong biomarker prediction (mean AUROC 0.72)	-
PLIP [31]	-	-	Lower average AUROC (0.64)	-

Table 2: Impact of Data Diversity on Model Performance

Diversity Dimension	Impact on Model Performance	Evidence
Anatomic Tissue Sites	Significant correlation with morphology task performance (r=0.74, p<0.05) [31]	Direct, statistically significant improvement
Institutions & Scanners	Enhanced robustness to staining variations and scanning protocols [64]	Improved generalization on external validation
Geographic Sources	Exposure to country-specific variations in staining and tissue preparation [64]	Broader feature representation in t-SNE visualizations
Data Modalities	Vision-language models (CONCH) outperform vision-only models on multiple tasks [31]	Complementary information from text captions

The data reveals several critical insights. First, CONCH, a vision-language model trained on 1.17 million image-caption pairs, performs on par with Virchow2, a vision-only model trained on 3.1 million WSIs, and together they outperform other pathology foundation models across morphology, biomarker, and prognostication tasks [31]. This suggests that multimodal training provides a form of data diversity that can compensate for smaller dataset sizes.

Second, the Athena foundation model demonstrates that strategic diversity-focused curation enables competitive performance with dramatically reduced computational burden. Trained on just 115 million tissue patches—several times fewer than recent histopathology foundation models—Athena approaches state-of-the-art performance by maximizing data diversity through random selection of a moderate number of patches per WSI from a repository spanning multiple countries, institutions, and scanner types [64].

Experimental Protocols for Weakly Supervised WSI Classification

Benchmarking Foundation Models as Feature Extractors

Purpose: To evaluate the performance of various pathology foundation models as feature extractors for weakly supervised downstream tasks related to morphology, biomarkers, and prognostication [31].

Materials:

19 pathology foundation models (e.g., CONCH, Virchow2, Prov-GigaPath, UNI, PLIP)
13 patient cohorts (6,818 patients, 9,528 slides) across lung, colorectal, gastric, and breast cancers
31 clinically relevant evaluation tasks (5 morphology, 19 biomarkers, 7 prognosis)

Procedure:

Feature Extraction:
- Tessellate each WSI into small, non-overlapping patches at appropriate magnification (typically 20x)
- Extract patch-level embeddings using each foundation model without fine-tuning
- For models with slide encoders, retrieve the original tile-level embeddings rather than slide-level representations [31]

Multiple Instance Learning (MIL) Setup:
- Aggregate patch-level features using a transformer-based MIL aggregator
- Compare with attention-based MIL (ABMIL) as a baseline [31]
- Train using only slide-level labels in a weakly supervised manner
Model Evaluation:
- Evaluate on external cohorts not included in any foundation model's training data
- Measure performance using AUROC, AUPRC, balanced accuracy, and F1 scores
- Conduct statistical significance testing across models (p<0.05)
Low-Data Scenario Testing:
- Train downstream models on randomly sampled cohorts of 300, 150, and 75 patients
- Maintain similar positive-to-negative sample ratios
- Validate on full-size external cohorts to assess data efficiency

Multimodal Data Fusion with MPath-Net

Purpose: To integrate WSIs with pathology reports for enhanced cancer subtype classification in a weakly supervised setting [65].

Materials:

TCGA dataset (1,684 cases: 916 kidney, 768 lung)
H&E-stained WSIs and corresponding pathology reports
Sentence-BERT model for text encoding
Multiple Instance Learning (MIL) framework for WSI processing

Procedure:

WSI Feature Extraction:
- Process WSIs using an MIL approach (e.g., ABMIL, TransMIL)
- Extract features from the second-last layer of the deep neural network

Text Feature Extraction:
- Preprocess pathology reports using standard NLP techniques
- Generate text embeddings using Sentence-BERT (frozen weights)
- Extract features from the second-last layer of the transformer
Multimodal Fusion:
- Concatenate 512-dimensional image and text embeddings
- Pass through custom fine-tuning layers for joint representation learning
- Employ feature-level fusion strategy to learn cross-modal interactions
Model Training & Evaluation:
- Train end-to-end with image encoder fine-tuned and text encoder frozen
- Evaluate using accuracy, precision, recall, and F1-score
- Generate attention heatmaps for interpretable tumor tissue localization

BRAF Mutation Prediction with Foundation Models and Gradient Boosting

Purpose: To predict BRAF-V600 mutation status directly from histopathological slides using a weakly supervised, image-only pipeline [23].

Materials:

Skin Cutaneous Melanoma (SKCM) dataset from TCGA (275 slides)
Independent cohort from University Hospital Essen (68 slides)
Prov-GigaPath foundation model
XGBoost classifier

Procedure:

Feature Extraction:
- Extract patch-level features from WSIs using Prov-GigaPath
- Fine-tune the foundation model on the target dataset

Slide-Level Representation:
- Aggregate patch-level features using attention-based pooling
- Alternatively, use principal component analysis (PCA) for dimensionality reduction
Classifier Training:
- Train XGBoost classifier on slide-level representations
- Compare with Random Forest (RF) and Logistic Regression (LR) baselines
- Perform cross-validation on TCGA dataset
Evaluation:
- Test on independent UHE cohort for external validation
- Report AUC with 95% confidence intervals
- Compare performance with and without XGBoost integration

Visualization of Workflows and Relationships

Workflow for Weakly Supervised WSI Classification

Data Diversity vs. Volume Relationship

Multimodal Fusion Architecture

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Key Research Reagent Solutions for Weakly Supervised WSI Classification

Category	Item	Function/Application	Examples/Notes
Foundation Models	CONCH [31]	Vision-language feature extraction for weakly supervised tasks	Highest overall performer across morphology, biomarkers, prognosis
	Virchow2 [31]	Vision-only feature extraction; strong in data-rich scenarios	Close second to CONCH; best for biomarker prediction
	Athena [64]	Data-efficient foundation model for diverse datasets	Trained on 115M patches; emphasizes diversity over volume
	Prov-GigaPath [23]	BRAF mutation prediction from histopathology slides	Used with XGBoost for state-of-the-art mutation classification
Software Frameworks	Multiple Instance Learning (MIL) [31] [65]	Weakly supervised aggregation of patch-level features	Transformer-based or attention-based (ABMIL) aggregation
	DINOv2 [64]	Self-supervised training framework for foundation models	Used for pretraining with self-distillation approach
	Sentence-BERT [65]	Text embedding for pathology reports in multimodal approaches	Generates semantic representations of clinical text
	XGBoost [23]	Gradient boosting for slide-level classification	Enhances foundation model performance for mutation prediction
Data Resources	TCGA Datasets [65] [23]	Multi-cancer WSI collections with molecular annotations	Source for SKCM, lung, kidney cancer subtypes
	GTEx [64]	Normal tissue reference for model pretraining	Provides complementary normal histology
	Institutional Cohorts [31] [23]	External validation sets for model generalization	Critical for assessing real-world performance

The collective evidence from recent benchmarking studies and methodological innovations firmly establishes that strategic emphasis on pretraining data diversity delivers superior returns compared to merely scaling dataset volume in computational pathology. This principle proves particularly impactful in weakly supervised settings, where models must extract maximal signal from minimal annotations. The protocols and visualizations provided herein offer researchers a structured framework for implementing diversity-first strategies in foundation model development and application. As the field advances, intentional curation of multidomain, multigeography, and multimodal datasets will accelerate the development of more robust, data-efficient, and clinically actionable AI tools for pathology and drug development.

Foundation models, trained on large-scale histopathological datasets using self-supervised or supervised learning, are revolutionizing computational pathology by providing powerful feature extractors for downstream clinical tasks. These models significantly reduce the need for extensive manual annotations and computational resources for task-specific model development. However, with an increasing number of available foundation models, selecting the most appropriate one for a specific clinical task presents a substantial challenge for researchers and practitioners. This guide provides a structured framework for matching pathology foundation models to specific clinical applications, supported by recent benchmarking evidence and detailed experimental protocols.

Foundation Model Performance Across Clinical Domains

Comprehensive benchmarking studies have evaluated numerous foundation models across multiple clinically relevant domains. Understanding their relative strengths in different task categories is essential for appropriate model selection.

Quantitative Performance Comparison

The table below summarizes the performance of leading foundation models across key clinical domains based on a comprehensive benchmark of 19 models on 31 tasks involving 6,818 patients and 9,528 slides [31]:

Table 1: Foundation Model Performance Across Clinical Domains (Mean AUROC)

Foundation Model	Modality	Morphology Tasks (n=5)	Biomarker Tasks (n=19)	Prognosis Tasks (n=7)	Overall Average
CONCH	Vision-Language	0.77	0.73	0.63	0.71
Virchow2	Vision-only	0.76	0.73	0.61	0.71
Prov-GigaPath	Vision-only	-	0.72	-	0.69
DinoSSLPath	Vision-only	0.76	-	-	0.69
UNI	Vision-only	-	-	-	0.68
BiomedCLIP	Vision-Language	-	-	0.61	0.66

Note: Dashes indicate where specific domain performance data was not explicitly provided in the source material [31]

Model Selection Guidelines by Clinical Task Type

Based on the benchmarking evidence, the following recommendations can be made for model selection:

For morphological analysis tasks (e.g., tissue structure assessment, cellular pattern recognition), CONCH achieves the highest performance (AUROC: 0.77), closely followed by Virchow2 and DinoSSLPath (AUROC: 0.76) [31].
For biomarker prediction tasks (e.g., molecular alterations, protein expression), Virchow2 and CONCH demonstrate equivalent top performance (AUROC: 0.73), with Prov-GigaPath as a close contender (AUROC: 0.72) [31].
For prognostic outcome prediction, CONCH provides superior performance (AUROC: 0.63) compared to other models, including Virchow2 (AUROC: 0.61) and BiomedCLIP (AUROC: 0.61) [31].
For low-data scenarios, Virchow2 performs best with medium-sized training cohorts (n=150), while CONCH shows advantages with very small cohorts (n=75) [31].
For multi-task clinical applications, consider model ensembles. An ensemble combining CONCH and Virchow2 predictions outperformed individual models in 55% of tasks by leveraging complementary strengths [31].

Experimental Protocol for Weakly-Supervised Whole Slide Image Classification

This section provides a detailed methodology for applying foundation models to weakly-supervised whole slide image (WSI) classification tasks, consistent with established benchmarking approaches [31].

Materials and Reagent Solutions

Table 2: Essential Research Materials and Computational Tools

Category	Item	Specification/Function
Data Sources	Whole Slide Images (WSIs)	Formalin-Fixed Paraffin-Embedded (FFPE) or frozen tissue sections, scanned at 20× or 40× magnification
	Clinical Annotations	Slide-level or patient-level labels for morphology, biomarkers, or prognosis
Computational Hardware	GPU Cluster	NVIDIA GPUs with ≥16GB VRAM for efficient feature extraction and model training
	Storage System	High-capacity storage for WSI repositories (often petabytescale)
Software Tools	Python Environment	Version 3.8+ with PyTorch or TensorFlow deep learning frameworks
	Whole Slide Image Processing	OpenSlide or CuImage libraries for patch-level extraction
Foundation Models	Vision-Language Models	CONCH (trained on 1.17M image-caption pairs)
	Vision-Only Models	Virchow2 (trained on 3.1M WSIs)
Downstream Architectures	Multiple Instance Learning	Attention-based MIL (ABMIL) or transformer-based aggregators

Step-by-Step Workflow Protocol

Figure 1: Workflow for weakly-supervised WSI classification using foundation models.

WSI Preprocessing and Patch Extraction

Quality Control: Perform visual or automated quality assessment of WSIs to exclude slides with significant artifacts, bubbles, adhesions, or blurring [56].
Tissue Segmentation: Apply tissue detection algorithms to identify relevant tissue regions and exclude background areas.
Patch Extraction: Tessellate WSIs into non-overlapping 256×256 pixel patches at 20× magnification. A typical WSI may yield 5,000-15,000 patches [31].
Data Augmentation: Implement appropriate augmentation strategies (rotation, flipping, color jitter) during training to improve model generalization.

Feature Extraction with Foundation Models

Model Selection: Choose an appropriate foundation model based on your clinical task (refer to Section 2.2 guidelines).
Feature Extraction: Process each patch through the foundation model to extract feature embeddings:
- Use models in "feature extractor" mode without fine-tuning for efficient processing
- CONCH and Virchow2 have demonstrated strong performance as feature extractors [31]
Feature Storage: Save patch-level features in efficient formats (e.g., HDF5) for downstream training.

Weakly-Supervised Model Training

Feature Aggregation: Implement multiple instance learning (MIL) frameworks to aggregate patch-level features into slide-level representations:
- Transformer-based aggregators generally outperform ABMIL by approximately 0.01 AUROC [31]
- Alternative: Attention-based MIL (ABMIL) for computational efficiency
Label Assignment: Use slide-level or patient-level labels that correspond to the clinical task (morphology, biomarker status, or prognosis)
Model Configuration:
- Optimizer: AdamW with learning rate 1e-4 to 1e-5
- Batch size: 8-32 depending on GPU memory
- Regularization: Weight decay, dropout appropriate to dataset size
Validation Strategy: Implement rigorous cross-validation with patient-level splits to prevent data leakage

Advanced Implementation Considerations

Ensemble Approaches

For maximum performance, consider ensemble methods that leverage complementary strengths of different foundation models:

Feature-Level Fusion: Combine features from multiple foundation models before aggregation
Prediction-Level Fusion: Average predictions from models trained with different foundation model features
Priority Ensemble: The CONCH + Virchow2 ensemble has demonstrated superior performance in 55% of tasks [31]

Low-Data Strategies

When labeled data is scarce (≤100 patients), consider these specialized approaches:

Model Selection: CONCH shows advantages in very low-data scenarios (n=75) [31]
Transfer Learning: Leverage features from models pretrained on diverse tissue sites, even if different from target domain
Regularization: Increase dropout rates, use heavier data augmentation, and implement early stopping

Clinical Deployment Validation

Before clinical deployment, ensure rigorous validation:

Multi-Center Testing: Validate performance on external cohorts from different institutions
Statistical Testing: Perform significance testing on performance differences between models
Clinical Utility Assessment: Evaluate impact on clinical workflows and decision-making

Selecting the appropriate foundation model for specific clinical tasks in computational pathology requires careful consideration of task type, available data, and performance requirements. Current evidence indicates that CONCH and Virchow2 generally provide state-of-the-art performance across multiple domains, with each showing specific strengths in different clinical contexts. By following the protocols and guidelines outlined in this document, researchers can systematically implement and validate foundation models for their specific clinical applications, accelerating the development of robust computational pathology solutions.

Application Notes

The Challenge of Stain Variation in Computational Pathology

In the field of digital histopathology, the analysis of Whole Slide Images (WSIs) is fundamentally hampered by technical variations introduced during tissue preparation and digitization. A model trained on data from one laboratory frequently exhibits a lack of generalization when applied to images from another laboratory due to differences in scanners, staining protocols, and laboratory procedures [66]. These variations cause strong color differences, altering image characteristics like contrast, brightness, and saturation, and creating complex style variations that are distinct from the underlying biological information [66]. Pathologists are trained to cope with these variations, but deep-learning models often struggle, potentially compromising the robustness and reliability of computer-aided diagnostic (CAD) systems [66].

The integration of foundation models into weakly supervised WSI classification research is particularly sensitive to these issues. The pre-training of these models on large, diverse datasets does not inherently confer immunity to domain shift. Therefore, proactive management of stain variation and image artifacts is a critical prerequisite for ensuring that these models can be effectively fine-tuned and deployed in real-world clinical settings, where data heterogeneity is the norm rather than the exception.

Quantitative Analysis of Stain Normalization and Domain Adaptation Techniques

The performance of various stain normalization and domain adaptation methods has been quantitatively evaluated across multiple studies, with metrics focusing on color constancy, segmentation accuracy, and classification improvement. The following tables summarize key findings.

Table 1: Quantitative Performance of Stain Normalization Methods on Color Constancy and Segmentation

Method	Key Principle	Performance Findings	Dataset
StainCUT [66]	Contrastive learning for unpaired image-to-image translation.	Improves metastasis segmentation performance; more efficient in memory and runtime than CycleGAN.	Lymph node specimens digitized with two different scanners.
WSICS [67]	Color/spatial pixel classification with distribution alignment in HSD color model.	Yields smallest standard deviation and coefficient of variation for normalized median intensity; significantly improves necrosis quantification performance.	125 H&E WSIs from 3 patients stained in 5 different labs; 30 H&E WSIs of rat liver.
Macenko et al. [66]	Linear stain separation based on non-negative stain vectors.	Used as a preprocessing step to increase tissue classification performance [66].	Various histopathology datasets.
SASN-IL [68]	Two-stage framework with incomplete label correction and adaptive stain transformation.	10.01% increase in Dice coefficient for gastric cancer segmentation compared to baseline.	GCP dataset (200 WSIs of gastric cancer).

Table 2: Impact of Stain Normalization on Downstream Classification and Clinical Tasks

Application / Method	Domain Adaptation Strategy	Key Outcome	Context
Supervised Contrastive Domain Adaptation [69]	Training constraint added to supervised contrastive learning.	Superior performance for WSI classification of six skin cancer subtypes compared to no adaptation or stain normalization.	Multi-center WSI classification.
Adversarial Training [70]	Unsupervised domain adaptation using CNN-based invariant feature space and Siamese architecture.	Significant classification improvement compared to baseline models.	Histopathology WSI classification.
DLMAR (CT Imaging) [71]	Deep learning reconstruction combined with metal artifact reduction.	Significantly reduced noise, higher SNR/CNR, and improved diagnostic scores in critically ill patients.	Abdominal CT imaging with metal artifacts.

The data indicates that deep-learning-based normalization and end-to-end domain adaptation consistently outperform doing nothing and often exceed the capabilities of traditional statistical methods. Techniques like StainCUT and SASN-IL that move beyond a simple two-stage "normalize then analyze" pipeline offer significant benefits for complex tasks like segmentation [66] [68].

Experimental Protocols

Protocol 1: Unpaired Stain Normalization with StainCUT

This protocol details the procedure for implementing the StainCUT method to normalize stain variations between two unpaired datasets (e.g., from different laboratories) [66].

2.1.1 Research Reagent Solutions

Source Domain WSIs (X): A collection of histopathology whole slide images from one source (e.g., Laboratory A).
Target Domain WSIs (Y): A collection of histopathology whole slide images from a different source (e.g., Laboratory B), without any paired correspondences to X.
Computational Environment: A machine with a capable GPU, deep learning framework (e.g., PyTorch or TensorFlow), and sufficient memory to hold image datasets and models.
StainCUT Model: The generator (G) and discriminator (D) neural network architectures as described in the original work [66].

2.1.2 Step-by-Step Methodology

Data Preparation:
- Extract representative image patches (e.g., 256x256 or 512x512 pixels) from the WSIs in both the source (X) and target (Y) domains.
- Ensure the datasets are unpaired, meaning there is no direct correspondence between patches in X and Y.
- Preprocess the patches as needed (e.g., normalization of pixel values to a [-1, 1] or [0, 1] range).
Model Training:
- Initialize Networks: Initialize the generator G (composed of an encoder G_enc and decoder G_dec) and the discriminator D.
- Adversarial Training: Train the model using the adversarial loss, L_GAN(G, D, X, Y), to make the output of the generator G(x) indistinguishable from real images in the target domain Y [66].
- Contrastive Learning: Apply the contrastive learning strategy as defined by Park et al. [66]. This involves ensuring that corresponding patches in the input x and output G(x) share underlying "content" by maximizing mutual information, without relying on cycle-consistency.
- Iterate: Alternate between updating the discriminator D (to better distinguish real y from generated G(x)) and the generator G (to better fool D) until convergence.
Inference and Application:
- Use the trained generator G to transform all image patches from the source domain X to the target stain style, creating a normalized dataset Y_hat = G(X).
- This normalized dataset can now be used for downstream tasks like weakly supervised classification, potentially in conjunction with real target domain data.

The following workflow diagram illustrates the StainCUT training process:

Protocol 2: Stain-Adaptive Segmentation with Incomplete Labels (SASN-IL)

This protocol addresses the practical scenario of performing unsupervised domain adaptation for segmentation when the source domain data has incomplete labels (i.e., only a subset of tumor regions is annotated) [68].

2.2.1 Research Reagent Solutions

Source Domain Data: WSIs with incomplete segmentation masks (denoted as D_s) containing false-negative regions.
Target Domain Data: Unlabeled WSIs (D_t) with a different stain appearance.
Segmentation Network: A mean-teacher architecture for robust learning, typically featuring a student model and an exponential moving average (EMA) teacher model [68].

2.2.2 Step-by-Step Methodology

The SASN-IL framework operates in two main stages.

Stage 1: Incomplete Label Correction

Preliminary Training: Train the segmentation network initially on the incompletely labeled source data (D_s).
Reliable Model Selection: Identify a model checkpoint from the early phases of training that shows promising performance before it overfits to the false-negative labels.
Pseudo-Label Generation: Use this reliable model to generate preliminary segmentation masks (pseudo-labels) on the source training data.
Label Correction: Fuse the original incomplete labels with the generated pseudo-labels to create a refined, more complete set of training labels for the source domain. This step aims to rectify the false-negative regions.

Stage 2: Unsupervised Domain Adaptation

Adaptive Stain Transformation: For each image in the source and target domains, apply a stain transformation module. The key innovation is that the degree of transformation is adaptive, based on the current segmentation performance of the model on that image [68].
Mean-Teacher Training: Train the student model on the stain-transformed images. The teacher model, updated via EMA from the student, generates stable pseudo-labels on the target domain.
Consistency Loss: Apply a consistency loss between the student and teacher model predictions on the target domain to enforce agreement and learn domain-invariant features.
Supervised Loss: On the source domain, apply a supervised segmentation loss (e.g., cross-entropy or Dice loss) using the corrected labels from Stage 1.

The logic of the two-stage SASN-IL framework is summarized below:

Protocol 3: Supervised Contrastive Domain Adaptation for WSI Classification

This protocol is designed for a multi-center WSI classification task, where labeled data is available from all domains, but the goal is to improve inter-class separability and domain invariance [69].

2.3.1 Research Reagent Solutions

Multi-Center WSI Datasets: Labeled whole slide images from multiple institutions (e.g., Center_A, Center_B).
Feature Encoder Network: A backbone network (e.g., ResNet) for feature extraction from image patches.
Projection Head: A small neural network that maps features to a lower-dimensional space where the contrastive loss is applied.

2.3.2 Step-by-Step Methodology

Feature Extraction: Extract feature embeddings from all WSI patches using the feature encoder.
Supervised Contrastive Learning:
- Create Augmented Views: For each input patch, generate multiple augmented views.
- Anchor Selection: For a given anchor patch, other patches from the same class (across all domains) are treated as positives.
- Apply Loss: Apply a supervised contrastive loss function that pulls the anchor's embedding closer to the embeddings of its positive samples (same class) and pushes it away from embeddings of negative samples (different classes), regardless of their domain of origin [69].
Domain Adaptation Constraint: Integrate an additional training objective that specifically encourages the model to learn features that are invariant to the domain (e.g., center of origin). This can be an additional loss term that minimizes the divergence between feature distributions from different domains.
Classification Head: Finally, train a classifier on top of the domain-invariant and class-separable features obtained from the projection head.

This process enhances the feature space for better classification across multiple domains, as visualized below:

Application Notes

The integration of semi-supervised learning (SSL) and model ensembles represents a paradigm shift in analyzing Whole Slide Images (WSIs), directly addressing the critical bottleneck of extensive data annotation in medical AI development. These methodologies enable the construction of robust, expert-level diagnostic tools by efficiently leveraging both limited labeled data and abundant unlabeled data.

The Convergence of SSL and Ensemble Methods in Histopathology

Foundation models, characterized by their large-scale pre-training on vast datasets, are a natural fit for a weakly supervised context. Their ability to learn generalizable data representations from unlabeled or weakly labeled data via self-supervised learning drastically reduces the dependency on costly, expert-annotated datasets [72]. When fine-tuned on specific histopathology tasks, these models can achieve remarkable performance with minimal task-specific labels.

Semi-Supervised Learning (SSL) frameworks, such as the Mean Teacher architecture, have demonstrated that models trained with a small fraction of labeled data can perform on par with fully supervised models. A landmark study on colorectal cancer recognition showed that an SSL model using only ~6,300 labeled patches and ~37,800 unlabeled patches achieved an Area Under the Curve (AUC) of 0.980, showing no significant difference from a supervised model trained on ~44,100 labeled patches (AUC: 0.987) [73]. This demonstrates a dramatic reduction in annotation burden without compromising diagnostic accuracy.

Ensemble Deep Learning further enhances model robustness and accuracy by combining predictions from multiple neural network architectures. For instance, in breast cancer subtype classification, an ensemble of VGG16 and ResNet50 architectures applied to the BACH dataset achieved a patch classification accuracy of 95.31% [74] [75]. On the BreakHis dataset, an ensemble incorporating VGG16, ResNet34, and ResNet50 reached a remarkable WSI classification accuracy of 98.43% [74] [75]. This synergy mitigates individual model biases and variances, leading to more reliable clinical predictions.

Quantitative Performance of Integrated Approaches

The table below summarizes key performance metrics from recent studies applying these methods to cancer diagnosis from WSIs.

Table 1: Performance of SSL and Ensemble Methods in Cancer Diagnosis from WSIs

Cancer Type	Task	Method	Dataset	Performance	Key Finding
Colorectal Cancer [73]	Patch-level Diagnosis	Semi-Supervised Learning (Mean Teacher)	13,111 WSIs, 13 centers	AUC: 0.980 (10% labels)	No significant difference from supervised model (AUC: 0.987) using 100% labels.
Breast Cancer [74] [75]	Subtype & Invasiveness Classification	Ensemble (VGG16, ResNet50)	BACH (400 images)	Accuracy: 95.31% (patch-level)	Demonstrates high precision in classifying four distinct breast cancer categories.
Breast Cancer [74] [75]	Benign/Malignant Classification	Ensemble (VGG16, ResNet34, ResNet50)	BreakHis (9,109 images)	Accuracy: 98.43% (image-level)	Highlights effectiveness for multi-class classification across magnifications.
Prostate Cancer [76]	TMPRSS2:ERG Fusion Prediction	Semi-Supervised, Attention-based DL (CLAM)	TCGA PRAD (436 WSIs)	AUC: 0.84 (validation), 0.72-0.73 (independent test)	Showcases SSL's potential for predicting genetic alterations from H&E stains alone.

Experimental Protocols

This section provides a detailed, actionable protocol for implementing a semi-supervised ensemble framework for WSI classification, incorporating foundation models.

The following diagram illustrates the end-to-end workflow for WSI analysis, from data preparation to final diagnosis.

Detailed Stepwise Protocol

WSI Preprocessing and Tiling

Tissue Segmentation: Convert the WSI from RGB to Hue, Saturation, Value (HSV) color space to separate luminance from color information, making the model more resilient to staining variations [76]. Apply a median blur filter (e.g., kernel size 7) and morphological closing to reduce noise and bridge small tissue gaps. Use Otsu's method to generate a binary mask differentiating tissue from background [76].
Tiling/Patching: Use the tissue mask to extract representative patches from the WSI at a specified magnification (e.g., 40x). A common patch size is 2048x2048 pixels, which can be downsampled by a factor of 4 to 512x512 pixels for efficient processing [76]. It is critical to group all tiles derived from the same patient to prevent data leakage between training and validation sets [76].

Feature Extraction using Foundation Models

Leverage Pre-trained Models: Utilize a foundation model, such as a large-scale Vision Transformer (ViT) or a pre-trained convolutional neural network like ResNet50, as a feature extractor [72] [76]. This model should have been pre-trained on a massive and diverse dataset, often via self-supervised learning, to learn powerful, general-purpose visual representations [72].
Extract Feature Vectors: Process each 512x512 patch through the foundation model. The output from a penultimate layer (before the classification head) is extracted as a feature vector (e.g., a 1024-dimensional vector) [76]. This step transforms each patch into a compact, meaningful representation for the subsequent classification task.

Semi-Supervised Learning Training

This protocol adopts the Mean Teacher framework, a state-of-the-art SSL method validated on large-scale pathological image analysis [73].

Framework Setup: The Mean Teacher method maintains two models: a Student model and a Teacher model, which are identical in architecture. The Student model is trained via gradient descent on the combined labeled and unlabeled data. The Teacher model's weights are an exponential moving average (EMA) of the Student model's weights, providing stable, consistent targets for the unlabeled data [73].
Loss Function Calculation: The total loss is a weighted sum of two components:
- Supervised Loss: The cross-entropy loss calculated on the small set of labeled patches [76] [73].
- Consistency Loss: The mean squared difference between the Student and Teacher model's predictions for the same unlabeled patch (often after applying data augmentation such as rotation or flipping). This forces the model to be invariant to small perturbations, improving robustness [73].
Model Architecture: A common choice for the base classifier is an attention-based Multiple Instance Learning (MIL) framework like CLAM (Clustering-constrained Attention Multiple Instance Learning) [76]. CLAM uses a gated attention mechanism to assign different weights to all patches in a WSI, aggregating them into a single slide-level representation and prediction, while also identifying key patches.
Training Parameters: Use the Adam optimizer with a learning rate of 0.0001 and weight decay of 0.00001 [76]. Train for a sufficient number of epochs (e.g., 150) with a validation set for early stopping.

Table 2: SSL Training Configuration based on Mean Teacher Framework

Component	Specification	Purpose & Rationale
Labeled Data	~6,300 patches (10% of total) [73]	Provides ground-truth supervision via cross-entropy loss.
Unlabeled Data	~37,800 patches (60% of total) [73]	Enforces consistency and improves model generalization.
Base Model	CLAM (Attention-based MIL) [76]	Enables slide-level prediction and provides interpretability.
Optimizer	Adam (α=0.0001, weight decay=0.00001) [76]	Standard for stable and efficient convergence.
Key Metric	Area Under the Curve (AUC)	Standard for evaluating diagnostic binary classification performance.

Ensemble Model Inference and Aggregation

Create Model Variants: Develop multiple high-performing SSL models. Diversity can be introduced by:
- Architectural Diversity: Using different foundation models as feature extractors (e.g., VGG16, ResNet50, ResNet34) [74] [75].
- Data Diversity: Training models on different splits or subsets of the data, or using different SSL algorithm variations.
Aggregate Predictions: For a new WSI, obtain predictions from all models in the ensemble. The final slide-level prediction can be generated through:
- Averaging: Calculating the average of the predicted probabilities from all models.
- Majority Voting: Taking the class label that is predicted by the majority of models [77].
- Stacking: Using a meta-learner (a simpler model) to learn how to best combine the predictions from the base models [77]. Research indicates stacking often demonstrates superior accuracy in disease prediction [77].

Slide-Level and Patient-Level Diagnosis

Attention-based Aggregation: Models like CLAM naturally provide an attention score for each patch, indicating its contribution to the final prediction. These scores can be used to generate a heatmap overlaid on the original WSI, highlighting regions most indicative of cancer for pathologist review [76].
Cluster-based WSI Inference: For slide-level diagnosis, a positive WSI can be inferred if the number of predicted positive patches exceeds a certain threshold, or if the patches form a spatially significant cluster [73].
Patient-Level Diagnosis: A patient-level diagnosis is typically derived from the combined analysis of all WSIs available for that patient [73].

The Scientist's Toolkit: Research Reagent Solutions

The following table catalogues essential computational "reagents" required to implement the described protocols.

Table 3: Essential Research Reagents for SSL and Ensemble WSI Analysis

Item / Solution	Function / Application	Exemplars & Notes
Whole Slide Image (WSI) Datasets	Serves as the primary data source for model training and validation.	BACH (Breast Cancer), BreakHis, TCGA (e.g., PRAD, CRC), and internal institutional cohorts [74] [76] [73].
Computational Frameworks	Provides the software environment for building and training deep learning models.	PyTorch or TensorFlow, with specialized libraries for WSI analysis (e.g., CLAM for attention-based MIL) [76].
Pre-trained Foundation Models	Acts as a powerful, generic feature extractor, reducing need for training from scratch.	Vision Transformers (ViT), Self-Supervised models (e.g., DINO), or CNNs pre-trained on large natural image datasets (e.g., ImageNet) like ResNet50, VGG16 [72] [76].
Semi-Supervised Learning Algorithms	Enables learning from both labeled and unlabeled data.	Mean Teacher framework, which uses a consistency loss between a student and a teacher model to leverage unlabeled data [73].
Ensemble Construction Methods	Combines multiple models to improve predictive performance and robustness.	Stacking (using a meta-learner), Bagging (e.g., Random Forest), or Boosting (e.g., AdaBoost, Gradient Boosting) [77]. Stacking has shown particularly high accuracy [77].
High-Performance Computing (HPC)	Provides the necessary computational power for training large models on massive WSI datasets.	GPU clusters (e.g., NVIDIA A100/V100) for parallel processing, coupled with large-scale memory and storage systems to handle multi-gigabyte WSIs.

Benchmarks and Validation: Evaluating Foundation Model Performance on Clinical Tasks

In the rapidly evolving field of computational pathology, the development of foundation models for weakly supervised Whole Slide Image (WSI) classification represents a paradigm shift in cancer diagnosis and prognostic prediction. These models, typically trained using only slide-level labels through multiple instance learning (MIL) frameworks, promise to unlock previously inaccessible insights from gigapixel pathological images [78] [79]. However, their transition from research prototypes to clinically viable tools hinges critically on one often-overlooked aspect: rigorous validation using external cohorts. External validation, defined as evaluating model performance on data collected from completely different sources, populations, or institutions than the training data, serves as the ultimate test of model generalizability and robustness [80] [81].

The importance of external validation is particularly pronounced in computational pathology due to the pervasive challenges of batch effects, staining variations, scanner differences, and population heterogeneity. While internal validation metrics might suggest high performance, these can be dangerously misleading due to overfitting to site-specific artifacts or demographic peculiarities [80]. For drug development professionals and translational researchers, this validation gap represents a significant barrier to clinical adoption. A model that fails to generalize across external cohorts may lead to erroneous conclusions in clinical trials or biomarker discovery efforts, potentially compromising patient safety and drug development pipelines.

This application note establishes a comprehensive framework for incorporating external validation into the development lifecycle of foundation models for weakly supervised WSI classification. We provide detailed protocols, experimental designs, and analytical tools to ensure that these powerful AI systems meet the rigorous standards required for clinical research and therapeutic development.

Foundational Concepts: Internal versus External Validation

Key Definitions and Distinctions

Understanding the fundamental differences between internal and external validation is crucial for establishing a robust validation framework. Internal validation encompasses all evaluation performed on data derived from the same source distribution as the training set, including standard train-test splits and cross-validation. While necessary for model development, internal validation provides an optimistically biased performance estimate because the model is evaluated on data with similar technical and biological characteristics [81].

External validation, by contrast, assesses model performance on data collected from completely independent sources—different hospitals, geographical regions, patient populations, or processing protocols. This approach mimics real-world deployment scenarios and provides a realistic estimate of how the model will perform in diverse clinical settings [80] [81]. For foundation models in computational pathology, this distinction is particularly critical due to the multi-center nature of most large-scale clinical trials and the diversity of real-world healthcare institutions.

Quantitative Implications for Model Assessment

Table 1: Comparative Performance Metrics in Internal vs. External Validation

Metric	Internal Validation	External Validation	Typical Performance Gap	Clinical Significance
Accuracy	0.92-0.95	0.75-0.85	10-20% decrease	Impacts diagnostic reliability across sites
AUC	0.94-0.98	0.70-0.89	0.05-0.15 point decrease	Affects screening utility in new populations
F1 Score	0.89-0.93	0.65-0.80	15-25% decrease	Impacts balanced performance on class-imbalanced data
Cohen's Kappa	0.85-0.90	0.60-0.75	0.15-0.25 point decrease	Measures agreement beyond chance across institutions

The performance degradation observed in external validation, as illustrated in Table 1, stems from multiple sources including domain shift, population differences, and technical variations. Research by Cheng et al. demonstrated this phenomenon clearly in their work on predicting TFE3-RCC translocation from H&E-stained WSIs, where area under the curve (AUC) values decreased from 0.842-0.894 in internal testing to lower ranges when applied to external cohorts [82]. Similarly, studies on microsatellite instability (MSI) prediction from H&E images have shown performance drops when models trained on gastric cancer data are applied to colorectal cancer specimens [82].

Methodological Framework: Designing External Validation Studies

Cohort Selection and Eligibility Criteria

The foundation of meaningful external validation lies in the strategic selection of independent cohorts that represent plausible deployment scenarios. We recommend a multi-tiered approach to cohort selection that encompasses varying levels of expected distribution shift:

Technical Diversity: Include WSIs from different scanner manufacturers (e.g., Aperio, Hamamatsu, Philips) and models to assess robustness to imaging variations.
Procedural Diversity: Incorporate slides processed with different staining protocols (H&E staining variations) and preparation techniques.
Geographical Diversity: Source data from multiple geographical regions to capture population heterogeneity in disease presentation.
Temporal Diversity: Include historical cohorts to assess temporal generalizability, particularly important for longitudinal studies and drug development programs.

Eligibility criteria should be explicitly documented for both the training/internal validation cohorts and external validation cohorts, including detailed metadata such as patient demographics, sample processing protocols, scanning parameters, and clinical characteristics. This documentation enables meaningful analysis of performance degradation sources and informs model improvement strategies.

Statistical Considerations and Power Analysis

Adequate statistical power is essential for informative external validation studies. The sample size for external validation cohorts should be determined based on pre-specified precision targets for performance metrics (e.g., the width of confidence intervals for AUC values) rather than traditional power calculations for hypothesis testing. We recommend a minimum of 100-200 independent cases per major outcome category in the external cohort to obtain sufficiently precise performance estimates (confidence interval width ≤0.1 for AUC metrics).

For multi-class classification problems, the sample size should ensure adequate representation of each class, particularly for minority classes with clinical significance. In cases of extreme class imbalance, stratified sampling or bootstrap methods can be employed to obtain reliable performance estimates.

Experimental Protocols for External Validation

Protocol 1: Cross-Institutional Validation for Model Generalizability

Purpose: To evaluate the performance stability of foundation models for WSI classification across multiple independent medical institutions.

Materials:

Pre-trained foundation model for weakly supervised WSI classification
WSIs from at least 3 external institutions not represented in training data
Computational resources for large-scale inference (GPU cluster recommended)
Standardized evaluation metrics pipeline

Procedure:

Cohort Curation: Assemble external validation cohorts from multiple institutions, ensuring balanced representation across relevant clinical and technical variables.
Preprocessing Harmonization: Apply identical preprocessing procedures to all external datasets, including color normalization and quality control checks.
Blinded Inference: Execute model predictions on all external cohorts without any model retraining or fine-tuning.
Performance Assessment: Calculate comprehensive performance metrics (AUC, accuracy, precision, recall, F1-score) for each institution separately and in pooled analysis.
Variance Analysis: Quantify between-institution performance variability using random effects models or ANOVA.
Failure Analysis: Investigate cases of discordant predictions across sites to identify potential technical or biological causes.

Deliverables: Cross-institution performance comparison table, analysis of performance heterogeneity, and identification of potential domain shift factors.

Protocol 2: Stain Robustness Assessment

Purpose: To evaluate model resilience to variations in H&E staining protocols, a common source of performance degradation in computational pathology.

Materials:

Pre-trained foundation model
Paired WSIs with varying staining intensities/protocols
Computational resources for inference
Color normalization tools (optional)

Procedure:

Stain Variation Cohort: Curate or generate a dataset with systematic staining variations, including over-stained, under-stained, and differently tinted H&E slides.
Reference Standard: Establish ground truth diagnoses through pathology consensus review.
Model Inference: Execute predictions across the stain variation dataset.
Performance Correlation: Analyze the relationship between stain deviation metrics and prediction accuracy.
Intervention Assessment: Evaluate whether stain normalization techniques improve robustness.

Deliverables: Stain robustness profile, recommendations for stain normalization prerequisites, and acceptability thresholds for staining variations.

Protocol 3: Scanner-Invariant Performance Evaluation

Purpose: To assess model performance consistency across different digital slide scanner models and manufacturers.

Materials:

Pre-trained foundation model
WSIs of the same tissue specimens scanned on multiple scanner models
Computational resources for inference
Scanner metadata recording system

Procedure:

Multi-Scanner Dataset: Utilize existing or create new datasets where the same physical slides have been scanned on different scanner models.
Paired Analysis: For each tissue specimen, obtain predictions from WSIs generated by different scanners.
Consistency Metrics: Calculate within-specimen prediction variability across scanners.
Scanner-Specific Performance: Compare aggregate performance metrics across scanner types.
Characteristic Analysis: Identify scanner-specific artifacts that correlate with performance degradation.

Deliverables: Scanner-invariance assessment, identification of problematic scanner characteristics, and recommendations for scanner-agnostic training strategies.

Visualization of the External Validation Framework

Implementation Workflow for Foundation Model Validation

Essential Research Reagents and Computational Tools

Table 2: Research Reagent Solutions for External Validation Studies

Category	Item	Specifications	Application in External Validation
Reference Datasets	The Cancer Genome Atlas (TCGA) WSIs	>30,000 slides across 25+ cancer types	Provides diverse multi-institutional data for initial external validation
Challenging Test Sets	Camelyon17	1,000 WSIs from 5 centers with breast cancer lymph node sections	Tests generalizability across medical centers with same tissue type
Stain Variation Panels	H&E staining intensity calibration slides	Systematically varied staining protocols	Quantifies model robustness to technical variations in staining
Multi-Scanner Datasets	Paired slides scanned on multiple devices	Same tissue on 3+ scanner models	Assesses scanner-induced performance variability
Computational Tools	Stain normalization algorithms	Python implementations (OpenCV, scikit-image)	Preprocessing to reduce technical variation across sites
Performance Monitoring	Domain shift detection metrics	Maximum Mean Discrepancy, Classifier Confidence Drift	Early detection of performance degradation in external cohorts
Statistical Analysis	Mixed-effects model packages	R (lme4) or Python (statsmodels)	Quantifies institution-specific vs. population-level performance

Analytical Framework for Validation Results Interpretation

Quantitative Performance Benchmarks

Establishing predefined performance benchmarks for external validation is essential for objective model assessment. We propose the following tiered system for interpreting external validation results:

Tier 1 (Clinically Ready): AUC reduction ≤0.05 in external validation with no significant performance degradation across subgroups. Models in this tier are candidates for immediate clinical implementation.
Tier 2 (Promising with Caveats): AUC reduction of 0.05-0.10 with manageable performance variation across sites. These models require ongoing monitoring and potentially targeted improvements.
Tier 3 (Research Grade): AUC reduction >0.10 or significant performance disparities across subgroups. These models require substantial refinement before clinical consideration.

Subgroup Performance Disparity Analysis

Beyond overall performance metrics, rigorous external validation must include comprehensive subgroup analysis to identify potential performance disparities. Key subgroups for analysis include:

Demographic subgroups (age, sex, race/ethnicity)
Disease severity subgroups
Technical subgroups (scanner type, staining protocol)
Specimen quality subgroups

Statistical tests for interaction should be employed to determine whether performance differences across subgroups are statistically significant, with particular attention to clinically relevant subgroups where performance disparities could exacerbate healthcare inequities.

Case Studies in External Validation for WSI Classification

Successful Implementation: CLAM with Multi-Cohort Validation

The CLAM (Clustering-constrained Attention Multiple-instance Learning) framework demonstrates a principled approach to external validation in weakly supervised WSI classification [78] [79]. In its development, researchers employed not only standard internal validation but also external cohorts from different institutions to verify the generalizability of their attention-based pooling approach for WSI classification. This rigorous validation strategy has contributed to its widespread adoption as a benchmark method in computational pathology research.

Domain-Specific Pretraining Impact: MoCo v2 in Pathology

Research on contrastive learning approaches like Momentum Contrast (MoCo v2) for pathology image analysis has highlighted the importance of domain-specific pretraining followed by external validation [78]. Models pretrained on natural images (e.g., ImageNet) then fine-tuned on pathology data showed significant performance degradation compared to models pretrained directly on pathology images from multiple institutions when evaluated on external cohorts. This case study underscores how training strategies fundamentally impact model generalizability.

Establishing a robust validation framework with comprehensive external cohort analysis is not merely an academic exercise—it is an essential component in the translational pathway for foundation models in computational pathology. The protocols and analytical frameworks presented in this application note provide researchers, scientists, and drug development professionals with standardized methodologies to rigorously assess model generalizability before deployment in clinical trials or healthcare settings.

We recommend the following best practices based on the current evidence and methodological considerations:

Proactive External Validation Planning: Integrate external validation considerations from the earliest stages of model development, including strategic partnerships with multiple institutions for diverse validation cohorts.
Comprehensive Metadata Collection: Systematically collect and document technical, demographic, and clinical metadata for all validation cohorts to enable nuanced analysis of performance heterogeneity.
Transparent Reporting: Clearly report external validation results separately from internal validation performance, including detailed descriptions of external cohort characteristics and any preprocessing adjustments.
Iterative Validation: Treat external validation as an iterative process rather than a one-time event, with continuous monitoring and reassessment as new data becomes available or clinical implementations expand to new settings.

By adopting this rigorous framework, the computational pathology community can accelerate the development of robust, generalizable foundation models that fulfill their promise to transform cancer diagnosis, prognosis, and therapeutic development.

In weakly supervised whole slide image (WSI) classification for computational pathology, selecting appropriate performance metrics is crucial for accurately evaluating foundation models. WSIs present unique computational challenges, as a standard gigapixel slide may comprise tens of thousands of image tiles, and analysis typically relies on weakly supervised multiple instance learning (MIL) approaches where only slide-level labels are available [11] [21]. In this context, the areas under the receiver operating characteristic curve (AUROC) and precision-recall curve (AUPRC), along with balanced accuracy, have emerged as fundamental metrics for evaluating model performance across diverse clinical tasks including cancer subtyping, biomarker prediction, and prognosis estimation [31] [44].

These metrics provide complementary insights into model behavior, particularly when dealing with imbalanced datasets common in clinical applications where positive cases may be rare [83]. For instance, in ovarian cancer subtyping, where high-grade serous carcinoma (HGSC) may represent 68% of samples while low-grade serous carcinoma (LGSC) comprises only 5%, traditional accuracy metrics can be misleading [44]. Foundation models like CONCH, Virchow2, and Prov-GigaPath have demonstrated state-of-the-art performance across multiple pathology tasks, with comprehensive benchmarking revealing AUROC values of 0.71-0.77 for morphological tasks, 0.72-0.73 for biomarker prediction, and approximately 0.61-0.63 for prognostic tasks [31].

Table 1: Performance of Leading Foundation Models Across Pathology Tasks

Foundation Model	Morphology AUROC	Biomarker AUROC	Prognosis AUROC	Overall AUROC
CONCH	0.77	0.73	0.63	0.71
Virchow2	0.76	0.73	0.61	0.71
Prov-GigaPath	-	0.72	-	0.69
DinoSSLPath	0.76	-	-	0.69
UNI	-	-	-	0.68

Metric Definitions and Theoretical Foundations

Area Under the Receiver Operating Characteristic Curve (AUROC)

The AUROC represents the probability that a model will rank a randomly chosen positive instance higher than a randomly chosen negative instance [84]. This metric visualizes the trade-off between the true positive rate (TPR) and false positive rate (FPR) across all possible classification thresholds. The ROC curve plots TPR against FPR, with the area under this curve providing a single number that summarizes performance across all thresholds [84]. A key advantage of AUROC is that it is threshold-independent and provides a comprehensive view of model performance. In computational pathology applications, AUROC has been widely adopted for benchmarking foundation models, with recent studies reporting values ranging from 0.64 to 0.77 across different models and tasks [31].

Area Under the Precision-Recall Curve (AUPRC)

The AUPRC evaluates the trade-off between precision (positive predictive value) and recall (sensitivity) across different decision thresholds [84]. Unlike AUROC, AUPRC focuses specifically on the model's performance on the positive class, making it particularly valuable for imbalanced datasets where the positive class is rare. Precision-recall curves plot precision against recall for different probability thresholds, with the area under this curve providing a single metric that emphasizes correct identification of positive instances [83] [84]. In clinical applications where false positives carry significant consequences, such as cancer diagnosis, AUPRC provides crucial insights into model performance that complement AUROC.

Balanced Accuracy

Balanced accuracy is defined as the average of sensitivity and specificity, providing a metric that accounts for class imbalance by giving equal weight to both classes [44]. This contrasts with standard accuracy, which can be misleading when classes are imbalanced, as a model can achieve high accuracy by simply always predicting the majority class. For ovarian cancer subtyping, recent studies using foundation models with attention-based multiple instance learning have reported balanced accuracy values reaching 89-97% on internal hold-out test sets and 74% on external validation [44]. Balanced accuracy is particularly valuable in medical applications where both false positives and false negatives have clinical significance.

Comparative Analysis of Metric Performance Characteristics

The relationship between AUROC and AUPRC is mathematically defined, with both metrics sharing fundamental connections through their probabilistic interpretations [83]. Specifically, AUROC weighs all false positives equally, while AUPRC weighs false positives with the inverse of the model's likelihood of outputting a score greater than the given threshold, a quantity referred to as the "firing rate" [83]. This distinction leads to different optimization behaviors: AUROC favors model improvements in an unbiased manner, while AUPRC prioritizes fixing high-score mistakes first [83].

Table 2: Key Characteristics and Applications of Performance Metrics

Metric	Strengths	Limitations	Ideal Use Cases
AUROC	Threshold-independent; Robust to class balance; Intuitive interpretation	Less sensitive to performance improvements in imbalanced data; Can be overly optimistic with class imbalance	General model comparison; When both classes are equally important; Ranking predictions
AUPRC	Focuses on positive class; More informative with class imbalance; Highlights precision-recall tradeoffs	Value depends on prevalence; Difficult to compare across datasets with different prevalences	Imbalanced datasets; When positive class is of primary interest; Information retrieval settings
Balanced Accuracy	Accounts for class imbalance; Intuitive clinical interpretation; Balanced view of performance	Threshold-dependent; Does not provide full performance picture across thresholds	Clinical applications with balanced importance of sensitivity/specificity; Multi-class problems

In practical applications, the choice between AUROC and AUPRC should be guided by the specific clinical context and dataset characteristics. While a widespread adage suggests that AUPRC is superior for imbalanced datasets, recent analysis challenges this notion, demonstrating that AUPRC might inadvertently favor model improvements in subpopulations with more frequent positive labels, potentially heightening algorithmic disparities [83]. This has significant implications for medical domains featuring imbalanced classification problems with diverse patient populations.

Experimental Protocols for Metric Evaluation

Benchmarking Foundation Models with Multiple Metrics

Comprehensive evaluation of foundation models for computational pathology requires rigorous benchmarking across multiple datasets and clinical tasks. A standardized protocol should include:

Dataset Curation: Collect multi-centric WSI datasets representing diverse patient populations and cancer types. Recent benchmarks have utilized datasets comprising 6,818 patients and 9,528 slides from lung, colorectal, gastric, and breast cancers [31]. For ovarian cancer subtyping, datasets should include the five major subtypes (HGSC, LGSC, CCC, EC, MC) with appropriate representation of rare subtypes [44].

Model Selection: Include diverse foundation models representing different architectural approaches and training methodologies. Recent benchmarks have evaluated 19 foundation models including CONCH (vision-language), Virchow2 (vision-only), Prov-GigaPath (whole-slide modeling), UNI, and others [31].

Evaluation Framework: Implement standardized preprocessing, feature extraction, and multiple instance learning aggregation. The CLAM (Clustering-constrained Attention Multiple-instance Learning) framework provides a validated approach for weakly supervised WSI classification [11]. Calculate AUROC, AUPRC, and balanced accuracy across all tasks with statistical significance testing.

Diagram Title: WSI Classification Workflow

Protocol for Low-Prevalence Scenario Evaluation

To assess metric performance in realistic clinical scenarios with rare positive cases, implement the following protocol:

Data Sampling: Create training cohorts of varying sizes (e.g., 75, 150, and 300 patients) while maintaining similar ratios of positive samples [31]. This evaluates performance in data-scarce settings common for rare biomarkers or conditions.

Task Selection: Include clinically relevant tasks with rare positive cases (>15% prevalence), such as BRAF mutation (10%) or CpG island methylator phenotype (CIMP) status (13%) [31].

Metric Calculation: Compute all three metrics (AUROC, AUPRC, balanced accuracy) for each model and task combination. Use cross-validation or external validation to ensure robustness.

Statistical Analysis: Perform pairwise statistical comparisons between models for each metric. Recent benchmarks have used significance testing with P<0.05 to identify meaningful differences [31].

Implementation Framework for Metric Calculation

Computational Implementation

Calculation of AUROC, AUPRC, and balanced accuracy can be implemented using standard machine learning libraries:

For comprehensive evaluation, generate both ROC and precision-recall curves across the full range of thresholds to visualize model performance characteristics [84].

Research Reagent Solutions

Table 3: Essential Research Tools for WSI Classification Evaluation

Research Tool	Function	Implementation Examples
Foundation Models	Feature extraction from pathology images	CONCH, Virchow2, Prov-GigaPath, UNI, DinoSSLPath [31] [21]
MIL Aggregators	Slide-level prediction from tile features	CLAM, ABMIL, TransMIL, K-TOP aggregator [11] [85]
Interpretability Tools	Spatial interpretation of model decisions	WEEP, INSIGHT, attention mechanisms [86] [87]
Evaluation Frameworks	Standardized benchmarking	Custom benchmarking pipelines, cross-validation schemes [31] [44]

Interpreting Metric Results in Clinical Context

Comparative Model Performance

When evaluating foundation models for computational pathology, comprehensive metric analysis reveals complementary strengths. Recent benchmarks show that CONCH achieved the highest mean AUROC (0.77) for morphology-related tasks, while Virchow2 and CONCH jointly led in biomarker-related tasks (AUROC 0.73) [31]. For prognosis-related tasks, CONCH achieved the highest mean AUROC (0.63) [31]. Interestingly, in low-data scenarios (n=75 patients), performance distribution became more balanced, with different models excelling in different tasks [31].

Balanced accuracy provides crucial complementary information, particularly for multi-class problems like ovarian cancer subtyping. Recent studies report balanced accuracy values of 89-97% for internal validation and 74% for external validation when using H-optimus-0 foundation models with attention-based MIL [44]. This demonstrates the importance of considering multiple metrics for comprehensive model assessment.

Metric Selection Guidelines

Based on empirical evidence and theoretical considerations, the following guidelines support appropriate metric selection:

Use AUROC when seeking a general overview of model performance across all thresholds, when both classes are equally important, and for comparing models across datasets with similar characteristics [84].
Prioritize AUPRC when working with highly imbalanced datasets where the positive class is of primary interest, when false positives have significant consequences, and in information retrieval settings where precision is critical [83] [84].
Incorporate Balanced Accuracy for clinical applications where both sensitivity and specificity are equally important, for multi-class problems, and when communicating results to clinical stakeholders who may find it more intuitive [44].
Report Multiple Metrics comprehensively, as each provides different insights into model behavior. Leading benchmarking studies consistently report AUROC, AUPRC, and balanced accuracy to provide a complete performance picture [31] [44].

Diagram Title: Metric Selection Decision Framework

AUROC, AUPRC, and balanced accuracy provide complementary insights for evaluating foundation models in weakly supervised WSI classification. AUROC offers a comprehensive view of model performance across all thresholds, AUPRC emphasizes correct identification of positive instances in imbalanced datasets, and balanced accuracy provides an intuitive measure of clinical utility. Empirical benchmarking demonstrates that leading foundation models like CONCH and Virchow2 achieve AUROC values of 0.71-0.77 across diverse pathology tasks, with balanced accuracy reaching 89-97% in specialized applications like ovarian cancer subtyping [31] [44].

Researchers should select metrics based on dataset characteristics, clinical requirements, and specific application needs, while routinely reporting multiple metrics to provide comprehensive performance assessment. As foundation models continue to evolve in computational pathology, rigorous evaluation using these complementary metrics will be essential for translating algorithmic advances into clinically useful tools.

The integration of artificial intelligence (AI) in computational pathology represents a paradigm shift in diagnostic medicine and biomarker discovery. Foundation models, pre-trained on vast datasets of histopathological whole slide images (WSIs) through self-supervised learning (SSL), have emerged as powerful tools for extracting clinically relevant information from gigapixel images [31] [10]. These models serve as versatile feature extractors, enabling the development of specialized downstream models for tasks such as cancer subtyping, mutation prediction, and prognosis estimation with significantly reduced requirements for labeled data [31]. This application note synthesizes findings from large-scale benchmarking studies across 31 clinical tasks to provide researchers and drug development professionals with evidence-based guidelines for model selection, implementation, and validation in weakly supervised computational pathology workflows.

Benchmarking Results: Performance Across Clinical Tasks

Comprehensive benchmarking of 19 foundation models on 31 clinically relevant tasks has revealed critical insights into model performance characteristics across different prediction domains. The evaluation encompassed tasks related to morphology (n=5), biomarkers (n=19), and prognostication (n=7) using data from 6,818 patients and 9,528 slides across lung, colorectal, gastric, and breast cancers [31].

Table 1: Top-Performing Foundation Models Across Task Categories

Model	Model Type	Morphology AUROC	Biomarker AUROC	Prognostication AUROC	Overall AUROC
CONCH	Vision-Language	0.77	0.73	0.63	0.71
Virchow2	Vision-Only	0.76	0.73	0.61	0.71
Prov-GigaPath	Vision-Only	0.69	0.72	0.61	0.69
DinoSSLPath	Vision-Only	0.76	0.68	0.60	0.69

Table 2: Performance in Low-Data Scenarios (Number of Tasks Where Model Ranked First)

Model	300 Patients	150 Patients	75 Patients
Virchow2	8	6	4
PRISM	7	9	4
CONCH	5	4	5

The vision-language model CONCH demonstrated superior overall performance, achieving the highest mean area under the receiver operating characteristic curve (AUROC) of 0.77 for morphology tasks and 0.63 for prognostication tasks, while tying with Virchow2 (a vision-only model) on biomarker tasks with an AUROC of 0.73 [31]. Interestingly, in low-data scenarios with only 75 patients for downstream training, performance remained relatively stable compared to medium-sized cohorts (150 patients), highlighting the potential of foundation models to accelerate research in rare diseases and low-prevalence clinical scenarios [31].

Experimental Protocols for Weakly Supervised WSI Classification

Feature Extraction and Aggregation Protocol

The standard workflow for weakly supervised WSI classification using foundation models involves sequential feature extraction and aggregation steps:

WSI Preprocessing and Tiling: Convert gigapixel WSIs into smaller, manageable patches. Standard protocols involve dividing each WSI into non-overlapping patches of 512×512 pixels at 20× magnification, generating hundreds to thousands of patches per slide [10].
Feature Extraction: Process each image patch through a pre-trained foundation model to generate embeddings. Most models produce 768-dimensional feature vectors per patch, capturing morphological patterns in tissue organization and cellular structure [31] [10].
Feature Aggregation: Compile patch-level embeddings into slide-level representations using multiple instance learning (MIL) frameworks. Transformer-based aggregation slightly outperformed attention-based MIL (ABMIL) with an average AUROC difference of 0.01 across all tasks [31].
Downstream Model Training: Train task-specific classifiers on the aggregated slide-level representations using weak labels. For optimal performance in low-data regimes, leverage the entire dataset for evaluation while training on limited samples to assess model robustness [31].

Benchmarking and Validation Protocol

Rigorous benchmarking requires standardized evaluation across multiple dimensions:

Dataset Curation: Assemble multi-institutional datasets with external validation cohorts to prevent data leakage and ensure generalizability. The referenced benchmark utilized data from 6,818 patients across 13 cohorts, with strict separation between pretraining and evaluation data [31].
Task Selection: Include clinically relevant endpoints across diagnostic, predictive, and prognostic domains. The benchmark incorporated 31 tasks including morphological classification (tumor subtyping), biomarker prediction (microsatellite instability, BRAF mutations), and survival analysis [31].
Performance Metrics: Employ multiple complementary metrics including AUROC, area under the precision-recall curve (AUPRC), balanced accuracy, and F1 scores, with particular emphasis on AUPRC for imbalanced datasets [31].

WSI Classification Workflow

Research Reagent Solutions: Essential Materials for Implementation

Table 3: Essential Research Reagents for Foundation Model Implementation

Resource Category	Specific Examples	Function & Application
Foundation Models	CONCH, Virchow2, Prov-GigaPath, DinoSSLPath	Pre-trained feature extractors for histopathology images; encode morphological patterns into transferable representations [31]
Computational Frameworks	HIA (Histopathology Image Analysis)	Reusable benchmarking platform implementing multiple weakly supervised learning approaches; supports classical and MIL workflows [13]
Dataset Resources	TCGA (The Cancer Genome Atlas), Mass-340K, Proprietary cohorts	Curated WSI collections with clinical annotations; essential for model validation and external testing [13] [31]
Feature Aggregation Methods	Transformer-based aggregation, ABMIL, Spatial averaging	Algorithms for compiling patch-level features into slide-level representations; critical for weakly supervised learning [13] [31]

Implementation Considerations and Future Directions

Model Selection Guidelines

Based on comprehensive benchmarking, researchers should consider the following model selection criteria:

For general-purpose applications: CONCH or Virchow2 provide the most consistent performance across diverse task types [31].
For low-data scenarios: Virchow2 and PRISM demonstrate superior performance with limited training samples (75-150 patients) [31].
For biomarker prediction: Vision-language models (CONCH) and large vision-only models (Virchow2) achieve the highest AUROCs (0.73 average) [31].
For computational efficiency: Consider model architecture and feature dimensionality, as processing thousands of patches per slide imposes significant computational demands.

Emerging Paradigms and Ethical Considerations

The field of computational pathology is rapidly evolving with several emerging trends. Multimodal foundation models like TITAN (Transformer-based pathology Image and Text Alignment Network) demonstrate capabilities for cross-modal retrieval and pathology report generation, pretrained on 335,645 WSIs with corresponding pathology reports [10]. Simultaneously, novel approaches such as the Pathology Expertise Acquisition Network (PEAN) leverage eye-tracking data to capture pathologists' visual attention patterns, reducing annotation time to 4% of manual annotation while achieving an AUROC of 0.992 on diagnostic prediction [88].

The integration of foundation models into clinical practice must be framed as a "social experiment" requiring ethical oversight [89]. Implementation should follow incremental approaches with continuous monitoring, risk containment strategies, and flexibility to adapt or discontinue use if unforeseen risks emerge [89].

Benchmark to Implementation Pathway

Foundation models are transforming computational pathology by providing powerful, adaptable base models for various diagnostic and prognostic tasks. A key development in this domain is the emergence of vision-language models (VLMs), which learn from paired image-text data, alongside more traditional vision-only models (VMs) trained exclusively on histology images. Understanding the performance characteristics, strengths, and weaknesses of these two paradigms is crucial for researchers and drug development professionals selecting the optimal approach for weakly supervised Whole Slide Image (WSI) classification. This analysis synthesizes recent benchmarking evidence to guide model selection and application in digital pathology workflows, framing the discussion within the broader thesis of using foundation models for weakly supervised learning.

Performance Benchmarking Across Clinical Tasks

Comprehensive benchmarking of 19 foundation models on 31 clinically relevant tasks across 6,818 patients reveals distinct performance patterns between leading vision and vision-language models. The evaluation encompassed tasks related to morphology (5 tasks), biomarkers (19 tasks), and prognostication (7 tasks) using multiple patient cohorts from lung, colorectal, gastric, and breast cancers [31].

Table 1: Overall Performance of Top Foundation Models Across Task Types

Model	Model Type	Mean Morphology AUROC	Mean Biomarker AUROC	Mean Prognostication AUROC	Overall Mean AUROC
CONCH	Vision-Language	0.77	0.73	0.63	0.71
Virchow2	Vision-Only	0.76	0.73	0.61	0.71
Prov-GigaPath	Vision-Only	0.69*	0.72	0.61*	0.69
DinoSSLPath	Vision-Only	0.76	0.68*	0.60*	0.69

Note: Values marked with * are estimated from available data in the benchmarking study [31].

The vision-language model CONCH demonstrates strong overall performance, achieving the highest mean AUROC in morphological and prognostic tasks, while matching the top performance in biomarker prediction [31]. This indicates that incorporating language priors during pretraining can enhance a model's ability to capture clinically relevant histopathological patterns.

Performance in Low-Data Scenarios

A critical consideration for real-world clinical applications is model performance when labeled training data is scarce, particularly for rare molecular events or conditions.

Table 2: Performance in Data-Scarce Settings (Number of Tasks Where Model Ranked First)

Model	Model Type	n=300 Patients	n=150 Patients	n=75 Patients
Virchow2	Vision-Only	8	6	4
PRISM	Vision-Language	7	9	4
CONCH	Vision-Language	5*	4*	5

Note: Values marked with * are estimated from available data [31].

In low-data scenarios, vision-only models like Virchow2 maintain strong performance, suggesting robust feature learning from large-scale image-only pretraining. However, the vision-language model PRISM shows particular advantage in medium-sized cohorts (n=150), potentially due to beneficial language-guided regularization [31].

Experimental Protocols for Weakly Supervised WSI Classification

Standardized Benchmarking Protocol

To ensure reproducible evaluation of foundation models for weakly supervised WSI classification, researchers should adhere to the following protocol:

A. Data Preparation and Preprocessing

Whole Slide Image Tiling: Segment WSIs into non-overlapping patches at 20× magnification (typically 256×256 or 512×512 pixels) using automated tissue segmentation to exclude background regions [31] [11].
Feature Extraction: Process each image patch through a foundation model to extract feature embeddings (e.g., 768-dimensional vectors from CONCH) [27] [10].
Feature Grid Construction: Spatially arrange patch features in a 2D grid replicating their original positions in the WSI to preserve spatial context [10].

B. Weakly Supervised Learning Setup

Multiple Instance Learning (MIL) Formulation: Treat each WSI as a "bag" containing multiple instances (patches), with only slide-level labels available [11].
Attention-Based Aggregation: Implement an attention mechanism to weight the importance of different patches for slide-level prediction, enabling interpretable heatmaps [11].
Transformer-Based Modeling: Utilize transformer architectures to model relationships between patches, capturing both local and global contextual information [31] [24].

C. Evaluation Framework

Task Diversity: Benchmark across multiple clinical domains including cancer subtyping, biomarker prediction, and prognostic outcome prediction [31].
Data Splitting: Employ rigorous train/validation/test splits with external validation cohorts to prevent data leakage and ensure generalizability [31].
Performance Metrics: Report AUROC (Area Under Receiver Operating Characteristic) and AUPRC (Area Under Precision-Recall Curve), with particular emphasis on AUPRC for imbalanced tasks [31].

Advanced Multimodal Integration Protocol

For vision-language models, specialized protocols leverage their dual-modality capabilities:

A. Zero-Shot Classification Protocol

Prompt Engineering: Create text prompts for each class of interest (e.g., "invasive ductal carcinoma of the breast") using clinical terminology [27].
Text Encoding: Process prompts through the model's text encoder to generate class-specific text embeddings [27].
Similarity Calculation: For each image patch, compute cosine similarity between image embeddings and all class text embeddings [27].
Slide-Level Aggregation: Aggregate patch-level similarity scores using methods like MI-Zero to generate slide-level predictions without task-specific training [27].

B. Cross-Modal Alignment Protocol

Report-Based Pretraining: Align WSI representations with corresponding pathology reports using contrastive learning [10].
Synthetic Caption Generation: Generate fine-grained morphological descriptions using multimodal generative AI copilots (e.g., PathChat) for ROI-level alignment [10].
Multimodal Fusion: Implement cross-attention mechanisms between visual and textual representations to enable bidirectional vision-language understanding [90].

Visualization of Foundation Model Workflows

Benchmarking Workflow for Foundation Model Evaluation

Vision-Language Model Architecture

The Scientist's Toolkit: Essential Research Reagents

Table 3: Foundation Models and Computational Tools for Weakly Supervised WSI Analysis

Resource	Type	Primary Function	Key Features
CONCH	Vision-Language Model	Whole slide classification and cross-modal retrieval	Trained on 1.17M image-caption pairs; excels in morphology tasks [31] [27]
Virchow2	Vision-Only Model	Feature extraction for downstream pathology tasks	Trained on 3.1M WSIs; strong performance in biomarker prediction [31]
Prov-GigaPath	Vision-Only Model	Whole-slide representation learning	Uses LongNet architecture for long-sequence modeling; trained on 1.3B image tiles [24]
TITAN	Multimodal Whole-Slide Model	Slide-level representation and report generation	Aligns WSIs with pathology reports; enables zero-shot classification [10]
CLAM	MIL Framework	Weakly supervised WSI classification	Attention-based learning; generates interpretable heatmaps [11]
AAMM	Multimodal Framework	Abnormal region detection and classification	Integrates patch, cell, and text features; reduces computation via informative patch selection [91]

The comparative analysis reveals that both vision and vision-language foundation models offer distinct advantages for weakly supervised WSI classification. Vision-language models like CONCH demonstrate superior performance in morphological analysis and zero-shot capabilities, while vision-only models like Virchow2 maintain strong performance in biomarker prediction, particularly in low-data scenarios. The emerging paradigm of whole-slide foundation models such as TITAN and Prov-GigaPath represents a significant advancement by directly modeling slide-level context. For researchers and drug development professionals, selection between these approaches should be guided by specific application requirements: VLMs for tasks requiring language understanding and zero-shot adaptability, VMs for traditional biomarker prediction, and whole-slide models for applications demanding comprehensive slide-level context. Future developments will likely focus on integrating the strengths of both paradigms while improving computational efficiency and accessibility.

The following tables summarize key quantitative results from recent clinical validation studies for AI-based diagnostic and prognostic tasks on Whole Slide Images (WSIs).

Table 1: Diagnostic and Molecular Classification Performance in Glioma

Task	Dataset	Metric	Performance	Citation
5-class Glioma Subtyping	Internal Validation	Overall Accuracy	0.79	[92]
	Multi-center Testing	Overall Accuracy	0.73	[92]
IDH Mutation Prediction	TCGA	AUC	0.9488	[93]
Astrocytoma Grading	TCGA	AUC	0.9419	[93]
	West China Hospital (External)	AUC	0.9048	[93]

Table 2: Prognostic Prediction Performance in Bladder Cancer

Model / Cohort	Metric	Performance	Citation
MibcMLP (Prognostic)
Internal Validation	C-index	0.631	[94]
External Validation	C-index	0.622	[94]
UniVisionNet (Prognostic)
Training Cohort (CMUFH)	C-index	0.853	[95]
External Validation (TCGA)	C-index	0.661	[95]
BlaPaSeg (Tissue Segmentation)	Multiple Cohorts	AUC	0.9906 - 0.9945	[95]

Table 3: Metastasis Detection Performance in Lymph Nodes

Factor / Result	VMD Tool	DPL Tool	Citation
Macrometastases	Excellent sensitivity across multiple tumor types	Excellent sensitivity across multiple tumor types	[96]
Micrometastases & Isolated Tumor Cells	Good sensitivity	Slightly higher sensitivity, particularly in lung cancer and melanoma	[96]
False-Positive Alert Rates	Substantial; generally higher, especially in lung/breast cancer	Substantial, but lower than VMD	[96]

Detailed Experimental Protocols

Weakly Supervised Subtyping of Gliomas

Objective: To develop a weakly supervised deep learning model for classifying glioma subtypes and predicting molecular markers using only slide-level labels.

Materials:

WSIs: 733 H&E-stained WSIs from four medical centers for model development and multi-center testing [92].
Computational Framework: A subtask-guided multi-instance learning (MIL) image-to-label training pipeline [92].

Procedure:

WSI Preprocessing: Digitalize glass slides into WSIs using a slide scanner at 20× magnification. Tile each WSI into small patches (e.g., 512×512 pixels) using tools like the OpenSlide DeepZoom module [92] [93].
Color Normalization: Apply a standard color normalization method, such as the Reinhard method, to account for variations in H&E staining across different centers [93].
Tissue Detection & Patch Selection: Use a tissue detection algorithm (e.g., saturation thresholding) to filter out background and non-informative regions [44]. Employ "patch prompting" by selecting only 1% of tiled image patches classified as tumor for model input, which reduces computational cost and regularizes training [92].
Feature Extraction: Use a pre-trained convolutional neural network (e.g., ResNet-50) to extract feature vectors from each selected patch [93].
Slide-Level Classification with MIL: Train an attention-based multiple instance learning (ABMIL) classifier. The model learns to aggregate patch-level features into a slide-level representation and assign a diagnostic label, focusing attention on morphologically critical patches [92] [93].
Validation: Perform hold-out validation and external validation on multi-center datasets to assess model generalizability [92].

Survival Risk Stratification in Bladder Cancer

Objective: To create an end-to-end deep learning system for predicting overall survival (OS) risk in bladder cancer patients from WSIs.

Materials:

WSIs and Data: A multicenter cohort of 1108 BCa patients with H&E-stained WSIs and corresponding clinical follow-up data [95].

Procedure:

Tile-Level Tissue Segmentation: Process WSIs with a pre-trained tile classification network (BlaPaSeg, based on ResNeXt50) to generate multi-class tissue probability heatmaps and segmentation maps [95].
Macro-Level Prognostic Network (MacroVisionNet): Train a network that analyzes the broad tissue distribution patterns from the probability heatmaps to predict survival risk [95].
Integrated Prognostic Network (UniVisionNet): Develop a second network that integrates the macro-level tissue distribution features from MacroVisionNet with micro-level tumor patch features extracted via a self-supervised learning network. This captures both global tissue architecture and local cellular morphology [95].
Risk Stratification: Determine optimal risk score cutoffs for MacroVisionNet and UniVisionNet using maximally selected rank statistics on the training cohort. Stratify patients into high-risk and low-risk groups for survival analysis [95].
Validation and Biomarker Exploration: Validate the prognostic system on internal and external validation cohorts. Perform multivariable Cox regression analysis, adjusting for clinical covariates like age, T stage, and N stage, to confirm the model's independent prognostic value [95].

Lymph Node Metastasis Detection

Objective: To evaluate the performance of CE-certified AI tools for detecting lymph node metastases across multiple tumor types, both within and beyond their intended use.

Materials:

AI Tools: Visiopharm Metastasis Detection App (VMD) and DeepPath LYDIA (DPL) [96].
Dataset: 455 retrospective patient cases with LN metastases from six tumor types (e.g., melanoma, colorectal, head and neck, lung, vulvar, and breast cancer), including both sentinel and non-sentinel LNs. An additional 1012 tumor-negative slides are used to assess false-positive rates [96].

Procedure:

Reference Standard Establishment: Have expert pathologists review the WSIs and establish the ground truth for metastasis presence and size (macrometastases, micrometastases, isolated tumor cells) according to standard clinical practice [96].
AI Tool Execution: Process all WSIs through both AI tools (VMD and DPL) using their standard operating procedures.
Performance Assessment:
- Calculate sensitivity per patient case and stratify results based on metastasis size.
- Assess the false-positive alert (FPA) rate by running the tools on the 1012 tumor-negative slides.
Analysis: Compare the performance of the two tools across the different tumor types, focusing on their performance within their officially intended uses and their generalizability to other cancers [96].

Workflow and Pathway Diagrams

Weakly Supervised WSI Classification Workflow

Integrated Prognostic System for Bladder Cancer

The Scientist's Toolkit: Key Research Reagents & Materials

Table 4: Essential Components for WSI Analysis Pipelines

Item / Category	Specific Examples / Functions	Application in Featured Studies
Tissue Samples & Staining	Formalin-Fixed Paraffin-Embedded (FFPE) tissue blocks; Hematoxylin and Eosin (H&E) staining.	Standard sample preparation for creating diagnostic WSIs across all cancer types [92] [94] [96].
Slide Digitization	High-throughput slide scanners (e.g., Leica Aperio AT2, Hamamatsu).	WSIs digitized at 20x or 40x magnification for computational analysis [92] [44].
Foundation Models & Feature Extractors	CONCH, UNI, H-optimus-0; Self-supervised learning on large histopathology datasets.	Used as powerful patch feature extractors, providing versatile and transferable representations for downstream tasks [44] [10].
Core AI Architectures	Multiple Instance Learning (MIL), Attention Mechanisms, Vision Transformers (ViTs).	Enables slide-level prediction from patch-level features without dense annotations; core to all featured protocols [92] [10] [93].
Computational Libraries & Frameworks	PyTorch/TensorFlow, OpenSlide, CLAM, TIAToolbox.	Libraries for WSI handling, model development, and implementation of specific algorithms like CLAM [44] [93].
Validation & Statistical Analysis	Cross-validation, Cox Regression (for survival), C-index, AUC, Sensitivity/Specificity.	Critical for assessing model performance, prognostic value, and generalizability in clinical tasks [94] [93] [95].

Conclusion

The integration of foundation models with weakly supervised learning frameworks represents a paradigm shift in computational pathology, effectively addressing the critical challenge of limited annotations. Evidence from large-scale benchmarks consistently shows that models like CONCH and Virchow2 achieve state-of-the-art performance across diverse tasks, from biomarker prediction to prognosis. Key takeaways indicate that data diversity in pretraining is a crucial success factor, and that model ensembles can leverage complementary strengths for superior performance. Future directions should focus on developing even more data-efficient algorithms, improving model interpretability for clinical trust, and pursuing large-scale prospective validation to firmly integrate these powerful tools into routine biomedical research and clinical decision-making, ultimately advancing the field of precision medicine.

Foundation Models for Weakly Supervised Whole Slide Image Classification: A Comprehensive Guide for Biomedical Research

Foundation Models for Weakly Supervised Whole Slide Image Classification: A Comprehensive Guide for Biomedical Research

Abstract

Core Concepts: Demystifying Foundation Models and Weak Supervision in Computational Pathology

The Annotation Bottleneck in Computational Pathology

Foundation Models and Weakly Supervised Learning

Table 1: Quantitative Performance of Deep Learning Models in CRC Molecular Subtyping

Experimental Protocols for Weakly Supervised WSI Classification

Protocol: Weakly Supervised Classification Using Slide-Level Labels

Protocol: Foundation Model Assisted Weak Segmentation

Visualization of Methodologies

Weakly Supervised WSI Classification Workflow

The Annotation Bottleneck in Pathology AI

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 2: Key Research Reagent Solutions for Computational Pathology

Future Perspectives

What are Foundation Models? From Self-Supervised Learning (SSL) to Pre-trained Feature Extractors

Foundation Models in Computational Pathology

The Unique Challenges of Whole Slide Images

Self-Supervised Learning Approaches for Pathology

Pre-trained Feature Extractors for WSI Classification

Architectural Approaches

Benchmarking Performance

Experimental Protocols

Protocol 1: Implementing CLAM for WSI Classification

Protocol 2: Cross-modal Foundation Model Pretraining

The Scientist's Toolkit

Understanding Weakly Supervised Learning and the Multiple Instance Learning (MIL) Paradigm

Fundamental MIL Architectures and Benchmark Performance

The Emergence of Whole-Slide Foundation Models

Integrated Experimental Protocols

Protocol: WSI Preprocessing and Feature Extraction

Protocol: Attention-Based Deep MIL with a Foundation Model Encoder

Protocol: Data-Efficient Learning with CLAM

The Scientist's Toolkit: Essential Research Reagents

Core Architectural Principles

Quantitative Performance Comparison

Application Notes and Experimental Protocols

Protocol 1: Weakly-Supervised WSI Classification using a ViT-based MIL Framework

Protocol 2: Zero-Shot WSI Classification using a Vision-Language Foundation Model

Advanced Integrated Workflow and Emerging Paradigms

Integrated Workflow: Combining ViT Feature Encoding with VLM Zero-Shot Triage

Protocol 3: Weakly Semi-Supervised Learning with Cross-Consistency (CroCo)

Benchmarking Pathology Foundation Models

Experimental Protocols for MIL-Based WSI Classification

Feature Extraction Using Foundation Models

MIL Aggregation and Classification

Advanced Methodologies and Visualization

Workflow Diagram

The CAMCSA Architecture

The Scientist's Toolkit: Essential Research Reagents and Materials

From Theory to Practice: Implementing Weakly Supervised Pipelines with Foundation Models

Core Building Blocks of the WSI Pipeline

Tissue Segmentation

Patching and Representative Patch Selection

Feature Extraction Using Foundation Models

Weakly Supervised Classification with Slide-Level Labels

Attention-based Multiple Instance Learning

Joint Segmentation and Classification

Experimental Protocols

Protocol: Benchmarking Foundation Models as Feature Extractors

Protocol: Implementing Weakly Supervised Segmentation with WholeSIGHT

The Scientist's Toolkit: Research Reagent Solutions

Model Specifications and Comparative Analysis

Technical Specifications of Leading Pathology Foundation Models

Performance Benchmarking Across Key Domains

Experimental Protocols for Weakly Supervised WSI Classification

Standardized Workflow for Downstream Task Evaluation

Detailed Protocol Steps

The Scientist's Toolkit: Essential Research Reagents

Performance Analysis and Model Selection Guidelines

Comparative Strengths and Application Fit

Advanced Applications: Ensemble Approaches and Low-Data Scenarios

Implementing Attention-Based Multiple Instance Learning (ABMIL) with Foundation Model Features

Quantitative Benchmarking of Pathology Foundation Models

Experimental Protocols and Workflows

Whole-Slide Image Preprocessing and Patch Extraction

Foundation Model Feature Extraction Protocol

ABMIL Implementation with Foundation Model Features

Advanced Implementation: AttriMIL Framework