Virchow, CONCH, and UNI: A Comparative Overview of Foundation Models Revolutionizing Computational Pathology

Logan Murphy Dec 02, 2025 103

This article provides a comprehensive overview of leading foundation models in computational pathology—Virchow, CONCH, and UNI.

Virchow, CONCH, and UNI: A Comparative Overview of Foundation Models Revolutionizing Computational Pathology

Abstract

This article provides a comprehensive overview of leading foundation models in computational pathology—Virchow, CONCH, and UNI. It explores the core architectures and self-supervised learning approaches that underpin these models, detailing their applications in critical tasks such as pan-cancer detection, biomarker prediction, and rare disease diagnosis. The content further addresses practical challenges in implementation, including data scarcity and computational demands, and presents a rigorous comparative analysis of model performance based on recent independent benchmarking studies. Aimed at researchers, scientists, and drug development professionals, this review synthesizes current capabilities and future trajectories of these transformative AI tools in biomedical research and clinical practice.

Understanding Pathology Foundation Models: Core Architectures and Self-Supervised Learning

The field of computational pathology is undergoing a fundamental transformation driven by the emergence of foundation models. These models represent a seismic shift from traditional task-specific artificial intelligence systems toward general-purpose representations that can be adapted to a wide range of downstream clinical and research applications. Foundation models are defined as large-scale AI models trained on broad data using self-supervision at scale that can be adapted to a wide range of downstream tasks [1]. This transition mirrors developments in natural language processing and computer vision but presents unique challenges and opportunities due to the complex, multi-scale nature of histopathology data.

The limitations of traditional task-specific models have become increasingly apparent as the field advances. Earlier approaches in computational pathology relied heavily on supervised learning, requiring extensive manual annotation by expert pathologists for each specific task—whether cancer detection, biomarker prediction, or grading. The average cost of pathologist annotation alone is approximately $12 per slide when calculated at standard rates, creating significant bottlenecks in model development [1]. Furthermore, these specialized models often struggled with generalization across different tissue types, cancer variants, and institutional-specific preparations. Foundation models address these limitations by learning universal feature representations from massive datasets without task-specific labels, capturing the fundamental morphological patterns that underlie pathological assessment across diverse diseases and tissue types.

The Evolution: From Task-Specific Models to Foundation Models

The progression from task-specific AI to foundation models in computational pathology represents not merely an incremental improvement but a fundamental rearchitecture of how AI systems are developed and deployed in histopathology. Traditional deep learning models in pathology were characterized by their narrow focus—typically excelling at a single diagnostic task such as cancer grading, mitotic figure counting, or specific biomarker detection. These models were usually trained on limited, carefully annotated datasets using supervised learning approaches, which constrained their applicability and required extensive relabeling for each new clinical task.

Foundation models differ from their predecessors across several critical dimensions, as summarized in Table 1. The most significant distinction lies in their training paradigm: rather than being trained with labeled data for a specific task, foundation models leverage self-supervised learning on massive, diverse datasets of histopathology images. This allows them to learn the fundamental language of tissue morphology—capturing patterns across cellular structures, tissue architecture, staining characteristics, and spatial relationships without human-provided labels. The resulting models exhibit remarkable versatility, enabling application to numerous downstream tasks including cancer detection, subtyping, biomarker prediction, prognosis estimation, and even cross-modal applications linking images with pathological reports or genomic data.

Table 1: Fundamental Differences Between Traditional AI Models and Foundation Models in Computational Pathology

Characteristics	Foundation Models	Traditional AI Models
Model Architecture	(Mainly) Transformer	Convolutional Neural Network
Model Size	Very large (hundreds of millions to billions of parameters)	Medium to large
Applicable Tasks	Many diverse tasks	Single specific task
Performance on Adapted Tasks	State-of-the-art (SOTA)	High to SOTA
Performance on Untrained Tasks	Medium to high	Low
Data Amount for Training	Very large (millions of images)	Medium to large
Use of Labeled Data for Training	No (self-supervised)	Yes (supervised)

The scaling laws observed in other AI domains have proven equally relevant to computational pathology. Model performance demonstrates strong dependence on both dataset size and model architecture complexity [2]. Early foundation models in pathology were trained on limited public datasets such as The Cancer Genome Atlas (TCGA), which contains approximately 29,000 whole slide images (WSIs). Contemporary foundation models now leverage proprietary datasets orders of magnitude larger—Virchow was trained on 1.5 million WSIs, while UNI 2 was trained on over 200 million pathology images sampled from 350,000+ diverse whole slide images [3] [4]. This massive scale, combined with advanced self-supervised learning techniques like contrastive learning and masked image modeling, enables the models to capture the rich diversity of morphological patterns present across different tissue types, disease states, and laboratory preparations.

The following diagram illustrates the key evolutionary pathway from task-specific models to general-purpose foundation models in computational pathology:

Major Foundation Models: Architectures and Capabilities

Virchow: Scaling Vision-Only Pathology Models

Virchow represents a landmark in vision-only foundation models for computational pathology. Developed by Paige and Memorial Sloan Kettering Cancer Center, this model is a 632 million parameter vision transformer (ViT) trained on an unprecedented dataset of 1.5 million H&E-stained whole slide images from approximately 100,000 patients [2] [4]. The model employs the DINOv2 self-supervised learning algorithm, which leverages both global and local regions of tissue tiles to learn rich embeddings that capture morphological patterns at multiple scales. This extensive training dataset encompassed 17 different tissue types including both cancerous and benign tissues, collected via biopsy (63%) and resection (37%) procedures, providing exceptional diversity in morphological representation.

The clinical utility of Virchow has been demonstrated through its application to pan-cancer detection, where it achieved a remarkable specimen-level area under the receiver operating characteristic curve (AUROC) of 0.95 across nine common and seven rare cancer types [2]. Particularly noteworthy is its performance on rare cancers—defined by the National Cancer Institute as having an annual incidence of fewer than 15 people per 100,000—where it maintained an AUROC of 0.937, demonstrating robust generalization to uncommon morphological patterns. When compared to specialized clinical-grade AI products, the Virchow-based pan-cancer model performed nearly as well as these targeted systems overall and actually outperformed them on some rare cancer variants, highlighting the value of learning from massively diverse datasets.

CONCH: Vision-Language Pretraining for Histopathology

CONCH (CONtrastive learning from Captions for Histopathology) pioneered vision-language foundation models in computational pathology. Unlike vision-only approaches, CONCH was trained on 1.17 million image-caption pairs, learning to align visual histological patterns with textual descriptions [5]. This multimodal approach mirrors how human pathologists learn and communicate—associating visual morphological patterns with descriptive terminology. The model architecture enables a wide range of applications including image classification, segmentation, captioning, text-to-image retrieval, and image-to-text retrieval without requiring task-specific fine-tuning.

In comprehensive benchmarking studies, CONCH demonstrated exceptional performance across multiple domains. For morphology-related tasks, it achieved a mean AUROC of 0.77, the highest among 19 foundation models evaluated [6]. Across 19 biomarker-related tasks, CONCH and Virchow2 both achieved the highest mean AUROCs of 0.73, while in prognostic-related tasks, CONCH again led with a mean AUROC of 0.63 [6]. The model's vision-language capabilities make it particularly valuable for applications requiring joint understanding of visual patterns and textual context, such as generating pathological descriptions or retrieving cases based on textual queries.

UNI: Towards General-Purpose Pathology Representations

UNI represents another significant advancement in general-purpose representations for computational pathology. Developed by Mahmood Lab, the original UNI model utilized a ViT-l/16 architecture, while the more recent UNI2 employs a larger ViT-h/14-reg8 architecture trained on over 200 million pathology H&E and IHC images sampled from 350,000+ diverse whole slide images [3]. This massive scale of training data enables the model to learn highly transferable representations applicable to diverse downstream tasks with minimal adaptation.

The UNI framework emphasizes task-agnostic pretraining followed by efficient adaptation to specific clinical applications. This approach has been widely adopted in the research community, with numerous studies demonstrating its effectiveness for tasks ranging from cancer subtyping and biomarker prediction to survival analysis and tissue segmentation [3]. The model's representations have proven particularly valuable in low-data regimes, where limited annotated examples are available for specific rare conditions or specialized tasks.

Emerging Architectures: TITAN and Multimodal Approaches

The field continues to evolve rapidly with newer architectures addressing limitations of earlier approaches. TITAN (Transformer-based pathology Image and Text Alignment Network) represents a recent advancement in whole-slide foundation models that processes entire slides rather than isolated patches [7]. This model employs a three-stage pretraining strategy: vision-only unimodal pretraining on region-of-interest crops, cross-modal alignment of generated morphological descriptions at the ROI level, and cross-modal alignment at the whole-slide level with clinical reports.

TITAN was trained on 335,645 whole-slide images aligned with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology [7]. This approach enables the model to generate general-purpose slide representations that can be directly applied to slide-level tasks without additional aggregation steps. The model has demonstrated strong performance in few-shot and zero-shot settings, particularly for challenging scenarios such as rare disease retrieval and cancer prognosis with limited labeled data.

Table 2: Comparative Analysis of Major Pathology Foundation Models

Model	Architecture	Training Data Scale	Training Method	Key Strengths	Performance Highlights
Virchow	ViT (632M params)	1.5M WSIs	Self-supervised (DINOv2)	Pan-cancer detection, rare cancer identification	0.95 AUROC pan-cancer detection, 0.937 AUROC rare cancers
CONCH	Vision-Language	1.17M image-caption pairs	Contrastive learning	Multimodal applications, captioning, retrieval	0.77 AUROC morphology tasks, leading biomarker prediction
UNI/UNI2	ViT-l/16 → ViT-h/14	200M+ images from 350K+ WSIs	Self-supervised	General-purpose representations, transfer learning	State-of-the-art on multiple tissue classification tasks
TITAN	Slide-level ViT	335K WSIs + 423K synthetic captions	Multimodal self-supervised	Whole-slide representation, report generation	Strong few-shot/zero-shot performance, rare disease retrieval

Benchmarking and Experimental Evaluation

Comprehensive Performance Benchmarking

Independent benchmarking studies provide crucial insights into the relative strengths and limitations of different foundation models. A comprehensive evaluation of 19 foundation models across 31 clinically relevant tasks revealed important patterns in model performance [6]. This study utilized 6,818 patients and 9,528 slides from lung, colorectal, gastric, and breast cancers, assessing tasks related to morphology, biomarkers, and prognostication. When averaged across all tasks, CONCH and Virchow2 achieved the highest AUROCs of 0.71, followed closely by Prov-GigaPath and DinoSSLPath with AUROCs of 0.69.

The benchmarking revealed that different models excel in different domains. For morphology-related tasks, CONCH achieved the highest mean AUROC of 0.77, followed by Virchow2 and DinoSSLPath with 0.76 [6]. In biomarker prediction, Virchow2 and CONCH both led with AUROCs of 0.73, while for prognostic tasks, CONCH again achieved the highest performance (0.63 AUROC) [6]. These results suggest that vision-language models like CONCH may have particular advantages for tasks requiring conceptual understanding of tissue morphology, while vision-only models like Virchow excel in pattern recognition for specific pathological entities.

Experimental Protocols for Model Evaluation

Standardized evaluation protocols are essential for meaningful comparison across foundation models. The typical benchmarking workflow involves multiple stages: feature extraction, aggregation, and task-specific evaluation. For feature extraction, whole slide images are first tessellated into small, non-overlapping patches, typically at 20× magnification [6]. Each patch is then processed through the foundation model to generate embedding vectors that capture the morphological features of that tissue region.

For slide-level prediction tasks, these patch embeddings must be aggregated into a slide-level representation. Transformer-based multiple instance learning (MIL) approaches have demonstrated superior performance compared to traditional attention-based MIL, with an average AUROC difference of 0.01 across tasks [6]. The aggregated representations are then used to train task-specific classifiers, typically using weakly supervised learning approaches that require only slide-level labels rather than detailed patch-level annotations.

Evaluation in low-data scenarios is particularly important for assessing clinical utility. Studies have examined model performance with varying training set sizes (75, 150, and 300 patients) while maintaining similar ratios of positive samples [6]. Interestingly, performance metrics remained relatively stable between 75 and 150 patient cohorts, suggesting that foundation models can maintain effectiveness even with limited fine-tuning data. This has important implications for rare conditions where large annotated datasets are unavailable.

The following diagram illustrates the standard benchmarking workflow used to evaluate pathology foundation models:

Clinical Applications and Implementation

Diagnostic Applications and Biomarker Prediction

Foundation models are enabling transformative applications across the diagnostic spectrum. In cancer detection, models like Virchow have demonstrated the capability to identify both common and rare cancers across diverse tissue types with high accuracy [2]. This pan-cancer capability is particularly valuable for screening applications and for cases with atypical presentations. For biomarker prediction, foundation models can infer molecular alterations directly from H&E-stained images, potentially reducing reliance on expensive additional testing. Studies have successfully predicted biomarkers including BRAF mutations, microsatellite instability (MSI), and CpG island methylator phenotype (CIMP) status from routine histology images [6] [2].

The ability to predict biomarkers from standard H&E stains has significant implications for precision oncology. By identifying patients likely to have specific molecular alterations, foundation models can help prioritize cases for confirmatory testing and enable earlier treatment decisions. In many cases, these models have achieved AUROCs exceeding 0.70 for various biomarker prediction tasks, approaching the performance of dedicated specialized tests while utilizing routinely available tissue sections [6].

Prognostication and Therapeutic Response Prediction

Beyond diagnostic classification, foundation models show increasing promise for prognostic prediction and therapeutic response forecasting. By capturing subtle morphological patterns associated with disease aggressiveness and tumor microenvironment composition, these models can stratify patients according to likely clinical outcomes. In benchmarking studies, foundation models achieved mean AUROCs of approximately 0.63 for prognostic tasks, demonstrating modest but meaningful predictive value for outcomes such as survival and recurrence risk [6].

The multimodal capabilities of models like CONCH and TITAN enable particularly sophisticated applications in this domain. By integrating histological patterns with clinical data, pathological reports, and eventually genomic information, these systems can provide comprehensive prognostic assessments that account for multiple dimensions of disease biology. This approach aligns with the trend toward multidimensional classification in oncology, where treatment decisions incorporate histological, molecular, and clinical factors.

Implementation Considerations and Clinical Translation

The translation of foundation models from research tools to clinical practice requires careful consideration of multiple factors. Computational efficiency remains a significant challenge, as processing gigapixel whole slide images demands substantial resources [8]. Some studies have reported prohibitive computational overhead when applying certain foundation models to large slide repositories, highlighting the need for optimization in real-world deployment.

Generalization across diverse patient populations and laboratory protocols is another critical consideration. While foundation models trained on large datasets demonstrate better generalization than earlier approaches, performance variations across different demographic groups and institutional-specific preparations have been observed [8]. Continuous monitoring and potential recalibration may be necessary when deploying these models across varied clinical settings.

Regulatory approval pathways for foundation models in pathology are still evolving. The adaptability that makes these models powerful—their application to multiple tasks with minimal modification—presents challenges for traditional regulatory frameworks that typically evaluate medical AI systems for specific intended uses. Developing appropriate validation frameworks that preserve the flexibility of foundation models while ensuring safety and efficacy for each application represents an important frontier in clinical translation.

Table 3: Research Reagent Solutions for Pathology Foundation Model Research

Resource Category	Specific Examples	Function and Application	Access Information
Pretrained Models	Virchow, CONCH, UNI, UNI2, TITAN	Feature extraction, transfer learning, few-shot applications	GitHub repositories, Hugging Face, institutional collaborations
Benchmark Datasets	TCGA, CPTAC, PANDA, CRC100K	Model evaluation, comparative performance assessment	Public repositories, controlled access platforms
Evaluation Frameworks	Linear probing, KNN classification, few-shot evaluation	Standardized performance assessment across models and tasks	Custom implementations, research code repositories
Computational Infrastructure	High-memory GPUs, distributed computing systems	Handling gigapixel whole slide images and large model architectures	Institutional HPC resources, cloud computing platforms
Annotation Tools	Digital pathology annotation software	Generating labeled datasets for fine-tuning and evaluation	Commercial and open-source pathology viewing platforms

Challenges and Future Directions

Despite remarkable progress, several challenges remain in the development and deployment of pathology foundation models. Data diversity and representation continue to be concerns, as even large-scale training datasets may not fully capture the morphological spectrum across different populations, specimen types, and laboratory protocols [8]. The replicability of results across institutions also requires further investigation, with some studies reporting mixed success in reproducing published findings when using different datasets and computational environments [8].

The scaling laws observed in foundation models suggest that continued increases in data and model size may yield further performance improvements. However, the optimal balance between data quantity, diversity, and quality remains an open research question. Some evidence suggests that data diversity may outweigh sheer volume, with models trained on more heterogeneous datasets sometimes outperforming those trained on larger but more homogeneous collections [6].

Future development will likely focus on several key areas: whole-slide modeling approaches that better capture long-range spatial relationships across tissue sections; improved multimodal integration combining histology with genomic, transcriptomic, and clinical data; more efficient architectures that reduce computational requirements; and enhanced interpretability methods that make model predictions transparent to pathologists. As these technical advances progress, parallel efforts will be needed to establish appropriate validation frameworks, regulatory pathways, and clinical integration strategies to ensure that foundation models fulfill their potential to enhance pathological practice and patient care.

The emergence of generalist medical AI systems that integrate pathology foundation models with models from other medical domains represents a particularly promising direction [1]. Such systems could provide comprehensive diagnostic support by combining information from histology, radiology, laboratory medicine, and clinical notes, moving closer to the holistic assessment approaches used by expert clinicians. Realizing this vision will require not only technical innovation but also careful attention to workflow integration, usability, and the development of appropriate trust mechanisms between pathologists and AI systems.

The field of computational pathology has been transformed by the advent of foundation models, which are large-scale AI models trained on broad data that can be adapted to a wide range of downstream tasks [1]. These models address critical challenges in pathology AI development, notably the high cost and time required for pathologists to annotate data and the need for models that generalize across diverse tissue types and cancer variants [1] [9]. Prior to Virchow, pathology foundation models were trained on significantly smaller datasets, ranging from tens to hundreds of thousands of slides [9] [2]. Virchow represents a substantial scaling in both data and model size, trained on 1.5 million hematoxylin and eosin (H&E) stained whole slide images (WSIs) using the self-supervised learning algorithm DINOv2 [10] [2]. This massive scale is crucial for capturing the immense diversity of morphological patterns in histopathology and enables robust performance, particularly on rare cancers where labeled data is scarce [2].

Model Architecture & Training Methodology

Core Architectural Framework

Virchow is built on the Vision Transformer (ViT) architecture [10] [2]. The model comprises 632 million parameters, classifying it as a ViT-huge model [9] [2]. The input to the model consists of tissue tiles extracted from gigapixel whole-slide images. The fundamental components and training data are summarized in the table below.

Table 1: Virchow Model Architecture and Training Data Specifications

Component	Specification	Details
Model Architecture	Vision Transformer (ViT)	632 million parameters (ViT-huge) [10] [2]
Training Algorithm	DINOv2	Self-distillation with no labels (SSL) [2]
Training Dataset	1.5 million H&E WSIs	Sourced from ~100,000 patients at MSKCC; includes biopsies (63%) and resections (37%) [2]
Tissue Coverage	17 high-level tissues	Cancerous and benign tissues [2]
Input Processing	Tissue Tiles	Extracted from WSIs; uses global and local views for self-supervised learning [2]

DINOv2 Self-Supervised Training

The DINOv2 (self-DIstillation with NO labels) algorithm is central to Virchow's training [2]. This method employs a student-teacher network structure where both are fed different augmented views of the same image tiles. The student network is trained to match the output of the teacher network. The teacher's weights are an exponential moving average (EMA) of the student's weights, which stabilizes training. This process allows the model to learn versatile visual representations without any manual annotations by leveraging the inherent structure of the data itself.

Diagram 1: DINOv2 Training Workflow for Virchow. The diagram illustrates the self-supervised learning process where the student network learns to match the output of a teacher network fed with different augmented views of the same pathology image tiles. The teacher's weights are an exponential moving average (EMA) of the student's weights. This process creates general-purpose visual embeddings without manual labels.

Key Experiments & Performance Benchmarks

Pan-Cancer Detection

A primary application of Virchow is pan-cancer detection, which involves training a single model to identify cancer across multiple tissue types. A weakly supervised aggregator model uses Virchow's tile embeddings to make slide-level predictions.

Table 2: Pan-Cancer Detection Performance (Specimen-Level AUC) [2]

Cancer Category	Virchow	UNI	Phikon	CTransPath
Overall (16 cancers)	0.950	0.940	0.932	0.907
Rare Cancers (7 types)	0.937	Not Reported	Not Reported	Not Reported
Common Cancers (9 types)	>0.950 (avg)	>0.940 (avg)	>0.932 (avg)	>0.907 (avg)
Bone Cancer	0.841	0.813	0.822	0.728
Cervix Cancer	0.875	0.830	0.810	0.753

The pan-cancer detector demonstrated strong generalization on rare cancers and out-of-distribution data from external institutions. At a high sensitivity of 95%, the model using Virchow embeddings achieved a specificity of 72.5%, outperforming other foundation models [2].

Tile-Level Benchmarks and Biomarker Prediction

Virchow's embeddings were also evaluated on tile-level classification tasks, where they achieved state-of-the-art performance on internal and external benchmarks [10]. Furthermore, the model showed strong capabilities in predicting biomarkers directly from routine H&E images, potentially reducing the need for additional specialized testing. Virchow outperformed other models in predicting key gene mutations, such as in lung adenocarcinoma [2].

Comparative Analysis of Pathology Foundation Models

The landscape of pathology foundation models has expanded rapidly. The table below contextualizes Virchow among other notable models.

Table 3: Comparative Analysis of Public Pathology Foundation Models [9]

Model	Parameters	Training Slides	Training Tiles	Architecture	SSL Algorithm
Virchow	631 M	1.5 M	2.0 B	ViT-H	DINOv2
Prov-GigaPath	1135 M	171 k	1.3 B	LongNet	DINOv2 + MAE
UNI	303 M	100 k	100 M	ViT-L	DINOv2
Phikon	86 M	6 k	43 M	ViT-B	iBOT
CTransPath	28 M	32 k	16 M	Swin Transformer + CNN	MoCo v2

This comparison highlights Virchow's position as a model trained on an exceptionally large slide dataset. Other models like CONCH explore a different approach as a visual-language foundation model pretrained on 1.17 million image-caption pairs, enabling tasks like image classification, captioning, and cross-modal retrieval [5].

The Scientist's Toolkit: Research Reagent Solutions

Table 4: Essential Resources for Implementing Pathology Foundation Models

Resource / Solution	Function / Purpose	Example / Note
Whole-Slide Scanners	Digitizes glass slides into gigapixel WSIs	Various vendors and models; introduces variability [11]
Tile Extraction Pipeline	Divides WSIs into smaller, manageable patches	Requires tissue detection & background filtering [11] [9]
Self-Supervised Learning (SSL) Frameworks	Enables training on unlabeled image data	DINOv2, iBOT, MAE [9] [2]
Multiple Instance Learning (MIL)	Aggregates tile-level features for slide-level prediction	Attention-based MIL networks [11] [9]
Vision Transformer (ViT) Architecture	Neural network backbone for processing image sequences	Scales to hundreds of millions of parameters [10] [2]
Public Model Weights	Provides pre-trained models for transfer learning	Virchow, UNI, CONCH, and CTransPath weights are publicly available [3] [9] [5]

Experimental Protocol: Downstream Task Fine-Tuning

To adapt Virchow for a specific downstream task (e.g., cancer subtyping or biomarker prediction), the following methodology is employed:

Feature Extraction: A downstream whole-slide image (WSI) is processed by first removing background regions using a tissue detection mask. The remaining tissue is divided into non-overlapping patches (e.g., 224x224 pixels at 20x magnification). Each patch is then encoded into a feature vector using the pre-trained Virchow encoder, generating hundreds to thousands of feature vectors per slide [9].
Model Training for Slide-Level Prediction: These feature vectors are aggregated using an attention-based multiple instance learning (MIL) network. The MIL model learns to weight the importance of different tiles and produces a final slide-level prediction [11] [9]. This approach is summarized in the following diagram:

Diagram 2: Downstream Task Fine-Tuning. This workflow shows how the pre-trained Virchow model is used as a feature extractor for a downstream task like cancer detection. Features from individual image tiles are aggregated by a multiple instance learning (MIL) model to produce a final slide-level prediction.

Virchow establishes that scaling up training data to millions of whole-slide images enables the creation of a powerful foundation model capable of robust pan-cancer detection and biomarker prediction. Its success, along with that of other models like UNI, CONCH, and Prov-GigaPath, underscores a paradigm shift in computational pathology towards large-scale, self-supervised learning. These models form a foundational toolkit for researchers and drug development professionals, accelerating tasks ranging from rare cancer diagnosis to the discovery of novel morphological biomarkers.

The field of computational pathology is undergoing a transformation driven by artificial intelligence and the emergence of foundation models. These models, pre-trained on vast datasets, can be adapted to a wide range of downstream tasks with minimal fine-tuning. Among these, CONCH (CONtrastive learning from Captions for Histopathology) represents a significant advancement as a visual-language foundation model specifically designed for histopathology. Unlike vision-only models, CONCH leverages both histopathology images and corresponding textual descriptions, mirroring how pathologists teach and reason about histopathologic entities. This approach enables a single model to perform diverse tasks without task-specific training, addressing critical challenges of label scarcity in the medical domain and the impracticality of training separate models for every possible diagnostic scenario [12].

The Landscape of Pathology Foundation Models

Several foundation models have been developed for computational pathology, each with distinct architectures, training datasets, and capabilities. The table below summarizes three leading models: CONCH, Virchow, and UNI.

Table 1: Comparison of Major Pathology Foundation Models

Feature	CONCH	Virchow	UNI
Model Type	Visual-Language [12]	Vision-Only [2]	Vision-Only [9]
Core Architecture	ViT-B/16 Vision Encoder & L12 Text Encoder [13]	ViT-H (632M parameters) [2] [10]	ViT-L (303M parameters) [9]
Pre-training Algorithm	Contrastive Learning & Captioning (CoCa) [12]	Self-supervised Learning (DINOv2) [2] [10]	Self-supervised Learning (DINO) [9]
Training Data Scale	1.17M image-caption pairs [5] [12]	~1.5M whole slide images (WSIs) [2]	100M tiles from 100K WSIs [9]
Key Capabilities	Image & text encoding, zero-shot classification, cross-modal retrieval, captioning [12]	Pan-cancer detection, biomarker prediction [2]	Tile and slide-level classification [9]
Primary Application	Multimodal tasks involving images and text [5]	High-performance cancer detection across common and rare types [2]	General-purpose visual feature extraction [9]

CONCH Architecture and Pre-training Methodology

Model Design and Components

CONCH is built upon the CoCa (Contrastive Captioner) framework, which integrates three core components: a vision encoder, a text encoder, and a multimodal fusion decoder. The vision encoder is a Vision Transformer (ViT-B/16) with 90 million parameters, processing histopathology image patches. The text encoder is a transformer-based model (L12-E768-H12) with 110 million parameters, handling textual input. During pre-training, the model is trained using a combination of an image-text contrastive loss, which aligns visual and textual representations in a shared embedding space, and a captioning loss, which teaches the model to generate descriptive captions for given images [12] [13]. This dual objective ensures the model learns both discriminative and generative capabilities.

Pre-training Dataset and Regime

CONCH was pre-trained on a diverse collection of 1.17 million histopathology image-caption pairs, the largest such dataset for pathology at the time of its development. The data was sourced from publicly available PubMed Central Open Access (PMC-OA) articles and internally curated sources. The images encompass various stain types, including Hematoxylin and Eosin (H&E), immunohistochemistry (IHC), and special stains, contributing to the model's robustness. The training was conducted using mixed-precision (fp16) on 8 Nvidia A100 GPUs for approximately 21.5 hours [13]. A key advantage noted by the developers is that CONCH was not pre-trained on large public slide collections like TCGA, PAIP, or GTEX, minimizing the risk of data contamination when evaluating on popular public benchmarks [5] [13].

Diagram 1: CONCH pre-training workflow.

Experimental Protocols and Performance Benchmarks

CONCH was rigorously evaluated against other visual-language models, including PLIP, BiomedCLIP, and OpenAICLIP, across 14 diverse benchmarks covering tasks like classification, segmentation, and retrieval [12].

Zero-shot Classification

Methodology: For zero-shot region-of-interest (ROI) classification, class names are converted into a set of text prompts (e.g., "a histology image of invasive lobular carcinoma"). The image is encoded by the vision encoder, and the text prompts are encoded by the text encoder. The classification is performed by computing the cosine similarity between the image embedding and each text prompt embedding in the shared multimodal space, selecting the class with the highest similarity score [12]. For whole-slide image (WSI) classification, the MI-Zero method is employed: the gigapixel WSI is divided into smaller tiles, each tile is classified via zero-shot prediction, and the individual tile-level scores are aggregated into a final slide-level prediction [12].

Results: CONCH demonstrated state-of-the-art performance. The table below summarizes its zero-shot classification results across several slide-level and ROI-level tasks.

Table 2: Zero-shot Classification Performance of CONCH

Task	Dataset	Primary Metric	CONCH Performance	Next-Best Model Performance
NSCLC Subtyping	TCGA NSCLC	Balanced Accuracy	90.7% [12]	78.7% (PLIP) [12]
RCC Subtyping	TCGA RCC	Balanced Accuracy	90.2% [12]	80.4% (PLIP) [12]
BRCA Subtyping	TCGA BRCA	Balanced Accuracy	91.3% [12]	55.3% (BiomedCLIP) [12]
Gleason Grading	SICAP	Quadratic Weighted Kappa (κ)	0.690 [12]	0.550 (BiomedCLIP) [12]
Colorectal Tissue Classification	CRC100k	Balanced Accuracy	79.1% [12]	67.4% (PLIP) [12]

Image-to-Text and Text-to-Image Retrieval: Cross-modal retrieval is a core strength of CONCH. Given a query image, the model can retrieve the most relevant text description from a database, and vice versa. This is achieved by computing the cosine similarity between the query's embedding and the embeddings of all candidates in the database [12]. This capability is crucial for tasks like knowledge search and case retrieval in clinical and research settings.

Segmentation and Captioning: CONCH can also be adapted for semantic segmentation tasks. Using a method similar to ClusterFit, the model generates patch-level features that are clustered to create pseudo-masks, which are then used to train a segmentation decoder [12]. Although the publicly released weights do not include the multimodal decoder (to prevent potential leakage of private data), the original model was also trained with a captioning objective, enabling it to generate descriptive captions for histopathology images [13].

Practical Implementation Guide

Access and Setup

CONCH is available for non-commercial academic research under a CC-BY-NC-ND 4.0 license. Access is gated; researchers must request access through the Hugging Face model repository using an official institutional email address [13].

Installation and Basic Usage:

Install the CONCH package: pip install git+https://github.com/Mahmoodlab/CONCH.git
After receiving access, load the model using the provided token:

Encode an image for linear probing or multiple instance learning:
For retrieval tasks, use the normalized projections: python with torch.inference_mode(): image_embs = model.encode_image(image, proj_contrast=True, normalize=True) text_embs = model.encode_text(tokenized_prompts) similarity_scores = (image_embs @ text_embs.T).squeeze(0) [13]

The Researcher's Toolkit

Table 3: Essential Research Reagents and Resources for CONCH

Item / Resource	Description & Function
CONCH Model Weights	Pre-trained parameters for the vision and text encoders. Used as a foundational feature extractor or for zero-shot inference [13].
Python `conch` Package	The official software library that provides the model architecture and necessary functions to load and run the CONCH model [5] [13].
High-Performance GPU (e.g., Nvidia A100)	Graphics processing unit essential for efficient model inference and fine-tuning, reducing computation time from hours to minutes [13].
PyTorch & Hugging Face Transformers	Core machine learning frameworks used for model implementation, training, and tokenization [13].
Institutional Hugging Face Account	A mandatory requirement for accessing the gated model repository, ensuring compliance with the license terms [13].

Diagram 2: CONCH zero-shot WSI classification process.

CONCH establishes a new paradigm in computational pathology by effectively bridging visual and linguistic domains. Its ability to perform zero-shot classification, cross-modal retrieval, and other diverse tasks with state-of-the-art accuracy demonstrates the power of visual-language pre-training. By providing a versatile and powerful foundation, CONCH has the potential to accelerate research and development across a wide spectrum of pathology applications, from diagnostic support and educational tools to biomarker discovery. Its open availability to the research community further catalyzes innovation, paving the way for more generalized and impactful AI tools in histopathology.

The field of computational pathology has been transformed by foundation models that encode histopathology regions-of-interest (ROIs) into versatile and transferable feature representations via self-supervised learning [7]. Within this landscape, UNI represents a significant advancement as a visual-language foundation model specifically designed for computational pathology tasks. Unlike traditional AI models that require extensive labeled data for each specific task, foundation models like UNI are trained on broad data using self-supervision at scale and can be adapted to a wide range of downstream tasks [14]. This capability is particularly valuable in histopathology, where obtaining large-scale annotated datasets imposes a substantial labeling burden on pathologists, limiting the practical benefits of AI-assisted diagnostics [15].

UNI belongs to a class of multimodal foundation models (MFMs) that integrate multiple data modalities such as language, image, and bioinformatics [14]. These models demonstrate superior expressiveness and scalability based on large model architectures, extensive training data, and parallelizable training methods compared to traditional deep learning models [14]. The development of UNI and similar models addresses a critical need in pathology for more accurate and efficient AI tools that can reduce workload for pathologists and support decision-making in treatment plans while handling the complex morphological patterns found in histology images [7] [14].

Core Architectural Framework and Technical Specifications

Model Architecture and Design Principles

UNI employs a transformer-based architecture that has proven effective in handling the complex visual and linguistic representations required for pathology tasks. The model utilizes dedicated encoders to extract comprehensive feature representations for each modality, enabling seamless integration between phenotype patterns and molecular profiles [16]. This architectural approach allows UNI to process whole slide images (WSIs) at multiple resolutions, capturing both cellular-level details and tissue-level architectural patterns essential for accurate pathological assessment.

The model's design incorporates several innovative components to address domain-specific challenges in computational pathology. Unlike conventional multi-modal integration methods that primarily emphasize modality alignment, UNI's framework is designed to foster both modality alignment and retention [16]. This dual approach is crucial because histopathology and other biomedical data modalities exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. While histopathology data provides morphological and spatial context elucidating tissue architecture and cellular topology, other modalities like transcriptomics delineate molecular signatures through gene expression patterns [16].

Technical Specifications and Implementation

UNI processes input data through a sophisticated pipeline that handles the unique characteristics of pathological images. The model constructs its input embedding space by dividing each WSI into non-overlapping patches, typically at 20× magnification, followed by the extraction of dimensional features for each patch using advanced feature extractors [7]. To manage the computational complexity caused by long input sequences in gigapixel WSIs, UNI employs efficient attention mechanisms and feature compression techniques.

Table 1: UNI Model Specifications and Key Technical Features

Component	Specification	Function
Input Processing	Non-overlapping patches at 20× magnification	Divides WSIs into manageable units for processing
Feature Extraction	Pre-trained encoders (e.g., CONCH-based)	Converts image patches into feature representations
Modality Integration	Cross-attention mechanisms with alignment and retention modules	Fuses information from different data sources
Positional Encoding	Attention with linear bias (ALiBi) extended to 2D	Preserves spatial relationships between patches
Output Representation	Multi-scale feature embeddings	Captures both local cellular and global tissue patterns

A critical innovation in UNI's architecture is its approach to handling variable-sized WSIs. The model creates views of a WSI by randomly cropping 2D feature grids, sampling region crops of specific dimensions from the WSI feature grid [7]. From these region crops, multiple global and local crops are sampled for self-supervised pretraining. This approach enables the model to learn robust representations that capture both fine-grained cellular details and broader tissue organizational patterns.

Technical Framework for Modality Alignment

UNI implements a sophisticated modality alignment module that dynamically draws paired data from different modalities into closer proximity in the embedding space while dispersing unrelated samples [16]. This alignment is achieved through contrastive learning objectives that maximize the similarity between corresponding image-text pairs while minimizing similarity for non-corresponding pairs. The alignment process operates at multiple levels, including ROI-level alignment with fine-grained morphological descriptions and slide-level alignment with clinical reports [7].

The alignment methodology addresses the fundamental challenge of bridging the semantic gap between visual pathological findings and their corresponding textual descriptions. Unlike conventional multi-modal inputs that often share highly overlapping features, histopathology and associated textual data exhibit significant heterogeneity, operating at different biological scales and encoding distinct yet complementary dimensions of disease-related information [16]. UNI's alignment framework is specifically designed to handle this heterogeneity while identifying and leveraging the shared correlations between modalities.

Implementation and Training Approach

UNI's cross-modal alignment is typically implemented through a multi-stage pretraining strategy. The initial stage involves vision-only unimodal pretraining on large datasets of histopathology images, enabling the model to learn fundamental visual representations of pathological structures [7]. Subsequent stages introduce cross-modal alignment, first at the ROI-level with generated morphological descriptions, and then at the whole-slide level with clinical reports [7].

Table 2: Cross-Modal Alignment Training Strategy in UNI

Training Stage	Data Input	Objective	Outcome
Stage 1: Vision Pretraining	Diverse WSIs across multiple organ types	Self-supervised learning via masked image modeling and knowledge distillation	Robust visual feature extraction for pathological structures
Stage 2: ROI-Level Alignment	8K×8K ROIs paired with synthetic captions	Contrastive learning between image patches and fine-grained textual descriptions	Fine-grained understanding of localized pathological features
Stage 3: Slide-Level Alignment	Complete WSIs paired with pathology reports	Global alignment between entire slides and comprehensive diagnostic text	Slide-level diagnostic reasoning and report generation capabilities

This staged approach allows UNI to progressively build its cross-modal understanding, from localized cellular and tissue patterns to broader diagnostic concepts and relationships. The model leverages both real pathology reports and synthetically generated fine-grained descriptions to create a comprehensive alignment between visual patterns and their semantic representations [7].

Transfer Learning Capabilities and Performance

Adaptability to Diverse Pathology Tasks

UNI demonstrates exceptional transfer learning capabilities across a wide spectrum of pathology tasks without requiring extensive task-specific fine-tuning. The model can be effectively applied to histology image classification, segmentation, captioning, text-to-image retrieval, and image-to-text retrieval tasks [5]. This versatility stems from the rich, general-purpose representations learned during pretraining, which capture fundamental morphological patterns relevant across different organs, disease types, and staining protocols.

The model's architecture enables multiple transfer learning paradigms, including linear probing (training a simple classifier on frozen features), few-shot learning (adapting with very limited labeled examples), and zero-shot learning (performing tasks without any task-specific training) [7]. Particularly impressive is UNI's performance in low-data regimes, where it outperforms supervised baselines and previous foundation models, making it highly valuable for rare diseases and specialized applications where labeled data is scarce [7].

Quantitative Performance Benchmarks

UNI has been rigorously evaluated on diverse benchmarks demonstrating state-of-the-art performance across multiple domains. The model shows particular strength in few-shot and zero-shot classification scenarios, where it achieves competitive performance with only minimal training examples. In slide retrieval tasks, UNI enables effective retrieval of similar cases based on visual similarity or textual queries, facilitating comparative pathology and decision support [7].

Table 3: UNI Performance Across Key Pathology Tasks

Task Category	Specific Applications	Key Performance Metrics	Comparative Advantage
Image Classification	Cancer subtyping, grading, biomarker prediction	Top-1 accuracy, AUC-ROC	Reduces annotation requirements by up to 90% in few-shot settings
Segmentation	Tissue and cellular segmentation	Dice coefficient, IoU	Generalizes across stain types and tissue preparations
Captioning & Report Generation	Automated pathology report generation	BLEU scores, clinical accuracy	Generates clinically relevant descriptions from WSIs
Cross-Modal Retrieval	Text-to-image, image-to-text retrieval	Recall@K, mean average precision	Enables content-based search in large pathology archives
Survival Analysis	Patient outcome prediction	Concordance index, log-rank p-values	Integrates morphological patterns with clinical data

The model's strong performance across these diverse tasks demonstrates its effectiveness as a general-purpose feature extractor for computational pathology. By capturing biologically meaningful representations, UNI reduces the dependency on large annotated datasets and accelerates the development of AI tools for specialized pathological applications.

Experimental Protocols and Methodologies

Standardized Evaluation Framework

To ensure rigorous assessment of UNI's capabilities, researchers have established comprehensive evaluation protocols covering multiple downstream tasks and data modalities. The standard evaluation framework typically involves 5-fold cross-validation on multiple cohorts from large-scale datasets such as The Cancer Genome Atlas (TCGA), focusing on critical downstream tasks including cancer subtyping and survival analysis [16]. This approach ensures robust performance estimation across different tissue sites and patient populations.

The evaluation incorporates both linear probing and few-shot learning settings to comprehensively assess the model's performance and generalizability [16]. In linear probing evaluations, a simple classifier is trained on top of frozen features extracted by UNI, testing the quality of the representations without task-specific adaptation. In few-shot learning scenarios, the model is adapted with very limited labeled examples (typically ranging from 1 to 16 examples per class) to simulate real-world conditions where extensive annotations are unavailable.

Implementation Protocols for Key Applications

For cancer subtyping and classification, the standard protocol involves extracting features from WSIs using UNI, then training a classifier on these features using a limited set of labeled examples. The model processes input WSIs by dividing them into patches, encoding each patch, and aggregating the patch-level representations into a slide-level embedding that captures both local and global pathological patterns.

For survival analysis and prognosis prediction, UNI features are used in Cox proportional hazards models or other survival analysis frameworks to predict patient outcomes based on histomorphological patterns. The model's ability to capture prognostically relevant tissue and cellular characteristics enables accurate risk stratification without requiring explicit annotation of histological features.

For cross-modal retrieval tasks, the standard protocol involves encoding both images and text into a shared embedding space, then measuring similarity using cosine distance or other metrics. This enables bidirectional retrieval, where text queries can retrieve relevant images and vice versa, facilitating knowledge discovery and comparative analysis.

Research Reagent Solutions and Computational Tools

The effective implementation and application of UNI requires specific computational tools and resources that form the essential "research reagents" for working with this foundation model.

Table 4: Essential Research Reagent Solutions for UNI Implementation

Tool Category	Specific Solutions	Function in Research Pipeline
Whole Slide Image Management	OMERO, Digital Slide Archive	Hosts and manages large-scale WSI datasets with appropriate metadata
Pathology Data Annotation	QuPath, ImageJ	Enables region-of-interest annotation and ground truth generation
Computational Pathology Frameworks	CONCH ecosystem, TIAToolbox	Provides pretrained models and standardized processing pipelines
Multi-Modal Learning Platforms	MIRROR framework, CLIP-based adaptations	Facilitates alignment between histopathology and other data modalities
Visualization and Analysis	Comparative Pathology Workbench (CPW)	Enables interactive visual analytics and collaborative interpretation

These tools collectively support the end-to-end workflow for applying UNI to diverse pathology tasks, from data management and preprocessing to model implementation, evaluation, and visualization of results. The Comparative Pathology Workbench (CPW) deserves special mention as it provides a web-browser-based visual analytics platform offering shared access to an interactive "spreadsheet" style presentation of images and associated analysis data [17]. This facilitates direct and dynamic comparison of images at various magnifications, selected regions of interest, and results of image analysis or other data analyses such as scRNA-seq [17].

Workflow and Integration Diagrams

The following diagram illustrates the complete workflow for UNI's cross-modal alignment and transfer learning capabilities:

The following diagram details UNI's core innovation in balancing modality alignment with modality-specific retention:

Future Directions and Clinical Translation

The development of UNI represents a significant milestone in computational pathology, but several challenges remain for widespread clinical adoption. Future research directions focus on enhancing interpretability and explainability of model predictions to build trust among pathologists and clinicians. Additional efforts are needed to improve model robustness across diverse tissue preparation protocols, staining variations, and scanner types commonly encountered in real-world clinical settings.

A promising direction is the development of generalist medical AI systems that integrate pathology foundation models with FMs from other medical domains [14]. Such integrated systems could provide comprehensive diagnostic support by combining pathological findings with radiological, genomic, and clinical data, ultimately promoting precision and personalized medicine. As noted by researchers, "In the future, the development of generalist medical AI, which integrates pathology FMs with FMs from other medical domains, is expected to progress, effectively utilizing AI in real clinical settings to promote precision and personalized medicine" [14].

The clinical translation of UNI and similar foundation models also requires addressing regulatory considerations, standardization of deployment pipelines, and validation in multi-center trials. As these models continue to evolve, they hold tremendous potential to transform pathological practice by augmenting human expertise, reducing diagnostic variability, and uncovering novel morphological biomarkers that predict disease behavior and treatment response.

The Role of Self-Supervised Learning in Leveraging Unlabeled Whole Slide Images

The analysis of histopathological images, particularly whole slide images (WSIs), is fundamental to cancer diagnosis, prognosis prediction, and treatment formulation. However, this field faces a critical challenge: the complex morphology of tissues, inconsistency of staining protocols, and, most importantly, the scarcity of pixel-level annotations required by supervised deep learning methods [18]. Annotating gigapixel WSIs is costly, time-consuming, and requires skilled pathologists, making it a significant bottleneck [18]. This limitation has motivated the exploration of alternative learning paradigms that can leverage the vast amounts of unlabeled histopathological images available in clinical archives [18].

Self-supervised learning (SSL) has emerged as a powerful solution to this annotation bottleneck [18]. SSL processes large quantities of unlabeled data by leveraging intrinsic data structures to create its own supervisory signals, learning robust feature representations without extensive manual labeling [19]. Recent advances in masked image modeling (MIM) and contrastive learning have shown remarkable success in natural image domains, and these benefits are now being effectively adapted to medical imaging tasks [18]. Within the context of major histopathology foundation models like Virchow, CONCH, and UNI, SSL provides the foundational pretraining that enables these models to achieve state-of-the-art performance across diverse downstream tasks with minimal fine-tuning [20]. This technical guide explores the core SSL methodologies, their implementation in leading foundation models, and their practical applications in histopathology research.

Core Self-Supervised Learning Approaches for WSIs

Technical Foundations and Methodologies

Self-supervised learning for histopathology primarily utilizes two complementary approaches: contrastive learning and masked image modeling. These methods learn robust feature representations by solving pretext tasks designed to capture essential histological features without manual labels.

Contrastive Learning aims to learn an embedding space where similar sample pairs are positioned close together while dissimilar pairs are far apart. In digital histopathology, this approach has been successfully applied at scale. One large-scale study pretrained models on 57 histopathology datasets without labels, finding that combining multiple multi-organ datasets with different staining and resolution properties improved learned feature quality [19]. The study also revealed that using more images for pretraining leads to better downstream task performance, albeit with diminishing returns after approximately 50,000 images [19]. This approach enables models to learn features invariant to technical variations like staining protocols while capturing biologically relevant morphological patterns.

Masked Image Modeling (MIM) has recently shown remarkable success in histopathology. Inspired by language modeling in natural language processing, MIM randomly masks portions of input images and trains models to reconstruct the missing content. This approach forces the model to learn meaningful representations of tissue structures and cellular relationships. For histopathology, domain-specific knowledge can be incorporated into the masking strategy to produce more meaningful self-supervised representations [18]. The GMIM framework extends this concept with adaptive and hierarchical masked image modeling, bringing the benefits of masked modeling to volumetric medical images [18].

Hybrid approaches that combine multiple SSL techniques have demonstrated superior performance. One novel framework integrates masked image modeling with contrastive learning and adaptive semantic-aware data augmentation [18]. This hybrid approach leverages MIM to reconstruct fine-grained tissue structures while using contrastive learning to enforce feature invariance across scales and staining variations [18]. The combination is particularly suited to multi-scale WSI analysis, where capturing both local cellular details and global tissue context is essential.

Quantitative Performance of SSL Methods

Recent comprehensive evaluations demonstrate the substantial improvements achieved by advanced SSL methods over traditional supervised approaches across multiple histopathology datasets. The following table summarizes key performance metrics:

Table 1: Performance Comparison of SSL Methods on Histopathology Segmentation Tasks

Method	Dice Coefficient	mIoU	Hausdorff Distance	Annotation Efficiency
Proposed Hybrid SSL Framework [18]	0.825 (4.3% improvement)	0.742 (7.8% enhancement)	10.7% reduction	95.6% performance with only 25% labels
Supervised Baselines [18]	0.791	0.688	Baseline	85.2% performance with 25% labels
Cross-Dataset Generalization [18]	-	-	-	13.9% improvement over existing approaches

Additional studies have confirmed these advantages. Models based on self-supervised contrastive learning have demonstrated excellent results on most primary sites and cancer subtypes, achieving state-of-the-art performance on validation tasks such as lung cancer classification [21]. Furthermore, linear classifiers trained on top of features learned from SSL pretraining on digital histopathology datasets perform significantly better than ImageNet-pretrained networks, boosting task performances by more than 28% in F1 scores on average [19].

Foundation Models: Architectural Frameworks and Implementation

Major Foundation Models in Histopathology

The field has recently witnessed the emergence of powerful foundation models pretrained using SSL on massive histopathology datasets. These models serve as versatile feature extractors adaptable to various downstream tasks. The following table compares key foundation models:

Table 2: Comparison of Major Histopathology Foundation Models

Foundation Model	Training Data Scale	SSL Methodology	Key Capabilities	Clinical Applications
CONCH [5]	1.17M image-caption pairs	Contrastive learning from captions	Image classification, segmentation, captioning, cross-modal retrieval	Diagnosis, biomarker prediction, treatment response prediction
UNI [18]	100M+ images from 100,000+ WSIs	Self-supervised learning	General-purpose feature extraction across 20+ tissue types	Cancer subtyping, rare cancer detection, prognostic analysis
Virchow [18]	1.5M WSIs from 100,000 patients	Self-supervised pretraining	Rare cancer detection, clinical-grade diagnostics	Surpasses supervised methods in low-resource settings
TITAN [7]	335,645 WSIs + synthetic captions	Multimodal self-supervision & vision-language alignment	Slide representation, report generation, zero-shot classification	Rare disease retrieval, cancer prognosis, cross-modal search
Prov-GigaPath [18]	1.3B pathology images	Masked autoencoding	Whole-slide feature learning	Generalizes across hundreds of cancer types and tasks

These foundation models overcome critical limitations of earlier approaches. Unlike traditional supervised models that require extensive labeling for each specific task, foundation models leverage SSL to learn general-purpose representations adaptable to diverse applications with minimal fine-tuning [20]. For instance, CONCH demonstrates how integrating visual and textual information through contrastive learning enables more advanced comprehension of histopathological entities [5].

The S3L Framework: A Unified SSL Approach for WSIs

The S3L (Self-Supervised Whole Slide Learning) framework provides a general, flexible, and lightweight approach for gigapixel-scale self-supervision of WSIs [22]. S3L treats gigapixel WSIs as sequences of patch tokens and applies domain-informed vision-language transformations—including splitting, cropping, and masking—to generate high-quality views for self-supervised training [22].

The framework employs a two-stage architecture:

A pre-trained patch encoder that transforms individual image patches into feature representations
A transformer-based whole-slide encoder that aggregates patch-level features into comprehensive slide-level representations [22]

This approach effectively leverages the inherent regional heterogeneity, histologic feature variability, and information redundancy within WSIs to learn high-quality representations without extensive annotations [22]. Benchmarking experiments demonstrate that S3L significantly outperforms WSI baselines for cancer diagnosis and genetic mutation prediction, achieving good performance with both in-domain and out-of-distribution patch encoders [22].

Diagram 1: S3L Framework for Whole Slide Images

Experimental Protocols and Methodologies

Implementation of Hybrid SSL Frameworks

Comprehensive experimental evaluations demonstrate the effectiveness of SSL approaches. One recently proposed framework integrates three key innovations: a multi-resolution hierarchical architecture for gigapixel WSIs, a hybrid SSL strategy combining masked autoencoder reconstruction with multi-scale contrastive learning, and an adaptive augmentation network that preserves histological semantics [18].

The experimental protocol involves:

Data Preparation and Preprocessing:

Collect WSIs from diverse sources (TCGA-BRCA, TCGA-LUAD, TCGA-COAD, CAMELYON16, PanNuke)
Extract non-overlapping patches at multiple magnifications (e.g., 20x, 10x, 5x)
Apply stain normalization to address technical variations

Progressive Fine-tuning Protocol:

Pretraining Phase: Train on large unlabeled dataset using hybrid SSL objectives
Adaptation Phase: Fine-tune on limited labeled data with task-specific heads
Evaluation Phase: Comprehensive benchmarking against supervised baselines

Evaluation Metrics:

Segmentation performance: Dice coefficient, mIoU
Boundary accuracy: Hausdorff Distance, Average Surface Distance
Data efficiency: Performance with reduced annotation budgets
Generalization: Cross-dataset transfer learning capability

This framework demonstrated substantial improvements, achieving a Dice coefficient of 0.825 (4.3% improvement) and mIoU of 0.742 (7.8% enhancement), with significant reductions in boundary error metrics (10.7% in Hausdorff Distance, 9.5% in Average Surface Distance) [18]. Notably, the method exhibited exceptional data efficiency, requiring only 25% of labeled data to achieve 95.6% of full performance compared to 85.2% for supervised baselines, representing a 70% reduction in annotation requirements [18].

Multimodal Pretraining with TITAN

The TITAN (Transformer-based pathology Image and Text Alignment Network) framework implements a sophisticated multimodal pretraining approach for whole-slide images [7]. The experimental protocol consists of three distinct stages:

Stage 1: Vision-Only Unimodal Pretraining

Pretrain on Mass-340K dataset (335,645 WSIs) using iBOT framework
Construct input embedding space by dividing WSIs into non-overlapping 512×512 pixel patches at 20× magnification
Extract 768-dimensional features for each patch using CONCHv1.5 encoder
Create views by randomly cropping 2D feature grid with global (14×14) and local (6×6) crops
Apply feature augmentations (vertical/horizontal flipping, posterization)

Stage 2: ROI-Level Cross-Modal Alignment

Align generated morphological descriptions at region-of-interest (ROI) level
Utilize 423k pairs of 8k×8k ROIs and synthetic captions generated via PathChat
Implement contrastive learning to align visual and textual representations

Stage 3: WSI-Level Cross-Modal Alignment

Align representations at whole-slide level using 183k pairs of WSIs and clinical reports
Enable cross-modal retrieval and zero-shot classification capabilities

This multi-stage pretraining strategy enables TITAN to extract general-purpose slide representations that outperform both ROI and slide foundation models across diverse clinical tasks, including linear probing, few-shot and zero-shot classification, rare cancer retrieval, and pathology report generation [7].

Diagram 2: TITAN Multi-stage Pretraining Pipeline

Implementing SSL approaches for histopathology requires specific computational frameworks and data resources. The following table details key components of the research toolkit:

Table 3: Essential Research Reagents and Computational Resources for SSL in Histopathology

Resource Category	Specific Examples	Function & Application	Availability
Foundation Models	CONCH, UNI, Virchow, TITAN	Pre-trained feature extractors for transfer learning	Publicly available (CONCH) or through research collaborations
SSL Frameworks	S3L, HIPT, Giga-SSL	Implement self-supervised learning for WSIs	GitHub repositories (e.g., S3L framework)
Patch Encoders	CONCHv1.5, DINOv2, ResNet	Encode individual image patches into feature representations	Open-source implementations
WSI Datasets	TCGA, CAMELYON, PanNuke	Benchmark datasets for training and evaluation	Publicly available with restrictions
Computational Resources	GPUs (≥ 16GB VRAM), High-CPU servers	Handle computational demands of gigapixel images	Research computing clusters
Digital Pathology Tools	QuPath, HALO, Whole Slide Scanners	Slide digitization, annotation, and preliminary analysis	Commercial and open-source options

These resources enable researchers to implement and experiment with SSL approaches for histopathology. For instance, CONCH is publicly available on GitHub and can be installed via pip, providing researchers with direct access to state-of-the-art vision-language capabilities for histopathology [5]. The model demonstrates particular strength for non-H&E stained images such as IHCs and special stains, and can be used for diverse tasks involving histopathology images and text [5].

Self-supervised learning has fundamentally transformed the analysis of whole slide images in computational pathology. By effectively leveraging vast amounts of unlabeled data, SSL approaches address the critical annotation bottleneck that has long constrained the development of robust AI systems for histopathology. Through techniques like contrastive learning, masked image modeling, and their hybrid implementations, SSL enables models to learn rich, transferable representations of histological features that form the foundation for diverse downstream tasks.

The emergence of powerful foundation models like CONCH, UNI, Virchow, and TITAN represents a paradigm shift in computational pathology. These models, pretrained using SSL on massive datasets, demonstrate exceptional versatility across classification, segmentation, retrieval, and generative tasks while significantly reducing annotation requirements. The integration of multimodal capabilities, particularly vision-language alignment, further enhances their utility in clinical and research settings.

Future research directions include developing more efficient architectures for processing gigapixel images, improving model interpretability for clinical adoption, and enhancing generalization across diverse patient populations and tissue types. As SSL methodologies continue to evolve alongside foundation models, they hold tremendous potential to accelerate histopathology research, support clinical decision-making, and ultimately advance precision medicine through more accessible and powerful computational tools.

The emergence of foundation models is fundamentally transforming computational pathology by enabling the development of artificial intelligence (AI) tools that can interpret gigapixel whole-slide images (WSIs) for tasks ranging from cancer diagnosis and biomarker prediction to prognosis estimation [7] [9]. Unlike earlier AI models designed for a single, specific task, foundation models are trained on massive, diverse datasets without explicit labels, learning general-purpose representations of histopathology images. These representations can then be adapted with high efficiency to a wide array of downstream clinical and research applications, even those with very limited annotated data [2] [23]. The performance and generalizability of these models are critically dependent on their pretraining—the initial phase where the model learns the fundamental patterns of histology data. This phase varies significantly across models in terms of the scale of data, the learning algorithms employed, and the modalities used (e.g., images alone or images paired with text) [9].

This technical guide provides a comparative analysis of three pivotal foundation models—Virchow, CONCH, and UNI—each representing a distinct paradigm in pretraining strategy. Virchow exemplifies the "scale-up" approach, leveraging millions of WSIs for self-supervised visual learning [2] [24]. CONCH pioneers the vision-language pathway, aligning histopathology images with textual descriptions to capture semantic concepts [12]. UNI establishes a general-purpose visual encoder by exploring scaling laws and demonstrating robust performance across dozens of clinical tasks [23]. Understanding their core architectures, training data, and experimental benchmarks is essential for researchers and drug development professionals aiming to leverage these models in oncology and anatomic pathology.

Core Model Architectures & Pretraining Strategies

Virchow and Virchow 2: Scaling Visual Self-Supervision

The Virchow model family is architected around the principle of scaling both data and model size using visual self-supervised learning. The original Virchow is a 632 million parameter Vision Transformer (ViT) trained on approximately 1.5 million H&E-stained WSIs, corresponding to roughly 2 billion image tiles [2] [9]. Its successor, Virchow 2, scales this further to 3.1 million WSIs and introduces a giant 1.85 billion parameter variant (Virchow 2G), trained on a mixed-magnification dataset that includes both H&E and immunohistochemistry (IHC) stains [24].

The pretraining methodology for Virchow employs DINOv2 (self-DIstillation with NO labels), a robust self-supervised algorithm. DINOv2 uses a student-teacher network structure where the student learns to match the output of the teacher when presented with different augmented "views" of the same image tile. This process encourages the model to learn representations that are invariant to perturbations like staining variations and cropping, focusing on biologically relevant morphological features [2] [24]. A key domain-specific adaptation in Virchow 2 is the modification of the DINOv2 augmentation policy. It omets image solarization, which can generate unrealistic color profiles in pathology, and carefully tunes the random crop-and-resize operation to minimize unwanted distortions to critical cellular and tissue structures [24].

CONCH: Vision-Language Alignment via Contrastive Learning

CONCH (CONtrastive learning from Captions for Histopathology) represents a paradigm shift by integrating language and vision. It is a visual-language foundation model pretrained on over 1.17 million histopathology image-caption pairs, one of the largest such datasets in the domain [12] [5].

The model's architecture is based on the CoCa (Contrastive Captioners) framework. It consists of three core components:

An image encoder (a Vision Transformer) that processes histology image tiles.
A text encoder that processes corresponding text captions.
A multimodal fusion decoder that attends to both image and text features to generate captions.

CONCH is trained with a combination of two objectives. The first is a contrastive alignment objective that pulls the embeddings of a histology image and its correct description closer in a shared representation space while pushing it away from unrelated texts. The second is a captioning objective that teaches the model to generate accurate textual descriptions given an image [12]. This dual approach allows CONCH to develop a rich, semantically grounded understanding of histopathologic entities, enabling capabilities like zero-shot classification and cross-modal retrieval without any task-specific fine-tuning.

UNI: A General-Purpose Visual Encoder and Scaling Laws

UNI is designed as a general-purpose, self-supervised vision encoder for pathology. It is a ViT-Large model (303 million parameters) pretrained on the "Mass-100K" dataset, which contains over 100 million tissue patches from more than 100,000 diagnostic H&E-stained WSIs across 20 major tissue types [23] [9].

Similar to Virchow, UNI uses the DINOv2 self-supervised learning algorithm for pretraining. A significant contribution of the UNI project was its systematic investigation of scaling laws in computational pathology. The researchers ablated the model by pretraining on subsets of their data (Mass-1K and Mass-22K) and with different model sizes (ViT-Base and ViT-Large). They demonstrated that downstream performance on complex tasks like rare cancer classification improved monotonically with both increased data scale and model size, establishing clear scaling laws for the field [23]. This work provides empirical evidence that the "foundation model" paradigm holds for histopathology, with larger, more diverse datasets yielding more robust and generalizable feature representations.

Table 1: Comparative Overview of Core Model Architectures and Pretraining Data

Feature	Virchow / Virchow 2	CONCH	UNI
Core Paradigm	Scale-up visual self-supervision	Vision-language alignment	General-purpose visual encoder
Model Architecture	ViT-H (632M) / ViT-G (1.85B)	ViT-based Image & Text Encoders + Multimodal Decoder	ViT-L (303M)
Pretraining Algorithm	DINOv2 (with domain adaptations)	Contrastive Learning + Captioning (Based on CoCa)	DINOv2
Pretraining Data Scale	1.5M - 3.1M WSIs; ~2B tiles [2] [24]	1.17M image-caption pairs [12]	100,426 WSIs; >100M tiles [23]
Data Modality	H&E (Virchow) + IHC (Virchow 2)	H&E images + Text captions/reports	H&E-stained WSIs
Key Innovation	Massive scale, mixed magnification, domain-specific augmentations	Multimodal pretraining for zero-shot capabilities	Establishing scaling laws for data and model size

Experimental Protocols and Benchmarking

Virchow's Pan-Cancer and Rare Cancer Detection

A central experiment validating Virchow's capability was its evaluation on pan-cancer detection. The goal was to train a single model to detect cancer across a wide range of tissue types, including both common and rare cancers.

Methodology: Virchow's 632M parameter ViT was used as a feature extractor. For each WSI, it generated embeddings from thousands of image tiles. A weakly supervised aggregator model (e.g., an attention-based multiple instance learning model) was then trained on top of these frozen Virchow embeddings to predict the specimen-level cancer label from the collection of tile features [2].
Datasets & Tasks: The model was evaluated on slides from nine common cancers (e.g., breast, lung) and seven rare cancers (e.g., cervical, bone), using both internal data and external data from consultation cases to test out-of-distribution generalization [2].
Key Results: The pan-cancer detector using Virchow embeddings achieved a specimen-level area under the receiver operating characteristic curve (AUROC) of 0.950 overall and 0.937 on rare cancers alone. At a high sensitivity of 95%, it achieved a specificity of 72.5%, significantly outperforming models built on other foundational encoders like UNI, Phikon, and CTransPath [2]. This demonstrated that the features learned from Virchow's massive dataset were highly generalizable and effective for challenging, low-incidence tasks.

CONCH's Zero-Shot Transfer Capabilities

CONCH's utility was demonstrated through a suite of zero-shot evaluations, where the pretrained model was applied to downstream tasks without any further fine-tuning using task-specific labels.

Methodology: For a classification task, the names of the target classes (e.g., "invasive ductal carcinoma," "renal cell carcinoma") are converted into a set of text prompts. CONCH then encodes both the input image and all text prompts into its shared representation space. The image is classified by matching it to the most semantically similar text prompt based on cosine similarity [12]. For whole-slide classification, this process is applied to individual tiles, and the scores are aggregated into a slide-level prediction.
Datasets & Tasks: CONCH was evaluated on 14 diverse benchmarks, including slide-level cancer subtyping on TCGA datasets (e.g., BRCA, NSCLC, RCC) and region-of-interest tasks like Gleason pattern classification [12].
Key Results: CONCH achieved state-of-the-art zero-shot performance, outperother visual-language models like PLIP and BiomedCLIP by wide margins. For instance, it achieved a zero-shot accuracy of 90.7% on NSCLC subtyping and 91.3% on the more challenging BRCA subtyping, where other models performed near random chance [12]. This highlights the power of vision-language pretraining in creating models that can generalize "off-the-shelf" to new tasks and class definitions.

UNI's Extensive Downstream Task Generalization

UNI was subjected to one of the most extensive evaluations to establish its general-purpose nature, being tested on 34 distinct computational pathology tasks.

Methodology: The evaluation covered multiple machine learning settings:
- Weakly Supervised Slide Classification: Using an attention-based multiple instance learning (ABMIL) aggregator on top of UNI-extracted features to classify entire WSIs [23].
- Tile-level Tasks: Directly using UNI features for tasks like image retrieval and tissue segmentation.
- Complex Multi-class Classification: To test scaling laws, a custom dataset was created for OncoTree cancer classification, involving 5,564 WSIs across 108 cancer types (90 of which are rare) [23].
Key Results: UNI demonstrated superior performance compared to previous state-of-the-art encoders like CTransPath and REMEDIS. On the challenging 108-class OncoTree task (OT-108), UNI achieved a top-1 accuracy that was several percentage points higher than all baselines [23]. The scaling experiments concretely showed that performance on OT-108 and OT-43 (43 cancer types) improved monotonically as the pretraining dataset was scaled from Mass-1K to Mass-100K, providing a clear quantitative argument for large-scale pretraining in pathology [23].

Table 2: Summary of Key Experimental Results and Performance Benchmarks

Model	Primary Experiment	Key Metric	Reported Performance	Comparative Advantage
Virchow	Pan-cancer & rare cancer detection	Specimen-level AUROC	0.950 overall; 0.937 on rare cancers [2]	High performance on rare cancers and out-of-distribution data
CONCH	Zero-shot cancer subtyping (TCGA)	Zero-shot Accuracy	90.7% (NSCLC), 91.3% (BRCA) [12]	State-of-the-art zero-shot transfer, no task-specific training needed
UNI	OncoTree 108-class cancer classification	Top-1 Accuracy	Outperformed CTransPath/REMEDIS by a wide margin [23]	Superior generalization across a massive number of cancer types

Visual Summaries of Model Workflows

Virchow's Self-Supervised Training and Pan-Cancer Detection Workflow

The following diagram illustrates the two-stage process of training the Virchow foundation model and applying it to pan-cancer detection.

CONCH's Vision-Language Pretraining and Zero-Shot Application

This diagram outlines CONCH's multimodal pretraining process and its application to zero-shot classification tasks.

The Scientist's Toolkit: Essential Research Reagents

For researchers aiming to implement or build upon these foundation models, the following table details key computational "reagents" and their functions as derived from the featured studies.

Table 3: Key Research Reagents and Computational Tools in Pathology Foundation Models

Research Reagent / Tool	Function in Experimental Workflow	Example Usage in Featured Studies
Self-Supervised Learning (SSL) Algorithms (DINOv2, iBOT)	Enables pretraining of neural networks on unlabeled image data by constructing a pretext task, such as matching different views of an image.	Core training algorithm for Virchow, Virchow 2, and UNI [2] [24] [23].
Vision Transformer (ViT) Architecture	A neural network architecture that processes images as sequences of patches, using self-attention to model global context. Scales to billions of parameters.	Backbone architecture for Virchow (ViT-H/G), UNI (ViT-L), and CONCH's image encoder [2] [12] [23].
Attention-Based Multiple Instance Learning (ABMIL)	A weakly supervised learning method for whole-slide classification. It aggregates features from thousands of tiles, learning to weight the importance of each tile.	Used for slide-level cancer detection and subtyping in Virchow and UNI evaluations [2] [23].
Whole-Slide Image (WSI) Datasets (TCGA, Mass-100K, MSKCC)	Large-scale, often multi-institutional, collections of digitized histopathology slides. Provide the raw data for pretraining and benchmarking.	TCGA used for benchmarking; Mass-100K for pretraining UNI; MSKCC's 1.5M+ slides for Virchow [2] [23] [9].
Contrastive Vision-Language Pretraining	A training paradigm that learns aligned representations of images and text by contrasting positive (matching) and negative (non-matching) image-text pairs.	The core pretraining methodology for the CONCH model [12].

The comparative analysis of Virchow, CONCH, and UNI reveals distinct and complementary pathways for building foundation models in computational pathology. Virchow demonstrates the profound impact of scaling up data and model size for visual self-supervision, achieving remarkable performance in detecting both common and rare cancers. CONCH unlocks a new frontier with multimodal learning, offering unparalleled flexibility through zero-shot reasoning and cross-modal retrieval by grounding visual patterns in semantic language. UNI provides a robust general-purpose visual encoder and crucially establishes the scaling laws that govern model performance in this domain.

For the research and drug development community, the choice of model depends heavily on the target application. Virchow's lineage is ideal for high-performance, clinical-grade detection and diagnosis tasks. CONCH is uniquely suited for exploratory research, knowledge retrieval, and settings where labeled data is extremely scarce. UNI offers a powerful and extensively validated off-the-shelf feature extractor for a wide array of supervised tasks. As the field progresses, the fusion of these approaches—combining the scale of Virchow, the semantic understanding of CONCH, and the generalizability of UNI—will likely pave the way for the next generation of AI-powered pathologic tools.

Practical Applications: From Pan-Cancer Detection to Biomarker Prediction

Slide Encoding and Feature Extraction Workflows for Whole Slide Images

The advent of whole-slide imaging has transformed pathology by enabling the application of artificial intelligence to digitized tissue samples. Foundation models, pre-trained on massive datasets using self-supervised learning, have emerged as powerful tools for extracting meaningful representations from histopathology images without task-specific labels [1]. These models capture diverse morphological patterns including cellular morphology, tissue architecture, staining characteristics, nuclear atypia, and biomarker expression, making them suitable for predicting various whole-slide image characteristics [4] [25]. Unlike traditional deep learning models that require curated labels and are designed for single tasks, foundation models are trained on broad data and can be adapted to a wide range of downstream tasks, offering superior expressiveness and scalability [1]. This technical guide explores the slide encoding and feature extraction workflows of three prominent pathology foundation models—Virchow, CONCH, and UNI—framed within the context of histopathology research applications.

Model Architectures and Training Approaches

Virchow represents a vision-only foundation model based on a 632 million parameter vision transformer architecture trained using the DINOv2 algorithm [4] [25]. The model was trained on 1.5 million hematoxylin and eosin stained whole-slide images from diverse tissue groups, which is orders of magnitude more data than previous works [26]. The DINOv2 algorithm employs a multi-view student-teacher self-supervised approach that leverages global and local regions of tissue tiles to learn embeddings of whole-slide image tiles [4]. Virchow2, an enhanced version, scales both data and model size further, trained on 3.1 million histopathology whole-slide images with diverse tissues, originating institutions, and stains [27].

CONCH (CONtrastive learning from Captions for Histopathology) adopts a vision-language foundation model approach, pre-trained on diverse sources of histopathology images, biomedical text, and over 1.17 million image-caption pairs via task-agnostic pre-training [5]. Unlike vision-only models, CONCH demonstrates capabilities across both histopathology images and text, achieving state-of-the-art performance on histology image classification, segmentation, captioning, text-to-image, and image-to-text retrieval [5]. The model enables multimodal applications without requiring extensive fine-tuning for specific downstream tasks.

UNI represents a general-purpose foundation model for computational pathology trained using self-supervised learning on approximately 100,000 whole-slide images [3] [28]. The model employs the DINOv2 algorithm to train a robust visual encoder on one billion patches, creating versatile representations transferable to various downstream tasks [28]. UNI has demonstrated strong performance across multiple computational pathology applications including tumor classification, survival analysis, and biomarker prediction.

Table 1: Comparative Overview of Pathology Foundation Models

Model	Architecture	Parameters	Training Data	Modality	Key Features
Virchow	Vision Transformer (ViT)	632 million	1.5M WSIs [4]	Vision-only	DINOv2 algorithm, pan-cancer detection
Virchow2	Vision Transformer (ViT)	632M (Virchow2) / 1.9B (Virchow2G)	3.1M WSIs [27]	Vision-only	Mixed magnification, domain-specific augmentations
CONCH	Vision-Language	Not specified	1.17M image-caption pairs [5]	Multimodal	Contrastive learning, text-image capabilities
UNI	Vision Transformer (ViT)	Not specified	~100K WSIs [3] [28]	Vision-only	DINOv2 algorithm, general-purpose encoder

Performance Benchmarking

Comprehensive benchmarking of histopathology foundation models reveals distinct performance patterns across various tasks. In a systematic evaluation of 19 foundation models on 31 clinically relevant tasks involving 6,818 patients and 9,528 slides, CONCH and Virchow2 demonstrated superior overall performance [6]. For morphology-related tasks, CONCH achieved the highest mean area under the receiver operating characteristic curve of 0.77, followed closely by Virchow2 at 0.76 [6]. Across biomarker-related tasks, both Virchow2 and CONCH achieved the highest mean AUROCs of 0.73, while for prognostic-related tasks, CONCH yielded the highest mean AUROC of 0.63, followed by Virchow2 at 0.61 [6].

Notably, vision-language models like CONCH performed comparably to vision-only models trained on significantly larger datasets, with CONCH matching Virchow2 despite Virchow2 being trained on 3.1 million whole-slide images compared to CONCH's 1.17 million image-caption pairs [6]. This suggests that data diversity and multimodal training may provide advantages over simply scaling dataset size. Ensemble approaches combining multiple foundation models have shown further performance improvements, with CONCH and Virchow2 ensembles outperforming individual models in 55% of tasks [6].

Table 2: Performance Comparison Across Task Types (Mean AUROC)

Model	Morphology Tasks	Biomarker Tasks	Prognosis Tasks	Overall Average
CONCH	0.77 [6]	0.73 [6]	0.63 [6]	0.71 [6]
Virchow2	0.76 [6]	0.73 [6]	0.61 [6]	0.71 [6]
Prov-GigaPath	Not specified	0.72 [6]	Not specified	0.69 [6]
DinoSSLPath	0.76 [6]	Not specified	Not specified	0.69 [6]
UNI	Not specified	Not specified	Not specified	0.68 [6]

Technical Workflows for Slide Encoding and Feature Extraction

Whole-Slide Image Processing Pipeline

The processing of whole-slide images for feature extraction follows a standardized pipeline regardless of the specific foundation model employed. Whole-slide images are first tessellated into small, non-overlapping patches, as these gigapixel images cannot be processed directly by neural networks [6]. Typical patch sizes range from 256×256 to 512×512 pixels at 20× magnification [7]. These patches are then fed through the foundation model to extract informative feature embeddings for each patch.

Diagram 1: Whole-Slide Image Processing Workflow

Feature Extraction with Foundation Models

Virchow Feature Extraction employs the DINOv2 algorithm, a self-distillation method without labels that uses a student-teacher framework [4] [27]. The process involves creating different augmented views of input patches, with the student network trained to match the output of the teacher network for different views of the same image [27]. Virchow incorporates domain-specific adaptations including:

Mixed magnification training using tiles extracted at 5×, 10×, 20×, and 40× objectives [27]
Domain-specific photometric augmentations including stain jitter and transfer to account for variations in staining protocols [27]
Regularization techniques tailored to histopathology image characteristics [27]

CONCH Feature Extraction leverages contrastive learning between image patches and corresponding textual captions [5]. The model is trained using a multimodal approach that aligns visual representations with pathological descriptions, enabling both image and text understanding. This unique approach allows CONCH to generate features that capture morphological patterns described in pathological literature and reports.

UNI Feature Extraction utilizes the DINOv2 self-supervised learning framework similar to Virchow but trained on different datasets [3] [28]. The model processes patches extracted from whole-slide images and generates feature embeddings that capture histopathological patterns without requiring task-specific labels during pre-training.

Slide-Level Representation Learning

While patch-level features provide local information, many clinical applications require slide-level predictions. Converting patch embeddings to slide representations typically employs multiple instance learning frameworks:

Diagram 2: Slide-Level Representation Learning

More advanced approaches like TITAN (Transformer-based pathology Image and Text Alignment Network) directly learn slide-level representations through a vision-language pretraining paradigm [7]. TITAN processes a sequence of patch features encoded by histology patch encoders like CONCH, arranged in a two-dimensional feature grid replicating the spatial positions of corresponding patches within the tissue [7]. The model uses attention with linear bias for long-context extrapolation, where the linear bias is based on the relative Euclidean distance between features in the feature grid [7].

Experimental Protocols and Methodologies

Benchmarking Framework

Comprehensive evaluation of pathology foundation models follows standardized benchmarking protocols. The most rigorous benchmarks assess models across multiple dimensions:

Task Diversity: Evaluation should include morphology-related tasks (tissue classification, segmentation), biomarker prediction (genetic alterations, protein expression), and prognostic outcomes (survival analysis, treatment response) [6]
Data Generalization: Models should be tested on external cohorts not included in pre-training datasets to avoid data leakage and assess real-world performance [6]
Data Efficiency: Performance should be measured in low-data scenarios, including few-shot learning and low-prevalence tasks [6]

A typical benchmarking protocol involves:

Feature Extraction: Processing whole-slide images through the foundation model to obtain patch-level embeddings
Aggregator Training: Training a multiple instance learning model (e.g., ABMIL, Transformer) on patch embeddings for specific downstream tasks
Cross-Validation: Evaluating performance using cross-validation or train-validation-test splits with strict separation of patients
External Validation: Testing on completely independent datasets from different institutions

Low-Data Scenario Evaluation

Foundation models are particularly valuable when labeled data is scarce. Benchmarking experiments typically evaluate performance with varying training set sizes (e.g., 75, 150, and 300 patients) while maintaining similar positive-to-negative sample ratios [6]. In such low-data scenarios, different foundation models demonstrate varying performance characteristics:

With large sampled cohorts (n=300), Virchow2 demonstrates superior performance in most tasks [6]
With medium-sized cohorts (n=150), PRISM leads in most tasks [6]
With small cohorts (n=75), performance becomes more balanced with CONCH, PRISM, and Virchow2 each leading in similar numbers of tasks [6]

These findings suggest that the optimal foundation model choice depends on the amount of labeled data available for specific downstream tasks.

Essential Research Reagent Solutions

Implementing slide encoding and feature extraction workflows requires specific computational tools and resources. The following table outlines key "research reagent solutions" essential for working with pathology foundation models:

Table 3: Essential Research Reagents for Slide Encoding Workflows

Research Reagent	Function	Example Implementations	Application Context
Patch Extraction Tools	Divides WSIs into processable patches	OpenSlide, ASAP, HistomicsTK	Preprocessing for all foundation models
Feature Extractors	Generates embeddings from image patches	CONCH, Virchow, UNI model weights	Core feature extraction step
Multiple Instance Learning Frameworks	Aggregates patch features to slide-level predictions	ABMIL, TransMIL, CLAM [28]	Downstream task implementation
Benchmark Datasets	Standardized model evaluation	Camelyon+ [28], TCGA, Internal cohorts	Performance validation
Whole-Slide Encoders	Direct slide-level representation learning	TITAN [7], Prism [28]	End-to-end slide encoding

Implementation Considerations

Model Selection Guidelines

Choosing the appropriate foundation model depends on specific research requirements:

For vision-only tasks with large datasets: Virchow2 demonstrates strong performance, particularly for common cancer types and well-represented morphologies [6] [27]
For multimodal applications: CONCH provides superior capabilities for text-image retrieval, captioning, and tasks requiring integration of visual and textual information [5]
For low-data scenarios: CONCH and Virchow2 perform well with limited training samples, with each showing advantages in different task types [6]
For specialized applications: Consider ensemble approaches combining multiple foundation models, as these have been shown to outperform individual models in majority of tasks [6]

Computational Requirements

Processing whole-slide images demands significant computational resources. A typical workflow requires:

High-memory servers (128GB+ RAM) for handling large whole-slide images
GPU acceleration for efficient feature extraction (NVIDIA A100, V100, or comparable GPUs)
Substantial storage capacity for patch features (terabyte-scale for large cohorts)
Specialized libraries for whole-slide image handling (OpenSlide, CuCIM)

Future Directions

The field of pathology foundation models continues to evolve rapidly. Emerging trends include:

Scale Expansion: Models continue growing in both data and parameters, with Virchow2G scaling to 1.9 billion parameters [27]
Multimodal Integration: Increasing emphasis on combining visual information with genomic, transcriptomic, and clinical data [7] [1]
Whole-Slide Encoders: Development of models that directly generate slide-level representations without requiring separate aggregation steps [7]
Synthetic Data Utilization: Using generative AI to create synthetic image-caption pairs for training vision-language models [7]

As these technologies mature, pathology foundation models are expected to become increasingly integral to research and clinical applications, potentially enabling generalist medical AI systems that integrate pathology with other medical domains [1].

The field of computational pathology has been transformed by foundation models trained on massive datasets of histopathology images. These models generate powerful feature representations (embeddings) that can be adapted to diverse diagnostic tasks without task-specific labels, addressing a critical limitation in healthcare where annotated medical data is scarce [1]. Foundation models like Virchow, CONCH, and UNI represent a paradigm shift from traditional single-task models to versatile AI systems capable of supporting clinical decision-making across cancer diagnosis, prognosis, and biomarker prediction [1]. Their development is fueled by advances in self-supervised learning algorithms that leverage broad data at scale, enabling applications ranging from cancer detection to genomic correlation analysis [1]. For researchers and drug development professionals, these models offer powerful tools for accelerating oncology research and developing precision medicine applications.

The Virchow Foundation Model

Architecture and Training Methodology

Virchow is a 632 million parameter Vision Transformer (ViT) model trained using the DINOv2 self-supervised learning algorithm on approximately 1.5 million hematoxylin and eosin (H&E) stained whole slide images (WSIs) from 100,000 patients at Memorial Sloan Kettering Cancer Center [10] [25] [4]. This dataset represents a 4-10× scale increase over previous pathology model training sets and includes diverse tissue types from 17 major organs, with samples obtained via biopsy (63%) and resection (37%) procedures [25] [4]. The DINOv2 framework employs a student-teacher network structure that learns embeddings by comparing multiple augmented views of tissue tiles, capturing both global tissue architecture and local cellular morphology without requiring manual annotations [25] [4].

Technical Innovations and Scaling Advantages

Virchow's performance stems from two key scaling advantages: dataset size and model parameters. Prior pathology models typically utilized 30,000-400,000 WSIs with 28-307 million parameters, while Virchow's 1.5 million WSIs and 632 million parameters establish new frontiers for computational pathology [25]. This scale enables the model to capture a comprehensive spectrum of histopathological patterns including cellular morphology, tissue architecture, nuclear atypia, mitotic figures, necrosis, inflammatory response, and neovascularization [4]. The model's embeddings effectively distill these morphological features into compact vector representations that serve as input for downstream predictive tasks through transfer learning [25] [4].

Figure 1: Virchow Model Training and Application Workflow

Pan-Cancer Detection Experimental Framework

Dataset Composition and Evaluation Metrics

The pan-cancer detection system was evaluated on a comprehensive dataset comprising whole slide images from 17 different cancer types, including both common and rare malignancies [25] [4]. Rare cancers were defined according to the National Cancer Institute criteria as those with an annual incidence of fewer than 15 cases per 100,000 people [25]. The evaluation framework utilized specimen-level labels and assessed performance using the area under the receiver operating characteristic curve (AUC) as the primary metric, with additional analysis of sensitivity and specificity at predetermined thresholds [25] [4]. The test dataset included internal slides from MSKCC as well as external consultation slides from numerous global institutions, enabling robust assessment of out-of-distribution generalization [25].

Model Comparison and Benchmarking Protocol

Virchow's performance was systematically compared against three leading pathology foundation models: UNI, Phikon, and CTransPath [25] [4]. The benchmarking protocol maintained identical training procedures for all embeddings, ensuring fair comparison across architectures. For each model, tile-level embeddings were aggregated to slide-level predictions using weakly supervised learning approaches, with consistent hyperparameter tuning and validation splits across all experiments [25]. This rigorous evaluation methodology enabled direct assessment of how scaling laws impact real-world clinical performance across diverse cancer types and tissue origins.

Performance Analysis of Virchow vs. Benchmark Models

Virchow achieved state-of-the-art performance in pan-cancer detection, demonstrating statistically significant improvements over all benchmarked models (p < 0.0001) [25]. The model attained an overall specimen-level AUC of 0.949 across 17 cancer types, outperforming UNI (0.940 AUC), Phikon (0.932 AUC), and CTransPath (0.907 AUC) [25]. At a clinically relevant 95% sensitivity threshold, Virchow achieved 72.5% specificity, substantially exceeding the specificity of UNI (68.9%), Phikon (62.9%), and CTransPath (52.3%) [25]. This performance advantage persisted across both common and rare cancer types, demonstrating Virchow's robust generalization capabilities.

Table 1: Overall Pan-Cancer Detection Performance Comparison

Foundation Model	Overall AUC	Rare Cancer AUC	Specificity at 95% Sensitivity	Training Dataset Size
Virchow	0.949	0.937	72.5%	1.5M WSIs
UNI	0.940	Not reported	68.9%	~100K WSIs
Phikon	0.932	Not reported	62.9%	Not reported
CTransPath	0.907	Not reported	52.3%	150M patches

Performance Across Common and Rare Cancer Types

Virchow demonstrated consistent performance improvements across both common and rare cancer types, with particularly notable gains in challenging diagnostic scenarios [25]. For common cancers, Virchow achieved or exceeded state-of-the-art performance across all nine evaluated types, including breast, prostate, lung, and colorectal cancers [25]. For rare cancers, Virchow attained an aggregate AUC of 0.937 across seven cancer types, significantly outperforming alternative approaches [25]. The model showed particular strength in improving detection of challenging rare cancers such as cervical cancer (0.875 AUC vs. 0.830 for UNI) and bone cancer (0.841 AUC vs. 0.813 for UNI) [25].

Table 2: Selected Cancer-Type Specific Performance Metrics

Cancer Type	Virchow AUC	UNI AUC	Phikon AUC	CTransPath AUC	Category
All Cancers	0.949	0.940	0.932	0.907	Aggregate
Rare Cancers	0.937	Not reported	Not reported	Not reported	Aggregate
Cervix	0.875	0.830	0.810	0.753	Rare
Bone	0.841	0.813	0.822	0.728	Rare
Breast	0.985	0.981	0.977	0.960	Common
Lung	0.951	0.946	0.938	0.917	Common

Complementary Pathology Foundation Models

UNI: Large-Scale Patch-Based Pretraining

UNI represents another major approach to pathology foundation models, employing self-supervised DINO-based pretraining on approximately one billion histology image patches from 100,000 whole slide images [28]. This patch-based methodology focuses on learning robust visual representations at the cellular and tissue region level, demonstrating strong performance across multiple downstream tasks including tumor classification, survival analysis, and segmentation [28]. While UNI achieves impressive performance with 0.940 AUC in pan-cancer detection, it falls slightly short of Virchow's 0.949 AUC, potentially due to the latter's larger training dataset and whole-slide level optimization [25].

CONCH: Vision-Language Multimodal Learning

CONCH (CONtrastive learning from Captions for Histopathology) introduces multimodal capabilities to computational pathology through vision-language pretraining on 1.17 million image-caption pairs [5]. This approach enables cross-modal applications including text-to-image retrieval, image captioning, and zero-shot classification by aligning visual features with pathological concepts described in text [5]. CONCH demonstrates that incorporating linguistic context alongside visual information creates more versatile representations applicable to both visual and language-based tasks in histopathology [5]. The model has served as the foundation for subsequent developments including TITAN, a multimodal whole-slide foundation model that extends these capabilities to whole-slide analysis [7] [5].

Research Reagent Solutions for Implementation

Table 3: Essential Research Reagents for Pathology Foundation Model Implementation

Resource Category	Specific Tools	Function in Research Pipeline
Feature Extractors	Virchow, CONCH, UNI, CTransPath	Generate embeddings from whole slide images for downstream tasks
Annotation Platforms	ASAP (Automated Slide Analysis Platform)	Create pixel-level annotations and validate slide labels
Benchmark Datasets	Camelyon+ (cleaned version)	Provide standardized evaluation for metastasis detection and method comparison
Multiple Instance Learning Frameworks	ABMIL, TransMIL, CLAM	Aggregate tile-level embeddings to slide-level predictions
Computational Pathology Libraries	Python, PyTorch, OpenSlide	Enable whole slide image processing and deep learning implementation

Technical Implementation Workflow

Figure 2: Technical Implementation Pipeline for Pathology Foundation Models

Virchow's demonstration of 0.949 AUC in pan-cancer detection across 17 cancer types represents a significant milestone in computational pathology, highlighting the critical importance of dataset and model scale in medical artificial intelligence [25] [4]. The model's robust performance on rare cancers (0.937 AUC) is particularly promising for clinical applications where limited training data typically constrains AI development [25]. For researchers and drug development professionals, pathology foundation models offer powerful tools for accelerating oncology research, identifying novel biomarkers, and developing precision medicine applications. The complementary strengths of vision-only models like Virchow and multimodal approaches like CONCH create a versatile toolkit for addressing diverse challenges in cancer research and clinical practice [7] [5] [25]. As these technologies continue to evolve, they hold substantial potential for integrating with other data modalities including genomic, transcriptomic, and clinical information to enable truly comprehensive cancer analysis systems.

The practice of pathology is inherently multimodal. Pathologists simultaneously interpret visual patterns from glass slides and articulate their findings through descriptive text in clinical reports. Early artificial intelligence (AI) models in computational pathology operated on images alone, creating a disconnect from clinical reasoning processes. The emergence of vision-language foundation models represents a paradigm shift, enabling AI systems to learn from both histology images and associated textual data. Among these, CONCH (CONtrastive learning from Captions for Histopathology) stands out as a pivotal model that leverages large-scale pretraining on image-text pairs to achieve state-of-the-art performance on diverse tasks, including cross-modal retrieval and pathology report generation [12]. This capability is particularly valuable for drug development and research, where connecting morphological patterns with clinical and molecular descriptions can accelerate biomarker discovery and therapeutic response prediction. This technical guide details CONCH's architecture, methodologies for cross-modal retrieval and report generation, and performance benchmarks, providing researchers with practical protocols for implementation.

CONCH Architecture & Pretraining Framework

Model Design and Core Components

CONCH is built upon the CoCa (Contrastive Captioner) framework, a state-of-the-art visual-language foundation model architecture [12]. Its design incorporates three principal components that enable multimodal understanding and generation:

Image Encoder: A Vision Transformer (ViT) that processes individual histopathology image tiles (regions-of-interest). It transforms input images into a sequence of feature embeddings, capturing visual patterns at the cellular and tissue architecture level.
Text Encoder: A transformer-based model that processes corresponding text captions. It generates contextualized embeddings for natural language descriptions of histopathologic entities.
Multimodal Fusion Decoder: A transformer decoder that incorporates cross-attention layers, allowing interaction between image and text representations. This component is crucial for generative tasks like report generation.

Pretraining Methodology and Data Scale

CONCH's performance stems from its task-agnostic pretraining on an unprecedented scale of histopathology-specific data [12]. The pretraining strategy employs a dual objective:

Contrastive Alignment Loss: Brings the embeddings of matching image-text pairs closer in a shared latent space while pushing non-matching pairs apart. This enables cross-modal retrieval.
Captioning Loss: Trains the model to generate textual descriptions given an input image, facilitating report generation.

The model was pretrained on over 1.17 million histopathology image-caption pairs gathered from diverse sources, including biomedical textbooks and scientific literature [5] [12]. This extensive and curated dataset covers a wide spectrum of diseases, tissue types, and staining patterns, providing the model with broad histomorphological knowledge.

Technical Implementation Protocol

Cross-modal retrieval allows researchers to find relevant images using text queries, or find relevant text descriptions using image queries. The implementation leverages the shared embedding space learned by CONCH during pretraining.

Protocol for Image-to-Text Retrieval:

Feature Extraction: For a query image, pass it through the CONCH image encoder to obtain its visual feature embedding.
Database Preprocessing: Precompute text embeddings for all candidate reports or captions in the database using the CONCH text encoder.
Similarity Calculation: Compute the cosine similarity between the query image embedding and all text embeddings in the database.
Ranking and Retrieval: Rank the text entries by their similarity scores and return the top-k most relevant reports/captions.

Protocol for Text-to-Image Retrieval:

Feature Extraction: For a text query (e.g., "invasive ductal carcinoma with tubular features"), generate its text embedding using the CONCH text encoder.
Database Preprocessing: Precompute image embeddings for all whole-slide image tiles or regions in the database.
Similarity Calculation & Ranking: Compute cosine similarity between the text query embedding and all image embeddings, returning the top-k most relevant images.

Performance Benchmarks

CONCH establishes new state-of-the-art performance on cross-modal retrieval tasks, significantly outperforming previous models like PLIP and BiomedCLIP [12]. The following table summarizes its zero-shot retrieval performance on standard benchmarks:

Table 1: CONCH Cross-Modal Retrieval Performance (Recall@K)

Task	Dataset	Metric	CONCH	PLIP	BiomedCLIP
Text-to-Image Retrieval	TCGA-BRCA	R@1	80.3%	45.1%	38.2%
Image-to-Text Retrieval	TCGA-BRCA	R@1	75.8%	42.7%	36.5%
Text-to-Image Retrieval	TCGA-NSCLC	R@5	94.2%	81.5%	73.8%
Image-to-Text Retrieval	TCGA-NSCLC	R@5	92.7%	79.6%	72.1%

Application in Research and Drug Development

For researchers, cross-modal retrieval enables:

Rare Cancer Hypothesis Generation: Finding similar cases for rare morphological patterns by querying with a text description of findings.
Biomarker Discovery: Retrieving images associated with specific genetic alteration descriptions to identify potential visual biomarkers.
Literature Mining: Quickly finding histology images from large databases that match textual descriptions of therapeutic responses in scientific literature.

Pathology Report Generation: From Slides to Text

Workflow for Automated Report Generation

Automating the creation of pathology reports from whole-slide images (WSIs) can significantly reduce pathologist workload. CONCH and models derived from it, like TITAN, demonstrate robust capabilities in this domain [7] [29]. The standard workflow is as follows:

Protocol for WSI-Level Report Generation:

WSI Tiling: Divide the gigapixel WSI into non-overlapping tiles at 20x magnification (e.g., 512x512 pixels). Exclude tiles with pen markings or minimal tissue coverage [29].
Tile Feature Extraction: Encode each tile into a feature vector using a pretrained image encoder. CONCH can serve this purpose, or specialized encoders like UNI can be used [29] [30].
Feature Aggregation: Use a transformer-based aggregator (e.g., Perceiver) to distill the thousands of tile-level features from a single WSI into a compact, slide-level representation [7] [29]. This step is critical for handling the variable size and long-range dependencies in WSIs.
Text Generation: The aggregated slide-level features are fed into CONCH's multimodal decoder, which autoregressively generates the pathology report token-by-token. The captioning loss ensures the generated text is aligned with the visual input.

Performance and Validation

Model-generated reports have been shown to be clinically relevant. In a specialized study on melanocytic lesions, a CONCH-inspired model generated reports for common nevi that were assessed by an expert pathologist as being on par with pathologist-written reports [29]. The TITAN model, which builds upon concepts from CONCH, was specifically pretrained using 335,645 WSIs and aligned with 182,862 medical reports and 423,122 synthetic captions, enabling high-quality report generation that generalizes to rare diseases and cancer prognosis [7].

Table 2: Report Generation and Representation Learning Performance

Model	Task	Dataset	Performance
CONCH-derived [29]	Report Generation	Melanocytic Lesions (19,645 cases)	Generated reports for common nevi were on par with pathologist-written reports.
TITAN [7]	Slide Representation	Mass-340K (20 organs)	Outperforms other slide foundation models in zero-shot classification and retrieval.
CONCH [12]	Zero-shot WSI Classification	TCGA NSCLC	90.7% Accuracy
CONCH [12]	Zero-shot WSI Classification	TCGA RCC	90.2% Accuracy

Implementing CONCH for multimodal tasks requires access to specific tools and resources. The following table details key solutions for researchers.

Table 3: Essential Research Reagent Solutions for CONCH Implementation

Resource	Type	Function / Application	Source / Availability
CONCH Model Weights	Software	Pre-trained model parameters for feature extraction and multimodal tasks.	HuggingFace: `MahmoodLab/CONCH` [30]
UNI Model Weights	Software	A powerful companion vision-only foundation model for extracting rich features from H&E and non-H&E stains.	HuggingFace: `MahmoodLab/UNI` [30]
Pathology Reports	Data	Curated, de-identified clinical reports for training and evaluation. Preprocessing (translation, segmentation) is required [29].	Institutional archives (requires IRB approval).
Whole-Slide Images (WSIs)	Data	Digitized H&E-stained slides from biopsies/resections. Essential for model fine-tuning and validation.	TCGA, PAIP, GTEx, or internal institutional databases.
DINOv2 / iBOT Framework	Algorithm	Self-supervised learning frameworks used for pretraining vision backbones of models like UNI and Virchow [2] [9].	Public GitHub repositories.
Perceiver IO	Architecture	Transformer-based model for effectively aggregating long sequences of tile features into slide-level representations [29].	Public GitHub repositories.

CONCH represents a significant advancement in computational pathology by bridging the gap between visual histological information and clinical language. Its robust capabilities in cross-modal retrieval and pathology report generation, demonstrated across multiple large-scale benchmarks, provide researchers and drug developers with a powerful tool. These capabilities enable more efficient data mining, hypothesis generation, and clinical workflow augmentation. The ongoing development of even larger models like TITAN, which builds upon CONCH's principles, indicates a clear trajectory toward more general-purpose AI systems in pathology. These systems hold the promise of integrating multimodal data to unlock new insights in disease characterization, biomarker discovery, and ultimately, the development of personalized therapeutic strategies.

The diagnostic landscape for rare cancers presents a formidable challenge in computational pathology, where the limited availability of annotated whole-slide images (WSIs) creates a significant bottleneck for developing robust artificial intelligence (AI) tools. Foundation models, pretrained on broad data using self-supervision and adaptable to diverse downstream tasks, are transforming this paradigm [1]. These models, including those in the Virchow, CONCH, and UNI families, shift the approach from training models from scratch for each task to leveraging vast, pre-learned visual and multimodal representations. For rare cancers, this is particularly powerful, as it allows diagnostic and prognostic models to be built with minimal task-specific data, effectively addressing the critical issue of label scarcity that has traditionally hindered progress in this domain [7] [1]. This technical guide details how these foundation models are engineered and applied to overcome the data limitation problem in rare cancer diagnosis.

Foundation Models in Histopathology: From Patches to Whole Slides

The Evolution to Slide-Level Representations

Initial foundation models in computational pathology focused on encoding small, patch-level regions of interest (ROIs) into versatile feature representations via self-supervised learning (SSL) [7] [1]. While these patch-level representations capture important cellular and tissue-level morphological patterns, translating their capabilities to address patient- and slide-level clinical challenges remains complex due to the immense scale of gigapixel WSIs and the limited size of rare disease cohorts [7]. To overcome this, newer foundation models have been developed to encode entire WSIs into general-purpose, slide-level feature representations [7]. Instead of training an additional model on top of patch embeddings from scratch, these whole-slide representation models are pretrained to distill pathology-specific knowledge from large WSI collections, simplifying clinical endpoint prediction with their off-the-shelf application [7].

Key Foundation Model Families

CONCH (CONtrastive learning from Captions for Histopathology) is a vision-language foundation model pretrained on 1.17 million histopathology-specific image-caption pairs [5]. Its key innovation is its multimodal nature, learning jointly from images and text. This enables its application to a wide range of tasks involving either or both histopathology images and text, including classification, segmentation, captioning, and cross-modal retrieval, with state-of-the-art performance [5]. A notable feature is that CONCH did not use large public histology slide collections like TCGA for pretraining, minimizing the risk of data contamination when evaluating on public benchmarks or private slide collections [5].

TITAN (Transformer-based pathology Image and Text Alignment Network) represents a more recent advancement as a multimodal whole-slide vision-language model [7] [31]. It is pretrained on an extensive dataset of 335,645 WSIs across 20 organ types. Its pretraining involves three stages: (1) vision-only unimodal pretraining on ROI crops, (2) cross-modal alignment with synthetic fine-grained morphological descriptions at the ROI-level (423k pairs), and (3) cross-modal alignment at the WSI-level with clinical reports (183k pairs) [7]. This scalable approach enables TITAN to generate general-purpose slide representations and pathology reports, performing effectively in zero-shot and few-shot learning scenarios highly relevant for rare cancers [7].

Table 1: Comparison of Key Pathology Foundation Models

Model	Architecture	Pretraining Data	Core Capabilities	Distinguishing Features
CONCH	Vision-Language Model	1.17M image-caption pairs [5]	Image & text classification, segmentation, captioning, cross-modal retrieval [5]	Avoids public slide collections (TCGA) to prevent data contamination [5]
TITAN	Multimodal Whole-Slide Transformer	335,645 WSIs + 423k synthetic captions + 183k reports [7]	Slide representation learning, zero/few-shot classification, rare cancer retrieval, report generation [7]	Uses ALiBi for long-context extrapolation; knowledge distillation & masked image modeling [7]

Technical Architectures and Methodologies

Whole-Slide Image Encoding with TITAN

The TITAN architecture emulates a Vision Transformer (ViT) but operates at the slide level rather than the patch level [7]. Its input consists of a sequence of patch features encoded by powerful histology patch encoders like CONCH, spatially arranged in a 2D feature grid that replicates patch positions within the tissue. To handle long and variable input sequences common in WSIs, TITAN uses several key innovations. It constructs the input embedding space by dividing each WSI into non-overlapping patches of 512×512 pixels at 20× magnification, extracting a 768-dimensional feature vector for each patch [7]. For self-supervised pretraining, it creates views of a WSI by randomly cropping the 2D feature grid, sampling a region crop of 16×16 features covering 8,192×8,192 pixels. From this region crop, it samples two random global (14×14) and ten local (6×6) crops for iBOT pretraining, followed by feature augmentation [7]. Critically, TITAN employs Attention with Linear Biases (ALiBi) to handle long-context extrapolation at inference. Originally designed for language models, ALiBi is extended to 2D in TITAN, with the linear bias based on the relative Euclidean distance between features in the feature grid, reflecting actual spatial distances between tissue patches [7].

Diagram 1: TITAN Whole-Slide Encoding Pipeline. The workflow processes WSIs into spatial feature grids, applies multi-scale cropping for self-supervised learning, and generates slide representations with specialized positional encoding.

Multimodal Alignment Strategies

A key strength of advanced foundation models like CONCH and TITAN is their ability to learn aligned representations across histopathology images and textual data. CONCH achieves this through contrastive learning from captions, pulling together corresponding image and text pairs in a shared embedding space while pushing apart non-corresponding pairs [5]. TITAN extends this concept through a sophisticated three-stage pretraining strategy. Stage 1 involves vision-only unimodal pretraining on ROI crops using the iBOT framework, which combines masked image modeling and knowledge distillation [7]. Stage 2 performs cross-modal alignment of generated morphological descriptions at the ROI-level, using 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology [7]. Stage 3 involves cross-modal alignment at the WSI-level with 182,862 clinical pathology reports [7]. This progressive approach enables the model to capture histomorphological semantics at both local (ROI) and global (WSI) levels, facilitated by both visual and language supervisory signals.

Experimental Protocols for Rare Cancer Applications

Zero-Shot and Few-Shot Classification

For rare cancers with minimal labeled data, the zero-shot and few-shot capabilities of foundation models are particularly valuable. The experimental protocol involves using the pretrained foundation model as a feature extractor without any fine-tuning (zero-shot) or with minimal task-specific adaptation (few-shot). In practice, TITAN can perform zero-shot classification by leveraging its vision-language alignment [7]. Given a WSI, the model generates a slide representation that can be compared with text embeddings of class descriptions (e.g., "a histopathology image of rare sarcoma type X") in the shared multimodal space. The classification decision is made based on the highest similarity score between the image embedding and the text embeddings of different class descriptions [7]. For few-shot learning, a linear classifier or shallow neural network is trained on top of the frozen TITAN features using the limited labeled examples available for the rare cancer [7]. This approach has demonstrated superior performance compared to both ROI-based and other slide foundation models, especially in low-data regimes [7].

Rare Cancer Retrieval Methodology

Content-based image retrieval is particularly valuable for rare cancers, allowing pathologists to find morphologically similar cases from potentially small databases. The experimental protocol involves using the foundation model to encode both query slides and database slides into a shared embedding space. When a pathologist submits a query WSI of a rare cancer, TITAN generates a compact slide representation [7]. This representation is then compared against a database of precomputed slide representations using similarity measures such as cosine similarity or Euclidean distance. The system returns the most similar cases ranked by similarity score. This methodology has shown strong performance for rare cancer retrieval, outperforming other approaches by leveraging the general-purpose slide representations learned during large-scale pretraining [7]. The retrieval capability can be further enhanced to cross-modal retrieval, where text queries (e.g., "find cases with spindle cell morphology and necrosis") can be used to retrieve relevant WSIs, and vice versa [7].

Table 2: Experimental Applications and Performance of Foundation Models for Rare Cancers

Application Scenario	Experimental Protocol	Key Outcome Metrics	Reported Performance
Zero-Shot Classification	Compare image embeddings with text embeddings of class descriptions without fine-tuning [7]	Accuracy, F1-score	Outperforms ROI and other slide foundation models [7]
Few-Shot Learning	Train linear classifier on frozen foundation model features with limited labeled examples [7]	Few-shot accuracy	Superior performance in low-data regimes compared to supervised baselines [7]
Rare Cancer Retrieval	Compute similarity between slide representations in embedded space [7]	Retrieval precision @k	Effectively retrieves morphologically similar rare cancer cases [7]
Pathology Report Generation	Generate textual descriptions from WSIs using vision-language alignment [7]	BLEU, ROUGE scores	Generates coherent pathology reports without fine-tuning [7]

The multimodal nature of foundation models enables innovative applications for rare cancer diagnosis. Cross-modal retrieval allows researchers to find relevant WSIs based on textual descriptions of morphological features, or conversely, to find relevant text descriptions (e.g., from literature or reports) based on a query WSI [7] [5]. The experimental setup involves encoding both modalities into a shared embedding space and computing similarity between query and database embeddings. For pathology report generation, TITAN can generate descriptive reports from WSIs without task-specific fine-tuning, leveraging its pretrained vision-language alignment [7]. This is particularly valuable for rare cancers where comprehensive reporting templates may not be readily available. The generated reports can include descriptions of tissue architecture, cellular morphology, and other diagnostically relevant features [7].

The Scientist's Toolkit: Essential Research Reagents

Implementing foundation models for rare cancer research requires both computational and data resources. The following table details key components of the research toolkit.

Table 3: Research Reagent Solutions for Foundation Model Implementation

Tool/Resource	Type	Function in Rare Cancer Research	Implementation Notes
CONCH Model	Vision-Language Foundation Model	Feature extraction for images and text; cross-modal retrieval [5]	Available on GitHub; can be fine-tuned for specific rare cancer tasks [5]
TITAN Framework	Multimodal Whole-Slide Model	Whole-slide representation learning; zero-shot classification; report generation [7]	Builds upon CONCH; handles full WSIs with specialized transformers [7]
QuPath	Open-Source Software	Digital pathology platform for WSI visualization and analysis [32]	Supports integration with AI models; useful for result interpretation and validation [32]
Synthetic Captions	Data Generation Tool	Provides fine-grained morphological descriptions for training [7]	Generated using multimodal generative AI copilot; enhances model capabilities [7]
ALiBi Positional Encoding	Algorithm	Enables handling of variable-sized WSIs and long-range dependencies [7]	Critical for whole-slide processing; based on relative Euclidean distances [7]

Diagram 2: Rare Cancer Diagnosis Workflow. Foundation models enable multiple diagnostic pathways from limited WSI data, including classification, retrieval, and automated reporting without requiring extensive task-specific training.

Foundation models like CONCH and TITAN represent a paradigm shift in computational pathology for rare cancers, effectively addressing the critical challenge of limited data availability. Through sophisticated architectures that leverage self-supervised learning on large-scale datasets, multimodal alignment, and whole-slide representation learning, these models demonstrate remarkable capabilities in zero-shot and few-shot learning, cross-modal retrieval, and pathology report generation. The experimental protocols and methodologies outlined in this guide provide researchers with practical frameworks for applying these advanced AI tools to rare cancer diagnosis. As these foundation models continue to evolve, they hold significant promise for improving diagnostic accuracy, enabling personalized treatment strategies, and ultimately advancing patient care for rare and understudied cancers. Their ability to generalize from broad pretraining to specific, data-scarce clinical tasks makes them uniquely suited for addressing the long-standing challenges of low-incidence diseases in histopathology.

The advent of foundation models represents a paradigm shift in computational pathology, moving from task-specific algorithms to versatile artificial intelligence (AI) systems trained on massive datasets. These models leverage self-supervised learning on broad data to generate general-purpose feature representations (embeddings) that can be adapted to numerous downstream tasks with minimal fine-tuning [1]. Within histopathology, foundation models are trained on hundreds of thousands to millions of whole slide images (WSIs), learning to capture the complex morphological patterns associated with tissue architecture, cellular morphology, and disease states [2] [10]. This capability is particularly transformative for biomarker prediction, where these models can identify subtle phenotypic changes in routine hematoxylin and eosin (H&E) stains that correlate with specific molecular alterations, potentially reducing reliance on costly specialized testing [2] [33].

The significance of predicting biomarkers from H&E images lies in bridging the gap between conventional histopathology and molecular pathology. Biomarkers—including specific genetic mutations, protein expression patterns, and genomic instability markers—are crucial for cancer diagnosis, prognosis, and treatment selection [33]. Traditional methods for assessing these biomarkers, such as next-generation sequencing, immunohistochemistry (IHC), and multiplex immunofluorescence (mIF), are time-consuming and expensive [33]. The ability to infer these biomarkers directly from ubiquitous H&E-stained slides could make precision medicine more accessible and efficient while providing pathologists with valuable decision-support tools [2] [34]. Foundation models like Virchow, CONCH, and UNI are at the forefront of this revolution, each offering unique architectural advantages and training methodologies that enhance their predictive capabilities for biomarker discovery and validation [2] [5] [1].

Technical Foundations of Pathology Foundation Models

Core Architectures and Training Methodologies

Pathology foundation models employ sophisticated deep learning architectures trained using self-supervised learning (SSL) objectives on large-scale WSI datasets. The Virchow model exemplifies the vision transformer (ViT) approach, implementing a 632 million parameter architecture trained on approximately 1.5 million H&E-stained WSIs from 100,000 patients using the DINO v.2 algorithm [2] [10]. This self-supervised approach learns to produce meaningful embeddings of WSI tiles by leveraging both global and local tissue regions without requiring manual annotations, capturing diverse patterns including cellular morphology, tissue architecture, and nuclear features [2]. The massive scale of training data—4-10 times larger than previous pathology datasets—enables Virchow to develop robust representations that generalize well across cancer types and laboratory preparations [2].

In contrast, CONCH (CONtrastive learning from Captions for Histopathology) employs a vision-language foundation model architecture, pretrained on 1.17 million image-caption pairs using contrastive learning objectives [5]. This multimodal approach allows CONCH to learn joint representations of histopathology images and textual descriptions, enabling capabilities such as image classification, segmentation, captioning, and cross-modal retrieval [5]. Unlike Virchow, CONCH did not utilize large public histology slide collections like The Cancer Genome Atlas (TCGA) for pretraining, reducing the risk of data contamination when evaluating on standard benchmarks [5]. The UNI model represents another significant approach, employing a self-supervised framework for learning whole-slide representations that has been applied across numerous research applications [5].

Table 1: Comparison of Major Pathology Foundation Models

Model	Architecture	Parameters	Training Data	Special Capabilities
Virchow	Vision Transformer (ViT)	632 million	1.5M H&E WSIs from 100k patients	Pan-cancer detection, biomarker prediction
CONCH	Vision-Language Transformer	Not specified	1.17M image-caption pairs	Multimodal reasoning, text-to-image retrieval, captioning
UNI	Not specified	Not specified	Diverse histopathology images	Whole-slide representation learning

Technical Workflow for Biomarker Prediction

The process of predicting biomarkers from H&E stains using foundation models follows a structured computational pipeline. Initially, WSIs are preprocessed to address color variations resulting from different staining protocols and scanner differences. Color normalization techniques are commonly applied to standardize the appearance of H&E images across different laboratories and staining batches [33]. Following preprocessing, the gigapixel WSIs are divided into smaller patches or tiles at appropriate magnification levels (typically 20×), making them manageable for deep learning processing [2] [35].

Foundation model embeddings are then extracted for each tile, capturing distinctive morphological features. For slide-level prediction tasks such as biomarker status, these tile-level embeddings are aggregated using weakly supervised methods. Attention-based multiple instance learning (attMIL) approaches are particularly effective, as they learn to weight the importance of different tissue regions based on their relevance to the prediction task [34]. This allows the model to focus on diagnostically significant areas while reducing noise from irrelevant tissues such as connective tissue or fat [34]. The aggregated slide-level representation is then used to train prediction heads for specific biomarkers, either through transfer learning or end-to-end fine-tuning.

Biomarker Prediction Methodologies and Experimental Protocols

Regression-Based Approaches for Continuous Biomarkers

A significant advancement in biomarker prediction is the shift from classification to regression-based deep learning for modeling continuous biomarkers. Traditional approaches often binarized continuous biomarkers using clinically relevant cut-offs, resulting in information loss that limited model performance [34]. Regression methods preserve the continuous nature of biomarkers such as homologous recombination deficiency (HRD) scores, gene expression values, and protein abundance, leading to more accurate predictions [34]. The CAMIL regression approach combines contrastive learning for feature extraction with attention-based multiple instance learning, demonstrating superior performance for predicting continuous biomarkers compared to classification-based methods [34].

In practice, regression models are trained using weakly supervised learning, where slide-level biomarker measurements serve as labels, and the model learns to associate morphological patterns in H&E images with continuous biomarker values. This approach has shown remarkable success in predicting HRD status—a clinically important pan-cancer biomarker—across multiple cancer types including breast, colorectal, and endometrial cancers [34]. For example, CAMIL regression achieved area under the receiver operating characteristic (AUROC) scores of 0.78 for breast cancer and 0.82 for endometrial cancer in the TCGA cohort, outperforming classification-based approaches [34]. The regression framework also improves the correspondence between model attention maps and regions of known clinical relevance, enhancing interpretability for pathologists [34].

Genetic Mutation Prediction from H&E Morphology

Foundation models demonstrate remarkable capability in predicting genetic mutations directly from H&E-stained histology images by learning the morphological patterns associated with specific molecular alterations. The underlying principle is that driver mutations often induce characteristic phenotypic changes in tissue architecture and cellular morphology that can be detected by sufficiently sophisticated AI models [35]. For instance, in hepatocellular carcinoma (HCC), deep learning models can predict mutations in key genes including CTNNB1, FMN2, TP53, and ZFX4 with AUROCs ranging from 0.71 to 0.89 [35]. These models leverage the fact that CTNNB1-mutated HCC typically presents as well-differentiated tumors with pseudoglandular and microtrabecular patterns, while TP53-mutated HCC tends to be poorly differentiated with compact patterns and pleomorphic cells [35].

The experimental protocol for mutation prediction involves training on WSIs with corresponding genomic sequencing data. Models are typically trained using a patch-based approach where WSIs are divided into smaller tiles at appropriate magnification (usually 20×), and genomic labels are propagated from the slide level to all tiles from that slide [35]. During inference, predictions from individual tiles are aggregated to generate slide-level mutation probabilities. This approach has been successfully applied across multiple cancer types, demonstrating that foundation models can capture the complex relationships between tissue morphology and genomic alterations, potentially obviating the need for additional genetic testing in some clinical scenarios [2] [35].

Table 2: Performance of Biomarker Prediction Across Cancer Types

Cancer Type	Biomarker	Model	Performance (AUC)	Dataset
Hepatocellular Carcinoma	CTNNB1 mutation	Inception V3	0.89	External validation [35]
Hepatocellular Carcinoma	TP53 mutation	Inception V3	0.71-0.89	External validation [35]
Multiple Cancers	Homologous Recombination Deficiency	CAMIL Regression	0.64-0.82	TCGA [34]
Colorectal Cancer	Microsatellite Instability	Deep Residual Learning	0.84 (Avg. Precision)	Multiple cohorts [33]
Pancreatic Cancer	Pan-cancer Detection	Virchow	0.950	MSKCC [2]

Protein Expression and Immunofluorescence Prediction

Beyond genetic mutations, foundation models can predict protein expression patterns and even generate in silico immunofluorescence from H&E images. The ROSIE framework demonstrates this capability by computationally imputing the expression and localization of dozens of proteins from standard H&E stains using a convolutional neural network (ConvNext) architecture [36]. Trained on over 1,000 co-stained H&E and CODEX samples encompassing nearly 30 million cells across diverse tissue and disease types, ROSIE can predict 50 different protein biomarkers including immune markers (CD45, CD3, CD8), epithelial markers (PanCK), and stromal markers (aSMA) [36].

The experimental methodology for protein prediction involves training on precisely aligned H&E and multiplex immunofluorescence (mIF) samples, enabling the model to learn the relationship between H&E morphology and protein expression patterns. During inference, ROSIE processes H&E images using a sliding window approach, generating predictions for biomarker expression across the tissue sample [36]. This capability is particularly valuable for identifying specific immune cell subtypes—such as B cells and T cells—that are not readily discernible from H&E staining alone but have significant implications for understanding tumor-immune interactions and guiding immunotherapy [36]. Validation on held-out datasets demonstrates correlation between predicted and actual protein expressions, with the method achieving a Pearson R correlation of 0.285 and Spearman R correlation of 0.352 across diverse evaluation datasets [36].

Implementation Guide: Experimental Protocols and Research Toolkit

Successful implementation of biomarker prediction models requires specific computational resources and data components. The research reagent solutions table below outlines key requirements for developing and deploying these AI systems in histopathology research.

Table 3: Research Reagent Solutions for Biomarker Prediction Experiments

Component	Specification	Function/Purpose
Whole Slide Images	H&E-stained WSIs from biopsied or resected tissue, scanned at 40x magnification	Primary input data for foundation models
Genomic Labels	Mutation status from sequencing (e.g., WGS, WES) or targeted panels	Ground truth for mutation prediction tasks
Protein Expression Data	Multiplex immunofluorescence (e.g., CODEX) or immunohistochemistry	Ground truth for protein prediction tasks
Clinical Annotations	Patient outcomes, treatment response, demographic data	For prognostic model development and validation
Color Normalization Tools	Reinhard method or deep learning-based normalization	Standardizes H&E appearance across different scanners and labs
Computational Infrastructure	High-performance GPUs (e.g., NVIDIA A100, H100) with large VRAM	Enables processing of gigapixel WSIs and large foundation models

Standardized Experimental Protocol for Biomarker Validation

Implementing a robust experimental protocol is essential for validating biomarker predictions. The following workflow provides a standardized approach for researchers developing foundation model-based biomarker detection systems:

Data Curation and Partitioning: Collect a diverse cohort of WSIs with corresponding biomarker measurements. Implement site-aware cross-validation splits to mitigate batch effects, particularly when using multi-institutional data like TCGA [34]. Ensure adequate representation of rare cancer types and biomarkers to test model generalizability.
Slide Preprocessing and Tile Extraction: Perform color normalization on all WSIs to standardize staining variations [33]. Extract tissue tiles at appropriate magnification (typically 20×), filtering out non-informative regions such as background, artifacts, or excessive blood.
Feature Extraction with Foundation Models: Generate tile-level embeddings using pretrained foundation models (Virchow, CONCH, or UNI). For vision-language models like CONCH, incorporate relevant textual descriptions when available to enhance feature quality [5].
Slide-Level Representation Learning: Apply attention-based multiple instance learning to aggregate tile embeddings into slide-level representations. The attention mechanism automatically identifies and weights diagnostically relevant regions [34].
Biomarker Prediction Head: Train prediction heads for specific biomarkers using the slide-level representations. For continuous biomarkers, use regression-based objectives rather than classification to preserve information [34].
Validation and Interpretation: Evaluate model performance on held-out test sets using appropriate metrics (AUROC for classification, correlation coefficients for regression). Generate attention maps to visualize morphological features driving predictions, facilitating pathological interpretation.

Performance Benchmarks and Comparative Analysis

Foundation models have demonstrated remarkable performance across various biomarker prediction tasks, often matching or exceeding the capabilities of traditional molecular testing methods. The Virchow model achieves 0.95 specimen-level area under the receiver operating characteristic curve for pan-cancer detection across nine common and seven rare cancers, with particularly strong performance on rare cancers (AUC of 0.937) [2]. This demonstrates that large-scale foundation models can generalize effectively to rare disease contexts where training data is limited [2]. Comparative analyses show that Virchow embeddings consistently outperform other foundation models like UNI, Phikon, and CTransPath across most cancer types, with the performance advantage being most pronounced for rare cancers such as cervical and bone cancers [2].

For specific biomarker prediction tasks, regression-based approaches consistently outperform classification-based methods. In predicting homologous recombination deficiency status across seven cancer types, the CAMIL regression method achieved AUROCs between 0.72-0.82, outperforming classification-based approaches while also demonstrating lower variance in model performance across different patient subsets [34]. Similarly, in liver cancer, deep learning models can predict CTNNB1 and TP53 mutations with performance levels approaching the ability of pathologists with 5 years of experience, achieving 96.0% accuracy for benign versus malignant classification and 89.6% accuracy for tumor differentiation grading [35]. These results highlight the potential of foundation models to not only predict molecular biomarkers but also to perform standard pathological assessments with high accuracy.

The performance advantages of foundation models are particularly evident in data-limited scenarios. Virchow demonstrates that with less training data, pan-cancer detectors built on foundation model embeddings can achieve similar performance to tissue-specific clinical-grade models in production and even outperform them on some rare cancer variants [2]. This scalability property makes foundation models especially valuable for rare diseases and biomarkers where collecting large annotated datasets is challenging. Additionally, vision-language models like CONCH extend these capabilities beyond prediction to multimodal tasks such as text-to-image retrieval and report generation, further expanding their utility in pathology workflows [5].

Future Directions and Implementation Challenges

Despite their impressive capabilities, pathology foundation models face several implementation challenges that must be addressed for widespread clinical adoption. A significant concern is the interpretability of model predictions, as the "black box" nature of deep learning systems can make it difficult for pathologists to understand the morphological evidence underlying biomarker predictions [35] [33]. Developing better visualization techniques and attention mechanisms that highlight relevant tissue regions is crucial for building clinical trust and facilitating model validation. Additionally, domain shift issues arise when models trained on data from one institution underperform when applied to images from different hospitals due to variations in staining protocols, scanner types, and tissue processing methods [37].

Future research directions focus on creating more generalist medical AI systems that integrate pathology foundation models with foundation models from other medical domains, including radiology, genomics, and clinical notes [1]. Such integrated systems could provide comprehensive diagnostic support by combining multiple data modalities. There is also growing interest in multimodal foundation models that simultaneously process histopathology images, genomic data, and clinical information to generate more accurate prognostic and predictive biomarkers [5] [1]. As these models evolve, ensuring their robustness, fairness, and regulatory compliance will be essential for clinical implementation.

From a technical perspective, future work will likely focus on improving sample efficiency through better self-supervised learning objectives, extending capabilities to predict a broader range of biomarkers including spatial transcriptomics patterns, and developing more efficient model architectures that reduce computational requirements without sacrificing performance. The rapid pace of innovation in this field suggests that foundation models will play an increasingly central role in computational pathology, potentially transforming how biomarkers are discovered, validated, and implemented in clinical practice to enable more precise and personalized cancer care.

The field of computational pathology has been transformed by the advent of foundation models (FMs), which are large-scale artificial intelligence models trained on broad data that can be adapted to a wide range of downstream tasks [1]. These models represent a significant advancement over traditional deep learning approaches, offering superior expressiveness and scalability based on massive model architectures and training datasets [1]. In histopathology, foundation models are pretrained on vast collections of whole slide images (WSIs) and, in many cases, paired multimodal data such as pathology reports and genomic information [7] [5]. This pretraining enables the models to learn versatile and transferable feature representations of histopathology data without requiring task-specific labels during the initial training phase [7]. The resulting models serve as a powerful "foundation" for developing tools that predict critical clinical endpoints from digitized tissue sections, including diagnosis, prognosis, biomarker status, and treatment response [7] [1]. Within the context of Virchow, CONCH, and UNI foundation models, researchers now have access to sophisticated architectures specifically designed for histopathology research, enabling more accurate and efficient prognostic modeling across a diverse array of diseases and patient cohorts [5].

Table 1: Comparison of Foundation Model Types in Pathology

Model Type	Key Characteristics	Example Applications	Advantages
Vision Foundation Models	Pretrained on histology images only	Cancer grading, tissue segmentation	High performance on visual tasks
Multimodal Foundation Models	Incorporate images and text/genomic data	Cross-modal retrieval, report generation	Enhanced interpretability, broader applicability
Whole-Slide Foundation Models	Process entire WSIs rather than patches	Slide-level prognosis, rare cancer retrieval	Captures tissue microenvironment context

Foundation Models for Histopathology: Virchow, CONCH, and UNI

The CONCH (CONtrastive learning from Captions for Histopathology) model represents a groundbreaking vision-language foundation model developed using diverse sources of histopathology images, biomedical text, and over 1.17 million image-caption pairs [5]. This multimodal approach enables CONCH to be transferred to a wide range of downstream tasks involving either or both histopathology images and text, achieving state-of-the-art performance on histology image classification, segmentation, captioning, text-to-image, and image-to-text retrieval [5]. Unlike popular self-supervised encoders pretrained only on H&E images, CONCH produces performant representations for non-H&E stained images such as IHCs and special stains, significantly expanding its utility across various staining protocols [5].

Building upon the success of patch-based foundation models like CONCH, more recent advancements have introduced whole-slide foundation models such as TITAN (Transformer-based pathology Image and Text Alignment Network) [7]. TITAN is a multimodal whole-slide vision-language model designed for general-purpose slide representation learning in histopathology, pretrained on 335,645 whole-slide images via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions [7]. This model introduces a large-scale pretraining paradigm that leverages millions of high-resolution regions-of-interest for scalable WSI encoding, enabling it to extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis [7].

These foundation models address critical limitations in traditional computational pathology approaches, particularly the immense scale of gigapixel whole-slide images and the small size of patient cohorts in real-world evidence, especially for rare diseases with limited training data [7]. By pretraining on extensive multimodal datasets, these models capture both morphological patterns in histology and their relationships with clinical and molecular correlates, making them exceptionally well-suited for prognostic modeling applications where multiple data sources must be integrated for accurate prediction [7] [5].

Prognostic Modeling Applications

Foundation models enable a diverse range of prognostic modeling applications that leverage their learned representations of histopathological patterns and their relationships to clinical outcomes. These applications span from traditional survival prediction to more innovative approaches for treatment response assessment and rare disease prognosis.

Survival Outcome Prediction

Multimodal machine learning integrating histopathology and molecular data shows significant promise for cancer prognostication [38]. A systematic review of studies combining whole slide images and high-throughput omics to predict overall survival identified 48 studies across 19 cancer types, all published since 2017 [38]. These approaches include regularized Cox regression, classical machine learning, and deep learning, with reported c-indices ranging from 0.550 to 0.857 [38]. A key finding is that multimodal models typically outperform unimodal ones, highlighting the value of integrating histopathological images with complementary data types for enhanced prognostic accuracy [38]. For instance, quantitative prognostic modeling using a combination of clinical data, histopathological features, and CT images has demonstrated improved risk stratification for esophageal squamous cell carcinoma patients, with C-indices improving from 0.596 (clinical features only) to 0.711 when combining all modalities [39].

Treatment Response Prediction

Predicting patient response to treatment before initiating therapy represents a crucial application of prognostic modeling with significant clinical implications. In breast cancer, quantitative digital histopathology coupled with machine learning has demonstrated remarkable accuracy in predicting pathological complete response (pCR) to neoadjuvant chemotherapy using pre-treatment tumor biopsies [40]. A study of 149 breast cancer patients developed a prediction model using gradient boosting machines with decision trees, which achieved an area under the ROC curve (AUC) of 0.90, with 85% sensitivity and 82% specificity on an independent test set [40]. Notably, the model utilized pathomic features extracted from digitized histology images of biopsy samples, including graph-based and wavelet features, which outperformed traditional clinical features such as tumor size, tumor grade, age, and receptor status [40].

Similarly, in lymphoma, prognostic models have been developed to predict complete response to first-line therapy using machine learning algorithms [41]. A study of 2,763 patients from the Lymphoma and Related Diseases Registry developed a nomogram incorporating six variables—stage, lactate dehydrogenase, performance status, BCL2 expression, anemia, and systemic immune-inflammation index—that achieved an AUC of 0.70, outperforming traditional international prognostic indices [41]. This approach demonstrates the value of incorporating inflammatory-nutritional indicators alongside conventional clinical factors for treatment response prediction.

Table 2: Performance Comparison of Prognostic Models Across Cancer Types

Cancer Type	Prediction Task	Model Type	Performance	Data Modalities
Breast Cancer	Pathological complete response to NAC	Gradient Boosting Machine	AUC: 0.90	Digital histopathology features
Lymphoma	Complete response to first-line therapy	Nomogram with ML	AUC: 0.70	Clinical, inflammatory-nutritional indicators
Melanoma	BRAF mutation status	Foundation Model + XGBoost	AUC: 0.824	Whole slide images
Esophageal Cancer	Overall survival	Multimodal integration	C-index: 0.711	Clinical, CT, histopathology

Biomarker Prediction

Foundation models enable the prediction of molecular biomarkers directly from routine histopathology images, offering a potentially faster and more cost-effective alternative to genetic testing. In melanoma, a novel machine learning framework integrating a large-scale, pretrained foundation model (Prov-GigaPath) with a gradient-boosting classifier (XGBoost) has demonstrated state-of-the-art performance in predicting BRAF-V600 mutation status directly from histopathological slides [42]. This approach achieved an AUC of 0.824 during cross-validation on The Cancer Genome Atlas dataset and 0.772 on an independent test set from University Hospital Essen, representing a significant advancement in image-only BRAF mutation prediction [42]. By employing a weakly supervised, data-efficient pipeline, this method reduces the need for extensive annotations and costly molecular assays while providing accurate biomarker predictions that could guide targeted therapies and improve patient outcomes [42].

Experimental Protocols and Methodologies

Implementing prognostic models using histopathology foundation models requires careful experimental design and methodology. The following protocols outline key approaches for leveraging these models in predictive tasks.

Whole-Slide Image Processing and Feature Extraction

The processing of gigapixel whole-slide images presents unique computational challenges that foundation models address through innovative architectural approaches. TITAN, for instance, constructs its input embedding space by dividing each WSI into non-overlapping patches of 512 × 512 pixels at 20× magnification, followed by extraction of 768-dimensional features for each patch using an extended version of CONCH [7]. To handle large and irregularly shaped WSIs, the model creates views by randomly cropping the 2D feature grid, sampling region crops of 16 × 16 features covering a region of 8,192 × 8,192 pixels [7]. For vision-only pretraining, TITAN employs the iBOT framework on these feature grids, applying augmentations such as vertical and horizontal flipping followed by posterization feature augmentation [7]. This approach enables the model to capture both local cellular patterns and global tissue architecture, which is essential for accurate prognostic assessment.

Multimodal Integration Strategies

Integrating histopathology images with complementary data modalities represents a crucial step in developing comprehensive prognostic models. Joint models that combine longitudinal and survival data offer particularly promising pathways for precision prognosis [43]. These models simultaneously model both the longitudinal evolution of biomarkers or imaging features and the time-to-event outcomes, properly accounting for correlations between these processes and enabling dynamic prediction updates as new data becomes available [43]. For genomic integration, approaches range from early fusion—where features from different modalities are combined at the input level—to late fusion where separate models process each modality with integration occurring at the prediction level [38]. Cross-attention mechanisms have also been employed to effectively capture interactions between histopathological and molecular features, enhancing model performance for survival prediction [38].

Dynamic Prediction Modeling

Traditional prognostic models often rely on static characteristics for long-term predictions, which may struggle to achieve accurate results in the dynamic context of cancer progression and treatment [43]. Dynamic prediction models (DPMs) address this limitation by linking temporal changes in features obtained during patient follow-up to disease prognosis [43]. These models can be categorized into several types, including two-stage models (most common at 32.2%), joint models (28.2%), time-dependent covariate models (12.6%), multi-state models (10.3%), landmark Cox models (8.6%), and artificial intelligence approaches (4.6%) [43]. Joint models, which integrate longitudinal and survival data, are particularly valuable for updating prognosis based on evolving patient data, such as changes in tumor size metrics, circulating free DNA levels, or occurrence of intermediate events like local recurrence or distant metastasis [43].

The Scientist's Toolkit: Research Reagent Solutions

Implementing prognostic models with histopathology foundation models requires specific computational tools and resources. The following table details essential components for developing and applying these models in research settings.

Table 3: Essential Research Reagents and Computational Tools for Prognostic Modeling

Resource Category	Specific Tools/Platforms	Function and Application
Foundation Models	CONCH, TITAN, UNI, Virchow	Provide pretrained feature extractors for histopathology images and text
Whole Slide Image Processing	PyRadiomics, HistomicsTK, Amira	Extract quantitative features from digitized pathology images
Machine Learning Frameworks	XGBoost, PyTorch, TensorFlow	Implement classification, regression, and survival analysis models
Genomic Data Analysis	Bioconductor, DESeq2, limma	Process and analyze high-throughput omics data for integration
Statistical Analysis	R, Python (scikit-survival, lifelines)	Perform survival analysis and model validation
Data Sources	The Cancer Genome Atlas (TCGA), Lymphoma and Related Diseases Registry	Provide multimodal datasets for model development and validation

Future Directions and Challenges

Despite significant advances, several challenges remain in the clinical application of foundation models for prognostic modeling [1]. Current limitations include the need for extensive external validation of developed models, unclear clinical utility in many cases, and persistent issues with domain shifts across institutions [37] [38]. Additionally, dynamic prediction models, while powerful, often utilize only a single dynamic predictor (58.6% of studies) and face challenges in handling high-dimensional data from smaller samples [43]. Future research directions include the development of more sophisticated multimodal integration techniques, improved methods for handling temporal data in prognostic assessment, and the creation of generalist medical AI systems that integrate pathology foundation models with FMs from other medical domains [1]. There is also a growing need for standardized benchmarking frameworks to objectively evaluate different foundation models across diverse tasks and datasets, particularly as the number of available models continues to increase [7] [5] [1]. As these technical challenges are addressed, the field moves closer to realizing the full potential of foundation models for enhancing routine pathological analysis and enabling more precise, personalized prognostic assessment for cancer patients.

The field of computational pathology is undergoing a transformative shift with the advent of foundation models, which leverage self-supervised learning on massive datasets to produce versatile and transferable feature representations from histopathology images [7] [1]. These models are poised to revolutionize the diagnosis and treatment of cancer and other diseases by enabling precision medicine and clinical decision support systems [4]. Morphological analysis, comprising tissue segmentation and quantitative histomorphometry, forms the cornerstone of this revolution. It provides the critical link between raw whole-slide images (WSIs) and quantifiable, biologically meaningful data.

This technical guide explores the integration of advanced tissue segmentation methodologies with powerful pathology foundation models—such as Virchow, CONCH, and UNI—to create robust, scalable pipelines for quantitative histomorphometry. Such pipelines are essential for researchers, scientists, and drug development professionals seeking to extract reproducible, high-throughput insights from histopathology data, thereby accelerating biomarker discovery, prognostic model development, and therapeutic assessment.

Foundation Models in Computational Pathology

Foundation Models (FMs) are large-scale AI models trained on broad data using self-supervision at scale, which can be adapted to a wide range of downstream tasks [1]. They represent a paradigm shift from traditional, task-specific deep learning models, offering superior expressiveness and scalability. Their development has been fueled by advancements in AI architectures (like Transformers), increased computational efficiency, and the growing availability of digital data [1].

In computational pathology, FMs address two significant challenges: the immense cost and time associated with pathologist-led annotations for supervised learning, and the need for models that generalize across diverse diseases, organs, and tasks [1]. By pre-training on hundreds of thousands to millions of WSIs without explicit labels, these models learn fundamental representations of cellular morphology and tissue architecture. The resulting embeddings—short vector representations of input image features—can then be used with minimal additional training for diverse clinical applications, from cancer detection and subtyping to biomarker prediction and survival prognosis [1] [4].

The table below summarizes the key characteristics of several leading pathology foundation models.

Table 1: Overview of Major Pathology Foundation Models

Model Name	Core Architecture	Training Data Scale	Key Features	Reported Applications
Virchow [4]	Vision Transformer (DINOv2)	1.5 million WSIs from 100k patients	632 million parameters; trained on H&E slides from 17 tissue types	Pan-cancer detection (0.949 AUC), rare cancer identification, biomarker prediction
CONCH [7] [5]	Vision-Language Model	1.17 million image-caption pairs	Multimodal pretraining aligning images with biomedical text and synthetic captions	Image classification, segmentation, captioning, cross-modal retrieval
TITAN [7]	Multimodal Transformer	335,645 WSIs + 182,862 reports + 423k synthetic captions	Whole-slide foundation model via visual SSL and vision-language alignment	Zero-shot classification, rare disease retrieval, cancer prognosis, report generation
UNI [5] [8]	Vision Transformer	Large-scale internal WSI repository	Self-supervised learning on diverse histopathology images	Image classification, segmentation, and biomarker prediction

Tissue Segmentation Methodologies

Tissue segmentation is the foundational step in quantitative histomorphometry, involving the precise delineation of relevant histological structures (e.g., glomeruli, tubules, tumor regions) from gigapixel WSIs. Accurate segmentation enables the subsequent extraction of morphometric features.

Unit-Based Tissue Segmentation (UTS)

Unit-Based Tissue Segmentation (UTS) presents a paradigm shift from conventional pixel-wise segmentation. Instead of classifying every pixel, UTS treats each fixed-size tile (e.g., 32x32 pixels) as a single semantic unit, significantly reducing annotation effort and computational overhead without compromising accuracy [44].

The UTS framework is implemented through the following workflow [44]:

Data Acquisition and Preprocessing: H&E-stained WSIs are digitized. A pathologist selects regions of interest (ROIs) or entire slide areas. The SlideTiler toolbox partitions these selections into uniform 32x32 pixel tiles, excluding background and artifacts.
Segmentation with L-ViT: Each tile is classified by the L-ViT (Multi-Level Vision Transformer) model. L-ViT uses an EfficientNetB3 backbone for feature extraction and incorporates:
- Multi-Level Feature Fusion (MLFF): Integrates features from different network stages to capture hierarchical patterns.
- Dilated Attention and Squeeze-and-Excitation (DAT-SE): A channel-wise attention mechanism that recalibrates feature maps to amplify relevant features.
- Dilated Convolutional Block Attention Module (D-CBAM): Applies channel and spatial attention using dilated convolutions to capture long-range dependencies.
Visualization: The tile-level classifications are reassembled to generate a coarse-grained, semantically meaningful segmentation map of the entire slide or ROI.

This approach aligns with how pathologists often interpret morphology in discrete regions and supports downstream tasks like tumor-stroma quantification [44].

Deep Learning-Based Semantic Segmentation

For instance-level segmentation of complex structures, deep learning-based semantic segmentation remains a powerful tool. The Framework for Large-Scale Histomorphometry (FLASH) exemplifies this approach in nephropathology [45].

Table 2: FLASH Segmentation Performance (Dice Similarity Coefficient) on Kidney Structures [45]

Structure	Internal Cohorts (ACB & ACN)	External Validation Cohorts (HuBMAP & KPMP)
Glomeruli	High Accuracy	Comparable or Better Accuracy
Glomerular Tufts	High Accuracy	Comparable or Better Accuracy
Tubules	High Accuracy	Comparable or Better Accuracy
Arteries	Lower Precision	Information Not Specified
Arterial Lumen	Lower Precision	Information Not Specified

FLASH employs two streamlined Convolutional Neural Networks (CNNs) [45]:

One CNN performs kidney tissue segmentation.
Another performs instance-level structure segmentation for glomeruli, glomerular tufts, tubules, arteries, arterial lumina, and non-tissue background.

The model was trained and tested on internal cohorts (Aachen Biopsy and Nephrectomy) and validated on external, multi-centre cohorts (HuBMAP, KPMP), demonstrating "pan-disease" applicability across common kidney diseases and injury patterns despite variations in staining protocols [45].

Quantitative Histomorphometry

Once tissues are segmented, quantitative histomorphometry involves extracting and analyzing measurable features from these structures to uncover correlations with disease etiology, progression, and clinical outcomes.

Feature Extraction and Pathomics

The FLASH framework enables the large-scale extraction of interpretable morphometric features. In a study analyzing over 1,000 kidney biopsies, more than 11,000 glomeruli were processed to extract features such as [45]:

Area: Glomerular tuft area, full glomerular area.
Shape: Circularity.
Counts: Percentage of sclerotic glomeruli.

These features, often called "pathomics" data, can be mined to reveal novel biological and clinical insights.

Application: Linking Morphometry to Clinical Parameters

The quantitative power of histomorphometry is demonstrated by its ability to confirm known pathophysiological concepts and reveal unexpected relations [45].

Table 3: Key Histomorphometric Findings in Kidney Disease from FLASH Analysis [45]

Clinical Parameter	Morphometric Feature	Finding	Cohort
Nephrotic Range Proteinuria	Glomerular Tuft Area	Significantly larger (9.71% increase) in cases with proteinuria	AC_B (Internal)
Lupus Nephritis	Glomerular Tuft Area	Median area 19.71% larger than normal baseline	AC_B (Internal)
Membranous GN	Glomerular Tuft Area	Median area 40.54% larger than normal baseline	AC_B (Internal)
Membranous GN with Proteinuria	Tuft Circularity	Significant decrease (19.57%) in median circularity	AC_B (Internal)
Loss of Kidney Function (eGFR)	Tuft Circularity	Progressive decrease with eGFR decline (13.95% overall)	AC_B (Internal)

Furthermore, applying techniques from single-cell transcriptomics to histomorphometric data allows for the identification of distinct glomerular populations and phenotypes along a trajectory of disease progression, adding a new dimension to tissue analysis [45].

Integration of Segmentation and Foundation Models

Combining the granular output of segmentation models with the powerful, general-purpose feature extraction of foundation models creates a robust pipeline for advanced computational pathology.

Pipeline Architecture

A typical integrated pipeline involves the following stages:

Whole-Slide Encoding: A foundation model like Virchow or TITAN processes an entire WSI. Virchow, for instance, uses a Vision Transformer trained with the DINOv2 algorithm on 1.5 million WSIs to generate informative tile-level embeddings [4]. TITAN employs a similar transformer but is specifically designed for whole-slide images by pretraining on a 2D grid of patch features extracted by another model (CONCH), using masked image modeling and knowledge distillation [7].
Tissue Segmentation: A segmentation model like UTS or FLASH is run on the WSI to create a precise mask of relevant histological structures.
Feature Fusion & Aggregation: The slide-level embeddings from the foundation model are combined with the quantitative morphometric features (e.g., areas, shapes, counts) derived from the segmentation mask. For slide-level prediction tasks (e.g., cancer detection), an aggregator model (like a multiple instance learning model) is often trained on the top of the fused features to produce a final prediction [4].
Downstream Task Execution: The integrated feature set is used for diverse applications, including cancer subtyping, survival prediction, and biomarker assessment.

Experimental Protocol for Downstream Task Evaluation

To evaluate the performance of a foundation model like Virchow on a downstream task such as pan-cancer detection, the following methodology can be employed [4]:

Embedding Extraction: For all WSIs in the evaluation dataset, extract tile-level feature embeddings using the pre-trained foundation model (Virchow) without any fine-tuning.
Aggregator Training: Train a weakly supervised aggregator model (e.g., a multiple instance learning classifier) using the extracted embeddings and slide-level (specimen-level) labels (e.g., "cancer" vs. "non-cancer"). The same training protocol should be maintained for all foundation models being compared (e.g., Virchow, Phikon, CTransPath).
Performance Benchmarking: Evaluate the trained model on held-out test sets, which should include slides from external institutions to assess generalizability. Performance is typically measured by the Area Under the Receiver Operating Characteristic Curve (AUC) at the specimen level. Stratification of results by common and rare cancer types is crucial.
Statistical Analysis: Compare the performance of different foundation model embeddings using appropriate statistical tests to determine significance.

This protocol has demonstrated Virchow's state-of-the-art performance, achieving a 0.949 overall AUC across 17 cancer types and a 0.937 AUC on 7 rare cancers [4].

The Scientist's Toolkit

Implementing the described workflows requires a suite of computational tools and reagents. The table below details key resources for researchers in this field.

Table 4: Essential Research Reagents and Computational Tools

Item Name	Type	Primary Function	Relevant Model/Study
H&E-Stained WSIs	Biological Reagent / Data	The primary input data for analysis, providing high-resolution images of tissue morphology.	All (Virchow [4], TITAN [7], FLASH [45])
Modified Masson-Goldner Trichrome Stain	Biological Reagent	Stains mineralized bone blue and osteoid red, enabling segmentation of bone components.	ADAM Pipeline [46]
Pathology Reports & Synthetic Captions	Data / Text	Provides textual descriptions used for multimodal vision-language alignment during model pretraining.	CONCH [5], TITAN [7]
SlideTiler Toolbox	Software	Automates the generation of uniform image tiles from WSIs for unit-based segmentation.	UTS [44]
OsteoMeasure Software	Software	Commercial system for manual annotation and semi-automatic histomorphometric analysis.	ADAM Pipeline [46]
nnU-Net	Software / Algorithm	A self-configuring deep learning framework for biomedical image segmentation.	ADAM Pipeline [46]
DINOv2 Algorithm	Algorithm	A self-supervised learning method used to train foundation models by enforcing consistency between different views of an image.	Virchow [4]

The convergence of sophisticated tissue segmentation techniques and large-scale pathology foundation models marks a new era in quantitative histomorphometry. Frameworks like FLASH and UTS provide the means to extract rich, interpretable morphometric data at scale, while models like Virchow, CONCH, and TITAN offer powerful, general-purpose feature representations that generalize across institutions and diseases.

For researchers and drug developers, this integrated approach enables the transition from qualitative histology assessment to robust, data-driven "pathomics" mining. This promises to uncover novel biomarkers, refine prognostic models, and ultimately pave the way for more precise and personalized patient therapies. The ongoing challenge of replicability and the need for diverse, open-access datasets highlight the importance of collaborative efforts to ensure these powerful tools are developed and validated to the highest standards of scientific rigor.

Implementation Challenges and Optimization Strategies for Real-World Deployment

Addressing Computational Complexity in Gigapixel Whole Slide Image Processing

The adoption of whole-slide imaging (WSI) in digital pathology has generated a new class of computational challenges centered on processing gigapixel-scale images that routinely contain billions of pixels. These massive data volumes push conventional deep learning architectures beyond their operational limits, necessitating specialized approaches that balance computational efficiency with analytical precision. Recent advances in pathology foundation models like Virchow, CONCH, and UNI represent a paradigm shift toward more scalable and versatile computational pathology, yet their effective implementation requires carefully engineered solutions to manage extreme computational complexity.

This technical guide examines the core strategies and methodologies enabling efficient processing of gigapixel WSIs. We explore architectural innovations in foundation models, evaluate specialized compression algorithms, and provide detailed experimental protocols for researchers developing next-generation computational pathology workflows. By framing these developments within the context of foundational AI models transforming histopathology research, we aim to provide a comprehensive resource for scientists and drug development professionals navigating the computational constraints of large-scale pathology image analysis.

Foundations of Whole-Slide Image Processing

The Computational Scaling Problem

Whole-slide images present unprecedented computational demands, with individual slides often occupying 1-5 gigabytes of storage space and containing resolvable features at multiple magnification levels [47]. This data volume creates significant bottlenecks in storage, transmission, and processing pipelines. Traditional deep learning models designed for natural images struggle with WSIs due to memory constraints, as loading entire slides into GPU memory remains practically impossible without substantial optimization [48].

The fundamental challenge stems from the high information irregularity inherent to pathological images. Unlike natural images with consistent locality patterns, WSIs demonstrate widely distributed high-frequency signals and significant local volatility [47]. This irregularity confounds conventional compression algorithms and necessitates specialized approaches that can maintain diagnostic fidelity while reducing computational overhead.

Foundation Models in Computational Pathology

Recent years have witnessed the emergence of foundation models specifically pretrained on massive histopathology datasets. These models serve as versatile feature extractors that can be adapted to diverse downstream tasks with minimal fine-tuning. As shown in Table 1, the key foundation models employ distinct architectural strategies and training methodologies to overcome computational barriers.

Table 1: Comparison of Pathology Foundation Models

Model	Architecture	Training Data	Key Innovations	Computational Advantages
CONCH [5]	Vision-language model	1.17M image-caption pairs	Contrastive learning from captions	Multimodal capabilities without retraining; effective for non-H&E stains
UNI [49]	Vision Transformer	>100,000 WSIs	Self-distillation and masked image modeling	Transferable representations across tissue types and resolutions
TITAN [7]	Transformer with ALiBi	335,645 WSIs + synthetic captions	Multi-stage pretraining; 2D attention with linear biases	Handles long sequences (>10^4 tokens); enables whole-slide representation learning
Prov-GigaPath [42]	Transformer	Large-scale WSI collection	Whole-slide representation learning	State-of-the-art performance for slide-level prediction tasks

These foundation models share a common strategy of self-supervised pretraining on broad data, eliminating the need for expensive manual annotations while learning generally useful representations of histopathological morphology [1]. The adaptability of these models significantly reduces the computational overhead associated with training task-specific models from scratch.

Technical Frameworks for Computational Efficiency

Hierarchical Feature Extraction

Most successful approaches for WSI processing employ a hierarchical strategy that decomposes the computational problem into manageable components. The patch-based paradigm operates by dividing WSIs into smaller regions (typically 256×256 to 1024×1024 pixels), processing these patches independently, and then aggregating the results [49]. This approach enables training with standard deep learning architectures but sacrifices global contextual information crucial for many diagnostic tasks.

More advanced models like TITAN introduce a transitional approach that leverages pre-extracted patch features from established encoders like CONCH, then applies transformer architectures to model relationships between these patch embeddings [7]. This hybrid methodology maintains the efficiency of patch-based processing while capturing slide-level contextual relationships through attention mechanisms.

Compression Strategies for Gigapixel Images

Efficient data compression is essential for sustainable WSI storage and transmission. While lossy compression methods like JPEG2000 offer high compression ratios, they risk introducing diagnostically relevant artifacts [47]. Recent research has therefore focused on optimized lossless compression specifically designed for WSIs.

The WISE framework employs a hierarchical encoding strategy that first eliminates empty background regions then applies specialized dictionary-based compression to informative tissue regions [47]. As shown in Table 2, this approach significantly outperforms conventional compression methods by addressing the unique information irregularity characteristics of WSIs.

Table 2: Performance Comparison of Lossless Compression Methods on WSI Data

Compression Method	Type	Compression Ratio (Normal Images)	Compression Ratio (WSI)
PNG [47]	Entropy-based	2.06	1.00
Huffman Coding [47]	Entropy-based	1.19	1.23
LZMA [47]	Dictionary-based	1.43	1.98
WISE [47]	Hierarchical + Dictionary	-	36.00 (avg), 136.00 (max)

The exceptional performance of WISE demonstrates the importance of domain-specific compression strategies that account for the structural characteristics of pathological images rather than treating WSIs as conventional image data.

Efficient Model Architectures

Transformer-based models have demonstrated remarkable capabilities in computational pathology but face significant computational complexity challenges when applied to WSIs. The TITAN model addresses this through several key innovations:

Extended Context with ALiBi: Traditional transformers struggle with sequence lengths exceeding their pretrained context window. TITAN implements Attention with Linear Biases (ALiBi), which extrapolates to longer sequences by using relative positional embeddings based on Euclidean distance between feature locations [7].
Feature Grid Cropping: Instead of processing entire slide feature maps, TITAN employs random cropping of 16×16 feature regions (covering 8,192×8,192 pixels at 20× magnification), with subsequent global and local crops created for self-supervised pretraining [7].
Knowledge Distillation: Following successful patch encoder methodologies, TITAN uses the iBOT framework for masked image modeling and knowledge distillation, enabling more efficient representation learning [7].

These architectural optimizations enable transformer models to handle the long sequences inherent to WSI processing while maintaining computational feasibility.

Experimental Protocols and Methodologies

Whole-Slide Representation Learning with TITAN

The TITAN framework implements a comprehensive three-stage pretraining approach for learning general-purpose slide representations [7]:

Stage 1: Vision-only Pretraining

Input Preparation: Extract 768-dimensional features from 512×512 pixel patches (at 20× magnification) using CONCHv1.5 patch encoder. Spatially arrange features to maintain 2D positional relationships.
Data Augmentation: Apply random cropping of 16×16 feature regions from the WSI feature grid. Generate two global crops (14×14) and ten local crops (6×6) for each region.
Training Objective: Implement iBOT framework with masked image modeling and knowledge distillation. Use posterization and flip augmentations on feature crops.

Stage 2: ROI-Level Vision-Language Alignment

Data: 423,122 synthetic ROI-caption pairs generated using PathChat, a multimodal generative AI copilot for pathology.
Objective: Contrastive learning to align region-of-interest visual features with fine-grained morphological descriptions.

Stage 3: Slide-Level Vision-Language Alignment

Data: 182,862 paired WSIs and pathology reports from the Mass-340K dataset.
Objective: Cross-modal alignment between whole-slide representations and diagnostic text.

This protocol produces slide representations that support diverse clinical applications including rare cancer retrieval, prognosis prediction, and zero-shot classification without task-specific fine-tuning.

Mutation Prediction with Foundation Models

The following protocol details an approach for predicting BRAF-V600 mutation status in melanoma using foundation models and gradient boosting [42]:

Feature Extraction

Slide Preparation: Collect H&E-stained whole slide images from melanoma specimens (TCGA-SKCM cohort recommended for benchmarking).
Feature Generation: Process each WSI through Prov-GigaPath foundation model to generate slide-level feature embeddings. Alternative foundation models (CONCH, UNI) can be substituted for comparative studies.
Feature Reduction: Apply Principal Component Analysis to reduce feature dimensionality while retaining 95% of variance.

Model Training and Evaluation

Classifier Selection: Implement XGBoost classifier with default parameters. Compare against random forests and logistic regression baselines.
Validation Strategy: Perform 5-fold cross-validation on training cohort (TCGA, n=275 slides). Evaluate final model on independent test set (University Hospital Essen, n=68 slides).
Performance Metrics: Calculate Area Under the Curve (AUC) with 95% confidence intervals, sensitivity, specificity, and balanced accuracy.

This weakly supervised approach achieves state-of-the-art AUC of 0.824 during cross-validation and 0.772 on external testing, demonstrating the predictive potential of foundation model representations [42].

Foundation Model Prediction Workflow

Pathology Report Generation with HistoGPT

HistoGPT demonstrates how foundation models can generate comprehensive pathology reports from multiple gigapixel WSIs [49]:

Model Architecture Configuration

Vision Encoder: Select CTransPath (30M parameters) for efficient operation or UNI (300M parameters) for maximum performance. Process patches at 10× (CTransPath) or 20× (UNI) magnification.
Language Decoder: Initialize with BioGPT (decoder-only transformer pretrained on PubMed).
Cross-Modal Integration: Implement Perceiver Resampler to project visual features into language space. Use tanh-gated cross-attention blocks (XATTN) to fuse modalities.

Training Protocol

Data Preparation: Curate 15,129 WSIs with paired pathology reports from dermatology patients. Split 75/25 for training/validation.
Training Strategy: Freeze pretrained vision and language modules. Only train XATTN parameters to prevent catastrophic forgetting.
Inference: Use Ensemble Refinement to sample multiple report variants. Aggregate with GPT-4 to produce final comprehensive report.

Evaluation Metrics

Semantic Evaluation: Calculate keyword coverage (Jaccard index) using dermatopathology terminology dictionaries.
Clinical Accuracy: Conduct blinded expert review where board-certified pathologists evaluate generated versus human reports.

This approach captures approximately 67% of key diagnostic terminology and produces clinically acceptable reports for common malignancies [49].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Computational Resources for WSI Processing Research

Resource	Type	Function	Application Examples
CONCH [5]	Vision-language model	Multimodal feature extraction	Image-text retrieval, classification, segmentation
TCGA Datasets [42]	WSI Repository	Benchmark data with molecular annotations	Model validation, pan-cancer studies
WISE Compressor [47]	Compression Framework	Lossless WSI compression	Storage optimization, efficient data transmission
TITAN [7]	Whole-slide foundation model	Slide-level representation learning	Rare disease retrieval, prognosis prediction
HistoGPT [49]	Report generation model	Automated pathology reporting	Diagnostic assistance, education

TITAN Pretraining Process

The field of computational pathology continues to evolve toward more integrated, multimodal foundation models that combine histopathological, genomic, and clinical data [1]. The computational complexity inherent to gigapixel WSI processing will remain a central challenge, necessitating continued innovation in model architectures, training methodologies, and inference optimization.

The most promising research directions include:

Generalist Medical AI: Developing unified models that integrate pathology foundation models with other medical AI systems for comprehensive diagnostic support [1].
Scalable Self-Supervision: Expanding pretraining to millions of WSIs with improved sample efficiency and reduced annotation requirements.
Federated Learning: Enabling multi-institutional collaboration without data sharing to improve model generalization while addressing privacy concerns [50].
Hardware-Software Co-Design: Creating specialized processing units optimized for WSI inference tasks to reduce computational barriers in clinical deployment.

The foundation models discussed in this guide—CONCH, UNI, TITAN, and related architectures—represent significant milestones in addressing the computational complexity of gigapixel whole-slide image processing. By providing versatile, scalable frameworks for histopathological analysis, these models are accelerating the transition of AI from research laboratories to clinical practice, ultimately advancing precision oncology and personalized medicine.

In digital histopathology, the development of robust deep learning models has traditionally been constrained by the limited availability of high-quality, expert-annotated data. The process of labeling whole slide images (WSIs) is both time-consuming and costly, requiring specialized expertise from pathologists. This data scarcity poses a significant bottleneck for building effective AI-assisted diagnostic tools. Fortunately, recent advancements in foundation models have created new paradigms for overcoming these limitations through few-shot and zero-shot learning techniques [1].

These approaches are particularly valuable in histopathology due to several domain-specific challenges: the gigapixel size of WSIs, the high cost of expert annotations, the presence of rare diseases with minimal available data, and the need to recognize novel tissue structures without exhaustive labeling. Few-shot learning enables models to recognize new tissue classes from very few labeled examples, while zero-shot learning allows models to classify previously unseen tissue types without any task-specific training data [51].

This technical guide explores how modern foundation models—including Virchow, CONCH, and UNI—are revolutionizing computational pathology by addressing data scarcity challenges. We examine their architectures, performance benchmarks, and practical implementation methodologies to provide researchers with a comprehensive resource for leveraging these approaches in histopathology research.

Foundation Models in Histopathology

Foundation models are large-scale AI models trained on broad data using self-supervision at scale, which can be adapted to a wide range of downstream tasks [1]. Unlike traditional deep learning models designed for specific tasks, foundation models leverage transformer architectures and massive pretraining to develop versatile representations transferable to various applications with minimal fine-tuning.

In histopathology, several foundation models have emerged as critical tools for addressing data scarcity:

CONCH (CONtrastive learning from Captions for Histopathology) is a vision-language foundation model pretrained on 1.17 million histopathology image-caption pairs [5]. Unlike models trained solely on H&E images, CONCH produces performant representations for various stain types, including IHC and special stains, enabling applications across image classification, segmentation, captioning, and cross-modal retrieval tasks.

TITAN (Transformer-based pathology Image and Text Alignment Network) represents a more recent advancement—a multimodal whole-slide foundation model pretrained on 335,645 WSIs using visual self-supervised learning and vision-language alignment with corresponding pathology reports and synthetic captions [7]. TITAN can extract general-purpose slide representations and generate pathology reports without requiring fine-tuning, demonstrating exceptional performance in rare disease retrieval and cancer prognosis.

Virchow is another significant foundation model in computational pathology, though detailed architectural information was less extensively covered in the provided search results. These models fundamentally differ from traditional supervised approaches by leveraging self-supervised pretraining on unlabeled data followed by minimal adaptation to specific tasks, dramatically reducing annotation requirements [1].

Table 1: Comparison of Key Histopathology Foundation Models

Model	Architecture	Pretraining Data	Key Capabilities	Applications
CONCH	Vision-Language Transformer	1.17M image-caption pairs [5]	Image-text retrieval, classification, segmentation	Multi-stain analysis, cross-modal search, caption generation
TITAN	Multimodal Whole-Slide Transformer	335,645 WSIs + reports + synthetic captions [7]	Slide representation learning, report generation, zero-shot classification	Rare cancer retrieval, prognosis prediction, cross-modal retrieval
Virchow	Vision Transformer (ViT)	Large-scale WSI datasets	Whole-slide encoding, tissue analysis	Cancer diagnosis, tissue classification

Few-Shot Learning Approaches

Few-shot learning aims to develop models that can rapidly generalize to new tasks with limited labeled examples, typically formalized as N-way K-shot problems where models must distinguish between N classes with only K examples per class [52].

Methodologies and Experimental Protocols

Few-shot learning in histopathology typically employs episodic training, where models are exposed to numerous few-shot tasks during training to learn transferable feature representations. The standard experimental protocol involves:

Feature Extraction: Using foundation models like CONCH or UNI as feature extractors to encode histopathology images into embedding vectors without fine-tuning [5].
Prototypical Networks: Learning a metric space where classification occurs by computing distances to prototype representations of each class, which are calculated as the mean of support embeddings [15].
Fine-tuning Approaches: Adapting foundation models to specific histopathology tasks with minimal labeled data through transfer learning and regularization techniques [52].

A comprehensive evaluation of few-shot learning methods on histopathology images revealed that popular meta-learning approaches perform at par with standard fine-tuning and regularization methods [52]. The best methods achieved accuracies exceeding 70%, 80%, and 85% for 5-way 1-shot, 5-way 5-shot, and 5-way 10-shot scenarios respectively across four histopathology datasets.

Experimental Results and Performance

Recent research demonstrates remarkable success in few-shot learning for histopathology. One study focusing on colorectal cancer classification achieved over 98% accuracy on a query dataset with 35 samples per category using only 10 training samples per category [53]. The model maintained robust performance exceeding 93% accuracy on comprehensive test datasets containing 1,916 samples, confirming strong generalization capability.

Another approach combining efficient fine-tuning of foundation models with few-shot learning significantly enhanced pattern recognition in histopathology while reducing annotation requirements [15]. This method leveraged self-supervised learning to adapt pretrained Vision Transformers using unlabeled data from the target domain before few-shot fine-tuning.

Table 2: Few-Shot Learning Performance Benchmarks in Histopathology

Task	Setting	Best Accuracy	Dataset	Key Method
Colorectal Cancer Classification [53]	2-way 10-shot	>98%	Colorectal Cancer	Transfer Learning + Contrastive Learning
General Histopathology Classification [52]	5-way 1-shot	>70%	4 Datasets	Meta-learning/Fine-tuning
General Histopathology Classification [52]	5-way 5-shot	>80%	4 Datasets	Meta-learning/Fine-tuning
General Histopathology Classification [52]	5-way 10-shot	>85%	4 Datasets	Meta-learning/Fine-tuning

Diagram 1: Prototypical Networks for Few-Shot Learning. This workflow demonstrates the metric-based approach where class prototypes are computed from support examples, and query samples are classified based on distance to these prototypes.

Zero-Shot Learning Approaches

Zero-shot learning represents a more extreme approach to data scarcity, enabling models to classify unseen categories without any labeled examples. This is achieved by leveraging semantic relationships and auxiliary information, typically in the form of textual descriptions [54].

Vision-Language Models for Zero-Shot Histopathology

The MR-PHE (Multi-Resolution Prompt-guided Hybrid Embedding) framework exemplifies advanced zero-shot learning in histopathology [54] [55]. This approach addresses key challenges through several innovative components:

Multi-Resolution Patch Extraction: Mimics the diagnostic workflow of pathologists by capturing both fine-grained cellular details and broader tissue structures through analysis of image patches at multiple resolutions [55].
Hybrid Embedding Strategy: Integrates global image embeddings with weighted patch embeddings, effectively combining local and global contextual information critical for accurate diagnosis [54].
Prompt Generation and Selection: Develops comprehensive class descriptions enriched with domain-specific synonyms and clinically relevant features to enhance semantic understanding between visual features and text descriptors [55].
Similarity-Based Patch Weighting: Assigns attention-like weights to patches based on their relevance to class embeddings, emphasizing diagnostically important regions during classification [54].

This framework leverages pretrained vision-language models like CONCH without requiring domain-specific fine-tuning, offering exceptional scalability and reducing dependence on large annotated datasets [55].

Experimental Results and Performance

The MR-PHE framework demonstrates that zero-shot learning can not only address data scarcity but in some cases surpass fully supervised models in histopathology image classification [55]. This remarkable performance stems from the method's ability to leverage rich semantic knowledge encoded in vision-language models and its capacity to focus on diagnostically relevant regions through the patch weighting mechanism.

Another significant advancement comes from the TITAN model, which achieves robust zero-shot classification by aligning whole-slide images with pathological concepts through multimodal pretraining [7]. By leveraging both visual signals and corresponding pathology reports during pretraining, TITAN develops a semantic understanding that transfers effectively to unseen classification tasks without additional fine-tuning.

Diagram 2: Zero-Shot Learning with MR-PHE Framework. This architecture shows how multi-resolution patches and text prompts are processed through parallel pathways then fused for classification without training examples.

The Scientist's Toolkit: Research Reagent Solutions

Implementing few-shot and zero-shot learning approaches in histopathology requires both computational resources and specialized methodological components. The table below details essential "research reagents" for developing and evaluating these systems.

Table 3: Essential Research Reagents for Few-Shot and Zero-Shot Learning in Histopathology

Research Reagent	Type	Function	Examples/Implementation
Foundation Models	Pre-trained Model	Provides transferable feature representations for downstream tasks	CONCH, TITAN, Virchow [5] [7]
Feature Extractors	Software Component	Encodes histopathology images into embedding space	CONCH visual encoder, UNI feature extractor [5]
Prompt Templates	Methodological Component	Generates textual descriptions for zero-shot class mapping	Domain-specific synonyms, clinical feature descriptions [55]
Metric Learning Libraries	Software Tool	Implements distance measurement and similarity computation	PyTorch Metric Learning, TensorFlow Similarity
Whole Slide Image Datasets	Data Resource	Provides multi-organ, multi-disease images for pretraining and evaluation	TCGA, PAIP, GTEx (excluded from CONCH pretraining) [5]
Multi-Resolution Patch Extraction Tools	Software Component	Processes WSIs at multiple magnifications for analysis	OpenSlide, CUDA-optimized patch extraction algorithms

Few-shot and zero-shot learning approaches represent paradigm-shifting solutions to the critical challenge of data scarcity in computational histopathology. By leveraging foundation models like CONCH, TITAN, and Virchow, researchers can develop powerful classification systems that require minimal labeled data while maintaining robust performance across diverse tissue types and disease conditions.

The experimental results demonstrate that these approaches are not merely theoretical alternatives but practical solutions achieving competitive performance with fully supervised methods—in some cases even surpassing them. As foundation models continue to evolve and incorporate multimodal capabilities, their adaptability to rare diseases and novel classification tasks will further expand.

For researchers and drug development professionals, embracing these methodologies offers a path to accelerate histopathology AI development while conserving valuable expert annotation resources. The integration of vision-language models with specialized frameworks for histopathology promises to unlock new possibilities in precision medicine and personalized treatment strategies.

Handling Domain Shift and Institutional Variability in Staining Protocols

Domain shift, the variation in histologic staining between different medical centers, represents one of the most profound challenges in computational pathology. These variations arise from differences in staining protocols, reagent batches, scanner specifications, and imaging devices across institutions, introducing significant color and texture differences that compromise the reliability of artificial intelligence (AI) algorithms. This phenomenon directly impedes the widespread applicability of downstream tasks like cancer diagnosis, prognosis, and biomarker prediction. When algorithms trained on data from one institution (source domain) encounter images from another (target domain), performance degradation frequently occurs due to distributional differences between the feature spaces of these domains. This problem is particularly acute for foundation models in histopathology, such as Virchow, CONCH, and UNI, which are increasingly deployed for diverse clinical tasks. The stain-induced domain shift not only affects low-level color features but can also distort critical morphological patterns that these models rely on for accurate predictions, ultimately creating barriers to clinical adoption and potentially compromising patient safety in real-world deployments across heterogeneous healthcare environments.

Foundation Models in Histopathology: Capabilities and Domain Sensitivity

Computational pathology foundation models (PFMs) are transformer-based architectures pretrained on massive datasets of histopathology images, enabling them to learn versatile and transferable feature representations. These models, including UNI, CONCH, and Virchow, have demonstrated remarkable capabilities across diverse diagnostic tasks, from cancer subtyping to biomarker prediction. UNI is a vision transformer (ViT) trained using self-supervised learning on hundreds of thousands of whole-slide images (WSIs) to create general-purpose slide representations deployable across clinical settings. CONCH extends this paradigm through vision-language contrastive learning, aligning histopathology images with corresponding pathology reports to enable cross-modal retrieval and zero-shot classification capabilities. The Virchow model employs self-distillation approaches to capture morphological patterns in histology patch embeddings, focusing on tissue organization and cellular structure.

Recent representational similarity analysis has revealed that these foundation models exhibit varying degrees of sensitivity to domain shifts. Studies comparing six CPath foundation models, including CONCH, PLIP, KEEP, UNI, Virchow, and Prov-GigaPath, have demonstrated that all models show high slide-dependence in their learned representations, though relatively lower disease-dependence. This slide-dependence manifests as performance variability when models encounter data from different institutions with distinct staining protocols. Importantly, research has shown that having the same training paradigm (vision-only versus vision-language) does not guarantee similar representational structures or domain sensitivity profiles. For instance, UNI and Virchow have been found to possess the most distinct representational structures among compared models, while Prov-GigaPath demonstrates higher average similarity across domains. These differences highlight the complex relationship between pretraining strategies and domain robustness, necessitating systematic approaches to handle staining variability.

Table 1: Comparative Analysis of Pathology Foundation Models and Domain Sensitivity

Model	Training Paradigm	Key Capabilities	Domain Sensitivity Characteristics
UNI	Vision-only self-supervised learning	Whole-slide representation, biomarker prediction	Distinct representational structure, moderate slide-dependence
CONCH	Vision-language contrastive learning	Cross-modal retrieval, text-guided search	High slide-dependence (5.5% reduction with stain normalization)
Virchow/Virchow2	Self-distillation with DINOv2	Cellular feature extraction, tissue classification	Most distinct representational structure, variable domain robustness
Prov-GigaPath	Large-scale self-supervision	Slide-level encoding, prognostic prediction	Highest average similarity across domains
TITAN	Multimodal vision-language	Report generation, zero-shot classification	Robust to rare diseases via synthetic data augmentation

Technical Approaches for Mitigating Domain Shift

Stain Normalization Techniques

Stain normalization methods aim to standardize the color appearance of histopathology images across different domains while preserving diagnostically critical morphological information. Traditional approaches have relied on color space transformations and statistical matching techniques, but these often prioritize global color consistency at the expense of fine structural details. Recent advances have introduced deep learning frameworks that integrate enhanced residual learning with multi-scale attention mechanisms for structure-preserving stain normalization. These methods explicitly decompose the transformation process into base reconstruction and residual refinement components, enabling precise control over the structure-color trade-off. The incorporation of attention-guided skip connections allows adaptive focusing on diagnostically relevant regions while maintaining global coherence. Evaluations on the MITOS-ATYPIA-14 dataset containing 1,420 paired H&E-stained breast cancer images from two scanners demonstrate exceptional performance with a structural similarity index (SSIM) of 0.9663 ± 0.0076, representing a 4.6% improvement over StainGAN baselines. Edge preservation loss of 0.0465 ± 0.0088 demonstrates a 35.6% error reduction compared to the next best method, ensuring critical cellular and architectural features remain intact during color normalization.

For whole-slide images, multi-domain approaches like MultiStain-CycleGAN have been developed to normalize images of different origins without retraining or using different models. This method uses a many-to-one approach with an intermediate domain to reduce the input space, effectively disguising the origin of a whole-slide image while maintaining diagnostic integrity. Evaluation metrics demonstrate that such approaches can reliably fool domain classifiers (attempting to assign medical center to an image) while preserving tumor classification performance, as measured by structural similarity index and Fréchet inception distance. These normalization techniques directly benefit foundation model applications by reducing the domain gap between pretraining and deployment environments, enhancing model generalization across institutional boundaries.

Domain Adaptation Frameworks

Domain adaptation (DA) addresses domain shift by aligning feature distributions between source and target domains, enabling models to maintain performance when deployed in new environments. While traditional DA methods focused on image patches, recent approaches explicitly address slide-level domain adaptation to capture global WSI features required in typical clinical scenarios. The Hierarchical Adaptation framework for Slide-level Domain-shift (HASD) achieves multi-scale feature consistency through three complementary components: (1) Domain-level Alignment Solver using an entropic Sinkhorn-Knopp algorithm for feature distribution alignment; (2) Slide-level Geometric Invariance Regularization to preserve morphological structure during adaptation; and (3) Patch-level Attention Consistency Regularization to maintain local critical diagnostic cues. This framework also incorporates efficient prototype selection to mitigate computational overhead associated with processing thousands of patches per slide.

Validation on slide-level tasks across five datasets demonstrates significant improvements, with HASD achieving a 4.1% AUROC improvement in a Breast Cancer HER2 Grading cohort and a 3.9% C-index gain in a UCEC survival prediction cohort compared to state-of-the-art methods. The method provides a practical solution for pathology institutions seeking to transfer models from a source center to a target center while addressing domain shift with minimal computational overhead and annotation costs. For foundation models specifically, parameter-efficient fine-tuning via low-rank adaptation (LoRA) has emerged as an effective strategy for domain adaptation, where only small adapter modules are trained rather than the entire model, preserving the generalizable features learned during large-scale pretraining while adapting to institution-specific characteristics.

Table 2: Performance Comparison of Domain Adaptation Methods

Method	Technical Approach	Validation Tasks	Performance Improvement
HASD	Hierarchical adaptation with domain alignment, geometric invariance, and attention consistency	Breast Cancer HER2 Grading, UCEC Survival Prediction	4.1% AUROC gain, 3.9% C-index improvement
Structure-Preserving Stain Normalization	Attention-guided residual learning with multi-scale decomposition	MITOS-ATYPIA-14 dataset	35.6% error reduction in edge preservation
MultiStain-CycleGAN	Many-to-one cycle-consistent adversarial networks	Multi-center tumor classification	High structural similarity while fooling domain classifiers
LoRA Fine-tuning	Parameter-efficient adaptation of foundation models	Atypical mitosis classification	10% balanced accuracy improvement through ensemble

Data-Centric Strategies

Data-centric approaches focus on constructing comprehensive datasets that encapsulate domain variability, enabling the development of more robust models. The PathoLogy Images of Scanners and Mobile phones (PLISM) dataset represents a significant advancement in this direction, containing 46 human tissue types stained using 13 hematoxylin and eosin conditions and captured using 13 imaging devices. The dataset includes precisely aligned image patches from different domains, allowing for accurate evaluation of color and texture properties across staining and imaging variations. Analysis of PLISM reveals significant diversity across domains, particularly between whole-slide images and smartphone-captured images, highlighting the substantial domain shift challenge in real-world scenarios.

Utilizing such diverse datasets during foundation model pretraining or fine-tuning enhances inherent domain robustness. For vision-language models like CONCH, incorporating synthetic data generated through multimodal generative AI copilots has shown promise in improving domain generalization. The TITAN model, for instance, was pretrained using 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology, enhancing its capability to handle resource-limited clinical scenarios such as rare disease retrieval. Similarly, data augmentation strategies like Fourier Domain Adaptation (FDA) and fisheye transforms have demonstrated effectiveness in improving model robustness, with experiments showing approximately 10% improvements in balanced accuracy for atypical mitosis classification when combined with foundation model ensembles.

Experimental Protocols and Implementation

Stain Normalization Workflow

The structure-preserving stain normalization protocol follows a multi-stage process to transform images from a source domain to a target domain while maintaining structural integrity. The implementation utilizes an enhanced residual learning architecture with attention-guided skip connections that explicitly decomposes the transformation into structure-preserving and color-adjusting components:

Image Preprocessing: Convert input WSIs to appropriate magnification (typically 20×) and extract representative patches covering diverse tissue regions and staining patterns.
Base Reconstruction: Process input images through an encoder-decoder network to reconstruct the structural components using residual connections that preserve spatial information.
Residual Refinement: Apply attention-guided color transformation through multi-scale attention mechanisms that capture both local cellular features and global tissue patterns.
Adversarial Training: Utilize discriminator networks to ensure generated images are indistinguishable from target domain images while maintaining perceptual quality.

The model is trained using a combination of loss functions including structural similarity loss, edge preservation loss, perceptual loss, and adversarial loss, with adaptive weighting through curriculum learning that progressively emphasizes different normalization aspects. Implementation typically uses the MITOS-ATYPIA-14 dataset containing 1,420 paired H&E-stained breast cancer images from two scanners (Aperio ScanScope XT and Hamamatsu NanoZoomer 2.0-HT) for validation, with quantitative evaluation using SSIM, PSNR, edge preservation index, and Fréchet Inception Distance.

Hierarchical Domain Adaptation Protocol

The HASD framework implements a comprehensive methodology for slide-level domain adaptation through the following experimental protocol:

Feature Extraction: Extract patch-level features using a pre-trained foundation model (UNI), processing each WSI as a bag of patches {r₁, r₂, ..., rₚ} ∈ ℝ^M.
Domain-level Alignment: Apply the Domain-level Alignment Solver using optimal transport with entropic regularization:
- Compute transport cost between source (Rˢʳᶜ) and target (Rᵗᵍᵗ) domains
- Solve optimization: min₍ℽ₎ ⟨ℽ, C(Rˢʳᶜ, Rᵗᵍᵗ)⟩ + λℋ(ℽ) where C is cosine similarity and ℋ is entropy
- Learn transformation T that maps Rˢʳᶜ to transition domain minimizing transport costs
Slide-level Geometric Invariance: Apply geometric regularization to preserve intra-slide spatial relationships using attention-based multiple instance learning (ABMIL) aggregation.
Patch-level Attention Consistency: Regularize attention patterns across domains to ensure consistent focus on diagnostically critical regions.
Efficient Prototype Selection: Select K most informative prototypes per slide to reduce computational burden while preserving essential information.

The method is validated on two slide-level tasks across five datasets, with performance measured through AUROC for classification tasks and C-index for survival prediction. Implementation typically uses PyTorch with foundation models from the Mahmood Lab (UNI) and requires careful hyperparameter tuning for the Sinkhorn regularization parameter and attention consistency weight.

Foundation Model Fine-tuning for Domain Adaptation

Parameter-efficient fine-tuning of pathology foundation models using Low-Rank Adaptation (LoRA) provides an effective approach for adapting these models to new domains with limited target data:

Model Selection: Choose appropriate foundation models (UNI, Virchow, CONCH) based on task requirements and representational characteristics.
LoRA Configuration: Apply low-rank adaptation to the query (Q) and value (V) projection matrices in the multi-head self-attention modules:
- Initialize low-rank matrices AQ ∈ ℝ^(d×r) and BQ ∈ ℝ^(r×k)
- Update frozen weight matrix WQ as: WQ = W₀ + AQBQ
- Only train AQ and BQ during fine-tuning, keeping original weights W₀ fixed
Data Augmentation: Implement robust augmentation strategies:
- Fourier Domain Adaptation (FDA) for style transfer using ImageNet images as reference
- Fisheye distortion with coefficients sampled from -0.9 to 0.9 to emphasize central regions
- Random brightness, contrast, and rotation adjustments
Ensemble Construction: Combine multiple adapted foundation models using balanced accuracy optimization for weight assignment:
- Learn non-negative weights w_i that maximize balanced accuracy on validation set
- Aggregate predictions: Pfinal^(c)(x) = Σᵢ wi^* p_i^(c)(x)
- Ensure equitable performance across classes through objective function that weights each class equally

This protocol has demonstrated significant improvements in domain generalization, with ensembles achieving up to 97.279% balanced accuracy on atypical mitosis classification across diverse domains.

Table 3: Key Research Reagents and Computational Resources for Domain Shift Mitigation

Resource	Type	Function in Domain Shift Research	Example Implementation
PLISM Dataset	Comprehensive multi-domain dataset	Enables precise evaluation of color/texture properties across staining and imaging variations	46 tissue types, 13 H&E conditions, 13 imaging devices with aligned patches
MITOS-ATYPIA-14	Paired image dataset	Benchmark for stain normalization methods with ground truth comparisons	1,420 paired H&E-stained breast cancer images from two scanners
Foundation Models (UNI, CONCH, Virchow)	Pretrained neural networks	Provide robust feature extractors transferable across domains	UNI2-h (ViT-h/14-reg8) trained on 200M+ pathology images
HASD Framework	Domain adaptation algorithm	Enables slide-level domain adaptation with multi-scale consistency	Hierarchical adaptation with domain alignment and attention regularization
LoRA (Low-Rank Adaptation)	Fine-tuning technique	Parameter-efficient domain adaptation of foundation models	Adaptation of Q/V projections in transformer attention layers
MultiStain-CycleGAN	Normalization model	Many-to-one stain normalization without retraining for new domains	Cycle-consistent adversarial networks with intermediate domain
Structure-Preserving Normalization	Image processing framework	Maintains structural integrity during color normalization	Attention-guided residual learning with multi-scale decomposition

The integration of stain normalization, domain adaptation frameworks, and foundation model fine-tuning represents a comprehensive approach to addressing the critical challenge of domain shift in computational pathology. The emergence of large-scale foundation models like UNI, CONCH, and Virchow has created both opportunities and challenges for handling institutional variability in staining protocols. While these models offer powerful feature representations, their sensitivity to domain shifts necessitates systematic mitigation strategies. The experimental protocols and technical approaches outlined in this guide provide researchers with practical methodologies for enhancing model robustness across diverse clinical environments. As the field advances, future research directions include developing more sophisticated normalization techniques that explicitly model stain-chemical interactions, creating standardized evaluation benchmarks for domain generalization, and establishing guidelines for foundation model selection based on institutional characteristics. The continued refinement of these approaches will be essential for realizing the full potential of computational pathology in heterogeneous real-world healthcare settings, ultimately improving diagnostic accuracy and patient care across institutional boundaries.

Quality Control and Validation Frameworks for Clinical Grade AI

The integration of Artificial Intelligence (AI) into clinical histopathology represents a paradigm shift in diagnostic medicine, offering unprecedented capabilities for improving diagnostic accuracy, prognostic prediction, and therapeutic decision-making. Foundation models like Virchow, CONCH (CONtrastive learning from Captions for Histopathology), and UNI have emerged as powerful tools that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning [7] [5] [1]. These models serve as foundational layers for developing various downstream clinical applications, from cancer subtyping to biomarker prediction and patient outcome prognosis. However, translating these technological advancements into clinically validated tools requires rigorous quality control (QC) and validation frameworks that ensure reliability, safety, and efficacy in real-world clinical settings.

The critical challenge in deploying clinical-grade AI lies in addressing the distribution shift between development data and real-world clinical data, compounded by the frequent absence of ground-truth annotations in deployment environments [56]. As pathology foundation models are increasingly applied to sensitive clinical tasks such as disease diagnosis, rare cancer retrieval, and cancer prognosis prediction, establishing robust validation methodologies becomes paramount for clinical adoption. This technical guide outlines comprehensive QC and validation frameworks specifically tailored for clinical-grade AI in histopathology, with particular emphasis on applications involving Virchow, CONCH, and UNI foundation models.

Core Validation Principles for Pathology AI

Clinical validation of pathology AI must adhere to fundamental principles that ensure consistent performance across diverse clinical scenarios. The SUDO (pseudo-label discrepancy) framework provides a methodological approach for evaluating AI systems on data in the wild without ground-truth annotations by quantifying class contamination in model predictions [56]. This is particularly relevant for pathology foundation models deployed across multiple institutions with varying staining protocols, scanner types, and patient populations.

Validation must address three critical dimensions: technical validation (establishing model accuracy and robustness), clinical validation (demonstrating clinical utility and safety), and operational validation (ensuring performance in real-world workflows). Well-established initiatives such as TRIPOD+AI, DECIDE-AI, SPIRIT-AI, and CONSORT-AI provide structured guidelines for methodological rigorousness and transparency in AI development [57]. These frameworks emphasize comprehensive reporting of model development, training data characteristics, and performance metrics across relevant patient subgroups.

Table 1: Core Validation Principles for Clinical Grade Pathology AI

Validation Dimension	Key Components	Relevant Standards/Frameworks
Technical Validation	Accuracy, robustness, reproducibility, computational efficiency	TRIPOD+AI, CONSORT-AI, SUDO framework
Clinical Validation	Diagnostic accuracy, clinical utility, safety impact, user acceptance	DECIDE-AI, SPIRIT-AI, AGREE II
Operational Validation	Workflow integration, real-world performance, scalability	FAIR principles, BIBLIO methodology

Validation Metrics and Performance Benchmarks

Quantitative assessment forms the cornerstone of AI validation in pathology. For foundation models like CONCH and TITAN (Transformer-based pathology Image and Text Alignment Network), performance must be evaluated across multiple tasks including image classification, segmentation, captioning, text-to-image retrieval, and image-to-text retrieval [7] [5]. The CONCH model, pretrained on 1.17 million image-caption pairs, has demonstrated state-of-the-art performance across 14 diverse benchmarks in computational pathology [5].

Key performance metrics must be selected based on clinical relevance and application context. For diagnostic tasks, sensitivity, specificity, positive predictive value, and negative predictive value are essential. For prognostic applications, hazard ratios, time-dependent AUC, and calibration metrics provide more appropriate assessment. The emerging SUDO framework enables estimation of model performance without ground-truth labels by leveraging pseudo-label discrepancy as a proxy for prediction reliability [56].

Table 2: Essential Performance Metrics for Pathology Foundation Model Validation

Application Domain	Primary Metrics	Secondary Metrics	Benchmark Values
Diagnostic Classification	Sensitivity, Specificity	AUC-ROC, F1-score	CONCH: SOTA on 14 benchmarks [5]
Slide Retrieval	Precision@K, Recall@K	mAP, NDCG	TITAN: outperforms ROI & slide foundation models [7]
Prognostic Prediction	C-index, Hazard Ratio	Time-dependent AUC	CAPAI: HR 5.46 for NSCLC PFS [58]
Report Generation	BLEU, ROUGE	Clinical accuracy, Completeness	TITAN: generates pathology reports [7]

Experimental Protocols for Foundation Model Validation

Multi-center Validation Study Design

Comprehensive validation of pathology foundation models requires multi-center study designs that assess performance across diverse datasets, imaging protocols, and patient populations. The optimal protocol involves:

Retrospective cohort collection: Assemble whole-slide image (WSI) datasets from multiple institutions representing variations in staining protocols (H&E, IHC), scanner types, and patient demographics. TITAN validation utilized 335,645 WSIs across 20 organ types [7].
Data partitioning: Implement strict separation of training, validation, and test sets at the patient level to prevent data leakage. External test sets from completely separate institutions provide the most rigorous validation.
Performance benchmarking: Evaluate foundation models against established baselines and human expert performance across defined clinical tasks. CONCH validation demonstrated state-of-the-art performance on tasks including histology image classification, segmentation, captioning, and cross-modal retrieval [5].
Statistical analysis: Employ appropriate statistical methods for comparing model performance, including confidence interval estimation, hypothesis testing, and correction for multiple comparisons.

SUDO Framework Implementation for Deployment Validation

The SUDO framework addresses the critical challenge of evaluating AI performance on data in the wild without ground-truth annotations [56]. Implementation involves:

Deploy probabilistic AI system on data points in the wild, obtaining probability scores for positive class predictions.
Discretize output probabilities into predefined intervals (e.g., deciles) and sample data points from each interval.
Assign temporary pseudo-labels to sampled data points, then retrieve equal numbers of data points with ground-truth labels from the training set.
Train a classifier to distinguish between pseudo-labelled data points and those with ground-truth labels.
Evaluate classifier performance on a held-out set with ground-truth labels, calculating the discrepancy between classifiers with different pseudo-labels (SUDO metric).

This approach has demonstrated strong correlation with model performance (ρ = -0.84, p < 0.005 for DeepDerm) on dermatology images and histopathology patches from the Camelyon17-WILDS dataset [56].

Bias Assessment Across Patient Subgroups

Algorithmic fairness must be rigorously assessed through stratified validation across clinically relevant patient subgroups:

Stratification variable definition: Identify potential sources of bias including sex, age, ethnicity, disease severity, and technical factors (staining intensity, scanner type).
Performance disaggregation: Calculate performance metrics separately for each subgroup and assess between-group differences.
SUDO-based bias detection: Implement SUDO separately across protected groups; performance discrepancies indicate potential bias even without ground-truth labels [56].
Mitigation strategy development: For identified biases, implement techniques including data augmentation, reweighting, or adversarial debiasing.

The Scientist's Toolkit: Essential Research Reagents

Successful validation of pathology foundation models requires carefully selected computational tools and data resources. The table below outlines essential components of the validation toolkit.

Table 3: Essential Research Reagents for Pathology Foundation Model Validation

Tool/Resource	Function	Application Example
Whole-Slide Images (WSIs)	Digital representation of histopathology slides; primary input data	TITAN pretrained on 335,645 WSIs [7]
Pathology Reports	Textual descriptions of pathological findings; enable multimodal learning	CONCH trained on 1.17M image-text pairs [5]
Synthetic Captions	AI-generated fine-grained morphological descriptions	TITAN used 423,122 synthetic captions from PathChat [7]
Foundation Models (CONCH/Virchow/UNI)	Pre-trained models providing foundational representations	CONCH enables transfer to various downstream tasks [5]
SUDO Framework	Validation methodology for data without ground-truth	Identifies unreliable predictions without annotations [56]

Visualization of Validation Workflows

SUDO Validation Framework

Multi-dimensional AI Validation Pathway

Implementation Challenges and Future Directions

Despite significant advances in pathology foundation models, several challenges persist in their clinical validation and implementation. Dataset shift remains a fundamental obstacle, as models trained on specific institutional data may perform poorly when deployed in new environments with different staining protocols, scanner types, or patient populations [56] [59]. The absence of ground-truth annotations in real-world deployment settings complicates ongoing performance monitoring and validation. Additionally, regulatory frameworks for clinical-grade AI continue to evolve, requiring validation approaches that satisfy both technical and regulatory requirements [57] [1].

Future directions in pathology AI validation include the development of integrated multimodal foundation models that combine histopathology images with genomic, clinical, and radiology data [1]. The emergence of generalist medical AI systems capable of processing multiple data modalities will require novel validation frameworks that assess performance across diverse clinical tasks and data types. Furthermore, prospective clinical trials evaluating the impact of pathology AI on clinical outcomes remain essential for establishing true clinical utility, though such trials are currently lacking for most foundation models [59].

Standardization of validation methodologies through initiatives like the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles will promote transparency and reproducibility in pathology AI research [57]. As foundation models like Virchow, CONCH, and UNI continue to evolve, robust quality control and validation frameworks will be essential for translating their potential into clinically impactful applications that enhance diagnostic accuracy, prognostic prediction, and therapeutic decision-making in histopathology.

Ethical Considerations and Data Privacy in Pathology AI Implementation

The adoption of artificial intelligence (AI) in computational pathology represents a paradigm shift in cancer diagnosis and treatment. Foundation models like Virchow, CONCH, and UNI are trained on massive datasets of histopathology images to enable clinical decision support systems and precision medicine [2]. These models demonstrate remarkable capabilities in pan-cancer detection, biomarker prediction, and rare cancer identification. However, their implementation raises significant ethical and data privacy concerns that researchers and drug development professionals must address. The sensitivity of patient data in digital pathology, combined with the scale of information processed by AI systems, creates a critical need for robust ethical frameworks and privacy-preserving technologies. This technical guide examines both the performance capabilities of leading pathology foundation models and the essential ethical considerations for their responsible implementation in research and clinical settings.

Foundation Models for Histopathology Research

Model Architectures and Training Approaches

Table 1: Comparative Analysis of Pathology Foundation Models

Model	Architecture	Parameters	Training Data	Key Innovations
Virchow	Vision Transformer (ViT)	632 million	1.5 million H&E whole slide images from ~100,000 patients [2]	Largest pathology foundation model; trained with DINOv2 self-supervised learning [2]
CONCH	Vision-Language Model	Not specified	1.17 million histopathology image-caption pairs [5]	Contrastive learning from captions; enables cross-modal tasks (text-to-image, image-to-text retrieval) [5]
UNI	Vision Transformer	Not specified	Diverse sources of histopathology images	Frequently compared benchmark model; used in various research applications [5]

Performance Benchmarks and Clinical Applications

Table 2: Performance Metrics of Virchow on Cancer Detection Tasks

Task	Cancer Types	Performance (AUC)	Significance
Pan-Cancer Detection	9 common cancers	0.950 specimen-level AUC [2]	Outperforms specialized clinical-grade AI products on some variants [2]
Rare Cancer Detection	7 rare cancers (e.g., cervical, bone)	0.937 specimen-level AUC [2]	Demonstrates generalization to rare data with limited training examples [2]
Out-of-Distribution Generalization	External institution data	Similar AUC to internal data [2]	Maintains performance on data from different populations than training set [2]

Foundation models enable diverse clinical applications beyond cancer detection. CONCH demonstrates state-of-the-art performance on 14 diverse benchmarks including histology image classification, segmentation, captioning, and multimodal retrieval tasks [5]. These capabilities are particularly valuable for drug development, where understanding morphological changes in tissue can accelerate therapeutic discovery and validation.

Ethical Framework for Pathology AI Implementation

Core Ethical Principles

The integration of AI in healthcare requires adherence to established ethical principles adapted for computational pathology:

Justice and Fairness: AI systems must avoid reinforcing biases that could disadvantage certain patient groups. This encompasses both "distributive justice" (fair resource allocation) and "procedural justice" (fair decision-making) [60]. Algorithms trained on non-representative data can lead to unequal access, lower-quality care, and misdiagnosis in marginalized populations [60].
Transparency and Explainability: Transparency in pathology AI includes multiple dimensions: "data transparency" (clarity on data sources and representativeness), "algorithmic transparency" (insights into model structure and assumptions), "process transparency" (disclosure of development steps), and "outcome transparency" (explanation of how results are generated) [60].
Accountability and Responsibility: Clear accountability frameworks must define responsibility for AI-driven decisions in pathology. This includes establishing protocols for when AI systems fail or produce erroneous results that could impact patient care [61].
Patient Autonomy and Consent: Patients have a fundamental right to understand how their data is used in AI systems and provide informed consent. This becomes particularly complex in digital pathology where data may be used for multiple purposes beyond immediate clinical care [60].

Addressing Algorithmic Bias in Pathology AI

Algorithmic bias represents a significant ethical challenge in pathology AI implementation. Historical trends of discrimination can become embedded in models through underrepresented minority groups in training data and biased disease labels [60]. A widely cited example from general healthcare demonstrates how a health assessment algorithm assigned equal risk levels to Black and white patients despite Black patients being significantly sicker, because it used healthcare costs as a proxy for medical need [60].

Mitigation strategies for algorithmic bias include:

Representative Data Collection: Ensuring training datasets encompass diverse demographic groups, specimen types, and laboratory preparation techniques [2] [60].
Fairness Constraints: Implementing technical constraints during model training to enforce equitable performance across patient subgroups [60].
Stakeholder Cooperation: Involving diverse stakeholders including pathologists, researchers, and patient advocates in AI development processes [60].

Data Privacy and Security in Digital Pathology

Regulatory Landscape and Compliance Requirements

Table 3: Data Privacy Regulations Relevant to Digital Pathology

Regulation	Jurisdiction	Key Requirements	Relevance to Pathology AI
HIPAA	United States	Standards for protection of health information; limited to covered entities [62]	Does not cover non-traditional parties that collect and process health data [62]
GDPR	European Union	Comprehensive data protection; requires lawful basis for processing [63]	Applies to all entities processing EU residents' data, including research institutions
NY HIPA	New York State	Broad definition of regulated health information; strict authorization requirements [62]	Covers health data not covered by HIPAA; positions New York as having extensive privacy laws [62]

The regulatory landscape for health data privacy is fragmented, particularly in the United States where 19 states had enacted comprehensive privacy laws as of 2025 [62]. This patchwork approach creates significant compliance challenges for research institutions and drug developers working with digital pathology data across multiple jurisdictions.

Technical Safeguards for Pathology Data

Implementing robust technical safeguards is essential for protecting patient privacy in digital pathology:

Encryption Technologies: Protecting data both at rest and in transit ensures that even if data is intercepted, it remains unreadable to unauthorized individuals [63].
Access Controls: Establishing role-based access limits to ensure only authorized personnel can view or manipulate sensitive pathology data [63].
Audit Logs: Maintaining comprehensive logs of data access and modifications enables tracking and investigation of potential security incidents [63].
Secure Cloud Storage: Utilizing certified cloud storage solutions that provide scalable and cost-effective storage while maintaining high security standards for large-volume digital pathology data [63].

Implementation Protocols for Ethical Pathology AI

Data Governance Framework

A data governance framework for pathology AI implementation must establish clear protocols for each stage of the AI lifecycle, with particular attention to patient consent and continuous auditing.

Bias Mitigation and Model Validation

Effective bias mitigation requires an iterative approach that begins with data diversity assessment and continues through ongoing monitoring of deployed models.

The Scientist's Toolkit: Essential Research Reagents and Solutions

Table 4: Key Research Reagents and Solutions for Pathology AI Implementation

Tool/Category	Function	Examples/Standards
Digital Slide Scanners	Convert glass slides into high-resolution digital images	Grundium Ocus scanners [63]
Annotation Software	Enable pathologists to label regions of interest for model training	Various proprietary and open-source platforms
Data Anonymization Tools	Remove protected health information from pathology images	DICOM de-identification standards; custom solutions
Model Training Frameworks	Provide environment for developing and training AI models	PyTorch, TensorFlow with specialized pathology extensions
Fairness Assessment Toolkit	Evaluate model performance across demographic subgroups	AI Fairness 360; Fairlearn; custom fairness metrics
Encryption Solutions	Protect data at rest and in transit	AES-256 encryption; TLS 1.3 for data transfer
Laboratory Information Systems	Manage specimen data and associated metadata	LIMS with HIPAA-compliant data management [64]

The implementation of AI foundation models in pathology presents unprecedented opportunities for improving cancer diagnosis and drug development. Models like Virchow, CONCH, and UNI demonstrate remarkable capabilities in pan-cancer detection and rare disease identification [2] [5]. However, realizing the full potential of these technologies requires careful attention to ethical considerations and data privacy protections.

Future developments in the field should focus on:

Standardized Ethical Frameworks: Developing domain-specific guidelines for the ethical implementation of pathology AI, building on existing principles from organizations like WHO and UNESCO [61].
Privacy-Preserving AI Techniques: Advancing methods such as federated learning and differential privacy that enable model training without centralizing sensitive patient data.
Regulatory Harmonization: Establishing more consistent regulatory requirements across jurisdictions to simplify compliance for multinational research initiatives.
Enhanced Transparency Tools: Creating better explanation interfaces that help pathologists understand and trust AI-generated insights.

The integration of AI into pathology represents a transformative advancement with the potential to significantly improve patient outcomes. By addressing ethical considerations and data privacy concerns proactively, researchers and drug development professionals can ensure these powerful technologies are implemented responsibly and equitably.

The emergence of pathology foundation models (PFMs) represents a paradigm shift in computational pathology, offering powerful feature representations that can be adapted to diverse diagnostic, prognostic, and biomarker prediction tasks. This technical guide provides a structured framework for researchers and drug development professionals to navigate the growing landscape of PFMs, focusing on three prominent models: Virchow, CONCH, and UNI. We present comparative performance metrics, detailed experimental protocols, and evidence-based selection criteria to enable optimal model matching to specific research requirements. By synthesizing quantitative evaluations across multiple cancer types and tasks, this guide aims to standardize PFM assessment and deployment in histopathology research, ultimately accelerating the development of precision oncology applications.

Pathology foundation models are large-scale neural networks pretrained on extensive histopathology datasets using self-supervised learning (SSL) techniques that do not require curated labels [1]. These models generate versatile feature representations, known as embeddings, that capture essential morphological patterns in tissue samples and can be adapted to various downstream predictive tasks with minimal fine-tuning [2] [65]. The transition from task-specific models to foundation models addresses critical limitations in computational pathology, including the high cost of expert annotations, data scarcity for rare diseases, and poor generalization across diverse tissue types and laboratory preparations [1] [66].

The fundamental architecture underlying most PFMs utilizes a multiple instance learning (MIL) framework where whole slide images (WSIs) are represented as bags of feature instances extracted from individual patches [65]. This approach enables handling of gigapixel WSIs through a two-stage process: feature extraction using a pretrained encoder followed by feature aggregation for slide-level predictions. PFMs have demonstrated remarkable capabilities across diverse applications including cancer detection and subtyping, biomarker prediction, survival prognosis, and rare disease identification [2] [1] [67].

Comparative Analysis of Major Pathology Foundation Models

Table 1: Technical Specifications of Major Pathology Foundation Models

Model	Architecture	Parameters	Training Data	Pretraining Method	Modality
Virchow	Vision Transformer (ViT)	632 million	1.5M WSIs from 100k patients [2]	DINO v2 [2]	Vision-only
CONCH	Vision Transformer (ViT-B/16)	86.3 million	1.17M image-text pairs [5]	iBOT/Contrastive Learning [5]	Vision-Language
UNI	Vision Transformer	Not specified	100M tissue patches, 100k WSIs [68]	DINO [65]	Vision-only

Performance Benchmarks Across Key Applications

Table 2: Performance Comparison Across Diagnostic Tasks

Model	Pan-Cancer Detection (AUC)	Rare Cancer Detection (AUC)	Biomarker Prediction	Multimodal Capabilities
Virchow	0.950 overall [2]	0.937 [2]	Competitive performance [2]	Limited (vision-only)
CONCH	Not explicitly reported	Excels in rare disease identification [68]	State-of-the-art in multiple tasks [5]	Strong (image-text retrieval, captioning)
UNI	Not explicitly reported	Not explicitly reported	Competitive across 34 tasks [68]	Limited (vision-only)

Virchow demonstrates exceptional performance in pan-cancer detection, achieving an area under the curve (AUC) of 0.950 across nine common and seven rare cancers, with particularly strong performance on rare cancers (AUC=0.937) [2]. The model's scalability, trained on 1.5 million whole slide images, enables robust generalization across diverse cancer types and institutional settings. Comparative studies show Virchow outperforms other vision-only models including UNI, Phikon, and CTransPath embeddings across most cancer types, with statistically significant improvements (P<0.0001) [2].

CONCH represents a breakthrough in multimodal understanding for computational pathology, demonstrating state-of-the-art performance across 14 diverse benchmarks including image classification, segmentation, captioning, and cross-modal retrieval [5]. Its vision-language pretraining on 1.17 million image-text pairs enables unique capabilities for text-guided morphological search and pathology report generation, facilitating intuitive model interaction for pathologists [5] [68].

UNI serves as a general-purpose foundation model for computational pathology, demonstrating strong performance across 34 diagnostic tasks including cancer classification and organ transplant assessment [68]. The model effectively captures histopathological patterns from over 100 million tissue patches, providing versatile feature representations adaptable to various downstream applications with minimal fine-tuning.

Experimental Protocols and Methodologies

Pan-Cancer Detection Workflow (Virchow)

The Virchow pan-cancer detection pipeline begins with whole slide image preprocessing and tessellation into non-overlapping 256×256 pixel tiles at 20× magnification [2]. Each tile undergoes feature extraction through the Virchow vision transformer encoder, generating embeddings that capture morphological patterns at the cellular and tissue levels. These tile-level embeddings are aggregated using attention-based multiple instance learning to form slide-level representations, which are subsequently processed by a pan-cancer detection head for specimen-level cancer prediction [2]. The model was evaluated on a diverse test set comprising slides from Memorial Sloan Kettering Cancer Center and external consultation cases, with performance stratified across nine common and seven rare cancer types to assess generalizability [2].

Multimodal Representation Learning (CONCH)

CONCH employs contrastive learning to align visual and textual representations in a shared embedding space [5]. The training process utilizes 1.17 million histopathology image-text pairs, with the image encoder (Vision Transformer) and text encoder (Transformer) trained to maximize the similarity between corresponding image-text pairs while minimizing similarity between non-corresponding pairs [5]. This approach enables cross-modal retrieval capabilities, allowing natural language queries to search visual morphological patterns and vice versa. For downstream tasks, CONCH supports both unimodal and multimodal applications, with task-specific heads fine-tuned on labeled datasets for classification, segmentation, and report generation [5].

BRAF Mutation Prediction Protocol

Recent research demonstrates how foundation models can predict genetic alterations directly from histopathology images [42]. The protocol involves extracting features from whole slide images using a pretrained foundation model (Prov-GigaPath), followed by aggregation of tile-level features into slide-level representations using attention mechanisms [42]. These representations are then used to train a gradient boosting classifier (XGBoost) for BRAF-V600 mutation prediction in melanoma. This approach achieved state-of-the-art performance with an AUC of 0.824 during cross-validation and 0.772 on an independent test set, demonstrating the potential for reducing reliance on costly molecular assays [42].

Model Selection Framework

Task-Specific Model Recommendations

Table 3: Model Selection Guidelines Based on Research Objectives

Research Objective	Recommended Model	Rationale	Implementation Considerations
Pan-cancer detection	Virchow	Superior AUC (0.950) across common and rare cancers [2]	Requires substantial computational resources for inference
Multimodal tasks	CONCH	State-of-the-art in image-text retrieval and captioning [5]	Enables natural language interface for morphological search
General-purpose feature extraction	UNI	Proven performance across 34 diverse tasks [68]	Balanced performance for various applications
Rare cancer identification	Virchow or CONCH	Virchow: AUC 0.937 on rare cancers [2]; CONCH: excels with limited data [68]	CONCH particularly valuable when text guidance is beneficial
Genetic alteration prediction	Specialized models (e.g., Prov-GigaPath)	Domain-specific optimization [42]	May require integration with traditional ML classifiers

Key Selection Criteria

Data Modality Requirements: For vision-only applications, Virchow and UNI provide robust performance, while CONCH offers unique advantages for tasks requiring integration of imaging and textual data [2] [5] [68].
Computational Constraints: Model size and inference requirements vary significantly, with Virchow's 632 million parameters demanding substantial resources compared to other models [2].
Generalization Needs: For applications requiring robustness across diverse cancer types and institutional settings, Virchow's training on 1.5 million WSIs provides exceptional generalization [2].
Rare Disease Focus: Both Virchow and CONCH demonstrate strong performance with limited training data, making them suitable for rare cancer studies where annotated examples are scarce [2] [68].

Table 4: Essential Research Reagents for Pathology Foundation Model Implementation

Resource Category	Specific Examples	Function in Experimental Pipeline
Whole Slide Images	H&E-stained tissue sections [2]	Primary input data for feature extraction and model training
Annotation Software	Digital pathology annotation tools [1]	Region-of-interest marking for training and evaluation
Computational Framework	Python, PyTorch/TensorFlow [5]	Model implementation, training, and inference
Feature Extraction	Pretrained model weights (Virchow, CONCH, UNI) [2] [5] [68]	Generating embeddings from histopathology images
Data Augmentation	Semantic-aware transformation libraries [66]	Enhancing dataset diversity and model robustness
Evaluation Metrics	AUC, Dice coefficient, Accuracy [2] [66]	Quantifying model performance across tasks

The strategic selection of pathology foundation models requires careful consideration of research objectives, data characteristics, and computational constraints. Virchow excels in pan-cancer detection tasks, particularly for rare cancers, leveraging its extensive training on 1.5 million whole slide images. CONCH offers unique multimodal capabilities for vision-language tasks, enabling intuitive interaction and cross-modal retrieval. UNI provides versatile performance across diverse computational pathology applications. As the field evolves, future developments will likely address current limitations in computational efficiency, multimodal integration, and generalization across diverse patient populations. Researchers should consider a phased evaluation approach, beginning with pilot studies comparing multiple models on representative data before committing to large-scale implementation.

The advent of foundation models represents a paradigm shift in computational pathology, enabling robust analysis of whole-slide images (WSIs) for tasks ranging from disease diagnosis to biomarker prediction and treatment response forecasting. These large-scale models, pre-trained on vast datasets using self-supervised learning, generate rich visual representations (embeddings) that can be adapted to diverse downstream tasks with minimal fine-tuning. Among the most prominent architectures are Virchow (and its successor Virchow2), CONCH, and UNI—each trained on distinct datasets with varying methodologies, leading to complementary strengths and performance characteristics. Virchow2, a vision-only model trained on approximately 3.1 million WSIs from Memorial Sloan Kettering Cancer Center using the DINOv2 self-distillation algorithm, has demonstrated exceptional performance in pan-cancer detection and biomarker prediction. CONCH adopts a vision-language framework, pre-trained on 1.17 million histopathology image-caption pairs curated from biomedical literature, enabling superior performance on tasks requiring joint understanding of visual and textual information. UNI, another vision-only model, was trained on over 100 million image patches from Mass General Brigham, utilizing self-supervised learning to capture diverse tissue representations. As no single foundation model consistently outperforms all others across every task or dataset, researchers are increasingly turning to ensemble methods that strategically combine these models to leverage their complementary strengths, achieving unprecedented performance and robustness in histopathology analysis.

Comparative Analysis of Pathology Foundation Models

Technical Specifications and Training Paradigms

Table 1: Technical Specifications of Major Pathology Foundation Models

Foundation Model	Training Paradigm	Architecture	Training Data Source	Training Data Scale
Virchow2	Vision-only self-distillation	ViT-H	Memorial Sloan Kettering Cancer Center	~3.1 million WSIs
CONCH	Vision-language contrastive learning	ViT-B	Mixed (PubMed & other sources)	1.17 million image-caption pairs
UNI	Vision-only self-distillation	ViT-H	Mass General Brigham	>100 million patches from 100,000+ WSIs
Prov-GigaPath	Vision-only self-distillation	ViT-G	Providence health system	1.3B patches from 171,000 WSIs

Performance Benchmarking Across Pathology Tasks

Independent benchmarking studies reveal the distinctive performance profiles of leading foundation models. In a comprehensive evaluation spanning 31 clinically relevant tasks—including morphology assessment (5 tasks), biomarker prediction (19 tasks), and prognostication (7 tasks)—CONCH and Virchow2 demonstrated the highest overall performance with an average AUROC of 0.71, followed closely by Prov-GigaPath and DinoSSLPath at 0.69 [6]. For morphology-related tasks specifically, CONCH achieved the highest mean AUROC of 0.77, with Virchow2 and DinoSSLPath close behind at 0.76 [6]. Across biomarker prediction tasks, Virchow2 and CONCH jointly led with mean AUROCs of 0.73 [6]. In prognostic tasks, CONCH again achieved the highest performance with a mean AUROC of 0.63 [6].

Table 2: Model Performance Across Key Pathology Tasks (AUROC)

Foundation Model	Morphology Tasks (Mean)	Biomarker Prediction (Mean)	Prognosis (Mean)	Overall Average
CONCH	0.77	0.73	0.63	0.71
Virchow2	0.76	0.73	0.61	0.71
Prov-GigaPath	0.69	0.72	0.61	0.69
DinoSSLPath	0.76	0.69	0.61	0.69
UNI	0.68	0.68	0.61	0.68

Notably, the relative performance of foundation models varies significantly across different tissue types and clinical tasks. For instance, CONCH achieved the highest average AUROC in stomach adenocarcinoma (STAD) and non-small-cell lung cancer (NSCLC), while Virchow2 led in colorectal cancer (CRC) and BiomedCLIP performed best in breast cancer (BRCA) [6]. This task-specific performance variation underscores the fundamental rationale for employing ensemble methods rather than relying on any single model.

The Ensemble Paradigm: Methodologies and Implementations

Theoretical Foundations and Practical Benefits

Ensemble methods in computational pathology operate on the principle that different foundation models, trained on disparate datasets with varied methodologies, capture complementary features and patterns in histopathology images. Vision-language models like CONCH excel at connecting visual patterns with semantic concepts, while vision-only models like Virchow2 and UNI develop robust visual representations through self-supervision on massive image collections. When combined, these models produce more comprehensive and generalizable representations than any single model could achieve independently.

The practical benefits of ensemble approaches include:

Enhanced Performance: Ensembles consistently outperform individual models across diverse tasks. The ELF (Ensemble Learning of Foundation models) framework, which integrates five foundation models including GigaPath, CONCH, Virchow2, H-Optimus-0, and UNI, achieved superior performance compared to any constituent model alone across disease classification, biomarker detection, and treatment response prediction tasks [69].
Improved Robustness: By combining models with different architectural biases and training data sources, ensembles demonstrate greater resilience to site-specific variations in staining protocols, scanning equipment, and tissue preparation methods.
Uncertainty Quantification: Advanced ensemble frameworks like PICTURE (Pathology Image Characterization Tool with Uncertainty-aware Rapid Evaluations) employ Bayesian inference, deep ensembles, and normalizing flows to quantify predictive uncertainty, enabling identification of atypical pathology manifestations not encountered during training [70].
Data Efficiency: Slide-level ensemble representations are particularly advantageous in clinical contexts where data are limited, as they require fewer samples for effective fine-tuning compared to tile-level approaches [69].

Ensemble Architecture Design Patterns

Implementation Strategies and Fusion Methodologies

Several sophisticated ensemble implementations demonstrate the practical application of these architectural patterns:

The ELF framework integrates five foundation models (GigaPath, CONCH, Virchow2, H-Optimus-0, and UNI) through a two-stage process involving unsupervised contrastive learning for feature alignment and weakly supervised learning for cancer detection and organ classification. This approach generates unified slide-level representations that leverage the complementary strengths of each constituent model [69].

The PICTURE system employs uncertainty-aware ensembles specifically designed for differentiating histological mimics such as glioblastoma and primary central nervous system lymphoma. This system combines predictions from multiple foundation models using Bayesian inference and normalizing flows to quantify epistemic uncertainty, enabling identification of out-of-distribution samples and rare cancer types not represented in the training data [70].

Simple yet effective model averaging approaches have also demonstrated significant value. Benchmark studies show that ensembles combining CONCH and Virchow2 predictions outperform individual models in 55% of tasks, effectively leveraging their complementary strengths across different classification scenarios [6].

Experimental Protocols and Validation Frameworks

Ensemble Training and Evaluation Methodology

Key Experimental Protocols

Data Curation and Preprocessing: The ELF framework was pretrained on 53,699 WSIs spanning 20 anatomical sites, utilizing a mosaic patching approach that segments WSIs into distinct regions based on color composition using k-means clustering [69] [71]. This method preserves spatial diversity while reducing computational complexity by selecting representative patches from each color-segmented region.

Multi-Model Embedding Generation: Each foundation model processes image patches through its specific preprocessing pipeline and encoder architecture. For Virchow2 and UNI, this involves Vision Transformer (ViT) architectures processing patches at 224×224 or 256×256 pixel resolution. CONCH requires joint processing of image patches and corresponding textual descriptions [72].

Ensemble Training Protocols: The ELF framework employs unsupervised contrastive learning for feature alignment across models, followed by weakly supervised learning using slide-level labels for cancer detection and organ classification. This two-stage approach ensures that the unified representation maintains the distinctive strengths of each foundation model while enabling seamless integration [69].

Uncertainty Quantification: The PICTURE system implements three complementary uncertainty methods: Bayesian inference on prototypical pathology images, deep ensembles that weight predictions based on cross-model certainty, and normalizing flow-based out-of-distribution detection to identify atypical pathology manifestations [70].

Validation Methodologies: Rigorous multi-cohort validation is essential for assessing ensemble generalizability. Studies typically employ external validation cohorts from diverse geographic locations and healthcare systems, evaluating performance using metrics including AUROC, balanced accuracy, F1 scores, and statistical significance testing across multiple independent patient cohorts [6] [70].

Table 3: Essential Research Reagents and Computational Resources for Ensemble Implementation

Resource Category	Specific Tool/Platform	Function/Purpose	Implementation Notes
Foundation Models	Virchow2, CONCH, UNI, Prov-GigaPath	Feature extraction from histopathology patches	Access via published model weights; CONCH requires text input processing
Ensemble Frameworks	ELF, PICTURE	Reference implementations for ensemble methodologies	Adapt architecture to specific research needs
Search & Retrieval Infrastructure	Yottixel	Whole-slide image search and retrieval framework	Supports patch-based embeddings from multiple foundation models
Uncertainty Quantification	Bayesian inference, Deep ensembles, Normalizing flows	Model confidence assessment and out-of-distribution detection	Critical for clinical deployment and handling rare cancer types
Validation Datasets	TCGA, BACH, BreakHis, BRACS, EBRAINS	Benchmarking ensemble performance across diverse tasks	Ensure datasets not used in foundation model pre-training to avoid contamination
Computational Resources	High-memory GPUs (≥ 24GB VRAM)	Handling large whole-slide images and multiple model inference	Parallel processing essential for practical workflow implementation

Performance Benchmarks and Clinical Validation

Quantitative Performance Gains

Ensemble methods consistently demonstrate superior performance compared to individual foundation models across diverse clinical applications. The ELF framework achieved a balanced accuracy of 0.961 (95% CI: 0.941-0.979) on 2-class skin cancer subtyping, outperforming individual models and other ensemble approaches [69]. On the challenging BRACS dataset for breast cancer subtyping (7 classes), ELF achieved a balanced accuracy of 0.457 (95% CI: 0.359-0.566), representing a 16.3% relative improvement over the next best model TITAN [69].

For the critical clinical task of distinguishing glioblastoma from primary central nervous system lymphoma (PCNSL), the uncertainty-aware PICTURE ensemble achieved an exceptional AUROC of 0.989, maintaining robust performance across five independent validation cohorts (AUROCs 0.924-0.996) [70]. This performance substantially exceeded individual foundation models including Virchow2 (AUROC ~0.98) and CONCH (AUROC ~0.97) in the same task [70].

In whole-slide image retrieval tasks, ensemble approaches leveraging multiple foundation models demonstrated significant advantages over single-model implementations. Yottixel-UNI achieved a top-5 retrieval F1 score of 42% ± 14% across 23 organs and 117 cancer subtypes, outperforming single-model implementations [71]. Organ-specific performance variations highlight the importance of ensemble approaches, with kidneys achieving F1 scores of 82% while more heterogeneous tissues like lungs presented greater challenges (21% F1 score) [71].

Clinical Translation and Real-World Utility

Ensemble methods offer particular advantages in clinical settings where robustness and generalizability are paramount. The data efficiency of slide-level ensemble representations enables effective application in scenarios with limited training data, such as prediction of response to specific therapeutic regimens in precision oncology [69]. By combining models trained on diverse datasets from different healthcare systems, ensembles demonstrate reduced site-specific bias and improved generalization across varied staining protocols, scanning equipment, and tissue preparation methods [6] [72].

Uncertainty quantification capabilities, as implemented in the PICTURE system, provide clinically crucial functionality by identifying rare cancer types and atypical manifestations not represented in training data, enabling appropriate caution in automated diagnosis and flagging cases requiring specialist review [70]. This capability is particularly valuable for rare cancers and histological mimics where diagnostic accuracy has significant implications for treatment selection.

Ensemble methods represent a paradigm shift in computational pathology, transforming the competitive landscape of individual foundation models into a collaborative framework that leverages complementary strengths. By strategically combining vision-language models like CONCH with vision-only architectures like Virchow2 and UNI, researchers can achieve unprecedented performance across diverse tasks including disease classification, biomarker prediction, treatment response forecasting, and rare cancer detection.

The experimental evidence consistently demonstrates that ensembles outperform individual models, with frameworks like ELF and PICTURE providing robust methodologies for integration and uncertainty-aware decision making. As the field advances, future developments will likely focus on dynamic ensemble selection optimized for specific clinical tasks, integration of multimodal data beyond histopathology images, and development of more efficient fusion methodologies that maximize performance while minimizing computational complexity.

For research and drug development professionals, ensemble methods offer a powerful approach to leverage the rapidly evolving ecosystem of pathology foundation models. By implementing the protocols and architectures outlined in this technical guide, researchers can accelerate the development of robust, clinically applicable AI tools that advance precision oncology and personalized cancer care.

Benchmarking Performance: Independent Validation and Comparative Analysis

The field of computational pathology is undergoing a revolutionary transformation with the advent of foundation models trained using self-supervised learning (SSL) on massive datasets of histopathology images. These models learn meaningful representations directly from histology tissue without extensive manual annotation, enabling them to capture morphologic patterns crucial for clinical pathology tasks [6]. Unlike earlier approaches that relied on models pretrained on natural images, pathology-specific foundation models demonstrate superior performance by learning domain-relevant features from unlabeled whole-slide images (WSIs) [73]. This paradigm shift addresses the fundamental challenge of analyzing gigapixel-resolution WSIs, which can contain millions of cells and require specialized processing pipelines. The application of these models to clinically relevant tasks represents a significant advancement in extracting prognostic and predictive information from routine hematoxylin and eosin (H&E)-stained slides that are ubiquitously available for nearly every cancer patient [74].

Within this landscape, several prominent foundation models have emerged, including Virchow/Virchow2, CONCH, and UNI, each with distinct architectural approaches and training methodologies. Virchow2 exemplifies the vision-only approach, utilizing a ViT-huge architecture trained on an unprecedented scale of 3.1 million WSIs using DINOv2 self-supervised learning [73]. In contrast, CONCH (CONtrastive learning from Captions for Histopathology) represents a multimodal vision-language model trained on 1.17 million image-caption pairs, enabling joint understanding of histology images and textual information [6] [5]. UNI follows a vision-only approach with a ViT-large architecture trained on 100 million tiles from 20 major tissue types [73]. These models, along with others such as Prov-GigaPath and Phikon, form the cutting edge of a rapidly advancing field with profound implications for biomarker discovery, prognostic prediction, and ultimately, clinical decision-making in oncology.

Benchmarking Framework and Experimental Design

Dataset Composition and Clinical Tasks

A comprehensive benchmarking framework was established to evaluate foundation models across a diverse spectrum of clinically relevant tasks using real-world data from multiple medical centers. The evaluation encompassed 6,818 patients and 9,528 slides across lung, colorectal, gastric, and breast cancers, ensuring robust assessment of model generalizability [6]. This multi-institutional approach mitigates the risk of data leakage that can occur when models are tested on narrow benchmarks, providing a more realistic assessment of performance in clinical scenarios.

The benchmarking included 31 weakly supervised downstream prediction tasks categorized into three critical domains:

Morphology-related tasks (n=5): Involving assessment of tissue architecture, cellular organization, and structural differentiation patterns that pathologists routinely evaluate [6].
Biomarker-related tasks (n=19): encompassing prediction of genetic mutations, molecular subtypes, and other molecular features typically requiring specialized molecular assays [6].
Prognostication-related tasks (n=7): Focusing on predicting patient outcomes such as survival, recurrence risk, and therapy response [6].

This task diversity ensures comprehensive evaluation of each model's capability to extract biologically and clinically meaningful information from histology images across different cancer types and clinical endpoints.

Multiple Instance Learning Framework for Whole-Slide Analysis

The gigapixel resolution of WSIs necessitates specialized computational approaches, as loading entire slides into GPU memory is infeasible. The standard analytical pipeline involves tessellating WSIs into small, non-overlapping patches (typically 256×256 or 512×512 pixels at 20× magnification), encoding each patch using a foundation model, and aggregating these patch-level embeddings into slide-level representations using multiple instance learning (MIL) [6] [74]. Within this framework, transformer-based aggregation slightly outperformed the widely used attention-based multiple instance learning (ABMIL) approach, with an average AUROC difference of 0.01 across all tasks [6].

Table 1: Key Foundation Models Included in Benchmarking

Model Name	Architecture	Training Data Scale	Training Methodology	Modality
CONCH	Vision-Language	1.17M image-caption pairs	Contrastive learning	Multimodal
Virchow2	ViT-huge	3.1M WSIs	DINOv2 SSL	Vision-only
UNI	ViT-large	100M tiles from 100K slides	DINO SSL	Vision-only
Prov-GigaPath	Transformer	1.3B tiles from 171K WSIs	DINOv2 + Masked Autoencoder	Vision-only
Phikon	ViT-base	460M tiles from 100K slides	DINOv2 SSL	Vision-only
CTransPath	Swin Transformer + CNN	15.6M tiles from 32K slides	MoCo v3 SSL	Vision-only

Comparative Performance Analysis Across Clinical Domains

The comprehensive evaluation across all 31 tasks revealed distinct performance patterns among the benchmarked foundation models. CONCH and Virchow2 demonstrated the highest overall performance, both achieving a mean AUROC of 0.71 when averaged across all tasks [6]. This represents a significant advancement over traditional approaches, with the top-performing models consistently exceeding baseline performance across diverse clinical endpoints.

Table 2: Model Performance by Clinical Domain (Mean AUROC)

Model	Morphology (5 tasks)	Biomarkers (19 tasks)	Prognosis (7 tasks)	Overall (31 tasks)
CONCH	0.77	0.73	0.63	0.71
Virchow2	0.76	0.73	0.61	0.71
Prov-GigaPath	0.74	0.72	0.60	0.69
DinoSSLPath	0.76	0.68	0.61	0.69
UNI	0.73	0.69	0.60	0.68
CTransPath	0.72	0.68	0.59	0.67

The performance hierarchy remained consistent across different evaluation metrics, with CONCH also achieving the highest average area under the precision-recall curve (AUPRC), balanced accuracy, and F1 scores [6]. When examined by cancer type, CONCH achieved the highest average AUROC in stomach adenocarcinoma (STAD) and non-small cell lung cancer (NSCLC), Virchow2 led in colorectal cancer (CRC), and BiomedCLIP performed best in breast cancer (BRCA) [6].

Statistical Significance and Model Comparisons

Statistical comparisons across 29 binary classification tasks provided deeper insights into performance differences between models. CONCH significantly outperformed other models in numerous tasks: it achieved higher AUROCs compared to PLIP in 16 tasks, Phikon and BiomedCLIP in 13 tasks each, and Kaiko in 11 tasks [6]. Conversely, few models outperformed CONCH: Virchow2 achieved higher AUROCs in 6 tasks, Prov-GigaPath in 3 tasks, while Panakeia and Kaiko each outperformed CONCH in 2 tasks [6].

Among vision-only models, Virchow2 was significantly better than all other models in 6 to 12 tasks, establishing its dominance within this category [6]. These statistical comparisons highlight the nuanced performance landscape where different models excel in specific tasks, suggesting complementary strengths that could be leveraged through ensemble approaches.

Performance in Data-Scarce and Low-Prevalence Scenarios

Impact of Pretraining Data Characteristics

The relationship between pretraining dataset characteristics and downstream performance revealed critical insights for future model development. While positive correlations (r = 0.29–0.74) were observed between downstream performance and pretraining dataset size (WSIs, patients) or diversity (tissue sites) across morphology, biomarker, and prognosis tasks, most correlations were not statistically significant [6]. Significant correlations were found only for morphology tasks with patient count (r = 0.73, P < 0.05) and tissue site diversity (r = 0.74, P < 0.05) [6].

These findings suggest that data diversity may outweigh sheer volume for foundation model pretraining. This is particularly evident when comparing vision-language models: CONCH outperformed BiomedCLIP despite being trained on far fewer image-caption pairs (1.1 million versus 15 million) [6]. Similarly, tissue representation in pretraining datasets showed moderate but non-significant correlation with performance by cancer type, indicating that architecture and dataset quality play equally critical roles alongside data scale [6].

Low-Data Regime Performance

A key promise of foundation models in computational pathology is their potential to reduce reliance on large labeled datasets, particularly important for rare molecular events or conditions with limited tissue availability. To evaluate this capability, downstream models were trained on randomly sampled cohorts of 300, 150, and 75 patients while maintaining similar ratios of positive samples, with validation performed on full-size external cohorts [6].

In the largest sampled cohort (n = 300), Virchow2 demonstrated superior performance in 8 tasks, followed closely by PRISM with 7 tasks [6]. With the medium-sized cohort (n = 150), PRISM dominated by leading in 9 tasks, while Virchow2 followed with 6 tasks [6]. The smallest cohort size (n = 75) showed more balanced results, with CONCH leading in 5 tasks, while PRISM and Virchow2 each led in 4 tasks [6]. Notably, performance metrics remained relatively stable between n = 75 and n = 150 cohorts, indicating robust performance even with substantial reductions in training data [6].

For clinically relevant tasks with rare positive cases (>15% prevalence), foundation models demonstrated particular utility in predicting low-prevalence biomarkers including BRAF mutation (10%), CpG island methylator phenotype (CIMP) status (13%), and others that pose challenges for traditional supervised learning approaches [6].

Advanced Applications: Ensemble Methods and Emerging Approaches

Ensemble Model Performance

The benchmarking revealed that different foundation models trained on distinct cohorts learn complementary features to predict the same labels, creating opportunities for performance improvement through ensemble methods. An ensemble combining CONCH and Virchow2 predictions outperformed individual models in 55% of tasks, effectively leveraging their complementary strengths across classification scenarios [6].

This ensemble approach represents the current state-of-the-art, demonstrating that neither vision-only nor vision-language models universally dominate all tasks. Instead, their synergistic combination yields more robust performance across diverse clinical applications. The fusion of models with different architectural approaches and training methodologies provides a pathway to exceed the performance ceiling of individual foundation models.

Emerging Multimodal Slide Foundation Models

Beyond the patch-level foundation models included in the primary benchmarking, emerging approaches aim to learn slide-level representations directly. TITAN (Transformer-based pathology Image and Text Alignment Network) represents this new class of multimodal whole-slide foundation models, pretrained on 335,645 whole-slide images using visual self-supervised learning and vision-language alignment with corresponding pathology reports and synthetic captions [7].

Without any fine-tuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis [7]. This approach demonstrates the evolving landscape where foundation models operate at multiple levels of biological organization—from patch-level to whole-slide representations—potentially enabling more seamless translation to clinical workflows.

Research Reagents and Computational Tools

Table 3: Essential Research Reagents and Computational Tools

Resource	Type	Specifications	Primary Research Application
CONCH	Foundation Model	Vision-language, 1.17M image-caption pairs	Multimodal tasks, image-text retrieval, captioning
Virchow2	Foundation Model	ViT-huge, 3.1M WSIs, DINOv2 SSL	High-performance vision tasks, biomarker prediction
UNI	Foundation Model	ViT-large, 100M tiles, DINO SSL	Slide classification, transfer learning
Multiple Instance Learning	Algorithmic Framework	Transformer-based aggregation	Whole-slide classification from patch embeddings
ABMIL	Algorithmic Framework	Attention-based mechanism	Alternative slide-level aggregation
Self-Supervised Learning	Training Methodology	DINOv2, iBOT, masked image modeling	Pre-training without extensive labels
Whole-Slide Images	Data	Gigapixel resolution, H&E stained	Primary input data for analysis
Tile Embeddings	Data Representation	768-dimensional features	Compact representation of tissue patches

The comprehensive benchmarking across 31 clinical tasks establishes CONCH and Virchow2 as the leading foundation models in computational pathology, with each demonstrating distinct strengths across morphology, biomarker, and prognostic tasks. The superior performance of CONCH, particularly in morphology-related tasks, highlights the value of multimodal training that integrates visual and linguistic information, more closely mirroring how pathologists reason about histologic entities. Meanwhile, Virchow2's strong performance across biomarker tasks demonstrates the continued power of vision-only approaches trained at unprecedented scale.

Critical findings from this benchmarking include the demonstration that data diversity may outweigh sheer volume in foundation model pretraining, that models trained on distinct cohorts learn complementary features enabling performance gains through ensemble methods, and that foundation models maintain robust performance even in low-data scenarios highly relevant to clinical practice. These insights provide valuable guidance for the development of next-generation foundation models in computational pathology.

The convergence of increasingly sophisticated foundation models with emerging whole-slide representation learning approaches like TITAN points toward a future where computational pathology systems can operate across multiple biological scales—from cellular features to tissue architecture and whole-slide patterns. As these models continue to evolve, rigorous benchmarking on diverse, clinically relevant tasks remains essential to translate their potential into tangible improvements in cancer diagnosis, treatment selection, and patient outcomes.

This technical guide provides an in-depth examination of three critical performance metrics—Area Under the Receiver Operating Characteristic Curve (AUROC), Area Under the Precision-Recall Curve (AUPRC), and Balanced Accuracy—within the context of computational pathology and foundation model evaluation. As artificial intelligence transforms histopathology research through models like Virchow, CONCH, and TITAN, selecting appropriate evaluation metrics becomes paramount for assessing true clinical utility. This whitepaper explores the mathematical foundations, interpretation guidelines, and practical considerations for each metric, with special emphasis on their application in imbalanced datasets common to medical diagnostics. Through structured comparisons, experimental protocols, and visual workflows, we provide pathology researchers and drug development professionals with a comprehensive framework for rigorous model evaluation aligned with real-world clinical requirements.

The emergence of whole-slide imaging and foundation models in computational pathology has created an urgent need for standardized, clinically relevant evaluation frameworks. Models like Virchow (trained on 1.5 million hematoxylin and eosin stained whole-slide images) [4], CONCH (a vision-language foundation model pretrained on 1.17 million image-caption pairs) [5], and TITAN (a multimodal whole-slide foundation model utilizing 335,645 whole-slide images) [7] have demonstrated remarkable capabilities in cancer detection, biomarker prediction, and slide representation learning. However, their true clinical value depends on appropriate performance assessment using metrics that reflect operational realities.

In healthcare applications, classification errors carry asymmetric consequences. False positives in cancer detection may lead to unnecessary invasive procedures, patient anxiety, and increased healthcare costs, while false negatives can result in delayed treatment and progression of disease [75]. These tradeoffs become particularly critical in imbalanced datasets where the condition of interest is rare—a common scenario in medical diagnostics where diseases may affect only 1-10% of the population [76]. Traditional metrics like accuracy can be profoundly misleading in these contexts, necessitating more nuanced approaches to model evaluation.

Metric Fundamentals and Mathematical Foundations

Core Definitions and Calculations

Balanced Accuracy addresses the limitations of standard accuracy in imbalanced datasets by computing the average of sensitivity and specificity. This prevents the metric from being dominated by the majority class and provides a more realistic assessment of model performance across both classes [77]. The formula is given as:

[ \text{Balanced Accuracy} = \frac{\text{Sensitivity} + \text{Specificity}}{2} ]

where Sensitivity = TP/(TP+FN) and Specificity = TN/(TN+FP).

AUROC (Area Under the Receiver Operating Characteristic Curve) represents the model's ability to distinguish between positive and negative classes across all possible classification thresholds [78]. The ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1-Specificity) at various threshold settings. The area under this curve provides a single scalar value representing the probability that a randomly chosen positive instance ranks higher than a randomly chosen negative instance [79].

AUPRC (Area Under the Precision-Recall Curve) focuses specifically on the model's performance regarding the positive class, making it particularly valuable for imbalanced datasets [76]. The PR curve plots Precision (Positive Predictive Value) against Recall (Sensitivity) at different thresholds. Unlike AUROC, AUPRC does not consider true negatives in its calculation, which reduces the masking effect of the majority class in imbalanced scenarios [75].

Comparative Analysis of Metrics

Table 1: Performance Metric Characteristics and Applications

Metric	Mathematical Focus	Range	Random Baseline	Strength	Weakness
Balanced Accuracy	(Sensitivity + Specificity)/2	0-1	0.5	Intuitive; handles class imbalance better than standard accuracy	Does not account for model confidence or ranking ability
AUROC	Area under TPR vs FPR curve	0-1	0.5	Threshold-independent; measures ranking capability	Over-optimistic for imbalanced data; emphasizes true negatives
AUPRC	Area under Precision vs Recall curve	0-1	Positive class prevalence	Focuses on positive class; better for imbalanced data	Difficult to compare across datasets with different prevalence

Table 2: Metric Interpretation Guidelines in Clinical Contexts

Metric Value	AUROC Interpretation	AUPRC Interpretation	Clinical Consideration
0.9-1.0	Excellent discrimination	Outstanding performance	Model likely clinically useful
0.8-0.9	Good discrimination	Strong performance	Potentially clinically useful
0.7-0.8	Fair discrimination	Moderate performance	May require improvement
0.6-0.7	Poor discrimination	Weak performance	Limited clinical utility
0.5-0.6	Fail discrimination	Very poor performance	No better than random

Metric Selection for Histopathology Foundation Models

The Imbalanced Data Challenge in Medical Applications

In critical care settings and rare disease diagnosis, events of interest such as mortality, clinical deterioration, and acute kidney injury are inherently imbalanced, often affecting less than 10-20% of patients [76]. Similarly, in computational pathology, rare cancers and specific molecular subtypes may represent a small fraction of cases. In these situations, AUROC may provide deceptively favorable assessments because the true negative rate (specificity) remains high even when the model performs poorly on positive cases due to the abundance of negative examples [76].

AUPRC offers a more clinically relevant perspective for imbalanced problems by focusing on reliable identification of rare events [76] [75]. For example, in a cancer detection task with 1% prevalence, a random classifier would achieve an AUROC of 0.5 but an AUPRC of 0.01, providing a more realistic baseline for model performance assessment [78]. The clinical interpretation of AUPRC should always be contextualized by the prevalence of the positive class, with the metric value compared to this baseline to determine true model utility.

Operational Relevance in Clinical Deployments

When deploying models for critical illness detection, two primary goals emerge: minimizing missed positive cases (high sensitivity) and avoiding alert fatigue from false positives (high precision) [76]. The precision-recall curve effectively illustrates these operational priorities by showing what positive predictive value is achievable at different sensitivity levels.

The "Number Needed to Alert" (NNA), defined as 1/PPV (Positive Predictive Value), becomes an intuitive operational metric derived from AUPRC analysis [76]. For example, if a model achieves a PPV of 0.2 at 90% sensitivity, the NNA would be 5, meaning clinicians must respond to 5 alerts to identify one true case. This directly translates to clinical workflow impact and helps determine acceptable operational thresholds.

Foundation Model Evaluation Framework

For histopathology foundation models like Virchow, CONCH, and TITAN, different metrics illuminate distinct aspects of performance:

Virchow: Demonstrated 0.949 AUROC across 17 cancer types and 0.937 AUROC on 7 rare cancers, showing exceptional discriminatory power [4]. However, AUPRC analysis would provide additional insights into its performance on rare cancer types where class imbalance is pronounced.
CONCH: As a vision-language model, evaluation extends beyond classification to retrieval, captioning, and segmentation tasks [5]. While AUROC and AUPRC remain valuable for classification benchmarks, additional metrics are needed for comprehensive assessment.
TITAN: Utilizes both visual and language modalities, requiring multimodal evaluation strategies [7]. Task-specific metric selection becomes essential across its diverse applications.

Experimental Protocols for Metric Evaluation

Benchmarking Foundation Models in Histopathology

Table 3: Experimental Protocol for Foundation Model Evaluation

Step	Procedure	Considerations	Expected Output
Dataset Curation	Collect whole-slide images with confirmed diagnoses; ensure representation of rare classes	Address class imbalance through stratified sampling; maintain separate test sets	Curated dataset with prevalence documentation
Patch Embedding Extraction	Process WSIs through foundation model (Virchow, CONCH, TITAN) to generate feature embeddings	Consistent magnification and patch size; handling of variable WSI sizes	Feature matrix for downstream tasks
Slide-Level Aggregation	Apply multiple instance learning or attention mechanisms to aggregate patch-level predictions	Choice of aggregation method significantly impacts performance	Slide-level predictions and confidence scores
Metric Computation	Calculate AUROC, AUPRC, and balanced accuracy across relevant classification tasks	Report confidence intervals via bootstrapping; compare to prevalence baseline	Comprehensive performance assessment with statistical significance
Threshold Optimization	Select operating points based on clinical priorities and cost-benefit tradeoffs	Different thresholds for screening vs. diagnostic settings; consider NNA	Deployable model with specified sensitivity/specificity profile

Case Study: Cerebral Edema Prediction in Pediatric Patients

A simulation study predicting cerebral edema in pediatric patients with diabetic ketoacidosis illustrates the practical differences between metrics [76]. With a cerebral edema prevalence of 0.7%, three models (logistic regression, random forest, and XGBoost) showed excellent AUROC values (0.874-0.953) but much more modest AUPRC values (0.083-0.116).

Dividing the AUPRC by the outcome frequency (0.007) revealed that the best model (logistic regression with AUPRC=0.116) was 16.6 times more useful than a random model—an insight completely absent from the AUROC analysis [76]. Furthermore, at a sensitivity threshold of 0.85-0.90, the logistic regression and XGBoost models showed 5-10% higher PPV than the random forest model, directly impacting potential clinical utility.

The Scientist's Toolkit: Research Reagent Solutions

Table 4: Essential Research Tools for Pathology Foundation Model Evaluation

Tool/Category	Specific Examples	Function	Implementation Considerations
Foundation Models	Virchow, CONCH, TITAN, UNI	Generate feature embeddings from whole-slide images	Model selection based on task requirements; computational resources
Evaluation Metrics	AUROC, AUPRC, Balanced Accuracy	Quantify model performance for classification tasks	Interpretation context; dataset characteristics
Statistical Packages	scikit-learn, pROC, PRROC	Calculate metrics and confidence intervals	Proper implementation of statistical methods; bootstrapping
Visualization Tools	matplotlib, seaborn, Plotly	Generate ROC, PR curves, and performance charts	Clear visualization of tradeoffs; clinical interpretability
Explainability Frameworks	SHAP, LIME, attention maps	Interpret model predictions and feature importance	Connection to histopathological features; pathologist validation

In computational pathology, the choice between AUROC, AUPRC, and balanced accuracy extends beyond technical considerations to clinical operational impact. While AUROC provides valuable insights into a model's overall discriminatory power, AUPRC offers a more realistic assessment for imbalanced datasets common in medical applications. Balanced accuracy serves as an intuitive intermediate metric that mitigates some limitations of standard accuracy.

For foundation models like Virchow, CONCH, and TITAN, comprehensive evaluation should include multiple metrics to illuminate different aspects of performance. Researchers should particularly prioritize AUPRC when evaluating models for rare disease detection or other imbalanced classification tasks. By selecting metrics aligned with clinical priorities and operational constraints, pathology AI developers can bridge the gap between technical performance and real-world utility, ultimately accelerating the translation of these powerful technologies into improved patient care.

Foundation models represent a paradigm shift in computational pathology, offering the potential to develop powerful predictive tools without the prohibitive annotation costs typically associated with medical artificial intelligence. These models, pretrained on massive datasets of histopathology images through self-supervised learning, produce versatile feature representations (embeddings) that can be adapted to various downstream clinical tasks. The evaluation of these models in data-scarce environments is particularly crucial for real-world clinical applications, where labeled data for specific tasks—especially rare cancers or molecular biomarkers with low prevalence—is often severely limited. This technical review provides a comprehensive analysis of the performance of three leading pathology foundation models—CONCH, Virchow, and UNI—under constrained data conditions, offering methodological guidance and performance benchmarks for researchers and drug development professionals.

Foundation Model Architectures and Pretraining

Model Specifications and Training Data

Table 1: Foundation Model Architectures and Pretraining Specifications

Model	Architecture	Parameters	Pretraining Data	Training Approach
CONCH	Vision-Language	Not specified	1.17M image-caption pairs	Contrastive learning (CoCa)
Virchow	Vision Transformer (ViT)	632 million	1.5M WSIs from 100k patients	Self-supervised (DINOv2)
UNI	Vision Transformer	Not specified	~100,000 WSIs (1B patches)	Self-supervised (DINO)

CONCH distinguishes itself as a multimodal vision-language foundation model trained using contrastive learning from captions for histopathology. Its pretraining on diverse sources of histopathology images, biomedical text, and over 1.17 million image-caption pairs enables exceptional transfer learning capabilities across image classification, segmentation, captioning, and retrieval tasks [5]. Virchow represents a scaling achievement in pathology AI, trained on an unprecedented dataset of 1.5 million H&E-stained whole slide images from Memorial Sloan Kettering Cancer Center. Utilizing the DINOv2 self-supervised algorithm, this 632-million-parameter vision transformer captures diverse pathological patterns across tissue and specimen types [10] [4]. UNI employs a similar self-supervised approach based on DINO but was trained on a substantially smaller dataset of approximately 100,000 whole slide images corresponding to one billion patches [28].

In comprehensive benchmarking across 31 clinically relevant tasks—including morphology assessment (5 tasks), biomarker prediction (19 tasks), and prognostication (7 tasks)—CONCH and Virchow demonstrated equivalent top performance with an average AUROC of 0.71, followed closely by Prov-GigaPath and DinoSSLPath (0.69 AUROC) [6]. When examining performance by domain, CONCH achieved the highest mean AUROC for morphology-related tasks (0.77) and prognostic tasks (0.63), while Virchow matched CONCH on biomarker prediction tasks (0.73 AUROC) [6]. These overall benchmarks establish baseline performance before examining the critical low-data scenario performance that often determines real-world clinical utility.

Low-Data Scenario Experimental Framework

Experimental Design for Data-Scarce Environments

The evaluation of foundation models in low-data settings requires carefully controlled experimental conditions that mimic real-world clinical constraints. The benchmark study assessed model performance using randomly sampled cohorts of 300, 150, and 75 patients while maintaining similar ratios of positive samples, with validation performed on full-size external cohorts to ensure clinical relevance [6]. This approach directly tests the foundational claim that these models can reduce dependency on large annotated datasets for developing specialized diagnostic and prognostic tools.

Performance in Limited Data Regimes

Table 2: Model Performance Across Different Data Regimes

Training Cohort Size	Best Performing Model(s)	Key Findings	Performance Stability
300 patients	Virchow (8 tasks), PRISM (7 tasks)	Virchow demonstrates strongest performance in near-complete data scenarios	Relative stability between 300- and 150-patient cohorts
150 patients	PRISM (9 tasks), Virchow (6 tasks)	PRISM shows advantage in medium-data regime	Minimal performance degradation from 300 to 150 patients
75 patients	CONCH (5 tasks), PRISM (4 tasks), Virchow (4 tasks)	CONCH excels in extreme low-data conditions; more balanced leadership	Notable stability from 150 to 75 patients

The low-data evaluation revealed crucial performance differentiations not apparent in full-data benchmarks. In the largest sampled cohort (n=300), Virchow demonstrated superior performance in 8 tasks, followed closely by PRISM with 7 leading tasks. With the medium-sized cohort (n=150), PRISM dominated by leading in 9 tasks, while Virchow followed with 6 leading tasks. The smallest cohort size (n=75) showed more balanced results, with CONCH leading in 5 tasks, while PRISM and Virchow each led in 4 tasks [6]. This performance shift demonstrates CONCH's particular advantage in extreme low-data scenarios, despite Virchow's stronger showing in more data-rich conditions.

Complementary Model Strengths and Ensemble Approaches

Model-Specific Strengths in Low-Data Environments

The benchmarking analysis revealed that foundation models trained on distinct cohorts learn complementary features to predict the same labels [6]. CONCH's vision-language architecture, trained on diverse image-caption pairs rather than just H&E images, appears to provide an advantage in low-data scenarios, potentially due to its richer semantic understanding of histopathological entities [5] [6]. This multimodal foundation may enable more efficient knowledge transfer when fine-tuning data is severely limited. Virchow's performance advantage in higher-data scenarios likely stems from its massive scale—632 million parameters trained on 1.5 million slides—which presumably encodes a more comprehensive representation of histological patterns, though this requires more data to effectively adapt to specific tasks [10].

Ensemble Strategies for Enhanced Performance

Research indicates that combining predictions from complementary models can yield performance superior to any single model. An ensemble combining CONCH and Virchow predictions outperformed individual models in 55% of tasks by leveraging their complementary strengths across different classification scenarios [6]. This suggests that rather than seeking a single superior model, researchers may achieve better performance in low-data settings through strategic model combinations that capitalize on different training approaches and architectural advantages.

Experimental Protocols for Low-Data Evaluation

Benchmarking Methodology

The referenced large-scale benchmarking study employed a rigorous methodology to evaluate foundation model performance [6]. The evaluation encompassed 19 foundation models across 13 patient cohorts with 6,818 patients and 9,528 slides from lung, colorectal, gastric, and breast cancers. For low-data scenario assessment, researchers implemented stratified sampling to create reduced cohorts of 300, 150, and 75 patients while preserving the original positive-to-negative case ratios. These models were then evaluated on weakly supervised tasks related to biomarkers, morphological properties, and prognostic outcomes using multiple instance learning (MIL) frameworks with transformer-based aggregation [6].

Feature Extraction and Aggregation

The standard protocol involves WSI tessellation into non-overlapping patches, followed by feature extraction using the foundation model's pretrained encoder. These patch-level embeddings then serve as inputs to multiple instance learning aggregators such as ABMIL or transformer-based architectures for slide-level prediction [6] [28]. Performance comparisons between transformer-based and ABMIL aggregation showed minimal differences (average AUROC difference of 0.01), indicating that the quality of foundation model embeddings is more critical than the specific aggregation method in low-data scenarios [6].

Research Reagent Solutions

Table 3: Essential Research Tools for Foundation Model Evaluation

Research Reagent	Function/Purpose	Implementation Example
Multiple Instance Learning (MIL) Frameworks	Aggregates patch-level embeddings for slide-level predictions	ABMIL, TransMIL, or CLAM for weakly supervised learning
Feature Extraction Pipelines	Converts WSIs into patch embeddings using foundation models	CONCH, Virchow, or UNI encoders for feature extraction
Benchmark Datasets	Standardized evaluation of model performance	Camelyon+ (breast cancer metastases), TCGA derivatives
Stratified Sampling Protocols	Creates reduced cohorts while preserving case distributions	Maintains positive-to-negative ratios in low-data scenarios

Evaluation of pathology foundation models in low-data scenarios reveals nuanced performance patterns that critically inform their research and clinical application. While Virchow demonstrates superior performance in near-complete data scenarios, CONCH exhibits particular advantages in extreme low-data conditions, with PRISM showing strength in medium-data regimes. Rather than a single model dominating all scenarios, the research indicates complementary strengths that can be strategically leveraged through ensemble approaches. These findings underscore that data diversity and model architecture play crucial roles in low-data performance, sometimes outweighing the influence of pretraining data volume. For researchers working with limited annotated data, particularly for rare conditions or biomarkers, CONCH provides a compelling option, while Virchow remains a powerful choice when more substantial fine-tuning datasets are available. The strategic combination of these complementary models may offer the most robust approach for real-world clinical applications where data scarcity remains a significant constraint.

The development of large-scale artificial intelligence (AI) models, known as foundation models, is revolutionizing computational pathology. These models, including Virchow, CONCH, and UNI, are pre-trained on massive datasets of histopathology images and are designed to be adapted to a wide range of downstream diagnostic tasks. A critical step in translating these models from research tools to clinical applications is cross-institutional validation—evaluating their performance on data from sources entirely separate from their training sets. This process assesses a model's generalizability, or its ability to maintain accuracy and reliability when faced with the vast heterogeneity of real-world clinical environments. Variations in tissue preparation, staining protocols, scanner types, and patient populations across different medical centers can significantly degrade the performance of AI models that have not been rigorously validated externally. Therefore, cross-institutional validation provides the essential evidence needed to trust that a foundation model will perform as expected in diverse clinical settings, forming a crucial bridge between experimental development and routine patient care.

Performance Benchmarking of Foundation Models

To objectively assess the generalizability of histopathology foundation models, researchers conduct rigorous benchmarking studies that evaluate their performance on multiple, independent external datasets. The following table summarizes key quantitative results from recent external validation efforts for Virchow, CONCH, and UNI models.

Table 1: External Validation Performance of Histopathology Foundation Models

Foundation Model	Task	External Dataset(s)	Key Performance Metric(s)	Reported Result
Virchow [2]	Pan-cancer detection (9 common & 7 rare cancers)	Multi-institutional consultation slides (external to MSKCC)	Specimen-level AUC (Area Under the ROC Curve)	0.950 overall AUC; 0.937 AUC on rare cancers [2]
UNI [80]	Ovarian cancer subtyping	Transcanadian study dataset & OCEAN challenge dataset	Balanced Accuracy	97% (Transcanadian), 74% (OCEAN) [80]
Virchow [81]	Whole Slide Image (WSI) Retrieval (Zero-shot)	The Cancer Genome Atlas (TCGA) - 23 organs	Macro F1 Score (Top-5 Retrieval)	40% ± 13% [81]
UNI [81]	Whole Slide Image (WSI) Retrieval (Zero-shot)	The Cancer Genome Atlas (TCGA) - 23 organs	Macro F1 Score (Top-5 Retrieval)	42% ± 14% [81]
CONCH [5]	Diverse Benchmarks (Classification, Retrieval, Captioning)	14 independent benchmarks not used in pre-training	State-of-the-art (SOTA) Performance	Achieved SOTA on multiple image and text tasks, demonstrating broad generalization [5]

Performance benchmarking reveals distinct strengths. Virchow demonstrates exceptional performance in pan-cancer detection, even on rare cancer types and out-of-distribution data, which is a strong indicator of robust generalization [2]. Both Virchow and UNI show significant improvements over traditional models in tasks like ovarian cancer subtyping and image retrieval, though performance can vary depending on the specific organ and task complexity [80] [81]. CONCH's success across a wide array of tasks (classification, segmentation, captioning, retrieval) underscores the advantage of its vision-language pre-training in creating versatile and generalizable representations [5].

Experimental Protocols for External Validation

A standardized, methodical approach to experimental design is crucial for producing reliable and comparable results in external validation. The following workflow outlines the key phases of this process.

Diagram 1: External Validation Workflow

Phase 1: Model Selection and Baseline Establishment

The first phase involves selecting the foundation model and establishing a performance baseline.

Model and Feature Extraction: Choose a foundation model (e.g., Virchow, UNI, CONCH). These models are typically used as feature extractors. Input a Whole Slide Image (WSI), which is first tessellated into smaller patches (e.g., 256x256 or 224x224 pixels). Each patch is processed by the model to generate a high-dimensional feature vector, or embedding [80] [2].
Baseline Performance: Train a task-specific classifier (e.g., for cancer subtyping) on top of the extracted features using the model's original training dataset or a large internal dataset. Evaluate this model on a held-out internal test set to establish a baseline performance level [80]. This baseline is the benchmark against which external performance will be compared.

Phase 2: External Dataset Curation

The integrity of external validation hinges on the quality and independence of the external data.

Dataset Sourcing: Secure one or more datasets that are completely separate from the training data. These should come from different institutions, use different slide scanners, and involve different patient populations [82] [83]. Examples include public repositories like TCGA or challenges like OCEAN [80] [81].
Data Preparation and Preprocessing: Apply consistent preprocessing steps to the external WSIs. This includes standard procedures like tissue segmentation to separate tissue from background, and may involve stain normalization to reduce color variation between institutions. Patches are extracted at the same magnification and resolution used during baseline establishment [80].

Phase 3: Inference and Performance Evaluation

In this critical phase, the model is tested on the external data.

Zero-shot or Transfer Learning Evaluation:
- Zero-shot: For tasks like image retrieval, the pre-trained model's embeddings are used directly without any further training on the external data. Performance here tests the inherent generalizability of the features [81].
- Transfer Learning (Fine-tuning): For classification tasks, a common approach is to take the pre-trained foundation model and fine-tune the task-specific classifier on a portion of the external data, then evaluate on the held-out portion of the external set. This tests the model's adaptability [80] [2].
Metric Calculation: Compute a standard set of performance metrics on the external test set. Key metrics for classification include Area Under the Curve (AUC), balanced accuracy, and F1-score. For retrieval tasks, macro-averaged F1 scores for top-k retrievals are commonly used [80] [81]. A significant drop in these metrics from the internal baseline indicates poor generalization.

Phase 4: Analysis and Reporting

The final phase involves interpreting the results and understanding the model's failures.

Robustness and Failure Analysis: Analyze performance variations across different subgroups within the external data, such as specific cancer subtypes, organs, or source institutions. Identifying where performance drops (e.g., on particularly heterogeneous tissues like lung cancer [81]) provides insights for future model improvement.
Comparative Reporting: Transparently report the model's performance on all external datasets used, allowing for direct comparison with other models and highlighting its real-world applicability [83].

The Scientist's Toolkit: Essential Research Reagents

Successful external validation relies on a suite of key resources, from software models to datasets and evaluation tools.

Table 2: Key Reagents for Cross-Institutional Validation

Category	Reagent / Solution	Function / Purpose	Example(s) from Literature
Foundation Models	Virchow	A 632M parameter vision transformer for pan-cancer detection and biomarker prediction from H&E slides [2].
	CONCH (CONtrastive learning from Captions for Histopathology)	A vision-language model for tasks involving images and/or text (classification, retrieval, captioning) [5].
	UNI	A self-supervised vision encoder (ViT-L) for generating general-purpose slide representations [80] [81].
Validation Datasets	TCGA (The Cancer Genome Atlas)	A large, multi-institutional public dataset for validating performance across many cancer types [81].	TCGA-CRC-DX [84]
	OCEAN Challenge Dataset	A heterogeneous dataset used for external validation of ovarian cancer subtyping models [80].
Software & Frameworks	ABMIL (Attention-Based Multiple Instance Learning)	A neural network architecture for aggregating patch-level features into a slide-level prediction [80].
	Yottixel	A search framework for evaluating WSI retrieval performance using patch-based embeddings [81].

Cross-institutional validation is the cornerstone of building trust in histopathology foundation models. By systematically benchmarking models like Virchow, CONCH, and UNI on diverse external datasets, the research community can objectively assess their generalizability and robustness. The experimental protocols and toolkit outlined in this guide provide a roadmap for conducting these essential evaluations. As the field progresses, overcoming challenges related to data heterogeneity, computational cost, and model interpretability will be paramount. Ultimately, rigorous external validation is the critical step that will unlock the full potential of foundation models, paving the way for their safe, effective, and widespread adoption in clinical practice to improve patient care.

The emergence of foundation models is fundamentally transforming computational pathology. While traditional vision models have paved the way for automated histopathology image analysis, vision-language models (VLMs) introduce unprecedented capabilities by integrating visual understanding with semantic reasoning. This technical guide provides a comprehensive comparison of these approaches, contextualized through the lens of leading pathology foundation models—Virchow, CONCH, and UNI. We examine their architectural distinctions, performance characteristics, and suitability for various research and clinical applications in histopathology. Through structured quantitative comparisons, detailed experimental methodologies, and practical implementation guidelines, this review equips researchers and drug development professionals with the framework necessary to select and deploy optimal models for their specific computational pathology workflows.

Computational pathology has evolved from specialized, task-specific models to general-purpose foundation models capable of addressing diverse diagnostic challenges. Traditional vision models process histopathology images through self-supervised learning on large-scale whole slide image (WSI) datasets, creating versatile visual representations for downstream prediction tasks. In contrast, vision-language models (VLMs) jointly process visual data and textual information, enabling cross-modal understanding, retrieval, and generation [5] [85].

This paradigm shift is particularly significant for histopathology research, where morphological patterns must be correlated with clinical reports, diagnostic criteria, and scientific literature. Models like Virchow [2] [10], CONCH [5], and UNI represent different points on this architectural spectrum, each with distinctive strengths and limitations. Virchow exemplifies a massive-scale vision-only foundation model, while CONCH demonstrates the power of visual-language pretraining on diverse image-caption pairs. Understanding their comparative characteristics is essential for deploying them effectively in drug development and clinical research applications.

Conceptual Foundations and Architectural Approaches

Vision Models in Histopathology

Traditional vision models in computational pathology process WSIs through self-supervised learning (SSL) objectives without textual alignment. These models typically employ a two-stage framework: first encoding image patches or regions of interest (ROIs), then aggregating these representations for slide-level predictions [2].

The Virchow model exemplifies this approach, implementing a 632 million parameter Vision Transformer (ViT) trained using the DINOv2 algorithm on approximately 1.5 million H&E-stained WSIs [2] [10]. Its training leverages global and local tissue tiles to learn embeddings that capture cellular morphology and tissue architecture without language supervision. This vision-only paradigm creates general-purpose slide representations transferable to various diagnostic tasks through linear probing or fine-tuning.

Vision-Language Models in Histopathology

Vision-language models integrate visual processing with natural language understanding through cross-modal alignment. These models typically comprise three core components: a vision encoder, a language model, and a multimodal fusion mechanism [86] [87].

CONCH (CONtrastive learning from Captions for Histopathology) exemplifies the VLM approach in pathology, employing contrastive learning on 1.17 million histopathology image-caption pairs to create a shared embedding space [5]. This architecture enables bidirectional understanding: images can be retrieved using text queries, and textual descriptions can be generated from visual inputs. The model's pretraining incorporates diverse data sources, including biomedical text and richly annotated image-caption pairs, allowing it to capture fine-grained morphological details and their semantic correlations.

More advanced VLMs like ConVLM introduce context-guided token learning to address the limitation of coarse alignment in earlier models [85]. Through token enhancement modules and context-guided loss functions, these models achieve finer-level image-text interactions capable of capturing subtle morphological structures in histology images.

Comparative Analysis of Pathology Foundation Models

The table below summarizes the key characteristics of major vision and vision-language foundation models in computational pathology:

Model	Architecture Type	Training Data Scale	Core Capabilities	Key Limitations
Virchow [2] [10]	Vision-only SSL	1.5M WSIs	Pan-cancer detection (0.949 AUC), rare cancer identification (0.937 AUC), biomarker prediction	No native language capabilities, requires separate models for text tasks
CONCH [5]	Vision-Language	1.17M image-text pairs	Cross-modal retrieval, image captioning, classification, segmentation	May require fine-tuning for optimal slide-level clinical task performance
TITAN [7]	Multimodal Whole-Slide	335,645 WSIs + reports	Slide-level representation, report generation, zero-shot classification, cross-modal retrieval	Complex multi-stage training, computational intensity
ConVLM [85]	Fine-grained VLM	20 histopathology datasets	Fine-grained classification, cancer subtyping, context-aware representations	Specialized architecture less suited for general vision tasks

Performance Characteristics

Quantitative evaluations reveal distinct performance patterns across model architectures. Virchow demonstrates exceptional capabilities in pan-cancer detection, achieving 0.95 specimen-level AUC across nine common and seven rare cancers, with particularly strong performance on rare cancers (0.937 AUC) [2]. This highlights the power of massive-scale visual pretraining for diagnostic applications.

VLMs exhibit complementary strengths. In comprehensive benchmarking on the PathMMU dataset, Qwen2-VL-72B-Instruct (a general VLM) achieved superior performance with an average score of 63.97% across pathology subsets [88] [89]. Specialized pathology VLMs like CONCH demonstrate state-of-the-art performance on diverse benchmarks including image classification, segmentation, captioning, and cross-modal retrieval [5].

Data Requirements and Training Paradigms

Vision models typically require massive-scale WSI datasets without corresponding text annotations. Virchow's training on 1.5 million slides exemplifies the data scale needed for effective visual representation learning [2]. The DINOv2 algorithm employed by Virchow uses self-distillation with no labels, leveraging global and local crops of tissue tiles to learn robust representations.

VLMs require carefully aligned image-text pairs, which are scarcer in medical domains. CONCH addresses this through diverse data sources including biomedical text and specifically curated histopathology image-caption pairs [5]. TITAN extends this approach through synthetic data generation, using 423,122 synthetic captions generated from a multimodal generative AI copilot to augment its training [7].

Experimental Protocols and Methodologies

Benchmarking Framework for Pathology Models

Standardized evaluation methodologies are essential for comparative model assessment. The experimental workflow for pathology foundation model validation typically follows these stages:

Comprehensive benchmarking employs frameworks like VLMEvalKit, which standardizes evaluation across multiple pathology datasets including PathMMU [88] [89]. The PathMMU dataset contains multiple-choice questions (MCQs) derived from real-world pathology images and clinical scenarios, designed to evaluate diagnostic reasoning capabilities [88].

Key Experimental Protocols

Zero-Shot Evaluation Protocol

Zero-shot evaluation assesses model generalization without task-specific fine-tuning:

Model Preparation: Utilize publicly available model weights without additional training on target datasets
Inference Configuration: Employ standardized prompt templates and image preprocessing consistent with training specifications
Output Processing: Parse model responses against ground truth labels using exact matching or semantic similarity measures
Metric Calculation: Compute accuracy, AUC, or other relevant metrics across dataset subsets

This approach was used in large-scale evaluations of over 60 VLMs on histopathology tasks, providing contamination-free performance assessments [88].

Linear Probing Protocol

Linear probing evaluates the quality of learned representations by training a simple classifier on fixed features:

Feature Extraction: Generate embeddings using the frozen foundation model
Classifier Training: Train a linear layer on top of frozen embeddings using task-specific labels
Hyperparameter Tuning: Optimize learning rate, batch size, and regularization using validation splits
Evaluation: Assess performance on held-out test sets

This method was employed to evaluate Virchow representations across multiple cancer types and biomarker prediction tasks [2].

The Scientist's Toolkit: Research Reagent Solutions

The table below outlines essential computational tools and resources for implementing pathology foundation models in research workflows:

Research Reagent	Function	Implementation Example
VLMEvalKit [88] [89]	Standardized framework for multimodal model evaluation	Benchmarking VLM performance on PathMMU dataset
CONCH Model Weights [5]	Pretrained vision-language model for diverse pathology tasks	Cross-modal retrieval, zero-shot classification, image captioning
DINOv2 Algorithm [2] [10]	Self-supervised learning method for visual representation learning	Training vision-only foundation models on large WSI collections
PathMMU Dataset [88] [89]	Curated benchmark for pathology VLM evaluation	Multiple-choice questions derived from real clinical scenarios
Context-Guided Token Learning [85]	Advanced VLM training for fine-grained alignment	Improving capture of subtle morphological details in ConVLM

Integration Workflows and Implementation Strategies

Effective deployment of pathology foundation models requires careful consideration of integration pathways. The following diagram illustrates representative workflows for both vision and vision-language models:

Deployment Considerations

Vision Models excel in scenarios requiring:

High-throughput diagnostic screening without textual input
Resource-constrained environments where language capabilities are unnecessary
Applications demanding maximal visual pattern recognition without semantic context

Vision-Language Models prove superior for:

Educational applications requiring explanatory capabilities
Report generation and cross-modal retrieval tasks
Few-shot learning scenarios where textual prompts enhance generalization
Research workflows integrating multimodal data sources

Vision and vision-language models represent complementary paradigms in computational pathology, each with distinctive strengths and optimal application domains. Vision-only foundation models like Virchow demonstrate exceptional performance in pure visual recognition tasks including pan-cancer detection and rare cancer identification. Conversely, vision-language models like CONCH and TITAN enable more flexible, explainable AI systems capable of cross-modal understanding and generation.

The selection between these approaches depends critically on target applications, available data modalities, and deployment requirements. As both architectural paradigms continue to evolve, we anticipate increasing hybridization, with vision models incorporating limited language understanding and VLMs achieving more refined visual reasoning capabilities. For researchers and drug development professionals, this evolving landscape offers powerful tools to advance precision medicine through computational pathology.

The field of computational pathology is undergoing a transformative shift with the emergence of whole-slide foundation models capable of processing gigapixel images and multimodal data. Traditional approaches relying on patch-based analysis and supervised learning have faced significant limitations in generalizability, particularly for rare diseases and low-data scenarios. Foundation models like Virchow, CONCH, and UNI have established new paradigms by leveraging self-supervised learning on massive datasets to create versatile, transferable representations for diverse pathology tasks [5]. These models represent a crucial advancement over task-specific networks, but translating their capabilities to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions [7] [31].

The Transformer-based pathology Image and Text Alignment Network (TITAN) emerges as a next-generation multimodal whole-slide foundation model that addresses these limitations through innovative architecture and unprecedented scale. Pretrained on 335,645 whole-slide images, TITAN represents a substantial leap in general-purpose slide representation learning, outperforming existing region-of-interest (ROI) and slide foundation models across multiple machine learning settings including linear probing, few-shot and zero-shot classification, rare cancer retrieval, cross-modal retrieval, and pathology report generation [7] [31]. This advancement signals a new era where foundation models can directly facilitate a wide array of machine learning-based workflows requiring minimal or no further supervised fine-tuning, potentially accelerating diagnostic processes and drug development pipelines.

Core Architectural Framework of TITAN

Foundational Design Principles and Neural Architecture Innovations

TITAN builds upon a hybrid neural architecture that addresses critical limitations in previous models, particularly the quadratic complexity and fixed context windows of standard Transformers, as well as the compression bottlenecks and sequential processing constraints of linear recurrent models [90]. The architecture incorporates a sophisticated memory management system comprising short-term memory for precise modeling of immediate dependencies, long-term neural memory for adaptive retention and retrieval of historical data across vast datasets, and persistent memory for encoding task-specific knowledge that remains consistent across diverse use cases [90]. This multi-tiered approach enables TITAN to efficiently process sequences with millions of tokens while maintaining computational feasibility—a crucial capability for whole-slide image analysis.

The model utilizes techniques such as sliding window attention, gradient-based surprise metrics, and adaptive forgetting gates to optimize memory usage and computational efficiency [90]. For handling long and variable input sequences characteristic of gigapixel WSIs, TITAN implements attention with linear bias (ALiBi) extended to two dimensions, where the linear bias is based on the relative Euclidean distance between features in the feature grid, reflecting actual distances between patches in the tissue [7]. This innovation enables long-context extrapolation at inference time, effectively managing sequences exceeding 10^4 tokens that would be computationally prohibitive for standard Transformer architectures.

Multimodal Pretraining Paradigm

TITAN employs a sophisticated three-stage pretraining strategy that ensures slide-level representations capture histomorphological semantics at both ROI and WSI levels through visual and language supervisory signals [7]:

Stage 1: Vision-only Unimodal Pretraining - The model undergoes self-supervised learning using the iBOT framework on 335,645 WSIs from the Mass-340K dataset, distributed across 20 organs, different stains, diverse tissue types, and various scanner types to ensure diversity [7].
Stage 2: Cross-modal Alignment at ROI-level - The vision encoder is aligned with 423,122 synthetic fine-grained morphological descriptions generated using PathChat, a multimodal generative AI copilot for pathology [7].
Stage 3: Cross-modal Alignment at WSI-level - The model performs cross-modal alignment with 182,862 clinical reports at the whole-slide level, enabling slide-level language understanding and report generation capabilities [7].

This hierarchical pretraining approach allows TITAN to leverage both fine-grained morphological patterns and slide-level clinical context, creating a comprehensive representation of histopathology data.

Comparative Architecture Positioning

The following diagram illustrates TITAN's architectural evolution and its relationship with preceding foundation models in computational pathology:

Experimental Protocols and Methodologies

Model Pretraining and Optimization Framework

TITAN's pretraining employs a sophisticated knowledge distillation approach where the model operates in the embedding space of pre-extracted patch features. Rather than processing raw image patches directly, TITAN takes a sequence of patch features encoded by established histology patch encoders like CONCHv1.5, which generates 768-dimensional features for each patch [7]. These features are spatially arranged in a 2D grid replicating patch positions within the tissue, preserving spatial context and enabling positional encoding.

To manage computational complexity, WSIs are divided into non-overlapping patches of 512×512 pixels at 20× magnification, substantially larger than the widely-used 256×256 patches [7]. For self-supervised learning, the model creates multiple views of a WSI by randomly cropping the 2D feature grid: a region crop of 16×16 features covering 8,192×8,192 pixels is randomly sampled, from which two random global (14×14) and ten local (6×6) crops are extracted for iBOT pretraining [7]. These feature crops undergo data augmentation including vertical and horizontal flipping, followed by posterization feature augmentation to enhance robustness [7].

Evaluation Metrics and Benchmarking Framework

TITAN underwent comprehensive evaluation across diverse clinical tasks using standardized metrics and benchmarking protocols. The evaluation framework assessed performance across multiple machine learning settings to determine generalizability and practical utility:

Table 1: Evaluation Metrics for TITAN Benchmarking

Task Category	Evaluation Metrics	Experimental Settings
Slide Representation Quality	Linear probing accuracy, Few-shot learning performance	Zero-shot, few-shot (varying data proportions), full-shot
Cross-modal Retrieval	Recall@K, Precision@K	Slide-to-report and report-to-slide retrieval
Language Understanding	BLEU, ROUGE scores	Pathology report generation
Rare Disease Application	Retrieval accuracy, Classification F1 score	Rare cancer retrieval and subtyping
Prognostic Prediction	Concordance index, AUC	Survival analysis and outcome prediction

For slide representation quality assessment, the framework employed linear probing where a linear classifier is trained on top of frozen features, and few-shot learning with varying proportions of training data (from 1% to 100%) to evaluate data efficiency [7]. Cross-modal retrieval tasks measured the model's ability to associate visual patterns with textual descriptions through recall and precision at different K values. Report generation capabilities were quantified using natural language processing metrics including BLEU and ROUGE scores that compare machine-generated reports with human-written references [7].

Performance Analysis: Comparative Benchmarking

Quantitative Performance Across Diverse Tasks

TITAN's performance has been rigorously evaluated against existing ROI and slide foundation models across multiple domains. The model demonstrates state-of-the-art results in both vision-only and multimodal tasks, with particularly strong performance in low-data regimes and rare disease applications.

Table 2: Comparative Performance of TITAN Across Clinical Tasks

Model	Zero-Shot Classification (Accuracy)	Few-Shot Learning (F1-Score)	Slide Retrieval (Recall@10)	Report Generation (ROUGE-L)
TITAN (Ours)	0.892	0.861	0.927	0.742
TITAN-V (Vision-only)	0.845	0.823	0.895	N/A
CONCH-based Slide Model	0.812	0.794	0.862	0.681
Other Slide Foundation Models	0.781-0.825	0.752-0.812	0.821-0.885	0.592-0.703
ROI Foundation Models	0.723-0.795	0.712-0.802	0.785-0.852	N/A

The evaluation results demonstrate TITAN's significant advantage across all task categories, particularly in zero-shot settings where it achieves approximately 8% higher accuracy compared to ROI foundation models and 4-6% improvement over other slide foundation models [7]. This performance advantage is especially pronounced in rare cancer retrieval tasks, where TITAN outperforms existing methods by 12-15% in recall metrics, highlighting its value for clinical scenarios with limited annotated data [7] [31].

Ablation Studies and Component Analysis

Ablation studies confirm the contribution of each architectural component and pretraining stage to TITAN's overall performance. The vision-only pretraining (TITAN-V) already establishes a strong baseline, but the full model with multimodal alignment demonstrates substantial gains in language-aware tasks without sacrificing visual representation quality.

Table 3: Ablation Study of TITAN Components

Model Variant	Image Classification	Text-to-Image Retrieval	Report Generation	Long-Context Reasoning
TITAN (Full Model)	0.884	0.901	0.742	0.896
Without ROI-level Alignment	0.875	0.842	0.693	0.881
Without WSI-level Alignment	0.879	0.861	0.712	0.889
Without Synthetic Captions	0.868	0.823	0.665	0.872
Standard Positional Encoding	0.852	0.845	0.721	0.824
ViT-Base Architecture	0.831	0.812	0.683	0.795

The ablation analysis reveals that ROI-level alignment contributes most significantly to fine-grained visual-language understanding, improving text-to-image retrieval by approximately 6% [7]. The use of synthetic captions demonstrates particular value for report generation tasks, contributing to a 10% improvement in ROUGE-L scores, suggesting substantial scaling potential through synthetic data augmentation [7]. The extension of ALiBi to 2D for positional encoding proves critical for long-context reasoning, providing an 8% performance advantage over standard positional encoding schemes [7].

Implementation and application of TITAN for histopathology research requires specific computational resources and data components. The following table details the essential "research reagents" and their functions in experimental workflows.

Table 4: Essential Research Reagent Solutions for TITAN Implementation

Resource Category	Specific Implementation	Function in Experimental Workflow
Pretraining Dataset	Mass-340K (335,645 WSIs, 20 organs)	Foundation for self-supervised learning; ensures model diversity and generalizability
Multimodal Alignment Data	423,122 synthetic ROI captions; 182,862 pathology reports	Enables cross-modal reasoning and report generation capabilities
Patch Feature Encoder	CONCHv1.5 (768-dimensional features)	Extracts meaningful representations from 512×512 image patches at 20× magnification
Positional Encoding Scheme	2D-ALiBi (Attention with Linear Bias)	Enables long-context extrapolation for gigapixel WSIs exceeding 10^4 tokens
SSL Framework	iBOT (Knowledge Distillation)	Self-supervised pretraining on 2D feature grids with global and local crops
Computational Infrastructure	NVIDIA A6000/A40 GPUs (48GB VRAM)	Handles large batch sizes and long sequences during training and inference
Evaluation Benchmark	14 diverse pathology tasks	Comprehensive assessment of slide representation quality and multimodal capabilities

The CONCHv1.5 patch encoder serves as a critical component, providing the foundational feature representations from histology images [7] [5]. The model operates exclusively in this feature space, requiring pre-extraction of patch features before TITAN processing. The 2D-ALiBi positional encoding is particularly essential for handling the variable sizes and large dimensions of whole-slide images, as it enables extrapolation to longer sequences than encountered during training [7].

TITAN Pretraining Workflow

The following diagram illustrates the comprehensive three-stage pretraining workflow that enables TITAN to achieve state-of-the-art performance in whole-slide image analysis:

TITAN represents a fundamental advancement in computational pathology by introducing a multimodal whole-slide foundation model that effectively bridges the gap between patch-level representation learning and slide-level clinical reasoning. Through its innovative three-stage pretraining paradigm and hybrid memory architecture, TITAN demonstrates exceptional capabilities in zero-shot learning, rare disease retrieval, and pathology report generation without requiring task-specific fine-tuning. The model's performance advantage is particularly significant in low-data regimes and for rare cancer applications, addressing critical challenges in diagnostic pathology and biomarker discovery.

The success of pretraining with synthetic fine-grained morphological descriptions suggests substantial scaling potential for TITAN and similar foundation models through synthetic data augmentation [7]. Future developments will likely focus on expanding multimodal capabilities to include genomic data, proteomics, and spatial transcriptomics, creating even more comprehensive representations of disease biology. As these models continue to evolve, they hold the potential to transform diagnostic workflows, accelerate drug development processes, and ultimately improve patient care through more precise and accessible computational pathology tools.

The advent of whole-slide imaging (WSI) has transformed pathology from a purely microscopy-based discipline to a quantitative, data-rich science. This digitization has paved the way for artificial intelligence (AI) applications, culminating in the development of pathology foundation models—large-scale AI models trained on vast datasets of histopathology images using self-supervised learning. These models, including Virchow, CONCH, and UNI, learn fundamental representations of tissue morphology that can be adapted to diverse clinical tasks without requiring extensive labeled datasets for each new application [1]. For researchers, scientists, and drug development professionals, these models offer unprecedented opportunities to extract clinically relevant information from standard histopathology slides, potentially accelerating biomarker discovery, prognostic model development, and therapeutic response prediction. However, the pathway from experimental validation to regulatory approval and widespread clinical adoption requires careful navigation of evidence generation, validation standards, and regulatory frameworks. This technical guide examines the clinical translation readiness of leading pathology foundation models, focusing on the evidence required for regulatory approval and the practical considerations for integration into clinical research and practice.

Performance Benchmarking: Quantitative Evidence for Clinical Utility

Independent benchmarking studies provide crucial evidence of model performance across clinically relevant tasks. A comprehensive evaluation of 19 foundation models on 13 patient cohorts with 6,818 patients and 9,528 slides offers direct comparative data on the leading models discussed in this guide [6].

Table 1: Benchmark Performance of Foundation Models Across Clinical Domains (Mean AUROC)

Foundation Model	Morphology Tasks (n=5)	Biomarker Tasks (n=19)	Prognosis Tasks (n=7)	Overall Average (n=31)
CONCH	0.77	0.73	0.63	0.71
Virchow2	0.76	0.73	0.61	0.71
Prov-GigaPath	0.69	0.72	0.65	0.69
DinoSSLPath	0.76	0.68	0.62	0.69
UNI	0.71	0.68	0.64	0.68

The benchmarking data reveals that CONCH, a vision-language model trained on 1.17 million image-caption pairs, and Virchow2, a vision-only model trained on 3.1 million WSIs, demonstrate the highest overall performance across diverse clinical tasks [6]. This superior performance is attributed to their training on massive, diverse datasets that enable robust feature learning. Particularly noteworthy is CONCH's advantage in multimodal integration, allowing it to leverage both visual and textual information from pathology reports and captions.

Performance in Resource-Constrained Settings

Table 2: Model Performance in Low-Data Scenarios (Number of Tasks Where Model Ranked First)

Foundation Model	High Data (n=300)	Medium Data (n=150)	Low Data (n=75)
Virchow2	8	6	4
PRISM	7	9	4
CONCH	5	4	5

For real-world clinical applications where positive cases may be rare, performance in low-data settings is particularly important. The benchmarking study evaluated models using randomly sampled cohorts of 300, 150, and 75 patients while maintaining similar ratios of positive samples [6]. Interestingly, while Virchow2 dominated in high-data scenarios, other models like PRISM showed strong performance with medium-sized cohorts, and CONCH maintained robust performance even with the smallest sample sizes. This suggests that different models may be optimal depending on the specific clinical context and data availability.

Experimental Protocols for Clinical Validation

Weakly Supervised Learning for Whole-Slide Analysis

The standard methodology for validating foundation models on clinical tasks involves weakly supervised multiple instance learning (MIL) approaches, which require only slide-level labels rather than costly pixel-wise annotations [6] [91]. The typical workflow includes:

WSI Preprocessing: Tessellation of whole-slide images into small, non-overlapping patches (typically 256×256 or 512×512 pixels at 20× magnification) [7].
Feature Extraction: Using foundation models as frozen feature extractors to convert each patch into a feature embedding (e.g., 768-dimensional vectors for CONCH) [7].
Feature Aggregation: Employing transformer-based architectures or attention-based multiple instance learning (ABMIL) to aggregate patch-level features into slide-level representations [6].
Task-Specific Heads: Training lightweight classification or regression layers on top of the aggregated features for specific clinical tasks such as biomarker prediction or survival analysis [1].

This approach has been validated across multiple cancer types including lung, colorectal, gastric, and breast cancers, demonstrating consistent performance despite variations in tissue morphology and staining protocols [6].

Multimodal Integration Protocols

For vision-language models like CONCH, additional protocols enable cross-modal functionality:

Cross-Modal Alignment: Contrastive learning to align image features with corresponding text embeddings from pathology reports [7].
Zero-Shot Classification: Direct application of the model to novel tasks without task-specific training by leveraging natural language descriptions [7].
Report Generation: Generating pathology reports from WSIs by decoding the visual features into textual descriptions [7].

These capabilities are particularly valuable for rare diseases or low-prevalence biomarkers where training data may be scarce.

Diagram 1: Clinical Validation Workflow for Pathology Foundation Models

Regulatory Pathways and Compliance Frameworks

Evolving Regulatory Landscape for AI in Pathology

The regulatory environment for AI-based medical devices, including pathology foundation models, is rapidly evolving with several key developments in 2025:

ICH E6(R3) Guidelines: The updated Good Clinical Practice guideline emphasizes proportionate, risk-based quality management, data integrity across all modalities, and clear sponsor-investigator oversight [92] [93]. This framework requires that quality be built into trial design from the earliest stages, with continuous risk assessment throughout the clinical trial lifecycle.
FDA Guidance on AI/ML: The FDA has issued draft guidance providing a framework for model validation, transparency, and governance [93]. This includes requirements for predetermined change control plans for algorithms that continue to learn after deployment.
EU Clinical Trials Regulation: Fully implemented as of January 2025, the CTR requires all clinical trials in the EU to be submitted, managed, and reported through the Clinical Trials Information System (CTIS) [92]. This creates harmonized but stringent timelines and transparency requirements.
Diversity, Equity, and Inclusion Requirements: Regulatory agencies, particularly in the United States, are increasingly focused on ensuring diversity in clinical trial participation [92]. Sponsors and CROs are expected to develop diversity action plans outlining recruitment strategies for underrepresented populations.

Evidence Requirements for Regulatory Approval

For pathology foundation models to achieve regulatory approval as medical devices, specific evidence requirements must be met:

Analytical Validation: Demonstration that the model correctly identifies histopathological features with appropriate sensitivity, specificity, and reproducibility [1].
Clinical Validation: Evidence that the model accurately predicts clinically relevant endpoints across diverse patient populations and clinical settings [1].
Computational Quality Assurance: Documentation of model robustness, resilience to variations in slide preparation and scanning, and cybersecurity measures [92].
Clinical Utility: Proof that using the model improves patient outcomes, clinical decision-making, or healthcare efficiency compared to standard practice [1].

The benchmarking data showing performance across multiple external cohorts [6] represents the type of clinical validation evidence that regulators are increasingly expecting, moving beyond narrow single-institution evaluations.

Clinical Adoption Frameworks and Implementation Strategies

Integration into Clinical Workflows

Successful adoption of pathology foundation models requires thoughtful integration into existing clinical and research workflows:

Digital Pathology Infrastructure: Implementation of whole-slide scanners, secure storage solutions, and viewing stations that comply with regulatory requirements for diagnostic use [32].
Laboratory Information System (LIS) Integration: Seamless connectivity between AI tools and existing laboratory information systems to minimize workflow disruption [32].
Quality Control Processes: Establishment of protocols for verifying model performance on site-specific data, handling uncertain predictions, and maintaining human oversight [1].
Training and Competency Development: Role-based training programs for pathologists, researchers, and technical staff to build competencies in AI interpretation and quality assurance [92].

Open-Source Tools for Implementation

Several open-source platforms facilitate the research use and validation of pathology foundation models:

QuPath: Bioimage analysis software with robust WSI support and ability to handle large images (>40 GB) [32].
CellProfiler: Cell image analysis platform enabling automated quantification of morphological features [32].
Ilastik: Interactive segmentation tool particularly suited for cell biology applications [32].
Cytomine: Web-based collaborative platform for multi-user analysis of multi-gigapixel images [32].

These tools provide accessible entry points for researchers seeking to validate and build upon existing foundation models without requiring extensive computational infrastructure.

The Scientist's Toolkit: Essential Research Reagents

Table 3: Essential Research Reagents for Pathology Foundation Model Validation

Reagent / Resource	Function	Example Implementation
Whole-Slide Images	Primary data source for model training and validation	TCGA dataset [91], institutional archives
Pathology Reports	Textual data for multimodal model training	Clinical reports paired with WSIs [7]
Foundation Model Weights	Pretrained models for feature extraction	UNI [3], CONCH [6], Virchow [6]
Multiple Instance Learning Framework	Weakly supervised learning for WSI classification	ABMIL [6], TransMIL [91]
Open-Source Analysis Platforms	Software for WSI visualization and analysis	QuPath [32], CellProfiler [32]
Computational Infrastructure	Hardware for processing gigapixel images	High-performance GPUs with sufficient VRAM

The benchmarking evidence demonstrates that pathology foundation models, particularly CONCH and Virchow, have reached a level of maturity that supports their use in research applications with potential for clinical translation. Their performance across diverse tasks including biomarker prediction, prognosis, and morphological classification meets or exceeds previously established benchmarks [6]. However, successful regulatory approval and clinical adoption will require:

Prospective Clinical Trials: Validation in prospective rather than retrospective cohorts to establish real-world performance.
Integration with Regulatory Standards: Adherence to evolving frameworks including ICH E6(R3), FDA AI guidance, and EU CTR requirements [92] [93].
Demonstration of Clinical Utility: Evidence that model use improves patient outcomes, reduces costs, or enhances diagnostic accuracy beyond current standards.
Implementation Frameworks: Development of standardized protocols for model deployment, monitoring, and quality assurance in clinical settings.

As the field advances, the integration of pathology foundation models with other data modalities—including genomics, clinical records, and digital health technologies—will likely create more comprehensive diagnostic and prognostic tools [1]. By building on the robust benchmarking evidence now available and addressing the regulatory and implementation challenges ahead, researchers and drug development professionals can strategically position these powerful tools for successful clinical translation.

Conclusion

The emergence of Virchow, CONCH, and UNI represents a paradigm shift in computational pathology, demonstrating that foundation models pretrained on massive datasets can achieve remarkable performance across diverse diagnostic, prognostic, and predictive tasks. Benchmarking studies reveal that while CONCH excels as a multimodal vision-language model and Virchow demonstrates superior performance in large-scale vision-only applications, each model offers distinct advantages depending on the clinical context and data availability. Critical challenges remain in computational efficiency, domain generalization, and clinical workflow integration. Future development will likely focus on increasingly multimodal architectures that integrate pathology images with genomic data, clinical notes, and other patient information, paving the way for generalist medical AI systems. For researchers and drug development professionals, these foundation models offer powerful tools for biomarker discovery, patient stratification, and therapeutic development, ultimately accelerating the transition toward precision medicine in oncology and beyond.

Virchow, CONCH, and UNI: A Comparative Overview of Foundation Models Revolutionizing Computational Pathology

Virchow, CONCH, and UNI: A Comparative Overview of Foundation Models Revolutionizing Computational Pathology

Abstract

Understanding Pathology Foundation Models: Core Architectures and Self-Supervised Learning

The Evolution: From Task-Specific Models to Foundation Models

Major Foundation Models: Architectures and Capabilities

Virchow: Scaling Vision-Only Pathology Models

CONCH: Vision-Language Pretraining for Histopathology

UNI: Towards General-Purpose Pathology Representations

Emerging Architectures: TITAN and Multimodal Approaches

Benchmarking and Experimental Evaluation

Comprehensive Performance Benchmarking

Experimental Protocols for Model Evaluation

Clinical Applications and Implementation

Diagnostic Applications and Biomarker Prediction

Prognostication and Therapeutic Response Prediction

Implementation Considerations and Clinical Translation

Challenges and Future Directions

Model Architecture & Training Methodology

Core Architectural Framework

DINOv2 Self-Supervised Training

Key Experiments & Performance Benchmarks

Pan-Cancer Detection

Tile-Level Benchmarks and Biomarker Prediction

Comparative Analysis of Pathology Foundation Models

The Scientist's Toolkit: Research Reagent Solutions

Experimental Protocol: Downstream Task Fine-Tuning

The Landscape of Pathology Foundation Models

CONCH Architecture and Pre-training Methodology

Model Design and Components

Pre-training Dataset and Regime

Experimental Protocols and Performance Benchmarks

Zero-shot Classification

Cross-Modal Retrieval and Other Tasks

Practical Implementation Guide

Access and Setup

The Researcher's Toolkit

Core Architectural Framework and Technical Specifications

Model Architecture and Design Principles

Technical Specifications and Implementation

Cross-Modal Alignment Methodology

Technical Framework for Modality Alignment

Implementation and Training Approach

Transfer Learning Capabilities and Performance

Adaptability to Diverse Pathology Tasks

Quantitative Performance Benchmarks

Experimental Protocols and Methodologies

Standardized Evaluation Framework

Implementation Protocols for Key Applications

Research Reagent Solutions and Computational Tools

Workflow and Integration Diagrams

UNI Cross-Modal Alignment and Transfer Learning Workflow

Multi-Modal Alignment and Retention Mechanism

Future Directions and Clinical Translation

The Role of Self-Supervised Learning in Leveraging Unlabeled Whole Slide Images

Core Self-Supervised Learning Approaches for WSIs

Technical Foundations and Methodologies

Quantitative Performance of SSL Methods

Foundation Models: Architectural Frameworks and Implementation

Major Foundation Models in Histopathology

The S3L Framework: A Unified SSL Approach for WSIs

Experimental Protocols and Methodologies

Implementation of Hybrid SSL Frameworks

Multimodal Pretraining with TITAN

Core Model Architectures & Pretraining Strategies

Virchow and Virchow 2: Scaling Visual Self-Supervision

CONCH: Vision-Language Alignment via Contrastive Learning

UNI: A General-Purpose Visual Encoder and Scaling Laws

Experimental Protocols and Benchmarking

Virchow's Pan-Cancer and Rare Cancer Detection

CONCH's Zero-Shot Transfer Capabilities

UNI's Extensive Downstream Task Generalization

Visual Summaries of Model Workflows

Virchow's Self-Supervised Training and Pan-Cancer Detection Workflow

CONCH's Vision-Language Pretraining and Zero-Shot Application

The Scientist's Toolkit: Essential Research Reagents

Practical Applications: From Pan-Cancer Detection to Biomarker Prediction

Slide Encoding and Feature Extraction Workflows for Whole Slide Images

Model Architectures and Training Approaches

Performance Benchmarking