Our Datasets
During last years we created a large number of high-quality datasets for many malignant tumors.

Many of these datasets were precisely manually annotated by experts and
are a foundation for our computational pathology algorithms.

If you are interested in cooperation or testing your algorithm on our datasets please feel free to contact us.
#prostate cancer
Prostate cancer: two tumor detection datasets

Two datasets from our study:
"High-accuracy prostate cancer pathology using deep learning published" Nature Machine Intelligence 2020
Ground truth: patch-level label.
Tissue classes: tumor, benign.
Patch size: 152 x 152 mkm (ca. 600 px)
Magnification: 20x
Dataset 1 (Institute 1): ~145.000 patches
Dataset 2 (Institute 2): ~34.000 patches

Link to publication

Link to Dataset/Zenodo


Prostate cancer: Gleason grading dataset

One of the test datasets from our study:
"High-accuracy prostate cancer pathology using deep learning published" Nature Machine Intelligence 2020
Image source: The Cancer Genome Atlas (TCGA)
Ground truth: Gleason scores, three pathologists.
Image size: large ROIs
Magnification: 20x
Number of images: 218

Link to publication


Prostate cancer: segmentation dataset

A dataset from our study:
"High-accuracy prostate cancer pathology using deep learning published" Nature Machine Intelligence 2020
Image source: The Cancer Genome Atlas (TCGA)
Ground truth: precise gland-level annotations.
Tissue classes: 5 tissue classes.
Number of whole-slide images annotated: 389.

Link to publication


Prostate cancer: six tumor detection datasets
(multiple institutes, multiple scanners)
Six datasets from our study:
"Quality control stress test for deep learning-based diagnostic model in digital pathology"
Modern Pathology 2021
Ground truth: patch-level label.
Tissue classes: tumor, benign glandular, benign non-glandular.
Patch size: 152 x 152 mkm (ca. 600 px)
Magnification: 20x-40x
Size of single datasets: 120 000 patches

Zenodo Part 1

Zenodo Part 2


Prostate cancer: six biopsy datasets
(five institutes, three scanners)
Six datasets from our study:
"An international multi-institutional validation study of deep learning-based classifier for prostate cancer detection and Gleason grading in biopsy samples"
(Under Review)
Ground truth: slide-level labels (tumor/benign/uncertain).
Magnification: 20x-40x
Overall number of whole-slide image: 5922.

Prostate cancer: two biopsy Gleason grading datasets

Two datasets from our study:
"An international multi-institutional validation study of deep learning-based classifier for prostate cancer detection and Gleason grading in biopsy samples"
(Under Review)
Ground truth: slide-level Gleason grading (11 pathologists).
Magnification: 40x
Number of whole-slide image: 227 / 159.

#oesophageal cancer
Oesophageal cancer: four classification datasets (multiple institutes, multiple scanners)

Four datasets from our study:
"Artificial intelligence for tumor detection and histological regression grading in oesophageal adenocarcinomas: a retrospective algorithm development and validation study"
Lancet Digital Health 2023
Number of institutes: three + TCGA
Ground truth: patch-level (11 tissue classes).
Patch size: 256 px at MPP 0.7813
Size of datasets: 32 796 - 178 187 patches

Link to publication

Link to Dataset/Zenodo


Oesophageal cancer: four segmentation datasets (multiple institutes, multiple scanners)

Four datasets from our study:
"Artificial intelligence for tumor detection and histological regression grading in oesophageal adenocarcinomas: a retrospective algorithm development and validation study"
Lancet Digital Health 2023
Ground truth: manual annotations.
Magnification: 40x
Number of institutes: three + TCGA
Number of annotated WSIs: 215 / 62 / 214 / 22.

Oesophageal cancer: one large histological regression grading dataset

One dataset from our study:
"Artificial intelligence for tumor detection and histological regression grading in oesophageal adenocarcinomas: a retrospective algorithm development and validation study"
Lancet Digital Health 2023
Ground truth: case-level, % of vital tumor tissue.
Magnification: 40x
Number of patient cases: 95
Number of WSIs: 1407.

#colorectal cancer
Colorectal cancer: one large high-quality segmentation dataset

A dataset from our study:
"Clinical-grade tumor detection and tissue segmentation in colorectal specimens using artificial intelligence tool"
(Under Review)
Source of cases: TCGA
Ground truth: highly precise manual annotations (17 tissue classes)
Magnification: 40x

Number of cases/WSIs: 241

SemiCOL Challenge


Colorectal cancer: extended CRAG dataset

A dataset from our study:
"Clinical-grade tumor detection and tissue segmentation in colorectal specimens using artificial intelligence tool"
(Under Review)
Source: CRAG dataset from TIA Center (consider citation of original publications)
Modification: correction of existing annotations, added annotations for nine additional tissue classes.
Ground truth: manual annotations (10 tissue classes)
Magnification: 20x

Number of ROIs: 214.

Link to Zenodo (coming soon)


Colorectal cancer: three segmentation datasets (three institutes, three scanners)

Three datasets from our study:
"Clinical-grade tumor detection and tissue segmentation in colorectal specimens using artificial intelligence tool"
(Under Review)
Source: three pathology institutes
Ground truth: highly precise manual annotations (11 tissue classes)
Magnification: 40x

Number of WSIs: 30 / 10 / 10.

Link to Zenodo (coming soon)


Colorectal cancer: biopsy tumor detection datasets (four institutes, three scanners)

Four datasets from our study:
"Clinical-grade tumor detection and tissue segmentation in colorectal specimens using artificial intelligence tool"
(Under Review)
Source: four pathology institutes
Ground truth: slide-level labels.
Magnification: 20x-40x

Number of WSIs: 356 / 212 / 652 / 675 (total n=1895)


#lung cancer
Lung cancer: large segmentation dataset

A dataset from our ongoing study to lung cancer.

Source: TCGA (adenocarcinoma/squamous cell carcinoma)
Ground truth: highly precise manual annotations (14 tissue classes).
Magnification: 40x

Number of manually annotated WSIs: 264

(AdenoCa 157 / SqCC 107 )


#kidney cancer
Kidney cancer: large segmentation dataset

A dataset from our ongoing study to kidney cancer.

Source: TCGA + one pathology institute.
Ground truth: precise manual annotations (12 tissue classes)
Magnification: 40x

Number of WSIs: 430 (ccRCC 131, pRCC 244, chrRCC 55) + Oncocytomas (n=50).


Kidney cancer: three segmentation datasets
(three institutes)

Three test datasets from our ongoing study to kidney cancer.

Source: three pathology institutes.
Ground truth: precise manual annotations (12 tissue classes)
Magnification: 40x
Tumor entities: ccRCC, pRCC, chrRCC, oncocytomas

Number of manually annotated WSIs / institute: 30


#lymph node metastasis
Lung cancer: segmentation dataset

One dataset from ongoing study to lung cancer metastasis detection.

Source: one pathology institute.
Ground truth: precise manual annotations (6 tissue classes)
Magnification: 40x
Tumor entities: AdenoCa, SqCC

Number of manually annotated WSIs: 253


Colorectal cancer: segmentation dataset

One dataset from ongoing study to colorectal cancer metastasis detection.

Source: one pathology institute.
Ground truth: precise manual annotations (6 tissue classes)
Magnification: 40x
Tumor entities: Adenocarcinoma

Number of manually annotated WSIs: 288


Prostate cancer: segmentation dataset

One dataset from ongoing study to prostate cancer metastasis detection.

Source: one pathology institute.
Ground truth: precise manual annotations (6 tissue classes)
Magnification: 40x

Number of manually annotated WSIs: 216


#tissue detection
Tissue detection: segmentation dataset


Source: multiple pathology institutes + TCGA.
Ground truth: precise manual annotations (tissue/background)
Magnification: 40x

Number of manually annotated WSIs: 208


#histological artifacts
Histological artifact detection: segmentation dataset


Source: multiple pathology institutes + TCGA.
Ground truth: precise manual annotations (normal tissue + 7 histological artifacts)
Magnification: 40x

Number of manually annotated WSIs: > 300


© tolklab.de 2023