Skip to content

Trident WSI dataset

torchmil.datasets.TridentWSIDataset

Bases: ProcessedMILDataset

This class represents a dataset of Whole Slide Images (WSI) for Multiple Instance Learning (MIL) that was processed using the TRIDENT repository.

Directory structure. For more information on the processing of the bags, refer to the ProcessedMILDataset class. This dataset expects the directory structure provided by the TRIDENT repository. The base_path argument should point to the base directory of the TRIDENT output, of the form {mag}x_{ps}px_{opx}px_overlap/. In this folder, the following folders are expected:

base_path
├──features_{feature_extractor}
|   ├── wsi1.h5
|   ├── wsi2.h5
|   └── ...
├──patches
|   ├── wsi1_patches.h5
|   ├── wsi2_patches.h5
|   └── ...
└──patch_labels (optional)
    ├── wsi1.h5
    ├── wsi2.h5
    └── ...

Adjacency matrix. If the coordinates of the patches are available, an adjacency matrix representing the spatial relationships between the patches is built. Please refer to the ProcessedMILDataset class for more information on how the adjacency matrix is built.

WSI-level labels. The labels of the WSIs can be provided in two ways: 1. As a directory containing one file per WSI, following the same structure as the features and patches folders. 2. As a CSV file containing the WSI names and their corresponding labels. In this case, the user must provide the column names for the WSI names and labels using the wsi_name_col and wsi_label_col keyword arguments, respectively.

Patch-level labels. The labels of the patches can be provided through the patch_labels_path argument. This should be a directory containing one '.h5' file per WSI. This file should have "patch_labels" as a key, which should contain an array with the labels of the patches. The order of the patch labels should be the same as the order of the features and coordinates of the patches.

__init__(base_path, labels_path, feature_extractor, patch_labels_path=None, wsi_names=None, bag_keys=['X', 'Y', 'y_inst', 'adj', 'coords'], patch_size=512, dist_thr=None, adj_with_dist=False, norm_adj=True, load_at_init=True, wsi_name_col=None, wsi_label_col=None)

Class constructor.

Parameters:

  • base_path (str) –

    Path to the base directory containing the TRIDENT folders.

  • labels_path (str) –

    Path to the directory or CSV file containing the labels of the WSIs.

  • feature_extractor (str) –

    Feature extractor used to extract the features. This will determine the features folder name.

  • patch_labels_path (str, default: None ) –

    Path to the directory containing the labels of the patches.

  • wsi_names (list, default: None ) –

    List of the names of the WSIs to load. If None, all the WSIs in the features_path directory are loaded.

  • bag_keys (list, default: ['X', 'Y', 'y_inst', 'adj', 'coords'] ) –

    List of keys to use for the bags. Must be in ['X', 'Y', 'y_inst', 'coords'].

  • patch_size (int, default: 512 ) –

    Size of the patches.

  • dist_thr (float, default: None ) –

    Distance threshold for building the adjacency matrix. If None, it is set to sqrt(2) * patch_size.

  • adj_with_dist (bool, default: False ) –

    If True, the adjacency matrix is built using the Euclidean distance between the patches features. If False, the adjacency matrix is binary.

  • norm_adj (bool, default: True ) –

    If True, normalize the adjacency matrix.

  • load_at_init (bool, default: True ) –

    If True, load the bags at initialization. If False, load the bags on demand.

  • wsi_name_col (str, default: None ) –

    Name of the column containing the WSI names in the CSV file provided in labels_path. Only used if labels_path is a CSV file.

  • wsi_label_col (str, default: None ) –

    Name of the column containing the WSI labels in the CSV file provided in labels_path. Only used if labels_path is a CSV file.

__getitem__(index)

Parameters:

  • index (int) –

    Index of the bag to retrieve.

Returns:

  • bag_dict ( TensorDict ) –

    Dictionary containing the keys defined in bag_keys and their corresponding values.

    • X: Features of the bag, of shape (bag_size, ...).
    • Y: Label of the bag.
    • y_inst: Instance labels of the bag, of shape (bag_size, ...).
    • adj: Adjacency matrix of the bag. It is a sparse COO tensor of shape (bag_size, bag_size). If norm_adj=True, the adjacency matrix is normalized.
    • coords: Coordinates of the bag, of shape (bag_size, coords_dim).