Skip to content

Processed MIL dataset

torchmil.datasets.ProcessedMILDataset

Bases: Dataset

This class represents a general MIL dataset where the bags have been processed and saved as numpy or .h5 files. It enforces strict data availability for core components, failing fast if expected files are missing.

MIL processing and directory structure. The dataset expects pre-processed bags saved as individual numpy or .h5 files.

  • A feature file should yield an array of shape (bag_size, ...), where ... represents the shape of the features.
  • A label file should yield an array of shape arbitrary shape, e.g., (1,) for binary classification.
  • An instance label file should yield an array of shape (bag_size, ...), where ... represents the shape of the instance labels.
  • A coordinates file should yield an array of shape (bag_size, coords_dim), where coords_dim is the dimension of the coordinates.

Bag keys and directory structure. The dataset can be initialized with a list of bag keys, which are used to choose which data to load. This dataset expects the following directory structure:

features_path/ (if "X" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...
labels_path/ (if "Y" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...
inst_labels_path/ (if "y_inst" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...
coords_path/ (if "coords" or "adj" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...

Adjacency matrix. If the coordinates of the instances are available, the adjacency matrix will be built using the Euclidean distance between the coordinates. Formally, the adjacency matrix \(\mathbf{A} = \left[ A_{ij} \right]\) is defined as:

$\(A_{ij} = \begin{cases} d_{ij}, & \text{if } \left\| \mathbf{c}_i - \mathbf{c}_j \right\| \leq \text{dist_thr}, \\ 0, & \text{otherwise}, \end{cases} \quad d_{ij} = \begin{cases} 1, & \text{if } \text{adj_with_dist=False}, \\ \exp\left( -\frac{\left\| \mathbf{x}_i - \mathbf{x}_j \right\|}{d} \right), & \text{if } \text{adj_with_dist=True}. \end{cases}\)$

where \(\mathbf{c}_i\) and \(\mathbf{c}_j\) are the coordinates of the instances \(i\) and \(j\), respectively, \(\text{dist_thr}\) is a threshold distance, and \(\mathbf{x}_i \in \mathbb{R}^d\) and \(\mathbf{x}_j \in \mathbb{R}^d\) are the features of instances \(i\) and \(j\), respectively.

How bags are built. When the __getitem__ method is called, the bag is built as follows (pseudocode):

  1. The __getitem__ method is called with an index.
  2. The bag name is retrieved from the list of bag names.
  3. The _build_bag method is called with the bag name: 3.1. The _build_bag method loads the bag from disk using the _load_bag method. This method loads the features, labels, instance labels and coordinates from disk using the _load_features, _load_labels, _load_inst_labels and _load_coords methods. 3.2. If the coordinates have been provided, it builds the adjacency matrix using the _build_adj method.
  4. The bag is returned as a dictionary containing the keys defined in bag_keys and their corresponding values. This behaviour can be extended or modified by overriding the corresponding methods.

Custom file reading. By default, the dataset supports reading .npy and .h5 files using the default_read_file function. However, users can provide a custom file reading function through the read_file_fn argument in the constructor. This function must take as input the file path and the key type (one of 'features', 'labels', 'inst_labels', 'coords') and return the corresponding data as a numpy array.

    def custom_read_file(file_path: str, key_type: str) -> np.ndarray:
        # Custom logic to read the file and return the data as a numpy array
        ...    

__init__(features_path=None, labels_path=None, inst_labels_path=None, coords_path=None, bag_names=None, bag_keys=['X', 'Y', 'y_inst', 'adj', 'coords'], file_ext='.npy', dist_thr=1.5, adj_with_dist=False, norm_adj=True, load_at_init=False, read_file_fn=default_read_file, verbose=True)

Class constructor.

Parameters:

  • features_path (str, default: None ) –

    Path to the directory containing the features.

  • labels_path (str, default: None ) –

    Path to the directory containing the bag labels.

  • inst_labels_path (str, default: None ) –

    Path to the directory containing the instance labels.

  • coords_path (str, default: None ) –

    Path to the directory containing the coordinates.

  • bag_keys (list, default: ['X', 'Y', 'y_inst', 'adj', 'coords'] ) –

    List of keys to load the bags data. The TensorDict returned by the __getitem__ method will have these keys. Possible keys are: - "X": Load the features of the bag. - "Y": Load the label of the bag. - "y_inst": Load the instance labels of the bag. - "adj": Load the adjacency matrix of the bag. It requires the coordinates to be loaded. - "coords": Load the coordinates of the bag.

  • file_ext (str, default: '.npy' ) –

    File type of files to be loaded. Can be '.npy' or '.h5'.

  • bag_names (list, default: None ) –

    List of bag names to load. If None, all bags from the features_path are loaded.

  • dist_thr (float, default: 1.5 ) –

    Distance threshold for building the adjacency matrix.

  • adj_with_dist (bool, default: False ) –

    If True, the adjacency matrix is built using the Euclidean distance between the instance features. If False, the adjacency matrix is binary.

  • norm_adj (bool, default: True ) –

    If True, normalize the adjacency matrix.

  • load_at_init (bool, default: False ) –

    If True, load the bags at initialization. If False, load the bags on demand.

  • read_file_fn (callable, default: default_read_file ) –

    Function to read the files from disk. It must take as input the file path and the key type (one of 'features', 'labels', 'inst_labels', 'coords') and return the corresponding data as a numpy array.

  • verbose (bool, default: True ) –

    If True, warning messages are displayed

__getitem__(index)

Parameters:

  • index (int) –

    Index of the bag to retrieve.

Returns:

  • bag_dict ( TensorDict ) –

    Dictionary containing the keys defined in bag_keys and their corresponding values.

    • X: Features of the bag, of shape (bag_size, ...).
    • Y: Label of the bag.
    • y_inst: Instance labels of the bag, of shape (bag_size, ...).
    • adj: Adjacency matrix of the bag. It is a sparse COO tensor of shape (bag_size, bag_size). If norm_adj=True, the adjacency matrix is normalized.
    • coords: Coordinates of the bag, of shape (bag_size, coords_dim).