Processed MIL dataset
torchmil.datasets.ProcessedMILDataset
Bases: Dataset
This class represents a general MIL dataset where the bags have been processed and saved as numpy or .h5 files. It enforces strict data availability for core components, failing fast if expected files are missing.
MIL processing and directory structure. The dataset expects pre-processed bags saved as individual numpy or .h5 files.
- A feature file should yield an array of shape
(bag_size, ...), where...represents the shape of the features. - A label file should yield an array of shape arbitrary shape, e.g.,
(1,)for binary classification. - An instance label file should yield an array of shape
(bag_size, ...), where...represents the shape of the instance labels. - A coordinates file should yield an array of shape
(bag_size, coords_dim), wherecoords_dimis the dimension of the coordinates.
Bag keys and directory structure. The dataset can be initialized with a list of bag keys, which are used to choose which data to load. This dataset expects the following directory structure:
features_path/ (if "X" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...
labels_path/ (if "Y" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...
inst_labels_path/ (if "y_inst" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...
coords_path/ (if "coords" or "adj" in bag_keys)
├── bag1.ext
├── bag2.ext
└── ...
Adjacency matrix. If the coordinates of the instances are available, the adjacency matrix will be built using the Euclidean distance between the coordinates. Formally, the adjacency matrix \(\mathbf{A} = \left[ A_{ij} \right]\) is defined as:
$\(A_{ij} = \begin{cases} d_{ij}, & \text{if } \left\| \mathbf{c}_i - \mathbf{c}_j \right\| \leq \text{dist_thr}, \\ 0, & \text{otherwise}, \end{cases} \quad d_{ij} = \begin{cases} 1, & \text{if } \text{adj_with_dist=False}, \\ \exp\left( -\frac{\left\| \mathbf{x}_i - \mathbf{x}_j \right\|}{d} \right), & \text{if } \text{adj_with_dist=True}. \end{cases}\)$
where \(\mathbf{c}_i\) and \(\mathbf{c}_j\) are the coordinates of the instances \(i\) and \(j\), respectively, \(\text{dist_thr}\) is a threshold distance, and \(\mathbf{x}_i \in \mathbb{R}^d\) and \(\mathbf{x}_j \in \mathbb{R}^d\) are the features of instances \(i\) and \(j\), respectively.
How bags are built.
When the __getitem__ method is called, the bag is built as follows (pseudocode):
- The
__getitem__method is called with an index. - The bag name is retrieved from the list of bag names.
- The
_build_bagmethod is called with the bag name: 3.1. The_build_bagmethod loads the bag from disk using the_load_bagmethod. This method loads the features, labels, instance labels and coordinates from disk using the_load_features,_load_labels,_load_inst_labelsand_load_coordsmethods. 3.2. If the coordinates have been provided, it builds the adjacency matrix using the_build_adjmethod. - The bag is returned as a dictionary containing the keys defined in
bag_keysand their corresponding values. This behaviour can be extended or modified by overriding the corresponding methods.
Custom file reading.
By default, the dataset supports reading .npy and .h5 files using the default_read_file function.
However, users can provide a custom file reading function through the read_file_fn argument in the constructor.
This function must take as input the file path and the key type (one of 'features', 'labels', 'inst_labels', 'coords') and return the corresponding data as a numpy array.
def custom_read_file(file_path: str, key_type: str) -> np.ndarray:
# Custom logic to read the file and return the data as a numpy array
...
__init__(features_path=None, labels_path=None, inst_labels_path=None, coords_path=None, bag_names=None, bag_keys=['X', 'Y', 'y_inst', 'adj', 'coords'], file_ext='.npy', dist_thr=1.5, adj_with_dist=False, norm_adj=True, load_at_init=False, read_file_fn=default_read_file, verbose=True)
Class constructor.
Parameters:
-
features_path(str, default:None) –Path to the directory containing the features.
-
labels_path(str, default:None) –Path to the directory containing the bag labels.
-
inst_labels_path(str, default:None) –Path to the directory containing the instance labels.
-
coords_path(str, default:None) –Path to the directory containing the coordinates.
-
bag_keys(list, default:['X', 'Y', 'y_inst', 'adj', 'coords']) –List of keys to load the bags data. The TensorDict returned by the
__getitem__method will have these keys. Possible keys are: - "X": Load the features of the bag. - "Y": Load the label of the bag. - "y_inst": Load the instance labels of the bag. - "adj": Load the adjacency matrix of the bag. It requires the coordinates to be loaded. - "coords": Load the coordinates of the bag. -
file_ext(str, default:'.npy') –File type of files to be loaded. Can be '.npy' or '.h5'.
-
bag_names(list, default:None) –List of bag names to load. If None, all bags from the
features_pathare loaded. -
dist_thr(float, default:1.5) –Distance threshold for building the adjacency matrix.
-
adj_with_dist(bool, default:False) –If True, the adjacency matrix is built using the Euclidean distance between the instance features. If False, the adjacency matrix is binary.
-
norm_adj(bool, default:True) –If True, normalize the adjacency matrix.
-
load_at_init(bool, default:False) –If True, load the bags at initialization. If False, load the bags on demand.
-
read_file_fn(callable, default:default_read_file) –Function to read the files from disk. It must take as input the file path and the key type (one of 'features', 'labels', 'inst_labels', 'coords') and return the corresponding data as a numpy array.
-
verbose(bool, default:True) –If True, warning messages are displayed
__getitem__(index)
Parameters:
-
index(int) –Index of the bag to retrieve.
Returns:
-
bag_dict(TensorDict) –Dictionary containing the keys defined in
bag_keysand their corresponding values.- X: Features of the bag, of shape
(bag_size, ...). - Y: Label of the bag.
- y_inst: Instance labels of the bag, of shape
(bag_size, ...). - adj: Adjacency matrix of the bag. It is a sparse COO tensor of shape
(bag_size, bag_size). Ifnorm_adj=True, the adjacency matrix is normalized. - coords: Coordinates of the bag, of shape
(bag_size, coords_dim).
- X: Features of the bag, of shape