Skip to content

Video Classification Dataset

torchmil.datasets.VideoClassificationDataset

Bases: ProcessedMILDataset

This class represents a dataset of videos for Multiple Instance Learning (MIL).

MIL and Video Classification. Videos are sequences of frames that capture motion and temporal information. In the context of MIL, a video is considered a bag, and the frames are considered instances.

Directory structure. It is assumed that the bags have been processed and saved as numpy files. For more information on the processing of the bags, refer to the ProcessedMILDataset class. This dataset expects the following directory structure:

features_path
├── video1.npy
├── video2.npy
└── ...
labels_path
├── video1.npy
├── video2.npy
└── ...
inst_labels_path
├── video1.npy
├── video2.npy
└── ...

Order of the frames and the adjacency matrix. This dataset assumes that the frames of the video frames are ordered. An adjacency matrix \(\mathbf{A} = \left[ A_{ij} \right]\) is built using this information:

\[\begin{equation} A_{ij} = \begin{cases} d_{ij}, & \text{if } \lvert i - j \rvert = 1, \\ 0, & \text{otherwise}, \end{cases} \quad d_{ij} = \begin{cases} 1, & \text{if } \text{adj_with_dist=False}, \\ \exp\left( -\frac{\left\| \mathbf{x}_i - \mathbf{x}_j \right\|}{d} \right), & \text{if } \text{adj_with_dist=True}. \end{cases} \end{equation}\]

where \(\mathbf{x}_i \in \mathbb{R}^d\) and \(\mathbf{x}_j \in \mathbb{R}^d\) are the features of instances \(i\) and \(j\), respectively.

__init__(features_path, labels_path, frame_labels_path=None, video_names=None, bag_keys=['X', 'Y', 'y_inst', 'adj', 'coords'], adj_with_dist=False, norm_adj=True, load_at_init=True)

Class constructor.

Parameters:

  • features_path (str) –

    Path to the directory containing the matrices of the videos

  • labels_path (str) –

    Path to the directory containing the labels of the videos.

  • frame_labels_path (str, default: None ) –

    Path to the directory containing the labels of the frames.

  • video_names (list, default: None ) –

    List of the names of the videos to load. If None, all videos in the features_path directory are loaded.

  • bag_keys (list, default: ['X', 'Y', 'y_inst', 'adj', 'coords'] ) –

    List of keys to use for the bags. Must be in ['X', 'Y', 'y_inst', 'coords'].

  • adj_with_dist (bool, default: False ) –

    If True, the adjacency matrix is built using the Euclidean distance between the frames features. If False, the adjacency matrix is binary.

  • norm_adj (bool, default: True ) –

    If True, normalize the adjacency matrix.

  • load_at_init (bool, default: True ) –

    If True, load the bags at initialization. If False, load the bags on demand.

__getitem__(index)

Parameters:

  • index (int) –

    Index of the bag to retrieve.

Returns:

  • bag_dict ( TensorDict ) –

    Dictionary containing the keys defined in bag_keys and their corresponding values.

    • X: Features of the bag, of shape (bag_size, ...).
    • Y: Label of the bag.
    • y_inst: Instance labels of the bag, of shape (bag_size, ...).
    • adj: Adjacency matrix of the bag. It is a sparse COO tensor of shape (bag_size, bag_size). If norm_adj=True, the adjacency matrix is normalized.
    • coords: Coordinates of the bag, of shape (bag_size, coords_dim).