datasets package¶

Submodules¶

datasets.bcss_data_module module¶

Module containing the data module for the BCSS dataset

class datasets.bcss_data_module.BCSSDataModule(*args, **kwargs)[source]¶

Bases: datasets.data_module.ActiveLearningDataModule

Initializes the BCSS data module.

Parameters

data_dir – Path of the directory that contains the data.
batch_size – Batch size.
num_workers – Number of workers for DataLoader.
cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (bool, optional) – Flag if the data should be shuffled.
channels (int, optional) – Number of channels of the images. 3 means RGB, 2 means greyscale.
image_shape (tuple, optional) – Shape of the image.
target_label (int, optional) – The label to use for learning. Details are in BCSSDataset.
combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)
val_set_size (float, optional) – The size of the validation set (default = 0.3).
stratify (bool, optional) – The option to stratify the train val split by the institutes.
random_state (int, optional) – Controls the data splitting and shuffling. Pass an int for reproducible output across multiple runs.
**kwargs – Further, dataset specific parameters.

static build_stratification_labels(image_paths)[source]¶: Build a list with class labels used for a stratified split

data_channels()[source]¶: Returns the number of channels

static discover_paths(image_dir, mask_dir)[source]¶

Discover the .png files in a given directory.

Parameters

image_dir – The directory to the images.
mask_dir – The directory to the annotations.

Returns

list of file paths as tuple of image paths, annotation paths

id_to_class_names()[source]¶

Returns: A mapping of class indices to descriptive class names.
Return type: Dict[int, str]

label_items(ids, pseudo_labels=None)[source]¶: Moves the given samples from the unlabeled dataset to the labeled dataset.

multi_label()[source]¶

Returns: Whether the dataset is a multi-label or a single-label dataset.
Return type: bool

train_dataloader()[source]¶

Returns: Pytorch dataloader or Keras sequence representing the training set.

datasets.bcss_data_module.copy_test_set_to_separate_folder(source_dir, target_dir)[source]¶

Reproduces the test set used in the baseline implementation of the challenge, by copying the scans of the respective institution into a separate folder.

Parameters

source_dir (str) – Directory where all the downloaded images and masks are stored.
target_dir (str) – Directory where to store the test data.

datasets.bcss_dataset module¶

Module to load and batch the BCSS dataset

class datasets.bcss_dataset.BCSSDataset(image_paths, annotation_paths, cache_size=0, target_label=1, is_unlabeled=False, shuffle=True, channels=3, image_shape=(300, 300), random_state=None)[source]¶

Bases: torch.utils.data.dataset.Dataset[torch.utils.data.dataset.T_co]

The BCSS dataset contains over 20,000 segmentation annotations of tissue region from breast cancer images from TCGA. Detailed description can be found either at the challenge website or on github .

Parameters

image_paths (List[Path]) – List with all images to load, can be obtained by datasets.bcss_data_module.BCSSDataModule.discover_paths() .
annotation_paths (List[Path]) – List with all annotations to load, can be obtained by datasets.bcss_data_module.BCSSDataModule.discover_paths() .
target_label (int, optional) –
The label to use for learning. Following labels are in the annotations:
- outside_roi 0
- tumor 1
- stroma 2
- lymphocytic_infiltrate 3
- necrosis_or_debris 4
- glandular_secretions 5
- blood 6
- exclude 7
- metaplasia_NOS 8
- fat 9
- plasma_cells 10
- other_immune_infiltrate 11
- mucoid_material 12
- normal_acinus_or_duct 13
- lymphatics 14
- undetermined 15
- nerve 16
- skin_adnexa 17
- blood_vessel 18
- angioinvasion 19
- dcis 20
- other 21
is_unlabeled (bool, optional) – Whether the dataset is used as “unlabeled” for the active learning loop.
shuffle (bool, optional) – Whether the data should be shuffled.
channels (int, optional) – Number of channels of the images. 3 means RGB, 2 means greyscale.
image_shape (tuple, optional) – Shape of the image.
random_state (int, optional) – Controls the data shuffling. Pass an int for reproducible output across multiple runs.

add_image(image_path, annotation_path)[source]¶

Adds an image to this dataset.

Parameters

image_path – Path of the image to be added.
annotation_path – Path of the annotation of the image to be added.

Returns

None. Raises ValueError if image already exists.

static get_case_id(filepath)[source]¶: Gets the case ID for a given filepath.

static get_institute_name(filepath)[source]¶: Gets the name of the institute which donated the image.

image_ids()[source]¶: For each image returns the case ID’s

static normalize(img)[source]¶

Normalizes an image by:

Dividing by the mean value

Subtracting the std

Parameters: img – The input image that should be normalized.
Returns: Normalized image with background values normalized to -1

num_pseudo_labels()[source]¶

Returns: Number of items with pseudo-labels in the dataset.
Return type: int

reinforce_type(expected_type)¶: Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

remove_image(image_path, annotation_path)[source]¶

Removes an image from this dataset.

Parameters

image_path – Path of the image to be removed.
annotation_path – Path of the annotation of the image to be removed.

Returns

None. Raises ValueError if image already exists.

size()[source]¶

Returns: Size of the dataset.
Return type: int

slices_per_image(**kwargs)[source]¶: For each image returns the number of slices

datasets.brats_data_module module¶

Module containing the data module for brats data

class datasets.brats_data_module.BraTSDataModule(*args, **kwargs)[source]¶

Bases: datasets.data_module.ActiveLearningDataModule

Initializes the BraTS data module.

Parameters

data_dir (string) – Path of the directory that contains the data.
batch_size (int) – Batch size.
num_workers (int) – Number of workers for DataLoader.
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to batch_size.
cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (boolean) – Flag if the data should be shuffled.
dim (int) – 2 or 3 to define if the datsets should return 2d slices of whole 3d images.
combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)
mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.
random_state (int, optional) – Random state for splitting the data into an initial training set and an unlabeled set and for shuffling the data. Pass an int for reproducibility across runs.
only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.
**kwargs – Further, dataset specific parameters.

static discover_paths(dir_path, modality='flair', random_samples=None)[source]¶

Discover the .nii.gz file paths with a given modality

Parameters

dir_path – directory to discover paths in
modality (string, optional) – modality of scan
random_samples – the amount of random samples from the data sets

Returns

list of files as tuple of image paths, annotation paths

id_to_class_names()[source]¶

Returns: A mapping of class indices to descriptive class names.
Return type: Dict[int, str]

label_items(ids, pseudo_labels=None)[source]¶: Moves the given samples from the unlabeled dataset to the labeled dataset.

multi_label()[source]¶

Returns: Whether the dataset is a multi-label or a single-label dataset.
Return type: bool

train_dataloader()[source]¶

Returns: Pytorch dataloader or Keras sequence representing the training set.

datasets.collate module¶

Module to collate batches

datasets.collate.batch_padding_collate_fn(batch, pad_value=0)[source]¶

Collates a batch and padds tensors to the same size before stacking them.

Parameters: batch (List[Union[tuple, str, torch.Tensor]]) – The batch in List form.
Returns: The batch collated.

datasets.data_module module¶

Module containing abstract classes for the data modules

class datasets.data_module.ActiveLearningDataModule(*args, **kwargs)[source]¶

Bases: pytorch_lightning.core.datamodule.LightningDataModule, abc.ABC

Abstract base class to structure the dataset creation for active learning

Parameters

data_dir (str) – Path of the directory that contains the data.
batch_size (int) – Batch size.
num_workers (int) – Number of workers for DataLoader.
batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to batch_size.
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (bool, optional) – Flag if the data should be shuffled.
**kwargs – Further, dataset specific parameters.

static data_channels()[source]¶

Can be overwritten by subclasses if the data has multiple channels.

Returns: The amount of data channels. Defaults to 1.

abstract id_to_class_names()[source]¶

Returns: A mapping of class indices to descriptive class names.
Return type: Dict[int, str]

abstract label_items(ids, pseudo_labels=None)[source]¶

Moves data items from the unlabeled set to one of the labeled sets (training, validation or test set).

Parameters

ids (List[str]) – IDs of the items to be labeled.
pseudo_labels (Dict[str, Any], optional) – Optional pseudo labels for (some of the) the selected data items.

Returns

None.

abstract multi_label()[source]¶

Returns: Whether the dataset is a multi-label or a single-label dataset.
Return type: bool

num_classes()[source]¶

Returns: Number of classes.

setup(stage=None)[source]¶

Creates the datasets managed by this data module.

Parameters: stage – Current training stage.

test_dataloader()[source]¶

Returns: Pytorch dataloader or Keras sequence representing the test set.

test_set_size()[source]¶

Returns: Size of test set.

train_dataloader()[source]¶

Returns: Pytorch dataloader or Keras sequence representing the training set.

training_set_num_pseudo_labels()[source]¶

Returns: Number of pseudo-labels in training set.

training_set_size()[source]¶

Returns: Size of training set.

unlabeled_dataloader()[source]¶

Returns: Pytorch dataloader or Keras sequence representing the unlabeled set.

unlabeled_set_size()[source]¶

Returns: Number of unlabeled items.

val_dataloader()[source]¶

Returns: Pytorch dataloader or Keras sequence representing the validation set.

validation_set_size()[source]¶

Returns: Size of validation set.

datasets.dataset_hooks module¶

Module defining hooks that each dataset class should implement

class datasets.dataset_hooks.DatasetHooks[source]¶

Bases: abc.ABC

Class that defines hooks that should be implemented by each dataset class.

abstract image_ids()[source]¶

Returns: List of all image IDs included in the dataset.

abstract num_pseudo_labels()[source]¶

Returns: Number of items with pseudo-labels in the dataset.
Return type: int

abstract size()[source]¶

Returns: Size of the dataset.
Return type: int

abstract slices_per_image(**kwargs)[source]¶

Parameters

kwargs – Dataset specific parameters.

Returns

Number of slices that each image of the dataset contains. If a single integer: value is provided, it is assumed that all images of the dataset have the same number of slices.

Return type

Union[int, List[int]]

datasets.decathlon_data_module module¶

Module containing the data module for decathlon data

class datasets.decathlon_data_module.DecathlonDataModule(*args, **kwargs)[source]¶

Bases: datasets.data_module.ActiveLearningDataModule

Initializes the Decathlon data module.

Parameters

data_dir (string) – Path of the directory that contains the data.
batch_size (int) – Batch size.
num_workers (int) – Number of workers for DataLoader.
task (str, optional) – The task from the medical segmentation decathlon.
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to batch_size.
cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (bool, optional) – Flag if the data should be shuffled.
dim (int) – 2 or 3 to define if the datsets should return 2d slices of whole 3d images.
combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)
mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.
random_state (int, optional) – Random state for splitting the data into an initial training set and an unlabeled set and for shuffling the data. Pass an int for reproducibility across runs.
only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.
**kwargs – Further, dataset specific parameters.

data_channels()[source]¶: Returns the amount of data channels.

static discover_paths(dir_path, subset, random_samples=None, random_state=None)[source]¶

Discover the .nii.gz file paths from the corresponding JSON file.

Parameters

dir_path (str) – Directory the dataset is inside.
subset (Literal["train", "val", "test"]) – The subset of paths of the whole dataset.
random_samples (int, optional) – The amount of random samples from the data set.

Returns

list of files as tuple of image paths, annotation paths

id_to_class_names()[source]¶

Returns: A mapping of class indices to descriptive class names.
Return type: Dict[int, str]

label_items(ids, pseudo_labels=None)[source]¶: Moves the given samples from the unlabeled dataset to the labeled dataset.

multi_label()[source]¶

Returns: Whether the dataset is a multi-label or a single-label dataset.
Return type: bool

train_dataloader()[source]¶

Returns: Pytorch dataloader or Keras sequence representing the training set.

datasets.doubly_shuffled_nifti_dataset module¶

Module to load and batch nifti datasets

class datasets.doubly_shuffled_nifti_dataset.DoublyShuffledNIfTIDataset(image_paths, annotation_paths, cache_size=0, combine_foreground_classes=False, mask_filter_values=None, is_unlabeled=False, shuffle=False, transform=None, target_transform=None, dim=2, slice_indices=None, case_id_prefix='train', random_state=None, only_return_true_labels=False)[source]¶

Bases: torch.utils.data.dataset.Dataset[torch.utils.data.dataset.T_co]

This dataset can be used with NIfTI images. It is iterable and can return both 2D and 3D images.

Parameters

image_paths (List[str]) – List with the paths to the images. Has to contain paths of all images which can ever become part of the dataset.
annotation_paths (List[str]) – List with the paths to the annotations. Has to contain paths of all images which can ever become part of the dataset.
cache_size (int, optional) – Number of images to keep in memory to speed-up data loading in subsequent epochs. Defaults to zero.
combine_foreground_classes (bool, optional) – Flag if the non-zero values of the annotations should be merged. Defaults to False.
mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.
shuffle (bool, optional) – Whether the data should be shuffled.
transform (Callable[[Any], Tensor], optional) – Function to transform the images.
target_transform (Callable[[Any], Tensor], optional) – Function to transform the annotations.
dim (int, optional) – 2 or 3 to define if the dataset should return 2d slices of whole 3d images. Defaults to 2.
slice_indices (List[np.array], optional) – Array of indices per image which should be part of the dataset. Uses all slices if None. Defaults to None.
random_state (int, optional) – Controls the data shuffling. Pass an int for reproducible output across multiple runs.
only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.

add_image(image_id, slice_index=0, pseudo_label=None)[source]¶

Adds an image to this dataset.

Parameters

image_id (str) – The id of the image.
slice_index (int) – Index of the slice to be added.
pseudo_label (np.array, optional) – An optional pseudo label for the slice. If no pseudo label is provided, the actual label from the corresponding file is used.

static generate_active_learning_split(filepaths, dim, initial_training_set_size, random_state=None)[source]¶

Generates a split between initial training set and initially unlabeled set for active learning.

Parameters

filepaths (List[str]) – The file paths to the Nifti files.
dim (int) – The dimensionality of the dataset. (2 or 3.)
initial_training_set_size (int) – The number of samples in the initial training set.
random_state (int, optional) – The random state used to generate the split. Pass an int for reproducibility across runs.

Returns

A tuple of two lists of np.arrays. The lists contain one array per filepath which contains the slice indices of the slices which should be part of the training and unlabeled sets respectively. The lists can be passed as slice_indices for initialization of a DoublyShuffledNIfTIDataset.

get_images_by_id(case_ids)[source]¶

Retrieves the last n images and corresponding case ids from the images that were last added to the dataset.

Parameters: case_ids (List[str]) – List with case_ids to get.
Returns: A list of all the images with provided case ids.

get_items_for_logging(case_ids)[source]¶

Creates a list of files as tuple of image id and slice index.

Parameters: case_ids (List[str]) – List with case_ids to get.

image_ids()[source]¶

static normalize(img)[source]¶

Normalizes an image by

Dividing by the maximum value
Subtracting the mean, zeros will be ignored while calculating the mean
Dividing by the negative minimum value

Parameters: img – The input image that should be normalized.
Returns: Normalized image with background values normalized to -1

num_pseudo_labels()[source]¶

Returns: Number of items with pseudo-labels in the dataset.
Return type: int

read_mask_for_image(image_index)[source]¶

Reads the mask for the image from file. Uses correct mask specific parameters.

Parameters: image_index (int) – Index of the image to load.

reinforce_type(expected_type)¶: Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

remove_image(image_id, slice_index=0)[source]¶

Removes an image from this dataset.

Parameters

image_id (str) – The id of the image.
slice_index (int) – Index of the slice to be removed.

size()[source]¶

Returns: Size of the dataset.
Return type: int

slices_per_image(**kwargs)[source]¶

datasets package¶

Submodules¶

datasets.bcss_data_module module¶

datasets.bcss_dataset module¶

datasets.brats_data_module module¶

datasets.collate module¶

datasets.data_module module¶

datasets.dataset_hooks module¶

datasets.decathlon_data_module module¶

datasets.doubly_shuffled_nifti_dataset module¶

Module contents¶

Docs