Shortcuts

datasets package

Submodules

datasets.bcss_data_module module

Module containing the data module for the BCSS dataset

class datasets.bcss_data_module.BCSSDataModule(*args, **kwargs)[source]

Bases: datasets.data_module.ActiveLearningDataModule

Initializes the BCSS data module.

Parameters
  • data_dir – Path of the directory that contains the data.

  • batch_size – Batch size.

  • num_workers – Number of workers for DataLoader.

  • cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).

  • active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).

  • initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.

  • pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.

  • shuffle (bool, optional) – Flag if the data should be shuffled.

  • channels (int, optional) – Number of channels of the images. 3 means RGB, 2 means greyscale.

  • image_shape (tuple, optional) – Shape of the image.

  • target_label (int, optional) – The label to use for learning. Details are in BCSSDataset.

  • combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)

  • val_set_size (float, optional) – The size of the validation set (default = 0.3).

  • stratify (bool, optional) – The option to stratify the train val split by the institutes.

  • random_state (int, optional) – Controls the data splitting and shuffling. Pass an int for reproducible output across multiple runs.

  • **kwargs – Further, dataset specific parameters.

static build_stratification_labels(image_paths)[source]

Build a list with class labels used for a stratified split

data_channels()[source]

Returns the number of channels

static discover_paths(image_dir, mask_dir)[source]

Discover the .png files in a given directory.

Parameters
  • image_dir – The directory to the images.

  • mask_dir – The directory to the annotations.

Returns

list of file paths as tuple of image paths, annotation paths

id_to_class_names()[source]
Returns

A mapping of class indices to descriptive class names.

Return type

Dict[int, str]

label_items(ids, pseudo_labels=None)[source]

Moves the given samples from the unlabeled dataset to the labeled dataset.

multi_label()[source]
Returns

Whether the dataset is a multi-label or a single-label dataset.

Return type

bool

train_dataloader()[source]
Returns

Pytorch dataloader or Keras sequence representing the training set.

datasets.bcss_data_module.copy_test_set_to_separate_folder(source_dir, target_dir)[source]

Reproduces the test set used in the baseline implementation of the challenge, by copying the scans of the respective institution into a separate folder.

Parameters
  • source_dir (str) – Directory where all the downloaded images and masks are stored.

  • target_dir (str) – Directory where to store the test data.

datasets.bcss_dataset module

Module to load and batch the BCSS dataset

class datasets.bcss_dataset.BCSSDataset(image_paths, annotation_paths, cache_size=0, target_label=1, is_unlabeled=False, shuffle=True, channels=3, image_shape=(300, 300), random_state=None)[source]

Bases: torch.utils.data.dataset.Dataset[torch.utils.data.dataset.T_co]

The BCSS dataset contains over 20,000 segmentation annotations of tissue region from breast cancer images from TCGA. Detailed description can be found either at the challenge website or on github .

Parameters
  • image_paths (List[Path]) – List with all images to load, can be obtained by datasets.bcss_data_module.BCSSDataModule.discover_paths() .

  • annotation_paths (List[Path]) – List with all annotations to load, can be obtained by datasets.bcss_data_module.BCSSDataModule.discover_paths() .

  • target_label (int, optional) –

    The label to use for learning. Following labels are in the annotations:

    • outside_roi 0

    • tumor 1

    • stroma 2

    • lymphocytic_infiltrate 3

    • necrosis_or_debris 4

    • glandular_secretions 5

    • blood 6

    • exclude 7

    • metaplasia_NOS 8

    • fat 9

    • plasma_cells 10

    • other_immune_infiltrate 11

    • mucoid_material 12

    • normal_acinus_or_duct 13

    • lymphatics 14

    • undetermined 15

    • nerve 16

    • skin_adnexa 17

    • blood_vessel 18

    • angioinvasion 19

    • dcis 20

    • other 21

  • is_unlabeled (bool, optional) – Whether the dataset is used as “unlabeled” for the active learning loop.

  • shuffle (bool, optional) – Whether the data should be shuffled.

  • channels (int, optional) – Number of channels of the images. 3 means RGB, 2 means greyscale.

  • image_shape (tuple, optional) – Shape of the image.

  • random_state (int, optional) – Controls the data shuffling. Pass an int for reproducible output across multiple runs.

add_image(image_path, annotation_path)[source]

Adds an image to this dataset.

Parameters
  • image_path – Path of the image to be added.

  • annotation_path – Path of the annotation of the image to be added.

Returns

None. Raises ValueError if image already exists.

static get_case_id(filepath)[source]

Gets the case ID for a given filepath.

static get_institute_name(filepath)[source]

Gets the name of the institute which donated the image.

image_ids()[source]

For each image returns the case ID’s

static normalize(img)[source]

Normalizes an image by:

  1. Dividing by the mean value

  2. Subtracting the std

Parameters

img – The input image that should be normalized.

Returns

Normalized image with background values normalized to -1

num_pseudo_labels()[source]
Returns

Number of items with pseudo-labels in the dataset.

Return type

int

reinforce_type(expected_type)

Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

remove_image(image_path, annotation_path)[source]

Removes an image from this dataset.

Parameters
  • image_path – Path of the image to be removed.

  • annotation_path – Path of the annotation of the image to be removed.

Returns

None. Raises ValueError if image already exists.

size()[source]
Returns

Size of the dataset.

Return type

int

slices_per_image(**kwargs)[source]

For each image returns the number of slices

datasets.brats_data_module module

Module containing the data module for brats data

class datasets.brats_data_module.BraTSDataModule(*args, **kwargs)[source]

Bases: datasets.data_module.ActiveLearningDataModule

Initializes the BraTS data module.

Parameters
  • data_dir (string) – Path of the directory that contains the data.

  • batch_size (int) – Batch size.

  • num_workers (int) – Number of workers for DataLoader.

  • active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).

  • batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to batch_size.

  • cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).

  • initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.

  • pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.

  • shuffle (boolean) – Flag if the data should be shuffled.

  • dim (int) – 2 or 3 to define if the datsets should return 2d slices of whole 3d images.

  • combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)

  • mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.

  • random_state (int, optional) – Random state for splitting the data into an initial training set and an unlabeled set and for shuffling the data. Pass an int for reproducibility across runs.

  • only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.

  • **kwargs – Further, dataset specific parameters.

static discover_paths(dir_path, modality='flair', random_samples=None)[source]

Discover the .nii.gz file paths with a given modality

Parameters
  • dir_path – directory to discover paths in

  • modality (string, optional) – modality of scan

  • random_samples – the amount of random samples from the data sets

Returns

list of files as tuple of image paths, annotation paths

id_to_class_names()[source]
Returns

A mapping of class indices to descriptive class names.

Return type

Dict[int, str]

label_items(ids, pseudo_labels=None)[source]

Moves the given samples from the unlabeled dataset to the labeled dataset.

multi_label()[source]
Returns

Whether the dataset is a multi-label or a single-label dataset.

Return type

bool

train_dataloader()[source]
Returns

Pytorch dataloader or Keras sequence representing the training set.

datasets.collate module

Module to collate batches

datasets.collate.batch_padding_collate_fn(batch, pad_value=0)[source]

Collates a batch and padds tensors to the same size before stacking them.

Parameters

batch (List[Union[tuple, str, torch.Tensor]]) – The batch in List form.

Returns

The batch collated.

datasets.data_module module

Module containing abstract classes for the data modules

class datasets.data_module.ActiveLearningDataModule(*args, **kwargs)[source]

Bases: pytorch_lightning.core.datamodule.LightningDataModule, abc.ABC

Abstract base class to structure the dataset creation for active learning

Parameters
  • data_dir (str) – Path of the directory that contains the data.

  • batch_size (int) – Batch size.

  • num_workers (int) – Number of workers for DataLoader.

  • batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to batch_size.

  • active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).

  • initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.

  • pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.

  • shuffle (bool, optional) – Flag if the data should be shuffled.

  • **kwargs – Further, dataset specific parameters.

static data_channels()[source]

Can be overwritten by subclasses if the data has multiple channels.

Returns

The amount of data channels. Defaults to 1.

abstract id_to_class_names()[source]
Returns

A mapping of class indices to descriptive class names.

Return type

Dict[int, str]

abstract label_items(ids, pseudo_labels=None)[source]

Moves data items from the unlabeled set to one of the labeled sets (training, validation or test set).

Parameters
  • ids (List[str]) – IDs of the items to be labeled.

  • pseudo_labels (Dict[str, Any], optional) – Optional pseudo labels for (some of the) the selected data items.

Returns

None.

abstract multi_label()[source]
Returns

Whether the dataset is a multi-label or a single-label dataset.

Return type

bool

num_classes()[source]
Returns

Number of classes.

setup(stage=None)[source]

Creates the datasets managed by this data module.

Parameters

stage – Current training stage.

test_dataloader()[source]
Returns

Pytorch dataloader or Keras sequence representing the test set.

test_set_size()[source]
Returns

Size of test set.

train_dataloader()[source]
Returns

Pytorch dataloader or Keras sequence representing the training set.

training_set_num_pseudo_labels()[source]
Returns

Number of pseudo-labels in training set.

training_set_size()[source]
Returns

Size of training set.

unlabeled_dataloader()[source]
Returns

Pytorch dataloader or Keras sequence representing the unlabeled set.

unlabeled_set_size()[source]
Returns

Number of unlabeled items.

val_dataloader()[source]
Returns

Pytorch dataloader or Keras sequence representing the validation set.

validation_set_size()[source]
Returns

Size of validation set.

datasets.dataset_hooks module

Module defining hooks that each dataset class should implement

class datasets.dataset_hooks.DatasetHooks[source]

Bases: abc.ABC

Class that defines hooks that should be implemented by each dataset class.

abstract image_ids()[source]
Returns

List of all image IDs included in the dataset.

abstract num_pseudo_labels()[source]
Returns

Number of items with pseudo-labels in the dataset.

Return type

int

abstract size()[source]
Returns

Size of the dataset.

Return type

int

abstract slices_per_image(**kwargs)[source]
Parameters

kwargs – Dataset specific parameters.

Returns

Number of slices that each image of the dataset contains. If a single integer

value is provided, it is assumed that all images of the dataset have the same number of slices.

Return type

Union[int, List[int]]

datasets.decathlon_data_module module

Module containing the data module for decathlon data

class datasets.decathlon_data_module.DecathlonDataModule(*args, **kwargs)[source]

Bases: datasets.data_module.ActiveLearningDataModule

Initializes the Decathlon data module.

Parameters
  • data_dir (string) – Path of the directory that contains the data.

  • batch_size (int) – Batch size.

  • num_workers (int) – Number of workers for DataLoader.

  • task (str, optional) – The task from the medical segmentation decathlon.

  • active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).

  • batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to batch_size.

  • cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).

  • initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.

  • pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.

  • shuffle (bool, optional) – Flag if the data should be shuffled.

  • dim (int) – 2 or 3 to define if the datsets should return 2d slices of whole 3d images.

  • combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)

  • mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.

  • random_state (int, optional) – Random state for splitting the data into an initial training set and an unlabeled set and for shuffling the data. Pass an int for reproducibility across runs.

  • only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.

  • **kwargs – Further, dataset specific parameters.

data_channels()[source]

Returns the amount of data channels.

static discover_paths(dir_path, subset, random_samples=None, random_state=None)[source]

Discover the .nii.gz file paths from the corresponding JSON file.

Parameters
  • dir_path (str) – Directory the dataset is inside.

  • subset (Literal["train", "val", "test"]) – The subset of paths of the whole dataset.

  • random_samples (int, optional) – The amount of random samples from the data set.

Returns

list of files as tuple of image paths, annotation paths

id_to_class_names()[source]
Returns

A mapping of class indices to descriptive class names.

Return type

Dict[int, str]

label_items(ids, pseudo_labels=None)[source]

Moves the given samples from the unlabeled dataset to the labeled dataset.

multi_label()[source]
Returns

Whether the dataset is a multi-label or a single-label dataset.

Return type

bool

train_dataloader()[source]
Returns

Pytorch dataloader or Keras sequence representing the training set.

datasets.doubly_shuffled_nifti_dataset module

Module to load and batch nifti datasets

class datasets.doubly_shuffled_nifti_dataset.DoublyShuffledNIfTIDataset(image_paths, annotation_paths, cache_size=0, combine_foreground_classes=False, mask_filter_values=None, is_unlabeled=False, shuffle=False, transform=None, target_transform=None, dim=2, slice_indices=None, case_id_prefix='train', random_state=None, only_return_true_labels=False)[source]

Bases: torch.utils.data.dataset.Dataset[torch.utils.data.dataset.T_co]

This dataset can be used with NIfTI images. It is iterable and can return both 2D and 3D images.

Parameters
  • image_paths (List[str]) – List with the paths to the images. Has to contain paths of all images which can ever become part of the dataset.

  • annotation_paths (List[str]) – List with the paths to the annotations. Has to contain paths of all images which can ever become part of the dataset.

  • cache_size (int, optional) – Number of images to keep in memory to speed-up data loading in subsequent epochs. Defaults to zero.

  • combine_foreground_classes (bool, optional) – Flag if the non-zero values of the annotations should be merged. Defaults to False.

  • mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.

  • shuffle (bool, optional) – Whether the data should be shuffled.

  • transform (Callable[[Any], Tensor], optional) – Function to transform the images.

  • target_transform (Callable[[Any], Tensor], optional) – Function to transform the annotations.

  • dim (int, optional) – 2 or 3 to define if the dataset should return 2d slices of whole 3d images. Defaults to 2.

  • slice_indices (List[np.array], optional) – Array of indices per image which should be part of the dataset. Uses all slices if None. Defaults to None.

  • random_state (int, optional) – Controls the data shuffling. Pass an int for reproducible output across multiple runs.

  • only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.

add_image(image_id, slice_index=0, pseudo_label=None)[source]

Adds an image to this dataset.

Parameters
  • image_id (str) – The id of the image.

  • slice_index (int) – Index of the slice to be added.

  • pseudo_label (np.array, optional) – An optional pseudo label for the slice. If no pseudo label is provided, the actual label from the corresponding file is used.

static generate_active_learning_split(filepaths, dim, initial_training_set_size, random_state=None)[source]

Generates a split between initial training set and initially unlabeled set for active learning.

Parameters
  • filepaths (List[str]) – The file paths to the Nifti files.

  • dim (int) – The dimensionality of the dataset. (2 or 3.)

  • initial_training_set_size (int) – The number of samples in the initial training set.

  • random_state (int, optional) – The random state used to generate the split. Pass an int for reproducibility across runs.

Returns

A tuple of two lists of np.arrays. The lists contain one array per filepath which contains the slice indices of the slices which should be part of the training and unlabeled sets respectively. The lists can be passed as slice_indices for initialization of a DoublyShuffledNIfTIDataset.

get_images_by_id(case_ids)[source]

Retrieves the last n images and corresponding case ids from the images that were last added to the dataset.

Parameters

case_ids (List[str]) – List with case_ids to get.

Returns

A list of all the images with provided case ids.

get_items_for_logging(case_ids)[source]

Creates a list of files as tuple of image id and slice index.

Parameters

case_ids (List[str]) – List with case_ids to get.

image_ids()[source]
static normalize(img)[source]
Normalizes an image by
  1. Dividing by the maximum value

  2. Subtracting the mean, zeros will be ignored while calculating the mean

  3. Dividing by the negative minimum value

Parameters

img – The input image that should be normalized.

Returns

Normalized image with background values normalized to -1

num_pseudo_labels()[source]
Returns

Number of items with pseudo-labels in the dataset.

Return type

int

read_mask_for_image(image_index)[source]

Reads the mask for the image from file. Uses correct mask specific parameters.

Parameters

image_index (int) – Index of the image to load.

reinforce_type(expected_type)

Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

remove_image(image_id, slice_index=0)[source]

Removes an image from this dataset.

Parameters
  • image_id (str) – The id of the image.

  • slice_index (int) – Index of the slice to be removed.

size()[source]
Returns

Size of the dataset.

Return type

int

slices_per_image(**kwargs)[source]

Module contents

Docs

Access comprehensive developer documentation for Active Segmentation

View Docs