datasets package¶
Submodules¶
datasets.bcss_data_module module¶
Module containing the data module for the BCSS dataset
- class datasets.bcss_data_module.BCSSDataModule(*args, **kwargs)[source]¶
Bases:
datasets.data_module.ActiveLearningDataModule
Initializes the BCSS data module.
- Parameters
data_dir – Path of the directory that contains the data.
batch_size – Batch size.
num_workers – Number of workers for DataLoader.
cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (bool, optional) – Flag if the data should be shuffled.
channels (int, optional) – Number of channels of the images. 3 means RGB, 2 means greyscale.
image_shape (tuple, optional) – Shape of the image.
target_label (int, optional) – The label to use for learning. Details are in BCSSDataset.
combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)
val_set_size (float, optional) – The size of the validation set (default = 0.3).
stratify (bool, optional) – The option to stratify the train val split by the institutes.
random_state (int, optional) – Controls the data splitting and shuffling. Pass an int for reproducible output across multiple runs.
**kwargs – Further, dataset specific parameters.
- static build_stratification_labels(image_paths)[source]¶
Build a list with class labels used for a stratified split
- static discover_paths(image_dir, mask_dir)[source]¶
Discover the
.png
files in a given directory.- Parameters
image_dir – The directory to the images.
mask_dir – The directory to the annotations.
- Returns
list of file paths as tuple of image paths, annotation paths
- id_to_class_names()[source]¶
- Returns
A mapping of class indices to descriptive class names.
- Return type
Dict[int, str]
- label_items(ids, pseudo_labels=None)[source]¶
Moves the given samples from the unlabeled dataset to the labeled dataset.
- datasets.bcss_data_module.copy_test_set_to_separate_folder(source_dir, target_dir)[source]¶
Reproduces the test set used in the baseline implementation of the challenge, by copying the scans of the respective institution into a separate folder.
- Parameters
source_dir (str) – Directory where all the downloaded images and masks are stored.
target_dir (str) – Directory where to store the test data.
datasets.bcss_dataset module¶
Module to load and batch the BCSS dataset
- class datasets.bcss_dataset.BCSSDataset(image_paths, annotation_paths, cache_size=0, target_label=1, is_unlabeled=False, shuffle=True, channels=3, image_shape=(300, 300), random_state=None)[source]¶
Bases:
torch.utils.data.dataset.Dataset
[torch.utils.data.dataset.T_co
]The BCSS dataset contains over 20,000 segmentation annotations of tissue region from breast cancer images from TCGA. Detailed description can be found either at the challenge website or on github .
- Parameters
image_paths (List[Path]) – List with all images to load, can be obtained by
datasets.bcss_data_module.BCSSDataModule.discover_paths()
.annotation_paths (List[Path]) – List with all annotations to load, can be obtained by
datasets.bcss_data_module.BCSSDataModule.discover_paths()
.target_label (int, optional) –
The label to use for learning. Following labels are in the annotations:
outside_roi 0
tumor 1
stroma 2
lymphocytic_infiltrate 3
necrosis_or_debris 4
glandular_secretions 5
blood 6
exclude 7
metaplasia_NOS 8
fat 9
plasma_cells 10
other_immune_infiltrate 11
mucoid_material 12
normal_acinus_or_duct 13
lymphatics 14
undetermined 15
nerve 16
skin_adnexa 17
blood_vessel 18
angioinvasion 19
dcis 20
other 21
is_unlabeled (bool, optional) – Whether the dataset is used as “unlabeled” for the active learning loop.
shuffle (bool, optional) – Whether the data should be shuffled.
channels (int, optional) – Number of channels of the images. 3 means RGB, 2 means greyscale.
image_shape (tuple, optional) – Shape of the image.
random_state (int, optional) – Controls the data shuffling. Pass an int for reproducible output across multiple runs.
- add_image(image_path, annotation_path)[source]¶
Adds an image to this dataset.
- Parameters
image_path – Path of the image to be added.
annotation_path – Path of the annotation of the image to be added.
- Returns
None. Raises ValueError if image already exists.
- static get_institute_name(filepath)[source]¶
Gets the name of the institute which donated the image.
- static normalize(img)[source]¶
Normalizes an image by:
Dividing by the mean value
Subtracting the std
- Parameters
img – The input image that should be normalized.
- Returns
Normalized image with background values normalized to -1
- num_pseudo_labels()[source]¶
- Returns
Number of items with pseudo-labels in the dataset.
- Return type
int
- reinforce_type(expected_type)¶
Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.
datasets.brats_data_module module¶
Module containing the data module for brats data
- class datasets.brats_data_module.BraTSDataModule(*args, **kwargs)[source]¶
Bases:
datasets.data_module.ActiveLearningDataModule
Initializes the BraTS data module.
- Parameters
data_dir (string) – Path of the directory that contains the data.
batch_size (int) – Batch size.
num_workers (int) – Number of workers for DataLoader.
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to
batch_size
.cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (boolean) – Flag if the data should be shuffled.
dim (int) – 2 or 3 to define if the datsets should return 2d slices of whole 3d images.
combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)
mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.
random_state (int, optional) – Random state for splitting the data into an initial training set and an unlabeled set and for shuffling the data. Pass an int for reproducibility across runs.
only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.
**kwargs – Further, dataset specific parameters.
- static discover_paths(dir_path, modality='flair', random_samples=None)[source]¶
Discover the
.nii.gz
file paths with a given modality- Parameters
dir_path – directory to discover paths in
modality (string, optional) – modality of scan
random_samples – the amount of random samples from the data sets
- Returns
list of files as tuple of image paths, annotation paths
- id_to_class_names()[source]¶
- Returns
A mapping of class indices to descriptive class names.
- Return type
Dict[int, str]
- label_items(ids, pseudo_labels=None)[source]¶
Moves the given samples from the unlabeled dataset to the labeled dataset.
datasets.collate module¶
Module to collate batches
datasets.data_module module¶
Module containing abstract classes for the data modules
- class datasets.data_module.ActiveLearningDataModule(*args, **kwargs)[source]¶
Bases:
pytorch_lightning.core.datamodule.LightningDataModule
,abc.ABC
Abstract base class to structure the dataset creation for active learning
- Parameters
data_dir (str) – Path of the directory that contains the data.
batch_size (int) – Batch size.
num_workers (int) – Number of workers for DataLoader.
batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to batch_size.
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (bool, optional) – Flag if the data should be shuffled.
**kwargs – Further, dataset specific parameters.
- static data_channels()[source]¶
Can be overwritten by subclasses if the data has multiple channels.
- Returns
The amount of data channels. Defaults to 1.
- abstract id_to_class_names()[source]¶
- Returns
A mapping of class indices to descriptive class names.
- Return type
Dict[int, str]
- abstract label_items(ids, pseudo_labels=None)[source]¶
Moves data items from the unlabeled set to one of the labeled sets (training, validation or test set).
- Parameters
ids (List[str]) – IDs of the items to be labeled.
pseudo_labels (Dict[str, Any], optional) – Optional pseudo labels for (some of the) the selected data items.
- Returns
None.
- abstract multi_label()[source]¶
- Returns
Whether the dataset is a multi-label or a single-label dataset.
- Return type
bool
- setup(stage=None)[source]¶
Creates the datasets managed by this data module.
- Parameters
stage – Current training stage.
- train_dataloader()[source]¶
- Returns
Pytorch dataloader or Keras sequence representing the training set.
- unlabeled_dataloader()[source]¶
- Returns
Pytorch dataloader or Keras sequence representing the unlabeled set.
datasets.dataset_hooks module¶
Module defining hooks that each dataset class should implement
- class datasets.dataset_hooks.DatasetHooks[source]¶
Bases:
abc.ABC
Class that defines hooks that should be implemented by each dataset class.
- abstract num_pseudo_labels()[source]¶
- Returns
Number of items with pseudo-labels in the dataset.
- Return type
int
- abstract slices_per_image(**kwargs)[source]¶
- Parameters
kwargs – Dataset specific parameters.
- Returns
- Number of slices that each image of the dataset contains. If a single integer
value is provided, it is assumed that all images of the dataset have the same number of slices.
- Return type
Union[int, List[int]]
datasets.decathlon_data_module module¶
Module containing the data module for decathlon data
- class datasets.decathlon_data_module.DecathlonDataModule(*args, **kwargs)[source]¶
Bases:
datasets.data_module.ActiveLearningDataModule
Initializes the Decathlon data module.
- Parameters
data_dir (string) – Path of the directory that contains the data.
batch_size (int) – Batch size.
num_workers (int) – Number of workers for DataLoader.
task (str, optional) – The task from the medical segmentation decathlon.
active_learning_mode (bool, optional) – Whether the datamodule should be configured for active learning or for conventional model training (default = False).
batch_size_unlabeled_set (int, optional) – Batch size for the unlabeled set. Defaults to
batch_size
.cache_size (int, optional) – Number of images to keep in memory between epochs to speed-up data loading (default = 0).
initial_training_set_size (int, optional) – Initial size of the training set if the active learning mode is activated.
pin_memory (bool, optional) – pin_memory parameter as defined by the PyTorch DataLoader class.
shuffle (bool, optional) – Flag if the data should be shuffled.
dim (int) – 2 or 3 to define if the datsets should return 2d slices of whole 3d images.
combine_foreground_classes (bool, optional) – Flag if the non zero values of the annotations should be merged. (default = False)
mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.
random_state (int, optional) – Random state for splitting the data into an initial training set and an unlabeled set and for shuffling the data. Pass an int for reproducibility across runs.
only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.
**kwargs – Further, dataset specific parameters.
- static discover_paths(dir_path, subset, random_samples=None, random_state=None)[source]¶
Discover the
.nii.gz
file paths from the corresponding JSON file.- Parameters
dir_path (str) – Directory the dataset is inside.
subset (Literal["train", "val", "test"]) – The subset of paths of the whole dataset.
random_samples (int, optional) – The amount of random samples from the data set.
- Returns
list of files as tuple of image paths, annotation paths
- id_to_class_names()[source]¶
- Returns
A mapping of class indices to descriptive class names.
- Return type
Dict[int, str]
- label_items(ids, pseudo_labels=None)[source]¶
Moves the given samples from the unlabeled dataset to the labeled dataset.
datasets.doubly_shuffled_nifti_dataset module¶
Module to load and batch nifti datasets
- class datasets.doubly_shuffled_nifti_dataset.DoublyShuffledNIfTIDataset(image_paths, annotation_paths, cache_size=0, combine_foreground_classes=False, mask_filter_values=None, is_unlabeled=False, shuffle=False, transform=None, target_transform=None, dim=2, slice_indices=None, case_id_prefix='train', random_state=None, only_return_true_labels=False)[source]¶
Bases:
torch.utils.data.dataset.Dataset
[torch.utils.data.dataset.T_co
]This dataset can be used with NIfTI images. It is iterable and can return both 2D and 3D images.
- Parameters
image_paths (List[str]) – List with the paths to the images. Has to contain paths of all images which can ever become part of the dataset.
annotation_paths (List[str]) – List with the paths to the annotations. Has to contain paths of all images which can ever become part of the dataset.
cache_size (int, optional) – Number of images to keep in memory to speed-up data loading in subsequent epochs. Defaults to zero.
combine_foreground_classes (bool, optional) – Flag if the non-zero values of the annotations should be merged. Defaults to False.
mask_filter_values (Tuple[int], optional) – Values from the annotations which should be used. Defaults to using all values.
shuffle (bool, optional) – Whether the data should be shuffled.
transform (Callable[[Any], Tensor], optional) – Function to transform the images.
target_transform (Callable[[Any], Tensor], optional) – Function to transform the annotations.
dim (int, optional) – 2 or 3 to define if the dataset should return 2d slices of whole 3d images. Defaults to 2.
slice_indices (List[np.array], optional) – Array of indices per image which should be part of the dataset. Uses all slices if None. Defaults to None.
random_state (int, optional) – Controls the data shuffling. Pass an int for reproducible output across multiple runs.
only_return_true_labels (bool, optional) – Whether only true labels or also pseudo-labels are to be returned. Defaults to False.
- add_image(image_id, slice_index=0, pseudo_label=None)[source]¶
Adds an image to this dataset.
- Parameters
image_id (str) – The id of the image.
slice_index (int) – Index of the slice to be added.
pseudo_label (np.array, optional) – An optional pseudo label for the slice. If no pseudo label is provided, the actual label from the corresponding file is used.
- static generate_active_learning_split(filepaths, dim, initial_training_set_size, random_state=None)[source]¶
Generates a split between initial training set and initially unlabeled set for active learning.
- Parameters
filepaths (List[str]) – The file paths to the Nifti files.
dim (int) – The dimensionality of the dataset. (2 or 3.)
initial_training_set_size (int) – The number of samples in the initial training set.
random_state (int, optional) – The random state used to generate the split. Pass an int for reproducibility across runs.
- Returns
A tuple of two lists of np.arrays. The lists contain one array per filepath which contains the slice indices of the slices which should be part of the training and unlabeled sets respectively. The lists can be passed as slice_indices for initialization of a DoublyShuffledNIfTIDataset.
- get_images_by_id(case_ids)[source]¶
Retrieves the last n images and corresponding case ids from the images that were last added to the dataset.
- Parameters
case_ids (List[str]) – List with case_ids to get.
- Returns
A list of all the images with provided case ids.
- get_items_for_logging(case_ids)[source]¶
Creates a list of files as tuple of image id and slice index.
- Parameters
case_ids (List[str]) – List with case_ids to get.
- static normalize(img)[source]¶
- Normalizes an image by
Dividing by the maximum value
Subtracting the mean, zeros will be ignored while calculating the mean
Dividing by the negative minimum value
- Parameters
img – The input image that should be normalized.
- Returns
Normalized image with background values normalized to -1
- num_pseudo_labels()[source]¶
- Returns
Number of items with pseudo-labels in the dataset.
- Return type
int
- read_mask_for_image(image_index)[source]¶
Reads the mask for the image from file. Uses correct mask specific parameters.
- Parameters
image_index (int) – Index of the image to load.
- reinforce_type(expected_type)¶
Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.