query_strategies package¶
Submodules¶
query_strategies.interpolation_sampling_strategy module¶
Module for interpolation sampling strategy
- class query_strategies.interpolation_sampling_strategy.InterpolationSamplingStrategy(**kwargs)[source]¶
Bases:
query_strategies.query_strategy.QueryStrategy
Class for selecting blocks to label by highest uncertainty and then interpolating within those blocks to generate additional pseudo labels.
- Parameters
**kwargs –
Optional keyword arguments: - | prefer_blocks_without_pseudo_labels (bool, optional): Whether blocks that do not contain
existing pseudo-labels should always be labeled before starting labeling of blocks that containpseudo-labels. Defaults to False.block_selection (str): The selection strategy for the blocks to interpolate: “uncertainty” | “random”.
block_thickness (int): The thickness of the interpolation blocks. Defaults to 5.
- calculation_method (str): Specification of the method used to calculate the uncertaintyvalues: “distance” | “entropy”.
- exclude_background (bool): Whether to exclude the background dimension in calculating theuncertainty value.
- epsilon (float): Small numerical value used for smoothing when using “entropy” as the uncertaintymetric.
block_thickness (int): The thickness of the interpolation blocks. Defaults to 5.
- interpolation_type (str): The interpolation algorithm to use.values: “signed-distance” | “morph-contour”.
- interpolation_quality_metric (str): The metric used for evaluating the performance of the interpolatione.g. “dice”
- random_state (int, optional): Random state for selecting items to label. Pass an int for reproducibleoutputs across multiple runs.
- disable_interpolation (bool, optional): Whether the block selection strategy should be run withoutactually interpolating slices. Defaults to False.
- select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶
Uses a sampling strategy to select blocks for labeling and generates pseudo labels by interpolation between the bottom and the top slice of a block.
- Parameters
models – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.
- Returns
List of IDs of the data items to be labeled and a dictionary of pseudo labels with the corresponding IDs as keys.
- Return type
Tuple[List[str], Dict[str, np.array]]
- query_strategies.interpolation_sampling_strategy.morphological_contour_interpolation(top, bottom, block_thickness)[source]¶
Interpolates between top and bottom slices using the morphological_contour_interpolator from ITK.
- Parameters
top (np.array) – The top slice of the block.
bottom (np.array) – The bottom slice of the block.
block_thickness (int) – The thickness of the block.
- Returns
The interpolated slices between top and bottom.
- Return type
np.array
- query_strategies.interpolation_sampling_strategy.signed_distance_interpolation(top, bottom, block_thickness)[source]¶
Interpolates between top and bottom slices if possible. Uses a signed distance function to interpolate.
- Parameters
top (np.array) – The top slice of the block.
bottom (np.array) – The bottom slice of the block.
block_thickness (int) – The thickness of the block.
- Returns
The interpolated slices between top and bottom.
- Return type
np.array
query_strategies.query_strategy module¶
Module containing abstract superclass for query strategies.
- class query_strategies.query_strategy.QueryStrategy[source]¶
Bases:
abc.ABC
Abstract superclass for query strategies.
- abstract select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶
Selects subset of the unlabeled data that should be labeled next.
- Parameters
models – Current models that should be improved by selecting additional data for labeling.
dataloader – Pytorch dataloader representing the unlabeled dataset.
items_to_label – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.
- Returns
List of IDs of the data items to be labeled and an optional dictionary of pseudo labels with the corresponding IDs as keys.
- Return type
Tuple[List[str], Optional[Dict[str, np.array]]]
query_strategies.random_sampling_strategy module¶
Module for random sampling strategy
- class query_strategies.random_sampling_strategy.RandomSamplingStrategy(random_state=None, **kwargs)[source]¶
Bases:
query_strategies.query_strategy.QueryStrategy
Class for selecting items via a random sampling strategy
- Parameters
random_state (int, optional) – Random state for selecting items to label. Pass an int for reproducible outputs across multiple runs.
- select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶
Selects random subset of the unlabeled data that should be labeled next. We are using the shuffling of the dataset for randomisation.
- Parameters
models – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.
- Returns
List of IDs of the data items to be labeled and None because no pseudo labels are generated.
- Return type
Tuple[List[str], None]
query_strategies.representativeness_sampling_clustering module¶
Clustering-based representativeness sampling strategy
- class query_strategies.representativeness_sampling_clustering.ClusteringBasedRepresentativenessSamplingStrategy(clustering_algorithm='mean_shift', feature_type='model_features', feature_dimensionality=10, **kwargs)[source]¶
Bases:
query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase
Representativeness sampling strategy that clusters the feature vectors and randomly selects items from the clusters least represented in the training set.
- Parameters
clustering_algorithm (string, optional) –
Clustering algorithm to be used: “mean_shift” | “k_means” | “scans”:
”mean_shift”: the mean shift clustering algorithm is used, allowing a variable number of clusters.
”k_means”: the k-means clustering algorithm is used, with a fixed number of clusters.
”scans”: all slices from one scan are considered to represent one cluster.
Defaults to “mean_shift”.
feature_type (string, optional) –
Type of feature vectors to be used: “model_features” | “image_features”:
”model_features”: Feature vectors retrieved from the inner layers of the model are used.
”image_features”: The input images are used as feature vectors.
Defaults to “model_features”.
feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.
**kwargs –
Optional keyword arguments:
- bandwidth (float, optional): Kernel bandwidth of the mean shift clustering algorithm. Defaults to 5.Only used if clustering_algorithm = “mean_shift”.
- cluster_all (bool, optional): Whether all data items including outliers should be assigned to a cluster.Defaults to False. Only used if clustering_algorithm = “mean_shift”.
- n_clusters (int, optional): Number of clusters. Defaults to 10. Only used ifclustering_algorithm = “k_means”.
- random_state (int, optional): Random state for centroid initialization of k-means algorithm. Defaults toNone. Only used if clustering_algorithm = “k_means”.
- compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶
Computes representativeness scores for all unlabeled items.
- Parameters
model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.
- Returns
Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.
- Return type
List[float]
- on_select_item(case_id)[source]¶
Callback that is called when an item is selected for labeling.
- Parameters
case_id (string) – Case ID of the selected item.
- prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶
Clusters the feature vectors.
- Parameters
feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.
case_ids_training_set (List[str]) – Case IDs of the items in the training set.
feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.
query_strategies.representativeness_sampling_distances module¶
Distance-based representativeness sampling strategy
- class query_strategies.representativeness_sampling_distances.DistanceBasedRepresentativenessSamplingStrategy(feature_type='model_features', feature_dimensionality=10, distance_metric='euclidean', **kwargs)[source]¶
Bases:
query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase
Representativeness sampling strategy that selects the items with the highest average feature distance to the items in the training set.
- Parameters
feature_type (string, optional) –
Type of feature vectors to be used: “model_features” | “image_features”:
”model_features”: Feature vectors retrieved from the inner layers of the model are used.
”image_features”: The input images are used as feature vectors.
Defaults to “model_features”.
feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.
distance_metric (string, optional) – Metric to be used for calculation the distance between feature vectors: “euclidean” | “cosine” | “russellrao”.
- compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶
Computes representativeness scores for all unlabeled items.
- Parameters
model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.
- Returns
Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.
- Return type
List[float]
query_strategies.representativeness_sampling_strategy_base module¶
Base class for implementing representativeness sampling strategies
- class query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase(feature_type='model_features', feature_dimensionality=10, **kwargs)[source]¶
Bases:
query_strategies.query_strategy.QueryStrategy
,abc.ABC
Base class for implementing representativeness sampling strategies
- Parameters
feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.
feature_type (string, optional) –
Type of feature vectors to be used: “model_features” | “image_features”:
”model_features”: Feature vectors retrieved from the inner layers of the model are used.
”image_features”: The input images are used as feature vectors.
Defaults to “model_features”.
feature_dimensionality – Number of dimensions the reduced feature vector should have. Defaults to 10.
- abstract compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶
Must be overridden in subclasses to compute the representativeness scores for the items in the unlabeled set.
- Parameters
model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.
- Returns
Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training set should receive higher scores.
- Return type
List[float]
- on_select_item(case_id)[source]¶
Callback that is called when an item is selected for labeling.
- Parameters
case_id (string) – Case ID of the selected item.
- prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶
Can be overridden in subclasses to perform global computations on all feature vectors before item selection starts.
- Parameters
feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.
case_ids_training_set (List[str]) – Case IDs of the items in the training set.
feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.
- reduce_features(feature_vectors, epsilon=1e-10)[source]¶
Reduces the dimensionality of feature vectors using a principle component analysis.
- Parameters
feature_vectors (numpy.array) – Feature vectors to be reduced.
epsilon (float, optional) – Smoothing operator.
- Returns
Reduced feature vectors.
- Return type
numpy.array
- select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶
Selects a subset of the unlabeled data that increases the representativeness of the training set.
- Parameters
models (PytorchModel) – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.
- Returns
List of IDs of the data items to be labeled and None because no pseudo labels are generated.
- Return type
Tuple[List[str], None]
query_strategies.representativeness_sampling_uncertainty module¶
Combined representativeness and uncertainty sampling strategy
- class query_strategies.representativeness_sampling_uncertainty.UncertaintyRepresentativenessSamplingStrategy(representativeness_algorithm='cluster_coverage', calculation_method='entropy', **kwargs)[source]¶
Bases:
query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase
Sampling strategy that combines representativeness and uncertainty sampling.
- Parameters
representativeness_algorithm (string, optional) –
The algorithm to be used to select the most representative samples: “most_distant_sample” | “cluster_coverage”. Defaults to “cluster_coverage”.
- ”most_distant_sample”: The unlabeled item that has the highest feature distance to the labeled setis selected for labeling.
- ”cluster_coverage”: The features of the unlabeled and labeled items are clustered and an item fromthe most underrepresented cluster is selected for labeling.
calculation_method (string, optional) – The algorithm to be used for computing the uncertainty: “distance” | “entropy”.
- compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶
Computes representativeness scores for all unlabeled items.
- Parameters
model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.
- Returns
Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.
- Return type
List[float]
- prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶
Prepares computation of representativeness scores.
- Parameters
feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.
case_ids_training_set (List[str]) – Case IDs of the items in the training set.
feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.
query_strategies.uncertainty_sampling_strategy module¶
Module for uncertainty sampling strategy
- class query_strategies.uncertainty_sampling_strategy.UncertaintySamplingStrategy(**kwargs)[source]¶
Bases:
query_strategies.query_strategy.QueryStrategy
Class for selecting items to label by highest uncertainty
- Parameters
**kwargs –
Optional keyword arguments:
- calculation_method (str): Specification of the method used to calculate the uncertaintyvalues: “distance” | “entropy”.
- exclude_background (bool): Whether to exclude the background dimension in calculating theuncertainty value.
- prefer_unique_scans (bool): Whether to prefer among the uncertain scan-slice combinations uniquescans, if possible. E.g. with items_to_label set to 2: [‘slice_1-32’, ‘slice_1-33’, ‘slice_2-50’] -> [‘slice_1-32’, ‘slice_2-50’]
epsilon (float): Small numerical value used for smoothing when using “entropy” as the uncertainty metric.
- compute_uncertainties(model, data_module)[source]¶
- Parameters
model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
- Returns
Model uncertainties and case IDs for all items in the unlabeled set.
- Return type
Tuple[List[float], List[str]]
- select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶
Selects subset of the unlabeled data with the highest uncertainty that should be labeled next.
- Parameters
models – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
calculation_method (str, optional) – Specification of the method used to calculate the uncertainty values. (default = ‘distance’)
**kwargs – Additional, strategy-specific parameters.
- Returns
List of IDs of the data items to be labeled and None because no pseudo labels are generated.
- Return type
Tuple[List[str], None]
query_strategies.utils module¶
Module containing functions used for different query strategies.
- query_strategies.utils.clean_duplicate_scans(uncertainties, items_to_label)[source]¶
Cleans the list from duplicate scans if possible. If minimum number of samples can’t be reached without duplicates, duplicates are kept.
- Parameters
uncertainties (List[Tuple[float, str]]) – List with tuples of uncertainty value and case id.
items_to_label (int) – Number of items that should be selected for labeling.
- Returns
A cleaned list of tuples.
- query_strategies.utils.distance_to_max_uncertainty(predictions, max_uncertainty_value=0.5, **kwargs)[source]¶
Calculates the uncertainties based on the distance to a maximum uncertainty value:
\[\sum | max\_uncertainty\_value - predictions | \]- Parameters
predictions (torch.Tensor) – The predictions of the model.
max_uncertainty_value (float, optional) – The maximum value of uncertainty in the predictions. (default = 0.5)
**kwargs – Keyword arguments specific for this calculation.
- Returns
Uncertainty value for each image in the batch of predictions.
- query_strategies.utils.entropy(predictions, max_uncertainty_value=0.5, **kwargs)[source]¶
Calculates the uncertainties based on the entropy of the distance to a maximum uncertainty value:
\[- \sum | max\_uncertainty\_value - predictions | \cdot | \log({max\_uncertainty\_value - predictions}) | \]- Parameters
predictions (torch.Tensor) – The predictions of the model.
max_uncertainty_value (float, optional) – The maximum value of uncertainty in the predictions. (default = 0.5)
**kwargs –
Keyword arguments specific for this calculation:
epsilon (float): The smoothing value to avoid the magic number. (default = 1e-10)
- Returns
Uncertainty value for each image in the batch of predictions.
- query_strategies.utils.select_uncertainty_calculation(calculation_method)[source]¶
Selects the calculation function based on the provided name.
- Parameters
calculation_method (str) – Name of the calculation method. Allowable values: “distance” | “entropy”.
- Returns
A callable function to calculate uncertainty based on predictions.