query_strategies package¶

Submodules¶

query_strategies.interpolation_sampling_strategy module¶

Module for interpolation sampling strategy

class query_strategies.interpolation_sampling_strategy.InterpolationSamplingStrategy(**kwargs)[source]¶

Bases: query_strategies.query_strategy.QueryStrategy

Class for selecting blocks to label by highest uncertainty and then interpolating within those blocks to generate additional pseudo labels.

Parameters

**kwargs –

Optional keyword arguments: - | prefer_blocks_without_pseudo_labels (bool, optional): Whether blocks that do not contain

existing pseudo-labels should always be labeled before starting labeling of blocks that contain

pseudo-labels. Defaults to False.

block_selection (str): The selection strategy for the blocks to interpolate: “uncertainty” | “random”.
block_thickness (int): The thickness of the interpolation blocks. Defaults to 5.
calculation_method (str): Specification of the method used to calculate the uncertainty

values: “distance” | “entropy”.
exclude_background (bool): Whether to exclude the background dimension in calculating the

uncertainty value.
epsilon (float): Small numerical value used for smoothing when using “entropy” as the uncertainty

metric.
block_thickness (int): The thickness of the interpolation blocks. Defaults to 5.
interpolation_type (str): The interpolation algorithm to use.

values: “signed-distance” | “morph-contour”.
interpolation_quality_metric (str): The metric used for evaluating the performance of the interpolation

e.g. “dice”
random_state (int, optional): Random state for selecting items to label. Pass an int for reproducible

outputs across multiple runs.
disable_interpolation (bool, optional): Whether the block selection strategy should be run without

actually interpolating slices. Defaults to False.

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶

Uses a sampling strategy to select blocks for labeling and generates pseudo labels by interpolation between the bottom and the top slice of a block.

Parameters

models – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and a dictionary of pseudo labels with the corresponding IDs as keys.

Return type

Tuple[List[str], Dict[str, np.array]]

query_strategies.interpolation_sampling_strategy.morphological_contour_interpolation(top, bottom, block_thickness)[source]¶

Interpolates between top and bottom slices using the morphological_contour_interpolator from ITK.

Parameters

top (np.array) – The top slice of the block.
bottom (np.array) – The bottom slice of the block.
block_thickness (int) – The thickness of the block.

Returns

The interpolated slices between top and bottom.

Return type

np.array

query_strategies.interpolation_sampling_strategy.signed_distance_interpolation(top, bottom, block_thickness)[source]¶

Interpolates between top and bottom slices if possible. Uses a signed distance function to interpolate.

Parameters

top (np.array) – The top slice of the block.
bottom (np.array) – The bottom slice of the block.
block_thickness (int) – The thickness of the block.

Returns

The interpolated slices between top and bottom.

Return type

np.array

query_strategies.query_strategy module¶

Module containing abstract superclass for query strategies.

class query_strategies.query_strategy.QueryStrategy[source]¶

Bases: abc.ABC

Abstract superclass for query strategies.

abstract select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶

Selects subset of the unlabeled data that should be labeled next.

Parameters

models – Current models that should be improved by selecting additional data for labeling.
dataloader – Pytorch dataloader representing the unlabeled dataset.
items_to_label – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and an optional dictionary of pseudo labels with the corresponding IDs as keys.

Return type

Tuple[List[str], Optional[Dict[str, np.array]]]

query_strategies.random_sampling_strategy module¶

Module for random sampling strategy

class query_strategies.random_sampling_strategy.RandomSamplingStrategy(random_state=None, **kwargs)[source]¶

Bases: query_strategies.query_strategy.QueryStrategy

Class for selecting items via a random sampling strategy

Parameters: random_state (int, optional) – Random state for selecting items to label. Pass an int for reproducible outputs across multiple runs.

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶

Selects random subset of the unlabeled data that should be labeled next. We are using the shuffling of the dataset for randomisation.

Parameters

models – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and None because no pseudo labels are generated.

Return type

Tuple[List[str], None]

query_strategies.representativeness_sampling_clustering module¶

Clustering-based representativeness sampling strategy

class query_strategies.representativeness_sampling_clustering.ClusteringBasedRepresentativenessSamplingStrategy(clustering_algorithm='mean_shift', feature_type='model_features', feature_dimensionality=10, **kwargs)[source]¶

Bases: query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase

Representativeness sampling strategy that clusters the feature vectors and randomly selects items from the clusters least represented in the training set.

Parameters

clustering_algorithm (string, optional) –
Clustering algorithm to be used: “mean_shift” | “k_means” | “scans”:
- ”mean_shift”: the mean shift clustering algorithm is used, allowing a variable number of clusters.
- ”k_means”: the k-means clustering algorithm is used, with a fixed number of clusters.
- ”scans”: all slices from one scan are considered to represent one cluster.
Defaults to “mean_shift”.
feature_type (string, optional) –
Type of feature vectors to be used: “model_features” | “image_features”:
- ”model_features”: Feature vectors retrieved from the inner layers of the model are used.
- ”image_features”: The input images are used as feature vectors.
Defaults to “model_features”.
feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.
**kwargs –
Optional keyword arguments:
- bandwidth (float, optional): Kernel bandwidth of the mean shift clustering algorithm. Defaults to 5.
  
  Only used if clustering_algorithm = “mean_shift”.
- cluster_all (bool, optional): Whether all data items including outliers should be assigned to a cluster.
  
  Defaults to False. Only used if clustering_algorithm = “mean_shift”.
- n_clusters (int, optional): Number of clusters. Defaults to 10. Only used if
  
  clustering_algorithm = “k_means”.
- random_state (int, optional): Random state for centroid initialization of k-means algorithm. Defaults to
  
  None. Only used if clustering_algorithm = “k_means”.

compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶

Computes representativeness scores for all unlabeled items.

Parameters

model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.

Return type

List[float]

on_select_item(case_id)[source]¶

Callback that is called when an item is selected for labeling.

Parameters: case_id (string) – Case ID of the selected item.

prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶

Clusters the feature vectors.

Parameters

feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.
case_ids_training_set (List[str]) – Case IDs of the items in the training set.
feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

query_strategies.representativeness_sampling_distances module¶

Distance-based representativeness sampling strategy

class query_strategies.representativeness_sampling_distances.DistanceBasedRepresentativenessSamplingStrategy(feature_type='model_features', feature_dimensionality=10, distance_metric='euclidean', **kwargs)[source]¶

Bases: query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase

Representativeness sampling strategy that selects the items with the highest average feature distance to the items in the training set.

Parameters

feature_type (string, optional) –
Type of feature vectors to be used: “model_features” | “image_features”:
- ”model_features”: Feature vectors retrieved from the inner layers of the model are used.
- ”image_features”: The input images are used as feature vectors.
Defaults to “model_features”.
feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.
distance_metric (string, optional) – Metric to be used for calculation the distance between feature vectors: “euclidean” | “cosine” | “russellrao”.

compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶

Computes representativeness scores for all unlabeled items.

Parameters

model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.

Return type

List[float]

query_strategies.representativeness_sampling_strategy_base module¶

Base class for implementing representativeness sampling strategies

class query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase(feature_type='model_features', feature_dimensionality=10, **kwargs)[source]¶

Bases: query_strategies.query_strategy.QueryStrategy, abc.ABC

Base class for implementing representativeness sampling strategies

Parameters

feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.
feature_type (string, optional) –
Type of feature vectors to be used: “model_features” | “image_features”:
- ”model_features”: Feature vectors retrieved from the inner layers of the model are used.
- ”image_features”: The input images are used as feature vectors.
Defaults to “model_features”.
feature_dimensionality – Number of dimensions the reduced feature vector should have. Defaults to 10.

abstract compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶

Must be overridden in subclasses to compute the representativeness scores for the items in the unlabeled set.

Parameters

model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training set should receive higher scores.

Return type

List[float]

on_select_item(case_id)[source]¶

Callback that is called when an item is selected for labeling.

Parameters: case_id (string) – Case ID of the selected item.

prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶

Can be overridden in subclasses to perform global computations on all feature vectors before item selection starts.

Parameters

feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.
case_ids_training_set (List[str]) – Case IDs of the items in the training set.
feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

reduce_features(feature_vectors, epsilon=1e-10)[source]¶

Reduces the dimensionality of feature vectors using a principle component analysis.

Parameters

feature_vectors (numpy.array) – Feature vectors to be reduced.
epsilon (float, optional) – Smoothing operator.

Returns

Reduced feature vectors.

Return type

numpy.array

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶

Selects a subset of the unlabeled data that increases the representativeness of the training set.

Parameters

models (PytorchModel) – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
**kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and None because no pseudo labels are generated.

Return type

Tuple[List[str], None]

query_strategies.representativeness_sampling_uncertainty module¶

Combined representativeness and uncertainty sampling strategy

class query_strategies.representativeness_sampling_uncertainty.UncertaintyRepresentativenessSamplingStrategy(representativeness_algorithm='cluster_coverage', calculation_method='entropy', **kwargs)[source]¶

Bases: query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase

Sampling strategy that combines representativeness and uncertainty sampling.

Parameters

representativeness_algorithm (string, optional) –
The algorithm to be used to select the most representative samples: “most_distant_sample” | “cluster_coverage”. Defaults to “cluster_coverage”.
- ”most_distant_sample”: The unlabeled item that has the highest feature distance to the labeled set
  
  is selected for labeling.
- ”cluster_coverage”: The features of the unlabeled and labeled items are clustered and an item from
  
  the most underrepresented cluster is selected for labeling.
calculation_method (string, optional) – The algorithm to be used for computing the uncertainty: “distance” | “entropy”.

compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶

Computes representativeness scores for all unlabeled items.

Parameters

model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.
feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.

Return type

List[float]

prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]¶

Prepares computation of representativeness scores.

Parameters

feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.
case_ids_training_set (List[str]) – Case IDs of the items in the training set.
feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.
case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

query_strategies.uncertainty_sampling_strategy module¶

Module for uncertainty sampling strategy

class query_strategies.uncertainty_sampling_strategy.UncertaintySamplingStrategy(**kwargs)[source]¶

Bases: query_strategies.query_strategy.QueryStrategy

Class for selecting items to label by highest uncertainty

Parameters

**kwargs –

Optional keyword arguments:

calculation_method (str): Specification of the method used to calculate the uncertainty

values: “distance” | “entropy”.
exclude_background (bool): Whether to exclude the background dimension in calculating the

uncertainty value.
prefer_unique_scans (bool): Whether to prefer among the uncertain scan-slice combinations unique

scans, if possible. E.g. with items_to_label set to 2: [‘slice_1-32’, ‘slice_1-33’, ‘slice_2-50’] -> [‘slice_1-32’, ‘slice_2-50’]
epsilon (float): Small numerical value used for smoothing when using “entropy” as the uncertainty metric.

compute_uncertainties(model, data_module)[source]¶

Parameters

model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.

Returns

Model uncertainties and case IDs for all items in the unlabeled set.

Return type

Tuple[List[float], List[str]]

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]¶

Selects subset of the unlabeled data with the highest uncertainty that should be labeled next.

Parameters

models – Current models that should be improved by selecting additional data for labeling.
data_module (ActiveLearningDataModule) – A data module object providing data.
items_to_label (int) – Number of items that should be selected for labeling.
calculation_method (str, optional) – Specification of the method used to calculate the uncertainty values. (default = ‘distance’)
**kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and None because no pseudo labels are generated.

Return type

Tuple[List[str], None]

query_strategies.utils module¶

Module containing functions used for different query strategies.

query_strategies.utils.clean_duplicate_scans(uncertainties, items_to_label)[source]¶

Cleans the list from duplicate scans if possible. If minimum number of samples can’t be reached without duplicates, duplicates are kept.

Parameters

uncertainties (List[Tuple[float, str]]) – List with tuples of uncertainty value and case id.
items_to_label (int) – Number of items that should be selected for labeling.

Returns

A cleaned list of tuples.

query_strategies.utils.distance_to_max_uncertainty(predictions, max_uncertainty_value=0.5, **kwargs)[source]¶

Calculates the uncertainties based on the distance to a maximum uncertainty value:

\[\sum | max\_uncertainty\_value - predictions | \]

Parameters

predictions (torch.Tensor) – The predictions of the model.
max_uncertainty_value (float, optional) – The maximum value of uncertainty in the predictions. (default = 0.5)
**kwargs – Keyword arguments specific for this calculation.

Returns

Uncertainty value for each image in the batch of predictions.

query_strategies.utils.entropy(predictions, max_uncertainty_value=0.5, **kwargs)[source]¶

Calculates the uncertainties based on the entropy of the distance to a maximum uncertainty value:

\[- \sum | max\_uncertainty\_value - predictions | \cdot | \log({max\_uncertainty\_value - predictions}) | \]

Parameters

predictions (torch.Tensor) – The predictions of the model.
max_uncertainty_value (float, optional) – The maximum value of uncertainty in the predictions. (default = 0.5)
**kwargs –
Keyword arguments specific for this calculation:
- epsilon (float): The smoothing value to avoid the magic number. (default = 1e-10)

Returns

Uncertainty value for each image in the batch of predictions.

query_strategies.utils.select_uncertainty_calculation(calculation_method)[source]¶

Selects the calculation function based on the provided name.

Parameters: calculation_method (str) – Name of the calculation method. Allowable values: “distance” | “entropy”.
Returns: A callable function to calculate uncertainty based on predictions.

query_strategies package¶

Submodules¶

query_strategies.interpolation_sampling_strategy module¶

query_strategies.query_strategy module¶

query_strategies.random_sampling_strategy module¶

query_strategies.representativeness_sampling_clustering module¶

query_strategies.representativeness_sampling_distances module¶

query_strategies.representativeness_sampling_strategy_base module¶

query_strategies.representativeness_sampling_uncertainty module¶

query_strategies.uncertainty_sampling_strategy module¶

query_strategies.utils module¶

Module contents¶

Docs