Shortcuts

query_strategies package

Submodules

query_strategies.interpolation_sampling_strategy module

Module for interpolation sampling strategy

class query_strategies.interpolation_sampling_strategy.InterpolationSamplingStrategy(**kwargs)[source]

Bases: query_strategies.query_strategy.QueryStrategy

Class for selecting blocks to label by highest uncertainty and then interpolating within those blocks to generate additional pseudo labels.

Parameters

**kwargs

Optional keyword arguments: - | prefer_blocks_without_pseudo_labels (bool, optional): Whether blocks that do not contain

existing pseudo-labels should always be labeled before starting labeling of blocks that contain
pseudo-labels. Defaults to False.
  • block_selection (str): The selection strategy for the blocks to interpolate: “uncertainty” | “random”.

  • block_thickness (int): The thickness of the interpolation blocks. Defaults to 5.

  • calculation_method (str): Specification of the method used to calculate the uncertainty
    values: “distance” | “entropy”.
  • exclude_background (bool): Whether to exclude the background dimension in calculating the
    uncertainty value.
  • epsilon (float): Small numerical value used for smoothing when using “entropy” as the uncertainty
    metric.
  • block_thickness (int): The thickness of the interpolation blocks. Defaults to 5.

  • interpolation_type (str): The interpolation algorithm to use.
    values: “signed-distance”“morph-contour”.
  • interpolation_quality_metric (str): The metric used for evaluating the performance of the interpolation
    e.g. “dice”
  • random_state (int, optional): Random state for selecting items to label. Pass an int for reproducible
    outputs across multiple runs.
  • disable_interpolation (bool, optional): Whether the block selection strategy should be run without
    actually interpolating slices. Defaults to False.

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]

Uses a sampling strategy to select blocks for labeling and generates pseudo labels by interpolation between the bottom and the top slice of a block.

Parameters
  • models – Current models that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • items_to_label (int) – Number of items that should be selected for labeling.

  • **kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and a dictionary of pseudo labels with the corresponding IDs as keys.

Return type

Tuple[List[str], Dict[str, np.array]]

query_strategies.interpolation_sampling_strategy.morphological_contour_interpolation(top, bottom, block_thickness)[source]

Interpolates between top and bottom slices using the morphological_contour_interpolator from ITK.

Parameters
  • top (np.array) – The top slice of the block.

  • bottom (np.array) – The bottom slice of the block.

  • block_thickness (int) – The thickness of the block.

Returns

The interpolated slices between top and bottom.

Return type

np.array

query_strategies.interpolation_sampling_strategy.signed_distance_interpolation(top, bottom, block_thickness)[source]

Interpolates between top and bottom slices if possible. Uses a signed distance function to interpolate.

Parameters
  • top (np.array) – The top slice of the block.

  • bottom (np.array) – The bottom slice of the block.

  • block_thickness (int) – The thickness of the block.

Returns

The interpolated slices between top and bottom.

Return type

np.array

query_strategies.query_strategy module

Module containing abstract superclass for query strategies.

class query_strategies.query_strategy.QueryStrategy[source]

Bases: abc.ABC

Abstract superclass for query strategies.

abstract select_items_to_label(models, data_module, items_to_label, **kwargs)[source]

Selects subset of the unlabeled data that should be labeled next.

Parameters
  • models – Current models that should be improved by selecting additional data for labeling.

  • dataloader – Pytorch dataloader representing the unlabeled dataset.

  • items_to_label – Number of items that should be selected for labeling.

  • **kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and an optional dictionary of pseudo labels with the corresponding IDs as keys.

Return type

Tuple[List[str], Optional[Dict[str, np.array]]]

query_strategies.random_sampling_strategy module

Module for random sampling strategy

class query_strategies.random_sampling_strategy.RandomSamplingStrategy(random_state=None, **kwargs)[source]

Bases: query_strategies.query_strategy.QueryStrategy

Class for selecting items via a random sampling strategy

Parameters

random_state (int, optional) – Random state for selecting items to label. Pass an int for reproducible outputs across multiple runs.

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]

Selects random subset of the unlabeled data that should be labeled next. We are using the shuffling of the dataset for randomisation.

Parameters
  • models – Current models that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • items_to_label (int) – Number of items that should be selected for labeling.

  • **kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and None because no pseudo labels are generated.

Return type

Tuple[List[str], None]

query_strategies.representativeness_sampling_clustering module

Clustering-based representativeness sampling strategy

class query_strategies.representativeness_sampling_clustering.ClusteringBasedRepresentativenessSamplingStrategy(clustering_algorithm='mean_shift', feature_type='model_features', feature_dimensionality=10, **kwargs)[source]

Bases: query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase

Representativeness sampling strategy that clusters the feature vectors and randomly selects items from the clusters least represented in the training set.

Parameters
  • clustering_algorithm (string, optional) –

    Clustering algorithm to be used: “mean_shift” | “k_means” | “scans”:

    • ”mean_shift”: the mean shift clustering algorithm is used, allowing a variable number of clusters.

    • ”k_means”: the k-means clustering algorithm is used, with a fixed number of clusters.

    • ”scans”: all slices from one scan are considered to represent one cluster.

    Defaults to “mean_shift”.

  • feature_type (string, optional) –

    Type of feature vectors to be used: “model_features” | “image_features”:

    • ”model_features”: Feature vectors retrieved from the inner layers of the model are used.

    • ”image_features”: The input images are used as feature vectors.

    Defaults to “model_features”.

  • feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.

  • **kwargs

    Optional keyword arguments:

    • bandwidth (float, optional): Kernel bandwidth of the mean shift clustering algorithm. Defaults to 5.
      Only used if clustering_algorithm = “mean_shift”.
    • cluster_all (bool, optional): Whether all data items including outliers should be assigned to a cluster.
      Defaults to False. Only used if clustering_algorithm = “mean_shift”.
    • n_clusters (int, optional): Number of clusters. Defaults to 10. Only used if
      clustering_algorithm = “k_means”.
    • random_state (int, optional): Random state for centroid initialization of k-means algorithm. Defaults to
      None. Only used if clustering_algorithm = “k_means”.

compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]

Computes representativeness scores for all unlabeled items.

Parameters
  • model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.

  • feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.

  • case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.

Return type

List[float]

on_select_item(case_id)[source]

Callback that is called when an item is selected for labeling.

Parameters

case_id (string) – Case ID of the selected item.

prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]

Clusters the feature vectors.

Parameters
  • feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.

  • case_ids_training_set (List[str]) – Case IDs of the items in the training set.

  • feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.

  • case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

query_strategies.representativeness_sampling_distances module

Distance-based representativeness sampling strategy

class query_strategies.representativeness_sampling_distances.DistanceBasedRepresentativenessSamplingStrategy(feature_type='model_features', feature_dimensionality=10, distance_metric='euclidean', **kwargs)[source]

Bases: query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase

Representativeness sampling strategy that selects the items with the highest average feature distance to the items in the training set.

Parameters
  • feature_type (string, optional) –

    Type of feature vectors to be used: “model_features” | “image_features”:

    • ”model_features”: Feature vectors retrieved from the inner layers of the model are used.

    • ”image_features”: The input images are used as feature vectors.

    Defaults to “model_features”.

  • feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.

  • distance_metric (string, optional) – Metric to be used for calculation the distance between feature vectors: “euclidean” | “cosine” | “russellrao”.

compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]

Computes representativeness scores for all unlabeled items.

Parameters
  • model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.

  • feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.

  • case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.

Return type

List[float]

query_strategies.representativeness_sampling_strategy_base module

Base class for implementing representativeness sampling strategies

class query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase(feature_type='model_features', feature_dimensionality=10, **kwargs)[source]

Bases: query_strategies.query_strategy.QueryStrategy, abc.ABC

Base class for implementing representativeness sampling strategies

Parameters
  • feature_dimensionality (int, optional) – Number of dimensions the reduced feature vector should have. Defaults to 10.

  • feature_type (string, optional) –

    Type of feature vectors to be used: “model_features” | “image_features”:

    • ”model_features”: Feature vectors retrieved from the inner layers of the model are used.

    • ”image_features”: The input images are used as feature vectors.

    Defaults to “model_features”.

  • feature_dimensionality – Number of dimensions the reduced feature vector should have. Defaults to 10.

abstract compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]

Must be overridden in subclasses to compute the representativeness scores for the items in the unlabeled set.

Parameters
  • model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.

  • feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.

  • case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training set should receive higher scores.

Return type

List[float]

on_select_item(case_id)[source]

Callback that is called when an item is selected for labeling.

Parameters

case_id (string) – Case ID of the selected item.

prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]

Can be overridden in subclasses to perform global computations on all feature vectors before item selection starts.

Parameters
  • feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.

  • case_ids_training_set (List[str]) – Case IDs of the items in the training set.

  • feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.

  • case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

reduce_features(feature_vectors, epsilon=1e-10)[source]

Reduces the dimensionality of feature vectors using a principle component analysis.

Parameters
  • feature_vectors (numpy.array) – Feature vectors to be reduced.

  • epsilon (float, optional) – Smoothing operator.

Returns

Reduced feature vectors.

Return type

numpy.array

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]

Selects a subset of the unlabeled data that increases the representativeness of the training set.

Parameters
  • models (PytorchModel) – Current models that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • items_to_label (int) – Number of items that should be selected for labeling.

  • **kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and None because no pseudo labels are generated.

Return type

Tuple[List[str], None]

query_strategies.representativeness_sampling_uncertainty module

Combined representativeness and uncertainty sampling strategy

class query_strategies.representativeness_sampling_uncertainty.UncertaintyRepresentativenessSamplingStrategy(representativeness_algorithm='cluster_coverage', calculation_method='entropy', **kwargs)[source]

Bases: query_strategies.representativeness_sampling_strategy_base.RepresentativenessSamplingStrategyBase

Sampling strategy that combines representativeness and uncertainty sampling.

Parameters
  • representativeness_algorithm (string, optional) –

    The algorithm to be used to select the most representative samples: “most_distant_sample” | “cluster_coverage”. Defaults to “cluster_coverage”.

    • ”most_distant_sample”: The unlabeled item that has the highest feature distance to the labeled set
      is selected for labeling.
    • ”cluster_coverage”: The features of the unlabeled and labeled items are clustered and an item from
      the most underrepresented cluster is selected for labeling.

  • calculation_method (string, optional) – The algorithm to be used for computing the uncertainty: “distance” | “entropy”.

compute_representativeness_scores(model, data_module, feature_vectors_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]

Computes representativeness scores for all unlabeled items.

Parameters
  • model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • feature_vectors_training_set (np.ndarray) – Feature vectors of the items in the training set.

  • feature_vectors_unlabeled_set (np.ndarray) – Feature vectors of the items in the unlabeled set.

  • case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

Returns

Representativeness score for each item in the unlabeled set. Items that are underrepresented in the training receive higher scores.

Return type

List[float]

prepare_representativeness_computation(feature_vectors_training_set, case_ids_training_set, feature_vectors_unlabeled_set, case_ids_unlabeled_set)[source]

Prepares computation of representativeness scores.

Parameters
  • feature_vectors_training_set (numpy.ndarray) – Feature vectors of the items in the training set.

  • case_ids_training_set (List[str]) – Case IDs of the items in the training set.

  • feature_vectors_unlabeled_set (numpy.ndarray) – Feature vectors of the items in the unlabeled set.

  • case_ids_unlabeled_set (List[str]) – Case IDs of the items in the unlabeled set.

query_strategies.uncertainty_sampling_strategy module

Module for uncertainty sampling strategy

class query_strategies.uncertainty_sampling_strategy.UncertaintySamplingStrategy(**kwargs)[source]

Bases: query_strategies.query_strategy.QueryStrategy

Class for selecting items to label by highest uncertainty

Parameters

**kwargs

Optional keyword arguments:

  • calculation_method (str): Specification of the method used to calculate the uncertainty
    values: “distance” | “entropy”.
  • exclude_background (bool): Whether to exclude the background dimension in calculating the
    uncertainty value.
  • prefer_unique_scans (bool): Whether to prefer among the uncertain scan-slice combinations unique
    scans, if possible. E.g. with items_to_label set to 2: [‘slice_1-32’, ‘slice_1-33’, ‘slice_2-50’] -> [‘slice_1-32’, ‘slice_2-50’]
  • epsilon (float): Small numerical value used for smoothing when using “entropy” as the uncertainty metric.

compute_uncertainties(model, data_module)[source]
Parameters
  • model (PytorchModel) – Current model that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

Returns

Model uncertainties and case IDs for all items in the unlabeled set.

Return type

Tuple[List[float], List[str]]

select_items_to_label(models, data_module, items_to_label, **kwargs)[source]

Selects subset of the unlabeled data with the highest uncertainty that should be labeled next.

Parameters
  • models – Current models that should be improved by selecting additional data for labeling.

  • data_module (ActiveLearningDataModule) – A data module object providing data.

  • items_to_label (int) – Number of items that should be selected for labeling.

  • calculation_method (str, optional) – Specification of the method used to calculate the uncertainty values. (default = ‘distance’)

  • **kwargs – Additional, strategy-specific parameters.

Returns

List of IDs of the data items to be labeled and None because no pseudo labels are generated.

Return type

Tuple[List[str], None]

query_strategies.utils module

Module containing functions used for different query strategies.

query_strategies.utils.clean_duplicate_scans(uncertainties, items_to_label)[source]

Cleans the list from duplicate scans if possible. If minimum number of samples can’t be reached without duplicates, duplicates are kept.

Parameters
  • uncertainties (List[Tuple[float, str]]) – List with tuples of uncertainty value and case id.

  • items_to_label (int) – Number of items that should be selected for labeling.

Returns

A cleaned list of tuples.

query_strategies.utils.distance_to_max_uncertainty(predictions, max_uncertainty_value=0.5, **kwargs)[source]

Calculates the uncertainties based on the distance to a maximum uncertainty value:

\[\sum | max\_uncertainty\_value - predictions | \]
Parameters
  • predictions (torch.Tensor) – The predictions of the model.

  • max_uncertainty_value (float, optional) – The maximum value of uncertainty in the predictions. (default = 0.5)

  • **kwargs – Keyword arguments specific for this calculation.

Returns

Uncertainty value for each image in the batch of predictions.

query_strategies.utils.entropy(predictions, max_uncertainty_value=0.5, **kwargs)[source]

Calculates the uncertainties based on the entropy of the distance to a maximum uncertainty value:

\[- \sum | max\_uncertainty\_value - predictions | \cdot | \log({max\_uncertainty\_value - predictions}) | \]
Parameters
  • predictions (torch.Tensor) – The predictions of the model.

  • max_uncertainty_value (float, optional) – The maximum value of uncertainty in the predictions. (default = 0.5)

  • **kwargs

    Keyword arguments specific for this calculation:

    • epsilon (float): The smoothing value to avoid the magic number. (default = 1e-10)

Returns

Uncertainty value for each image in the batch of predictions.

query_strategies.utils.select_uncertainty_calculation(calculation_method)[source]

Selects the calculation function based on the provided name.

Parameters

calculation_method (str) – Name of the calculation method. Allowable values: “distance” | “entropy”.

Returns

A callable function to calculate uncertainty based on predictions.

Module contents

Docs

Access comprehensive developer documentation for Active Segmentation

View Docs