Evaluation¶

Evaluation base¶

class nussl.evaluation.EvaluationBase(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key=None, **kwargs)[source]¶

Base class for all Evaluation classes for source separation algorithms in nussl. Contains common functions for all evaluation techniques. This class should not be instantiated directly.

Both true_sources_list and estimated_sources_list get validated using the private method _verify_input_list(). If your evaluation needs to verify that input is set correctly (recommended) overwrite that method to add checking.

Parameters

true_sources_list (list) – List of objects that contain one ground truth source per object. In some instances (such as the BSSEval objects) this list is filled with AudioSignals but in other cases it is populated with MaskBase -derived objects (i.e., either a BinaryMask or SoftMask object).
estimated_sources_list (list) – List of objects that contain source estimations from a source separation algorithm. List should be populated with the same type of objects and in the same order as :param:`true_sources_list`.
source_labels (list) – List of strings that are labels for each source to be used as keys for the scores. Default value is None and in that case labels use the file_name attribute. If that is also None, then the source labels are Source 0, Source 1, etc.
compute_permutation (bool) – Whether or not to evaluate in a permutation-invariant fashion, where the estimates are permuted to match the true sources. Only the best permutation according to best_permutation_key is returned to the scores dict. Defaults to False.
best_permutation_key (str) – Which metric to use to decide which permutation of the sources was best.
**kwargs (dict) – Any additional keyword arguments are passed on to evaluate_helper.

Methods

`evaluate`()	This function encapsulates the main functionality of all evaluation classes.
`evaluate_helper`(references, estimates, **kwargs)	This function should be implemented by each class that inherits this class.
`get_candidates`()	This gets all the possible candidates for evaluation.
`preprocess`()	Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).

Attributes

scores

A dictionary that stores all scores from the evaluation method.

evaluate()[source]¶

This function encapsulates the main functionality of all evaluation classes. It performs the following steps, some of which must be implemented in subclasses of EvaluationBase.

Preprocesses the data somehow into numpy arrays that get passed into your evaluation function.

Gets all possible candidates that will be evaluated in your evaluation function.

For each candidate, runs the evaluation function (must be implemented in subclass).

Finds the results from the best candidate.

Returns a dictionary containing those results.

Steps 1 and 3 must be implemented by the subclass while the others are implemented by EvaluationBase.

Returns: A dictionary containing the scores for each source for the best candidate.

evaluate_helper(references, estimates, **kwargs)[source]¶

This function should be implemented by each class that inherits this class. The function should take in a numpy array containing the references and one for the estimates and compute evaluation measures between the two arrays. The results should be stored in a list of dictionaries. For example, a BSSEval evaluator may return a dictionary as follows, for a single estimate:

               #or windows or both
[   {          #ch0  ch1   # results for first estimate
        "SDR": [5.6, 5.2], # metric
        "SIR": [9.2, 8.9], # metric
        "SAR": [4.1, 4.3]  # metric
    },
    ...                    # more results for other estimates
]

Each metric should be a key in the dictionary pointing to a value which is a list. The list will contain the metrics for however the algorithm was implemented (e.g. there might be two value, one for each channel in a stereo mix, or there might be a sequence, one for each window that was evaluated.)

Parameters

references (np.ndarray) – References kept in a numpy array. Should have shape (…, n_channels, n_sources).
estimates (np.ndarray) – Estimates kept in whatever format you want. Should have shape (…, n_channels, n_sources).
kwargs (dict) – Keyword arguments with any additional arguments to be used in the function (e.g. window_size, hop_length).

Returns

A list of dictionary containing the measures corresponding to each estimate and reference.

get_candidates()[source]¶

This gets all the possible candidates for evaluation. If compute_permutation is False, then the estimates and the references are assumed to be in the same order. The first N estimates will be compared to the first N references, where N is min(len(estimates), len(references)).

If compute_permutation is True, and len(estimates) == len(references), then every possible ordering of the estimates will be tried to match to the references. So if there are 3 references and 3 estimates, a total of 3! = 6 candidates will be generated.

If compute_permutation is True and len(estimates) > len(references), then every combination of size len(references) estimates will be tried as well as their permutations. If there are 2 references and 4 estimates, then (4 choose 2) = 6 combos will be tried. For each of those pairs of 2, there will be 2! = 2 permutations. So a total of 12 candidates will be generated.

Returns: Two lists of combinations and permutations that should be tried. Each element of the list contains the indices that are used to find the sources that are compared to each other.

preprocess()[source]¶

Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).

Returns: references, estimates in that order as np.ndarrays.

Note

Make sure to return the preprocessed data in the order (references, estimates)!

property scores¶: A dictionary that stores all scores from the evaluation method. Gets populated when evaluate() gets run.

BSS Evaluation base¶

class nussl.evaluation.BSSEvaluationBase(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='SDR', **kwargs)[source]¶

Base class for all evaluation classes that are based on BSSEval metrics. This contains some useful verification functions, preprocessing functions that are used in many separation-based evaluation. Specific evaluation metrics are thin wrappers around this base class, basically only implementing the self.evaluate_helper function.

Both true_sources_list and estimated_sources_list get validated using the private method _verify_input_list(). If your evaluation needs to verify that input is set correctly (recommended) overwrite that method to add checking.

Parameters

true_sources_list (list) – List of objects that contain one ground truth source per object. In some instances (such as the BSSEval objects) this list is filled with AudioSignals but in other cases it is populated with MaskBase -derived objects (i.e., either a BinaryMask or SoftMask object).
estimated_sources_list (list) – List of objects that contain source estimations from a source separation algorithm. List should be populated with the same type of objects and in the same order as :param:`true_sources_list`.
source_labels (list) – List of strings that are labels for each source to be used as keys for the scores. Default value is None and in that case labels use the file_name attribute. If that is also None, then the source labels are Source 0, Source 1, etc.
compute_permutation (bool) – Whether or not to evaluate in a permutation-invariant fashion, where the estimates are permuted to match the true sources. Only the best permutation according to best_permutation_key is returned to the scores dict. Defaults to False.
best_permutation_key (str) – Which metric to use to decide which permutation of the sources was best.
**kwargs (dict) – Any additional arguments are passed on to evaluate_helper.

Methods

preprocess()

Implements preprocess by stacking the audio_data inside each AudioSignal object in both self.true_sources_list and self.estimated_sources_list.

preprocess()[source]¶

Implements preprocess by stacking the audio_data inside each AudioSignal object in both self.true_sources_list and self.estimated_sources_list.

Returns: Tuple containing reference and estimate arrays.
Return type: tuple

Scale invariant BSSEval¶

class nussl.evaluation.BSSEvalScale(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='SDR', **kwargs)[source]¶

Methods

`evaluate_helper`(references, estimates[, …])	Implements evaluation using new BSSEval metrics [1].
`preprocess`()	Scale invariant metrics expects zero-mean centered references and sources.

evaluate_helper(references, estimates, compute_sir_sar=True)[source]¶

Implements evaluation using new BSSEval metrics [1]. This computes every metric described in [1], including:

SI-SDR: Scale-invariant source-to-distortion ratio. Higher is better.
SI-SIR: Scale-invariant source-to-interference ratio. Higher is better.
SI-SAR: Scale-invariant source-to-artifact ratio. Higher is better.
SD-SDR: Scale-dependent source-to-distortion ratio. Higher is better.
SNR: Signal-to-noise ratio. Higher is better.
SRR: The source-to-rescaled-source ratio. This corresponds to a term that punishes the estimate if its scale is off relative to the reference. This is an unnumbered equation in [1], but is the term on page 2, second column, second to last line: ||s - alpha*s||**2. s is factored out. Higher is better.
SI-SDRi: Improvement in SI-SDR over using the mixture as the estimate. Higher is better.
SD-SDRi: Improvement in SD-SDR over using the mixture as the estimate. Higher is better.
SNRi: Improvement in SNR over using the mixture as the estimate. Higher is better.

Note:

If compute_sir_sar = False, then you’ll get np.nan for SI-SIR and SI-SAR!

References:

[1] Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J. R. (2019, May). SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 626-630). IEEE.

preprocess()[source]¶: Scale invariant metrics expects zero-mean centered references and sources.

nussl.evaluation.scale_bss_eval(references, estimate, mixture, idx, compute_sir_sar=True)[source]¶

Computes metrics for references[idx] relative to the chosen estimates. This only works for mono audio. Each channel should be done independently when calling this function. Lovingly borrowed from Gordon Wichern and Jonathan Le Roux at Mitsubishi Electric Research Labs.

This returns 9 numbers (in this order):

SI-SDR: Scale-invariant source-to-distortion ratio. Higher is better.
SI-SIR: Scale-invariant source-to-interference ratio. Higher is better.
SI-SAR: Scale-invariant source-to-artifact ratio. Higher is better.
SD-SDR: Scale-dependent source-to-distortion ratio. Higher is better.
SNR: Signal-to-noise ratio. Higher is better.
SRR: The source-to-rescaled-source ratio. This corresponds to a term that punishes the estimate if its scale is off relative to the reference. This is an unnumbered equation in [1], but is the term on page 2, second column, second to last line: ||s - alpha*s||**2. s here is factored out. Higher is better.
SI-SDRi: Improvement in SI-SDR over using the mixture as the estimate.
SD-SDRi: Improvement in SD-SDR over using the mixture as the estimate.
SNRi: Improvement in SNR over using the mixture as the estimate.

References:

[1] Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J. R.: (2019, May). SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 626-630). IEEE.

Parameters

references (np.ndarray) – object containing the references data. Of shape (n_samples, n_sources).
estimate (np.ndarray) – object containing the estimate data. Of shape (n_samples, 1).
mixture (np.ndarray) – objct containingthe mixture data. Of shape (n_samples, 1).
idx (int) – Which reference to compute metrics against.
compute_sir_sar (bool, optional) – Whether or not to compute SIR/SAR metrics, which can be computationally expensive and may not be relevant for your evaluation. Defaults to True

Returns

SI-SDR, SI-SIR, SI-SAR, SD-SDR, SNR, SRR, SI-SDRi, SD-SDRi, SNRi

Return type

tuple

BSSEvalV4 (museval)¶

class nussl.evaluation.BSSEvalV4(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='SDR', **kwargs)[source]¶

Methods

evaluate_helper(references, estimates, **kwargs)

Implements evaluation using museval.metrics.bss_eval

evaluate_helper(references, estimates, **kwargs)[source]¶: Implements evaluation using museval.metrics.bss_eval

Precision and recall on masks¶

class nussl.evaluation.PrecisionRecallFScore(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='F1-Score', **kwargs)[source]¶

This class provides common statistical metrics for determining how well a source separation algorithm in nussl was able to create a binary mask compared to a known binary mask. The metrics used here are Precision, Recall, F-Score (sometimes called F-measure or F1-score), and Accuracy (though this is not reflected in the name of the class, it is simply # correct / total).

Notes

Methods

`evaluate_helper`(references, estimates, **kwargs)	Determines the precision, recall, f-score, and accuracy of each binary_mask object in `true_sources_mask_list` and `estimated_sources_mask_list`.
`preprocess`()	Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).

PrecisionRecallFScore can only be run using binary_mask objects. The constructor expects a list of

binary_mask objects for both the ground truth sources and the estimated sources. * PrecisionRecallFScore does not calculate the correct permutation of the estimated and ground truth sources; they are expected to be in the correct order when they are passed into PrecisionRecallFScore.

Parameters

true_sources_mask_list (list) – List of binary_mask objects representing the ground truth sources.
estimated_sources_mask_list (list) – List of binary_mask objects representing the estimates from a source separation object
source_labels (list) (Optional) – List of str with labels for each source. If no labels are provided, sources will be labeled Source 0, Source 1, ... etc.

evaluate_helper(references, estimates, **kwargs)[source]¶

Determines the precision, recall, f-score, and accuracy of each binary_mask object in true_sources_mask_list and estimated_sources_mask_list. Returns a list of results that is formatted like so:

[
    {'Accuracy': 0.83,
    'Precision': 0.78,
    'Recall': 0.81,
    'F1-Score': 0.77 },

    {'Accuracy': 0.22,
    'Precision': 0.12,
    'Recall': 0.15,
    'F1-Score': 0.19 }
]

Returns: A list of scores that contains accuracy, precision, recall, and F1-score of between the list of binary_mask objects in both true_sources_mask_list and estimated_sources_mask_list.
Return type: self.scores (dict)

preprocess()[source]¶

Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).

Returns: references, estimates in that order as np.ndarrays.

Note

Make sure to return the preprocessed data in the order (references, estimates)!

Aggregators¶

nussl.evaluation.aggregate_score_files(json_files, aggregator=<function nanmedian>)[source]¶

Takes a list of json files output by an Evaluation method in nussl and aggregates all the metrics into a Pandas dataframe. Sample output:

                        SDR        SIR        SAR
drums  oracle0.json   9.086025  15.025801  10.362709
       random0.json  -6.539877  -6.087538   3.508338
       oracle1.json   9.591432  14.335700  11.365882
       random1.json  -1.358840  -0.993666   9.577297
bass   oracle0.json   7.936720  12.843092   9.631929
       random0.json  -4.190299  -3.730649   5.802003
       oracle1.json   8.581090  12.513445  10.831370
       random1.json   0.365171   0.697621  11.693103
other  oracle0.json   2.024207   6.133359   4.158805
       random0.json  -9.857085  -9.481909   0.965199
       oracle1.json   3.961383   6.861785   7.085745
       random1.json  -4.042277  -3.707997   7.260934
vocals oracle0.json  12.169686  16.650161  14.085037
       random0.json  -2.440166  -1.884026   6.760966
       oracle1.json  12.409913  16.248470  14.725983
       random1.json   1.609577   1.958037  12.738970

Parameters

json_files (list) – List of JSON files that will be parsed for metrics.
aggregator ([type], optional) – How to aggregate results within a single track. Defaults to np.median.

Returns

Pandas dataframe containing the aggregated metrics.

Return type

pd.DataFrame