Evaluation¶
Evaluation base¶
-
class
nussl.evaluation.
EvaluationBase
(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key=None, **kwargs)[source]¶ Base class for all Evaluation classes for source separation algorithms in nussl. Contains common functions for all evaluation techniques. This class should not be instantiated directly.
Both
true_sources_list
andestimated_sources_list
get validated using the private method_verify_input_list()
. If your evaluation needs to verify that input is set correctly (recommended) overwrite that method to add checking.- Parameters
true_sources_list (list) – List of objects that contain one ground truth source per object. In some instances (such as the
BSSEval
objects) this list is filled withAudioSignals
but in other cases it is populated withMaskBase
-derived objects (i.e., either aBinaryMask
orSoftMask
object).estimated_sources_list (list) – List of objects that contain source estimations from a source separation algorithm. List should be populated with the same type of objects and in the same order as :param:`true_sources_list`.
source_labels (list) – List of strings that are labels for each source to be used as keys for the scores. Default value is None and in that case labels use the file_name attribute. If that is also None, then the source labels are Source 0, Source 1, etc.
compute_permutation (bool) – Whether or not to evaluate in a permutation-invariant fashion, where the estimates are permuted to match the true sources. Only the best permutation according to
best_permutation_key
is returned to the scores dict. Defaults to False.best_permutation_key (str) – Which metric to use to decide which permutation of the sources was best.
**kwargs (dict) – Any additional keyword arguments are passed on to
evaluate_helper
.
Methods
evaluate
()This function encapsulates the main functionality of all evaluation classes.
evaluate_helper
(references, estimates, **kwargs)This function should be implemented by each class that inherits this class.
This gets all the possible candidates for evaluation.
Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).
Attributes
A dictionary that stores all scores from the evaluation method.
-
evaluate
()[source]¶ This function encapsulates the main functionality of all evaluation classes. It performs the following steps, some of which must be implemented in subclasses of EvaluationBase.
Preprocesses the data somehow into numpy arrays that get passed into your evaluation function.
Gets all possible candidates that will be evaluated in your evaluation function.
For each candidate, runs the evaluation function (must be implemented in subclass).
Finds the results from the best candidate.
Returns a dictionary containing those results.
Steps 1 and 3 must be implemented by the subclass while the others are implemented by EvaluationBase.
- Returns
A dictionary containing the scores for each source for the best candidate.
-
evaluate_helper
(references, estimates, **kwargs)[source]¶ This function should be implemented by each class that inherits this class. The function should take in a numpy array containing the references and one for the estimates and compute evaluation measures between the two arrays. The results should be stored in a list of dictionaries. For example, a BSSEval evaluator may return a dictionary as follows, for a single estimate:
#or windows or both [ { #ch0 ch1 # results for first estimate "SDR": [5.6, 5.2], # metric "SIR": [9.2, 8.9], # metric "SAR": [4.1, 4.3] # metric }, ... # more results for other estimates ]
Each metric should be a key in the dictionary pointing to a value which is a list. The list will contain the metrics for however the algorithm was implemented (e.g. there might be two value, one for each channel in a stereo mix, or there might be a sequence, one for each window that was evaluated.)
- Parameters
references (np.ndarray) – References kept in a numpy array. Should have shape (…, n_channels, n_sources).
estimates (np.ndarray) – Estimates kept in whatever format you want. Should have shape (…, n_channels, n_sources).
kwargs (dict) – Keyword arguments with any additional arguments to be used in the function (e.g. window_size, hop_length).
- Returns
A list of dictionary containing the measures corresponding to each estimate and reference.
-
get_candidates
()[source]¶ This gets all the possible candidates for evaluation. If compute_permutation is False, then the estimates and the references are assumed to be in the same order. The first N estimates will be compared to the first N references, where N is min(len(estimates), len(references)).
If compute_permutation is True, and len(estimates) == len(references), then every possible ordering of the estimates will be tried to match to the references. So if there are 3 references and 3 estimates, a total of 3! = 6 candidates will be generated.
If compute_permutation is True and len(estimates) > len(references), then every combination of size len(references) estimates will be tried as well as their permutations. If there are 2 references and 4 estimates, then (4 choose 2) = 6 combos will be tried. For each of those pairs of 2, there will be 2! = 2 permutations. So a total of 12 candidates will be generated.
- Returns
Two lists of combinations and permutations that should be tried. Each element of the list contains the indices that are used to find the sources that are compared to each other.
-
preprocess
()[source]¶ Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).
- Returns
references, estimates in that order as np.ndarrays.
Note
Make sure to return the preprocessed data in the order (references, estimates)!
-
property
scores
¶ A dictionary that stores all scores from the evaluation method. Gets populated when
evaluate()
gets run.
BSS Evaluation base¶
-
class
nussl.evaluation.
BSSEvaluationBase
(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='SDR', **kwargs)[source]¶ Base class for all evaluation classes that are based on BSSEval metrics. This contains some useful verification functions, preprocessing functions that are used in many separation-based evaluation. Specific evaluation metrics are thin wrappers around this base class, basically only implementing the
self.evaluate_helper
function.Both
true_sources_list
andestimated_sources_list
get validated using the private method_verify_input_list()
. If your evaluation needs to verify that input is set correctly (recommended) overwrite that method to add checking.- Parameters
true_sources_list (list) – List of objects that contain one ground truth source per object. In some instances (such as the
BSSEval
objects) this list is filled withAudioSignals
but in other cases it is populated withMaskBase
-derived objects (i.e., either aBinaryMask
orSoftMask
object).estimated_sources_list (list) – List of objects that contain source estimations from a source separation algorithm. List should be populated with the same type of objects and in the same order as :param:`true_sources_list`.
source_labels (list) – List of strings that are labels for each source to be used as keys for the scores. Default value is None and in that case labels use the file_name attribute. If that is also None, then the source labels are Source 0, Source 1, etc.
compute_permutation (bool) – Whether or not to evaluate in a permutation-invariant fashion, where the estimates are permuted to match the true sources. Only the best permutation according to
best_permutation_key
is returned to the scores dict. Defaults to False.best_permutation_key (str) – Which metric to use to decide which permutation of the sources was best.
**kwargs (dict) – Any additional arguments are passed on to evaluate_helper.
Methods
Implements preprocess by stacking the audio_data inside each AudioSignal object in both self.true_sources_list and self.estimated_sources_list.
Scale invariant BSSEval¶
-
class
nussl.evaluation.
BSSEvalScale
(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='SDR', **kwargs)[source]¶ Methods
evaluate_helper
(references, estimates[, …])Implements evaluation using new BSSEval metrics [1].
Scale invariant metrics expects zero-mean centered references and sources.
-
evaluate_helper
(references, estimates, compute_sir_sar=True)[source]¶ Implements evaluation using new BSSEval metrics [1]. This computes every metric described in [1], including:
SI-SDR: Scale-invariant source-to-distortion ratio. Higher is better.
SI-SIR: Scale-invariant source-to-interference ratio. Higher is better.
SI-SAR: Scale-invariant source-to-artifact ratio. Higher is better.
SD-SDR: Scale-dependent source-to-distortion ratio. Higher is better.
SNR: Signal-to-noise ratio. Higher is better.
SRR: The source-to-rescaled-source ratio. This corresponds to a term that punishes the estimate if its scale is off relative to the reference. This is an unnumbered equation in [1], but is the term on page 2, second column, second to last line: ||s - alpha*s||**2. s is factored out. Higher is better.
SI-SDRi: Improvement in SI-SDR over using the mixture as the estimate. Higher is better.
SD-SDRi: Improvement in SD-SDR over using the mixture as the estimate. Higher is better.
SNRi: Improvement in SNR over using the mixture as the estimate. Higher is better.
Note:
If compute_sir_sar = False, then you’ll get np.nan for SI-SIR and SI-SAR!
References:
[1] Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J. R. (2019, May). SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 626-630). IEEE.
-
-
nussl.evaluation.
scale_bss_eval
(references, estimate, mixture, idx, compute_sir_sar=True)[source]¶ Computes metrics for references[idx] relative to the chosen estimates. This only works for mono audio. Each channel should be done independently when calling this function. Lovingly borrowed from Gordon Wichern and Jonathan Le Roux at Mitsubishi Electric Research Labs.
This returns 9 numbers (in this order):
SI-SDR: Scale-invariant source-to-distortion ratio. Higher is better.
SI-SIR: Scale-invariant source-to-interference ratio. Higher is better.
SI-SAR: Scale-invariant source-to-artifact ratio. Higher is better.
SD-SDR: Scale-dependent source-to-distortion ratio. Higher is better.
SNR: Signal-to-noise ratio. Higher is better.
SRR: The source-to-rescaled-source ratio. This corresponds to a term that punishes the estimate if its scale is off relative to the reference. This is an unnumbered equation in [1], but is the term on page 2, second column, second to last line: ||s - alpha*s||**2. s here is factored out. Higher is better.
SI-SDRi: Improvement in SI-SDR over using the mixture as the estimate.
SD-SDRi: Improvement in SD-SDR over using the mixture as the estimate.
SNRi: Improvement in SNR over using the mixture as the estimate.
References:
- [1] Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J. R.
(2019, May). SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 626-630). IEEE.
- Parameters
references (np.ndarray) – object containing the references data. Of shape (n_samples, n_sources).
estimate (np.ndarray) – object containing the estimate data. Of shape (n_samples, 1).
mixture (np.ndarray) – objct containingthe mixture data. Of shape (n_samples, 1).
idx (int) – Which reference to compute metrics against.
compute_sir_sar (bool, optional) – Whether or not to compute SIR/SAR metrics, which can be computationally expensive and may not be relevant for your evaluation. Defaults to True
- Returns
SI-SDR, SI-SIR, SI-SAR, SD-SDR, SNR, SRR, SI-SDRi, SD-SDRi, SNRi
- Return type
tuple
BSSEvalV4 (museval)¶
-
class
nussl.evaluation.
BSSEvalV4
(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='SDR', **kwargs)[source]¶ Methods
evaluate_helper
(references, estimates, **kwargs)Implements evaluation using museval.metrics.bss_eval
Precision and recall on masks¶
-
class
nussl.evaluation.
PrecisionRecallFScore
(true_sources_list, estimated_sources_list, source_labels=None, compute_permutation=False, best_permutation_key='F1-Score', **kwargs)[source]¶ This class provides common statistical metrics for determining how well a source separation algorithm in nussl was able to create a binary mask compared to a known binary mask. The metrics used here are Precision, Recall, F-Score (sometimes called F-measure or F1-score), and Accuracy (though this is not reflected in the name of the class, it is simply
# correct / total
).Notes
Methods
evaluate_helper
(references, estimates, **kwargs)Determines the precision, recall, f-score, and accuracy of each binary_mask object in
true_sources_mask_list
andestimated_sources_mask_list
.Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).
PrecisionRecallFScore
can only be run using binary_mask objects. The constructor expects a list of
binary_mask objects for both the ground truth sources and the estimated sources. *
PrecisionRecallFScore
does not calculate the correct permutation of the estimated and ground truth sources; they are expected to be in the correct order when they are passed intoPrecisionRecallFScore
.- Parameters
true_sources_mask_list (list) – List of binary_mask objects representing the ground truth sources.
estimated_sources_mask_list (list) – List of binary_mask objects representing the estimates from a source separation object
source_labels (list) (Optional) – List of
str
with labels for each source. If no labels are provided, sources will be labeledSource 0, Source 1, ...
etc.
-
evaluate_helper
(references, estimates, **kwargs)[source]¶ Determines the precision, recall, f-score, and accuracy of each binary_mask object in
true_sources_mask_list
andestimated_sources_mask_list
. Returns a list of results that is formatted like so:[ {'Accuracy': 0.83, 'Precision': 0.78, 'Recall': 0.81, 'F1-Score': 0.77 }, {'Accuracy': 0.22, 'Precision': 0.12, 'Recall': 0.15, 'F1-Score': 0.19 } ]
- Returns
A list of scores that contains accuracy, precision, recall, and F1-score of between the list of binary_mask objects in both
true_sources_mask_list
andestimated_sources_mask_list
.- Return type
self.scores (dict)
-
preprocess
()[source]¶ Takes the objects contained in true_sources_list and estimated_sources_list and processes them into numpy arrays that have shape (…, n_channels, n_sources).
- Returns
references, estimates in that order as np.ndarrays.
Note
Make sure to return the preprocessed data in the order (references, estimates)!
Aggregators¶
-
nussl.evaluation.
aggregate_score_files
(json_files, aggregator=<function nanmedian>)[source]¶ Takes a list of json files output by an Evaluation method in nussl and aggregates all the metrics into a Pandas dataframe. Sample output:
SDR SIR SAR drums oracle0.json 9.086025 15.025801 10.362709 random0.json -6.539877 -6.087538 3.508338 oracle1.json 9.591432 14.335700 11.365882 random1.json -1.358840 -0.993666 9.577297 bass oracle0.json 7.936720 12.843092 9.631929 random0.json -4.190299 -3.730649 5.802003 oracle1.json 8.581090 12.513445 10.831370 random1.json 0.365171 0.697621 11.693103 other oracle0.json 2.024207 6.133359 4.158805 random0.json -9.857085 -9.481909 0.965199 oracle1.json 3.961383 6.861785 7.085745 random1.json -4.042277 -3.707997 7.260934 vocals oracle0.json 12.169686 16.650161 14.085037 random0.json -2.440166 -1.884026 6.760966 oracle1.json 12.409913 16.248470 14.725983 random1.json 1.609577 1.958037 12.738970
- Parameters
json_files (list) – List of JSON files that will be parsed for metrics.
aggregator ([type], optional) – How to aggregate results within a single track. Defaults to np.median.
- Returns
Pandas dataframe containing the aggregated metrics.
- Return type
pd.DataFrame