Machine Learning

SeparationModel

class nussl.ml.SeparationModel(config, verbose=False)[source]

SeparationModel takes a configuration file or dictionary that describes the model structure, which is some combination of MelProjection, Embedding, RecurrentStack, ConvolutionalStack, and other modules found in nussl.ml.networks.modules.

References

Methods

forward(data)

param data

(dict) a dictionary containing the input data for the model.

save(location[, metadata])

Saves a SeparationModel into a location into a dictionary with the weights and model configuration.

Hershey, J. R., Chen, Z., Le Roux, J., & Watanabe, S. (2016, March). Deep clustering: Discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 31-35). IEEE.

Luo, Y., Chen, Z., Hershey, J. R., Le Roux, J., & Mesgarani, N. (2017, March). Deep clustering and conventional networks for music separation: Stronger together. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 61-65). IEEE.

Parameters

config – (str, dict) Either a config dictionary that defines the model and its connections, or the path to a json file containing the dictionary. If the latter, the path will be loaded and used.

See also

ml.register_module to register your custom modules with SeparationModel.

Examples

>>> config = nussl.ml.networks.builders.build_recurrent_dpcl(
>>>     num_features=512, hidden_size=300, num_layers=3, bidirectional=True,
>>>     dropout=0.3, embedding_size=20,
>>>     embedding_activation=['sigmoid', 'unit_norm'])
>>>
>>> model = SeparationModel(config)
forward(data)[source]
Parameters
  • data – (dict) a dictionary containing the input data for the model.

  • match the input_keys in self.input. (Should) –

Returns:

save(location, metadata=None)[source]

Saves a SeparationModel into a location into a dictionary with the weights and model configuration. :param location: (str) Where you want the model saved, as a path.

Returns

where the model was saved.

Return type

(str)

Building blocks for SeparationModel

Helpers for common deep networks

Functions that make it easy to build commonly used source separation architectures. Currently contains mask inference, deep clustering, and chimera networks that are based on recurrent neural networks. These functions are a good place to start when creating your own network toplogies. Since there can be dependencies between layers depending on input size, it’s good to work this out in a function like those below.

Functions

build_dual_path_recurrent_end_to_end(…[, …])

Builds a config for a dual path recurrent network that operates on the time-series.

build_open_unmix_like(num_features, …[, …])

This is a builder for an open-unmix LIKE (UMX) architecture for music source separation.

build_recurrent_chimera(num_features, …[, …])

Builds a config for a Chimera network that can be passed to SeparationModel.

build_recurrent_dpcl(num_features, …[, …])

Builds a config for a deep clustering network that can be passed to SeparationModel.

build_recurrent_end_to_end(num_filters, …)

Builds a config for a BLSTM-based network that operates on the time-series.

build_recurrent_mask_inference(num_features, …)

Builds a config for a mask inference network that can be passed to SeparationModel.

nussl.ml.networks.builders.build_dual_path_recurrent_end_to_end(num_filters, filter_length, hop_length, chunk_size, hop_size, hidden_size, num_layers, bidirectional, bottleneck_size, num_sources, mask_activation, num_audio_channels=1, window_type='sqrt_hann', skip_connection=False, rnn_type='lstm', mix_key='mix_audio')[source]

Builds a config for a dual path recurrent network that operates on the time-series. Uses a learned filterbank within the network.

Parameters
  • num_filters (int) – Number of learnable filters in the front end network.

  • filter_length (int) – Length of the filters.

  • hop_length (int) – Hop length between frames.

  • window_type (str) – Type of windowing function on apply to each frame.

  • hidden_size (int) – Hidden size of the RNN.

  • num_layers (int) – Number of layers in the RNN.

  • bidirectional (int) – Whether the RNN is bidirectional.

  • dropout (float) – Amount of dropout to be used between layers of RNN.

  • num_sources (int) – Number of sources to create masks for.

  • mask_activation (list of str) – Activation of the mask (‘sigmoid’, ‘softmax’, etc.). See nussl.ml.networks.modules.Embedding.

  • num_audio_channels (int) – Number of audio channels in input (e.g. mono or stereo). Defaults to 1.

  • rnn_type (str, optional) – RNN type, either ‘lstm’ or ‘gru’. Defaults to ‘lstm’.

  • normalization_class (str, optional) – Type of normalization to apply, either ‘InstanceNorm’ or ‘BatchNorm’. Defaults to ‘BatchNorm’.

  • mix_key (str, optional) – The key to look for in the input dictionary that contains the mixture spectrogram. Defaults to ‘mix_magnitude’.

Returns

A TASNet configuration that can be passed to

SeparationModel.

Return type

dict

nussl.ml.networks.builders.build_open_unmix_like(num_features, hidden_size, num_layers, bidirectional, dropout, num_sources, num_audio_channels=1, add_embedding=False, embedding_size=20, embedding_activation='sigmoid', rnn_type='lstm', mix_key='mix_magnitude')[source]

This is a builder for an open-unmix LIKE (UMX) architecture for music source separation.

The architecture is not exactly the same but is very similar for the most part. This architecture also has the option of having an embedding space attached to it, making it a UMX + Chimera network that you can regularize with a deep clustering loss.

Parameters
  • num_features (int) – Number of features in the input spectrogram (usually means window length of STFT // 2 + 1.)

  • hidden_size (int) – Hidden size of the RNN. Will be hidden_size // 2 if bidirectional is True.

  • num_layers (int) – Number of layers in the RNN.

  • bidirectional (int) – Whether the RNN is bidirectional.

  • dropout (float) – Amount of dropout to be used between layers of RNN.

  • num_sources (int) – Number of sources to create masks for.

  • num_audio_channels (int) – Number of audio channels in input (e.g. mono or stereo). Defaults to 1.

  • add_embedding (bool) – Whether or not to add an embedding layer to this to make this a Chimera network. If True, then embedding_size and embedding_activation will be used to define this.

  • embedding_size (int) – Embedding dimensionality of the deep clustering network.

  • embedding_activation (list of str) – Activation of the embedding (‘sigmoid’, ‘softmax’, etc.). See nussl.ml.networks.modules.Embedding.

  • rnn_type (str, optional) – RNN type, either ‘lstm’ or ‘gru’. Defaults to ‘lstm’.

  • mix_key (str, optional) – The key to look for in the input dictionary that contains the mixture spectrogram. Defaults to ‘mix_magnitude’.

Returns

An OpenUnmix-like configuration that can be passed to

SeparationModel.

Return type

dict

nussl.ml.networks.builders.build_recurrent_chimera(num_features, hidden_size, num_layers, bidirectional, dropout, embedding_size, embedding_activation, num_sources, mask_activation, num_audio_channels=1, rnn_type='lstm', normalization_class='BatchNorm', mix_key='mix_magnitude')[source]

Builds a config for a Chimera network that can be passed to SeparationModel. Chimera networks are so-called because they have two “heads” which can be trained via different loss functions. In traditional Chimera, one head is trained using a deep clustering loss while the other is trained with a mask inference loss. This Chimera network uses a recurrent neural network (RNN) to process the input representation.

Parameters
  • num_features (int) – Number of features in the input spectrogram (usually means window length of STFT // 2 + 1.)

  • hidden_size (int) – Hidden size of the RNN.

  • num_layers (int) – Number of layers in the RNN.

  • bidirectional (int) – Whether the RNN is bidirectional.

  • dropout (float) – Amount of dropout to be used between layers of RNN.

  • embedding_size (int) – Embedding dimensionality of the deep clustering network.

  • embedding_activation (list of str) – Activation of the embedding (‘sigmoid’, ‘softmax’, etc.). See nussl.ml.networks.modules.Embedding.

  • num_sources (int) – Number of sources to create masks for.

  • mask_activation (list of str) – Activation of the mask (‘sigmoid’, ‘softmax’, etc.). See nussl.ml.networks.modules.Embedding.

  • num_audio_channels (int) – Number of audio channels in input (e.g. mono or stereo). Defaults to 1.

  • rnn_type (str, optional) – RNN type, either ‘lstm’ or ‘gru’. Defaults to ‘lstm’. normalization_class (str, optional): Type of normalization to apply, either

  • or 'BatchNorm'. Defaults to 'BatchNorm'. ('InstanceNorm') –

  • mix_key (str, optional) – The key to look for in the input dictionary that contains the mixture spectrogram. Defaults to ‘mix_magnitude’.

Returns

A recurrent Chimera network configuration that can be passed to

SeparationModel.

Return type

dict

nussl.ml.networks.builders.build_recurrent_dpcl(num_features, hidden_size, num_layers, bidirectional, dropout, embedding_size, embedding_activation, num_audio_channels=1, rnn_type='lstm', normalization_class='BatchNorm', mix_key='mix_magnitude')[source]

Builds a config for a deep clustering network that can be passed to SeparationModel. This deep clustering network uses a recurrent neural network (RNN) to process the input representation.

Parameters
  • num_features (int) – Number of features in the input spectrogram (usually means window length of STFT // 2 + 1.)

  • hidden_size (int) – Hidden size of the RNN.

  • num_layers (int) – Number of layers in the RNN.

  • bidirectional (int) – Whether the RNN is bidirectional.

  • dropout (float) – Amount of dropout to be used between layers of RNN.

  • embedding_size (int) – Embedding dimensionality of the deep clustering network.

  • embedding_activation (list of str) – Activation of the embedding (‘sigmoid’, ‘softmax’, etc.). See nussl.ml.networks.modules.Embedding.

  • num_audio_channels (int) – Number of audio channels in input (e.g. mono or stereo). Defaults to 1.

  • rnn_type (str, optional) – RNN type, either ‘lstm’ or ‘gru’. Defaults to ‘lstm’.

  • normalization_class (str, optional) – Type of normalization to apply, either ‘InstanceNorm’ or ‘BatchNorm’. Defaults to ‘BatchNorm’.

  • mix_key (str, optional) – The key to look for in the input dictionary that contains the mixture spectrogram. Defaults to ‘mix_magnitude’.

Returns

A recurrent deep clustering network configuration that can be passed to

SeparationModel.

Return type

dict

nussl.ml.networks.builders.build_recurrent_end_to_end(num_filters, filter_length, hop_length, window_type, hidden_size, num_layers, bidirectional, dropout, num_sources, mask_activation, num_audio_channels=1, mask_complex=False, trainable=False, rnn_type='lstm', mix_key='mix_audio', normalization_class='BatchNorm')[source]

Builds a config for a BLSTM-based network that operates on the time-series. Uses an STFT within the network and can apply the mixture phase to the estimate, or can learn a mask on the phase as well as the magnitude.

Parameters
  • num_filters (int) – Number of learnable filters in the front end network.

  • filter_length (int) – Length of the filters.

  • hop_length (int) – Hop length between frames.

  • window_type (str) – Type of windowing function on apply to each frame.

  • hidden_size (int) – Hidden size of the RNN.

  • num_layers (int) – Number of layers in the RNN.

  • bidirectional (int) – Whether the RNN is bidirectional.

  • dropout (float) – Amount of dropout to be used between layers of RNN.

  • num_sources (int) – Number of sources to create masks for.

  • mask_activation (list of str) – Activation of the mask (‘sigmoid’, ‘softmax’, etc.). See nussl.ml.networks.modules.Embedding.

  • num_audio_channels (int) – Number of audio channels in input (e.g. mono or stereo). Defaults to 1.

  • mask_complex (bool, optional) – Whether to also place a mask on the complex part, or whether to just use the mixture phase.

  • trainable (bool, optional) – Whether to learn the filters, which start from a Fourier basis.

  • rnn_type (str, optional) – RNN type, either ‘lstm’ or ‘gru’. Defaults to ‘lstm’.

  • normalization_class (str, optional) – Type of normalization to apply, either ‘InstanceNorm’ or ‘BatchNorm’. Defaults to ‘BatchNorm’.

  • mix_key (str, optional) – The key to look for in the input dictionary that contains the mixture spectrogram. Defaults to ‘mix_magnitude’.

Returns

A recurrent end-to-end network configuration that can be passed to

SeparationModel.

Return type

dict

nussl.ml.networks.builders.build_recurrent_mask_inference(num_features, hidden_size, num_layers, bidirectional, dropout, num_sources, mask_activation, num_audio_channels=1, rnn_type='lstm', normalization_class='BatchNorm', mix_key='mix_magnitude')[source]

Builds a config for a mask inference network that can be passed to SeparationModel. This mask inference network uses a recurrent neural network (RNN) to process the input representation.

Parameters
  • num_features (int) – Number of features in the input spectrogram (usually means window length of STFT // 2 + 1.)

  • hidden_size (int) – Hidden size of the RNN.

  • num_layers (int) – Number of layers in the RNN.

  • bidirectional (int) – Whether the RNN is bidirectional.

  • dropout (float) – Amount of dropout to be used between layers of RNN.

  • num_sources (int) – Number of sources to create masks for.

  • mask_activation (list of str) – Activation of the mask (‘sigmoid’, ‘softmax’, etc.). See nussl.ml.networks.modules.Embedding.

  • num_audio_channels (int) – Number of audio channels in input (e.g. mono or stereo). Defaults to 1.

  • rnn_type (str, optional) – RNN type, either ‘lstm’ or ‘gru’. Defaults to ‘lstm’.

  • normalization_class (str, optional) – Type of normalization to apply, either ‘InstanceNorm’ or ‘BatchNorm’. Defaults to ‘BatchNorm’.

  • mix_key (str, optional) – The key to look for in the input dictionary that contains the mixture spectrogram. Defaults to ‘mix_magnitude’.

Returns

A recurrent mask inference network configuration that can be passed to

SeparationModel.

Return type

dict

Confidence measures

There are ways to measure the quality of a separated source without requiring ground truth. These functions operate on the output of clustering-based separation algorithms and work by analyzing the clusterability of the feature space used to generate the separated sources.

Functions

dpcl_classic_confidence(audio_signal, …[, …])

Computes the clusterability in two steps:

jensen_shannon_confidence(audio_signal, …)

Calculates the clusterability of a space by comparing a K-cluster GMM with a 1-cluster GMM on the same features.

jensen_shannon_divergence(gmm_p, gmm_q[, …])

Compute Jensen-Shannon (JS) divergence between two Gaussian Mixture Models via sampling.

loudness_confidence(audio_signal, features, …)

Computes the clusterability of the feature space by comparing the absolute size of each cluster.

posterior_confidence(audio_signal, features, …)

Calculates the clusterability of an embedding space by looking at the strength of the assignments of each point to a specific cluster.

silhouette_confidence(audio_signal, …[, …])

Uses the silhouette score to compute the clusterability of the feature space.

whitened_kmeans_confidence(audio_signal, …)

Computes the clusterability in two steps:

nussl.ml.confidence.dpcl_classic_confidence(audio_signal, features, num_sources, threshold=95, **kwargs)[source]

Computes the clusterability in two steps:

  1. Cluster the feature space using KMeans into assignments

  2. Compute the classic deep clustering loss between the features and the assignments.

Parameters
  • audio_signal (AudioSignal) – AudioSignal object which will be used to compute the mask over which to compute the confidence measure. This can be None, if and only if representation is passed as a keyword argument to this function.

  • features (np.ndarray) – Numpy array containing the features to be clustered. Should have the same dimensions as the representation.

  • n_sources (int) – Number of sources to cluster the features into.

  • threshold (int, optional) – Threshold by loudness. Points below the threshold are excluded from being used in the confidence measure. Defaults to 95.

  • kwargs – Keyword arguments to _get_loud_bins_mask. Namely, representation can go here as a keyword argument.

Returns

Confidence given by deep clustering loss.

Return type

float

nussl.ml.confidence.jensen_shannon_confidence(audio_signal, features, num_sources, threshold=95, n_samples=100000, **kwargs)[source]

Calculates the clusterability of a space by comparing a K-cluster GMM with a 1-cluster GMM on the same features. This function fits two GMMs to all of the points that are above the specified threshold (defaults to 95: 95th percentile of all the data). This saves on computation time and also allows one to have the confidence measure only focus on the louder more perceptually important points.

References:

Seetharaman, Prem, Gordon Wichern, Jonathan Le Roux, and Bryan Pardo. “Bootstrapping Single-Channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures”. 44th International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, May, 2019

Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition. Diss. Northwestern University, 2019.

Parameters
  • audio_signal (AudioSignal) – AudioSignal object which will be used to compute the mask over which to compute the confidence measure. This can be None, if and only if representation is passed as a keyword argument to this function.

  • features (np.ndarray) – Numpy array containing the features to be clustered. Should have the same dimensions as the representation.

  • n_sources (int) – Number of sources to cluster the features into.

  • threshold (int, optional) – Threshold by loudness. Points below the threshold are excluded from being used in the confidence measure. Defaults to 95.

  • kwargs – Keyword arguments to _get_loud_bins_mask. Namely, representation can go here as a keyword argument.

Returns

Confidence given by Jensen-Shannon divergence.

Return type

float

nussl.ml.confidence.jensen_shannon_divergence(gmm_p, gmm_q, n_samples=100000)[source]

Compute Jensen-Shannon (JS) divergence between two Gaussian Mixture Models via sampling. JS divergence is also known as symmetric Kullback-Leibler divergence. JS divergence has no closed form in general for GMMs, thus we use sampling to compute it.

Parameters
  • gmm_p (GaussianMixture) – A GaussianMixture class fit to some data.

  • gmm_q (GaussianMixture) – Another GaussianMixture class fit to some data.

  • n_samples (int) – Number of samples to use to estimate JS divergence.

Returns

JS divergence between gmm_p and gmm_q

nussl.ml.confidence.loudness_confidence(audio_signal, features, num_sources, threshold=95, **kwargs)[source]

Computes the clusterability of the feature space by comparing the absolute size of each cluster.

References:

Seetharaman, Prem, Gordon Wichern, Jonathan Le Roux, and Bryan Pardo. “Bootstrapping Single-Channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures”. 44th International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, May, 2019

Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition. Diss. Northwestern University, 2019.

Parameters
  • audio_signal (AudioSignal) – AudioSignal object which will be used to compute the mask over which to compute the confidence measure. This can be None, if and only if representation is passed as a keyword argument to this function.

  • features (np.ndarray) – Numpy array containing the features to be clustered. Should have the same dimensions as the representation.

  • n_sources (int) – Number of sources to cluster the features into.

  • threshold (int, optional) – Threshold by loudness. Points below the threshold are excluded from being used in the confidence measure. Defaults to 95.

  • kwargs – Keyword arguments to _get_loud_bins_mask. Namely, representation can go here as a keyword argument.

Returns

Confidence given by size of smallest cluster.

Return type

float

nussl.ml.confidence.posterior_confidence(audio_signal, features, num_sources, threshold=95, **kwargs)[source]

Calculates the clusterability of an embedding space by looking at the strength of the assignments of each point to a specific cluster. The more points that are “in between” clusters (e.g. no strong assignmment), the lower the clusterability.

References:

Seetharaman, Prem, Gordon Wichern, Jonathan Le Roux, and Bryan Pardo. “Bootstrapping Single-Channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures”. 44th International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, May, 2019

Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition. Diss. Northwestern University, 2019.

Parameters
  • audio_signal (AudioSignal) – AudioSignal object which will be used to compute the mask over which to compute the confidence measure. This can be None, if and only if representation is passed as a keyword argument to this function.

  • features (np.ndarray) – Numpy array containing the features to be clustered. Should have the same dimensions as the representation.

  • n_sources (int) – Number of sources to cluster the features into.

  • threshold (int, optional) – Threshold by loudness. Points below the threshold are excluded from being used in the confidence measure. Defaults to 95.

  • kwargs – Keyword arguments to _get_loud_bins_mask. Namely, representation can go here as a keyword argument.

Returns

Confidence given by posteriors.

Return type

float

nussl.ml.confidence.silhouette_confidence(audio_signal, features, num_sources, threshold=95, max_points=1000, **kwargs)[source]

Uses the silhouette score to compute the clusterability of the feature space.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

References:

Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition. Diss. Northwestern University, 2019.

Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics 20: 53-65.

Parameters
  • audio_signal (AudioSignal) – AudioSignal object which will be used to compute the mask over which to compute the confidence measure. This can be None, if and only if representation is passed as a keyword argument to this function.

  • features (np.ndarray) – Numpy array containing the features to be clustered. Should have the same dimensions as the representation.

  • n_sources (int) – Number of sources to cluster the features into.

  • threshold (int, optional) – Threshold by loudness. Points below the threshold are excluded from being used in the confidence measure. Defaults to 95.

  • kwargs – Keyword arguments to _get_loud_bins_mask. Namely, representation can go here as a keyword argument.

  • max_points (int, optional) – Maximum number of points to compute the Silhouette score for. Silhouette score is a costly operation. Defaults to 1000.

Returns

Confidence given by Silhouette score.

Return type

float

nussl.ml.confidence.whitened_kmeans_confidence(audio_signal, features, num_sources, threshold=95, **kwargs)[source]

Computes the clusterability in two steps:

  1. Cluster the feature space using KMeans into assignments

  2. Compute the Whitened K-Means loss between the features and the assignments.

Parameters
  • audio_signal (AudioSignal) – AudioSignal object which will be used to compute the mask over which to compute the confidence measure. This can be None, if and only if representation is passed as a keyword argument to this function.

  • features (np.ndarray) – Numpy array containing the features to be clustered. Should have the same dimensions as the representation.

  • n_sources (int) – Number of sources to cluster the features into.

  • threshold (int, optional) – Threshold by loudness. Points below the threshold are excluded from being used in the confidence measure. Defaults to 95.

  • kwargs – Keyword arguments to _get_loud_bins_mask. Namely, representation can go here as a keyword argument.

Returns

Confidence given by whitened k-means loss.

Return type

float

Training

Training

nussl.ml.train.create_train_and_validation_engines(train_func, val_func=None, device='cpu')[source]

Helper function for creating an ignite Engine object with helpful defaults. This sets up an Engine that has four handlers attached to it:

  • prepare_batch: before a batch is passed to train_func or val_func, this function runs, moving every item in the batch (which is a dictionary) to the appropriate device (‘cpu’ or ‘cuda’).

  • book_keeping: sets up some dictionaries that are used for bookkeeping so one can easily track the epoch and iteration losses for both training and validation.

  • add_to_iter_history: records the iteration, epoch, and past iteration losses into the dictionaries set up by book_keeping.

  • clear_iter_history: resets the current iteration history of losses after moving the current iteration history into past iteration history.

Parameters
  • train_func (func) – Function that provides the closure for training for a single batch.

  • val_func (func, optional) – Function that provides the closure for validating a single batch. Defaults to None.

  • device (str, optional) – Device to move tensors to. Defaults to ‘cpu’.

nussl.ml.train.add_tensorboard_handler(tensorboard_folder, engine, every_iteration=False)[source]

Every key in engine.state.epoch_history[-1] is logged to TensorBoard.

Parameters
  • tensorboard_folder (str) – Where the tensorboard logs should go.

  • trainer (ignite.Engine) – The engine to log.

  • every_iteration (bool, optional) – Whether to also log the values at every iteration.

nussl.ml.train.cache_dataset(dataset)[source]

Runs through an entire dataset and caches it if there nussl.datasets.transforms.Cache is in dataset.transform. If there is no caching, or dataset.cache_populated = True, then this function just iterates through the dataset and does nothing.

This function can also take a torch.util.data.DataLoader object wrapped around a nussl.datasets.BaseDataset object.

Parameters

dataset (nussl.datasets.BaseDataset) – Must be a subclass of nussl.datasets.BaseDataset.

nussl.ml.train.add_validate_and_checkpoint(output_folder, model, optimizer, train_data, trainer, val_data=None, validator=None)[source]

This adds the following handler to the trainer:

  • validate_and_checkpoint: this runs the validator on the validation dataset (val_data) using a defined validation process function val_func. These are optional. If these are not provided, then no validator is run and the model is simply checkpointed. The model is always saved to {output_folder}/checkpoints/latest.model.pth. If the model is also the one with the lowest validation loss, then it is also saved to {output_folder}/checkpoints/best.model.pth. This is attached to ``Events.EPOCH_COMPLETED on the trainer. After completion, it fires a ValidationEvents.VALIDATION_COMPLETED event.

Parameters
  • model (torch.nn.Module) – Model that is being trained (typically a SeparationModel). optimizer (torch.optim.Optimizer): Optimizer being used to train.

  • train_data (BaseDataset) – dataset that is being used to train the model. This is to save additional metadata information alongside the model checkpoint such as the STFTParams, dataset folder, length, list of transforms, etc.

  • trainer (ignite.Engine) – Engine for trainer

  • validator (ignite.Engine, optional) – Engine for validation. Defaults to None.

  • val_data (torch.utils.data.Dataset, optional) – The validation data. Defaults to None.

nussl.ml.train.add_stdout_handler(trainer, validator=None)[source]

This adds the following handler to the trainer engine, and also sets up Timers:

  • log_epoch_to_stdout: This logs the results of a model after it has trained for a single epoch on both the training and validation set. The output typically looks like this:

    EPOCH SUMMARY
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    - Epoch number: 0010 / 0010
    - Training loss:   0.583591
    - Validation loss: 0.137209
    - Epoch took: 00:00:03
    - Time since start: 00:00:32
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Saving to test.
    Output @ tests/local/trainer
    
Parameters
  • trainer (ignite.Engine) – Engine for trainer

  • validator (ignite.Engine, optional) – Engine for validation. Defaults to None.

nussl.ml.train.add_progress_bar_handler(*engines)[source]

Adds a progress bar to each engine. Keeps track of a running average of the loss as well.

Usage:

.. code-block:: python

tr_engine, val_engine = … add_progress_bar_handler(tr_engine, val_engine)

class nussl.ml.train.ValidationEvents[source]

Events based on validation running

VALIDATION_COMPLETED = 'validation_completed'
VALIDATION_STARTED = 'validation_started'
class nussl.ml.train.BackwardsEvents[source]

Events based on validation running

BACKWARDS_COMPLETED = 'backwards_completed'

Loss functions

Classes

CombinationInvariantLoss(loss_function)

Variant on Permutation Invariant Loss where instead a combination of the sources output by the model are used.

DeepClusteringLoss()

Computes the deep clustering loss with weights.

KLDivLoss([size_average, reduce, reduction])

L1Loss([size_average, reduce, reduction])

MSELoss([size_average, reduce, reduction])

PermutationInvariantLoss(loss_function)

Computes the Permutation Invariant Loss (PIT) [1] by permuting the estimated sources and the reference sources.

SISDRLoss([scaling, return_scaling, …])

Computes the Scale-Invariant Source-to-Distortion Ratio between a batch of estimated and reference audio signals.

WhitenedKMeansLoss()

Computes the whitened K-Means loss with weights.

class nussl.ml.train.loss.CombinationInvariantLoss(loss_function)[source]

Variant on Permutation Invariant Loss where instead a combination of the sources output by the model are used. This way a model can output more sources than there are in the ground truth. A subset of the output sources will be compared using Permutation Invariant Loss with the ground truth estimates.

For when you’re trying to match the estimates to the sources but you don’t know the order in which your model outputs the estimates AND you are outputting more estimates then there are sources.

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

Methods

forward(estimates, targets)

Defines the computation performed at every call.

DEFAULT_KEYS = {'estimates': 'estimates', 'source_magnitudes': 'targets'}
forward(estimates, targets)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class nussl.ml.train.loss.DeepClusteringLoss[source]

Computes the deep clustering loss with weights. Equation (7) in [1].

References:

[1] Wang, Z. Q., Le Roux, J., & Hershey, J. R. (2018, April).

Alternative Objective Functions for Deep Clustering. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

Methods

forward(embedding, assignments, weights)

Defines the computation performed at every call.

DEFAULT_KEYS = {'embedding': 'embedding', 'ideal_binary_mask': 'assignments', 'weights': 'weights'}
forward(embedding, assignments, weights)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class nussl.ml.train.loss.KLDivLoss(size_average=None, reduce=None, reduction='mean')[source]

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

DEFAULT_KEYS = {'estimates': 'input', 'source_magnitudes': 'target'}
class nussl.ml.train.loss.L1Loss(size_average=None, reduce=None, reduction='mean')[source]

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

DEFAULT_KEYS = {'estimates': 'input', 'source_magnitudes': 'target'}
class nussl.ml.train.loss.MSELoss(size_average=None, reduce=None, reduction='mean')[source]

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

DEFAULT_KEYS = {'estimates': 'input', 'source_magnitudes': 'target'}
class nussl.ml.train.loss.PermutationInvariantLoss(loss_function)[source]

Computes the Permutation Invariant Loss (PIT) [1] by permuting the estimated sources and the reference sources. Takes the best permutation and only backprops the loss from that.

For when you’re trying to match the estimates to the sources but you don’t know the order in which your model outputs the estimates.

References:

[1] Yu, Dong, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen.

“Permutation invariant training of deep models for speaker-independent multi-talker speech separation.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241-245. IEEE, 2017.

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

Methods

forward(estimates, targets)

Defines the computation performed at every call.

DEFAULT_KEYS = {'estimates': 'estimates', 'source_magnitudes': 'targets'}
forward(estimates, targets)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class nussl.ml.train.loss.SISDRLoss(scaling=True, return_scaling=False, reduction='mean', zero_mean=True)[source]

Computes the Scale-Invariant Source-to-Distortion Ratio between a batch of estimated and reference audio signals. Used in end-to-end networks. This is essentially a batch PyTorch version of the function nussl.evaluation.bss_eval.scale_bss_eval and can be used to compute SI-SDR or SNR.

Parameters
  • scaling (bool, optional) – Whether to use scale-invariant (True) or signal-to-noise ratio (False). Defaults to True.

  • return_scaling (bool, optional) – Whether to only return the scaling factor that the estimate gets scaled by relative to the reference. This is just for monitoring this value during training, don’t actually train with it! Defaults to False.

  • reduction (str, optional) – How to reduce across the batch (either ‘mean’, ‘sum’, or none). Defaults to ‘mean’.

  • zero_mean (bool, optional) – Zero mean the references and estimates before computing the loss. Defaults to True.

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

Methods

forward(estimates, references)

Defines the computation performed at every call.

DEFAULT_KEYS = {'audio': 'estimates', 'source_audio': 'references'}
forward(estimates, references)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class nussl.ml.train.loss.WhitenedKMeansLoss[source]

Computes the whitened K-Means loss with weights. Equation (6) in [1].

References:

[1] Wang, Z. Q., Le Roux, J., & Hershey, J. R. (2018, April).

Alternative Objective Functions for Deep Clustering. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Attributes

DEFAULT_KEYS

dict() -> new empty dictionary

Methods

forward(embedding, assignments, weights)

Defines the computation performed at every call.

DEFAULT_KEYS = {'embedding': 'embedding', 'ideal_binary_mask': 'assignments', 'weights': 'weights'}
forward(embedding, assignments, weights)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Closures

Classes

Closure(loss_dictionary[, combination_approach])

Closures are used with ignite Engines to train a model given an optimizer and a set of loss functions.

TrainClosure(loss_dictionary, optimizer, …)

This closure takes an optimization step on a SeparationModel object given a loss.

ValidationClosure(loss_dictionary, model, …)

This closure validates the model on some data dictionary.

Exceptions

ClosureException

Exception class for errors when working with closures in nussl.

class nussl.ml.train.closures.Closure(loss_dictionary, combination_approach='combine_by_sum', *args, **kwargs)[source]

Closures are used with ignite Engines to train a model given an optimizer and a set of loss functions. Closures perform forward passes of models given the input data. The loss is computed via self.compute_loss. The forward pass is implemented via the objects __call__ function.

This closure object provides a way to define the loss functions you want to use to train your model as a loss dictionary that is structured as follows:

loss_dictionary = {
    'LossClassName': {
        'weight': [how much to weight the loss in the sum, defaults to 1],
        'keys': [key mapping items in dictionary to arguments to loss],
        'args': [any positional arguments to the loss class],
        'kwargs': [keyword arguments to the loss class],
    }
}

Methods

combine_by_multiply(loss_output)

combine_by_multitask(loss_output)

Implements a multitask learning objective [1] where each loss

combine_by_sum(loss_output)

compute_loss(output, target)

The keys value will default to LossClassName.DEFAULT_KEYS, which can be found in nussl.ml.train.loss within each available class. Here’s an example of a Chimera loss combining deep clustering with permutation invariant L1 loss:

loss_dictionary = {
    'DeepClusteringLoss': {
        'weight': .2,
    },
    'PermutationInvariantLoss': {
        'weight': .8,
        'args': ['L1Loss']
    }
}

Or if you’re using permutation invariant loss but need to specify arguments to the loss function being wrapped by PIT, you can do this:

loss_dictionary = {
    'PITLoss': {
        'class': 'PermutationInvariantLoss',
        'keys': {'audio': 'estimates', 'source_audio': 'targets'},
        'args': [{
            'class': 'SISDRLoss',
            'kwargs': {'scaling': False}
        }]
    }
}

If you have your own loss function classes you wish to use, you can pass those into the loss dictionary and make them discoverable by the closure by using ml.register_loss.

Parameters
  • loss_dictionary (dict) – Dictionary of losses described above.

  • combination_approach (str) – How to combine losses, if there are multiple losses. The default is that the losses will be combined via a weighted sum (‘combine_by_sum’). Can also do ‘combine_by_multiply’. Defaults to ‘combine_by_sum’.

  • args – Positional arguments to combination_approach.

  • kwargs – Keyword arguments to combination_approach.

See also

ml.register_loss to register your loss functions with this closure.

combine_by_multiply(loss_output)[source]
combine_by_multitask(loss_output)[source]

Implements a multitask learning objective [1] where each loss is weighted by a learned parameter with the following function:

combined_loss = sum_i exp(-weight_i) * loss_i + weight_i

where i indexes each loss. The weights come from the loss dictionary and can point to nn.Parameter teensors that get learned jointly with the model.

References:

[1] Kendall, Alex, Yarin Gal, and Roberto Cipolla.

“Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

combine_by_sum(loss_output)[source]
compute_loss(output, target)[source]
exception nussl.ml.train.closures.ClosureException[source]

Exception class for errors when working with closures in nussl.

class nussl.ml.train.closures.TrainClosure(loss_dictionary, optimizer, model, *args, **kwargs)[source]

This closure takes an optimization step on a SeparationModel object given a loss.

Parameters
  • loss_dictionary (dict) – Dictionary containing loss functions and specification.

  • optimizer (torch Optimizer) – Optimizer to use to train the model.

  • model (SeparationModel) – The model to be trained.

class nussl.ml.train.closures.ValidationClosure(loss_dictionary, model, *args, **kwargs)[source]

This closure validates the model on some data dictionary.

Parameters
  • loss_dictionary (dict) – Dictionary containing loss functions and specification.

  • model (SeparationModel) – The model to be validated.