Separation algorithms

Base classes

These classes are used to build every type of source separation algorithm currently in nussl. They provide helpful utilities and make it such that the end-user only has to implement one or two functions to create a new separation algorithm, depending on what sort of algorithm they are trying to implement.

Base for all methods

class nussl.separation.SeparationBase(input_audio_signal)[source]

Base class for all separation algorithms in nussl.

Do not call this. It will not do anything.

Parameters

input_audio_signal (AudioSignal) – This will always be a copy of the provided AudioSignal object.

Attributes

audio_signal

Copy of AudioSignal that is made on initialization.

sample_rate

Sample rate of audio_signal.

stft_params

STFTParams object containing the STFT parameters of the copied AudioSignal.

Methods

make_audio_signals()

Makes audio_signal.AudioSignal objects after separation algorithm is run

run()

Runs separation algorithm.

property audio_signal

Copy of AudioSignal that is made on initialization.

make_audio_signals()[source]

Makes audio_signal.AudioSignal objects after separation algorithm is run

Raises

NotImplementedError – Cannot call base class

run()[source]

Runs separation algorithm.

Raises

NotImplementedError – Cannot call base class

property sample_rate

Sample rate of audio_signal. Literally audio_signal.sample_rate.

Type

(int)

property stft_params

STFTParams object containing the STFT parameters of the copied AudioSignal.

Base for masking-based methods

class nussl.separation.MaskSeparationBase(input_audio_signal, mask_type='soft', mask_threshold=0.5)[source]

Base class for separation algorithms that create a mask (binary or soft) to do their separation. Most algorithms in nussl are derived from MaskSeparationBase.

Although this class will do nothing if you instantiate and run it by itself, algorithms that are derived from this class are expected to return a list of separation.masks.mask_base.MaskBase -derived objects (i.e., either a separation.masks.binary_mask.BinaryMask or separation.masks.soft_mask.SoftMask object) by their run() method. Being a subclass of MaskSeparationBase is an implicit contract assuring this. Returning a separation.masks.mask_base.MaskBase-derived object standardizes algorithm return types for evaluation.evaluation_base.EvaluationBase-derived objects.

Parameters
  • input_audio_signal – (audio_signal.AudioSignal) An audio_signal.AudioSignal object containing the mixture to be separated.

  • mask_type – (str, BinaryMask, or SoftMask) Indicates whether to make binary or soft masks. See mask_type property for details.

  • mask_threshold – (float) Value between [0.0, 1.0] to convert a soft mask to a binary mask. See mask_threshold property for details.

Methods

make_audio_signals()

Makes audio_signal.AudioSignal objects after mask-based separation algorithm is run.

ones_mask(shape)

Creates a new ones mask with this object’s type.

run()

Runs mask-based separation algorithm.

zeros_mask(shape)

Creates a new zeros mask with this object’s type.

Attributes

mask_threshold

Threshold of determining True/False if mask_type is BINARY_MASK.

mask_type

This property indicates what type of mask the derived algorithm will create and be returned by run().

make_audio_signals()[source]

Makes audio_signal.AudioSignal objects after mask-based separation algorithm is run. This looks in self.result_masks which must be filled by run in the algorithm that subclasses this. It applies each mask to the mixture audio signal and returns a list of the estimates, which are each AudioSignal objects.

Returns

List of AudioSignal objects corresponding to the

separated estimates.

Return type

list

property mask_threshold

Threshold of determining True/False if mask_type is BINARY_MASK. Some algorithms will first make a soft mask and then convert that to a binary mask using this threshold parameter. All values of the soft mask are between [0.0, 1.0] and as such mask_threshold() is expected to be a float between [0.0, 1.0].

Returns

Value between [0.0, 1.0] that indicates

the True/False cutoff when converting a soft mask to binary mask.

Return type

mask_threshold (float)

Raises

ValueError if not a float or if set outside [0.0, 1.0]

property mask_type

This property indicates what type of mask the derived algorithm will create and be returned by run(). Options are either ‘soft’ or ‘binary’. mask_type is usually set when initializing a MaskSeparationBase-derived class and defaults to ‘soft..

This property, though stored as a string, can be set in two ways when initializing:

  • First, it is possible to set this property with a string. Only 'soft' and 'binary' are accepted (case insensitive), every other value will raise an error. When initializing with a string, two helper attributes are provided: BINARY_MASK and SOFT_MASK.

    It is HIGHLY encouraged to use these, as the API may change and code that uses bare strings (e.g. mask_type = 'soft' or mask_type = 'binary') for assignment might not be future-proof. BINARY_MASK` and SOFT_MASK are safe aliases in case these underlying types change.

  • The second way to set this property is by using a class prototype of either the separation.masks.binary_mask.BinaryMask or separation.masks.soft_mask.SoftMask class prototype. This is probably the most stable way to set this, and it’s fairly succinct. For example, mask_type = nussl.BinaryMask or mask_type = nussl.SoftMask are both perfectly valid.

Though uncommon, this can be set outside of __init__()

Examples of both methods are shown below.

Returns

Either 'soft' or 'binary'.

Return type

mask_type (str)

Raises

ValueError if set invalidly.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import nussl
mixture_signal = nussl.AudioSignal()

# Two options for determining mask upon init...

# Option 1: Init with a string (BINARY_MASK is a string 'constant')
repet_sim = nussl.RepetSim(mixture_signal, mask_type=nussl.MaskSeparationBase.BINARY_MASK)

# Option 2: Init with a class type
ola = nussl.OverlapAdd(mixture_signal, mask_type=nussl.SoftMask)

# It's also possible to change these values after init by changing the `mask_type` property...
repet_sim.mask_type = nussl.MaskSeparationBase.SOFT_MASK  # using a string
ola.mask_type = nussl.BinaryMask  # or using a class type
ones_mask(shape)[source]

Creates a new ones mask with this object’s type.

Parameters

shape (tuple) – tuple with shape of mask

Returns

A subclass of MaskBase containing 1s.

run()[source]

Runs mask-based separation algorithm. Base class: Do not call directly!

Raises

NotImplementedError – Cannot call base class!

zeros_mask(shape)[source]

Creates a new zeros mask with this object’s type.

Parameters

shape (tuple) – tuple with shape of mask

Returns

A subclass of MaskBase containing 0s.

Base for clustering-based methods

class nussl.separation.ClusteringSeparationBase(input_audio_signal, num_sources, clustering_type='KMeans', fit_clusterer=True, percentile=90, beta=5.0, mask_type='soft', mask_threshold=0.5, **kwargs)[source]

A base class for any clustering-based separation approach. Subclasses of this class must implement just one function to use it: extract_features. This function should uses the internal variables of the class to extract the appropriate time-frequency features of the signal. These time-frequency features will then be clustered by cluster_features. Masks will then be produced by the run function and applied to the audio signal to produce separated estimates.

Parameters
  • input_audio_signal – (AudioSignal) An AudioSignal object containing the mixture to be separated.

  • num_sources (int) – Number of sources to cluster the features of and separate the mixture.

  • clustering_type (str) – One of ‘KMeans’, ‘GaussianMixture’, and ‘MiniBatchKMeans’. The clustering approach to use on the features. Defaults to ‘KMeans’.

  • fit_clusterer (bool, optional) – Whether or not to call fit on the clusterer. If False, then the clusterer should already be fit for this to work. Defaults to True.

  • percentile (int, optional) – Percentile of time-frequency points to consider by loudness. Audio spectrograms are very high dimensional, and louder points tend to matter more than quieter points. By setting the percentile high, one can more efficiently cluster an auditory scene by considering only points above that threshold. Defaults to 90 (which means the top 10 percentile of time-frequency points will be used for clustering).

  • beta (float, optional) – When using KMeans, we use soft KMeans, which has an additional parameter beta. beta controls how soft the assignments are. As beta increases, the assignments become more binary (either 0 or 1). Defaults to 5.0, a value discovered through cross-validation.

  • mask_type (str, optional) – Masking approach to use. Passed up to MaskSeparationBase.

  • mask_threshold (float, optional) – Threshold for masking. Passed up to MaskSeparationBase.

  • **kwargs (dict, optional) – Additional keyword arguments that are passed to the clustering object (one of KMeans, GaussianMixture, or MiniBatchKMeans).

Raises

SeparationException – If clustering type is not one of the allowed ones, or if the output of extract_features has the wrong shape according to the STFT shape of the AudioSignal.

Methods

cluster_features(features, clusterer)

Clusters each time-frequency point according to features for each time-frequency point.

confidence([approach])

In clustering-based separation algorithms, we can compute a confidence measure based on the clusterability of the feature space.

extract_features()

This function should be implemented by the subclass.

run([features])

Clusters the features using the chosen clustering algorithm.

cluster_features(features, clusterer)[source]

Clusters each time-frequency point according to features for each time-frequency point. Features should be on the last axis.

Features should come in in the shape:

(…, n_features)

Parameters
  • features (np.ndarray) – Features to cluster, for each time-frequency point.

  • clusterer (object) – Clustering object to use.

Returns

Responsibilities for each cluster for each time-frequency point.

Return type

np.ndarray

confidence(approach='silhouette_confidence', **kwargs)[source]

In clustering-based separation algorithms, we can compute a confidence measure based on the clusterability of the feature space. This can be computed only after the features have been extracted by extract_features.

Parameters
  • approach (str, optional) – What approach to use for getting the confidence measure. Options are ‘jensen_shannon_confidence’, ‘posterior_confidence’, ‘silhouette_confidence’, ‘loudness_confidence’, ‘whitened_kmeans_confidence’, ‘dpcl_classic_confidence’. Defaults to ‘silhouette_confidence’.

  • kwargs – Keyword arguments to the function being used to compute the confidence.

extract_features()[source]

This function should be implemented by the subclass. It should extract features. If the STFT shape is (n_freq, n_time, n_chan), the output of this function should be (n_freq, n_time, n_chan, n_features).

run(features=None)[source]

Clusters the features using the chosen clustering algorithm.

Parameters

features (np.ndarray, optional) – If features are given, then the extract_features step will be skipped. Defaults to None (so extract_features will be run.)

Raises

SeparationException – If features.shape doesn’t match what is expected in the STFT of the audio signal, an exception is raised.

Returns

List of Mask objects in self.result_masks.

Return type

list

Mix-in for NMF-based methods

class nussl.separation.NMFMixin[source]

Methods

fit(audio_signals, n_components[, …])

Fits an NMF model to the magnitude spectrograms of each audio signal.

inverse_transform(components, activations)

Reconstructs the magnitude spectrogram by matrix multiplying the components with the activations.

transform(audio_signal, model)

Use an already fit model to transform the magnitude spectrogram of an audio signal into components and activations.

static fit(audio_signals, n_components, beta_loss='frobenius', l1_ratio=0.5, **kwargs)[source]

Fits an NMF model to the magnitude spectrograms of each audio signal. If audio_signals is a list, the magnitude spectrograms of each signal are concatenated into a single data matrix to which NMF is fit. If audio_signals is a single audio signal, then NMF is fit only to the magnitude spectrogram for that audio signal. If any of the audio signals are multichannel, the channels are concatenated into a single (longer) data matrix.

Parameters
  • audio_signals (list or AudioSignal) – AudioSignal object(s) that NMF will be fit to.

  • n_components (int) – Number of components to use in the NMF module. Corresponds to number of spectral templates.

  • beta_loss (float or string) – String must be in {‘frobenius’, ‘kullback-leibler’, ‘itakura-saito’}. Beta divergence to be minimized, measuring the distance between X and the dot product WH. Note that values different from ‘frobenius’ (or 2) and ‘kullback-leibler’ (or 1) lead to significantly slower fits. Note that for beta_loss <= 0 (or ‘itakura-saito’), the input matrix X cannot contain zeros. Used only in ‘mu’ solver. Defaults to ‘frobenius’.

  • l1_ratio (float) – The regularization mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an elementwise L2 penalty (aka Frobenius Norm). For l1_ratio = 1 it is an elementwise L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2. Defaults to 1.0 (sparse templates and activations).

  • kwargs (dict) – Additional keyword arguments to initialization of the NMF decomposition method.

Returns

Fitted NMF model to the audio signal(s). components (np.ndarray): Spectral templates (n_components, n_features) activations (np.ndarray): Activations (n_components, n_time, n_channels)

The shape here is as if it was like an STFT but with components as the features rather than frequencies of the STFT.

Return type

model (NMF)

static inverse_transform(components, activations)[source]

Reconstructs the magnitude spectrogram by matrix multiplying the components with the activations. Components and activations are considered to be 2D matrices, but if they are more, then the first dimension is interpreted to be the batch dimension.

Parameters
  • components (np.ndarray) – Spectral templates (n_components, n_features)

  • activations (np.ndarray) – Activations (n_components, n_time, n_channels) The shape here is as if it was like an STFT but with components as the features rather than frequencies of the STFT.

static transform(audio_signal, model)[source]

Use an already fit model to transform the magnitude spectrogram of an audio signal into components and activations. These can be multiplied to reconstruct the original matrix, or used to separate out sounds that correspond to components in the model.

Parameters
  • audio_signal (AudioSignal) – AudioSignal object to transform with model.

  • model (NMF) – NMF model to separate with. Must be fitted prior to this call.

Returns

Spectral templates (n_components, n_features) activations (np.ndarray): Activations (n_components, n_time, n_channels)

The shape here is as if it was like an STFT but with components as the features rather than frequencies of the STFT.

Return type

components (np.ndarray)

Mix-in for deep methods

class nussl.separation.DeepMixin[source]

Methods

load_model(model_path[, device])

Loads the model at specified path model_path.

load_model(model_path, device='cpu')[source]

Loads the model at specified path model_path. Uses GPU if available.

Parameters
  • model_path (str) – path to model saved as SeparatonModel.

  • device (str or torch.Device) – loads model on CPU or GPU. Defaults to ‘cuda’.

Returns

Loaded model, nn.Module metadata (dict): metadata associated with model, used for making the input data into the model.

Return type

model (SeparationModel)

Benchmark methods

These methods are used for obtaining upper and lower baselines for source separation algorithms.

High pass filter

class nussl.separation.benchmark.HighLowPassFilter(input_audio_signal, high_pass_cutoff_hz, mask_type='binary')[source]

Implements a super simple separation algorithm that just masks everything below the specified hz. It does this by zeroing out the associated FFT bins via a mask to produce the “high” source, and the residual is the “low” source.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • high_pass_cutoff_hz (float) – Cutoff in Hz. Will be rounded off

  • mask_type (str, optional) – Mask type. Defaults to ‘binary’.

Ideal binary mask

class nussl.separation.benchmark.IdealBinaryMask(input_audio_signal, sources, mask_type='binary', mask_threshold=0.5)[source]

Implements an ideal binary mask (IBM) that is computed by using the known ground truth performance. This is one of the upper baselines.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • sources (list) – List of audio signal objects that correspond to the sources.

  • mask_type (str, optional) – Mask type. Defaults to ‘binary’.

Ideal ratio mask

class nussl.separation.benchmark.IdealRatioMask(input_audio_signal, sources, approach='psa', mask_type='soft', mask_threshold=0.5, **kwargs)[source]

Implements an ideal ratio mask (IRM) that is computed by using the known ground truth performance. This is one of the upper baselines.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • sources (list) – List of audio signal objects that correspond to the sources.

  • approach (str) – Either ‘psa’ (phase sensitive spectrum approximation) or ‘msa’ (magnitude spectrum approximation). Generally ‘psa’ does better.

  • mask_type (str, optional) – Mask type. Defaults to ‘soft’.

  • mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.

  • kwargs (dict) – Extra keyword arguments are passed to the transform classes at initialization.

Wiener filter

class nussl.separation.benchmark.WienerFilter(input_audio_signal, estimates, iterations=1, mask_type='soft', mask_threshold=0.5, **kwargs)[source]

Implements a multichannel Wiener filter that is computed by using some source estimates. When using the estimates produced by IdealRatioMask or IdealBinaryMask, this is one of the upper baselines.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • estimates (list) – List of audio signal objects that correspond to the estimates.

  • iterations (int) – Number of iterations for expectation-maximization in Wiener filter.

  • mask_type (str, optional) – Mask type. Defaults to ‘soft’.

  • mask_threshold (float, optional) – Threshold for masking binary. Defaults to 0.5.

  • kwargs (dict) – Additional keyword arguments to norbert.wiener.

Mix as estimate

class nussl.separation.benchmark.MixAsEstimate(input_audio_signal, num_sources)[source]

This algorithm does nothing but scale the mix by the number of sources. This can be used to compute the improvement metrics (e.g. improvement in SDR over using the mixture as the estimate).

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • num_sources (int) – How many sources to return.

Deep methods

Deep networks can be used for source separation via these classes.

Deep clustering

class nussl.separation.deep.DeepClustering(input_audio_signal, num_sources, model_path=None, device='cpu', **kwargs)[source]

Clusters the embedding produced by a deep model for every time-frequency point. This is the deep clustering source separation approach. It is flexible with the number of sources. It expects that the model outputs a dictionary where one of the keys is ‘embedding’. This uses the DeepMixin class to load the model and set the audio signal’s parameters to be appropriate for the model.

Parameters
  • input_audio_signal – (AudioSignal`) An AudioSignal object containing the mixture to be separated.

  • num_sources (int) – Number of sources to cluster the features of and separate the mixture.

  • model_path (str, optional) – Path to the model that will be used. Can be None, so that you can initialize a class and load the model later. Defaults to None.

  • device (str, optional) – Device to put the model on. Defaults to ‘cpu’.

  • **kwargs (dict) – Keyword arguments for ClusteringSeparationBase and the clustering object used for clustering (one of KMeans, GaussianMixture, MiniBatchKmeans).

Raises

SeparationException – If ‘embedding’ isn’t in the output of the model.

Deep mask estimation

class nussl.separation.deep.DeepMaskEstimation(input_audio_signal, model_path=None, device='cpu', **kwargs)[source]

Separates an audio signal using the masks produced by a deep model for every time-frequency point. It expects that the model outputs a dictionary where one of the keys is ‘masks’. This uses the DeepMixin class to load the model and set the audio signal’s parameters to be appropriate for the model.

Parameters
  • input_audio_signal – (AudioSignal`) An AudioSignal object containing the mixture to be separated.

  • model_path (str, optional) – Path to the model that will be used. Can be None, so that you can initialize a class and load the model later. Defaults to None.

  • device (str, optional) – Device to put the model on. Defaults to ‘cpu’.

  • **kwargs (dict) – Keyword arguments for MaskSeparationBase.

Deep audio estimation

class nussl.separation.deep.DeepAudioEstimation(input_audio_signal, model_path=None, device='cpu', **kwargs)[source]

Separates an audio signal using a model that produces separated sources directly in the waveform domain. It expects that the model outputs a dictionary where one of the keys is ‘audio’. This uses the DeepMixin class to load the model and set the audio signal’s parameters to be appropriate for the model.

Parameters
  • input_audio_signal – (AudioSignal`) An AudioSignal object containing the mixture to be separated.

  • model_path (str, optional) – Path to the model that will be used. Can be None, so that you can initialize a class and load the model later. Defaults to None.

  • device (str, optional) – Device to put the model on. Defaults to ‘cpu’.

  • **kwargs (dict) – Keyword arguments for MaskSeparationBase.

Composite methods

These are methods that use the output of multiple separation algorithms to build better more robust separation estimates.

Ensemble clustering

class nussl.separation.composite.EnsembleClustering(input_audio_signal, num_sources, separators, weights=None, returns=None, num_cascades=1, extracted_feature='masks', clustering_type='KMeans', fit_clusterer=True, percentile=90, beta=5.0, mask_type='soft', mask_threshold=0.5, **kwargs)[source]

Run multiple separation algorithms on a single mixture and concatenate their masks to input into a clustering algorithm.

This algorithm allows you to combine the outputs of multiple separation algorithms, fusing them into a single output via clustering. It was first developed in [1]. When used with primitive separation algorithms, it becomes the PrimitiveClustering algorithm described in [1].

References:

[1] Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition.

Diss. Northwestern University, 2019.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • num_sources (int) – Number of sources to separate from signal.

  • separators (list) – List of instantiated separation algorithms that will be run on the input audio signal.

  • weights (list, optional) – Weight to give to each algorithm in the resultant feature vector. For example, [3, 1], will repeat the features from the first algorithm 3 times and the second algorithm 1 time. Defaults to None - every algorithm gets a weight of 1.

  • returns (list, optional) – Which outputs of each algorithm to keep in the resultant feature vector. Defaults to None.

  • num_cascades (int, optional) – The output of each algorithm can be cascaded into one another. The outputs of the first layer of algorithms will be refed to each separation algorithm to create more features. Defaults to 1.

  • extracted_feature (str, optional) – Which feature to extract from each algorithm. Must be one of [‘estimates’, ‘masks’]. estimates will reconstruct a soft mask using the output of the algorithm (useful if the algorithm is not a masking based separation algorithm). masks will use the data in the result_masks attribute of the separation algorithm. Defaults to ‘masks’.

  • clustering_type (str) – One of ‘KMeans’, ‘GaussianMixture’, and ‘MiniBatchKMeans’. The clustering approach to use on the features. Defaults to ‘KMeans’.

  • fit_clusterer (bool, optional) – Whether or not to call fit on the clusterer. If False, then the clusterer should already be fit for this to work. Defaults to True.

  • percentile (int, optional) – Percentile of time-frequency points to consider by loudness. Audio spectrograms are very high dimensional, and louder points tend to matter more than quieter points. By setting the percentile high, one can more efficiently cluster an auditory scene by considering only points above that threshold. Defaults to 90 (which means the top 10 percentile of time-frequency points will be used for clustering).

  • beta (float, optional) – When using KMeans, we use soft KMeans, which has an additional parameter beta. beta controls how soft the assignments are. As beta increases, the assignments become more binary (either 0 or 1). Defaults to 5.0, a value discovered through cross-validation.

  • mask_type (str, optional) – Masking approach to use. Passed up to MaskSeparationBase.

  • mask_threshold (float, optional) – Threshold for masking. Passed up to MaskSeparationBase.

  • **kwargs (dict, optional) – Additional keyword arguments that are passed to the clustering object (one of KMeans, GaussianMixture, or MiniBatchKMeans).

Example

from nussl.separation import (
    primitive,
    factorization,
    composite,
    SeparationException
)

separators = [
    primitive.FT2D(mix),
    factorization.RPCA(mix),
    primitive.Melodia(mix, voicing_tolerance=0.2),
    primitive.HPSS(mix),
]

weights = [3, 3, 1, 1]
returns = [[1], [1], [1], [0]]

ensemble = composite.EnsembleClustering(
    mix, 2, separators, weights=weights, returns=returns)
estimates = ensemble()

Factorization-based methods

The methods use some sort of factorization-based algorithm like robust principle component analysis, or independent component analysis to separate the auditory scene.

Robust principle component analysis

class nussl.separation.factorization.RPCA(input_audio_signal, high_pass_cutoff=100, num_iterations=100, epsilon=1e-07, mask_type='soft', mask_threshold=0.5)[source]

Implements foreground/background separation using RPCA.

Huang, Po-Sen, et al. “Singing-voice separation from monaural recordings using robust principal component analysis.” Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012.

Parameters
  • input_audio_signal (AudioSignal) – The AudioSignal object that has the audio data that RPCA will be run on.

  • high_pass_cutoff (float, optional) – Value (in Hz) for the high pass cutoff filter. Defaults to 100.

  • num_iterations (int, optional) – how many iterations to run RPCA for. Defaults to 100.

  • epsilon (float, optional) – Stopping criterion for RPCA convergence. Defaults to 1e-7.

  • mask_type (str, optional) – Type of mask to use. Defaults to ‘soft’.

  • mask_threshold (float, optional) – Threshold for mask. Defaults to 0.5.

Independent component analysis

class nussl.separation.factorization.ICA(audio_signals, max_iterations=200, **kwargs)[source]

Separate sources using the Independent Component Analysis, given observations of the audio scene. nussl’s ICA is a wrapper for sci-kit learn’s implementation of FastICA, and provides a way to interop between nussl’s AudioSignal objects and FastICA.

References

sci-kit learn FastICA

Parameters
  • audio_signals – list of AudioSignal objects containing the observations of the mixture. Will be converted into a single multichannel AudioSignal.

  • max_iterations (int) – Max number of iterations to run ICA for. Defaults to 200.

  • **kwargs – Additional keyword arguments that will be passed to sklearn.decomposition.FastICA

Primitive methods

These methods are based on primitives - hard-wired perceptual grouping cues that are used automatically by the brain. Primitives were coined by Albert Bregman in the book Auditory Scene Analysis.

Cluster sources by timbre

class nussl.separation.primitive.TimbreClustering(input_audio_signal, num_sources, n_components, n_mfcc=13, nmf_kwargs=None, **kwargs)[source]

Implements separation by timbre via NMF with MFCC clustering. The steps are:

  1. Factorize the magnitude spectrogram of the mixture with NMF.

  2. Take MFCC coefficients of each component.

  3. Express each time-frequency bin as a combination of components.

  4. The features for each time-frequency bin are the weighted combination of the MFCCs of each component.

  5. Cluster each time-frequency bin based on these features.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • n_components (int) – Number of components to use in the NMF model. Corresponds to number of spectral templates.

  • n_mfcc (int) – Number of MFCC coefficients to use. Defaults to 13.

  • nmf_args (dict) – Dictionary containing keyword arguments for NMFMixin.fit.

  • kwargs (dict) – Extra keyword arguments are passed to ClusteringSeparationBase.

Foreground/background via 2DFT

class nussl.separation.primitive.FT2D(input_audio_signal, neighborhood_size=(1, 25), high_pass_cutoff=100.0, quadrants_to_keep=(0, 1, 2, 3), filter_approach='local_std', use_bg_2dft=True, mask_type='soft', mask_threshold=0.5)[source]

This separation method is based on using 2DFT image processing for source separation [1].

The algorithm has five main steps:

  1. Take the 2DFT of Magnitude STFT.

  2. Identify peaks in the 2DFT (these correspond to repeating patterns).

  3. Mask everything but the peaks and invert the masked 2DFT back to a magnitude STFT.

  4. Take the residual peakless 2DFT and invert that to a magnitude STFT.

  5. Compare the two magnitude STFTs to construct masks for the foreground and the background.

The algorithm runs on each channel independently. The masks can be constructed in two ways: either using the background (peaky) 2DFT as the reference for making masks, or using the foreground (peakless) 2DFT as the reference for making masks. This behavior can be toggled via use_bg_2dft=[True, False]. Using the background biases the algorithm towards separating repeating patterns whereas using the foreground biases it towards preserving micromodulation in the foreground.

There are two ways to identify peaks: either using the original method, as laid out in the paper which identifies peaks in a binary way as being above some threshold, or by using the local_std way, which was developed and investigated in Chapter 2 in [2]. This method identifies peaks in a soft way and can thus be used to construct soft masks. It generally performs better.

The main hyperparameter to consider is the neighborhood_size, which determines how big of a filter to apply when looking for peaks. It is two dimensional, but generally keeping 1D horizontal filters is the way to go.

References

[1] Seetharaman, Prem, Fatemeh Pishdadian, and Bryan Pardo.

“Music/Voice Separation Using the 2D Fourier Transform.” 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017.

[2] Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition.

Diss. Northwestern University, 2019.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • neighborhood_size (tuple, optional) – 2-tuple of ints telling the filter size to look for peaks in frequency and time: (f, t). Defaults to (1, 25).

  • high_pass_cutoff (float, optional) – Time-frequency bins below this cutoff will be assigned to the background. Defaults to 100.0.

  • quadrants_to_keep (tuple, optional) – Each quadrant of the 2DFT can be filtered out when separating. Can be used for separating out upward spectro-temporal patterns (via (0, 2) from downward ones (1, 3). Defaults to (0,1,2,3).

  • filter_approach (str, optional) – One of ‘original’ or ‘local_std’. Which filtering approach to apply to identify peaks. Defaults to ‘local_std’.

  • use_bg_2dft (bool, optional) – Whether to use the background or foreground 2DFT as the reference for constructing masks. Defaults to True.

  • mask_type (str, optional) – Mask type. Defaults to ‘soft’.

  • mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.

Harmonic/percussive separation

class nussl.separation.primitive.HPSS(input_audio_signal, kernel_size=31, mask_type='soft', mask_threshold=0.5)[source]

Implements harmonic/percussive source separation based on [1]. This is a wrapper around the librosa implementation.

References:

[1] Fitzgerald, Derry. “Harmonic/percussive separation using median filtering.”

13th International Conference on Digital Audio Effects (DAFX10), Graz, Austria, 2010.

[2] Driedger, Müller, Disch. “Extending harmonic-percussive separation of audio.”

15th International Society for Music Information Retrieval Conference (ISMIR 2014) Taipei, Taiwan, 2014.

Parameters
  • input_audio_signal (AudioSignal) – signal to separate.

  • kernel_size (int or tuple (kernel_harmonic, kernel_percussive)) – kernel size(s) for the median filters.

  • mask_type (str, optional) – Mask type. Defaults to ‘soft’.

  • mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.

Foreground/background via REPET

class nussl.separation.primitive.Repet(input_audio_signal, min_period=None, max_period=None, period=None, high_pass_cutoff=100.0, mask_type='soft', mask_threshold=0.5)[source]

Implements the original REpeating Pattern Extraction Technique algorithm using the beat spectrum.

REPET is a simple method for separating a repeating background from a non-repeating foreground in an audio mixture. It assumes a single repeating period over the whole signal duration, and finds that period based on finding a peak in the beat spectrum. The period can also be provided exactly, or you can give Repet a guess of the min and max period. Once it has a period, it “overlays” spectrogram sections of length period to create a median model (the background).

References:

[1] Rafii, Zafar, and Bryan Pardo.

“Repeating pattern extraction technique (REPET): A simple method for music/voice separation.” IEEE transactions on audio, speech, and language processing 21.1 (2012): 73-84.

Parameters
  • input_audio_signal (AudioSignal) – Signal to separate.

  • min_period (float, optional) – minimum time to look for repeating period in terms of seconds.

  • max_period (float, optional) – maximum time to look for repeating period in terms of seconds.

  • period (float, optional) – exact time that the repeating period is (in seconds).

  • high_pass_cutoff (float, optional) – value (in Hz) for the high pass cutoff filter.

  • mask_type (str, optional) – Mask type. Defaults to ‘soft’.

  • mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.

Foreground/background via REPET-SIM

class nussl.separation.primitive.RepetSim(input_audio_signal, similarity_threshold=0, min_distance_between_frames=1, max_repeating_frames=100, high_pass_cutoff=100, mask_type='soft', mask_threshold=0.5)[source]

Implements the REpeating Pattern Extraction Technique algorithm using the Similarity Matrix (REPET-SIM).

REPET is a simple method for separating the repeating background from the non-repeating foreground in a piece of audio mixture. REPET-SIM is a generalization of REPET, which looks for similarities instead of periodicities.

References:

[1] Zafar Rafii and Bryan Pardo.

“Music/Voice Separation using the Similarity Matrix,” 13th International Society on Music Information Retrieval, Porto, Portugal, October 8-12, 2012.

Parameters
  • input_audio_signal (AudioSignal) – Audio signal to be separated.

  • similarity_threshold (int, optional) – Threshold for considering two frames to be similar. Defaults to 0.

  • min_distance_between_frames (float, optional) – Number of seconds two frames must be apart to be considered neighbors. Defaults to 1.

  • max_repeating_frames (int, optional) – Max number of frames to consider as neighbors. Defaults to 100.

  • high_pass_cutoff (float, optional) – Cutoff for high pass filters. Bins below this cutoff will be given to the background. Defaults to 100.

  • mask_type (str, optional) – Mask type to use.. Defaults to ‘soft’.

  • mask_threshold (float, optional) – Threshold for mask converting to binary. Defaults to 0.5.

Vocal melody extraction via Melodia

class nussl.separation.primitive.Melodia(input_audio_signal, high_pass_cutoff=100, minimum_frequency=55.0, maximum_frequency=1760.0, voicing_tolerance=0.2, minimum_peak_salience=0.0, compression=0.5, num_overtones=40, apply_vowel_filter=False, smooth_length=5, add_lower_octave=False, mask_type='soft', mask_threshold=0.5)[source]

Implements melody extraction using Melodia [1].

This needs Melodia installed as a vamp plugin, as well as having vampy for Python installed. Install Melodia via: https://www.upf.edu/web/mtg/melodia. Note that Melodia can be used only for NON-COMMERCIAL use.

References:

[1] J. Salamon and E. Gómez, “Melody Extraction from Polyphonic Music Signals using

Pitch Contour Characteristics”, IEEE Transactions on Audio, Speech and Language Processing, 20(6):1759-1770, Aug. 2012.

Parameters
  • input_audio_signal (AudioSignal object) – The AudioSignal object that has the audio data that Melodia will be run on.

  • high_pass_cutoff (optional, float) – value (in Hz) for the high pass cutoff filter.

  • minimum_frequency (optional, float) – minimum frequency in Hertz (default 55.0)

  • maximum_frequency (optional, float) – maximum frequency in Hertz (default 1760.0)

  • voicing_tolerance (optional, float) – Greater values will result in more pitch contours included in the final melody. Smaller values will result in less pitch contours included in the final melody (default 0.2).

  • minimum_peak_salience (optional, float) – a hack to avoid silence turning into junk contours when analyzing monophonic recordings (e.g. solo voice with no accompaniment). Generally you want to leave this untouched (default 0.0).

  • num_overtones (optional, int) – Number of overtones to use when creating melody mask.

  • apply_vowel_filter (optional, bool) – Whether or not to apply a vowel filter on the resynthesized melody signal when masking.

  • smooth_length (optional, int) – Number of frames to smooth discontinuities in the mask.

  • add_lower_octave (optional, fool) – Use octave below fundamental frequency as well to take care of octave errors in pitch tracking, since we only care about the mask. Defaults to False.

  • mask_type (optional, str) – Type of mask to use.

  • mask_threshold (optional, float) – Threshold for mask to convert to binary.

Spatial methods

These methods are based on primitive spatial cues.

Cluster by inter-phase and inter-level difference

class nussl.separation.spatial.SpatialClustering(input_audio_signal, num_sources, clustering_type='KMeans', fit_clusterer=True, percentile=90, beta=5.0, mask_type='soft', mask_threshold=0.5, **kwargs)[source]

Implements clustering on IPD/ILD features between the first two channels.

IPD/ILD features are inter-phase difference and inter-level difference features. Sounds coming from different directions will naturally cluster in IPD/ILD space.

Subclasses ClusteringSeparationBase which actually handles all of the clustering functionality behind this function.

PROJET: Separate via spatial projections

class nussl.separation.spatial.Projet(input_audio_signal, num_sources, estimates=None, num_iterations=50, maximum_delay_in_samples=20, location_set_panning=30, location_set_delay=17, projection_set_panning=10, projection_set_delay=9, beta=1, alpha=1, device='cpu')[source]

Implements the PROJET algorithm for spatial audio separation using projections. This implementation uses PyTorch to speed up computation considerably. PROJET does the following steps:

  1. Project the complex stereo STFT onto multiple angles and delay via projection and delay matrix transformations.

  2. Initialize the parameters of the system to “remix” these projections along with PSDs of the sources such that they try to reconstruct the original stereo mixture.

  3. Find the optimal parameters via multiplicative update rules for P and for Q.

  4. Use the discovered parameters to isolate the sources via spatial cues.

This implementation considers BOTH panning and delays when isolating sources. PROJET is not a masking based method, it estimates the sources directly by projecting the complex STFT.

Parameters
  • input_audio_signal (AudioSignal) – Audio signal to separate.

  • num_sources (int) – Number of source to separate.

  • estimates (list of AudioSignal) – initial estimates for the separated sources if available. These will be used to initialize the update algorithm. So one could (for example), run FT2D on a signal and then refine the estimates using PROJET. Defaults to None (randomly initialize P).

  • num_iterations (int, optional) – Number of iterations to do for the update rules for P and Q. Defaults to 50.

  • maximum_delay_in_samples (int, optional) – Maximum delay in samples that you are willing to consider in the projection matrices. Defaults to 20.

  • location_set_panning (int, optional) – How many locations in panning you are willing to consider. Defaults to 30.

  • location_set_delay (int, optional) – How many delays you are willing to consider. Defaults to 17.

  • projection_set_panning (int, optional) – How many projections you are willing use in panning-space. Defaults to 10.

  • projection_set_delay (int, optional) – How many delays you are willing to project the mixutre onto in panning-space. Defaults to 9.

  • beta (int, optional) – Beta in beta divergence. See Table 1 in [1]. Defaults to 1.

  • alpha (int, optional) – Power to raise each power spectral density estimate of each source to. Defaults to 1.

  • device (str, optional) – Device to use when performing update rules. ‘cuda’ will be fastest, if available. Defaults to ‘cpu’.

References:

[1] Fitzgerald, Derry, Antoine Liutkus, and Roland Badeau.

“Projection-based demixing of spatial audio.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.9 (2016): 1560-1572.

[2] Fitzgerald, Derry, Antoine Liutkus, and Roland Badeau.

“Projet—spatial audio separation using projections.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.

DUET

class nussl.separation.spatial.Duet(input_audio_signal, num_sources, attenuation_min=-3, attenuation_max=3, num_attenuation_bins=50, delay_min=-3, delay_max=3, num_delay_bins=50, peak_threshold=0.0, attenuation_min_distance=5, delay_min_distance=5, p=1, q=0, mask_type='binary')[source]

The DUET algorithm was originally proposed by S.Rickard and F.Dietrich for DOA estimation and further developed for BSS and demixing by A. Jourjine, S.Rickard, and O. Yilmaz.

DUET extracts sources using the symmetric attenuation and relative delay between two channels. The symmetric attenuation is calculated from the ratio of the two channels’ stft amplitudes, and the delay is the arrival delay between the two sensors used to record the audio signal. These two values are clustered as peaks on a histogram to determine where each source occurs. This implementation of DUET creates and returns Mask objects after the run() function, which can then be applied to the original audio signal to extract each individual source.

References:

[1] Rickard, Scott. “The DUET blind source separation algorithm.”

Blind Speech Separation. Springer Netherlands, 2007. 217-241.

[2] Yilmaz, Ozgur, and Scott Rickard. “Blind separation of speech mixtures

via time-frequency masking.” Signal Processing, IEEE transactions on 52.7 (2004): 1830-1847.

Parameters
  • input_audio_signal (np.array) – a 2-row Numpy matrix containing samples of the two-channel mixture.

  • num_sources (int) – Number of sources to find.

  • attenuation_min (int) – Minimum distance in utils.find_peak_indices, change if not enough peaks are identified.

  • attenuation_max (int) – Used for creating a histogram without outliers.

  • num_attenuation_bins (int) – Number of bins for attenuation.

  • delay_min (int) – Lower bound on delay, used as minimum distance in utils.find_peak_indices.

  • delay_max (int) – Upper bound on delay, used for creating a histogram without outliers.

  • num_delay_bins (int) – Number of bins for delay.

  • peak_threshold (float) – Value in [0, 1] for peak picking.

  • attenuation_min_distance (int) – Minimum distance between peaks wrt attenuation.

  • delay_min_distance (int) – Minimum distance between peaks wrt delay.

  • p (int) – Weight the histogram with the symmetric attenuation estimator.

  • q (int) – Weight the histogram with the delay estimato

Notes

On page 8 of his paper, Rickard recommends p=1 and q=0 as a default starting point and p=.5, q=0 if one source is more dominant.

stft_ch0

A Numpy matrix containing the stft data of channel 0.

Type

np.array

stft_ch1

A Numpy matrix containing the stft data of channel 1.

Type

np.array

frequency_matrix

A Numpy matrix containing the frequencies of analysis.

Type

np.array

symmetric_atn

A Numpy matrix containing the symmetric attenuation between the two channels.

Type

np.array

delay

A Numpy matrix containing the delay between the two channels.

Type

np.array

num_time_bins

The number of time bins for the frequency matrix and mask arrays.

Type

np.array

num_frequency_bins

The number of frequency bins for the mask arrays.

Type

int

attenuation_bins

A Numpy array containing the attenuation bins for the histogram.

Type

int

delay_bins

A Numpy array containing the delay bins for the histogram.

Type

np.array

normalized_attenuation_delay_histogram

A normalized Numpy matrix containing the attenuation delay histogram, which has peaks for each source.

Type

np.array

attenuation_delay_histogram

A non-normalized Numpy matrix containing the attenuation delay histogram, which has peaks for each source.

Type

np.array

peak_indices

A Numpy array containing the indices of the peaks for the histogram.

Type

np.array

separated_sources

A Numpy array of arrays containing each separated source.

Type

np.array