Separation algorithms¶
Base classes¶
These classes are used to build every type of source separation algorithm currently in nussl. They provide helpful utilities and make it such that the end-user only has to implement one or two functions to create a new separation algorithm, depending on what sort of algorithm they are trying to implement.
Base for all methods¶
-
class
nussl.separation.
SeparationBase
(input_audio_signal)[source]¶ Base class for all separation algorithms in nussl.
Do not call this. It will not do anything.
- Parameters
input_audio_signal (AudioSignal) – This will always be a copy of the provided AudioSignal object.
Attributes
Copy of AudioSignal that is made on initialization.
Sample rate of
audio_signal
.STFTParams object containing the STFT parameters of the copied AudioSignal.
Methods
Makes
audio_signal.AudioSignal
objects after separation algorithm is runrun
()Runs separation algorithm.
-
property
audio_signal
¶ Copy of AudioSignal that is made on initialization.
-
make_audio_signals
()[source]¶ Makes
audio_signal.AudioSignal
objects after separation algorithm is run- Raises
NotImplementedError – Cannot call base class
-
property
sample_rate
¶ Sample rate of
audio_signal
. Literallyaudio_signal.sample_rate
.- Type
(int)
-
property
stft_params
¶ STFTParams object containing the STFT parameters of the copied AudioSignal.
Base for masking-based methods¶
-
class
nussl.separation.
MaskSeparationBase
(input_audio_signal, mask_type='soft', mask_threshold=0.5)[source]¶ Base class for separation algorithms that create a mask (binary or soft) to do their separation. Most algorithms in nussl are derived from
MaskSeparationBase
.Although this class will do nothing if you instantiate and run it by itself, algorithms that are derived from this class are expected to return a list of
separation.masks.mask_base.MaskBase
-derived objects (i.e., either aseparation.masks.binary_mask.BinaryMask
orseparation.masks.soft_mask.SoftMask
object) by theirrun()
method. Being a subclass ofMaskSeparationBase
is an implicit contract assuring this. Returning aseparation.masks.mask_base.MaskBase
-derived object standardizes algorithm return types forevaluation.evaluation_base.EvaluationBase
-derived objects.- Parameters
input_audio_signal – (
audio_signal.AudioSignal
) Anaudio_signal.AudioSignal
object containing the mixture to be separated.mask_type – (str, BinaryMask, or SoftMask) Indicates whether to make binary or soft masks. See
mask_type
property for details.mask_threshold – (float) Value between [0.0, 1.0] to convert a soft mask to a binary mask. See
mask_threshold
property for details.
Methods
Makes
audio_signal.AudioSignal
objects after mask-based separation algorithm is run.ones_mask
(shape)Creates a new ones mask with this object’s type.
run
()Runs mask-based separation algorithm.
zeros_mask
(shape)Creates a new zeros mask with this object’s type.
Attributes
Threshold of determining True/False if
mask_type
isBINARY_MASK
.This property indicates what type of mask the derived algorithm will create and be returned by
run()
.-
make_audio_signals
()[source]¶ Makes
audio_signal.AudioSignal
objects after mask-based separation algorithm is run. This looks inself.result_masks
which must be filled byrun
in the algorithm that subclasses this. It applies each mask to the mixture audio signal and returns a list of the estimates, which are each AudioSignal objects.- Returns
- List of AudioSignal objects corresponding to the
separated estimates.
- Return type
list
-
property
mask_threshold
¶ Threshold of determining True/False if
mask_type
isBINARY_MASK
. Some algorithms will first make a soft mask and then convert that to a binary mask using this threshold parameter. All values of the soft mask are between[0.0, 1.0]
and as suchmask_threshold()
is expected to be a float between[0.0, 1.0]
.- Returns
- Value between
[0.0, 1.0]
that indicates the True/False cutoff when converting a soft mask to binary mask.
- Value between
- Return type
mask_threshold (float)
- Raises
ValueError if not a float or if set outside [0.0, 1.0] –
-
property
mask_type
¶ This property indicates what type of mask the derived algorithm will create and be returned by
run()
. Options are either ‘soft’ or ‘binary’.mask_type
is usually set when initializing aMaskSeparationBase
-derived class and defaults to ‘soft..This property, though stored as a string, can be set in two ways when initializing:
First, it is possible to set this property with a string. Only
'soft'
and'binary'
are accepted (case insensitive), every other value will raise an error. When initializing with a string, two helper attributes are provided:BINARY_MASK
andSOFT_MASK
.It is HIGHLY encouraged to use these, as the API may change and code that uses bare strings (e.g.
mask_type = 'soft'
ormask_type = 'binary'
) for assignment might not be future-proof.BINARY_MASK`
andSOFT_MASK
are safe aliases in case these underlying types change.The second way to set this property is by using a class prototype of either the
separation.masks.binary_mask.BinaryMask
orseparation.masks.soft_mask.SoftMask
class prototype. This is probably the most stable way to set this, and it’s fairly succinct. For example,mask_type = nussl.BinaryMask
ormask_type = nussl.SoftMask
are both perfectly valid.
Though uncommon, this can be set outside of
__init__()
Examples of both methods are shown below.
- Returns
Either
'soft'
or'binary'
.- Return type
mask_type (str)
- Raises
ValueError if set invalidly. –
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
import nussl mixture_signal = nussl.AudioSignal() # Two options for determining mask upon init... # Option 1: Init with a string (BINARY_MASK is a string 'constant') repet_sim = nussl.RepetSim(mixture_signal, mask_type=nussl.MaskSeparationBase.BINARY_MASK) # Option 2: Init with a class type ola = nussl.OverlapAdd(mixture_signal, mask_type=nussl.SoftMask) # It's also possible to change these values after init by changing the `mask_type` property... repet_sim.mask_type = nussl.MaskSeparationBase.SOFT_MASK # using a string ola.mask_type = nussl.BinaryMask # or using a class type
-
ones_mask
(shape)[source]¶ Creates a new ones mask with this object’s type.
- Parameters
shape (tuple) – tuple with shape of mask
- Returns
A subclass of MaskBase containing 1s.
Base for clustering-based methods¶
-
class
nussl.separation.
ClusteringSeparationBase
(input_audio_signal, num_sources, clustering_type='KMeans', fit_clusterer=True, percentile=90, beta=5.0, mask_type='soft', mask_threshold=0.5, **kwargs)[source]¶ A base class for any clustering-based separation approach. Subclasses of this class must implement just one function to use it: extract_features. This function should uses the internal variables of the class to extract the appropriate time-frequency features of the signal. These time-frequency features will then be clustered by cluster_features. Masks will then be produced by the run function and applied to the audio signal to produce separated estimates.
- Parameters
input_audio_signal – (AudioSignal) An AudioSignal object containing the mixture to be separated.
num_sources (int) – Number of sources to cluster the features of and separate the mixture.
clustering_type (str) – One of ‘KMeans’, ‘GaussianMixture’, and ‘MiniBatchKMeans’. The clustering approach to use on the features. Defaults to ‘KMeans’.
fit_clusterer (bool, optional) – Whether or not to call fit on the clusterer. If False, then the clusterer should already be fit for this to work. Defaults to True.
percentile (int, optional) – Percentile of time-frequency points to consider by loudness. Audio spectrograms are very high dimensional, and louder points tend to matter more than quieter points. By setting the percentile high, one can more efficiently cluster an auditory scene by considering only points above that threshold. Defaults to 90 (which means the top 10 percentile of time-frequency points will be used for clustering).
beta (float, optional) – When using KMeans, we use soft KMeans, which has an additional parameter beta. beta controls how soft the assignments are. As beta increases, the assignments become more binary (either 0 or 1). Defaults to 5.0, a value discovered through cross-validation.
mask_type (str, optional) – Masking approach to use. Passed up to MaskSeparationBase.
mask_threshold (float, optional) – Threshold for masking. Passed up to MaskSeparationBase.
**kwargs (dict, optional) – Additional keyword arguments that are passed to the clustering object (one of KMeans, GaussianMixture, or MiniBatchKMeans).
- Raises
SeparationException – If clustering type is not one of the allowed ones, or if the output of extract_features has the wrong shape according to the STFT shape of the AudioSignal.
Methods
cluster_features
(features, clusterer)Clusters each time-frequency point according to features for each time-frequency point.
confidence
([approach])In clustering-based separation algorithms, we can compute a confidence measure based on the clusterability of the feature space.
This function should be implemented by the subclass.
run
([features])Clusters the features using the chosen clustering algorithm.
-
cluster_features
(features, clusterer)[source]¶ Clusters each time-frequency point according to features for each time-frequency point. Features should be on the last axis.
- Features should come in in the shape:
(…, n_features)
- Parameters
features (np.ndarray) – Features to cluster, for each time-frequency point.
clusterer (object) – Clustering object to use.
- Returns
Responsibilities for each cluster for each time-frequency point.
- Return type
np.ndarray
-
confidence
(approach='silhouette_confidence', **kwargs)[source]¶ In clustering-based separation algorithms, we can compute a confidence measure based on the clusterability of the feature space. This can be computed only after the features have been extracted by
extract_features
.- Parameters
approach (str, optional) – What approach to use for getting the confidence measure. Options are ‘jensen_shannon_confidence’, ‘posterior_confidence’, ‘silhouette_confidence’, ‘loudness_confidence’, ‘whitened_kmeans_confidence’, ‘dpcl_classic_confidence’. Defaults to ‘silhouette_confidence’.
kwargs – Keyword arguments to the function being used to compute the confidence.
-
extract_features
()[source]¶ This function should be implemented by the subclass. It should extract features. If the STFT shape is (n_freq, n_time, n_chan), the output of this function should be (n_freq, n_time, n_chan, n_features).
-
run
(features=None)[source]¶ Clusters the features using the chosen clustering algorithm.
- Parameters
features (np.ndarray, optional) – If features are given, then the extract_features step will be skipped. Defaults to None (so extract_features will be run.)
- Raises
SeparationException – If features.shape doesn’t match what is expected in the STFT of the audio signal, an exception is raised.
- Returns
List of Mask objects in self.result_masks.
- Return type
list
Mix-in for NMF-based methods¶
-
class
nussl.separation.
NMFMixin
[source]¶ Methods
fit
(audio_signals, n_components[, …])Fits an NMF model to the magnitude spectrograms of each audio signal.
inverse_transform
(components, activations)Reconstructs the magnitude spectrogram by matrix multiplying the components with the activations.
transform
(audio_signal, model)Use an already fit model to transform the magnitude spectrogram of an audio signal into components and activations.
-
static
fit
(audio_signals, n_components, beta_loss='frobenius', l1_ratio=0.5, **kwargs)[source]¶ Fits an NMF model to the magnitude spectrograms of each audio signal. If audio_signals is a list, the magnitude spectrograms of each signal are concatenated into a single data matrix to which NMF is fit. If audio_signals is a single audio signal, then NMF is fit only to the magnitude spectrogram for that audio signal. If any of the audio signals are multichannel, the channels are concatenated into a single (longer) data matrix.
- Parameters
audio_signals (list or AudioSignal) – AudioSignal object(s) that NMF will be fit to.
n_components (int) – Number of components to use in the NMF module. Corresponds to number of spectral templates.
beta_loss (float or string) – String must be in {‘frobenius’, ‘kullback-leibler’, ‘itakura-saito’}. Beta divergence to be minimized, measuring the distance between X and the dot product WH. Note that values different from ‘frobenius’ (or 2) and ‘kullback-leibler’ (or 1) lead to significantly slower fits. Note that for beta_loss <= 0 (or ‘itakura-saito’), the input matrix X cannot contain zeros. Used only in ‘mu’ solver. Defaults to ‘frobenius’.
l1_ratio (float) – The regularization mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an elementwise L2 penalty (aka Frobenius Norm). For l1_ratio = 1 it is an elementwise L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2. Defaults to 1.0 (sparse templates and activations).
kwargs (dict) – Additional keyword arguments to initialization of the NMF decomposition method.
- Returns
Fitted NMF model to the audio signal(s). components (np.ndarray): Spectral templates (n_components, n_features) activations (np.ndarray): Activations (n_components, n_time, n_channels)
The shape here is as if it was like an STFT but with components as the features rather than frequencies of the STFT.
- Return type
model (NMF)
-
static
inverse_transform
(components, activations)[source]¶ Reconstructs the magnitude spectrogram by matrix multiplying the components with the activations. Components and activations are considered to be 2D matrices, but if they are more, then the first dimension is interpreted to be the batch dimension.
- Parameters
components (np.ndarray) – Spectral templates (n_components, n_features)
activations (np.ndarray) – Activations (n_components, n_time, n_channels) The shape here is as if it was like an STFT but with components as the features rather than frequencies of the STFT.
-
static
transform
(audio_signal, model)[source]¶ Use an already fit model to transform the magnitude spectrogram of an audio signal into components and activations. These can be multiplied to reconstruct the original matrix, or used to separate out sounds that correspond to components in the model.
- Parameters
audio_signal (AudioSignal) – AudioSignal object to transform with model.
model (NMF) – NMF model to separate with. Must be fitted prior to this call.
- Returns
Spectral templates (n_components, n_features) activations (np.ndarray): Activations (n_components, n_time, n_channels)
The shape here is as if it was like an STFT but with components as the features rather than frequencies of the STFT.
- Return type
components (np.ndarray)
-
static
Mix-in for deep methods¶
-
class
nussl.separation.
DeepMixin
[source]¶ Methods
load_model
(model_path[, device])Loads the model at specified path model_path.
-
load_model
(model_path, device='cpu')[source]¶ Loads the model at specified path model_path. Uses GPU if available.
- Parameters
model_path (str) – path to model saved as SeparatonModel.
device (str or torch.Device) – loads model on CPU or GPU. Defaults to ‘cuda’.
- Returns
Loaded model, nn.Module metadata (dict): metadata associated with model, used for making the input data into the model.
- Return type
model (SeparationModel)
-
Benchmark methods¶
These methods are used for obtaining upper and lower baselines for source separation algorithms.
High pass filter¶
-
class
nussl.separation.benchmark.
HighLowPassFilter
(input_audio_signal, high_pass_cutoff_hz, mask_type='binary')[source]¶ Implements a super simple separation algorithm that just masks everything below the specified hz. It does this by zeroing out the associated FFT bins via a mask to produce the “high” source, and the residual is the “low” source.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
high_pass_cutoff_hz (float) – Cutoff in Hz. Will be rounded off
mask_type (str, optional) – Mask type. Defaults to ‘binary’.
Ideal binary mask¶
-
class
nussl.separation.benchmark.
IdealBinaryMask
(input_audio_signal, sources, mask_type='binary', mask_threshold=0.5)[source]¶ Implements an ideal binary mask (IBM) that is computed by using the known ground truth performance. This is one of the upper baselines.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
sources (list) – List of audio signal objects that correspond to the sources.
mask_type (str, optional) – Mask type. Defaults to ‘binary’.
Ideal ratio mask¶
-
class
nussl.separation.benchmark.
IdealRatioMask
(input_audio_signal, sources, approach='psa', mask_type='soft', mask_threshold=0.5, **kwargs)[source]¶ Implements an ideal ratio mask (IRM) that is computed by using the known ground truth performance. This is one of the upper baselines.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
sources (list) – List of audio signal objects that correspond to the sources.
approach (str) – Either ‘psa’ (phase sensitive spectrum approximation) or ‘msa’ (magnitude spectrum approximation). Generally ‘psa’ does better.
mask_type (str, optional) – Mask type. Defaults to ‘soft’.
mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.
kwargs (dict) – Extra keyword arguments are passed to the transform classes at initialization.
Wiener filter¶
-
class
nussl.separation.benchmark.
WienerFilter
(input_audio_signal, estimates, iterations=1, mask_type='soft', mask_threshold=0.5, **kwargs)[source]¶ Implements a multichannel Wiener filter that is computed by using some source estimates. When using the estimates produced by IdealRatioMask or IdealBinaryMask, this is one of the upper baselines.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
estimates (list) – List of audio signal objects that correspond to the estimates.
iterations (int) – Number of iterations for expectation-maximization in Wiener filter.
mask_type (str, optional) – Mask type. Defaults to ‘soft’.
mask_threshold (float, optional) – Threshold for masking binary. Defaults to 0.5.
kwargs (dict) – Additional keyword arguments to norbert.wiener.
Mix as estimate¶
-
class
nussl.separation.benchmark.
MixAsEstimate
(input_audio_signal, num_sources)[source]¶ This algorithm does nothing but scale the mix by the number of sources. This can be used to compute the improvement metrics (e.g. improvement in SDR over using the mixture as the estimate).
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
num_sources (int) – How many sources to return.
Deep methods¶
Deep networks can be used for source separation via these classes.
Deep clustering¶
-
class
nussl.separation.deep.
DeepClustering
(input_audio_signal, num_sources, model_path=None, device='cpu', **kwargs)[source]¶ Clusters the embedding produced by a deep model for every time-frequency point. This is the deep clustering source separation approach. It is flexible with the number of sources. It expects that the model outputs a dictionary where one of the keys is ‘embedding’. This uses the DeepMixin class to load the model and set the audio signal’s parameters to be appropriate for the model.
- Parameters
input_audio_signal – (AudioSignal`) An AudioSignal object containing the mixture to be separated.
num_sources (int) – Number of sources to cluster the features of and separate the mixture.
model_path (str, optional) – Path to the model that will be used. Can be None, so that you can initialize a class and load the model later. Defaults to None.
device (str, optional) – Device to put the model on. Defaults to ‘cpu’.
**kwargs (dict) – Keyword arguments for ClusteringSeparationBase and the clustering object used for clustering (one of KMeans, GaussianMixture, MiniBatchKmeans).
- Raises
SeparationException – If ‘embedding’ isn’t in the output of the model.
Deep mask estimation¶
-
class
nussl.separation.deep.
DeepMaskEstimation
(input_audio_signal, model_path=None, device='cpu', **kwargs)[source]¶ Separates an audio signal using the masks produced by a deep model for every time-frequency point. It expects that the model outputs a dictionary where one of the keys is ‘masks’. This uses the DeepMixin class to load the model and set the audio signal’s parameters to be appropriate for the model.
- Parameters
input_audio_signal – (AudioSignal`) An AudioSignal object containing the mixture to be separated.
model_path (str, optional) – Path to the model that will be used. Can be None, so that you can initialize a class and load the model later. Defaults to None.
device (str, optional) – Device to put the model on. Defaults to ‘cpu’.
**kwargs (dict) – Keyword arguments for MaskSeparationBase.
Deep audio estimation¶
-
class
nussl.separation.deep.
DeepAudioEstimation
(input_audio_signal, model_path=None, device='cpu', **kwargs)[source]¶ Separates an audio signal using a model that produces separated sources directly in the waveform domain. It expects that the model outputs a dictionary where one of the keys is ‘audio’. This uses the DeepMixin class to load the model and set the audio signal’s parameters to be appropriate for the model.
- Parameters
input_audio_signal – (AudioSignal`) An AudioSignal object containing the mixture to be separated.
model_path (str, optional) – Path to the model that will be used. Can be None, so that you can initialize a class and load the model later. Defaults to None.
device (str, optional) – Device to put the model on. Defaults to ‘cpu’.
**kwargs (dict) – Keyword arguments for MaskSeparationBase.
Composite methods¶
These are methods that use the output of multiple separation algorithms to build better more robust separation estimates.
Ensemble clustering¶
-
class
nussl.separation.composite.
EnsembleClustering
(input_audio_signal, num_sources, separators, weights=None, returns=None, num_cascades=1, extracted_feature='masks', clustering_type='KMeans', fit_clusterer=True, percentile=90, beta=5.0, mask_type='soft', mask_threshold=0.5, **kwargs)[source]¶ Run multiple separation algorithms on a single mixture and concatenate their masks to input into a clustering algorithm.
This algorithm allows you to combine the outputs of multiple separation algorithms, fusing them into a single output via clustering. It was first developed in [1]. When used with primitive separation algorithms, it becomes the PrimitiveClustering algorithm described in [1].
References:
- [1] Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition.
Diss. Northwestern University, 2019.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
num_sources (int) – Number of sources to separate from signal.
separators (list) – List of instantiated separation algorithms that will be run on the input audio signal.
weights (list, optional) – Weight to give to each algorithm in the resultant feature vector. For example, [3, 1], will repeat the features from the first algorithm 3 times and the second algorithm 1 time. Defaults to None - every algorithm gets a weight of 1.
returns (list, optional) – Which outputs of each algorithm to keep in the resultant feature vector. Defaults to None.
num_cascades (int, optional) – The output of each algorithm can be cascaded into one another. The outputs of the first layer of algorithms will be refed to each separation algorithm to create more features. Defaults to 1.
extracted_feature (str, optional) – Which feature to extract from each algorithm. Must be one of [‘estimates’, ‘masks’]. estimates will reconstruct a soft mask using the output of the algorithm (useful if the algorithm is not a masking based separation algorithm). masks will use the data in the result_masks attribute of the separation algorithm. Defaults to ‘masks’.
clustering_type (str) – One of ‘KMeans’, ‘GaussianMixture’, and ‘MiniBatchKMeans’. The clustering approach to use on the features. Defaults to ‘KMeans’.
fit_clusterer (bool, optional) – Whether or not to call fit on the clusterer. If False, then the clusterer should already be fit for this to work. Defaults to True.
percentile (int, optional) – Percentile of time-frequency points to consider by loudness. Audio spectrograms are very high dimensional, and louder points tend to matter more than quieter points. By setting the percentile high, one can more efficiently cluster an auditory scene by considering only points above that threshold. Defaults to 90 (which means the top 10 percentile of time-frequency points will be used for clustering).
beta (float, optional) – When using KMeans, we use soft KMeans, which has an additional parameter beta. beta controls how soft the assignments are. As beta increases, the assignments become more binary (either 0 or 1). Defaults to 5.0, a value discovered through cross-validation.
mask_type (str, optional) – Masking approach to use. Passed up to MaskSeparationBase.
mask_threshold (float, optional) – Threshold for masking. Passed up to MaskSeparationBase.
**kwargs (dict, optional) – Additional keyword arguments that are passed to the clustering object (one of KMeans, GaussianMixture, or MiniBatchKMeans).
Example
from nussl.separation import ( primitive, factorization, composite, SeparationException ) separators = [ primitive.FT2D(mix), factorization.RPCA(mix), primitive.Melodia(mix, voicing_tolerance=0.2), primitive.HPSS(mix), ] weights = [3, 3, 1, 1] returns = [[1], [1], [1], [0]] ensemble = composite.EnsembleClustering( mix, 2, separators, weights=weights, returns=returns) estimates = ensemble()
Factorization-based methods¶
The methods use some sort of factorization-based algorithm like robust principle component analysis, or independent component analysis to separate the auditory scene.
Robust principle component analysis¶
-
class
nussl.separation.factorization.
RPCA
(input_audio_signal, high_pass_cutoff=100, num_iterations=100, epsilon=1e-07, mask_type='soft', mask_threshold=0.5)[source]¶ Implements foreground/background separation using RPCA.
Huang, Po-Sen, et al. “Singing-voice separation from monaural recordings using robust principal component analysis.” Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012.
- Parameters
input_audio_signal (AudioSignal) – The AudioSignal object that has the audio data that RPCA will be run on.
high_pass_cutoff (float, optional) – Value (in Hz) for the high pass cutoff filter. Defaults to 100.
num_iterations (int, optional) – how many iterations to run RPCA for. Defaults to 100.
epsilon (float, optional) – Stopping criterion for RPCA convergence. Defaults to 1e-7.
mask_type (str, optional) – Type of mask to use. Defaults to ‘soft’.
mask_threshold (float, optional) – Threshold for mask. Defaults to 0.5.
Independent component analysis¶
-
class
nussl.separation.factorization.
ICA
(audio_signals, max_iterations=200, **kwargs)[source]¶ Separate sources using the Independent Component Analysis, given observations of the audio scene. nussl’s ICA is a wrapper for sci-kit learn’s implementation of FastICA, and provides a way to interop between nussl’s AudioSignal objects and FastICA.
References
- Parameters
audio_signals – list of AudioSignal objects containing the observations of the mixture. Will be converted into a single multichannel AudioSignal.
max_iterations (int) – Max number of iterations to run ICA for. Defaults to 200.
**kwargs – Additional keyword arguments that will be passed to sklearn.decomposition.FastICA
Primitive methods¶
These methods are based on primitives - hard-wired perceptual grouping cues that are used automatically by the brain. Primitives were coined by Albert Bregman in the book Auditory Scene Analysis.
Cluster sources by timbre¶
-
class
nussl.separation.primitive.
TimbreClustering
(input_audio_signal, num_sources, n_components, n_mfcc=13, nmf_kwargs=None, **kwargs)[source]¶ Implements separation by timbre via NMF with MFCC clustering. The steps are:
Factorize the magnitude spectrogram of the mixture with NMF.
Take MFCC coefficients of each component.
Express each time-frequency bin as a combination of components.
The features for each time-frequency bin are the weighted combination of the MFCCs of each component.
Cluster each time-frequency bin based on these features.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
n_components (int) – Number of components to use in the NMF model. Corresponds to number of spectral templates.
n_mfcc (int) – Number of MFCC coefficients to use. Defaults to 13.
nmf_args (dict) – Dictionary containing keyword arguments for NMFMixin.fit.
kwargs (dict) – Extra keyword arguments are passed to ClusteringSeparationBase.
Foreground/background via 2DFT¶
-
class
nussl.separation.primitive.
FT2D
(input_audio_signal, neighborhood_size=(1, 25), high_pass_cutoff=100.0, quadrants_to_keep=(0, 1, 2, 3), filter_approach='local_std', use_bg_2dft=True, mask_type='soft', mask_threshold=0.5)[source]¶ This separation method is based on using 2DFT image processing for source separation [1].
The algorithm has five main steps:
Take the 2DFT of Magnitude STFT.
Identify peaks in the 2DFT (these correspond to repeating patterns).
Mask everything but the peaks and invert the masked 2DFT back to a magnitude STFT.
Take the residual peakless 2DFT and invert that to a magnitude STFT.
Compare the two magnitude STFTs to construct masks for the foreground and the background.
The algorithm runs on each channel independently. The masks can be constructed in two ways: either using the background (peaky) 2DFT as the reference for making masks, or using the foreground (peakless) 2DFT as the reference for making masks. This behavior can be toggled via use_bg_2dft=[True, False]. Using the background biases the algorithm towards separating repeating patterns whereas using the foreground biases it towards preserving micromodulation in the foreground.
There are two ways to identify peaks: either using the original method, as laid out in the paper which identifies peaks in a binary way as being above some threshold, or by using the local_std way, which was developed and investigated in Chapter 2 in [2]. This method identifies peaks in a soft way and can thus be used to construct soft masks. It generally performs better.
The main hyperparameter to consider is the neighborhood_size, which determines how big of a filter to apply when looking for peaks. It is two dimensional, but generally keeping 1D horizontal filters is the way to go.
References
- [1] Seetharaman, Prem, Fatemeh Pishdadian, and Bryan Pardo.
“Music/Voice Separation Using the 2D Fourier Transform.” 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017.
- [2] Seetharaman, Prem. Bootstrapping the Learning Process for Computer Audition.
Diss. Northwestern University, 2019.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
neighborhood_size (tuple, optional) – 2-tuple of ints telling the filter size to look for peaks in frequency and time: (f, t). Defaults to (1, 25).
high_pass_cutoff (float, optional) – Time-frequency bins below this cutoff will be assigned to the background. Defaults to 100.0.
quadrants_to_keep (tuple, optional) – Each quadrant of the 2DFT can be filtered out when separating. Can be used for separating out upward spectro-temporal patterns (via (0, 2) from downward ones (1, 3). Defaults to (0,1,2,3).
filter_approach (str, optional) – One of ‘original’ or ‘local_std’. Which filtering approach to apply to identify peaks. Defaults to ‘local_std’.
use_bg_2dft (bool, optional) – Whether to use the background or foreground 2DFT as the reference for constructing masks. Defaults to True.
mask_type (str, optional) – Mask type. Defaults to ‘soft’.
mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.
Harmonic/percussive separation¶
-
class
nussl.separation.primitive.
HPSS
(input_audio_signal, kernel_size=31, mask_type='soft', mask_threshold=0.5)[source]¶ Implements harmonic/percussive source separation based on [1]. This is a wrapper around the librosa implementation.
References:
- [1] Fitzgerald, Derry. “Harmonic/percussive separation using median filtering.”
13th International Conference on Digital Audio Effects (DAFX10), Graz, Austria, 2010.
- [2] Driedger, Müller, Disch. “Extending harmonic-percussive separation of audio.”
15th International Society for Music Information Retrieval Conference (ISMIR 2014) Taipei, Taiwan, 2014.
- Parameters
input_audio_signal (AudioSignal) – signal to separate.
kernel_size (int or tuple (kernel_harmonic, kernel_percussive)) – kernel size(s) for the median filters.
mask_type (str, optional) – Mask type. Defaults to ‘soft’.
mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.
Foreground/background via REPET¶
-
class
nussl.separation.primitive.
Repet
(input_audio_signal, min_period=None, max_period=None, period=None, high_pass_cutoff=100.0, mask_type='soft', mask_threshold=0.5)[source]¶ Implements the original REpeating Pattern Extraction Technique algorithm using the beat spectrum.
REPET is a simple method for separating a repeating background from a non-repeating foreground in an audio mixture. It assumes a single repeating period over the whole signal duration, and finds that period based on finding a peak in the beat spectrum. The period can also be provided exactly, or you can give
Repet
a guess of the min and max period. Once it has a period, it “overlays” spectrogram sections of lengthperiod
to create a median model (the background).References:
- [1] Rafii, Zafar, and Bryan Pardo.
“Repeating pattern extraction technique (REPET): A simple method for music/voice separation.” IEEE transactions on audio, speech, and language processing 21.1 (2012): 73-84.
- Parameters
input_audio_signal (AudioSignal) – Signal to separate.
min_period (float, optional) – minimum time to look for repeating period in terms of seconds.
max_period (float, optional) – maximum time to look for repeating period in terms of seconds.
period (float, optional) – exact time that the repeating period is (in seconds).
high_pass_cutoff (float, optional) – value (in Hz) for the high pass cutoff filter.
mask_type (str, optional) – Mask type. Defaults to ‘soft’.
mask_threshold (float, optional) – Masking threshold. Defaults to 0.5.
Foreground/background via REPET-SIM¶
-
class
nussl.separation.primitive.
RepetSim
(input_audio_signal, similarity_threshold=0, min_distance_between_frames=1, max_repeating_frames=100, high_pass_cutoff=100, mask_type='soft', mask_threshold=0.5)[source]¶ Implements the REpeating Pattern Extraction Technique algorithm using the Similarity Matrix (REPET-SIM).
REPET is a simple method for separating the repeating background from the non-repeating foreground in a piece of audio mixture. REPET-SIM is a generalization of REPET, which looks for similarities instead of periodicities.
References:
- [1] Zafar Rafii and Bryan Pardo.
“Music/Voice Separation using the Similarity Matrix,” 13th International Society on Music Information Retrieval, Porto, Portugal, October 8-12, 2012.
- Parameters
input_audio_signal (AudioSignal) – Audio signal to be separated.
similarity_threshold (int, optional) – Threshold for considering two frames to be similar. Defaults to 0.
min_distance_between_frames (float, optional) – Number of seconds two frames must be apart to be considered neighbors. Defaults to 1.
max_repeating_frames (int, optional) – Max number of frames to consider as neighbors. Defaults to 100.
high_pass_cutoff (float, optional) – Cutoff for high pass filters. Bins below this cutoff will be given to the background. Defaults to 100.
mask_type (str, optional) – Mask type to use.. Defaults to ‘soft’.
mask_threshold (float, optional) – Threshold for mask converting to binary. Defaults to 0.5.
Vocal melody extraction via Melodia¶
-
class
nussl.separation.primitive.
Melodia
(input_audio_signal, high_pass_cutoff=100, minimum_frequency=55.0, maximum_frequency=1760.0, voicing_tolerance=0.2, minimum_peak_salience=0.0, compression=0.5, num_overtones=40, apply_vowel_filter=False, smooth_length=5, add_lower_octave=False, mask_type='soft', mask_threshold=0.5)[source]¶ Implements melody extraction using Melodia [1].
This needs Melodia installed as a vamp plugin, as well as having vampy for Python installed. Install Melodia via: https://www.upf.edu/web/mtg/melodia. Note that Melodia can be used only for NON-COMMERCIAL use.
References:
- [1] J. Salamon and E. Gómez, “Melody Extraction from Polyphonic Music Signals using
Pitch Contour Characteristics”, IEEE Transactions on Audio, Speech and Language Processing, 20(6):1759-1770, Aug. 2012.
- Parameters
input_audio_signal (AudioSignal object) – The AudioSignal object that has the audio data that Melodia will be run on.
high_pass_cutoff (optional, float) – value (in Hz) for the high pass cutoff filter.
minimum_frequency (optional, float) – minimum frequency in Hertz (default 55.0)
maximum_frequency (optional, float) – maximum frequency in Hertz (default 1760.0)
voicing_tolerance (optional, float) – Greater values will result in more pitch contours included in the final melody. Smaller values will result in less pitch contours included in the final melody (default 0.2).
minimum_peak_salience (optional, float) – a hack to avoid silence turning into junk contours when analyzing monophonic recordings (e.g. solo voice with no accompaniment). Generally you want to leave this untouched (default 0.0).
num_overtones (optional, int) – Number of overtones to use when creating melody mask.
apply_vowel_filter (optional, bool) – Whether or not to apply a vowel filter on the resynthesized melody signal when masking.
smooth_length (optional, int) – Number of frames to smooth discontinuities in the mask.
add_lower_octave (optional, fool) – Use octave below fundamental frequency as well to take care of octave errors in pitch tracking, since we only care about the mask. Defaults to False.
mask_type (optional, str) – Type of mask to use.
mask_threshold (optional, float) – Threshold for mask to convert to binary.
Spatial methods¶
These methods are based on primitive spatial cues.
Cluster by inter-phase and inter-level difference¶
-
class
nussl.separation.spatial.
SpatialClustering
(input_audio_signal, num_sources, clustering_type='KMeans', fit_clusterer=True, percentile=90, beta=5.0, mask_type='soft', mask_threshold=0.5, **kwargs)[source]¶ Implements clustering on IPD/ILD features between the first two channels.
IPD/ILD features are inter-phase difference and inter-level difference features. Sounds coming from different directions will naturally cluster in IPD/ILD space.
Subclasses ClusteringSeparationBase which actually handles all of the clustering functionality behind this function.
PROJET: Separate via spatial projections¶
-
class
nussl.separation.spatial.
Projet
(input_audio_signal, num_sources, estimates=None, num_iterations=50, maximum_delay_in_samples=20, location_set_panning=30, location_set_delay=17, projection_set_panning=10, projection_set_delay=9, beta=1, alpha=1, device='cpu')[source]¶ Implements the PROJET algorithm for spatial audio separation using projections. This implementation uses PyTorch to speed up computation considerably. PROJET does the following steps:
Project the complex stereo STFT onto multiple angles and delay via projection and delay matrix transformations.
Initialize the parameters of the system to “remix” these projections along with PSDs of the sources such that they try to reconstruct the original stereo mixture.
Find the optimal parameters via multiplicative update rules for P and for Q.
Use the discovered parameters to isolate the sources via spatial cues.
This implementation considers BOTH panning and delays when isolating sources. PROJET is not a masking based method, it estimates the sources directly by projecting the complex STFT.
- Parameters
input_audio_signal (AudioSignal) – Audio signal to separate.
num_sources (int) – Number of source to separate.
estimates (list of AudioSignal) – initial estimates for the separated sources if available. These will be used to initialize the update algorithm. So one could (for example), run FT2D on a signal and then refine the estimates using PROJET. Defaults to None (randomly initialize P).
num_iterations (int, optional) – Number of iterations to do for the update rules for P and Q. Defaults to 50.
maximum_delay_in_samples (int, optional) – Maximum delay in samples that you are willing to consider in the projection matrices. Defaults to 20.
location_set_panning (int, optional) – How many locations in panning you are willing to consider. Defaults to 30.
location_set_delay (int, optional) – How many delays you are willing to consider. Defaults to 17.
projection_set_panning (int, optional) – How many projections you are willing use in panning-space. Defaults to 10.
projection_set_delay (int, optional) – How many delays you are willing to project the mixutre onto in panning-space. Defaults to 9.
beta (int, optional) – Beta in beta divergence. See Table 1 in [1]. Defaults to 1.
alpha (int, optional) – Power to raise each power spectral density estimate of each source to. Defaults to 1.
device (str, optional) – Device to use when performing update rules. ‘cuda’ will be fastest, if available. Defaults to ‘cpu’.
References:
- [1] Fitzgerald, Derry, Antoine Liutkus, and Roland Badeau.
“Projection-based demixing of spatial audio.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.9 (2016): 1560-1572.
- [2] Fitzgerald, Derry, Antoine Liutkus, and Roland Badeau.
“Projet—spatial audio separation using projections.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.
DUET¶
-
class
nussl.separation.spatial.
Duet
(input_audio_signal, num_sources, attenuation_min=-3, attenuation_max=3, num_attenuation_bins=50, delay_min=-3, delay_max=3, num_delay_bins=50, peak_threshold=0.0, attenuation_min_distance=5, delay_min_distance=5, p=1, q=0, mask_type='binary')[source]¶ The DUET algorithm was originally proposed by S.Rickard and F.Dietrich for DOA estimation and further developed for BSS and demixing by A. Jourjine, S.Rickard, and O. Yilmaz.
DUET extracts sources using the symmetric attenuation and relative delay between two channels. The symmetric attenuation is calculated from the ratio of the two channels’ stft amplitudes, and the delay is the arrival delay between the two sensors used to record the audio signal. These two values are clustered as peaks on a histogram to determine where each source occurs. This implementation of DUET creates and returns Mask objects after the run() function, which can then be applied to the original audio signal to extract each individual source.
References:
- [1] Rickard, Scott. “The DUET blind source separation algorithm.”
Blind Speech Separation. Springer Netherlands, 2007. 217-241.
- [2] Yilmaz, Ozgur, and Scott Rickard. “Blind separation of speech mixtures
via time-frequency masking.” Signal Processing, IEEE transactions on 52.7 (2004): 1830-1847.
- Parameters
input_audio_signal (np.array) – a 2-row Numpy matrix containing samples of the two-channel mixture.
num_sources (int) – Number of sources to find.
attenuation_min (int) – Minimum distance in utils.find_peak_indices, change if not enough peaks are identified.
attenuation_max (int) – Used for creating a histogram without outliers.
num_attenuation_bins (int) – Number of bins for attenuation.
delay_min (int) – Lower bound on delay, used as minimum distance in utils.find_peak_indices.
delay_max (int) – Upper bound on delay, used for creating a histogram without outliers.
num_delay_bins (int) – Number of bins for delay.
peak_threshold (float) – Value in [0, 1] for peak picking.
attenuation_min_distance (int) – Minimum distance between peaks wrt attenuation.
delay_min_distance (int) – Minimum distance between peaks wrt delay.
p (int) – Weight the histogram with the symmetric attenuation estimator.
q (int) – Weight the histogram with the delay estimato
Notes
On page 8 of his paper, Rickard recommends p=1 and q=0 as a default starting point and p=.5, q=0 if one source is more dominant.
-
stft_ch0
¶ A Numpy matrix containing the stft data of channel 0.
- Type
np.array
-
stft_ch1
¶ A Numpy matrix containing the stft data of channel 1.
- Type
np.array
-
frequency_matrix
¶ A Numpy matrix containing the frequencies of analysis.
- Type
np.array
-
symmetric_atn
¶ A Numpy matrix containing the symmetric attenuation between the two channels.
- Type
np.array
-
delay
¶ A Numpy matrix containing the delay between the two channels.
- Type
np.array
-
num_time_bins
¶ The number of time bins for the frequency matrix and mask arrays.
- Type
np.array
-
num_frequency_bins
¶ The number of frequency bins for the mask arrays.
- Type
int
-
attenuation_bins
¶ A Numpy array containing the attenuation bins for the histogram.
- Type
int
-
delay_bins
¶ A Numpy array containing the delay bins for the histogram.
- Type
np.array
-
normalized_attenuation_delay_histogram
¶ A normalized Numpy matrix containing the attenuation delay histogram, which has peaks for each source.
- Type
np.array
-
attenuation_delay_histogram
¶ A non-normalized Numpy matrix containing the attenuation delay histogram, which has peaks for each source.
- Type
np.array
-
peak_indices
¶ A Numpy array containing the indices of the peaks for the histogram.
- Type
np.array
-
separated_sources
¶ A Numpy array of arrays containing each separated source.
- Type
np.array