Full Text

Turn on search term navigation

Introduction

Most of the speech source separation techniques are designed in the short-time Fourier transform (STFT) domain where the narrowband assumption is generally used, e.g. [1–4]. In the narrowband assumption, at each frequency band, the time-domain filter is represented by the acoustic transfer function (ATF), and the time-domain convolutive process is transformed to a product between the ATF and the STFT coefficients of the source signal. This assumption is also referred to as the multiplicative transfer function approximation [5]. Based on the ATF or its variant, e.g. relative transfer functions [6, 7], beamforming techniques are widely used for multichannel speech source separation and speech enhancement. Popular beamformers for multisource separation include linearly constrained minimum variance/power [6, 8]. Further, because of the spectral sparsity of speech, the microphone signal can be assumed to be dominated by only one speech source in each time–frequency (TF) bin. This is referred to as the W-disjoint orthogonality (WDO) assumption [1]. The binary masking (BM) method [1, 2] and $ℓ_{1}$ -norm minimisation method [3] exploit such WDO assumption. More examples of narrowband assumption-based techniques can be found in [9] and references therein.

In real scenarios, the time-domain filter, i.e. the room impulse response (RIR), is normally much longer than the STFT window (frame) length, since the latter should be set to be sufficiently short to account for the local stationarity of speech. For this case, the narrowband assumption is no longer valid, and thus leads to unsatisfied speech source separation performance. In the literature, only a few studies had questioned the validity of the narrowband assumption and attempted to tackle this problem. Based on the narrowband assumption, the theoretical covariance matrix of one source image is a rank-one matrix [4]. To mitigate the invalidity of this matrix in practice, a full-rank spatial covariance matrix (FR-SCM) was adopted in [10], even if the narrowband assumption is used. To circumvent the inaccurancy of the narrowband assumption, the wideband time-domain convolution model was used in [11, 12], where the source STFT coefficients are recovered by minimising the fit cost between the time-domain mixtures and the time-domain convolution model. Meanwhile, based on the Lasso technique, the $ℓ_{1}$ -norm of the source STFT coefficients is minimised as well to impose the sparsity of speech spectra. This method achieves good performance, but its computational complexity is very large due to the high cost of time-domain convolution operation. In [13, 14], based on the criterion of likelihood maximisation, a variational expectation-maximisation (EM) algorithm was proposed also using the time-domain convolution model and the STFT domain signal model.

In [15], the time-domain convolution can be ideally represented in the STFT domain by the cross-band filters. More precisely, the STFT coefficients of the source image can be computed by summing multiple convolutions (over frequencies) between the STFT coefficients of the source signal and the STTF-domain filter. Note that the convolution is conducted along the frame axis. To simplify the analysis, for each frequency, the band-to-band filter, i.e. the convolutive transfer function (CTF) model [16], is used, while with the cross-band information omitted. Compared to the narrowband assumption that uses a frequency-wise scalar product to approximate the time-domain convolution, the CTF model uses a frequency-wise convolution and thus is more accurate. Following the principle of the wide-band Lasso [11], based on the CTF model, a subband Lasso technique was proposed in [17], which largely reduces the complexity relative to the wide-band Lasso technique. In [18], two CTF inverse filtering methods were proposed based on the multiple-input/output inverse theorem (MINT). In [19], the CTF was integrated into the generalised sidelobe canceler beamformer. A CTF-based EM algorithm was proposed in [20] for single-source dereverberation, in which the Kalman filter was exploited to achieve online EM update. The cross-band filters were adopted in [21], combined with a non-negative matrix factorisation model for the source signal. To estimate the source signals, the likelihood of the microphone signals is maximised via a variational EM algorithm. In [22], also based on likelihood maximisation and EM, a STFT-domain convolutive model was used for source separation, combined with an hidden Markov model (HMM) model for source activity. Even though this STFT-domain convolutive model was not named as CTF in [22], it actually plays the same role as CTF.

Due to the high model complexity, the above-mentioned source separation techniques that are beyond the narrowband assumption actually cannot be performed in a blind manner, in other words, some prior knowledge is required. For example, the RIRs for the wide-band Lasso techniques, and the CTFs for the CTF-Lasso and MINT techniques, are required to be known or well estimated. Both the mixing filters and source parameters are required as a good initialisation for the (variational) EM techniques [13, 14, 21, 22]. A blind multichannel CTF identification method was proposed in [23] and the identified CTF can be fed in the semi-blind methods. However, this CTF identification method was only suitable for the single-source case.

In the present paper, based on the CTF model, we propose a likelihood maximisation method for speech source separation. First, the CTF model is presented in a source mixture probabilistic framework. Then, an EM algorithm is proposed for resolving the likelihood maximisation problem. The STFT coefficients of the source signals are taken as hidden variables, and are estimated in the expectation step (E-step). The CTF coefficients and source parameters are estimated in the maximisation step (M-step). Experiments show that the proposed method performs better than the narrowband assumption based methods [1, 10] and the CTF-Lasso method [17] within a semi-blind setup where the mixing filters are initialised with a perturbed version of the ground-truth CTF.

The rest of this paper is organised as follows. Section 2 presents the CTF formulation, which is plugged in a probabilistic framework in Section 3. The proposed EM is given in Section 4. Experiments are presented in Section 5. Section 6 concludes the paper. This paper is an extension of a conference paper [24]. The main improvements over [24] consist of (i) we present the methodology in more detail, such as the two vector/matrix formulations in Section 3, the detailed derivation of the EM algorithm in Section 4 and the execution process of EM in Algorithm 1; (ii) in Section 5, we add the experiments with CTF perturbations, and analyse the computational complexity of the proposed method.

CTF formulation

In an indoor (room) environment, a speech source signal propagates to the receivers (microphones) through the room effect. In the time domain, the received source image $y (n)$ is given by the linear convolution between the speech source signal $s (n)$ and the RIR $a (n)$ 1 $y (n) = a (n) ⋆ s (n),$ where $⋆$ denotes convolution. Applying STFT, the narrowband approximation usually approximates the time-domain convolution (1) as the product $y (p, k) = a (k) s (p, k)$ , where $y (p, k)$ and $s (p, k)$ are the STFT coefficients of the corresponding signals, and $a (k)$ is the ATF. Let N denote the STFT window (frame) length, then $k \in [0, N - 1]$ is the frequency bin index, and $p \in [1, P]$ is the frame index. This narrowband approximation is no longer valid when $a (n)$ is long relative to the STFT window.

We use the CTF model to circumvent the inaccurancy of the narrowband assumption, then $y (p, k)$ can be presented as [16] 2 $y (p, k) = a (p, k) ⋆ s (p, k) = \sum_{p^{'} = 0}^{Q} a (p^{'}, k) s (p - p^{'}, k),$ where $a (p, k)$ denotes the CTF, which can be derived from the time-domain filter $a (n)$ by 3 $a (p, k) = (a (n) ⋆ ζ_{k} (n)) |_{n = p L} .$ This equation is the convolution with respect to the time index n evaluated at multiples of the frame step L, with $ζ_{k} (n) = e^{j \frac{2 π}{N} k n} \sum_{m = - \infty}^{+ \infty} {\tilde{ω}}_{a} (m) {\tilde{ω}}_{s} (n + m),$ where ${\tilde{ω}}_{a} (n)$ and ${\tilde{ω}}_{s} (n)$ are, respectively, the STFT analysis and synthesis windows. The length of CTF, i.e. $Q + 1$ , approximately equals the length of RIR divided by L.

Mixture model formulations

Basic formulation for mixture model

We consider a source separation problem with J sources and I sensors, which could be either underdetermined $(I < J)$ or (over)determined $(I \geq J)$ . Using the CTF formulation (2), in the STFT domain, the microphone signal $x_{i} (p, k)$ is 4 $x_{i} (p, k) = \sum_{j = 1}^{J} a_{i j} (p, k) ⋆ s_{j} (p, k) + e_{i} (p, k),$ where $a_{i j} (p, k)$ is the CTF from source $j, j = 1, \dots, J$ to sensor $i, i = 1, \dots, I$ , and $e_{i} (p, k)$ denotes the noise signal.

Probabilistic model

In the literature of source separation, each source signal $s_{j} (p, k)$ is normally assumed to be independent of other sources, and is also independent across STFT frames and frequencies. Each STFT coefficient $s_{j} (p, k)$ is assumed to follow a complex Gaussian distribution with a zero mean and variance $v_{j} (p, k)$ [4, 10], i.e. its probability density function (pdf) is $N_{c} (s_{j} (p, k); 0, v_{j} (p, k)) = \frac{1}{π v_{j} (p, k)} \exp (- \frac{| s_{j} (p, k {) |}^{2}}{v_{j} (p, k)}) .$ The noise signal is assumed to be stationary, temporally uncorrelated and independent to the speech source signals. We define the noise vector across microphones as $e (p, k) = {[e_{1} (p, k), \dots, e_{I} (p, k)]}^{⊤} \in C^{I \times 1}$ . This vector is assumed to follow a zero-mean complex Gaussian, with a non-diagonal covariance matrix denoted as $Σ_{e} (k)$ . This covariance matrix encodes the spatial correlation of the noise signals. The pdf is $N_{c} (e (p, k); 0, Σ_{e} (k)) = \frac{1}{π^{I} | Σ_{e} (k) |} e^{- e {(p, k)}^{H} Σ_{e} {(k)}^{- 1} e (p, k)},$ where $H$ denotes complex transpose, $| \cdot |$ is the determinant of matrix.

Since the proposed source separation method is carried out independently at each frequency, hereafter, we omit the frequency index k for notational simplicity.

Vector/matrix Formulation 1

To formulate the mixture model (4) more compactly, we have several different choices to organise the signals and the convolution operation in vector/matrix forms. To facilitate the derivation of the following EM algorithm, we use two different vector/matrix formulations. In this section, Formulation 1 will be presented to enable us to easily derive the M-step, and Formulation 2 that is used for the E-step derivation will be presented in the next section. These two formulations are different just in the organisation of the variables and parameters, thence transforming from the E-step to the M-step, and vice-versa, will only necessitate reorganising the vector/matrix elements.

In Formulation 1, we define the source signals in vector form as, for $p \in [1, P]$ $s_{j} (p) = [s_{j} (p), \dots, s_{j} (p - q), \dots, s_{j} (p - Q)]^{⊤} \in C^{(Q + 1) \times 1},$ $s (p) = [_{1} {(p)}^{⊤}, \dots,_{j} {(p)}^{⊤}, \dots,_{J} {(p)}^{⊤}]^{⊤} \in C^{J (Q + 1) \times 1},$ where $^{⊤}$ denotes vector/matrix transpose. If $p \leq q$ , we set $s_{j} (p - q) = 0$ . Define the CTF in vector/matrix form as $a_{i j} = [a_{i j} (0), \dots, a_{i j} (p), \dots, a_{i j} (Q)]^{⊤} \in C^{(Q + 1) \times 1},$ $A_{j} = [a_{1 j}^{⊤}; \dots; a_{i j}^{⊤}; \dots; a_{I j}^{⊤}] \in C^{I \times (Q + 1)},$ $A = [A_{1}, \dots, A_{j}, \dots, A_{J}] \in C^{I \times J (Q + 1)},$ We already defined $e (p) \in C^{I \times 1}$ . Similarly, the microphone signal is $x (p, k) = [x_{1} (p, k), \dots, x_{I} (p, k)]^{⊤} \in C^{I \times 1}$ . Finally, we can rewrite (4) as 5 $x (p) = \sum_{j = 1}^{J} A_{j} s_{j} (p) + e (p) = A s (p) + e (p) .$ In Formulation 1, the source vector $s (p)$ follows a zero-mean complex Gaussian distribution with $J (Q + 1) \times J (Q + 1)$ diagonal covariance matrix $R_{s} (p)$ where the first $Q + 1$ diagonal entries (for the first source) are $v_{1} (p), \dots, v_{1} (P - Q)$ , the next $Q + 1$ diagonal entries (for the second source) are $v_{2} (1), \dots, v_{2} (P)$ and so on. The pdf of the mixture given the sources is $N_{c} (x (p); A s (p), Σ_{e})$ .

Vector/matrix Formulation 2

In Formulation 2, let ${\tilde{s}}_{j} = [s_{j} (1), \dots, s_{j} (p), \dots, s_{j} (P)]^{T} \in C^{P \times 1},$ ${\tilde{e}}_{i} = [e_{i} (1), \dots, e_{i} (p), \dots, e_{j} (P)]^{T} \in C^{P \times 1},$ ${\tilde{x}}_{i} = [x_{i} (1), \dots, x_{i} (p), \dots, x_{j} (P)]^{T} \in C^{P \times 1}$ denote the $j th$ source vector and the $i th$ noise and microphone vectors, all involving all P frames. Concatenate them along source/microphone as $\tilde{s} = [{\tilde{s}}_{1}^{⊤}, \dots, {\tilde{s}}_{j}^{⊤}, \dots, {\tilde{s}}_{J}^{⊤}]^{⊤} \in C^{J P \times 1},$ $\tilde{e} = [{\tilde{e}}_{1}^{⊤}, \dots, {\tilde{e}}_{i}^{⊤}, \dots, {\tilde{e}}_{I}^{⊤}]^{⊤} \in C^{I P \times 1},$ $\tilde{x} = [{\tilde{x}}_{1}^{⊤}, \dots, {\tilde{x}}_{i}^{⊤}, \dots, {\tilde{x}}_{I}^{⊤}]^{⊤} \in C^{I P \times 1} .$ Define the CTF convolution matrix in $C^{P \times P}$ $A_{i j} = [\begin{matrix} a_{i j} (0) & 0 & \dots & \dots & \dots & 0 \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋱ & ⋮ \\ a_{i j} (Q) & ⋱ & a_{i j} (0) & 0 & ⋱ & 0 \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋱ & ⋮ \\ 0 & \dots & 0 & a_{i j} (Q) & \dots & a_{i j} (0) \end{matrix}],$ where the CTF ${a_{i j} (p)}$ is first flipped and then duplicated as the row vectors, with one element shift per row. Concatenate it along source and microphone as 6 $\begin{aligned} A_{i} = [A_{i 1}, \dots, A_{i j}, \dots, A_{i J}] \in C^{P \times J P}, \\ A = {[A_{1}^{⊤}, \dots, A_{i}^{⊤}, \dots, A_{I}^{⊤}]}^{⊤} \in C^{I P \times J P} . \end{aligned}$ Then we can rewrite (4) as 7 ${\tilde{x}}_{i} = \sum_{j = 1}^{J} A_{i j} {\tilde{s}}_{j} + {\tilde{e}}_{i} = A_{i} \tilde{s} + {\tilde{e}}_{i}, or \tilde{x} = A - \tilde{s} + \tilde{e} .$ In Formulation 2, the pdf of source vector $\tilde{s}$ is a zero-mean complex Gaussian distribution with $J P \times J P$ diagonal covariance matrix 8 $Ψ_{s} = Diag \{{[v_{1} (1), \dots, v_{1} (P), \dots, v_{J} (1), \dots, v_{J} (P)]}^{⊤}\}$ where $Diag {\cdot}$ denotes diagonal matrix of a vector. The noise vector $\tilde{e}$ follows a zero-mean complex Gaussian distribution with $I P \times I P$ covariance matrix $Ψ_{e}$ . The entries of $Ψ_{e}$ are 9 $Ψ_{e} ((i_{1} - 1) P + p_{1}, (i_{2} - 1) P + p_{2}) = \{\begin{matrix} Σ_{e} (i_{1}, i_{2}), & if p_{1} = p_{2}, \\ 0, & otherwise . \end{matrix}$ where the arguments in parentheses denote the row and column indices; $i_{1}, i_{2} \in [1, I], p_{1}, p_{2} \in [1, P]$ . The pdf of the mixture given the sources is $N_{c} (\tilde{x}; A - \tilde{s}, Ψ_{e})$ .

Expectation-maximisation algorithm

Collect the source variances for all sources and frames, we have the source variance set $V = {v_{j} (p)}_{j \in [1, J], p \in [1, P]}$ . The parameters set in the present problem is $Θ = {V, A, Σ_{e}}$ in Formulation 1 or $Θ = {Ψ_{s}, A, Ψ_{e}}$ in Formulation 2. The likelihood of the mixture will be maximised by an EM algorithm, in which the parameters $Θ$ are the optimisation variables. Meanwhile, the STFT coefficients of the source signals, i.e. ${s_{j} (p)}_{j, p}$ , are taken as hidden variables, whose posterior statistics will be inferred, and the posterior mean is taken as the estimation of the source signals. The proposed EM algorithm is summarised in Algorithm 1.

E-step

The E-step will be derived based on Formulation 2. Using the parameters estimates $Θ$ , given in the preceding M-step by (16) in Formulation 1, we construct the CTF convolution matrix $A$ , the source covariance matrix $Ψ_{s}$ and the noise covariance matrix $Ψ_{e}$ following (6), (8) and (9), respectively.

In Formulation 2, the posterior distribution of the source signals is $p (\tilde{s} | \tilde{x}, Θ) \propto p (\tilde{x} | \tilde{s}, Θ) p (\tilde{s} | Θ)$ . Since both $p (\tilde{x} | \tilde{s}, Θ)$ and $p (\tilde{s} | Θ)$ are Gaussian, $p (\tilde{s} | \tilde{x}, Θ)$ is also Gaussian. Let $E_{\tilde{s} | \tilde{x}, Θ} [.]$ denote the expectation in the sense of the posterior distribution $p (\tilde{s} | \tilde{x}, Θ)$ . From the exponent of $p (\tilde{s} | \tilde{x}, Θ)$ , i.e. $- (\tilde{x} - A \tilde{s})^{H} Ψ_{e}^{- 1} (\tilde{x} - A \tilde{s}) - {\tilde{s}}^{H} Ψ_{s}^{- 1} \tilde{s},$ the posterior mean $\hat{\tilde{s}} = E_{\tilde{s} | \tilde{x}, Θ} [\tilde{s}]$ and covariance matrix ${\hat{Σ}}_{s} = E_{\tilde{s} | \tilde{x}, Θ} [(\tilde{s} - \hat{\tilde{s}}) (\tilde{s} - \hat{\tilde{s}})^{H}]$ can be derived by reorganising the quadratic and linear forms in $\tilde{s}$ . We obtain 10 $\begin{aligned} {\hat{Σ}}_{s} & = {(A^{H} Ψ_{e}^{- 1} A + Ψ_{s}^{- 1})}^{- 1}, \\ \hat{\tilde{s}} & = {\hat{Σ}}_{s} A^{H} Ψ_{e}^{- 1} \tilde{x}, \end{aligned}$ and then the posterior second-order moment matrix ${\hat{\tilde{R}}}_{s} = E_{\tilde{s} | \tilde{x}, Θ} [\tilde{s} {\tilde{s}}^{H}]$ can be computed as 11 ${\hat{\tilde{R}}}_{s} = \hat{\tilde{s}} {\hat{\tilde{s}}}^{H} + {\hat{Σ}}_{s} .$ As derived based on the narrowband assumption in [4], (10) is also in the form of classical Wiener filtering. The difference is, here the interframe elements in the posterior covariance matrix ${\hat{Σ}}_{s}$ are non-zero, which means the correlation between frames due to the convolution is encoded. As a result, the posterior mean of source, i.e. $\hat{\tilde{s}}$ , is recovered by deconvoluting the mixture.

M-step

The M-step will be derived based on Formulation 1. Collect the multichannel mixture vectors and source signal vectors along frames, we have the observation set $X = {x (p)}_{p \in [1, P]}$ and the source signal set $S = {s (p)}_{p \in [1, P]}$ . The complete-data (including observations and hidden variables) likelihood function is $\begin{aligned} p (X, S | Θ) \propto p (X | S, Θ) p (S | Θ) \\ = \prod_{p = 1}^{P} N_{c} (x (p); A s (p), Σ_{e}) \prod_{j = 1}^{J} \prod_{p = 1}^{P} N_{c} (s_{j} (p); 0, v_{j} (p)) . \end{aligned}$ Considering only the terms related to the parameters and hidden variables, the corresponding log-likelihood writes 12 $\begin{aligned} \log (p (X, S | Θ)) & = - \sum_{p = 1}^{P} (\log (| Σ_{e} |) + (x (p) - A s (p))^{H} Σ_{e}^{- 1} \\ (x (p) - A s (p))) & - \sum_{j = 1}^{J} \sum_{p = 1}^{P} (\log (v_{j} (p)) + \frac{| s_{j} (p {) |}^{2}}{v_{j} (p)}) + const . \end{aligned}$ Denote the auxiliary function for likelihood maximisation as $Q (Θ, Θ^{old}) = E_{S | X, Θ^{old}} [\log (p (X, S | Θ))]$ , where $Θ^{old}$ denotes the parameters estimated at the previous iteration. From the log-likelihood (12), the auxiliary function can be derived as 13 $\begin{aligned} Q (Θ, Θ^{old}) & = - \sum_{p = 1}^{P} (\log (| Σ_{e} |) + Trace {Σ_{e}^{- 1} (A \hat{s} (p) x {(p)}^{H} \\ + x (p) \hat{s} {(p)}^{H} A^{H} - A {\hat{R}}_{s} (p) A^{H})}) \\ - \sum_{j = 1}^{J} \sum_{p = 1}^{P} (\log (v_{j} (p)) + \frac{{\hat{v}}_{j} (p)}{v_{j} (p)}) + const, \end{aligned}$ where $Trace {\cdot}$ denotes matrix trace, and $\hat{s} (p) = E_{S | X, Θ^{old}} [s (p)],$ ${\hat{R}}_{s} (p) = E_{S | X, Θ^{old}} [s (p) s (p)^{H}],$ ${\hat{v}}_{j} (p) = E_{S | X, Θ^{old}} [| s_{j} (p) |^{2}]$ are the posterior statistics of the source signal, namely the posterior mean, the posterior second-order moment matrix and the element-wise posterior second-order moment, respectively. Actually, ${\hat{v}}_{j} (p)$ is the $((j - 1) (Q + 1) + 1)$ -th diagonal entry of ${\hat{R}}_{s} (p)$ .

Reformulation: $\hat{s} (p)$ and ${\hat{R}}_{s} (p)$ can be obtained by reformulating $\hat{\tilde{s}}$ and ${\hat{\tilde{R}}}_{s}$ derived in the preceding E-step. The reformulation is mainly to find the elements with the same source and frame indices 14 $\begin{aligned} \hat{s} (p)_{{j, p - q}} & = {\hat{\tilde{s}}}_{{j, p - q}}, \\ {\hat{R}}_{s} (p)_{\{j_{1}, p - q_{1}\}, \{j_{2}, p - q_{2}\}} & = {\hat{\tilde{R}}}_{s \{j_{1}, p - q_{1}\}, \{j_{2}, p - q_{2}\}}, \end{aligned}$ where the subscript ${j, p}$ denote ‘the $j th$ source at $p th$ frame’. In a vector, or in a row/column of a matrix, ${j, p}$ represents (i) the ${(j - 1) P + p} th$ element in $\hat{\tilde{s}}$ and ${\hat{\tilde{R}}}_{s}$ , based on Formulation 2, and (ii) the ${(j - 1) (Q + 1) + q + 1} th$ ( $q \in [0, Q]$ ) element in $\hat{s} (p)$ and ${\hat{R}}_{s} (p)$ , based on Formulation 1.

With respect to $A^{*}$ ( $^{*}$ denotes conjugate), $v_{j} (p)$ and $Σ_{e}$ , the (complex) derivative of $Q (Θ, Θ^{old})$ are, respectively 15 $\begin{aligned} \frac{\partial Q (Θ, Θ^{old})}{\partial A^{*}} & = - Σ_{e}^{- 1} \sum_{p = 1}^{P} ((p) \hat{s} (p)^{H} - A {\hat{R}}_{s} (p)), \\ \frac{\partial Q (Θ, Θ^{old})}{\partial v_{j} (p)} & = {\hat{v}}_{j} (p) v_{j}^{- 2} (p) - v_{j}^{- 1} (p), \\ \frac{\partial Q (Θ, Θ^{old})}{\partial Σ_{e}} & = \sum_{p = 1}^{P} (Σ_{e}^{- 1} (A \hat{s} (p) x {(p)}^{H} + x (p) \hat{s} {(p)}^{H} A^{H} \\ - A {\hat{R}}_{s} (p) A^{H}) Σ_{e}^{- 1} - Σ_{e}^{- 1}) . \end{aligned}$ To maximise $Q (Θ, Θ^{old})$ , the three derivatives are set equal to zero, then $A$ , $v_{j} (p)$ and $Σ_{e}$ can be estimated as, respectively 16 $\begin{aligned} A^{new} & = (\sum_{p = 1}^{P} x (p) \hat{s} {(p)}^{H})) {(\sum_{p = 1}^{P} {\hat{R}}_{s} (p))}^{- 1}, \\ v_{j}^{new} (p) & = {\hat{v}}_{j} (p), \\ Σ_{e}^{new} & = \frac{1}{P} \sum_{p = 1}^{P} (A^{new} \hat{s} (p) x {(p)}^{H} + x (p) \hat{s} {(p)}^{H} A^{new H} \\ - A^{new} {\hat{R}}_{s} (p) A^{new H}) . \end{aligned}$

Algorithm 1

EM for MASS with CTF

Input: ${x_{i} (p, k)}_{p \in [1, P], k \in [0, N - 1]}$ ; initial parameters $Θ$ .
repeat
- E-step
- 1 Construct $A$ , $Ψ_{s}$ and $Ψ_{e}$ following (6), (8) and (9), respectively,
- 2 Compute ${\hat{Σ}}_{s}$ and $\hat{\tilde{s}}$ following (10),
- 3 Compute ${\hat{\tilde{R}}}_{s}$ following (11),
- M-step
- 4 Construct $\hat{s} (p)$ and ${\hat{R}}_{s} (p)$ following (14),
- 5 Update $A$ , $v_{j}$ and $Σ_{e}$ following (16),
until convergence
Output: STFT coefficients of source signals $\hat{\tilde{s}}$ .

Experiments

Experimental configuration

The binaural (two-channel) simulated signals were used to evaluate the proposed EM algorithm. The experiments were conducted under various acoustic conditions, in terms of reverberation time, number of sources and intensity of room filter perturbation.

Simulation setup

We use a KEMAR dummy head [25] with one microphone embedded in each ear as the recording system. The head-related impulse responses (HRIRs) for a large grid of directions were measured in advance. The ROOMSIM simulator [26] simulates the binaural RIRs using these HRIRs as both the direct-path wave and the reflections. Four reverberant conditions were simulated, with the reverberation time $T_{60}$ as 0 (the anechoic case), 0.22, 0.5 and 0.79 s, respectively. The TIMIT [27] speech signals were used as the speech source signals, and were convolved with the simulated BRIRs to generate microphone signals (mixtures). The sampling rates of source signals and microphone signals are both 16 kHz, and the length of source signals are about 3 s. The speech sources were set to locate at different directions in front of the dummy head. The noisy microphone signals are generated by adding a spatially uncorrelated stationary speech-like noise to the noise-free signals. One signal-to-noise ratio condition, i.e. 20 dB, is tested. The STFT uses Hamming window with a length of 1024 samples (64 ms) and frame step of 256 samples (16 ms). In this experiment, the ground-truth noise covariance matrix $Σ_{e} (k)$ is used, and fixed during EM iterations.

EM initialisation

Depending on what types of prior knowledge are available, the EM algorithm can be initialised either from the E-step or from the M-step. For both choices, the accuracy of the initialisation is crucial for the EM iterations to converge to a good solution. In this experiment, we initialise EM from the M-step. Due to the difficulty of the blind initialisation, we consider a semi-blind initialisation scheme. The time-domain filters are assumed to be known, from which the CTFs are computed by (3). To fit the realistic situation that the time-domain filters (or CTFs) should actually be blindly estimated and suffer from some estimation error, a proportional Gaussian random noise is added to the time-domain filters to generate the perturbed filters. The normalised projection misalignment (NPM) [28] in decibels (dB) is used to measure the intensity of the perturbation. The lower the NPM is, the intense the perturbation will be. To have a good initialisation for the source variance, the CTF-Lasso method proposed in [17] is first applied, which solves the problem $\min_{s} ∥ A \tilde{s} - \tilde{x} ∥_{2}^{2} + λ ∥ \tilde{s} ∥_{1},$ where $∥ \cdot ∥_{1}$ denotes $ℓ_{1}$ -norm, and is used to impose the spectral sparsity of speech sources. For more details, please refer to [17]. Then the source variance is initialised as the magnitude square of each source coefficient estimate. It is found that, for most of acoustic conditions and frequency bins, the number of EM iterations required for convergency is <10, thence in this experiment we use a constant number, i.e. 7, of EM iterations.

Baseline methods

Three baseline methods were used for comparison: (i) the CTF-Lasso method used for initialisation; (ii) the BM method [1], which is based on the narrowband approximation. To make a fair comparison, the narrowband mixing filters are also computed using the known perturbed time-domain filters. However, to compute the one-order mixing filters, for the high reverberation time cases, the time-domain filters should be first truncated to have a length being equal to (or less than) the STFT window length. Based on some pilot experiments, we use the HRIRs (without reverberation) as the truncated filters, which achieves the best results. For source separation, each TF bin is assigned to one of the sources based on the mixing filters; (iii) the FR-SCM method [10]. The FR-SCM for each source was separately estimated using the corresponding source image, and kept fixed during the EM, following the line of the semi-oracle experiments in [10].

Performance metrics

The signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR) and signal-to-artifact ratio (SAR) [29], all in decibels (dB), are used as the separation performance metric. In the following, three sets of experiments will be conducted: (i) for various reverberation time, (ii) for various numbers of sources, i.e. with 2, 3, 4 and 5 sources, (iii) with various NPM settings. For each condition, the metric scores are averaged over 20 mixtures.

The computation complexity of each method is measured with the real-time factor, which is the processing time of one method divided by the length of the processed signal. Note that all the methods are implemented in MATLAB.

Results as a function of reverberation time

Fig. 1 plots the performance measures obtained for the four reverberation times. The number of sources is 3. NPM is set to −35 dB, for which the filter perturbation is light, and the CTFs used for CTF-Lasso and the proposed method are very accurate. Therefore, for this experiment, the CTFs are fixed during EM iterations. For the anechoic case, all the four methods largely improve the SDR and SIR scores compared to the unprocessed signals, however slightly reduces the SAR score. This indicates that the four methods can efficiently separate the multiple sources, but introduces some artefacts. Especially, BM suffers from more artefacts than the other methods, since the hard assignment of the TF bins to the dominant source largely distorts the less dominant sources. As $T_{60}$ increases, the performance measures of BM and FR-SCM dramatically decrease, since the length of RIRs is much larger than the STFT window and the narrowband approximation is not valid anymore for the high reverberation cases. FR-SCM outperforms BM due to the use of the full-rank covariance matrix, which however is only suitable for the low reverberation cases. This can be testified by the fact that the most prominent advantage of FR-SCM over BM presents when $T_{60}$ is 0.22 s. In contrast to BM and FR-SCM, CTF-Lasso and the proposed EM method achieve good performances. It is a bit surprising that the performance measures actually increases with the increase of reverberation time. The possible reason is that the long filters involve more information to differentiate and separate the multiple sources. Of course, to satisfy this assertion the filters should be accurate enough. Compared to CTF-Lasso whose outputs are taken as the initial point for the EM algorithm, the EM algorithm improves the SDR by about 1.5 dB for every reverberation times, which indicates that the EM iterations are able to refine the quality of the source estimate.

[IMAGE OMITTED. SEE PDF]

Results as a function of number of sources

Fig. 2 plots the results for various number of sources, for $T_{60} = 0.5 s$ . NPM is set to −35 dB, and the CTFs are fixed during EM iterations. As expected, the performance measures of all methods degrade when the number of sources increases. For BM, the WDO assumption for speech sources becomes less valid when more sources present. For FR-SCM, CTF-Lasso and the proposed EM method, the mutual confusion of sources gets larger with increasing number of sources. The performance degradation rate of the four methods is similar to the one for the unprocessed signals. Overall, when the CTFs are properly initialised, CTF-Lasso and the proposed EM method achieve good source separation performance even for the mixtures with five sources using only two microphones.

[IMAGE OMITTED. SEE PDF]

Results as a function of NPM

To evaluate the proposed method under the conditions that the initialised CTFs suffer from large estimation error, we conducted the experiments with various NPM settings, and the results are shown in Fig. 3. Since the initialisation is not accurate, in this experiment, the CTFs are updated during EM interactions to refine the CTFs. To demonstrate the efficiency of CTF updation, the results with the fixed CTFs are also given. As NPM increases, the performance of CTF-Lasso and the proposed EM method largely degrade. When NPM is larger than −20 dB, the proposed method with fixed CTFs does not improve the performance of CTF-Lasso, which means the source estimation cannot be refined based on the inaccurate CTFs. In contrast, the proposed method with updating CTFs improves the performance of CTF-Lasso, due to the refining of CTFs in the M-step. The performance measures of CTF-Lasso and the proposed method become close to the ones of BM and FR-SCM, which indicates that the CTF-based methods are more sensitive to the filter perturbations than the narrowband assumption based methods.

[IMAGE OMITTED. SEE PDF]

Computational complexity analysis

Table 1 shows the real-time factor of the four methods for the case with three sources and $T_{60}$ of 0.5 s. BM is the fastest, since it is an one-step method. The other three are all iterative methods, whose computational complexity is proportional to the number of iterations. FR-SCM and the proposed method are similar in the sense that they both are based on EM iteration: estimating the source statistics using a Wiener-like filter in the E-step, and estimating the mixing filters in the M-step. The main difference between them is that the proposed method uses the CTF with a length of Q, while FR-SCM uses the mixing filter with a length of 1. As a result, the proposed method has a much larger complexity than FR-SCM. CTF-Lasso also uses the CTF. However, unlike FR-SCM and the proposed method that several matrix inverse operations are performed (as shown in (10) and (16)), Lasso optimisation only executes the first-order convolution operation. As a result, the real-time factor of CTF-Lasso is lower than the factor of FR-SCM and the proposed method.

Table 1 Real-time factor of the four methods

BM	FR-SCM	CTF-Lasso	Proposed
0.01	81.7	54.2	630.0

Conclusion

In this work, an EM algorithm has been proposed for speech separation. The subband convolutive model, i.e. CTF model, was adopted. To concisely derive the M-step and E-step, two convolution vector/matrix formulations were used. The CTF model-based methods, i.e. CTF-Lasso and the proposed EM method, outperform the narrowband assumption based methods, i.e. BM and FR-SCM, for the reverberant case. The proposed EM algorithm is capable to refine the CTFs and source estimates starting with the output of CTF-Lasso, and thus improves the source separation performance. Only the semi-blind experiments were conducted in this work due to the difficulty of EM initialisation. In the future, a blind CTF identification method could be developed to enable the blind initialisation of EM. To this aim, the CTF identification methods proposed in [23, 30] could be combined, which are in the contexts of single-source dereverberation and multisource localisation, respectively.

Acknowledgment

This research has received funding from the ERC Advanced Grant VHIA (#340113).

References

[1]

Yilmaz O. Rickard S.: ‘Blind separation of speech mixtures via time‐frequency masking’, IEEE Trans. Signal Process., 2014, 52, (7), pp. 1830–1847 (doi: 10.1109/TSP.2004.828896)

[2]

Mandel M.I. Weiss R.J. Ellis D.P.: ‘Model‐based expectation‐maximization source separation and localization’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (2), pp. 382–394 (doi: 10.1109/TASL.2009.2029711)

[3]

Winter S. Kellermann W. Sawada H. et al.: ‘MAP‐based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and l₁ ‐norm minimization’, EURASIP J. Appl. Signal Process., 2007, 2007, (1), pp. 81–81

[4]

Ozerov A. Févotte C.: ‘Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (3), pp. 550–563 (doi: 10.1109/TASL.2009.2031510)

[5]

Avargel Y. Cohen I.: ‘On multiplicative transfer function approximation in the short‐time Fourier transform domain’, IEEE Signal Process. Lett., 2007, 14, (5), pp. 337–340 (doi: 10.1109/LSP.2006.888292)

[6]

Gannot S. Burshtein D. Weinstein E.: ‘Signal enhancement using beamforming and nonstationarity with applications to speech’, IEEE Trans. Signal Process., 2001, 49, (8), pp. 1614–1626 (doi: 10.1109/78.934132)

[7]

Li X. Girin L. Horaud R. et al.: ‘Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, pp. 320–324

[8]

Van Trees H.L.: ‘Detection, estimation, and modulation theory ’ (John Wiley & Sons, USA, 2004)

[9]

Gannot S. Vincent E. Markovich‐Golan S. et al.: ‘A consolidated perspective on multimicrophone speech enhancement and source separation’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (4), pp. 692–730 (doi: 10.1109/TASLP.2016.2647702)

[10]

Duong N. Vincent E. Gribonval R.: ‘Under‐determined reverberant audio source separation using a full‐rank spatial covariance model’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (7), pp. 1830–1840 (doi: 10.1109/TASL.2010.2050716)

[11]

Kowalski M. Vincent E. Gribonval R.: ‘Beyond the narrowband approximation: wideband convex methods for under‐determined reverberant audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (7), pp. 1818–1829 (doi: 10.1109/TASL.2010.2050089)

[12]

Arberet S. Vandergheynst P. Carrillo J.‐P. et al.: ‘Sparse reverberant audio source separation via reweighted analysis’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (7), pp. 1391–1402 (doi: 10.1109/TASL.2013.2250962)

[13]

Leglaive S. Badeau R. Richard G.: ‘Multichannel audio source separation: variational inference of time‐frequency sources from time‐domain observations’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017, pp. 26–30

[14]

Leglaive S. Badeau R. Richard G.: ‘Separating time‐frequency sources from time‐domain convolutive mixtures using non‐negative matrix factorization’. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, USA, 2017, pp. 264–268

[15]

Avargel Y. Cohen I.: ‘System identification in the short‐time Fourier transform domain with crossband filtering’, IEEE Trans. Audio Speech Lang. Process., 2007, 15, (4), pp. 1305–1319 (doi: 10.1109/TASL.2006.889720)

[16]

Talmon R. Cohen I. Gannot S.: ‘Relative transfer function identification using convolutive transfer function approximation’, IEEE Trans. Audio Speech Lang. Process., 2009, 17, (4), pp. 546–555 (doi: 10.1109/TASL.2008.2009576)

[17]

Li X. Girin L. Horaud R.: ‘Audio source separation based on convolutive transfer function and frequency‐domain lasso optimization’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, USA, 2017, pp. 541–545

[18]

Li X. Girin L. Gannot S. et al.: ‘Multichannel speech separation and enhancement using the convolutive transfer function’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018

[19]

Talmon R.I. Cohen I. Gannot S.: ‘Convolutive transfer function generalized sidelobe canceler’, IEEE Trans. Audio Speech Lang. Process., 2009, 17, (7), pp. 1420–1434 (doi: 10.1109/TASL.2009.2020891)

[20]

Schwartz B. Gannot S. Habets E.A.: ‘Online speech dereverberation using kalman filter and EM algorithm’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (2), pp. 394–406 (doi: 10.1109/TASLP.2014.2372342)

[21]

Badeau R. Plumbley M.D.: ‘Multichannel high‐resolution NMF for modeling convolutive mixtures of non‐stationary signals in the time‐frequency domain’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (11), pp. 1670–1680 (doi: 10.1109/TASLP.2014.2341920)

[22]

Higuchi T. Kameoka H.: ‘Joint audio source separation and dereverberation based on multichannel factorial hidden Markov model’. IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP), Reims, France, 2014, pp. 1–6

[23]

Li X. Gannot S. Girin L. et al.: ‘Multichannel identification and nonnegative equalization for dereverberation and noise reduction based on convolutive transfer function’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018, 26, (10), pp. 1755–1768 (doi: 10.1109/TASLP.2018.2839362)

[24]

Li X. Girin L. Horaud R.: ‘An EM algorithm for audio source separation based on the convolutive transfer function’. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, USA, 2017, pp. 56–60

[25]

Gardner W.G. Martin K.D.: ‘HRTF measurements of a KEMAR dummy‐head microphone’, J. Acoust. Soc. Am., 1995, 97, (6), pp. 3907–3908 (doi: 10.1121/1.412407)

[26]

Campbell D.: ‘The roomsim user guide (v3.3)’, https://pimsgrc.nasa.gov/plots/user/acoustics/roomsim/Roomsim%20User%20Guide%20v3p3.htm, 2004

[27]

Garofolo J.S. Lamel L.F. Fisher W.M. et al.: ‘Getting started with the DARPA TIMIT CD‐ROM: An acoustic phonetic continuous speech database’, National Institute of Standards and Technology (NIST), Gaithersburgh, MD, 1988, 107

[28]

Morgan D.R. Benesty J. Sondhi M.M.: ‘On the evaluation of estimated impulse responses’, IEEE Signal Process. Lett., 1998, 5, (7), pp. 174–176 (doi: 10.1109/97.700920)

[29]

Vincent E. Gribonval R. Févotte C.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (4), pp. 1462–1469 (doi: 10.1109/TSA.2005.858005)

[30]

Li X. Girin L. Horaud R. et al.: ‘Multiple‐speaker localization based on direct‐path features and likelihood maximization with spatial sparsity regularization’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (10), pp. 1997–2012 (doi: 10.1109/TASLP.2017.2740001)

Word count: 5099

Show less

© 2019. This work is published under http://creativecommons.org/licenses/by-nc/3.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

This study addresses the problem of under‐determined speech source separation from multichannel microphone signals, i.e. the convolutive mixtures of multiple sources. The time‐domain signals are first transformed to the short‐time Fourier transform (STFT) domain. To represent the room filters in the STFT domain, instead of the widely used narrowband assumption, the authors propose to use a more accurate model, i.e. the convolutive transfer function (CTF). At each frequency band, the CTF coefficients of the mixing filters and the STFT coefficients of the sources are jointly estimated by maximising the likelihood of the microphone signals, which is resolved by an expectation‐maximisation algorithm. Experiments show that the proposed method provides very satisfactory performance under highly reverberant environments.

Details

Title

Expectation‐maximisation for speech source separation using convolutive transfer function

Author

Li, Xiaofei¹; Girin, Laurent²; Horaud, Radu¹

¹ INRIA Grenoble Rhône‐Alpes, Montbonnot Saint‐Martin, France
² Université Grenoble Alpes, Saint‐Martin d'Hères, France

Pages

47-53

Section

Research Articles

Publication year

2019

Publication date

Mar 1, 2019

Publisher

John Wiley & Sons, Inc.

e-ISSN

24682322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1049/trit.2018.1061

ProQuest document ID

3091951581

Expectation‐maximisation for speech source separation using convolutive transfer function

Jump to:

Full Text

Abstract

Details

Suggested sources