Introduction
Most of the speech source separation techniques are designed in the short-time Fourier transform (STFT) domain where the narrowband assumption is generally used, e.g. [1–4]. In the narrowband assumption, at each frequency band, the time-domain filter is represented by the acoustic transfer function (ATF), and the time-domain convolutive process is transformed to a product between the ATF and the STFT coefficients of the source signal. This assumption is also referred to as the multiplicative transfer function approximation [5]. Based on the ATF or its variant, e.g. relative transfer functions [6, 7], beamforming techniques are widely used for multichannel speech source separation and speech enhancement. Popular beamformers for multisource separation include linearly constrained minimum variance/power [6, 8]. Further, because of the spectral sparsity of speech, the microphone signal can be assumed to be dominated by only one speech source in each time–frequency (TF) bin. This is referred to as the W-disjoint orthogonality (WDO) assumption [1]. The binary masking (BM) method [1, 2] and -norm minimisation method [3] exploit such WDO assumption. More examples of narrowband assumption-based techniques can be found in [9] and references therein.
In real scenarios, the time-domain filter, i.e. the room impulse response (RIR), is normally much longer than the STFT window (frame) length, since the latter should be set to be sufficiently short to account for the local stationarity of speech. For this case, the narrowband assumption is no longer valid, and thus leads to unsatisfied speech source separation performance. In the literature, only a few studies had questioned the validity of the narrowband assumption and attempted to tackle this problem. Based on the narrowband assumption, the theoretical covariance matrix of one source image is a rank-one matrix [4]. To mitigate the invalidity of this matrix in practice, a full-rank spatial covariance matrix (FR-SCM) was adopted in [10], even if the narrowband assumption is used. To circumvent the inaccurancy of the narrowband assumption, the wideband time-domain convolution model was used in [11, 12], where the source STFT coefficients are recovered by minimising the fit cost between the time-domain mixtures and the time-domain convolution model. Meanwhile, based on the Lasso technique, the -norm of the source STFT coefficients is minimised as well to impose the sparsity of speech spectra. This method achieves good performance, but its computational complexity is very large due to the high cost of time-domain convolution operation. In [13, 14], based on the criterion of likelihood maximisation, a variational expectation-maximisation (EM) algorithm was proposed also using the time-domain convolution model and the STFT domain signal model.
In [15], the time-domain convolution can be ideally represented in the STFT domain by the cross-band filters. More precisely, the STFT coefficients of the source image can be computed by summing multiple convolutions (over frequencies) between the STFT coefficients of the source signal and the STTF-domain filter. Note that the convolution is conducted along the frame axis. To simplify the analysis, for each frequency, the band-to-band filter, i.e. the convolutive transfer function (CTF) model [16], is used, while with the cross-band information omitted. Compared to the narrowband assumption that uses a frequency-wise scalar product to approximate the time-domain convolution, the CTF model uses a frequency-wise convolution and thus is more accurate. Following the principle of the wide-band Lasso [11], based on the CTF model, a subband Lasso technique was proposed in [17], which largely reduces the complexity relative to the wide-band Lasso technique. In [18], two CTF inverse filtering methods were proposed based on the multiple-input/output inverse theorem (MINT). In [19], the CTF was integrated into the generalised sidelobe canceler beamformer. A CTF-based EM algorithm was proposed in [20] for single-source dereverberation, in which the Kalman filter was exploited to achieve online EM update. The cross-band filters were adopted in [21], combined with a non-negative matrix factorisation model for the source signal. To estimate the source signals, the likelihood of the microphone signals is maximised via a variational EM algorithm. In [22], also based on likelihood maximisation and EM, a STFT-domain convolutive model was used for source separation, combined with an hidden Markov model (HMM) model for source activity. Even though this STFT-domain convolutive model was not named as CTF in [22], it actually plays the same role as CTF.
Due to the high model complexity, the above-mentioned source separation techniques that are beyond the narrowband assumption actually cannot be performed in a blind manner, in other words, some prior knowledge is required. For example, the RIRs for the wide-band Lasso techniques, and the CTFs for the CTF-Lasso and MINT techniques, are required to be known or well estimated. Both the mixing filters and source parameters are required as a good initialisation for the (variational) EM techniques [13, 14, 21, 22]. A blind multichannel CTF identification method was proposed in [23] and the identified CTF can be fed in the semi-blind methods. However, this CTF identification method was only suitable for the single-source case.
In the present paper, based on the CTF model, we propose a likelihood maximisation method for speech source separation. First, the CTF model is presented in a source mixture probabilistic framework. Then, an EM algorithm is proposed for resolving the likelihood maximisation problem. The STFT coefficients of the source signals are taken as hidden variables, and are estimated in the expectation step (E-step). The CTF coefficients and source parameters are estimated in the maximisation step (M-step). Experiments show that the proposed method performs better than the narrowband assumption based methods [1, 10] and the CTF-Lasso method [17] within a semi-blind setup where the mixing filters are initialised with a perturbed version of the ground-truth CTF.
The rest of this paper is organised as follows. Section 2 presents the CTF formulation, which is plugged in a probabilistic framework in Section 3. The proposed EM is given in Section 4. Experiments are presented in Section 5. Section 6 concludes the paper. This paper is an extension of a conference paper [24]. The main improvements over [24] consist of (i) we present the methodology in more detail, such as the two vector/matrix formulations in Section 3, the detailed derivation of the EM algorithm in Section 4 and the execution process of EM in Algorithm 1; (ii) in Section 5, we add the experiments with CTF perturbations, and analyse the computational complexity of the proposed method.
CTF formulation
In an indoor (room) environment, a speech source signal propagates to the receivers (microphones) through the room effect. In the time domain, the received source image is given by the linear convolution between the speech source signal and the RIR
We use the CTF model to circumvent the inaccurancy of the narrowband assumption, then can be presented as [16]
Mixture model formulations
Basic formulation for mixture model
We consider a source separation problem with J sources and I sensors, which could be either underdetermined or (over)determined . Using the CTF formulation (2), in the STFT domain, the microphone signal is
Probabilistic model
In the literature of source separation, each source signal is normally assumed to be independent of other sources, and is also independent across STFT frames and frequencies. Each STFT coefficient is assumed to follow a complex Gaussian distribution with a zero mean and variance [4, 10], i.e. its probability density function (pdf) is
Since the proposed source separation method is carried out independently at each frequency, hereafter, we omit the frequency index k for notational simplicity.
Vector/matrix Formulation 1
To formulate the mixture model (4) more compactly, we have several different choices to organise the signals and the convolution operation in vector/matrix forms. To facilitate the derivation of the following EM algorithm, we use two different vector/matrix formulations. In this section, Formulation 1 will be presented to enable us to easily derive the M-step, and Formulation 2 that is used for the E-step derivation will be presented in the next section. These two formulations are different just in the organisation of the variables and parameters, thence transforming from the E-step to the M-step, and vice-versa, will only necessitate reorganising the vector/matrix elements.
In Formulation 1, we define the source signals in vector form as, for
Vector/matrix Formulation 2
In Formulation 2, let
Expectation-maximisation algorithm
Collect the source variances for all sources and frames, we have the source variance set . The parameters set in the present problem is in Formulation 1 or in Formulation 2. The likelihood of the mixture will be maximised by an EM algorithm, in which the parameters are the optimisation variables. Meanwhile, the STFT coefficients of the source signals, i.e. , are taken as hidden variables, whose posterior statistics will be inferred, and the posterior mean is taken as the estimation of the source signals. The proposed EM algorithm is summarised in Algorithm 1.
E-step
The E-step will be derived based on Formulation 2. Using the parameters estimates , given in the preceding M-step by (16) in Formulation 1, we construct the CTF convolution matrix , the source covariance matrix and the noise covariance matrix following (6), (8) and (9), respectively.
In Formulation 2, the posterior distribution of the source signals is . Since both and are Gaussian, is also Gaussian. Let denote the expectation in the sense of the posterior distribution . From the exponent of , i.e.
M-step
The M-step will be derived based on Formulation 1. Collect the multichannel mixture vectors and source signal vectors along frames, we have the observation set and the source signal set . The complete-data (including observations and hidden variables) likelihood function is
Reformulation: and can be obtained by reformulating and derived in the preceding E-step. The reformulation is mainly to find the elements with the same source and frame indices
With respect to ( denotes conjugate), and , the (complex) derivative of are, respectively
Algorithm 1
EM for MASS with CTF
Input: ; initial parameters .
repeat
E-step
1 Construct , and following (6), (8) and (9), respectively,
2 Compute and following (10),
3 Compute following (11),
M-step
4 Construct and following (14),
5 Update , and following (16),
until convergence
Output: STFT coefficients of source signals .
Experiments
Experimental configuration
The binaural (two-channel) simulated signals were used to evaluate the proposed EM algorithm. The experiments were conducted under various acoustic conditions, in terms of reverberation time, number of sources and intensity of room filter perturbation.
Simulation setup
We use a KEMAR dummy head [25] with one microphone embedded in each ear as the recording system. The head-related impulse responses (HRIRs) for a large grid of directions were measured in advance. The ROOMSIM simulator [26] simulates the binaural RIRs using these HRIRs as both the direct-path wave and the reflections. Four reverberant conditions were simulated, with the reverberation time as 0 (the anechoic case), 0.22, 0.5 and 0.79 s, respectively. The TIMIT [27] speech signals were used as the speech source signals, and were convolved with the simulated BRIRs to generate microphone signals (mixtures). The sampling rates of source signals and microphone signals are both 16 kHz, and the length of source signals are about 3 s. The speech sources were set to locate at different directions in front of the dummy head. The noisy microphone signals are generated by adding a spatially uncorrelated stationary speech-like noise to the noise-free signals. One signal-to-noise ratio condition, i.e. 20 dB, is tested. The STFT uses Hamming window with a length of 1024 samples (64 ms) and frame step of 256 samples (16 ms). In this experiment, the ground-truth noise covariance matrix is used, and fixed during EM iterations.
EM initialisation
Depending on what types of prior knowledge are available, the EM algorithm can be initialised either from the E-step or from the M-step. For both choices, the accuracy of the initialisation is crucial for the EM iterations to converge to a good solution. In this experiment, we initialise EM from the M-step. Due to the difficulty of the blind initialisation, we consider a semi-blind initialisation scheme. The time-domain filters are assumed to be known, from which the CTFs are computed by (3). To fit the realistic situation that the time-domain filters (or CTFs) should actually be blindly estimated and suffer from some estimation error, a proportional Gaussian random noise is added to the time-domain filters to generate the perturbed filters. The normalised projection misalignment (NPM) [28] in decibels (dB) is used to measure the intensity of the perturbation. The lower the NPM is, the intense the perturbation will be. To have a good initialisation for the source variance, the CTF-Lasso method proposed in [17] is first applied, which solves the problem
Baseline methods
Three baseline methods were used for comparison: (i) the CTF-Lasso method used for initialisation; (ii) the BM method [1], which is based on the narrowband approximation. To make a fair comparison, the narrowband mixing filters are also computed using the known perturbed time-domain filters. However, to compute the one-order mixing filters, for the high reverberation time cases, the time-domain filters should be first truncated to have a length being equal to (or less than) the STFT window length. Based on some pilot experiments, we use the HRIRs (without reverberation) as the truncated filters, which achieves the best results. For source separation, each TF bin is assigned to one of the sources based on the mixing filters; (iii) the FR-SCM method [10]. The FR-SCM for each source was separately estimated using the corresponding source image, and kept fixed during the EM, following the line of the semi-oracle experiments in [10].
Performance metrics
The signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR) and signal-to-artifact ratio (SAR) [29], all in decibels (dB), are used as the separation performance metric. In the following, three sets of experiments will be conducted: (i) for various reverberation time, (ii) for various numbers of sources, i.e. with 2, 3, 4 and 5 sources, (iii) with various NPM settings. For each condition, the metric scores are averaged over 20 mixtures.
The computation complexity of each method is measured with the real-time factor, which is the processing time of one method divided by the length of the processed signal. Note that all the methods are implemented in MATLAB.
Results as a function of reverberation time
Fig. 1 plots the performance measures obtained for the four reverberation times. The number of sources is 3. NPM is set to −35 dB, for which the filter perturbation is light, and the CTFs used for CTF-Lasso and the proposed method are very accurate. Therefore, for this experiment, the CTFs are fixed during EM iterations. For the anechoic case, all the four methods largely improve the SDR and SIR scores compared to the unprocessed signals, however slightly reduces the SAR score. This indicates that the four methods can efficiently separate the multiple sources, but introduces some artefacts. Especially, BM suffers from more artefacts than the other methods, since the hard assignment of the TF bins to the dominant source largely distorts the less dominant sources. As increases, the performance measures of BM and FR-SCM dramatically decrease, since the length of RIRs is much larger than the STFT window and the narrowband approximation is not valid anymore for the high reverberation cases. FR-SCM outperforms BM due to the use of the full-rank covariance matrix, which however is only suitable for the low reverberation cases. This can be testified by the fact that the most prominent advantage of FR-SCM over BM presents when is 0.22 s. In contrast to BM and FR-SCM, CTF-Lasso and the proposed EM method achieve good performances. It is a bit surprising that the performance measures actually increases with the increase of reverberation time. The possible reason is that the long filters involve more information to differentiate and separate the multiple sources. Of course, to satisfy this assertion the filters should be accurate enough. Compared to CTF-Lasso whose outputs are taken as the initial point for the EM algorithm, the EM algorithm improves the SDR by about 1.5 dB for every reverberation times, which indicates that the EM iterations are able to refine the quality of the source estimate.
[IMAGE OMITTED. SEE PDF]
Results as a function of number of sources
Fig. 2 plots the results for various number of sources, for . NPM is set to −35 dB, and the CTFs are fixed during EM iterations. As expected, the performance measures of all methods degrade when the number of sources increases. For BM, the WDO assumption for speech sources becomes less valid when more sources present. For FR-SCM, CTF-Lasso and the proposed EM method, the mutual confusion of sources gets larger with increasing number of sources. The performance degradation rate of the four methods is similar to the one for the unprocessed signals. Overall, when the CTFs are properly initialised, CTF-Lasso and the proposed EM method achieve good source separation performance even for the mixtures with five sources using only two microphones.
[IMAGE OMITTED. SEE PDF]
Results as a function of NPM
To evaluate the proposed method under the conditions that the initialised CTFs suffer from large estimation error, we conducted the experiments with various NPM settings, and the results are shown in Fig. 3. Since the initialisation is not accurate, in this experiment, the CTFs are updated during EM interactions to refine the CTFs. To demonstrate the efficiency of CTF updation, the results with the fixed CTFs are also given. As NPM increases, the performance of CTF-Lasso and the proposed EM method largely degrade. When NPM is larger than −20 dB, the proposed method with fixed CTFs does not improve the performance of CTF-Lasso, which means the source estimation cannot be refined based on the inaccurate CTFs. In contrast, the proposed method with updating CTFs improves the performance of CTF-Lasso, due to the refining of CTFs in the M-step. The performance measures of CTF-Lasso and the proposed method become close to the ones of BM and FR-SCM, which indicates that the CTF-based methods are more sensitive to the filter perturbations than the narrowband assumption based methods.
[IMAGE OMITTED. SEE PDF]
Computational complexity analysis
Table 1 shows the real-time factor of the four methods for the case with three sources and of 0.5 s. BM is the fastest, since it is an one-step method. The other three are all iterative methods, whose computational complexity is proportional to the number of iterations. FR-SCM and the proposed method are similar in the sense that they both are based on EM iteration: estimating the source statistics using a Wiener-like filter in the E-step, and estimating the mixing filters in the M-step. The main difference between them is that the proposed method uses the CTF with a length of Q, while FR-SCM uses the mixing filter with a length of 1. As a result, the proposed method has a much larger complexity than FR-SCM. CTF-Lasso also uses the CTF. However, unlike FR-SCM and the proposed method that several matrix inverse operations are performed (as shown in (10) and (16)), Lasso optimisation only executes the first-order convolution operation. As a result, the real-time factor of CTF-Lasso is lower than the factor of FR-SCM and the proposed method.
Table 1 Real-time factor of the four methods
BM | FR-SCM | CTF-Lasso | Proposed |
0.01 | 81.7 | 54.2 | 630.0 |
Conclusion
In this work, an EM algorithm has been proposed for speech separation. The subband convolutive model, i.e. CTF model, was adopted. To concisely derive the M-step and E-step, two convolution vector/matrix formulations were used. The CTF model-based methods, i.e. CTF-Lasso and the proposed EM method, outperform the narrowband assumption based methods, i.e. BM and FR-SCM, for the reverberant case. The proposed EM algorithm is capable to refine the CTFs and source estimates starting with the output of CTF-Lasso, and thus improves the source separation performance. Only the semi-blind experiments were conducted in this work due to the difficulty of EM initialisation. In the future, a blind CTF identification method could be developed to enable the blind initialisation of EM. To this aim, the CTF identification methods proposed in [23, 30] could be combined, which are in the contexts of single-source dereverberation and multisource localisation, respectively.
Acknowledgment
This research has received funding from the ERC Advanced Grant VHIA (#340113).
Yilmaz O. Rickard S.: ‘Blind separation of speech mixtures via time‐frequency masking’, IEEE Trans. Signal Process., 2014, 52, (7), pp. 1830–1847 (doi:
Mandel M.I. Weiss R.J. Ellis D.P.: ‘Model‐based expectation‐maximization source separation and localization’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (2), pp. 382–394 (doi:
Winter S. Kellermann W. Sawada H. et al.: ‘MAP‐based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and l1 ‐norm minimization’, EURASIP J. Appl. Signal Process., 2007, 2007, (1), pp. 81–81
Ozerov A. Févotte C.: ‘Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (3), pp. 550–563 (doi:
Avargel Y. Cohen I.: ‘On multiplicative transfer function approximation in the short‐time Fourier transform domain’, IEEE Signal Process. Lett., 2007, 14, (5), pp. 337–340 (doi:
Gannot S. Burshtein D. Weinstein E.: ‘Signal enhancement using beamforming and nonstationarity with applications to speech’, IEEE Trans. Signal Process., 2001, 49, (8), pp. 1614–1626 (doi:
Li X. Girin L. Horaud R. et al.: ‘Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 2015, pp. 320–324
Van Trees H.L.: ‘Detection, estimation, and modulation theory ’ (John Wiley & Sons, USA, 2004)
Gannot S. Vincent E. Markovich‐Golan S. et al.: ‘A consolidated perspective on multimicrophone speech enhancement and source separation’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (4), pp. 692–730 (doi:
Duong N. Vincent E. Gribonval R.: ‘Under‐determined reverberant audio source separation using a full‐rank spatial covariance model’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (7), pp. 1830–1840 (doi:
Kowalski M. Vincent E. Gribonval R.: ‘Beyond the narrowband approximation: wideband convex methods for under‐determined reverberant audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2010, 18, (7), pp. 1818–1829 (doi:
Arberet S. Vandergheynst P. Carrillo J.‐P. et al.: ‘Sparse reverberant audio source separation via reweighted analysis’, IEEE Trans. Audio Speech Lang. Process., 2013, 21, (7), pp. 1391–1402 (doi:
Leglaive S. Badeau R. Richard G.: ‘Multichannel audio source separation: variational inference of time‐frequency sources from time‐domain observations’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017, pp. 26–30
Leglaive S. Badeau R. Richard G.: ‘Separating time‐frequency sources from time‐domain convolutive mixtures using non‐negative matrix factorization’. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, USA, 2017, pp. 264–268
Avargel Y. Cohen I.: ‘System identification in the short‐time Fourier transform domain with crossband filtering’, IEEE Trans. Audio Speech Lang. Process., 2007, 15, (4), pp. 1305–1319 (doi:
Talmon R. Cohen I. Gannot S.: ‘Relative transfer function identification using convolutive transfer function approximation’, IEEE Trans. Audio Speech Lang. Process., 2009, 17, (4), pp. 546–555 (doi:
Li X. Girin L. Horaud R.: ‘Audio source separation based on convolutive transfer function and frequency‐domain lasso optimization’. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, USA, 2017, pp. 541–545
Li X. Girin L. Gannot S. et al.: ‘Multichannel speech separation and enhancement using the convolutive transfer function’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018
Talmon R.I. Cohen I. Gannot S.: ‘Convolutive transfer function generalized sidelobe canceler’, IEEE Trans. Audio Speech Lang. Process., 2009, 17, (7), pp. 1420–1434 (doi:
Schwartz B. Gannot S. Habets E.A.: ‘Online speech dereverberation using kalman filter and EM algorithm’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (2), pp. 394–406 (doi:
Badeau R. Plumbley M.D.: ‘Multichannel high‐resolution NMF for modeling convolutive mixtures of non‐stationary signals in the time‐frequency domain’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (11), pp. 1670–1680 (doi:
Higuchi T. Kameoka H.: ‘Joint audio source separation and dereverberation based on multichannel factorial hidden Markov model’. IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP), Reims, France, 2014, pp. 1–6
Li X. Gannot S. Girin L. et al.: ‘Multichannel identification and nonnegative equalization for dereverberation and noise reduction based on convolutive transfer function’, IEEE/ACM Trans. Audio Speech Lang. Process., 2018, 26, (10), pp. 1755–1768 (doi:
Li X. Girin L. Horaud R.: ‘An EM algorithm for audio source separation based on the convolutive transfer function’. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, USA, 2017, pp. 56–60
Gardner W.G. Martin K.D.: ‘HRTF measurements of a KEMAR dummy‐head microphone’, J. Acoust. Soc. Am., 1995, 97, (6), pp. 3907–3908 (doi:
Campbell D.: ‘The roomsim user guide (v3.3)’, https://pimsgrc.nasa.gov/plots/user/acoustics/roomsim/Roomsim%20User%20Guide%20v3p3.htm, 2004
Garofolo J.S. Lamel L.F. Fisher W.M. et al.: ‘Getting started with the DARPA TIMIT CD‐ROM: An acoustic phonetic continuous speech database’, National Institute of Standards and Technology (NIST), Gaithersburgh, MD, 1988, 107
Morgan D.R. Benesty J. Sondhi M.M.: ‘On the evaluation of estimated impulse responses’, IEEE Signal Process. Lett., 1998, 5, (7), pp. 174–176 (doi:
Vincent E. Gribonval R. Févotte C.: ‘Performance measurement in blind audio source separation’, IEEE Trans. Audio Speech Lang. Process., 2006, 14, (4), pp. 1462–1469 (doi:
Li X. Girin L. Horaud R. et al.: ‘Multiple‐speaker localization based on direct‐path features and likelihood maximization with spatial sparsity regularization’, IEEE/ACM Trans. Audio Speech Lang. Process., 2017, 25, (10), pp. 1997–2012 (doi:
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2019. This work is published under http://creativecommons.org/licenses/by-nc/3.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This study addresses the problem of under‐determined speech source separation from multichannel microphone signals, i.e. the convolutive mixtures of multiple sources. The time‐domain signals are first transformed to the short‐time Fourier transform (STFT) domain. To represent the room filters in the STFT domain, instead of the widely used narrowband assumption, the authors propose to use a more accurate model, i.e. the convolutive transfer function (CTF). At each frequency band, the CTF coefficients of the mixing filters and the STFT coefficients of the sources are jointly estimated by maximising the likelihood of the microphone signals, which is resolved by an expectation‐maximisation algorithm. Experiments show that the proposed method provides very satisfactory performance under highly reverberant environments.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 INRIA Grenoble Rhône‐Alpes, Montbonnot Saint‐Martin, France
2 Université Grenoble Alpes, Saint‐Martin d'Hères, France