1. Introduction
Chronic Respiratory Diseases (CRDs) can be defined as disorders of the airways and other physiological structures of the respiratory system. One of the most common CRDs is Chronic Obstructive Pulmonary Disease (COPD) that is responsible for more than 3 million deaths of people each year which is equivalent to 6% of all deaths worldwide [1]. COPD is often characterized by the presence of wheeze sounds since wheezes provide relevant clues that alert about a respiratory disorder [2,3]. Although CRDs currently have no medical cure, early detection of wheezing from auscultation can lead to treatment when the disease is in its early stage, thus improving people’s quality of life. Although there are other clinical alternatives, such as chest radiography and laboratory analysis, auscultation remains the main technique used in most of the health centers in low-income and middle-income countries to provide the first medical diagnosis of the status of the lung due to its low cost, safety and non-invasive nature. Nevertheless, this early detection by the physician depends largely on the subjective diagnosis based on both the training and expertise in interpreting what hears with the stethoscope and the vulnerability to normal respiratory sounds that can mask the presence of sounds of interest, such as wheezing [4]. Today, many researchers continue to investigate in biomedical signal processing to enhance the clarity of the wheezing sounds with the aim that all useful medical information contained in the wheezing sound signal is heard in the process of auscultation.
In general terms, the respiratory sounds can be classified into two main categories: normal and abnormal (adventitious, such as wheezes), according to the Computerized Respiratory Sound Analysis (CORSA) guidelines [5]. Although wheeze and normal respiratory sounds appear simultaneously since both of them are generated by the same air flow through the lungs, normal respiratory sounds are always present in each respiratory cycle since they are automatically generated by the breathing process. However, the occurrence of wheezing sounds is random because of the respiratory disorder so they do not have to be present in all breathing cycles. So, normal respiratory sounds (RS) are generated by healthy lungs and they are represented by broadband spectrum where most of the energy is concentrated in the spectral band 60 Hz–1000 Hz [6]. Wheeze sounds (WS) are abnormal sounds, generated by unhealthy lungs that suffer narrowing of airways, superimposed onto the RS. Therefore, WS can be described as pitched and continuous sounds which usually have a fundamental frequency (pitch) located between 100 Hz–1000 Hz with duration longer than 100 ms, displaying spectral trajectories of narrowband spectral peaks [7] as shown in Figure 1. In this work, any single-channel signal composed of both RS and WS will be referred as mixture.
It is common that the cognitive capacity of the physician is reduced throughout the day as the number of hours spent analyzing respiratory sounds increases, a fact that is exacerbated by the stress to which the physician is subjected to certain medical cases [8,9]. The presence of WS is often associated with obstructions of the airways. However, the interference caused by RS causes the loss of relevant wheezing content in WS which makes it difficult to provide a reliable diagnosis of the status of the lung according to what is being heard through the stethoscope. Sound source separation approaches have been widely applied to overcome this problem by isolating the sounds of interest (target) from those that act as acoustic interference (non-target) [10].
Many biomedical signal processing challenges, such as ambient denoising [11], wheezing detection and classification are still open to the machine learning research community. In [11], a denoising approach is proposed to remove ambient noise from lung sound recordings by means of an adaptive subtraction method that operates in the spectral domain. Focusing on both wheezing detection and classification tasks, the initial works are based on spectral peaks analysis applying thresholding [2,12,13,14,15] that obtain sensitivity/specificity results from 71% to 98%. Like this, Taplidou and Hadjileontiadis [14] proposed a spectro-temporal wheeze detector that automatically locates and identifies wheeze sounds based on spectral trend elimination, separation of the spectrum into frequency bands and peak detection/classification. Most of the wheezing detection and classification approaches are based on the feature extraction and classifier configuration: (i) Musical features and Logistic Regression Model (LRM) [16]; (ii) Spectral features and Support Vector Machine (SVM) such as Power spectral density mean and harmonics [17], Intensity, mean frequency and standard deviation frequency [18], Power spectral band [19], Tonality index [20] and Ensemble Empirical Mode Decomposition (EEMD) [21]; and finally, (iii) Mel Frequency Cepstral Coefficients (MFCC) using K-nearest neighbour (KNN) [22], LRM [23] and Gaussian Mixture Model (GMM) [24], that obtain sensitivity/specificity results from 90% to 99%. Thus, a wheezing detection [20] was developed at the segment level by means of a SVM classifier whose features are the spectral envelope variation and a tonality index. Other works have been focused on the wavelet domain [25,26]. In this context, Ulukaya et al. [26] presented a tunable RAtional Dilation Wavelet Transform (RADWT) based method to discriminate monophonic and polyphonic wheeze sounds by means of localized energy peaks which are calculated from wavelet coefficients. Other studies have applied different types of neural networks (NN) to wheezing sound analysis [27,28,29,30] obtaining the best promising performance in terms of sensitivity and specificity results, specifically, from 86% and 100%. Thus, Lin et al. [27] introduce a method that searches for horizontal or nearly horizontal edges of the spectrogram and a back-propagation neural network (BPNN) classifier is applied using features such as, frequency range and the slope of the potential wheeze. However, wheezing detection and classification tasks could be improved applying sound source separation techniques as a preliminary step since these techniques can increase the clarity of the wheezing content hidden in the signal being auscultated. Although very few works [31,32] have addressed in depth the separation of wheezing sound sources to the best of our knowledge, all of them are based on Non-negative Matrix Factorization (NMF) since NMF is a recent and promising tool that can extract hidden sound events with physical interpretation in nature. Specifically, Torre et al. [31] present a constrained NMF approach to separate wheezes from respiratory sounds applied to single-channel mixtures. The proposed constraints, smoothness and sparseness, model common spectral behavior shown by wheezes and normal breath sounds. Results report that the proposed method improves the acoustic quality of the wheezes removing most of the respiratory sounds.
In this paper, an extended version of Non-negative Matrix Partial Co-Factorization (NMPCF) is proposed to suppress RS while preserving the wheezing acoustic content. Here, we assume that RS can be considered as repetitive sound events during breathing so, RS can be modeled by sharing together the spectral patterns found in each respiratory stage (segment), inspiration or expiration, with a respiratory training signal. However, this sharing of patterns can not be applied to wheezes since WS could not be present at each segment due to their unpredictable nature in time motivated by the pulmonary disorder. To improve the sound separation performance of the conventional NMPCF that treats equally all segments of the spectrogram, the main contribution of the proposed method adds higher importance to those segments classified as non-wheezing using inter-segment information informed by a wheezing detection system. As a result, our proposal is able to characterize RS more accurately by forcing to model more on those non-wheezing segments in the bases sharing process into the NMPCF decomposition.
The rest of this paper is structured as follows. First, Section 2 briefly reviews the background of the most relevant approaches based on Non-negative Matrix Factorization and Non-negative Matrix Partial Co-Factorization. Section 3 details the proposed method. Section 4 discusses the evaluation and the experimental results. Finally, conclusions and further research are presented in Section 5.
2. Background 2.1. Non-Negative Matrix Factorization
Non-negative Matrix Factorization (NMF) [33,34] is a rank-reduction method that has been widely applied to learning images [35] and audio [36]. NMF includes the non-negativity constraint to recover hidden patterns of the input data using basis and activation matrices. Considering a monaural input mixturex(t), composed of sources of interest (target)xW(t)and non-target sourcesxR(t), NMF factorizes the input spectrogramXinto the product of two non-negative matrices: basis matrixU∈R+F×Kand activation matrixV∈R+K×T as shown in Equation (1). We assume an approximate linear additivity between the input spectrogramsXW∈R+F×TandXR∈R+F×T. The subscript W is often used to refer the sounds of interest and the subscript R is applied to the sounds that act as acoustic interference,
X=XW+XR≈X^=X^W+X^R=UV=UWURVWVR=UW VW+UR VR
obtaining the estimated spectrogramsX^∈R+F×T,X^W∈R+F×T,X^R∈R+F×Twith F frequency bins and T frames using K bases and the corresponding time-varying activations. Therefore,Ucan be interpreted as a dictionary of spectral bases or patterns that represents the frequency information associated to the target and non-target sources active in the input spectrogram. Instead,Vrepresents a matrix of activations that indicates the activity of each spectral basis in a given frame.
NMF is often calculated using an iterative algorithm, based on multiplicative update rules [33], to obtain those parameters that reduce the cost functionD(X|X^)based on penalizing the error reconstruction betweenXandX^. In this paper, the generalized Kullback-Liebler divergenceDKL(X|X^) [37] has been applied because it confirms the non-negativity ofUandV as can be observed in Equations (3) and (4). In addition, recent works [32,38] report thatDKL(X|X^)can be used in biomedical signal processing to achieve promising results,
DKLX|X^=XlogXX^−X+X^
Uz←Uz⊙X⊘UVVzT⊘(1VzT),z=W,R
Vz←Vz⊙UzTX⊘UV⊘(UzT1),z=W,R
whereUW∈R+F×KW,UR∈R+F×KR,VW∈R+KW×TandVR∈R+KR×Tare initialized as random positive matrices,1∈R+F×Trepresents an all-ones matrix,Tis the transpose operator, ⊙ is the element-wise multiplication, ⊘ is the element-wise division andK=KW+KRindicates the number of bases, being KWthe number of bases related to the sounds of interest andKRthe number of bases related to the acoustic interference.
The main drawbacks shown by NMF can be summarized in the following three points: (i) poor signal quality when the iterative algorithm reaches a poor local minimum; (ii) NMF can not reconstruct each source because it does not have enough information to cluster all the bases generated by the same source; (iii) NMF does not guarantee a parts-based objects reconstruction with physical meaning as occurs in nature [39]. To overcome this problem, three approaches have been widely proposed in literature [40]: (i) supervised NMF (SNMF) [41,42] in whichUWandURare learned in advanced by means of training and fixed during the iterative process. As a result, only the activations matricesVWandVR are updated; (ii) semi-supervised NMF (SSNMF) [43,44] in whichURis learned in advanced by means of training and fixed during the iterative process. As a result,VW,VRandUW are updated; and (iii) constrained NMF (CNMF) in which no training is used because different constraints are included into the factorization procedure to model the specific time-frequency characteristics of the sources to extract [45,46].
To sum up, SNMF, SSNMF and CNMF find better solutions compared to NMF since all of them model, into the bases or activations obtained from the factorization, temporal or spectral behaviors shown by the sounds, that are intended to be recovered, in nature. Nevertheless, the main disadvantages observed in both SNMF and SSNMF are the following: (i) highly dependent of the training data so, the separation performance is limited to the spectral similarity between the training and sounds contained in the input mixture and; (ii) there may not be public training databases available. On the other hand, the main disadvantage observed in constrained NMF approaches, such as CNMF is the difficulty of mathematically defining both the constraints that correctly model the temporal and spectral behaviors shown by the target sources and their incorporation into the cost function on which the factorization is based [47].
2.2. Non-Negative Matrix Partial Co-Factorization
Non-negative Matrix Partial Co-Factorization (NMPCF) has been used in several audio processing tasks, such as extraction of rhythmic sources [48,49,50], singing-voice separation [51] or speaker diarization [52]. The main idea of NMPCF is to apply a joint matrix factorization using multiple input matrices to obtain a set of shared spectral bases or temporal activations.
In general, NMPCF-based methods can be classified into four approaches: (i) semi-supervised factorization (1S-NMPCF) [50] in which a joint decomposition, considering the input mixture and a training matrix related to repetitive sounds, is performed by sharing some bases active in both of them [48]; (ii) supervised factorization (2S-NMPCF) in which a joint decomposition, considering the input mixture and two training matrices related to repetitive and non-repetitive sounds, is performed by sharing some bases active between each training matrix and the input mixture [51]; (iii) unsupervised factorization (T-NMPCF) [50] in which a joint decomposition using multiple shorter segments from the input mixture is obtained factorizing them into repetitive sound events by finding common bases across segments [49]; and (iv) semi-supervised factorization (ST-NMPCF) [50] in which a joint decomposition of the input mixture is performed using a training matrix associated to repetitive sound events and multiple shorter segments to make advantage of both spectral and temporal modelling of repetitive sounds.
However, NMPCF-based approaches treat all segments of the input mixture decomposition together equally, ignoring the importance of each specific segment in the modelling of the repetitive and non-repetitive sounds. As a result, it could be interesting to investigate how to include the importance of different segments according their spectral content to weight the spectral modelling of the repetitive sounds in the joint factorization and as a consequence, to improve the separation quality of the sounds of interest. 3. Proposed Method
The aim of the proposed method is to enhance the quality of the WS by removing the RS that implicitly appear in the human breathing process. In order to improve the separation performance between WS and RS of the NMF-based and NMPCF-based baseline methods, we propose a modified NMPCF approach denominated Informed Inter-Segment Non-negative Matrix Partial Co-Factorization (IIS-NMPCF) that adds higher importance into the NMPCF factorization to those segments in which WS are not present. For this purpose, IIS-NMPCF consists of three stages: (i) Segmentation; (ii) Classification between presence/absence of WS and finally (iii) Adding weighting into the NMPCF decomposition. The flowchart of the proposed method is shown in Figure 2, and details are depicted in the following Section 3.1 and Section 3.2.
3.1. Time-Frequency Signal Representation
Letx[n]denote the n-th sample of a mixture signal, which consists of the sum of wheezingw[n]and normal respiratory soundsr[n]. The magnitude spectrogramXof a mixture signalx[n]can be represented asX=XW+XR, beingXWthe magnitude spectrogram of only WS andXRthe magnitude spectrogram of only RS. Each unitXf,tis defined by the f-th frequency bin at the t-th frame and is calculated from the magnitude of the Short-Time Fourier Transform (STFT) using a Hamming window of N samples with 25% overlap. A normalization process is applied in order to ensure that the proposed method can be independent of the size and scale of the magnitude spectrogramX. To avoid complex nomenclature throughout the paper, the variableXis hereinafter referred to the normalized magnitude spectrogramX¯computed as follows,
X¯=X∑f,t Xf,tFT
Besides,y[n] denote the n-th sample of the respiratory training signal, which consists of a concatenation of different respiratory stages composed only of RS (for more details see Section 4.3). The magnitude spectrogramYof the respiratory training signaly[n]has been calculated following the same procedure used with the previous magnitude spectrogramX.
3.2. Wheezing Sound Separation Using Informed Inter-Segment NMPCF The key assumptions behind the proposed method IIS-NMPCF to apply WS and RS source sound separation are the following:
(i)
RS are often characterized by similar spectral patterns that represent a wideband noise spectrum showing time and frequency smoothness [32]. In this way,Ycan be useful to replicate these similar RS spectro-temporal behaviors observed in most of the subjects.
(ii)
In addition, RS can be considered as repetitive events in human breathing so, RS can be modeled sharing common spectral patterns that can be found throughout all breathing stages (segments), that is, some basis vectors can be shared during the inter-segment analysis due to the repeatability of RS. If we divide the input mixture spectrogramXinto segmentsX(1),X(2), …,X(L), we can get L-segments from the given mixturex[n] that share common spectral patterns. For this purpose, we have used AMIE_SEG [53] that automatically allows to segment the mixture spectrogramXinto inspiratory and expiratory stages.
(iii)
However, WS can be present or absent in the respiratory stages due to the pulmonary disorder. Therefore, we can define an indicatorC(l)to distinguish between non-wheezing (C(l)=0) and wheezing (C(l)=1) segments. Note that the term(l)refers to the segment identifierl=1,…,Lof the mixture spectrogramX . In the case of wheezing segments, the spectral patterns of both RS and WS are present. For this reason, we propose to weight the importance of wheezing and non-wheezing segments into the conventional NMPCF decomposition to improve the wheezing sound separation performance. The classification between non-wheezing and wheezing segments is provided by a wheezing detection algorithm previously developed by authors [54].
Considering two input spectrogramsXandY, the factorization of the conventional ST-NMPCF lets the common respiratory basis vectorsURbe shared jointly between the spectrogramYand L-segmentsX(1),X(2), …,X(L)of the input spectrogramX (see Figure 3),
X(l)≈X^(l)=X^R(l)+X^W(l)=URdiag(DxR(l))VR(l)+UW(l)diag(DxW(l))VW(l)
Y≈Y^=URdiag(DyR)HR
whereX^,Y^are the estimated or reconstructed spectrograms of the input mixture and the respiratory training signal;X^R,X^Ware the estimated spectrograms of the RS and WS;UR,UWare the estimated basis matrices of the RS and WS;VR,VWare the estimated activation matrices of the RS and WS for the mixture;HRis the estimated activation matrix of the RS for the respiratory training signal. All of these matrices are non-negative matrices. The number of respiratory and wheezing components will be denoted asKRandKW, respectively. TheL2-norm of each column ofURorUWis equal to 1.0. The termsDxRandDxWrepresent vectors with theL2-norm of each activation component of RS and WS, respectively. Similarly, the termDyRrepresents a vector with theL2-norm of each activation component of RS. Therefore, theL2-norm of each row ofVR,VWorHRbe equal to 1.0 due to the normalization procedure at each iteration. The operatordiag()is the diagonal matrix.
Figure 3 depicts those models with L-segments of the mixture spectrogramXand the respiratory training spectrogramY . As mentioned in the key assumption (i), Equation (7) models the respiratory training reconstruction by letting the estimated basis matrixUR to contain spectral patterns that define the common behavior of RS. As mentioned in the key assumption (ii), Equation (6) aims to learn the common basis vectorsURof L-segmentsX(1),X(2), …,X(L)to model repetitive spectral components throughout the segments, since RS can be considered as repetitive sound events in human breathing. On the other hand,UW(l)is responsible for recovering WS that can be contained in each segment. Combining the two previous factorization models,URcan model both spectral characteristics of the respiratory trainingYand temporally repeating components belonging to the segmentsX(1),X(2), …,X(L). Considering the previous assumption (iii), the main contribution of the proposed method is to give greater importance, by means of weighting, to those segments classified as non-wheezing (C(l)=0) in the NMPCF decomposition to learn more accurate the common basis vectorsUR since these segments will not be interfered by WS so, the spectral modelling of RS will be more acoustically reliable. In Figure 3, the segmentsX(2)andX(L)are classified as non-wheezing segments.
The objective function of the proposed method IIS-NMPCF can be constructed to minimize the residuals of the models (6) and (7),
ΓIIS−NMPCF=∑l=1LλC(l) DKLX(l)|X^(l)+LDFUR|0+∑l=1LDFUW(l)|0︷ObjectivefunctionappliedtothesetofL−segmentsoftheinputmixture++αDKLY|Y^+DFUR|0︷Objectivefunctionappliedtotherespiratorytraining
whereDKLis the Kullback–Leibler divergence used to calculate the signal reconstruction error for each segmentDKLX(l)|X^(l)and the respiratory training spectrogramDKLY|Y^. The penalization termDF represents the Frobenius norm applied to each basis matrix in order to prevent basis vectors from convergence to too small values [50]. The weighting factorλC(l) controls the relative importance of each segment matrixX(l)depending on the type of segment, wheezing (C(l)=1) or non-wheezing (C(l)=0), in the factorization model. The weighting factorαcontrols the relative importance of the respiratory training matrixYin the factorization model.
Highlight that the weighting factorλC(l) plays a crucial role in the proposed method. The reason is becauseλC(l) controls the importance of which segments are more relevant in the modelling of the spectral patterns related to RS, specifically, those segments in which WS are not detected. Therefore, the following considerations about the parameterλC(l) must be taken into account:
(a)
According to the estimated basis matrixURorUW(l), the weighting factorλC(l) can be classified asλRC(l) orλWC(l) , respectively. As mentioned above, WS are always overlapped with RS so, we assume that none of the segments will model the behaviour of WS better than another. However, RS can be found isolated in some segments of human breathing due to the unpredictable nature of the pulmonary disorder. In this case, those segments in which WS are not contained will be more relevant to model the behaviour of RS. In this manner,λWC(l) will set the same value for all segments, that is,λWC(l) =λW,l=1,…,LandλRC(l) will be variable depending on the type of segment, wheezing (C(l)=1) or non-wheezing (C(l)=0), is analyzed. In addition, the value assigned to the weighing factors must satisfyλRC(l) >λW (see Section 4.4) since RS are always present in all segments of the input mixture and WS may not be.
(b)
Focusing on the type of segment indicated by the parameterC(l), the weighting factorλRC(l) can be classified asλR0orλR1. The parameterλR0is associated with the non-wheezing segments (C(l)=0) andλR1is associated with the wheezing segments (C(l)=1). This allows to give greater importance to non-wheezing segments for the modeling of respiratory basisUR. As consequence, the value assigned to the weighing factors must satisfyλR0>λR1 (see Section 4.4).
Given the above, the estimated basis matricesUR,UW(l)and activations matricesVR(l),VW(l),HRcan be obtained by applying a gradient descent algorithm based on multiplicative update rules as follows,
UR←UR⊙∑l=1LλRC(l) X(l)⊘X^(l)diagDxR(l)VR(l)T+αY⊘Y^diagDyRHRT∑l=1LλRC(l) 1F,T diagDxR(l)VR(l)T+α1F,T diagDyRHRT+2L+1UR
UW(l)←UW(l)⊙λWX(l)⊘X^(l)diagDxW(l)VW(l)TλW 1F,T diagDxW(l)VW(l)T+2UW(l)
VR(l)←VR(l)⊙URdiagDxR(l)TX(l)⊘X^(l)URdiagDxR(l)T 1F,T
VW(l)←VW(l)⊙UW(l)diagDxW(l)TX(l)⊘X^(l)UW(l)diagDxW(l)T 1F,T
HR←HR⊙URdiagDyRTY⊘Y^URdiagDyRT 1F,T
The set of matricesUR,UW(l),VR(l),VW(l),HR are obtained updating the rules (9)–(13) until the algorithm converges or reaches a maximum number of iterations M. At each iteration, the activation matricesVR(l),VW(l),HRand the basis matricesUR,UW(l)must be normalized applying theL2 -norm (see Equation (14)). As a result,DxR,DxW,DyRmust be updated multiplying by theL2 -norm obtained at each previous normalization (see Equation (15)). The normalization process ensures that both the sum of the square elements of each k-th column of the basis matricesUR,UW(l)and the sum of the square elements of each k-th row of the activation matricesVR(l),VW(l),HR equals 1.0 [46].
G(k)=G(k)∑G2(k)
DJ(k)=DJ(k)∑G2(k)
where (G, J, k) = {(UR, R,kR), (UW(l), W,kW), (VR(l), R,kR), (VW(l), W,kW), (HR, R,kR)} respectively. If we consider the basis matrixG=(UR,UW(l))→∑G2(k)=∑f=1F G2(f,k). If we consider the activation matrixG=(VR(l),VW(l),HR)→∑G2(k)=∑t=1T G2(k,t).
After the updating process, the estimated spectrogramsX^R(l)andX^W(l)for each segment can be reconstructed as follows:
X^R(l)=URdiag(DxR(l))VR(l)
X^W(l)=UW(l)diag(DxW(l))VW(l)
Note thatX^R(l)andX^W(l) must be denormalized by multiplying by the denominator of Equation (5). A Wiener filtering [32,55] has been applied in order to ensure a conservative signal reconstruction and to obtain the estimated complex wheezing and respiratory spectrogram of each segment.X^R andX^Ware obtained by concatenating the estimated complex spectrograms of each segment,X^R=X^R(1),X^R(2),…,X^R(L)andX^W=X^W(1),X^W(2),…,X^W(L), respectively. Finally, the inverse overlap-add STFT is applied to synthesize the estimated RS signalr^[n]and the estimated WS signalw^[n]in time domain using the phase of the input mixture. The wheezing/normal respiratory sound separation procedure is summarized in Algorithm 1.
Algorithm 1 Wheezing sound separation using IIS-NMPCF. |
Require:x[n],y[n],KR,KW,λR0,λR1,λW,αand M. 1) Compute the normalized magnitude spectrogramXof the mixturex[n]. 2) Compute the normalized magnitude spectrogramYof the trainingy[n]. 3) Divide the spectrogramXinto L-segmentsX(1),X(2), …,X(L) using AMIE_SEG [53]. 4) Classify the L-segments into wheezing (C(l)=1) and non-wheezing (C(l)=0 ) using a wheezing detection algorithm [54]. 5) Initialize each activation and basis matrixUR,UW(l),VR(l),VW(l),HRwith random non-negative values. 6) Update each activation and basis matrixUR,UW(l),VR(l),VW(l),HR using Equations (9)–(13) for the predefined number of iterations M. At each iteration, normalize each activation and basis matrixUR,UW(l),VR(l),VW(l),HRand update the termsDxR,DxWandDyR using Equations (14) and (15). 7) Compute the estimated magnitude spectrogramsX^R(l)andX^W(l). 8) Denormalize the estimated magnitude spectrogramsX^R(l)andX^W(l). 9) Apply a Wiener filtering [32] onX^R(l)andX^W(l). 10) Concatenate all the estimated complex respiratory spectrograms:X^R=X^R(1),X^R(2),…,X^R(L). 11) Concatenate all the estimated complex wheezing spectrograms:X^W=X^W(1),X^W(2),…,X^W(L). 12) Synthesizer^[n]. 13) Synthesizew^[n]. returnr^[n]andw^[n] |
4. Experimental Results 4.1. Dataset and Metric
Because there is no public database where only wheeze sounds can be found to the best of our knowledge, two datasets P1 and T1 (T1H, T1M and T1L), detailed in Table 1, have been used in the evaluation of the proposed method with a total of 64 recordings considering the two databases. Specifically, the database P1 consists of 48 recordings (that is, 3/4 of the total recordings used in the experiments) and the database T1 consists of 16 recordings (that is, 1/4 of the total recordings used in the experiments). The dataset P1 has been used in the hyperparametric optimization process (see Section 4.4) while the dataset T1 has been used in the separation testing (see Section 4.5). The databases P1 and T1 have been created by collecting a set of recordings from different subjects of the most widely used Internet pulmonary repositories [56,57,58,59,60,61,62,63,64,65,66,67,68]. These recordings, captured from the trachea, anterior, and posterior chest using either a stethoscope or microphone, were collected from subjects with different pathologies, including asthma, bronchitis or COPD. The databases P1 and T1 have been created by randomly selecting recordings from the above-mentioned repositories. It must be highlighted that P1 is not a part of T1 in order to validate the results. Therefore, the recordings selected for the database P1 are not the same as the recordings selected for the database T1. In total, these databases provide 1474 s of recording, 96 unhealthy subjects, 874 respiratory events (a respiratory event is defined as inspiration or expiration) and 133 wheezes. Note that each recording has been created using single-channel configuration, a sampling rate equals 2048 Hz and a bit resolution of 16 bits.
Specifically, the datasets P1 and T1 (T1H, T1M and T1L) have been created mixing only WS recordings manually separatedw[n], in which respiratory sounds are inactive, and only RS recordingsr[n], in which wheezing sounds are inactive, obtained from the above-mentioned repositories. Highlight that wheezing sounds cannot be recorded isolated since WS are always overlapped with RS, that is, both sounds are produced by the same bronchial tree in the lungs. To do this, a MATLAB tool, designed by the authors, has been used to visually modify the spectrogram values. Specifically, this tool behaves as an eraser that allows us, by means of the mouse, to set to zero those bins of the spectrogram that we observe that do not belong to a wheeze sound, a fact that is also verified by a listening inspection of the resulting signal. Therefore, only the bins corresponding to WS have been kept active for each signalw[n]. Both the fundamental component of WS and its corresponding harmonics have been considered. Note that the recordings used to create the database P1 are different from those used to create the database T1.
The datasets T1H (SNR = 5 dB), T1M (SNR = 0 dB) and T1L (SNR = −5 dB) are composed of the same set of signalsw[n]andr[n]but they have been mixed using a different Signal-to-Noise Ratio (SNR). Specifically, T1H is composed of mixtures in which the power ofw[n]is 5 dB greater compared tor[n]so, WS are louder than RS. The dataset T1M is composed of mixtures in which the power of bothw[n]andr[n]is the same so, both type of sounds is similarly audible. Finally, the dataset T1L is composed of mixtures in which the power ofw[n]is 5 dB lower compared tor[n]so, RS are louder than WS. Note that in each mixture process, the power related tow[n]andr[n]are calculated and the signal with the highest power is left fixed while the signal with the lowest power is scaled to obtain the desired SNR in order to avoid audio saturation or distortion in the signal scaling process.
To assess the sound separation performance of the proposed method, the BSS EVAL toolbox [69,70] has been applied because it is widely used in the field of sound source separation. The metrics used are the following: (1) Source-to-distortion ratio (SDR), which provides information on the overall quality of the separation process; (2) Source-to-interferences ratio (SIR), which reports the presence of WS contained in RS and vice versa; and (3) Source-to-artifacts ratio (SAR), which provides information on the artifacts in the separated signal from separation and/or resynthesis. The principle to obtain the value of these metrics is to decompose the total error, between the estimated target signals^[n]and the original target signals[n] , in three terms related to three types of error, as follows [70]:
s^[n]−s[n]=esinterf[n]+esartifacts[n]+esspatial[n]
whereesinterf[n]is the error term related to the interference produced by the unwanted sources;esartifacts[n]is the error term attributed to the artifacts generated by the separation algorithm; andesspatial[n]is the error term attributed to spatial distortion. We can now define the SDR, SIR and SAR values, expressed in dB, as follows:
SDR=10log10s[n]2 esinterf[n]+esartifacts[n]+esspatial[n]2
SIR=10log10s[n]2 esinterf[n]2
SAR=10log10s[n]+esinterf[n]+esspatial[n]2 esartifacts[n]2
Note that the term s indicates the target signal to be analyzed. In this article s could be the wheezing signals (s=w) and the respiratory signals (s=r). Therefore, in the case of the wheezing signals (s^[n],s[n],esinterf[n],esartifacts[n],esspatial[n]) = (w^[n],w[n],ewinterf[n],ewartifacts[n],ewspatial[n]) and in the case of the respiratory signals (s^[n],s[n],esinterf[n],esartifacts[n],esspatial[n]) = (r^[n],r[n],erinterf[n],erartifacts[n],erspatial[n]). The estimated signalsw^[n],r^[n]are obtained by the separation algorithm, the original signalsw[n],r[n] are obtained from the original separated signals used in the creation of the mixtures of the databases and the error terms are obtained using the BSS EVAL toolbox. We refer the reader to [70] for more details.
In this article, three different sets of SDR, SIR and SAR metrics will be analyzed as follows: (i) SDRw, SIRwand SARware referred to WS, (ii) SDRr, SIRrand SARrare referred to RS; and (iii) SDRmis associated to the average considering SDRwand SDRr, SIRmis associated to the average considering SIRwand SIRr, and SARmis associated to the average considering SARwand SARr.
4.2. Experiments Setup
According to the results obtained in similar works [32,54] related to wheezing sound analysis, the following parameters provided the best trade-off between the separation performance and the computational cost: sampling ratefs=2048Hz, Hamming window withN=256samples length and 25% overlap (temporal resolution of 31.3 ms), and a discrete Fourier transform using 2N points.
The performance of the proposed method depends on the initial values with which each activation and basis matrix is initialized. For this reason, we have evaluated four times each input mixture with the proposed method and therefore, the results are averaged values. Furthermore, the convergence of the proposed method was empirically achieved after 50 iterations for all mixtures, so M = 50 iterations. 4.3. Comparison Methods
A set of reference baseline sound source separation methods have been compared to assess the sound separation performance achieved by the proposed method (IIS-NMPCF). As mentioned in Section 2, these methods can be divided into two groups: (i) NMF-based methods (NMF, SNMF, SSNMF and CNMF); and (ii) NMPCF-based methods (1S-NMPCF, 2S-NMPCF, T-NMPCF and ST-NMPCF). Highlight that the main parameters of the previous baseline methods have been optimized using the database P1. However, the following considerations must be taken into account to a fair comparison:
-
A training signaly[n], created to simulate the behavior of RS, is used in the baseline methods SNMF, SSNMF, 1S-NMPCF, 2S-NMPCF, ST-NMPCF and the proposed method IIS-NMPCF. The training signaly[n] has been created by concatenating randomly a set of normal respiratory stages only composed of RS obtained from the previously mentioned Internet pulmonary repositories [56,57,58,59,60,61,62,63,64,65,66,67,68]. Specifically, the signaly[n]has a temporal duration of 128 s and 54 respiratory stages (inspiration or expiration). Note that the normal respiratory stages used to construct y[n] do not correspond to any of the respiratory stages used in the databases P1 or T1.
-
SNMF and 2S-NMPCF must use a training signal to simulate the behaviour of wheezing sounds. Taking into account that WS can be defined as continuous adventitious sounds that show a pitched sound (see Section 1), a signalz[n] has been created by concatenating a set of single pitches located along the frequency band 100 Hz–1000 Hz in which WS are typically present. Each pitch is represented by a sinusoidal signal multiplied by a Hamming window of N samples. The distance between the frequencies of each pitch is equal to the value provided by the spectral spacing of the model. Considering that all evaluated methods have used the same parameters previously mentioned in Section 4.2, the spectral spacing equals to 4 Hz.
-
T-NMPCF and ST-NMPCF as well as IIS-NMPCF has been implemented using AMIE_SEG [53] to divide the input spectrogramXinto the L-segmentsX(1),X(2), …,X(L).
-
CNMF has been evaluated using its optimal parameters found in [32].
4.4. Optimization
The proposed method employs a wide range of parametersKR,KW,α,λW,λR0andλR1that can affect significantly the separation performance and the reconstructed sound quality. A hyperparametric optimization procedure has been applied to the main parameters of the proposed method IIS-NMPCF to obtain the optimal parameters that maximize the audio quality of the estimated wheezing signalw^[n]. In this work, a preliminary evaluation using visual inspection reduced the parameter space as follows:KR= (8, 16, 32, 64, 128, 256, 512),KW= (8, 16, 32, 64, 128, 256, 512),α= (0, 0.01, 0.1, 1, 10, 100),λW= (0.001, 0.01, 0.1, 1, 10),λR0= (0.001, 0.01, 0.1, 1, 10, 100) andλR1= (0.001, 0.01, 0.1, 1, 10, 100).
The hyperparametric procedure is performed for each mixture of the dataset P1 in order to obtain the audio quality of the estimated wheeze signalw^[n]in terms of SDRw, SIRwand SARw. This procedure has been computed by evaluating all the possible combinations of the parametersKR,KW,α,λW,λR0andλR1that can be found within the parameter space defined above, providing the SDRw, SIRwand SARw average values for each combination of parameters. Table 2 shows the optimal combination of the previous parameters that provides the best separation performance in terms of SDRw . Specifically, the optimal parameters corroborate our previous assumptions described in Section 3.2: (i) the highest weighting factorλR0=10is due to the high importance of the non-wheezing segments in the factorization of the respiratory bases since RS can be modeled by sharing spectral patterns that can be found in all non-wheezing segments during the breathing process; (ii) the second highest weighting factorα=1is associated with the training signal since RS typically show common spectral behavior; (iii) the low weighting factorλR1=0.1is associated with the wheezing segments in the factorization of the respiratory bases since WS can interfere in the RS reconstruction; and (iv) the lowest weighting factorλW=0.01is due to none of the L-segments is only composed by isolated WS.
Focusing on the parameter space defined above and keeping the optimal parameters shown in Table 2, the aim of the rest of the section is to analyze the stability and efficiency of the proposed method when its main parametersKW,KR,α,λW,λR0andλR1are distanced from the optimal values.
Figure 4 shows the SDRwresults varying the number of respiratoryKRand wheezingKW components. Figure 4 shows that the difference, in terms of SDRw, between the configuration of the parametersKRandKWthat provides the best performance (SDRw= 16.99 dB) and the worst performance (SDRw= 14.01 dB) is approximately 3 dB. Therefore, the proposed method is stable within the defined parameter spaceKWandKRsince the maximum loss that the algorithm can suffer is less than 3 dB regardless of the number of wheezingKWand respiratoryKRcomponents evaluated. Besides, the difference in SDRwresults is marginal (less than 0.2 dB) either usingKW≥256andKR≥256or (less than 0.3 dB) usingKW≤16andKR≤16. Highlight that the proposed factorization model needs a minimum of respiratory and wheezing components so that WS and RS can be modelled correctly. An empirical analysis showed that the SDRwresults start to drop significantly whenKW<16andKR<16 . Figure 4 shows that SDRwresults increase when the number of wheezing components is greater than the number of respiratory components (KW>KR). Specifically, comparing the parameter spaceKW∈[32−512]andKR∈[8−16]withKW∈[8−16]andKR∈[32−512], the performance of the method, in terms of SDRw, improves by about 1.7 dB. As a result, RS seem to be modelled with a lower number of bases than WS. Finally, the best performance of the proposed method IIS-NMPCF can be found in the parameter space comprised byKW∈[32−128]andKR∈[32−128]with SDRw results above 16.5 dB. As previously indicated in Table 2, the proposed method provides its highest wheezing separation performance, SDRw= 16.99 dB, usingKW=64andKR=32.
Figure 5 shows the optimization of the parametersλW,λR0andλR1of the proposed method in terms of SDRw results, of the proposed method. Figure 5E shows a poor wheezing separation when the proposed method uses aλW=10since the performance of the proposed method decreases exponentially (below 2 dB) in this scenario. The reason seems to indicate that WS are always overlapped with RS since both are produced by the same airflow through the bronchial tree of the lungs. Therefore, the proposed method wrongly models the wheeze bases whenλW≥10 since it assumes that the L-segments of the input mixture are composed mostly of prominent WS. Figure 5A shows that SDRwresults decrease significantly whenλW=0.001. In this case, the use of an excessively low weighting factor makes WS less important in the factorization process, causing that the separation process is not performed correctly since the estimated respiratory signalr^[n] contains both WS and RS. Figure 5B,D show the lower and upper limit of the weighting factorλW so that the performance of the method is not drastically affected. Figure 5 shows an improvement of the wheeze separation performance of the proposed method whenλR0>λR1. Results suggest that, unlike the wheezing segments, the non-wheezing segments improve the modeling of the RS bases since these segment do not contain wheeze content so, they are not interfered by WS. As a result,λR0must be greater thanλR1to increase the quality of the reconstructed respiratory signalr^[n]. In the parameter space comprised byλR0∈[0.001−100]andλR1∈[10−100], the SDRw results are reduced significantly as can be seen in Figure 5. Therefore, a remarkable increase ofλR1causes that the factorization model inserts a large proportion of wheezing interferences into the reconstructed respiratory signal. This fact produces more of the WS to be present in the reconstructed respiratory signalr^[n]rather than in the reconstructed wheezing signalw^[n]. It can be observed that the maximum SDRw value, approximately equal to 17 dB in Figure 5B, is provided by the proposed method for the set of parametersλW=0.01,λR1=0.1andλR0=10 . This optimization process confirms the assumptions introduced in Section 3.2. Firstly, the proposed method provides the greatest importance, with a weighting factor ofλR0=10, to the non-wheezing segments for the factorization of the basis matrix related to RS. Secondly, the proposed method provides less importance, with a weighing factor ofλR1=0.1, to the wheezing segments for the factorization of the basis matrix of the RS. Finally, the proposed method provides the least importance, with a weighting factor ofλW=0.01, to the L-segments that composes the input mixture signal for the factorization of the basis matrix of WS, as in none of these segments are WS isolated.
Note that whenλW=λR0=λR1 , the proposed method works similarly to the conventional NMPCF approach, that is, ST-NMPCF. In particular, Figure 5B shows that ST-NMPCF obtains a SDRwresult equal to 13 dB (4 dB less than the optimal value obtained with the proposed method) usingλW=λR0=λR1=0.01. This improvement provided by the proposed method confirms that adding different weighting factors to different segments of the input mixture into the NMPCF factorization enhances the acoustic fidelity of the spectral content of both RS and WS in the sound separation.
Focusing on the importance of the respiratory training signaly[n] in the proposed IIS-NMPCF approach, Figure 6 shows SDRw, SIRwand SARwresults of the estimated wheezing signal evaluating the parameter space of the weighting factorα. Each box represents 48 data points, one for each mixture of the optimization dataset P1: each blue box represents the analysis for SDRwvalues; each red box represents the analysis for SIRwvalues; and each black box represents the analysis for SARwvalues. The lower and upper lines of each box show the 25th and 75th percentiles. The line in the middle of each box represents the median value. The diamond in the center of each box represents the average value. The lines extending above and below each box show the extent of the rest of the samples, excluding outliers. Outliers are defined as points that are over 1.5 times the interquartile range from the sample median, which are shown as crosses. The proposed method usingα=0, henceforth called IIS0-NMPCF, does not use any training to model the respiratory bases. IIS0-NMPCF shows an efficient performance with an average separation results of SDRw= 14 dB, SIRw= 18 dB and SARw= 15 dB. Based on these results, it can be confirmed that IIS0-NMPCF maintains a remarkable performance in the quality of the estimated wheezing signalw^[n]. However, the best average separation results, SDRw= 17 dB, SIRw= 22 dB and SARw= 20 dB, are obtained usingα=1. The optimal configuration of the proposed method IIS-NMPCF (α=1) produces a significant improvement of 3 dB in SDRw, 4 dB in SIRwand 5 dB in SARwcompared to IIS0-NMPCF. As a result, two conclusions are stated: (i) the performance of IIS-NMPCF is mainly due to the importance of the different segments depending on the presence or absence of WS so, not using any respiratory training signal the method maintains good separation results; and (ii) the use of a respiratory training signal significantly improves the performance of the proposed method IIS-NMPCF since it is combined both the information provided by the spectral patterns found at inter-segments with the information provided by the spectral patterns found in the respiratory training signal. This fact implies that the probability of finding wheezing interferences in the factorized respiratory bases decreases considerably.
Moreover, SDRw, SIRwand SARwresults, obtained usingα>10, suffer a significant decrease compared to the best performance provided by the proposed method (α=1 ) as shown in Figure 6. In this case (α>λR0), the factorization gives more importance to the spectral patterns obtained from the respiratory training signal instead of the spectral patterns shared between the different segments, that is, the proposed method IIS-NMPCF performs similarly to 1S-NMPCF.
4.5. Results and Discussion
This section assesses the sound quality of the estimated or reconstructed WS and RS obtained by the proposed method (IIS0 -NMPCF and IIS-NMPCF) and the baseline separation NMF-based and NMPCF-based methods described in Section 2. Table 3 describes the methods evaluated, indicating the approach on which they are based and the spectro-temporal information used in the modelling of WS and RS.
Next, SDR, SIR and SAR results of the estimated wheezing signalw^[n]and the estimated respiratory signalr^[n] obtained by the proposed method and the aforementioned baseline methods evaluating the testing datasets T1H (see Figure 7), T1M (see Figure 8) and T1L (see Figure 9) are analyzed to extract interesting information about the sound separation performance of the methods evaluated. Each blue box corresponds to the SDRw, SIRwand SARwresults of the estimated wheezing signal while each red box corresponds to the SDRr, SIRrand SARr results of the estimated respiratory signal. Note that the methods have been shown sorted from lowest to highest separation performance to represent results as a ranking. The following information can be derived from the analysis of results from Figure 7, Figure 8 and Figure 9:
-
The decrease in SNR affects significantly the SDR and SIR results for both WS and RS. Focusing on Figure 7 in which SNR = 5 dB, results tend to be higher for reconstructed WS compared to the reconstructed RS because WS are louder than RS, so the sound separation benefits the audio quality of the reconstructed WS. Focusing on Figure 8 in which SNR = 0 dB, results for both WS and RS tend to remain stable because both WS and RS are similarly audible, so the performance of the sound separation seems to work equally between WS and RS. However, in Figure 9 in which SNR = −5 dB, results tend to be better for reconstructed RS since RS are louder than WS. This decrease in SNR implies that SDRmand SIRmresults are worse in T1L compared to T1H. The reason is because RS are louder than WS when SNR < 0 dB (T1L) and as a consequence, WS be inaudible in this acoustic scenario so, the reduction of the SNR implies a greater time-frequency overlapping from RS to WS than the opposite.
- The standard NMF is ranked at the bottom, obtaining the worst sound separation performance since it achieves the signal reconstruction but not a factorization composed of audio events with physical meaning. The standard NMF cannot group the factorized bases to the sound source that generated them unlike the other methods because the standard NMF does not incorporate any type of information into the factorization process to model the spectro-temporal characteristics shown by WS and RS.
- Semi-supervised approaches (SSNMF and 1S-NMPCF) obtain better performance compared to supervised approaches (SNMF and 2S-NMPCF). Regardless of the approach, NMF or NMPCF, the use of the RS training signal is more effective that the use of both RS and WS training signals. It indicates that both training signals provide over-information that causes spectro-temporal ambiguity in the factorization of both WS and RS dictionaries.
- NMPCF-based methods (1S-NMPCF) obtain better separation performance than NMF-based methods (SSNMF). This fact seems to be because SSNMF uses a fixed dictionary composed of respiratory bases previously trained. However, 1S-NMPCF does not need a previous training stage, since it applies a joint matrix factorization using the input mixture and the respiratory training to obtain a dynamic dictionary of respiratory bases shared between both signals, obtaining a different dictionary of bases for each input mixture.
- Comparing NMPCF-based methods, T-NMPCF improves the separation performance compared to 1S-NMPCF. Results suggest that the dictionary of respiratory bases is more efficient when the input mixture is divided into segments in order to find repetitive patterns of RS.
-
ST-NMPCF, the combination of the approaches 1S-NMPCF and T-NMPCF, obtains a significant improvement of the wheezing separation performance. Specifically, SDRw= 5.96 dB and SIRw = 9.73 dB evaluating T1H (Figure 7). It indicates that a more reliable modelling of RS can be achieved using jointly the shared respiratory spectral patterns along the segments and a prior knowledge of the respiratory spectral content by means of the respiratory training signal.
-
CNMF [32] obtains competitive SDR SIR and SAR results compared to the methods above, ranking fourth. In some cases, WS and RS are modelled efficiently by applying its proposed constraints, but in other cases in which WS and RS are uncommon, CNMF does not model properly the spectro-temporal behavior of the target sounds.
Focusing on the main contribution proposed in this work, the incorporation of higher importance to those segments classified as non-wheezing in the co-factorization process, Figure 7, Figure 8 and Figure 9 reveal the following information:
-
A significant separation performance improvement over the conventional T-NMPCF and ST-NMPCF is achieved adding greater importance to the non-wheezing segments in the co-factorization process. The SDRwimprovement of IIS0-NMPCF over T-NMPCF is about 8.31 dB (T1H), 5.18 dB (T1M) and 4.85 dB (T1L). The SIRwimprovement of IIS0-NMPCF over T-NMPCF is about 11.09 dB (T1H), 10.18 dB (T1M) and 8.33 dB (T1L). The SDRwimprovement of IIS0-NMPCF over ST-NMPCF is about 2.67 dB (T1H), 3.03 dB (T1M) and 1.69 dB (T1L). The SIRwimprovement of IIS0-NMPCF over ST-NMPCF is about 1.98 dB (T1H), 2.25 dB (T1M) and 1.87 dB (T1L). Results suggest that the inclusion of inter-segment information into the co-factorization process for modeling repetitive RS improves significantly the separation performance because it avoids that the respiratory spectral patterns obtained from the factorization remaining uncontaminated in wheezing segments.
-
Adding prior knowledge of RS to IIS0-NMPCF improves significantly the sound separation performance. The SDRwimprovement of IIS-NMPCF over IIS0-NMPCF is about 3.07 dB (T1H), 2.89 dB (T1M) and 4.12 dB (T1L). The SIRwimprovement of IIS-NMPCF over IIS0-NMPCF is about 4.96 dB (T1H), 3.23 dB (T1M) and 3.02 dB (T1L). However, the dispersion between SDR and SIR results increases when the respiratory training signal is incorporated into the co-factorization process.
Focusing on the SAR results observed in Figure 7C, Figure 8C and Figure 9C: (i) NMPCF-based methods produce fewer artifacts than NMF-based methods; (ii) the spectro-temporal information used in the modelling of WS and RS allows to reduce the ambiguity that NMPCF-based methods are affected by decreasing the amount of artifacts. For this reason, the proposed method IIS-NMPCF, which uses more spectro-temporal information to model RS compared to the other NMPCF-based methods, obtains the best separation performance in terms of SAR.
In order to guarantee the relevance of the respiratory and wheezing SDR, SIR and SAR results shown in Figure 7, Figure 8 and Figure 9, an analysis of the statistical significance, using an one-side paired t-test, has been performed comparing the proposed method (IIS-NMPCF) with the rest of the evaluated methods as shown in Table 4, Table 5 and Table 6. It can be observed that results confirm the significant improvement obtained by IIS-NMPCF compared to the other evaluated methods.
Finally, a set of spectrograms are presented in Figure 10 and Figure 11 in order to display the sound separation performance obtained by each of the assessed methods. Unlike the other evaluated methods, it can be observed that the proposed method IIS-NMPCF removes most of the RS in the estimated wheezing spectrogramX^Wkeeping most of the wheezing spectral content. This fact confirms the advantage of the proposed method since most of the clinical useful information contained in the estimated spectrogramX^W will be available to the physician to maximize the reliability of medical diagnosis. The MATLAB implementation of the proposed method is shared by the authors and can be downloaded from GitHub (https://github.com/JTORRECRUZ/Sensors_IIS-NMPCF).
5. Conclusions
We propose an extended version of Non-negative Matrix Partial Co-Factorization (NMPCF) approach to separate wheezing and respiratory sounds improving their acoustic quality. We assume that RS can be considered as sound events that are repeated during the human breathing process. However, WS may or may not be present along the segments due to the unpredictable nature of the pulmonary disorder. The main contribution of the proposed method is to add importance to the segments classified as non-wheezing to improve the sound separation performance of the conventional NMPCF which treats all segments of the input spectrogram equally. As a result, our proposal (IIS0-NMPCF/IIS-NMPCF) is able to characterize RS more accurately by forcing to model more on those non-wheezing segments in the bases sharing process into the NMPCF approach.
The main conclusions from the experimental results indicate that adding more importance to the non-wheezing segments into the decomposition procedure (NMPCF) models more accurate the spectro-temporal characteristics related to repetitive sound events of the mixture. In this work, these repetitive sound events are represented by RS that are present in all cycles of the breathing. Experimental SDR, SIR and SAR results report that the proposed method IIS-NMPCF outperforms significantly all evaluated methods providing competitive and promising results in the wheezing sound separation. This fact confirms the ability of the proposed method to improve the sound quality of WS maximizing both the removal of the acoustic interference caused by RS and that as much wheezing content is maintained. As a result, all useful medical information contained in the estimated wheezing can be clearly preserved. It can be observed that the separation performance for the different evaluated methods drops when the SNR decreases. Considering the acoustic scenario in which RS are louder than WS (SNR < 0 dB), WS are barely audible due to the high interference produced by RS. Although in this case the reduction of the SNR implies a greater time-frequency overlapping from RS to WS, our proposal still achieves the best performance compared to the other baseline methods evaluating. Therefore, the proposed method can be considered an useful tool to be applied in sound environments in which WS are barely audible. Future work will focus on the development of new constraints to be incorporated into NMF-based approaches for modelling different types of WS according to their spectral content in order to automatically classify the severity of the lung disorder.
ID1 | ID2 | ID3 | ID4 | ID5 | ID6 | ID7 | ID8 | ID9 |
---|---|---|---|---|---|---|---|---|
P1 | 48 | 5–24 | 721 | [0–9] | [4–16] | 496 | [1–8] | 92 |
T1H | 16 | 7–22 | 251 | 5 | [6–14] | 126 | [1–5] | 41 |
T1M | 16 | 7–22 | 251 | 0 | [6–14] | 126 | [1–5] | 41 |
T1L | 16 | 7–22 | 251 | −5 | [6–14] | 126 | [1–5] | 41 |
ID1: identifier; ID2: number of recordings captured from unhealthy subjects; ID3: the shortest and longest duration, in seconds, captured from recordings; ID4: total duration in seconds; ID5: the lowest and highest SNR, in dB, between WS and RS; ID6: the minimum and maximum number of respiratory events found in the recordings; ID7: the total number of respiratory events; ID8: the minimum and maximum number of wheezes found in the recordings; ID9: the total number of wheezes.
IIS-NMPCF Approach Parameters | KW | KR | λR0 | α | λR1 | λW |
Optimal values | 64 | 32 | 10 | 1 | 0.1 | 0.01 |
Method | Approach | Modelling Associated to WS and RS |
---|---|---|
NMF | NMF | |
SSNMF | NMF | y[n] |
SNMF | NMF | y[n]andz[n] |
CNMF | NMF | Sparseness and Smoothness constraints |
1S-NMPCF | NMPCF | y[n] |
2S-NMPCF | NMPCF | y[n]andz[n] |
T-NMPCF | NMPCF | L-segments |
ST-NMPCF | NMPCF | L-segments andy[n] |
IIS0 -NMPCF | NMPCF | L-segments andC(l) |
IIS-NMPCF | NMPCF | L-segments,C(l)andy[n] |
Method | SDRr | SIRr | SARr | SDRw | SIRw | SARw |
---|---|---|---|---|---|---|
NMF | 6.1×10−10 | 4.1×10−10 | 5.4×10−3 | 1.9×10−11 | 1.8×10−11 | 4.8×10−7 |
SSNMF | 1.4×10−7 | 5.5×10−8 | 4.8×10−3 | 3.2×10−10 | 4.4×10−12 | 8.9×10−8 |
SNMF | 1×10−7 | 3.1×10−7 | 4.5×10−8 | 7.9×10−12 | 1.4×10−10 | 1.2×10−2 |
2S-NMPCF | 3.3×10−7 | 5.5×10−6 | 4.9×10−7 | 2.6×10−11 | 1.7×10−10 | 1.8×10−3 |
1S-NMPCF | 6×10−6 | 4×10−7 | 2.7×10−6 | 5.2×10−10 | 9.2×10−11 | 4.9×10−3 |
T-NMPCF | 3.8×10−5 | 5.7×10−5 | 3.3×10−4 | 4.4×10−9 | 8.9×10−9 | 2.9×10−2 |
CNMF | 1.6×10−4 | 1.7×10−3 | 1.8×10−7 | 2.6×10−6 | 1.3×10−6 | 7.2×10−4 |
ST-NMPCF | 1.5×10−4 | 9.5×10−6 | 5.2×10−2 | 3.9×10−5 | 5.3×10−7 | 2.2×10−2 |
IIS0-NMPCF | 4×10−2 | 2.2×10−3 | 1×10−1 | 4.2×10−2 | 8.2×10−3 | 1.1×10−1 |
Each cell shows the parameterρthat represents the probability of setting a statistically significant result. Considering a confidence interval of 95%, small values ofρ < 0.05 indicate that there exists statistical significance of the results evaluated.
Method | SDRr | SIRr | SARr | SDRw | SIRw | SARw |
---|---|---|---|---|---|---|
NMF | 4×10−13 | 4.1×10−14 | 9.4×10−2 | 6.2×10−13 | 2×10−12 | 2.7×10−9 |
SNMF | 4.3×10−13 | 7.3×10−13 | 4.3×10−2 | 2.8×10−13 | 1×10−10 | 1.3×10−10 |
SSNMF | 9.7×10−11 | 2.3×10−10 | 2.5×10−6 | 5.6×10−11 | 4.3×10−10 | 2.9×10−7 |
2S-NMPCF | 6.9×10−11 | 1.1×10−11 | 4.9×10−6 | 5.1×10−11 | 5.8×10−11 | 3.5×10−8 |
1S-NMPCF | 8.7×10−8 | 4.9×10−10 | 1.3×10−6 | 1.7×10−8 | 1.4×10−9 | 3.3×10−7 |
T-NMPCF | 9.7×10−9 | 3.7×10−9 | 1.1×10−7 | 6.9×10−9 | 1.6×10−7 | 1.2×10−9 |
CNMF | 9.6×10−7 | 1.1×10−5 | 5.7×10−5 | 9.4×10−6 | 7.4×10−5 | 5.2×10−7 |
ST-NMPCF | 1.9×10−4 | 1.6×10−4 | 4.4×10−2 | 8.4×10−5 | 4.3×10−4 | 1.3×10−9 |
IIS0-NMPCF | 4.1×10−2 | 6.3×10−4 | 4×10−1 | 2×10−2 | 3.1×10−2 | 3.3×10−4 |
Each cell shows the parameterρthat represents the probability of setting a statistically significant result. Considering a confidence interval of 95%, small values ofρ < 0.05 indicate that there exists statistical significance of the results evaluated.
Method | SDRr | SIRr | SARr | SDRw | SIRw | SARw |
---|---|---|---|---|---|---|
NMF | 2.1×10−9 | 6.8×10−12 | 3.9×10−8 | 2.2×10−9 | 1.5×10−12 | 8.6×10−10 |
SNMF | 5.5×10−10 | 3×10−11 | 1.9×10−5 | 1.7×10−9 | 2.7×10−12 | 8×10−12 |
SSNMF | 5.3×10−10 | 1.1×10−13 | 6.5×10−10 | 1.2×10−9 | 6.2×10−11 | 3.6×10−10 |
2S-NMPCF | 2.8×10−10 | 3.1×10−9 | 4.7×10−10 | 1.1×10−8 | 1.2×10−10 | 9.3×10−10 |
1S-NMPCF | 1.8×10−9 | 9.9×10−12 | 5.5×10−12 | 2.1×10−8 | 3.3×10−9 | 2.9×10−6 |
T-NMPCF | 1.5×10−7 | 1.7×10−8 | 5.4×10−12 | 1.8×10−7 | 1.2×10−7 | 2.5×10−8 |
CNMF | 2×10−5 | 4.4×10−4 | 3.9×10−2 | 4.7×10−9 | 3.4×10−4 | 5.6×10−6 |
ST-NMPCF | 5.6×10−4 | 4×10−6 | 3.7×10−4 | 1.9×10−6 | 1.1×10−3 | 5.2×10−6 |
IIS0-NMPCF | 3.6×10−2 | 9.6×10−3 | 3×10−6 | 2.5×10−3 | 2.7×10−2 | 4.3×10−3 |
Each cell shows the parameterρthat represents the probability of setting a statistically significant result. Considering a confidence interval of 95%, small values ofρ < 0.05 indicate that there exists statistical significance of the results evaluated.
Author Contributions
Conceptualization, J.D.L.T.C., F.J.C.Q., N.R.R. and P.V.C.; methodology, J.D.L.T.C., F.J.C.Q., N.R.R. and P.V.C.; software, J.D.L.T.C., F.J.C.Q. and J.J.C.O.; validation, J.D.L.T.C. and J.J.C.O.; writing-original draft, J.D.L.T.C., F.J.C.Q. and J.J.C.O.; writing-review and editing, N.R.R. and P.V.C.; supervision, N.R.R. and P.V.C. All authors have read and agreed to the submitted version of the manuscript.
Funding
This work was supported by the Programa Operativo FEDER Andalucía 2014-2020 under project with reference 1257914.
Acknowledgments
The authors would like to thank the pulmonologist Gerardo Pérez Chica from the University Hospital of Jaén (Spain) for all the constructive discussions about the sound wheezing in the auscultation process.
Conflicts of Interest
The authors declare no conflict of interest.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Wheezing reveals important cues that can be useful in alerting about respiratory disorders, such as Chronic Obstructive Pulmonary Disease. Early detection of wheezing through auscultation will allow the physician to be aware of the existence of the respiratory disorder in its early stage, thus minimizing the damage the disorder can cause to the subject, especially in low-income and middle-income countries. The proposed method presents an extended version of Non-negative Matrix Partial Co-Factorization (NMPCF) that eliminates most of the acoustic interference caused by normal respiratory sounds while preserving the wheezing content needed by the physician to make a reliable diagnosis of the subject’s airway status. This extension, called Informed Inter-Segment NMPCF (IIS-NMPCF), attempts to overcome the drawback of the conventional NMPCF that treats all segments of the spectrogram equally, adding greater importance for signal reconstruction of repetitive sound events to those segments where wheezing sounds have not been detected. Specifically, IIS-NMPCF is based on a bases sharing process in which inter-segment information, informed by a wheezing detection system, is incorporated into the factorization to reconstruct a more accurate modelling of normal respiratory sounds. Results demonstrate the significant improvement obtained in the wheezing sound quality by IIS-NMPCF compared to the conventional NMPCF for all the Signal-to-Noise Ratio (SNR) scenarios evaluated, specifically, an SDR, SIR and SAR improvement equals 5.8 dB, 4.9 dB and 7.5 dB evaluating a noisy scenario with SNR = −5 dB.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer