Full text

Turn on search term navigation

INTRODUCTION

As a combustible fossil energy source, coal is widely used in power supply, heating, forging and steelmaking,¹ and plays a key role in the daily life of the nation. With the improvement of modernization level, people's demand for energy is rising, and coal energy is also an important part of social development.² Coal and gangue is the main solid waste produced in the process of coal mining, which is rich in sulfur and heavy metals in the combustion will releases a large number of harmful substances, causing environmental pollution.³ Therefore, the automatic and efficient separation of coal gangue and other substances in coal mining to improve the purity of coal is of great significance to the efficiency of coal mining, which is conducive to the realization of the green development of the coal industry.⁴ Traditional sorting methods include manual sorting and wet separation. Artificial coal picking uses the workers' own experience in the many coal gangue mixed ore to separate the large pieces of coal and gangue, this method is time-consuming, laborious and inefficient. Wet separation according to the different densities of coal and gangue, the deployment of media that can make coal and gangue separation, with large-scale mechanical equipment for sorting, but there is a waste of water and environmental pollution and other problems.²

In recent years, with the continuous development of artificial intelligence technology and image processing technology, which is widely used in various industries. Doung et al.⁵ used neural network algorithms to build an expert system to achieve accurate and fast recognition of fruits. Tang et al.⁶ designed a signal enhancement and compression method for motor fault problems to realize fault diagnosis on the Internet of Things (IoT) platform. Setio et al.⁷ proposed a novel computer-aided detection system for lung nodules with multiview convolutional networks to effectively realize high-quality detection of lung nodules. Image classification, recognition and segmentation combined with deep learning have become a hotspot for gangue sorting at this stage. Among them, Sun et al.⁸ analyzed visual differences between images of coal and gangue, extracted texture as auxiliary information based on morphology, and then extracted texture information based on differences in physicochemical properties, extracting texture information on surface visual traces caused by the transportation process of coal and gangue, and establishes classifiers for classification. Lv et al.⁹ used the idea of combining traditional computer vision and deep learning to design a multichannel feature fusion layer, optimized the convolutional neural network (CNN) in the discriminator from two perspectives: the loss function and the classifier, and also designed a decision function to unify the results of the output of the detector and the discriminator, which effectively improves the detection accuracy. Li et al.¹⁰ first conduct comparative tests on three image threshold segmentation to prove the effectiveness of clustering methods, and then train the classifiers using grayscale features, texture features, and joint special combining skewness and contrast, respectively, to recognize and locate coal and gangue effectively. Although the existing methods can effectively recognize coal and gangue and have good detection accuracy, they can only distinguish whether there is coal or gangue in the image, and cannot obtain its shape and edge profile information.

With the advancement of neural networks and computer hardware conditions, a series of semantic segmentation models with excellent performance and segmentation results have been proposed, such as fully convolutional network (FCN),¹¹ SegNet,¹² U-Net,¹³ DeepLab,¹⁴ and Mask R-CNN.¹⁵ Coal and gangue based semantic segmentation first classifies the pixel points in the image and assigns different semantic category labels to coal and gangue based on the category and location of the pixel points to achieve pixel-level segmentation.¹⁶ Wang et al.¹⁷ proposed a segmentation method based on the star algorithm edge detection, which extracts the pixel matrix one by one along the x, y direction and maps it to a single-valued image of equal size based on monotonic variation to realize the edge detection of coal and gangue. Luo et al.¹⁸ established the grayscale fluctuation equation by calculating the change amplitude of pixels in different directions, extracting multiple features from it, realizing the target contour segmentation, and optimizing the contour using historical and predictive information, removing a large number of false contours, reproduces part of the hidden contours, and effectively improves the segmentation accuracy. Lv et al.¹⁹ designed the loss function of the joint network and the feature interaction channel between the shared coding module and the parallel decoding branch based on the multitask learning theory to realize the target detection and segmentation of oversized gangue. He et al.²⁰ utilized optoelectronics to obtain independent image targets and proposed a concave surface point detection and segmentation algorithm, which translates the concave point detection into a positional relationship between a pixel and a linear equation, and finally generates a segmentation line to segment the image based on the concave points. Wang et al.²¹ developed a multibranch parallel feature extraction bottleneck to establish a lightweight backbone network and then designed a novel neck structure to aggregate channel and spatial information into the contextual information of the backbone network to enhance the location and boundary information of the coal and gangue, which effectively provides the category and location information of the coal and gangue in the dense state. He et al.²² proposed pit detection and segmentation algorithms to solve the target sticking and overlapping problems, design open-loop and closed-loop crossover algorithms, use conjugate line to detect pits to determine the position and distance of pixel points relative to the conjugate line, and then set the distance constraints to get the segmentation line corresponding to the pits using the minimum distance search method to realize coal and gangue segmentation. Fu et al.²³ proposed a fast clustering segmentation algorithm for water pixels based on gradient enhancement, which enhances the edge gradient features by multiscale details, then reconstructs the gradient watershed transform based on multiscale morphology, and finally performs statistics and clustering on the obtained hyperpixel map to get the final effect map. Liu et al.²⁴ introduced the InceptionV1 module to replace some of these convolutional blocks based on U-net and integrated the CPAM attention module to solve the light change problem, which improved the segmentation accuracy based on the original model. Li et al.²⁵ proposed a segmentation method based on deep learning of convolutional neural network in mask region, which can recognize fine coal dust, and combined with focus loss and dice coefficient new mask loss function to overcome the wrong segmentation caused by edge effect, and achieved effective results. All of the above methods can effectively segment the edge information and shape of coal and gangue, but for intelligent sorting of coal and gangue by manipulators further improvement of segmentation accuracy is needed to improve the success rate of gripping.

Therefore, in this paper, based on the Mask R-CNN²⁶ algorithm framework for improvement, the proposed model can effectively improve the accuracy of segmentation and realize the pixel-level segmentation of coal and gangue, the main contributions of this paper are as follows:

1.
Design of a multichannel forward-linking obfuscated convolutional block to enhance the feature extraction capability of the network by replacing the ResNet50 network structure with a stacked one. Therefore, in this paper, we take the initials MFCCM for “Multichannel Forward-Linked Confusion Convolution Module” to name the improved model.
2.
Aiming at the problem of insufficient communication of feature information, this paper proposes the high-resolution feature pyramid network (HR-FPN) structure, which fully integrates the feature information through the information propagation of multiple channels, and is paired with the SPR structure to expand the sensory field of the network and realize the efficient propagation of gradient information.
3.
A multiscale Mask head structure is proposed, which effectively utilizes local and contextual information to achieve feature enhancement and better segmentation.

The rest of the paper is organized as follows: Section 2 outlines the model structure of Mask R-CNN, giving a graphical and textual description of the network structure proposed in this paper. Section 3 describes the experimental data set, equipment environment, experimental parameters, and evaluation metrics, and conducts ablation experiments and comparison tests to further demonstrate the advantages of the proposed model in this paper. Section 4 summarizes the paper and the subsequent further work.

METHODS Mask R-CNN algorithm

The Mask R-CNN algorithm used in this paper extends the classical target detection network Faster R-CNN framework,²⁷ as shown in Figure 1, which consists of two phases, the first phase scans the initial feature maps output from the FPN and generates the region of interest (ROI), which is the same as the Faster R-CNN method.²⁸ The second stage is the ROI pooling operation, unlike the Faster R-CNN, the algorithm proposes to use the ROIAlign layer, which uses bilinear interpolation to compute the exact values of the input features at the four regular sampling locations in each ROI pooling, avoiding large quantization errors caused by misalignment of the ROIs with the extracted features, and finally uses the fully convolutional network FCN to generate binary masks for each of the aligned ROIs.²⁹ Finally, the loss function of the model consists of three parts: box loss, classification loss, and segmentation loss, and we will introduce the three structures of Mask R-CNN, FPN, region proposal network (RPN), and ROIAlign, in sequence, respectively.

a.
FPN

The feature pyramid structure plays an important role in multiscale target detection because each size target has different feature information. For low-level feature maps, the image resolution is high and can effectively extract the location information of the target under test, but the semantic information is less. The high-level feature maps, on the contrary, have rich semantic information, but the low resolution of the image causes the ambiguity of the location information of the measured target.³⁰ FPN uses a top-down hierarchical structure with horizontal splicing to transform a single-scale feature extraction problem into a multiscale problem and does not require additional image inputs and parameter settings.³¹
b.
RPN

Mask R-CNN uses RPN structure to generate candidate frames, which is based on the principle of generating candidate frames with different sizes and dimensions at each pixel location, and performing classification and bounding box regression operations on the candidate frames to output the category and boundary coordinates of the ROI.³² Compared with the traditional method using a sliding window to first divide the candidate region in the image and then determine the object category, the region suggestion can be obtained at almost no extra cost, which effectively improves the acquisition efficiency of the detection frame.
c.
ROI align

Figure 1. Structure of Mask R-CNN network. FCN, fully convolutional network; ROI, region of interest.

The ROIs generated by the RPN structure are fed into ROI Align together with the corresponding feature maps, and each ROI size is adjusted using ROI Align to meet the input requirements of functional connectivity. ROI Pooling operation is used in Faster R-CNN, and when the feature maps are convolved and merged several times, the final feature map size is different from that of the initial input, which will result in a large error when the pixel-level segmentation is performed directly.³³ The ROI Align layer in Mask R-CNN uses a bilinear interpolation method to extract the pixel features of each ROI on the corresponding feature map and performs appropriate alignment operations, which effectively avoids the quantization error caused by the pooling operation and improves the segmentation accuracy.

Mask R-CNN is designed for large-scale computer vision tasks, which enables its direct application to achieve good recognition and segmentation accuracies in many fields, such as medical diagnosis,³⁴ agriculture,³⁵ and industry,³⁶ and so on. However, there is still room for increasing the accuracy of Mask R-CNN when it is targeted at a single scene, and therefore, many deformations of Mask R-CNN have been proposed to be better applied in various scenes.

MFCCM

In the target segmentation task, the effective extraction of RGB image feature information is one of the key steps to achieve pixel-level segmentation, in the extraction of coal and gangue images, due to the small difference in the shape and size of different coals and gangue, and the color and texture are similar, which means that the feature information in the coal and gangue images is relatively less, and it is necessary for the backbone network to be able to distinguish effectively. The original network, Maks R-CNN, uses ResNet50³⁷ as the feature extraction network, and when the depth of the network is increased, the use of residual structure can fully fuse the shallow and deep network features, and effectively prevent the problem of gradient disappearance during network training. In this paper, we redesign the feature extraction network and propose a multichannel forward linkage obfuscated convolutional module for building the feature extraction network. It is shown in Figure 2.

Figure 2. Structure of feature extraction network. (A) Conv block and (B) feature extraction network.

“Channel Shuffle” denotes the channel shuffling operation,³⁸ which recombines the different groups of feature information in the final output of the Conv Block convolutional block, which ensures that the feature maps of different channel groups can be exchanged without increasing the computational effort, improving the learning of the feature information between the groups, and enhancing the segmentation accuracy of the network model. In Figure 2A, “Conv” stands for two-dimensional convolution, “BN” stands for batch normalization, and “ReLU” stands for activation function, which can set the outputs of the neural network nodes to 0, reduce the interdependence between nodes, and effectively mitigate the overfitting problem that occurs during the training process.³⁹ The design of convolutional block combines the classical convolutional operation, residual structure, and obfuscation operation together, which is propagated through a multichannel forward-linked obfuscation network, so that the basic features of the image are propagated in series within different channels, combining and disrupting the obfuscation of deep features with shallow features, and outputting feature maps with distinctive gangue features, to achieve a richer combination of features and gradients.

Assume that the feature map is $F\in {{\mathbb{R}}}^{C\times W\times H}$ , $F=({f}_{1},{f}_{2},\text{\unicode{x02026}},{f}_{c})$ is given, where ${f}_{i}\in {{\mathbb{R}}}^{W\times H}$ denotes the feature submap of the $i$ th channel in the feature map $F$ , and $C$ , $W$ , $H$ represents the number of channels in the feature map, and the width and height of the feature map, respectively. Then the feature map is fed into the feature extraction network to extract the coal and gangue feature information, $Conv+BN+\text{Re}LU$ represented by the following Equation (1): [Image Omitted. See PDF] [Image Omitted. See PDF]where $\varphi ()$ , $\psi ()$ and $\delta ()$ represent the convolution operation, the normalization operation, and the activation function, respectively, ${F}_{layer}$ denotes the output of the previous layer, and ${F}_{input}$ represents the input of the previous feature map into the Conv Block. After channel summation and dimensional splicing is sent to $Shuffle()$ obfuscation to disrupt the splicing, which fully makes the feature information intermingle with each other, and the final output of each Conv Block collected effectively captures the channel dependency between the aggregated features.

Improved multiscale feature pyramid

In the segmentation task, coal and gangue have different shapes and sizes, and the FPN structure has proved that if we use the feature map output from the last layer of the feature extraction network, although it has rich semantic information and is beneficial to the classification of coal and gangue, but it will result in the inaccurate localization of the object and other problems, so we need to make good use of the feature maps of the output of the various scales, and to improve the accuracy of the recognition and segmentation of coal and gangue of different sizes.

The commonly used FPN structure is illustrated in Figure 3. where C2, C3, C4, and C5 are the feature maps of four different scales fed into the FPN by the feature extraction network, and P2, P3, P4, and P5 are the outputs of the FPN layer, which are subsequently fed into the RPN to generate ROIs. We find by analyzing their structure that feature information flows only between neighboring nodes in each layer, and the circulation of feature information between different layers relies on up-sampling and convolution operations in top-down and bottom-up paths, with fewer paths for information exchange. High-resolution feature maps generated using the nearest neighbor method or bilinear interpolation produce phenomena such as blurred edges and image discontinuities, which affect the accuracy of segmentation.⁴⁰ Therefore, we design the HR-FPN structure to realize the information interaction between the different scales of outputs of the feature extraction network, so as to improve the clarity of the edge contour and improve the problem of image discontinuity. As shown in Figure 4.

Figure 3. Structure of common feature pyramid. (A) Defult, (B) FPN, (C) PAN, and (D) BiFPN.

Figure 4. Structure of high-resolution feature pyramid network (HR-FPN). (A) HR-FPN, (B) Up and down, and (C) stacked residual pooling (SRP).

Different from the traditional top-down and bottom-up paths, the outputs of the feature extraction network, C2, C3, C4, and C5, are all subjected to UP or Down operation, and then their own feature information is passed to the nodes of different layers, and the information paths are effectively shortened through the ADD operation by fully integrating the information flow from different scales. Figure 4B represents the operation of the arrows in Figure 4A, the direction of the arrows facing upwards means that it has undergone a down operation and vice versa means an up operation, the level of the arrows represents that it has undergone convolution operation, and the slope of the arrows responds to the number of times of up and down. Among them, up хn denotes the up-sampling operation, n means that the image after up-sampling is n times of the original, and down хn denotes the down-sampling operation, n means that the original image after down-sampling is n times of the current one, and the feature information is then aggregated after up- and down-sampling, which effectively combines the feature information of each scale. Table 1 lists the operations from the C layer to the M layer.

Table 1 Computer operation.

Paths	Operation	Paths	Operation
C5 to M5	Conv2d	C3 to M5	Down х4
C5 to M4	Up х2	C3 to M4	Down х2
C5 to M3	Up х4	C3 to M3	Conv2d
C5 to M2	Up х8	C3 to M2	Up х2
C4 to M5	Down х2	C2 to M5	Down х8
C4 to M4	Conv2d	C2 to M4	Down х4
C4 to M3	Up х2	C2 to M3	Down х2
C4 to M2	Up х4	C2 to M2	Conv2d

Here we define the output feature map of the M layer as ${M}_{{\rm{i}}}(i=2,3,4,5)$ , which can be expressed as follows: [Image Omitted. See PDF]where Upasmple denotes bilinear interpolation and then feeds the output of the ${M}_{i}$ into the right half, which is defined as ${P}_{i}=SRP({M}_{i}),i=(2,3,4,5)$ .

In CNNs, contextual information is beneficial to improve segmentation performance to some extent, and the utilization of global contextual information depends on the size of the receptive field, for this reason, we expand the receptive field of the feature pyramid by adding stacked residual pooling (SRP)⁴¹ structure behind the proposed improved FPN structure. As shown in Figure 4C, there is a maximum pooling and convolution on each residual branch, and the filter size of the maximum pooling is different, which is 1, 3, and 5, respectively, and the input of each pooling operation is the output of the previous residual module, and the convolution operation is carried out immediately after the pooling operation for the weighting of the feature information, which eliminates the effect of the mash-up brought about by the previous step of the up-sampling, and the output of residual module is continuously superimposed on the input features for updating, which enhances the feature fusion, reduces the loss of information in the process of the deep network computation, and ensures the high efficiency of the propagation of the gradient information.

Multiscale Mask head

The header network is responsible for decoding the semantic features in each channel, in which the class of pixels as well as the positional information needs to be obtained. There are three branches in the head of the Mask R-CNN network, which are box head, class head and mask head. class head and box head share a fully connected layer, which feeds the feature information with ROIAlign output size of $7\times 7\times 256$ into the fully connected layer, which raises the dimensionality to 1024 channels, and then classifies and regresses the coal and gangue.⁴² The difference is that the size of the feature information input to the Mask head is $14\times 14\times 256$ , and then the target mask is obtained through the FCN layer.⁴³ As shown in Figure 5A, the original Mask head structure is relatively simple and can be further improved.

Figure 5. Mask head structure. (A) Mask head, and (B) Multiscale Mask head. ROI, region of interest.

Zhao et al.⁴⁴ found that the vast majority of errors in a segmentation task were linked to contextual relationships and global information in different receptive fields. Thus, we designed the multiscale Mask head for collecting useful local-global information. First, the input ROI is copied three times, the first way through the inverse convolution operation converts the size of the feature map into two times the original, enriching the position information in the feature map, then the convolution performs the weighting operation on the information in the feature map and enlarges the sensory field, and finally uses the average pooling to transform the feature map into the original size, this step reduces the spatial resolution, which can effectively avoid the overfitting. The second way is opposite to it, which reduces the feature map and then enlarges it to enhance the semantic information of the feature map, and finally the two ways are spliced to complement each other. The third way maintains the original ROI size, and the outputs after convolutional activation operation are fused with the outputs after the first two splices for viewpoint fusion, which flexibly aggregates multiscale contextual information, increases the diversity of information, and effectively utilizes local-global and contextual information to achieve feature enhancement and better segmentation.

Loss function

As described in Section 2.1, the total loss function equation of the algorithm with the partial loss equations is as follows: [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF]where ${L}_{cls}$ denotes the classification loss, ${L}_{box}$ is the regression loss, and ${L}_{mask}$ is the mask loss. The ${P}_{i}$ and ${{P}_{i}}^{\unicode{x002C6}}$ in Equation (5) represent the predicted and true values of the anchor box, respectively. The ${t}_{i}$ and ${{t}_{i}}^{\unicode{x002C6}}$ in Equation (6) denote the predicted and true coordinates, respectively, and ${N}_{reg}$ denotes the number of pixels in the feature map. The Smooth L1 loss function is denoted by R. This loss function will not amplify the loss and will be more robust algorithmically. Equation (7) uses the Binary CrossEntropy Loss Function, where ${Y}_{ij}$ and ${{Y}_{ij}}^{\unicode{x002C6}}$ denote the predicted and true values of the mask with pixel point coordinates of $(i,j)$ in the $m\times m$ region. Through the continuous feedback of the above three loss functions during the training process, the segmentation position and prediction results are made more accurate.

EXPERIMENTATION AND ANALYSIS Experimental details

In this paper, a small belt conveyor is used to simulate the coal and gangue disorderly arranged on the conveyor belt in the coal mine production process, equipped with a USB2.0 camera to collect images of coal and gangue, and a halogen light source is used to provide illumination for the whole system, and the halogen light source is produced by Sumita Optical Glass Company's LS-LHA, with a maximum working power of 150 W. An HP laptop computer is used as the image acquisition system's memory. We respectively collected coal, gangue, coal and gangue mixed and coal and gangue overlap four cases a total of 1500 data sets, all coal and gangue images are in the same light and background conditions, the resolution size of each image is $640\times 480$ , in accordance with the 9:1 division of the training validation and test set, and then the training and validation set in accordance with the 9:1 division of the training and validation set, as shown in Table 2.

Table 2 Coal and gangue data information.

[Table omitted. See PDF]

All experiments in this paper were run under the Windows 11 64-bit operating system environment with an Intel(R) Core(TM) i7-12700F processor, a GPU graphics card model NVIDIA GeForce RTX 3060, and 32 GB of RAM on board. Pycharm 3.9 was used as the development platform and Pytorch 1.10.1 deep learning framework was used. The experiments were based on Mask R-CNN with extensive modifications and did not use the officially given pretraining weights, but rather retrained the parameters of each part of the network model. The experiments were all performed using the SGD optimizer for parameter updating with 100 training cycles and a batch size of 4. The training hyperparameters for the experiments are shown in Table 3.

Table 3 Training hyperparameters.

Hyperparameters	The numerical
Epoch	100
Batch size	4
Initial learning rate	0.004
Momentum	0.9
Weight decay	0.0001
Num_classes	2
Steps_per_Epoch	50

Model evaluation indicators

In this paper, we choose the Precision, Recall, F1 score, and five of the standard COCO evaluation metrics, which are $AP$ , $A{P}^{0.5}$ , $A{P}^{0.75}$ , $A{P}^{medium}$ , $A{P}^{{\rm{l}}\text{arg}{\rm{e}}}$ . Where the Precision represents the proportion of coal and gangue images that can be correctly recognized, the Recall Rate is the proportion of all positive samples that can be correctly recognized, and the F1 score is the weighted average of the Precision and the Recall. AP denotes the value averaged over 10 $IoU$ thresholds in the interval $[0.5\sim 0.95]$ , summing AP in 0.05 increments. $A{P}^{0.5}$ , $A{P}^{0.75}$ denote the AP calculated when $IoU$ is 0.5 and 0.75, respectively. There is also the concept of object size in the COCO metrics, which is divided into three categories according to the number of pixel points contained in their segmentation masks according to the two thresholds of 1024 and 9216. Since the mask sizes of coal and gangue in this paper are both larger than 1024 pixel points, we do not use the $A{P}^{small}$ metric. The specific formulas of some evaluation indicators are as follows. [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF]where TP denotes the number of positive samples correctly identified, FN denotes the number of positive samples not identified, FP denotes the number of negative samples misidentified, nc denotes the number of categories, and num is taken to be any one of 0.5 and 0.75. In the tables of ablation and comparison experiments in Sections 3.4 and 3.5, we use APm and APl instead of $A{P}^{\text{medium}}$ and $A{P}^{\text{large}}$ , respectively.

Analysis of model training and testing results

To validate the segmentation performance of the improved Mask R-CNN model, we evaluate 150 test sets, as shown by the curves in Figure 6. Due to the introduction of the residual structure which increases the difficulty of network learning, together with the influence of learning rate and batch size, the improved model is somewhat oscillating at first, and there is a downward trend in cycle 15, but it is a normal phenomenon. After 25 cycles of training, the individual evaluation metrics converge without overfitting. As can be seen from the figure, the improved model has high segmentation accuracy for coal and gangue.

Figure 6. Training curve graph. (A) Loss, and (B) Training indicator curves.

Ablation experiments

To better verify the performance of our proposed model, this paper compares and analyzes the three improved modules in turn, and conducts ablation experiments in coal and gangue data sets with consistent experimental environments, experimental equipment, and experimental parameters, Table 4 shows the models obtained by different improved methods, and the results of ablation experiments in coal and gangue data sets of the different improved models are shown in Table 5. Where Model A is the original Mask R-CNN network model and Model B represents the feature extraction network replacing the original RestNet50 using the Multi-Channel Forward Joining Confusion Convolution Module, which is used instead of MFCCM in the table. Model C represents replacing the original FPN structure with our proposed HR-FPN structure on the basis of B. Model D represents replacing the original Mask head with our proposed multiscale Mask head on the basis of C.

Table 4 Ablation models for different improved methods.

Model	MFCCM	HR-FPN	Multiscale mask head
A	×	×	×
B	√	×	×
C	√	√	×
D	√	√	√

Table 5 Results of ablation experiments (percentage).

Model	P	R	F1	AP	AP0.5	AP0.75	APm	APl
A	95.72	98.98	97.32	74.2	95.7	88.7	60.4	77.1
B	96.76	98.98	97.85	80.1	96.8	92.1	74.7	81.6
C	97.09	99.48	98.27	80.1	97.1	95.2	74.8	81.6
D	97.38	99.49	98.42	80.8	97.4	92.5	74.9	82.2

Note: The bolded data in the table indicate the optimal metrics obtained under this metric.

As shown in Table 5, model A is the result of the original Mask R-CNN validated in the coal and gangue data set, in model B, after replacing the original feature extraction network with our proposed multichannel forward join confusing convolutional block, except for the recall rate which remains unchanged, the rest of the indexes have been greatly improved, which proves that the multichannel forward join confusing convolutional block proposed in this paper can extract the features of the input image effectively. In the case of Model C, after we use HR-FPN, the model's metrics improve again, with AP0.75 increasing from 92.1% to 95.2%, indicating that the IoU is better able to identify positive and negative samples under the 0.75 threshold, and is able to excel in the segmentation task despite the more stringent evaluation criteria. Model D compared with other models, the introduction of a multiscale Mask head makes the AP 0.75 value partially lost, but the precision rate, recall rate, F1, and other indicators are optimal, compared with the original model, the accuracy rate is improved by 1.66%, the recall rate and F1 are improved by 0.51% and 1.1%, and the APm and APl are improved by 14.5% and 5.1%, respectively. It shows that our proposed model can effectively segment coal and gangue accurately at different scales.

Comparative experiments

To better demonstrate the effectiveness and innovativeness of our proposed model, we do two sets of comparison tests, the first one is the experimental comparison of different FPNs, and the second one is the comparison test of different algorithms in the same coal and gangue data set. Tables 6 and 7 are shown below.

Table 6 Comparative tests of different FPNs (percentage).

Type	P	R	F1	AP	AP0.5	AP0.75	APm	APl
FPN	96.76	98.98	97.85	80.1	96.8	92.1	74.7	81.6
PAN	96.39	98.47	97.41	79.5	96.4	91.6	70	82.2
BiFPN	95.66	98.47	97.04	77.2	95.7	91.2	67.5	79.4
HR-FPN	97.1	99.48	98.27	80.1	97.1	95.2	74.8	81.6

Note: The bolded data in the table indicate the optimal metrics obtained under this metric. Abbreviations: FPN, feature pyramid network; HR-FPN, high-resolution feature pyramid network; PAN, Path aggregation network.

Table 7 Comparative tests of different models (percentage).

Network models	P	R	F1
Unet	96.62	97.5	97.05
Deeplab V3+	96.47	96.88	96.67
Ghost-Mask	95.8	98.47	97.11
VGG-Mask	95.78	98.47	97.1
Mask R-CNN	95.72	98.98	97.32
Yoloact	96.3	98.7	97.48
Yolov7	98.8	97.4	98.1
ours	97.38	99.49	98.42

Note: The bolded data in the table indicate the optimal metrics obtained under this metric.

Table 6 shows us the experimental results of different feature pyramid structures in the data set of this paper, where the FPNs use a top-down structure to pass the high-level feature maps to the bottom-level feature maps through an up-sampling operation, so that the bottom level also has semantic information. Path aggregation network (PAN)⁴⁵ adds bottom-up paths to FPN to better superimpose and fuse the feature information between the bottom and the top layers, but it increases the training and prediction time. BiFPN⁴⁶ integrates bidirectional cross-scale connectivity and fast normalized fusion with the aim of enhancing the feature fusion between different layers and improving the performance of the feature pyramid. It is observed in the experimental data that the effect of HR-FPN proposed in this paper is superior to the first three, except for the APl index, which is slightly inferior. In Figure 7 below, the feature maps of different FPN outputs are demonstrated, and the HR-FPN structure can cover the coal and gangue with medium pixel size more accurately and the focus area is localized accurately as shown in HR-FPN. FPN, PAN, and BiFPN have blurred feature information and contours for medium pixel sizes, but clear edge contours for larger coals and gangue, especially PAN has better spatial location perception for larger objects, so PAN has the best APl index in Table 6. As can be seen from the Figure 7, the feature pyramid proposed in this paper can effectively retain the spatial location information and feature information of medium targets, which lays a good foundation for the next work.

View Image - Figure 7. Feature map visualization. FPN, feature pyramid network; HR-FPN, high-resolution feature pyramid network; PAN, Path aggregation network.

Figure 7. Feature map visualization. FPN, feature pyramid network; HR-FPN, high-resolution feature pyramid network; PAN, Path aggregation network.

The evaluation metrics of other current segmentation network models on this paper's data set are demonstrated in Table 7, and this paper's method achieves optimal results in segmentation recall and F1 metrics, reaching 99.49% and 98.42%, respectively. Compared to other segmentation networks yoloact,⁴⁷ Unet⁴⁸ and Deeplab V3+,⁴⁹ the accuracy of segmentation is 1.08%, 0.76%, and 0.91% higher, respectively. Although yolov7⁵⁰ has high recognition accuracy, the pixel information in the generated box frame is not well utilized. The third and fourth rows in the table indicate that we replace the feature extraction network of Mask R-CNN using GhostNet⁵¹ and VGGNet⁵² backbone network models, which reflects from the data that our proposed feature extraction network composed of multichannel forward join confusing convolutional blocks can effectively extract features, and the improved model can predict the shape, which effectively mitigates the drawbacks of the low segmentation efficiency of the original Mask R-CNN, and improves the ability of pixel-level segmentation of the coal and gangue.

Figure 8 below shows the results predicted by different models in the same pictures with the same parameters, thresholds and experimental settings. Deeplab V3+ predicts pixel point misidentification in Figure 1 and coal and gangue confusion in Figures 4 and 5. Confusion also occurs in Figure 5 of the Unet prediction results. Ghost-Mask has one less coal block segmented in Figure 5, and the segmentation in VGG-Mask is poor with discontinuous segmented pixels. Yoloact's Figure 5 shows misrecognition and the edge contour is not smooth. yolov7 has high recognition accuracy but cannot segment the coal and gangue shapes effectively. Compared with Mask R-CNN, our model has the advantages of good segmentation accuracy, clear edge contour, and so on, and the shape size is closer to the shape of coal and gangue in the RGB image, and it is not difficult to see from the results in the figure that our proposed MFCCM-Mask R-CNN model is more advantageous for coal and gangue segmentation.

Figure 8. Segmentation results of different models.

CONCLUSION

Recognition and edge contour segmentation of irregularly shaped and sized objects has been a hot research issue in industry, agriculture, and medicine, and the recognition and segmentation of coal and gangue are of great significance for intelligent sorting in the construction of intelligent coal mines and green mining. This paper mainly focuses on coal and gangue segmentation proposed based on the MFCCM-Mask R-CNN segmentation network model, using the self-built coal and gangue database to train the network parameters, ablation experiments as well as model comparison experiments on the trained network, and the conclusions are as follows:

(1)
The multichannel forward concatenated obfuscated convolution module proposed in this paper combines the classical convolution operation, residual structure, and obfuscation operation to allow the feature information of the image to be propagated in series in different channels, so that the network can effectively learn the coal and gangue features in the image, laying the foundation for improving the segmentation accuracy.
(2)
The HR-FPN feature pyramid structure fully fuses the output of the feature extraction network with the information flow at different scales after UP and Down operations, which effectively shortens the path of information exchange and strengthens the network's focus on the small-pixel coal and gangue, and subsequently expands the receptive field through the SPR structure to realize the high efficiency of the gradient information propagation, and effectively reduces the leakage rate.
(3)
The multiscale Mask head structure can increase the diversity of information and utilize local-global and contextual information to achieve feature enhancement for better segmentation.
(4)
The next step will be to combine the algorithm proposed in this paper with the cooperative robotic arm, and the segmentation results will be used to guide the robotic arm to sort coal and gangue, lightweighting of the model while maintaining model accuracy.

ACKNOWLEDGMENTS

We thank the editor-in-chief and reviewers for their valuable comments on this paper, and the other authors of this article for their help with the article. This research was funded by Natural Science Research Project of Anhui Educational Committee (No. KJ2021A0427), Postgraduate Innovation Fund of Anhui University of Science and Technology (No. 2023cx2081).

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT

Data will be made available on request.

Word count: 6238

Show less

© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Intelligent sorting of coal and gangue is of great significance to the intelligent construction of coal mines as well as green development. In this study, we propose a coal and gangue segmentation method with an improved classical segmentation network Mask R-CNN, denoted as Multichannel Forward-Linked Confusion Convolution Module (MFCCM)-Mask R-CNN. First, we design a MFCCM to construct the feature extraction network by stacking, second, we design a multiscale high-resolution feature pyramid network structure to realize multipath fusion of feature information to enhance the position and contour information of the target, and finally, we propose a multiscale Mask head to enhance the diversity of information, and capture the more representative and unique features. Training and testing models using self-built RGB coal and gangue data sets, the accuracy of the improved algorithm reaches 97.38%, which is an improvement of 1.66% compared to the original model. Compared with other segmentation models Unet, Deeplab V3+, Yoloact, Yolov7, and the model after replacing the backbone network, the MFCCM-Mask R-CNN has higher precision and recall, and can more accurately realize the efficient segmentation of coal and gangue.

Details

Title

Research on coal and gangue segmentation based on MFCCM-Mask R-CNN

Author

Cao, Zhenguan¹

; Li, Zhuoqin¹

; Liao Fang¹

; Li, Jinbiao¹; Yang, Haixia¹; Hui, Donggao¹

¹ School of Electrical and Information Engineering, Anhui University of Science and Technology, Anhui Huainan, China

Pages

2958-2973

Section

ORIGINAL ARTICLE

Publication year

2024

Publication date

Jul 2024

Publisher

John Wiley & Sons, Inc.

e-ISSN

20500505

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/ese3.1797

ProQuest document ID

3085229311

Research on coal and gangue segmentation based on MFCCM-Mask R-CNN

Jump to:

Full text

Abstract

Details

Suggested sources