Full Text

Turn on search term navigation

1. Introduction

Recent advancements have boosted demand for poultry products like meat and eggs. Meat ducks, valued for their high-quality proteins and fats [1], grow faster than other poultry, quickly reaching market weight. Proper breeding density is critical for maximizing survival rates and barn space efficiency [2], which are essential for large-scale production and animal welfare. Enhancing breeding efficiency requires real-time monitoring of density and effective management of duck populations. As large-scale breeding emerges as the predominant operational model within the industry, challenges arise when relying exclusively on manual labor for the management of duck houses and the control of breeding density. The need for human observation of extensive surveillance footage not only results in inefficiencies and a heavy reliance on the subjective judgment of personnel, lacking a standardized approach [3]. This not only increases labor intensity and expenses but also poses a barrier to the advancement of the farming sector. By leveraging intelligent farming technologies grounded in deep learning algorithms, it is possible to elevate production efficiency through automation and smart decision-making support systems. These advancements facilitate precise calculation of optimal breeding densities and streamline farming processes. Consequently, while lowering labor costs, they contribute to an overall enhancement in productivity and economic returns [4].

Girshick et al. [5] introduced the R-CNN algorithm, marking a significant advancement in object detection. However, R-CNN was constrained by its requirement for fixed-size candidate regions, leading to slower detection speeds and substantial storage needs for region proposals. To address these limitations, Fast R-CNN [6] and Faster R-CNN [7] were subsequently developed. Fast R-CNN incorporated Spatial Pyramid Pooling (SPP) to accelerate processing, while Faster R-CNN introduced Region Proposal Networks (RPNs) to significantly boost both speed and detection accuracy.

In large-scale breeding, high poultry density often leads to occlusion and crowding. Traditional machine learning methods struggle to perform effectively under such conditions. Zhang et al. [8] pioneered the use of deep learning for crowd counting with a multi-column convolutional neural network, marking the first application of this technique to counting tasks. Although object detection-based counting methods have demonstrated effectiveness, they require significant manual effort for bounding box annotation, which is both costly and time-consuming. In densely occluded real-world settings, these methods often produce inaccurate counts. To overcome these challenges, Cao et al. [9] enhanced the point-supervised localization and counting method proposed by Laradji et al. [10], aiming to enable real-time processing of camera-captured images. By using DenseNet, introduced by Huang et al. [11], as the backbone network and employing point annotation instead of bounding boxes, their approach significantly reduces annotation time and boosts data processing efficiency compared to traditional methods.

Since then, research has advanced these methods, particularly in breeding scenarios. Tian et al. [12] developed a 13-layer convolutional network combining Count CNN [13] and ResNeXt [14] for pig counting, achieving low error rates on sparse datasets. Wu et al. [15] addressed fish counting on dense datasets using dilated convolutions to increase the receptive field without reducing resolution and channel attention modules to enhance feature selection, achieving a mean absolute error (MAE) of 7.07 and a mean squared error (MSE) of 19.16. Li et al. [16] used a multi-column neural network with $1 \times 1$ convolutions replacing fully connected layers, resulting in an MAE of 3.33 and MSE of 4.58 on a dataset of 3200 seedling images. Sun et al. [17] introduced a comprehensive dataset of 5637 fish images and proposed an innovative two-branch network that merges convolutional layers with Transformer architectures. By integrating density maps from the convolution branch into the Transformer encoder for regression purposes, their model attained state-of-the-art performance, achieving an MAE of 15.91 and an MSE of 34.08.

Despite these advancements, significant challenges remain in large-scale breeding counting methods. Low-quality images can negatively impact model detection accuracy, while dense scenes further exacerbate the cost and complexity of bounding box annotations. Traditional labeling approaches also face difficulties in achieving precise identification, particularly in areas with severe occlusion and indistinct features.

To address the specified challenges, this study investigates a point-based detection method that reduces data annotation costs and minimizes computational demands by focusing on the approximate positioning of target points rather than bounding box regression. Leveraging PENet [18] and IOCFormer [17], we propose the Feature Enhancement Module with Parallel Network Architecture (FE-P Net), tailored for the current application. FE-P Net comprises two branches: the primary branch employs a Transformer framework with a regression head for final detection outputs, while the auxiliary convolutional branch generates density maps to enhance feature extraction in the primary branch’s encoder. Additionally, to handle low-resolution images common in industrial settings, a Gaussian pyramid module is integrated into the feature extraction process, improving feature representation and mitigating uneven illumination effects. To address feature redundancy caused by low-quality images, spatial and channel reconstruction units are incorporated into the convolutional branch, filtering out irrelevant information and emphasizing crucial features. Our method enhances annotation efficiency and detection accuracy, particularly for livestock farming data, ensuring robust performance under challenging conditions.

2. Materials and Methods

2.1. Image Acquisition for Meat Duck Counting in Large-Scale Farm

The dataset used in this experiment was collected from a poultry house located in Lishui District, Nanjing City. Spanning approximately 500 square meters, the poultry house uses net-bed breeding methods. Data were collected over June and July 2023 using Hikvision Ezviz CS-CB2 (2WF-BK) cameras, which have a resolution of 1920 × 1080 and a viewing angle of about 45°. Over 136 min of video footage were recorded, with frames sampled every 30 s. After filtering out insect-obscured images, 242 usable images were obtained, containing 20,550 meat duck instances, averaging 84.92 instances per image.

As shown in Figure 1a, the collected images exhibit low resolution, significant occlusion between targets, considerable variations in individual sizes, and varying densities across different regions. These factors present substantial challenges for detection tasks.

2.2. Dataset Construction for Meat Duck Counting in Large-Scale Farm

The 242 collected images were randomly split into training and testing sets at a ratio of 7:3, yielding 171 images for the training set and 71 images for the testing set. The images were then annotated using Labelme 4.5.9, with the annotation results shown in Figure 1b.

2.3. Detail-Enhanced Parallel Density Estimation Model Architecture

As shown in Figure 2, our proposed network model combines density-based [19] and regression-based [20] approaches through three key components: the Detail Enhancement Module (DEM), a density branch utilizing convolutional networks, and a regression branch leveraging Transformers.

The DEM enhances edge details using the Laplacian operator and captures multi-scale features via a pyramid module. It highlights duck-specific features while reducing high-frequency background noise through convolutional and pooling operations, thereby mitigating ambiguities caused by low-resolution images. The density branch generates location-sensitive density maps through a convolutional encoder. These maps are shared between the density head and the regression branch, enhancing duck position extraction and preserving spatial context. The regression branch integrates positional information from the density branch with global semantics extracted by the Transformer. This fusion of local details and global context enables highly accurate quantity prediction.

Composed of these components, the architecture ensures synergistic performance, effectively addressing the complexities of large-scale poultry breeding. This enables FE-P Net to achieve accurate meat duck counts even under challenging conditions.

2.3.1. Detail Enhancement Module (DEM)

The overall structure of the DEM is illustrated in Figure 3. It first uses the Laplacian Pyramid Enhancement Model (LPEM) [21], which is based on a pyramid structure, to extract multi-level features from input images. These feature maps are then refined by the Edge Enhancement and Background Noise Reduction Model (EBM), significantly improving their quality and providing a robust foundation for both the density and regression branches. The enhanced high-level features are subsequently upsampled, combined with low-level features, fused with the original image, and finally passed to the encoder.

The Laplacian pyramid employs a $5 \times 5$ Gaussian kernel to capture multi-scale information. Each Gaussian pyramid operation halves the image dimensions, reducing the resolution to one-fourth of the original. For an input image $I \in R^{3 \times H \times W}$ , involving downsampling $D o w n$ and Gaussian filtering $G a u s s i a n$ , the Gaussian pyramid operation is

(1) $G (I) = D o w n (G a u s s i a n (I))$

Since Gaussian pyramid downsampling is irreversible and leads to information loss, the Laplacian pyramid is employed to enable the reconstruction of the original image:

(2) $L_{i} = G_{i} - U p (G_{i + 1})$

where

L_{i}

and

G_{i}

represent the Laplacian pyramid and the Gaussian pyramid at the i-th level, respectively, and

U p

is the upsampling function. During image reconstruction, the original image resolution can be restored by performing inverse operations. Ultimately, the Laplacian pyramid generates multi-scale features

{f_{0}, f_{1}, f_{2}}

The EBM was designed to enhance local detail extraction and suppress high-frequency noise. It processes multi-scale information through two components: the Edge Enhancement module (illustrated in Figure 4 and the Low-Frequency Filter module.

The context branch captures global context by modeling remote dependencies and processes them with residual blocks to extract semantic features. For a given feature $f_{i} \in R^{C \times \frac{H}{2^{i}} \times \frac{W}{2^{i}}}$ , with $γ$ as a linear activation function, F as a $3 \times 3$ convolution operation, $R e s$ as a residual block, and $σ$ as the Softmax function, the processed feature $f_{c}$ is defined as:

(3) $f_{c} = R e s (f_{i} + γ (F_{1} (f^{'})))$

(4) $f^{'} = (f_{i}) \times σ (F_{2} (R e s (f_{i})))$

The edge branch utilizes the $S o b e l x$ and $S o b e l y$ operators to process input features, enhancing texture information through gradient extraction. The processed feature $f_{e}$ is

(5) $f_{e} = F_{3} (f_{i} + S o b e l x (f_{i}) + S o b e l y (f_{i}))$

The concatenated outputs are fused via convolution to generate the final feature $f_{c e}$ .

The Low-Frequency Filter module captures multi-scale low-frequency semantic information, as detailed in Figure 5, enriching the semantic content of images.

For input features $f_{i} \in R^{3 \times H \times W}$ , a $1 \times 1$ convolution adjusts the channel count to 32, yielding features $f_{i} \in R^{32 \times H \times W}$ . These features are then separated into channel-wise and filtered components [22] with adaptive average pooling using kernel sizes of $1 \times 1$ , $2 \times 2$ , $3 \times 3$ , and $6 \times 6$ , followed by upsampling. The pooling operation is defined is:

(6) $P o o l (f_{d}) = U p (β_{s} (f_{d}))$

Here, $f_{d}$ represents the feature component after channel separation, $U p$ denotes bilinear interpolation upsampling, and $β_{s}$ refers to adaptive average pooling with varying kernel sizes. The resulting feature tensor is concatenated and processed with a convolution operation to recover $f_{i} \in R^{3 \times H \times W}$ .

2.3.2. Density Branch Based on Convolutional Networks

CNNs employ fixed-size kernels to capture local semantic information [22], excelling in extracting detailed local features. However, the similarity in color and texture between the background and objects in our dataset necessitates the EBM to differentiate their features. Convolution operations can introduce redundancy by indiscriminately extracting features across both spatial and channel dimensions [23], potentially impacting final predictions. To address this redundancy, this section integrates a Spatial Reconstruction Unit (SRU) and Channel Reconstruction Unit (CRU) into the convolution branch [24]. The structure of the density branch, based on convolutions, is illustrated in Figure 6.

The input features first undergo two sequential $3 \times 3$ convolutions to compress the channel dimensions. These features are then refined by SRU and CRU modules to reduce redundancy. Finally, the processed features are used for counting loss calculation and feature fusion in the regression branch. The final expression is

(7) $f_{s c} = F_{1} (F_{2} (C R U (S R U (f_{i n}))))$

Notably, the convolutional branch incorporates both an SRU and CRU in addition to convolution operations to address information redundancy. The SRU primarily comprises separation and reconstruction stages. The separation stage identifies feature maps rich in spatial information from those with less spatial detail. This is achieved using weight factors from group normalization (GN) layers [25] to evaluate the richness of the feature maps. The GN formula is

(8) $f_{o u t} = G N (f_{i n}) = γ \frac{X - μ}{\sqrt{σ^{2} + ϵ}} + β$

where

μ

and

σ

are the mean and variance of the features in the group,

γ

is a trainable factor that evaluates the richness of the feature mapping,

β

is a bias parameter, and

ϵ

is a small positive number to avoid division by zero. To unify the sample distribution and accelerate network convergence, normalization layers process the trainable parameter

γ \in R^{C}

, yielding the normalized weights

W_{γ}

. The expression is as follows:

(9) $W_{γ} = {w_{i}} = \frac{γ_{i}}{\sum_{j = 1}^{C} γ_{j}}, i, j = 1, 2, \dots, C$

After obtaining the normalized weights, the $s i g m o i d$ function processes the weight vector to amplify differences between rich features and coefficient features. The gating unit then inputs the processed weight vectors into two directions. Based on a set threshold $θ$ , elements greater than $θ$ are set to 1 ( $W_{1}$ ), and those less than or equal to $θ$ are set to 0 ( $W_{2}$ ). The expression for $W_{i} \in 1, 2$ is as follows:

(10) $W_{i} = T h r e s h o l d (S i g m o i d (W_{γ}))$

The threshold operation generates two weight vectors, which are then multiplied by the input features to generate $X_{1}$ and $X_{2}$ . $X_{1}$ contains richer feature information, while $X_{2}$ contains less semantic information. To eliminate redundancy while preserving features, a spatial reconstruction module is designed to restructure the features.

Traditional convolution operations, which use fixed-size kernels for feature extraction along each channel, can introduce channel redundancy. To address this, the CRU module employs a separation and fusion strategy. Specifically, the input features are grouped, and $1 \times 1$ convolutions are applied to compress the channels of these groups, thereby reducing computational costs. The key components GWC [26] and PWC [27] minimize channel redundancy: GWC applies $k \times k$ convolutions to the grouped feature maps, while PWC uses $1 \times 1$ convolutions to enhance inter-channel communication. The final output is refined to reduce redundancy and improve efficiency:

(11) $Y_{1} = M^{G} X_{u p} + M^{P 1} X_{u p}$

where

M^{G} \in R^{\frac{a c}{g r} \times k \times k \times a c}

is the trainable weight parameter of the GWC module, and

M^{P 1} \in R^{\frac{a c}{r} \times 1 \times 1 \times a c}

is the learnable parameter of the PWC module in the upper branch. Additionally,

X_{u p} \in R^{\frac{a c}{r} \times h \times w}

is the input to the upper branch.

The lower branch concatenates the PWC-processed image with the original input to obtain the output $Y_{2}$ :

(12) $Y_{2} = c o n c a t (M^{P 2} X_{l o w}, X_{l o w})$

where

M^{P 2} \in R^{\frac{(1 - a) c}{r} \times 1 \times 1 \times (1 - a) c}

is the trainable parameter of the PWC module in the lower branch, and

X_{l o w} \in R^{\frac{(1 - a) c}{r} \times w \times h}

is the input to the lower branch.

The features from both branches are then fused. First, the features are concatenated and undergo adaptive pooling to obtain the pooled feature weight vector S:

(13) $S_{m} = P o o l i n g (Y_{m}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} Y_{c} (i, j)$

The obtained weight vector is then used as a weighting coefficient to scale the original features. Finally, with $s 1$ and $s 2$ representing the sums of vectors derived from adaptive average pooling operations, the weighted features are aggregated to produce final output:

(14) $Y = \frac{e^{s_{1}}}{e^{s_{1}} + e^{s_{2}}} Y_{1} + \frac{e^{s_{2}}}{e^{s_{1}} + e^{s_{2}}} Y_{2}$

2.3.3. Regression Branch Based on Transformer

The regression branch comprises three main components: a density-enhanced encoder based on the Transformer (DETE), a Transformer decoder [28], and a regression head. Unlike traditional Transformer encoders, the DETE module enhances feature extraction by integrating outputs from a convolutional branch. This integration enables more precise capture of both global semantic information and local detail features, improving the precision of duck localization and quantity estimation. The detailed architecture of the DETE module is illustrated in Figure 7.

The module maps the information extracted by the detail enhancement module to $F \in R^{w \times h \times c}$ while simultaneously mapping the features extracted by the density branch to $F_{d} \in R^{w \times h \times c}$ . These features are then combined with positional embeddings $E \in R^{w \times h \times c}$ and fed into a Transformer layer. The expressions for this process are as follows:

(15) $F^{1} = R s (F) + R s (F_{d}) + E$

(16) $F^{2} = T r a n s (F^{1})$

Here, $R s ()$ represents the flattening of feature maps for subsequent Transformer processing, and $T r a n s ()$ represents the Transformer layer. The module consists of four layers, with each layer processing features as follows:

(17) $F_{d}^{1} = F_{d}$

(18) $F_{d}^{i} = C o n v s (F_{d}^{i - 1}), i = 2, 3, \dots, L - 1$

(19) $F^{i + 1} = T r a n s (F^{i} + R s (F_{d}^{i})), i = 2, 3, \dots, L - 1$

where

C o n v s

represents convolution operations. The DETE module merges the output of the previous layer with the convolved density features before each Transformer layer. This process gradually combines global and local information, enabling the extraction of richer and more accurate features. After processing by the DETE module, the resulting feature maps contain precise information. The Transformer decoder then takes the obtained features

F_{L}

and trainable query vectors

Q \in R^{n \times c}

as inputs, and outputs the decoded embedding vectors

\hat{Q} \in R^{n \times c}

2.3.4. Loss Function

After forward propagation, the classification head maps the output to a confidence score vector, and the regression head maps it to point coordinates [20]. Assuming the query results are ${p_{i}, (\hat{x_{i}}, \hat{y_{i}})}_{i = 1}^{n}$ , where $p_{i}$ is the predicted confidence score for the pixel corresponding to the target, and $(\hat{x_{i}}, \hat{y_{i}})$ are the network-predicted coordinates. These predictions are compared with ground truth points ${(x_{i}, y_{i})}_{i = 1}^{K}$ . By setting $n > K$ , each ground truth point can be matched with one prediction using the Hungarian algorithm [29]. To compute the average distance difference, first the average distances from the prediction to other points and from the corresponding ground truth to other points are calculated. For the i-th ground truth point $X_{i} = (x_{i}, y_{i})$ and its corresponding prediction point $Y_{i} = (\hat{x_{i}}, \hat{y_{i}})$ , first the average distances $d_{i}$ and $\hat{d_{i}}$ are computed:

(20) $d_{i} = \frac{1}{K} \sum_{k = 1}^{K} d_{i}^{k}, \hat{d_{i}} = \frac{1}{K} \sum_{k = 1}^{K} \hat{d_{i}^{k}}$

With $C_{i}$ being the confidence score of the i-th prediction point, the cost function is

(21) $D = | | X_{i} - Y_{i} {| |}_{1} - | | d_{i} - \hat{d_{i}} {| |}_{1} - C_{i}$

After determining the matching relationships between prediction points and ground truth points, the classification loss $L_{C}$ evaluates the confidence loss for each point. A higher loss function value indicates that the features around the point are more distinct from the target object. This loss is calculated using binary cross-entropy:

(22) $L_{C} = - \sum_{i = 1}^{K} l o g [S i g m o i d (C_{i})]$

During loss calculation, the predicted confidence values are first mapped to the range [0, 1]. To make the predicted coordinates closer to the true values, the classification loss $L_{l}$ is used to express the difference between predictions and labels:

(23) $L_{l} = \sum_{i = 1}^{K} [| \hat{y_{i}} - y_{i} | + | \hat{x_{i}} - x_{i} |]$

With $λ$ being the weight for the density branch loss, and $L_{D}$ denoting the density branch loss. The overall loss function is

(24) $L = λ L_{D} + L_{C} + L_{l}$

2.4. Implementation Details

The experiment used Linux Ubuntu 20.04 with 32 GB RAM and a 2080Ti GPU, using Python 3.7.13 and PyTorch 1.13.0. The parameters were initialized using the Kaiming uniform distribution initialization method. The Adam optimizer was used with a weight decay of 0.0005. Training ran for 3500 iters with an initial learning rate of 0.00001, which was reduced to 0.1 times its value at the 1200th epoch using MultiStepLR.

2.5. Evaluation Metrics

The evaluation metrics used for flow counting were MAE and MSE. MAE measures the difference between predicted individual counts and ground truth counts:

(25) $M A E = \frac{1}{N} \sum_{i = 1}^{N} | X_{i}^{p} - X_{i}^{l} |$

where N is the quantity of prediction images,

X_{i}^{p}

and

X_{i}^{l}

are the predicted and true values for the i-th image, respectively, and

| |

denotes the absolute value. In counting tasks, MAE represents the average error in estimating individual counts; a smaller MAE indicates more accurate predictions. However, MAE does not highlight outliers. MSE, which uses squared differences to emphasize larger errors and highlight outliers, is defined by the formula:

(26) $M S E = \frac{1}{N} \sum_{i = 1}^{N} {(X_{i}^{p} - X_{i}^{l})}^{2}$

3. Results

3.1. Count Result Analysis

Figure 8 shows the prediction results of our method. Specifically, (a) represents the original images in the dataset, (b) shows the visualized label values generated from the annotation files, (c) displays the density map predicted by the convolution branch, and (d) shows the coordinate results regressed by the regression branch.

Table 1 compares the segmentation accuracy of various models on the broiler counting dataset. Compared to the classical convolutional networks MFF [30] and P2PNet [31], our method improves MAE by 2.92 and 1.63, and improves MSE by 2.95 and 0.90, achieving accuracy improvements of 3.45% and 1.68%, respectively. Compared to Transformer-based models MAN [32] and CCST [33], our method improves the MAE by 1.14 and 0.87, and MSE by 1.11 and 0.48, achieving improvements of 1.38% and 1.01%, respectively. Overall, Transformer-based detectors show higher accuracy than convolutional network-based detectors. This is because convolution operations are sensitive to fine-grained local semantic information but can fail to some extent when extracting features from low-resolution images. Our method achieved a mean absolute error of 3.02 and an accuracy of 96.46% on the meat duck dataset, showing improvements over other models, which means that our method can better detect individual broilers in poultry house environments.

Figure 9 shows that Transformer-based networks excel in high-density scenarios, while convolution-based networks perform better in low-density areas. Transformers use an attention mechanism to capture long-range dependencies, improving detection of closely spaced objects without misidentifying the background. In contrast, convolutional networks effectively capture local features and edges due to their receptive fields, aiding precise object localization. Our method integrates density features from convolution branches with Transformer-extracted features using a density-fused Transformer encoder. This approach balances global semantics and local details, enhancing overall detection performance.

3.2. Ablation Experiments

Ablation experiments were conducted to evaluate the impact of the various substructures on model performance. Our method includes two branches and a detail enhancement module. Several ablation studies were designed to validate their effectiveness, with the results shown in Table 2.

Without the convolution branch, the model’s MAE increased to 8.44 and the MSE to 10.29. This increase is attributed to the failure of the density-enhanced Transformer encoder and the counting loss strategy used by the convolution branch, rendering the loss strategy ineffective. Adding the convolution branch significantly improved performance, reducing MAE to 4.15 and MSE to 4.95. Incorporating the LPEM module further enhanced feature representation, mitigating low-resolution image issues and improving object localization and edge detection, which lowered the MSE to 3.56. Finally, adding the SCCONV module to the convolution branch eliminated feature redundancy, allowing the Transformer encoder to integrate more precise features and reduce noise, ultimately lowering the error to 3.01. These experiments demonstrate the positive contributions of each module and validate the superiority of the dual-branch structure.

The encoder plays a crucial role in feature extraction, generating feature maps from input images to support subsequent detection tasks. Selecting an appropriate feature extraction network can significantly enhance model performance. This section evaluates the use of different feature extraction networks as the backbone encoder through a series of experiments, with the results shown in Table 3.

The experimental results indicate that using ResNet-50 [34] as the backbone encoder achieves the best performance. Shallower networks such as ResNet-18 and ResNet-34 offer faster inference but provide less comprehensive feature extraction, leading to reduced accuracy. In contrast, using ResNet-101 as the encoder results in over-extraction due to its deeper layers, which also decreases accuracy.

4. Conclusions

In practical production environments, image quality is often low, and traditional bounding box annotation methods are labor-intensive. Our study introduces the FE-P Net architecture, which uses point annotations combined with a feature enhancement module to refine image features and improve detection accuracy. This approach offers a novel solution for enhancing poultry counting accuracy in farming scenarios.

Unlike existing methods that struggle with low-resolution images, FE-P Net employs a dual-branch architecture to integrate local detail extraction and global semantic understanding. By refining edge details and enhancing feature representations while reducing redundancy in both the spatial and channel dimensions, FE-P Net effectively mitigates issues related to low resolution, blurred edges, and uneven lighting. This results in significant performance improvements, achieving 96.46% accuracy on the meat duck dataset. Comparative experiments confirm that FE-P Net outperforms other models on this dataset, successfully completing the counting task. Ablation studies further validate the contributions of each module to overall network performance.

The findings have profound practical implications for poultry breeding. Deploying FE-P Net in real-time monitoring systems enables automated and accurate population counts, crucial for optimizing space utilization, improving animal welfare, and enhancing farm management efficiency. The use of point annotations significantly reduces the labor required for labeling, facilitating scalable deployment across large farming operations. However, there are areas for improvement. Eye-based point annotations require precise calibration, especially in highly crowded scenes with frequent occlusions. Further research should explore the model’s robustness under varying lighting and environmental conditions. Combining FE-P Net with larger models could enhance performance and adaptability.

In conclusion, this study advances the application of deep learning techniques in poultry breeding management. By addressing key challenges in image quality and object density, FE-P Net provides a promising solution for automating meat duck counting. As agriculture continues its digital transformation, innovations like ours will be essential in driving productivity and sustainability.

Author Contributions

Conceptualization, W.T. and H.Q.; methodology, W.T. and M.L.; software, W.T.; validation, H.Q., X.C. and T.W.; formal analysis, Y.E.X.; investigation, X.C. and S.S.; resources, X.C. and S.S.; data curation, W.T.; writing—original draft preparation, W.T.; writing—review and editing, H.Q., M.L. and X.C.; visualization, H.Q.; supervision, X.C.; project administration, X.C. and M.L.; funding acquisition, Y.E.X. and S.S. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to they were collected in privately-owned poultry facilities.

Acknowledgments

The authors would like to express their gratitude for the valuable feedback and suggestions provided by all the anonymous reviewers and the editorial team.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. (a) Data were collected with a visual fluorite camera. (b) Original images and labeled images.

Figure 2. FE-P Net network architecture.

Figure 3. Detail enhancement model.

Figure 4. Structure diagram of the Edge Enhancement module (with context and edge branches).

Figure 5. Low-frequency semantic filter of the Low-Frequency Filter module.

Figure 6. The structure of the density branch based on convolutions.

Figure 7. The density-enhanced encoder based on the Transformer.

View Image - Figure 8. The prediction results of our method: (a) is the original image in the dataset, (b) is the visual label value generated based on the annotation file, (c) is the density map predicted by the convolution branch, (d) is the coordinate result of the regression branch regression.

Figure 8. The prediction results of our method: (a) is the original image in the dataset, (b) is the visual label value generated based on the annotation file, (c) is the density map predicted by the convolution branch, (d) is the coordinate result of the regression branch regression.

Figure 9. Results of different segmentation models.

Table 1

Comparison of counting accuracy among models.

Model Name	MAE	MSE	Accuracy
CCST	3.89	4.69	95.41%
MAN	4.18	5.32	95.08%
MFF	5.93	7.16	93.01%
TransCrowd	3.95	5.03	95.35%
P2PNet	4.64	5.11	94.53%
IOC	4.15	4.95	95.11%
FE-P	3.01	4.21	96.46%

The bold values represent the optimal performance.

Table 2

Ablation experiments for the different modules.

LPEM	CNN Branch	SCCONV	MAE	MSE
×	×	×	8.44	10.29
×	✔	×	4.15	4.95
✔	✔	×	3.56	4.32
✔	✔	✔	3.01	4.15

A ✔ symbol indicates that the module has been added and a × symbol indicates that the module has not been added. The bold values represent the optimal performance.

Table 3

Ablation experiments for the different feature extraction networks.

Encoder	MAE	MSE	Accuracy
ResNet-18	3.93	4.77	95.37%
ResNet-34	3.76	4.63	95.57%
ResNet-50	3.02	4.15	96.46%
ResNet-101	3.62	4.67	95.73%

The bold values represent the optimal performance.

References

1. Pereira, P.M.D.C.C.; Vicente, A.F.D.R.B. Meat nutritional composition and nutritive role in the human diet. Meat Sci.; 2013; 93, pp. 586-592. [DOI: https://dx.doi.org/10.1016/j.meatsci.2012.09.018] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23273468]

2. Pettit-Riley, R.; Estevez, I. Effects of density on perching behavior of broiler chickens. Appl. Anim. Behav. Sci.; 2001; 71, pp. 127-140. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/11179565]

3. Jiang, K.; Xie, T.; Yan, R.; Wen, X.; Li, D.; Jiang, H.; Jiang, N.; Feng, L.; Duan, X.; Wang, J. An Attention Mechanism-Improved YOLOv7 Object Detection Algorithm for Hemp Duck Count Estimation. Agriculture; 2022; 12, 1659. [DOI: https://dx.doi.org/10.3390/agriculture12101659]

4. Fan, P.; Yan, B. Research on the Application of New Technologies and Products in Intelligent Breeding. Livest. Poult. Ind.; 2023; 34, pp. 36-38.

5. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation; IEEE Computer Society: Washington, DC, USA, 2014.

6. Girshick, R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile, 7–13 December 2015.

7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2017; 39, pp. 1137-1149. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2577031] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27295650]

8. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016.

9. Cao, L.; Xiao, Z.; Liao, X.; Yao, Y.; Wu, K.; Mu, J.; Li, J.; Pu, H. Automated Chicken Counting in Surveillance Camera Environments Based on the Point Supervision Algorithm: LC-DenseFCN. Agriculture; 2021; 11, 493. [DOI: https://dx.doi.org/10.3390/agriculture11060493]

10. Laradji, I.H.; Rostamzadeh, N.; Pinheiro, P.O.; Vazquez, D.; Schmidt, M. Where Are the Blobs: Counting by Localization with Point Supervision; Springer: Cham, Switzerland, 2018.

11. Huang, G.; Liu, Z.; Laurens, V.D.M.; Weinberger, K.Q. Densely Connected Convolutional Networks; IEEE Computer Society: Washington, DC, USA, 2016.

12. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); Seoul, Korea, 27 October 27–2 November 2020.

13. Ooro-Rubio, D.; López-Sastre, R.J. Towards Perspective-Free Object Counting with Deep Learning. Proceedings of the European Conference on Computer Vision (ECCV); Amsterdam, The Netherlands, 11–14 October 2016.

14. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks; IEEE: Piscataway, NJ, USA, 2016.

15. Wu, J.; Zhou, Y.; Yu, H.; Zhang, Y.; Li, J. A Novel Fish Counting Method with Adaptive Weighted Multi-Dilated Convolutional Neural Network; IEEE: Piscataway, NJ, USA, 2021.

16. Li, W.; Zhu, Q.; Zhang, H.; Xu, Z.; Li, Z. A lightweight network for portable fry counting devices. Appl. Soft Comput.; 2023; 136, 110140.

17. Sun, G.; An, Z.; Liu, Y.; Liu, C.; Sakaridis, C.; Fan, D.; Gool, L.V. Indiscernible Object Counting in Underwater Scenes. Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Vancouver, BC, Canada, 17–24 June 2023; pp. 13791-13801.

18. Hu, M.; Wang, S.; Li, B.; Ning, S.; Fan, L.; Gong, X. PENet: Towards Precise and Efficient Image Guided Depth Completion. arXiv; 2021; arXiv: 2103.00783

19. Wang, B.; Liu, H.; Samaras, D.; Hoai, M. Distribution Matching for Crowd Counting. arXiv; 2020; arXiv: 2009.13077

20. Liang, D.; Xu, W.; Bai, X. An End-to-End Transformer Model for Crowd Localization. arXiv; 2022; arXiv: 2202.13065

21. Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA, 21–26 July 2017.

22. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis.; 2015; 115, pp. 211-252. [DOI: https://dx.doi.org/10.1007/s11263-015-0816-y]

23. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv; 2019; arXiv: 1905.11946

24. Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Vancouver, BC, Canada, 17–24 June 2023; pp. 6153-6162.

25. Wu, Y.; He, K. Group Normalization; Springer: New York, NY, USA, 2018.

26. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks; NIPS: Grenada, Spain, 2012.

27. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv; 2017; arXiv: 1704.04861

28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv; 2017; arXiv: 1706.03762

29. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv; 2020; arXiv: 2005.12872

30. Zhang, X. Deep Learning-based Multi-focus Image Fusion: A Survey and A Comparative Study. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44, pp. 4819-4838.

31. Yin, K.; Huang, H.; Cohen-Or, D.; Zhang, H. P2P-NET: Bidirectional Point Displacement Net for Shape Transform. ACM Trans. Graph.; 2018; 37, pp. 152.1-152.13. [DOI: https://dx.doi.org/10.1145/3197517.3201288]

32. Lin, H.; Ma, Z.; Ji, R.; Wang, Y.; Hong, X. Boosting Crowd Counting via Multifaceted Attention. arXiv; 2022; arXiv: 2203.02636

33. Li, B.; Zhang, Y.; Xu, H.; Yin, B. CCST: Crowd counting with swin transformer. Vis. Comput.; 2022; 39, pp. 2671-2682.

34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778. [DOI: https://dx.doi.org/10.1109/CVPR.2016.90]

Word count: 5229

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Traditional object detection methods for meat duck counting suffer from high manual costs, low image quality, and varying object sizes. To address these issues, this paper proposes FE-P Net, an image enhancement-based parallel density estimation network that integrates CNNs with Transformer models. FE-P Net employs a Laplacian pyramid to extract multi-scale features, effectively reducing the impact of low-resolution images on detection accuracy. Its parallel architecture combines convolutional operations with attention mechanisms, enabling the model to capture both global semantics and local details, thus enhancing its adaptability across diverse density scenarios. The Reconstructed Convolution Module is a crucial component that helps distinguish targets from backgrounds, significantly improving feature extraction accuracy. Validated on a meat duck counting dataset in breeding environments, FE-P Net achieved 96.46% accuracy in large-scale settings, demonstrating state-of-the-art performance. The model shows robustness across various densities, providing valuable insights for poultry counting methods in agricultural contexts.

Details

Title

FE-P Net: An Image-Enhanced Parallel Density Estimation Network for Meat Duck Counting

Author

Qin, Huanhuan¹

; Teng, Wensheng¹; Lu, Mingzhou¹

; Chen, Xinwen²; Ye, Erlan Xieermaola²; Samat, Saydigul²; Wang, Tiantian²

¹ College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China; [email protected] (H.Q.); [email protected] (M.L.)
² Xinjiang Intelligent Livestock Key Laboratory, Xinjiang Uygur Autonomous Region Academy of Animal Science, Urumqi 831399, China[email protected] (S.S.); [email protected] (T.W.)

First page

3840

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app15073840

ProQuest document ID

3188787961

FE-P Net: An Image-Enhanced Parallel Density Estimation Network for Meat Duck Counting

Jump to:

Full Text

2. Materials and Methods

2.1. Image Acquisition for Meat Duck Counting in Large-Scale Farm

2.2. Dataset Construction for Meat Duck Counting in Large-Scale Farm

2.3. Detail-Enhanced Parallel Density Estimation Model Architecture

2.3.1. Detail Enhancement Module (DEM)

2.3.2. Density Branch Based on Convolutional Networks

2.3.3. Regression Branch Based on Transformer

2.3.4. Loss Function

2.4. Implementation Details

2.5. Evaluation Metrics

3.1. Count Result Analysis

3.2. Ablation Experiments

Abstract

Details

Suggested sources