Pedestrian POSE estimation using multi-branched

Full text

Turn on search term navigation

1. Introduction

Vision-based pedestrian recognition [1–3] is applied widely, including video surveillance [4,5], driving assistance [6,7], human-machine interaction [8], advanced robotics, video indexing [9], and automated driving systems [10]. Recognizing a pedestrian has always been critical to achieving an accurate and precise safety system [11]. Human pose estimation (HPE) [12] is perhaps the most challenging field and a hot topic of study for researchers [13]. An image or video of a human being estimates the posture, orientation, position, or three-dimensional (3D) [14] human body placement. 2D or 3D imaging techniques [15] are used in HPE to determine the pose, orientation, action, location of human body parts[16], and facial expression [17]. The prime and primary goal of HPE is to recognize a human body and its parts, i.e., the head, hand, elbow, arm, wrist, ankles, and shoulders. These parts are crucial/critical for analyzing the human in images. In graphic applications, e.g., animated movies, action recognition, biometrics [18,19], simulations, and games, human positions and orientations are needed to estimate. In such graphics-based applications, HPE is particularly critical to analyze. A bunch of cameras is required in some existing systems. These are used for HPE to cover the identified area. Sometimes, stationary camera applications are also used, which are expensive. With that, some calibrated and traditional cameras can also be used for motion-based recognition [20].

One of the most significant, primary, essential, and critical issues is the complexity of viewpoint variation. For computer vision researchers, HPE has always remained a challenging task due to variations and dissimilarities in appearance, with occlusion, hardly visible joints, and minor joint problems [21]. Therefore, different camera angles are used in different scenarios [22]. HPE is being used in many real-world applications like pedestrian orientation and action recognition [23], analysis of medical images [24], human-computer interaction[25], understanding of human behavior [26], surveillance, sports, and biomechanics [27]. In the intelligent video surveillance system, deep learning technologies are applied for human activity recognition (HAR) [28–31]. It predicts human behaviors and estimates human poses [32]. To estimate the pose of a human or pedestrian, its orientation is vitally essential in computer vision research. The most crucial task in the human body orientation estimation (HBOE) [33] challenge is to accurately evaluate the direction of motion of the pedestrian in a provided video or photograph. Accurate HBOE can significantly improve the estimation of human posture as a vital component of the behavior analysis system [34]. Estimating the human body’s directions or a specific part of it is significant for a variety of activities, for example, healthcare scenarios [35], counting people [36], detecting a fall [37], remotely tracking a patient’s recovery [38], predicting a fall in an older adult [39], robot-human interaction because robots work and orient themselves to watch and interact with humans more naturally [40]. Security cameras can more precisely identify people’s behavior. The body’s position, by indicating the walking directions of the pedestrian, is an excellent approach to predicting what the pedestrian is likely to do next in autonomous driving [41]. Fig 1(A) depicts people’s perspectives from several cameras with different orientations and pose angles. Fig 1(B) depicts pose angles from several cameras with different orientations like 0° South (Front), 45° South East (Right Front), 90° East (Right), 135° North East (Right Back), 180° North (Back), 225° North West (Left Back), 270° West (Left), 315° South West (Left Front).

[Figure omitted. See PDF.]

(a) People’s perspectives with different orientations and pose angles (b) 0° South (Front), 45° South East (Right Front), 90° East (Right), 135° North East (Right Back), 180° North (Back), 225° North West (Left Back), 270° West (Left), 315° South West (Left Front) (the pedestrian images in this diagram are taken from PKU-reid [42], BDBO [43], and TUD [44,45] datasets).

In computer vision, HBOE is the most critical issue investigated primarily in pedestrian safety and activity prediction [46] and robotic applications [47]. The direction of interest in a video surveillance system is dictated by the orientation of the human body or head [48]. For effective social engagement, body and head alignment and movement are essential nonverbal communication skills. Some individuals with social communication difficulties may struggle to display normative nonverbal communication indicators, such as maintaining regular eye contact and consistently orienting their body towards a speaker. [49]. Red, green, and blue (RGB) [50] cameras are used in most scenarios of body orientation estimation due to their low cost [51]. However, RGB-D cameras, such as Intel RealSense [52] and Microsoft Kinect, are also used. When there isn’t enough data to extract relevant features, features are constructed and work admirably [52]. Aside from the handcrafted components, in any end-to-end learning system, an automatic feature extraction technique can also address most classification tasks’ issues. At the same time, a deep CNN-based model has been verified to be more competitive in object recognition tasks [53]. To determine the human body orientation [54–57], numerous systems used for tracking motion have been established, including the LIDAR system [60] used as motion trackers [61] and some mechanical [62] and optical motion trackers [58]. HBOE utilizing RGB photographs has proved to be of significant development in current years and has been successfully performed in a limited number of cases [59]. However, uncontrolled pose obstacles for RGB-based systems, such as illumination alterations, occlusions, and a significant change in posture, have reduced their effectiveness.

CNN-based techniques are used to resolve a variety of performance difficulties. Therefore, a deep learning CNN-based method is created in this proposed approach to estimate human pose appearances. The following are the primary contributions of this manuscript:

1. As input, grayscale images are used for the proposed model, which is taken with 2D cameras. Because there is some noise in standard images, the results are compromised. Dehazing is used as preprocessing to clear the hazy, low visibility/low-resolution/degraded/low visible images to improve the visibility of the images.

2. A proposed CNN-based system comprising 66 layers with B1, B2, and B3 branches for feature extraction to depict appearance-based orientation and full-body pose estimation. Both still images and image sequences are used in the proposed system.

3. Ant colony selection (ACS) [60] is used to select an optimized feature set that set is provided to nine (9) different classifiers to classify and estimate the full-body pose of a pedestrian. The outcomes are assessed using existing techniques, and promising results are obtained.

The leading organization of the manuscript is structured as section 1 is the introduction. This section provides an overview of the problem being addressed, the motivation behind the research, and the objectives of the study. It sets the stage for the rest of the manuscript by outlining the significance and scope of the work. The section 2 is related work. It highlights the key findings, methodologies, and gaps in the current knowledge that the present study aims to address. By doing so, it establishes the context and foundation for the research. Section 3 details the methods and materials used in the study. This includes a comprehensive description of the proposed CNN framework, the architectural setup, and the implementation specifics. It provides the necessary technical details that allow for replication and validation of the work. The section 4 is proposed datasets. This section introduces the newly created large dataset for full-body pose and orientation classification. It discusses the dataset’s composition, the data collection process, and the features that make it suitable for training and evaluating the proposed CNN technique. Section 5 discusses experiments and results. The manuscript presents the experiments conducted. It includes the experimental setup, the metrics used for evaluation, and a detailed analysis of the results obtained. This section demonstrates the effectiveness and robustness of the proposed method through quantitative and qualitative assessments. The final section, 6, concludes the manuscript.

2. Related work

The highest-semantic feature embedding [61] focuses on existing techniques, ignoring the insights concealed in prior layers. Furthermore, due to misalignment and stance fluctuations, pose-related information must be utilized entirely. [62]. Numerous studies on human pose estimation have been conducted. There are two sets of HPE approaches such as (a) top-down [63] and (b) bottom-up [64]. Top-down [65] and bottom-up approaches [66] are employed in the provided work to discover all essential points before grouping them into distinct instances. According to the estimated people count in the image, HPE is assigned to a group of people or a single person.

Compared to multi-person and multiview pose estimation, estimating the pose of a single person from a given image that may include more than one person is substantially more accessible (and usually does) [67]. Both approaches can also determine an individual’s posture and pose in an image. For single-person channels, deep learning techniques can also be divided into regression-based approaches [68] and body part detection-based approaches [68]. Regression models develop a mapping from an input image to body joints or model parameters using an end-to-end approach [69]. Body part detection algorithms aim to estimate the relative positions of various joints and body parts [70], commonly managed by heatmap representations [71]. Research relies on regression systems, such as [72], which proposed DeepPose [73], a cascaded deep neural network regressor, to estimate joints and body parts from photos. As a result of DeepPose’s exceptional performance, the HPE research framework shifted away from traditional methodologies and toward deep learning, notably CNN [74]. Based on GoogleNet [75], an Iterative Error Feedback (IEF) [76] network is presented that inoculates estimation errors back into the input space to modify an initial response gradually.

For 2D human poses, the top-down technique perceives the number and position of a person and then finds each individual. Top-down approaches receive benefits from the advantages of bottom-up approaches. Several well-known strategies for estimating the pose of a single person are already present. Still, they accept early obligations: if the single-person detector fails, they will accept it first. As a frequent case when individuals are in close quarters, there is no prospect for a return. In the early years, to predict the 3D poses of humans, a set of handcrafted features, geometric features, and perspective relationships were employed. In recent years, the use of deep neural networks for images to 3D HPE has expanded mainly due to considerable developments in deep learning methodologies [77]. Estimating the human body or its shape from color photos is difficult. A unique work is presented in [78] to analyze the shape of the human body from numerous color images using off-the-shelf segmentation [79] without regard for stance, backdrop, or camera viewpoint. From a single RGB image, to determine the 3D human pose, a framework is presented in [80]. The reconstruction network, depth map, 2D pose estimator, and monocular image establish a dynamic and user-friendly system. In [81], metric-scale truncation-hearty (MeTRo) pose estimation and volumetric heatmaps for root-relative are presented. Instead, it is also being used to restrict the image space. A complete convolutional neural network is also expected to address the space measurement of a person’s location directly.

Two main approaches to estimating 3D human pose [82] for a single person are known as model-free and model-based [82] approaches. Whether or not they estimate 3D human position using a human body model [83]. 3D multi-person HPE comprises Top-Down approaches that use a human detection network to identify single-person zones. The region can be calculated for individual 3D poses for every single person using a 3D pose architecture. The global coordinate is then utilized to align the three-dimensional postures. Firstly, bottom-up approaches estimate all joints and depth maps of those joints. Those body parts are related to each individual part’s relative depth and root depth [84].

The existing techniques used for HPE are pointing out some differences and limitations. A few techniques neglect key insights from earlier layers that contain crucial pose-related data. Some of them do not extensively address misalignment and stance fluctuations information, which may lead to mistakes in pose estimation. In crowded environments, in particular, top-down techniques may not work if the initial single-person detector does not work. Estimating 3D stances from color photographs is challenging since there is little depth of information. Many techniques start with the assumption that there is just one person present. A mistake in this assumption could result in serious mistakes. Accurate reconstruction of 3D poses depends on depth information, which is absent when estimating human stances from 2D photos. The complexity of multi-person pose estimation arises from the necessity to accurately estimate the poses of numerous individuals while maintaining individual distinction. People nearby might cause posture estimate mistakes because body part overlapping is difficult for algorithms to distinguish. Even with the advances, deep learning models today are still struggling to anticipate poses correctly in complicated settings like changing camera angles and occlusions.

The proposed method incorporates advanced and innovative strategies that enhance performance, addressing several limitations of existing pedestrian full-body pose and orientation estimation techniques. Typically applied to color images, traditional models often struggle with noise and feature selection. In contrast, the proposed approach utilizes grayscale images and includes dehazing as a preprocessing step, significantly improving accuracy and visibility. The method features a 66-layer CNN model with three distinct branches (B1, B2, and B3), which excels at capturing complex features such as poses and orientations and effectively handles still images with diverse backgrounds.

Additionally, feature optimization using ACS ensures that only the most relevant and appropriate features are utilized. This has enhanced the robustness, accuracy, and efficiency of classification. Cross-validation was conducted on three distinct datasets, where the proposed method outperformed the state-of-the-art models. The method’s average 95% and 97% accuracy across various configurations demonstrate its effectiveness, robustness, and efficiency.

3. Proposed methodology

This portion provides a comprehensive outline of the proposed CNN architecture. Fig 2 depicts a complete sketch of the proposed approach paradigm.

[Figure omitted. See PDF.]

(The pedestrian images in this diagram are taken from BDBO, PKU-reid, and TUD datasets).

The description of the proposed 66-layer CNN model is also included in this section. The primary processes of the proposed CNN framework, including pre-training with the dataset CIFAR-100 [85] with 100 different classes, are also included. Preprocessing (dehazing), feature extraction from the extensive dataset for body-orientation (BDBO) [43], PKU-Reid [42], and TUD multiview pedestrians [44,45] datasets by using proposed deep learning-based CNN model, selection of feature subsets using the ant colony selection, and classification/estimation by using different classifiers is also elaborated in this section.

3.1 Preprocessing

Image pre-processing is crucial for preparing images in deep learning for tasks like training, testing, and inference. It involves resizing, orientation, color corrections, and haze removal. Dehazing is employed as a preprocessing step to reduce haze in images, resulting in improved visibility and smoother appearance. Haze removal is challenging but essential for computer vision and photography in low-light or bad-weather scenarios. Researchers have focused on obtaining high-quality dehazed images in the past decade. Poor weather conditions, such as haze, fog, mist, or smog, reduce color and contrast in images. In computer vision, Eq (1) is usually used to describe the formation of a foggy or hazy image [85].

(1)

In Eq (1), the hazy image is represented by I(x). In contrast, J(x) represents the reconstructed dehazed image, A represents the air lightly, and the transmission is represented by t(x). The transmission t(x) is the amount of light that does not scatter and reach the camera. The amount of light that survives and makes it to the camera is also essential. The basic steps in the dehazing [86] process are as follows: Hazy images are processed to generate image inputs and weighted maps. The weight maps from the first input are used as subsequent weight maps for the second input. Normalized weight maps are created from these resultant weight maps. Gaussian and Laplacian pyramids are applied to the normalized weight maps of derived input images. The outcome is an improved dehazed image suitable for utilization in deep-learning models for further image processing. Fig 3. depicts the complete illustration of the dehazing process. The dehazing process inversed the model as follows to restructure the dehazed image as shown in Eq (2):(2)

[Figure omitted. See PDF.]

During the dehazing process, three image maps are generated: luminance, chromatic, and saliency weight maps. The luminance weight map assigns higher values to brighter regions and lower values to darker regions, aiding in contrast enhancement, brightness adjustment, and visibility improvement in image dehazing. One standard method for generating a luminance weight map is based on the local contrast and frequency content of the image, which can be expressed using the following equation:(3)

In Eq (3), at pixel (x,y), L(x,y) is the luminance value, and R(x,y), G(x,y), and B(x,y) are the values of red, green, and blue colors at the same pixel. The range and sensitivity of the weight map control the constants k and σ.

The chromatic weight map is derived from the color content of an image, assigning higher values to regions with rich or saturated colors expressed as.(4)Where C(x,y) is the chromatic value (x,y). R(x,y), G(x,y), and B(x,y) are RGB colors. Attention-grabbing features are captured by the saliency weight map in an image by analyzing visual cues like color contrast, texture, and object boundaries. Higher values are assigned to image dehazing, which signifies the haziness in the image and is applied to restore colors and contrast. A standard method for generating the saliency weight map in image dehazing is based on contrast and color statistics, as expressed in a specific equation.

(5)

S(x,y) is the saliency value, and k and σ are constants controlling the saliency range and sensitivity.

3.2 Acquisition of deep features

The proposed framework will extract features using a deep learning-based trained CNN framework. For this purpose, a third-party pre-existing dataset, such as CIFAR-100, is used for pre-training. CIFAR-100 is an image dataset containing 100 different classes. For learning purposes, 500 images are used in each class. For validation, 100 images are employed in each class. The images used for training and validation are also intermixed and used in pre-training. Therefore, in each class, 600 images are presented for training.

3.2.1 Proposed MBDLP-Net.

A CNN-based model with 66 layers is proposed. Multi-branched deep learning poses net (MBDLP-Net) refers to the whole architecture of the presented deep CNN network. Fig 4 represents the graphical representation of the proposed MBDLP-Net. Table 1. depicts the arrangement and entire formation of the layers of the proposed MBDLP-Net.

[Figure omitted. See PDF.]

In the proposed MBDLP-Net architecture, various common types of layers are utilized. The primary building blocks are convolutional layers, which apply filters to input images, resulting in activations. A feature map is created by repeatedly using the same filter for the input image. This feature map indicates the intensity and positions of recognized features in the input image. The innovation and novelty of convolutional neural networks (CNNs) lie in their ability to automatically learn numerous filters relevant to a training dataset, even for complex tasks like image classification. This allows for the detection of specific and essential features present anywhere on the input image. To generate a feature map, CNN uses a filter on[8 an input that summarizes the existence of observed features. For CNN, if there is an input size W x W x D and a Dout number of kernels with a spatial dimension of F, stride S, and padding value P, the size of the output produced may be calculated with the following equation:(6)

The convolutional layer can be expressed mathematically as:(7)

In Eq (7), hjn is the output feature map, hkn−1 is the input feature map, and wkjn is the kernel. ReLU is an activation function that stands for a rectified linear unit. It is mathematically defined as y = max (0, x). The most commonly used activation function in neural networks is ReLU, particularly in CNNs. If there is some uncertainty about what activation function to use while designing, a good first choice is usually ReLU. ReLU can be expressed mathematically as:(8)

Compared with tanh and sigmoid, Relu has a simpler function. Also, the gradient is constant, which does not require any computation in the backward direction (i.e., backpropagation). It is probably one of the first functions to have experimented with in network design.

The proposed strategy also uses some Batch Normalization layers. Batch_Norm [87] is an approach used for optimizing a channel of neurons throughout a short batch. It computes the mean and variance in segments. The standard deviation separates the features after computing the average/mean. The batch means B = I₁…….I_w is expressed as under:(9)

In the scenario of Eq (9), the number of feature maps is denoted by w, and over a small batch, variance expression can be expressed as follows:(10)

Therefore, further normalization is carried out by using an expression which is as follows:(11)

Eq (11) ℧ is consistency, which is considered constant in this scenario. Both ReLU and Leaky ReLU processes are used in the proposed CNN design. The ReLU turns all numbers less than 0 to 0, which is written as under and in [88]:(12)

For values lower than zero, leaky ReLU has a modest slope instead of zero. When you is negative, a leaky ReLU will have v = 0.01u. Further in-depth information about CNN can be learned from some other existing works [89–91].

Image visualization at different convolutional layers Conv1, Conv2, Convl3, Conv4, and Conv5 of the proposed MBDLP-Net of the PKU-Reid dataset is presented in Fig 5.

[Figure omitted. See PDF.]

Visualizations at different convolutional layers (a) Conv1, (b) Conv2, (c) Convl3, (d) Conv4, and (e) Conv5 of the proposed MBDLP-Net of PKU-Reid dataset.

3.3 Feature selection with ant colony optimization

For feature selection/optimization, ant colony selection (ACS) [92,93] is used, which is an approach based on learning. In ACS, the actions and activities of ants are mainly focused [94]. As they migrate from one location to another, the ants disseminate an ant deposit substance known as "pheromone." This material’s strength deteriorates with time. The ants follow the trail of pheromones. This helps to encourage ants to follow the less expensive route. That’s why, ants start to migrate from one position to another as vertex to vertex a network follows. A vertex represents a feature, and the ends connecting vertices provide the optimal selection to go on to the other next feature. The technique repeats until the best qualities are discovered. When the minimal number from vertices is reached, and the freezing condition is met, the technique grinds to a standstill. The vertices are connected in a mesh shape. Finally, the optimized feature set is fed into the classifiers for classification. At a given point in time, an ant’s feature selection is determined by the likelihood. Mathematically, the term of ACO is as follows:

Updating of pheromone(13)

The left side of the equation indicates the amount of pheromone on the given edge x,y. The rate at which pheromones evaporate is represented by– ρ.

The amount of pheromone deposited is indicated by the final term on the right side.

(14)

Q is a constant, and L is the cost of the length of an ant tour.

3.4 Classification

Classification is carried out by utilizing nine (9) different classifiers from which six (6) are SVM-based (Cubic (CSVM), coarse Gaussian (CGSVM), Median Gaussian (MGSVM), Fine Gaussian (FGSVM), Quadratic (QSVM) and Linear (LSVM)) and other three (3) are KNN based Cosine (CKNN), Coarse (RKNN), and Fine (FKNN). In the classification paradigm, the classifiers use 100, 250, 500, 750 & 1000 features set as input and predict the class label following feature computation to identify and categorize the poses of pedestrians based on eight (8) bins or angles. Several variables, including deep layer selection, activation function, and weight initialization influence the accuracy. The model’s accuracy is being improved via image preprocessing (dehazing). Classifiers are trained to anticipate the poses to get a high score accuracy.

4. Results and discussion

The primary and basic aim of the proposed work is to establish a deep learning-based CNN framework for dealing with the provided dataset. The MBDLP-Net CNN Network presented here only extracts most influential and powerful features after feature selection. Because this research aims to extract features on already existing datasets using the proposed CNN architecture, the pretraining is done on a third-party dataset, CIFAR-100. Furthermore, the proposed 66-layer MBDLP-Net CNN Network was developed after rigorous testing and assessment. Many different approaches are used to complete this architecture. The most common techniques involved were fine-tuning, layer addition, and removal. In its ultimate form, a 66-layer framework is proven to be good with the best performance results. This section explains the conversation and explanation of the proposed framework and its outcomes. This section starts by defining the dataset. Then, the process for performance evaluation is described. Lastly, the assessments and results are thoroughly described. For this manuscript, all the simulations and mentioned experiments are carried out on a Pentium Intel ® core i7-7700 @ 3.60 GHz. 16 GB RAM is present as system memory. The system is supported with GTX 1070 NVIDIA GeForce with GPU GP104 with 8GM dedicated memory with maximum digital resolution 7680x4320 @60 Hz. Windows 10 pro-64-bit (x64) is used as an Operating System. The coding and simulations are carried out using MATLAB R2021b and R2022b.

4.1 Datasets

This work employs three distinct datasets for performance evaluation. The first data set, PKU-Reid dataset, is used and is focused on pedestrians and has also been used for human re-identification. BDBO is the second dataset used for this work. The third dataset is TUD multiview pedestrians. Each dataset is comprised of 8 bin images with various 8 angled poses of different humans/pedestrians. Large datasets are necessarily required for ML processes to operate well during training. The dataset’s size is expanded by augmentation. Adding Gaussian noise, mirroring, color-shifting, and adding salt and pepper noise are the four major types of augmentations but in the proposed technique one kind of data augmentation (color-shifting) is utilized. The dataset became almost doubled in size. When a deep neural network is trained, the Data Augmentation techniques are supported in reducing overfitting. The majority of the enhancement processes described in image processing are used to improve image recognition or classification models.

To the best of our knowledge, pedestrian gender categorisation results were obtained utilising pioneering methodologies that were thoroughly investigated and tested on the PKU-Reid dataset. This dataset has served as a baseline for a variety of pedestrian-related studies since its inception in 2016. However, there is a significant lack in the literature on specialised studies on pedestrian pose estimation utilising the PKU-Reid dataset. This lack of precedence makes it difficult to directly compare the outcomes of our suggested strategy to existing methods, as there are no documented research or results in the literature that focus on this element.

Although the PKU-Reid dataset is a well-known resource for pedestrian re-identification and related tasks, no studies have used it particularly to assess a pedestrian’s posture. This lack of relevant literature emphasises the originality of our method while also highlighting the difficulty of confirming our findings against previous research. As a result, our findings must be understood with the awareness that they are an early investigation into this field, laying the groundwork for future research rather than giving a direct comparison with earlier techniques.

The BDBO is an essential resource that drives the advancement of sophisticated body orientation estimation techniques. Its extensive and meticulously annotated data allow researchers and practitioners to develop models that are highly accurate and applicable across various real-world scenarios. Characterized by a large volume of diverse samples, the BDBO is carefully curated to support research and development in body orientation analysis, providing the necessary data for training robust machine learning and computer vision models.

The details concerning to PKU-Reid, BDBO, and TUD Multiview Pedestrians datasets, including their specific characteristics, and how they contribute to the overall performance evaluation have also been provided.

Concerning to the probable limitations, and boundries, is is acknowledged that every dataset has its exclusive, and unique constraints that could affect the generalizability of results of proposed methodology. For example, because the PKU-Reid dataset is generally, and widely used for re-identification of pedestrians, it may not fully apprehend the range, and diversity of real-world situations. This could limit the generalization of the proposed model to unseen scenarios. Although, the BDBO dataset is effective for orientation estimation. It might have partialities due to the specific sides, angles, and poses it includes.

Another dataset, the TUD Multiview Pedestrians dataset, is recognized for its wide-ranging angles and multiview viewpoints. It proposes an appreciated diversity in pose estimation but there are also some challenges present. Because, it emphases on precise controlled situations, the dataset might not entirely imitate the complexities, and denesties of actual, and real-world pedestrian behavior. Furthermore, the 8-bin images from several angles may present certain partialities in pose estimation. Specifically when dealing with active, less common or more dynamic poses are not included in the dataset.

Data augmentation techniques are utilized to address these limitations, and to reduce overfitting. Specially, color-shifting augmentation technique is used, which effectively doubled the dataset size. It also has helped to enhance the model’s robustness. But, it is significant, and important to diagnose that the use of only one augmentation tehnique could still leave some characteristics under-explored. The detail of all three used datasets is elaborated in Table 2.

[Figure omitted. See PDF.]

4.2 Performance measures

The performance evaluation of the presented algorithms is taken place by using different formulas, some of which are Accuracy (ACC), TRP/Sensitivity (SE), TNR/Specificity (SP), Negative Prediction value (NPV), PRC/Positive Prediction (PPV), False Positive Rate (FPR), and False Negative Rate (FNR).

4.3 Experimental detail

To achieve the best outcomes, extensive testing of numerous iterations of particular feature sets is performed. In this section, a few important reviews are discussed. This section also explains a brief assessment of the accuracy of each carried out test. Numerous tests are carried out with varying quantities of feature sets during the feature selection process. These feature sets are used to automate the models used for prediction/ classification based on different classifiers of own choice. Overall CSVM has proved to be the best in all performance metrics in terms of ACC while Fine-KNN has the second-best overall performance in all performance metrics in terms of ACC.

4.3.1 Experiment-I: Results on normal/hazy images of PKU-Reid dataset: Performance evaluation of normal/hazy images on PKU-Reid dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 3.

The best result such as 70.3% in terms of ACC is achieved by QSVM while using features subsets of 750 and 1000 features, while in terms of ACC as 69.69% is the second-best result which is obtained by CSVM with 750 features subset as shown in Table 3. The best results of 61.0%, 66.96%, 69.36%, 69.86%, and 70.3% ACC are attained using 100, 250, 500, 750, and 1000 features subsets, respectively with QSVM. It has been observed that QSVM performed better while predicting results under ACC protocol on normal/hazy images of the PKU-Reid dataset.

[Figure omitted. See PDF.]

The results produced by each SVM and KNN variant in terms of ACC are given in Table 3. It is noted from the results that the best result such as 0.70% in terms of AUC is achieved by QSVM with both feature subsets of 750 and 1000 features. The second-best result of 0.70% in terms of AUC is produced by CSVM utilizing a 750 features subset. SVM variants performed better than KNN variants under ACC on normal/hazy images of the PKU-Reid dataset. A summary of the highest results in terms of ACC achieved by different classifiers on normal/hazy images of the PKU-Reid dataset is given in Fig 8. In terms of ACC, the best results on the PKU-Reid dataset are depicted in Table 4.

[Figure omitted. See PDF.]

From the obtained results given in Table 4. the proposed technique has achieved 0.70, 0.58, 0.63, 0.52, 0.70, 0.64, 0.46, 0.46, and 0.48 results in terms of ACC utilizing CSVM, CGSVM, MGSVM, FGSVM, QSVM, and LSVM that are SVM based classifiers and CKNN, RKNN, and FKNN that are KNN based classifiers respectively on normal/hazy images of PKU-Reid dataset.

The PKU-Reid dataset’s normal/hazy photos can be effectively processed using the proposed method to produce the best results. The best accuracy of 70.31% is achieved by QSVM with 1000 features subset and training time is 147.74 sec with prediction speed ~190 obs/sec. The total cost validation for this training is 1071. The confusion matrix based on No of Observations is given in Fig 6. (A), while based on TPR, FNR is shown in (b), and based on PPV, FDR is shown in (c) on normal/hazy images of PKU-Reid dataset. The class-wise best ROC results on normal/hazy images of the PKU-Reid dataset are expressed in Fig 7(A) 0, 7(b) 45, 7(c) 90,7(d) 135, 7(e) 180, 7(f) 225, 7(g) 270 and 7(h) 315 class wise best ROC results on normal/hazy images of PKU-Reid dataset.

[Figure omitted. See PDF.]

(a) The confusion matrix based on No of Observations, (b) The confusion matrix is based on TPR, and FNR, (c) The confusion matrix is based on PPV, and FDR on normal/hazy images of PKU-Reid dataset using Quadratic SVM.

[Figure omitted. See PDF.]

(a) 0, (b) 45, (c) 90, (d) 135, (e) 180, (f) 225, (g) 270, and (h) 315 class wise best ROC results on normal/hazy images of PKU-Reid dataset using Quadratic SVM.

4.3.2 Experiment-II: Results on dehazed images of PKU-Reid dataset: Performance evaluation of dehazed images on PKU-Reid dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 5.

The best result such as 94.62% and 94.34% in terms of ACC is achieved by CSVM with features subsets of 750, and 1000 features respectively, whereas the second-best result such as 84.24%, 91.76%, 92.34%, 92.65%, and 93.23% in terms of ACC is obtained by QSVM with features subsets of 100, 250, 500, 750 and 1000 features respectively as shown in Table 5. The best results of 88.71%, 93.45%, 93.31%, 94.62%, and 94.34% ACC are attained using feature subsets of 100, 250, 500, 750, and 1000 features respectively with CSVM. It has been observed that CSVM performed better while predicting results under ACC protocol on dehazed images of the PKU-Reid dataset.

[Figure omitted. See PDF.]

The results produced by each SVM and KNN variant in terms of ACC are given in Table 6. It is noted from the results that the best result such as 0.95 and 0.94 in terms of ACC is achieved by CSVM with both feature subsets 750 and 1000 features. The second-best result of 0.93% in terms of ACC is achieved by QSVM utilizing feature subsets of 750 and 1000 features. SVM variants performed better than KNN variants under ACC on dehazed images of the PKU-Reid dataset.

[Figure omitted. See PDF.]

From Table 6, the proposed approach has achieved 0.95, 0.81, 0.89, 0.84, 0.93, 0.85, 0.62, 0.62, and 0.86 results in terms of ACC utilizing CSVM, CGSVM, MGSVM, FGSVM, QSVM, and LSVM that are SVM based classifiers and CKNN, RKNN, and FKNN that are KNN based classifiers respectively on dehazed images of PKU-Reid dataset.

The proposed method produces the best results on dehazed photos from the PKU-Reid dataset. The best accuracy of 94.62% is achieved by CSVM with 750 features subset and training time is 97.216 sec with prediction speed ~310 obs/sec. The total cost validation for this training is 204. The confusion matrix based on No of Observations is given in Fig 8(A), while based on TPR, FNR is shown in (b), and based on PPV, FDR is shown in (c) on dehazed images of PKU-Reid dataset. The class-wise best ROC results on dehazed images of the PKU-Reid dataset are expressed in Fig 9(A) 0, 9(B) 45, 9(C) 90, 9(D) 135, 9(E) 180, 9(F) 225, 9(G) 270 and 9(H) 315.

[Figure omitted. See PDF.]

(a) The confusion matrix is based on No of Observations, (b) The confusion matrix is based on TPR, and FNR, and (c) The confusion matrix is based on PPV, and FDR on dehazed images of PKU-Reid dataset using Cubic SVM.

[Figure omitted. See PDF.]

(a) 0, (b) 45, (c) 90, (d) 135, (e) 180, (f) 225, (g) 270, and (h) 315 class-wise best ROC results on dehazed images of PKU-Reid dataset using Cubic SVM.

4.3.3 Experiment-III: Results on standard/hazy images of BDBO dataset: Performance evaluation of standard/hazy images on BDBO dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 7.

The best result such as 92.8%, 93.0%, and 93.3% in terms of ACC is achieved by CSVM with features subsets of 500, 750, and 1000 features, whereas the second-best result, such as 91.9%, 91.9%, 91.9%, 92.0%, and 92.0% in terms of ACC is obtained by FKNN with features subset of 100, 250, 500, 750 and 1000 features as shown in Table 7. The best results of 92.8%, 93.0%, and 93.3% ACC are attained using feature subsets of 500, 750, and 1000 features, respectively with CSVM. It has been observed that CSVM performed better while predicting results under ACC protocol on standard/hazy images of the BDBO dataset.

[Figure omitted. See PDF.]

The results produced by each SVM and KNN variant in terms of ACC are given in Table 8. It is noted from the results that the best result, 0.93% in terms of ACC, is achieved by CSVM with all feature subsets of 500, 750, and 1000 features. FKNN produces the second-best result of 0.92 in terms of ACC with all feature subsets of 100, 250, 500, 750, and 1000 features. SVM variants performed better than KNN variants under AUC on standard/hazy images of the BDBO dataset.

[Figure omitted. See PDF.]

From the results given in Table 8, the proposed approach has achieved 0.93, 0.79, 0.92, 0.90, 0.91, 0.73, 0.86, 0.74, and 0.92 results in terms of AUC utilizing CSVM, CGSVM, MGSVM, FGSVM, QSVM, and LSVM that are SVM based classifiers and CKNN, RKNN, and FKNN that are KNN based classifiers respectively, on standard/hazy images of BDBO dataset.

The proposed approach produces the best results on the BDBO dataset’s normal/hazy images. CSVM achieves the best accuracy of 93.30% with 1000 features subsets, and training time is 21069 sec with prediction speed ~21obs/sec. The total cost validation for this training is 4608. The confusion matrix based on The Observations is given in Fig 10(A), while based on TPR, FNR is shown in (b), and based on PPV, FDR is shown in (c) on normal/hazy images of the BDBO dataset. The class-wise best ROC results on standard/hazy images of the BDBO dataset are expressed in Fig 11(A) 0, 11(B) 45, 11(C) 90, 11(D) 135, 11(E) 180, 11(F) 225, 11(G) 270 and 11(H) 315.

[Figure omitted. See PDF.]

(a) The confusion matrix is based on No of Observations, (b) The confusion matrix is based on TPR and FNR, (c) The confusion matrix is based on PPV and FDR on normal/hazy images of the BDBO dataset using Cubic SVM.

[Figure omitted. See PDF.]

(a) 0, (b) 45, (c) 90, (d) 135, (e) 180, (f) 225, (g) 270, and (h) 315 class-wise best ROC results on standard/hazy images of BDBO dataset using Cubic SVM.

4.3.4 Experiment-IV: Results on dehazed images of BDBO dataset: Performance evaluation of dehazed images on BDBO dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 9.

The best result such as 94.7% and 94.8% in terms of ACC achieved by CSVM with features subsets of 750 and 1000 features, respectively, whereas the second-best result such as 93.7%, 94.2%, 94.1%, 93.9%, and 94.1% in terms of ACC is obtained by FKNN with features subsets of 100, 250, 500, 750, and 1000 features as shown in Table 9. The best results of 94.7% and 94.8% ACC are attained using 750 and 1000 feature subsets, respectively, with CSVM. It has been observed that CSVM performed better while predicting results under the ACC protocol on dehazed images on the BDBO dataset.

[Figure omitted. See PDF.]

The results produced by each SVM and KNN variant in terms of ACC are given in Table 10. It is noted from the results that CSVM achieves the best result such as 0.95 in terms of ACC with both 750 and 1000 feature. The next-best result of 0.94 in terms of ACC is produced by FKNN utilizing feature subsets of 100, 250, 500, 750, and 1000 features. SVM variants performed better than KNN variants under ACC protocols on dehazed images of the BDBO dataset.

[Figure omitted. See PDF.]

The results are given in Table 10. the proposed method has achieved 0.95, 0.81, 0.93, 0.91, 0.91, 0.78, 0.87, 0.76, and 0.94 results in terms of AUC utilizing CSVM, CGSVM, MGSVM, FGSVM, QSVM, and LSVM that are SVM based classifiers and CKNN, RKNN, and FKNN that are KNN based classifiers respectively, on dehazed images of BDBO dataset.

The proposed approach performs well on dehazed photos from the BDBO dataset. CSVM achieves the best accuracy of 94.86% with 1000 features subset and training time is 19349 sec with prediction speed ~23obs/sec. The total cost validation for this training is 3579. The confusion matrix based on The Observations is given in Fig 12(A), while based on TPR, FNR is shown in (b), and based on PPV, FDR is shown in (c) on dehazed images of the BDBO dataset. The class-wise best ROC results on dehazed images of the BDBO dataset are expressed in Fig 13(A) 0, 13(B) 45, 13(C) 90, 13(D) 135, 13(E) 180, 13(F) 225, 13(G) 270 and 13(H) 315.

[Figure omitted. See PDF.]

(a) The confusion matrix is based on No of Observations, (b) The confusion matrix is based on TPR and FNR, (c) The confusion matrix is based on PPV and FDR on dehazed images of the BDBO dataset using Cubic SVM.

[Figure omitted. See PDF.]

Class wise best ROC results on dehazed images of BDBO dataset using Cubic SVM (a) 0, (b) 45, (c) 90, (d) 135, (e) 180, (f) 225, (g) 270, and (h) 315.

4.3.5 Experiment-V: Results on dehazed images of BDBO dataset: Performance evaluation of dehazed images on BDBO dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) using pretrained Inception Net with four (4) classifiers, i.e. CSVM, MGSVM, QSVM, and FKNN is expressed in Table 11.

The results produced by each classifier, i.e., CSVM, MGSVM, QSVM, and FKNN, in terms of ACC, are given in Table 11. These results are obtained by using a pretrained Inception net. It is noted from the results that CSVM achieves the best result, such as 0.60 in terms of ACC with 750 features set, while the second best result is 0.58 with 500 features set. CSVM produces the next-best result of 0.56 in terms of ACC, and QSVM utilizing feature subsets of 250, and 750respectively.

[Figure omitted. See PDF.]

4.3.6 Experiment-VI: Experiment-V: Results on dehazed images of BDBO dataset: Performance evaluation of dehazed images on BDBO dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) using pretrained VGG16 with four (4) classifiers i.e., CSVM, MGSVM, QSVM, and FKNN is expressed in Table 12.

The results produced by each classifier, i.e., CSVM, MGSVM, QSVM, and FKNN, in terms of ACC, are given in Table 12. These results are obtained by using pretrained VGG16. It is noted from the results that CSVM achieves the best result, such as 0.80 in terms of ACC with both 750 and 1000 feature sets, while the second best result is 0.78 with 500 feature sets. The next-best result of 0.75 in terms of ACC is produced by QSVM utilizing features subset of 1000.

[Figure omitted. See PDF.]

4.3.7 Experiment-VII: Results on dehazed images of BDBO dataset: Performance evaluation of dehazed images on BDBO dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) using pretrained ResNet with four (4) classifiers, i.e. CSVM, MGSVM, QSVM, and FKNN is expressed in Table 13.

The results produced by each classifier i.e. CSVM, MGSVM, QSVM, and FKNN in terms of ACC are given in Table 13. These results are obtained by using pretrained ResNet. It is noted from the results that CSVM achieves the best result, such as 0.69, in terms of ACC with both 1000 feature sets, while the second best result is 0.68 with 750 feature set. The next-best result of 0.66 in terms of ACC is also produced by CSVM utilizing features subset of 1000.

[Figure omitted. See PDF.]

4.3.8 Experiment-VIII: Results on images of TUD dataset: Performance evaluation of images on TUD dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) using MBDLP-Net CNN Network with four (4) classifiers i.e. CSVM, MGSVM, QSVM, and FKNN is expressed in Table 14.

The results produced by each classifier, i.e., CSVM, MGSVM, QSVM, and FKNN, in terms of ACC, are given in Table 14. These results are obtained by using the MBDLP-Net CNN Network. It is noted from the results that CSVM achieves the best result such as 0.97 in terms of ACC with 100, 250, 500, 750, and 1000 features set while with MGSVM 0.97 is achieved with 1000 features set. With FKNN, 0.97 is achieved for 100, 250, 750, and 100 sets. The second best result is 0.96, which is achieved with QSVM with 100, 250, 500, 750, and 100 features set. MGSVM also achieves the second best result of 0.96 in terms of ACC with 500, and 750 features sets, and FKNN with 500 feature subset. A summary of experimental results of images on the TUD dataset in terms of ACC using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with variants of SVM and KNN classifiers is presented in Table 15.

[Figure omitted. See PDF.]

5. Comparison with existing works

Two types of accuracy comparisons are carried out with various recently developed existing approaches in terms of obtained entire body orientation accuracies with the proposed method, and with pre-trained models or architectures.

The evaluated best performance of the proposed model using four (4) main classifiers, i.e., CSVM, MGSVM, QSVM, and FKNN, is compared with the results obtained from the same classifiers with pre-trained models or architectures such as Inception in Table 11, VGG16 in Table 12, and ResNet in Table 13. Comparison of existing pretrained and proposed systems in terms of the mean of complete body orientation accuracies Table 16 aims to demonstrate that the proposed network achieves higher accuracy and precision in estimating the poses of pedestrians. Specifically, the comparison highlights the superior performance of the proposed model in terms of its ability to accurately and precisely identify and estimate pedestrian poses, thereby validating its effectiveness over the traditional pretrained mentioned networks.

[Figure omitted. See PDF.]

In comparison with recent existing approaches, the result of the proposed approach is found to be the best, robust, and superior. In terms of feature size, the proposed framework is examined for scalability concerns. It has been observed during experiments with increasing the feature size somewhat improves accuracy in results. The computation time increases as size of the features also increases. To address this, feature selection is used. Table 17 shows the direct comparison of existing approach with proposed system by using BDBO data set, and Table 18 shows the direct comparison of existing approach with proposed system by using TUD data set.

[Figure omitted. See PDF.]

The proposed method effectively addresses several limitations of existing techniques for pedestrian full-body pose and orientation estimation by introducing advanced, progressive approaches that result in improved performance. Traditional models often struggle with noise, particularly when working with color images, and may suffer from suboptimal feature selection. In contrast, our methodology utilizes grayscale images and incorporates dehazing as a preprocessing step, significantly enhancing visibility and accuracy. The 66-layer CNN model, featuring three distinct branches (B1, B2, B3), captures complex features of poses and orientations more effectively and is well-suited to handling still images with varying backgrounds. Additionally, ACS is used for feature optimization, ensuring that only the most relevant and appropriate features are used, thereby enhancing classification efficiency, robustness, and accuracy. The method has been tested across three independent datasets, consistently outperforming state-of-the-art models and achieving mean accuracies of 95% and 97%, demonstrating its efficiency, robustness, and effectiveness in various configurations.

6. Conclusion

Pose and orientation of a pedestrian in a specified direction, and his movement is occasionally correlated very loosely. As a conclusion, a deep CNN learning based method for recognizing pedestrian full-body pose/orientation MBDLP-Net is proposed. The proposed CNN approach tested individually for full body pose estimation on two publicly available datasets PKU-Reid and BDBO. This technique can be applied to every image sequence as well as to still images because full body appearance-based classification is used. After obtaining the results, comparison is carried out with stat-of-the-art existing techniques. The proposed method is found to be the best among other existing classification approaches with a prominent best outcome. For full-body pose estimation, 0.95% value of accuracy is achieved with BDBO, and PKU-Reid while 0.97% accuracy is achieved with TUD multiview pedestrians dataset. The accuracy achieved by proposed approach shows the robustness of the approach. For achieving more accurate results, the proposed CNN model may be fine-tuned further.

Furthermore, in this study, the proposed technique is applied to just eight orientation classes, but it may be extended to other orientations in the future for improved predictions. Furthermore, the proposed method can be beneficial for improved path prediction in people behavior analysis, danger assessment, people tracking, and in detecting human activities and actions.

References

1. 1. Fayyaz M., Yasmin M., Sharif M., Iqbal T., Raza M., Babar M.I.J.N.C., and Applications, Pedestrian gender classification on imbalanced and small sample datasets using deep and traditional features. 2023. 35(16): p. 11937–11968.

* View Article

* Google Scholar

2. 2. Rehman S.-U., Chen Z., Raza M., Wang P., and Zhang Q.J.N., Person re-identification post-rank optimization via hypergraph-based learning. 2018. 287: p. 143–153.

* View Article

* Google Scholar

3. 3. Raza M., Sharif M., Yasmin M., Khan M.A., Saba T., and Fernandes S.L., Appearance based pedestrians’ gender recognition by employing stacked auto encoders in deep learning. Future Generation Computer Systems, 2018. 88: p. 28–39.

* View Article

* Google Scholar

4. 4. Nasir I.M., Raza M., Ulyah S.M., Shah J.H., Fitriyani N.L., and Syafrudin M., ENGA: Elastic Net-Based Genetic Algorithm for human action recognition. Expert Systems with Applications, 2023. 227: p. 120311.

* View Article

* Google Scholar

5. 5. Saba T., Rehman A., Latif R., Fati S.M., Raza M., and Sharif M., Suspicious activity recognition using proposed deep L4-branched-ActionNet with entropy coded ant colony system optimization. IEEE Access, 2021. 9: p. 89181–89197.

* View Article

* Google Scholar

6. 6. Alkinani M.H., Khan W.Z., Arshad Q., and Raza M., HSDDD: a hybrid scheme for the detection of distracted driving through fusion of deep learning and handcrafted features. Sensors, 2022. 22(5): p. 1864. pmid:35271011

* View Article

* PubMed/NCBI

* Google Scholar

7. 7. Fayyaz M., Yasmin M., Sharif M., and Raza M., J-LDFR: joint low-level and deep neural network feature representations for pedestrian gender classification. Neural Computing and Applications, 2021. 33: p. 361–391.

* View Article

* Google Scholar

8. 8. Li C. and Yung N.H.J.J.o.S.P.S., Arm Poses Modeling for Pedestrians with Motion Prior. 2016. 84(2): p. 237–249.

* View Article

* Google Scholar

9. 9. Yoon, S.M., J. Song, K.-S. Hahn, and G.-J. Yoon. Simultaneous detection of pedestrians, pose, and the camera viewpoint from 3D models. In 2015 International Conference on Information and Communication Technology Convergence (ICTC),IEEE,(2015),p.83-88.

10. 10. Mínguez R.Q., Alonso I.P., Fernández-Llorca D., and Sotelo M.Á., Pedestrian Path, Pose, and Intention Prediction Through Gaussian Process Dynamical Models and Pedestrian Activity Recognition. IEEE Transactions on Intelligent Transportation Systems, 2018. 20(5): p. 1803–1814.

* View Article

* Google Scholar

11. 11. Yano S., Gu Y., and Kamijo S.J.I.J.o.I.T.S.R., Estimation of pedestrian pose and orientation using on-board camera with histograms of oriented gradients features. 2016. 14(2): p. 75–84.

* View Article

* Google Scholar

12. 12. Menan, V., A. Gawesha, P. Samarasinghe, and D. Kasthurirathna. DS-HPE: Deep Set for Head Pose Estimation. in 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC),2023,IEEE,p.1179-1184.

13. 13. Li B., Chen Y., and Wang F.-Y.J.I.T.o.V.T., Pedestrian detection based on clustered poselet models and hierarchical and–or grammar. 2014. 64(4): p. 1435–1444.

* View Article

* Google Scholar

14. 14. Hariyono J. and Jo K.-H.J.N., Detection of pedestrian crossing road: A study on pedestrian pose recognition. 2017. 234: p. 144–153.

* View Article

* Google Scholar

15. 15. Schut D.E., Wood R.M., Trull A.K., Schouten R., van Liere R., van Leeuwen T., Batenburg K.J.J.P.B., and Technology, Joint 2D to 3D image registration workflow for comparing multiple slice photographs and CT scans of apple fruit with internal disorders. 2024. 211: p. 112814.

* View Article

* Google Scholar

16. 16. Liu, W., J. Chen, C. Li, C. Qian, X. Chu, and X. Hu. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. in Thirty-second AAAI conference on artificial intelligence, 32 (1), 2018,.

17. 17. Shah J.H., Sharif M., Yasmin M., and Fernandes S.L.J.P.R.L., Facial expressions classification and false label reduction using LDA and threefold SVM. 2020. 139: p. 166–173.

* View Article

* Google Scholar

18. 18. Sharif M., Raza M., Shah J.H., Yasmin M., S.L.J.H.o.m.i.s.t. Fernandes, and applications, An overview of biometrics methods. 2019: p. 15–35.

* View Article

* Google Scholar

19. 19. Shah J.H., Chen Z., Sharif M., Yasmin M., S.L.J.J.o.M.i.M. Fernandes, and Biology, A novel biomechanics-based approach for person re-identification by generating dense color sift salience features. 2017. 17(07): p. 1740011.

* View Article

* Google Scholar

20. 20. Nägeli T., Oberholzer S., Plüss S., Alonso-Mora J., and Hilliges O.J.A.T.o.G., Flycon: real-time environment-independent multi-view human pose estimation with aerial vehicles. 2018. 37(6): p. 1–14.

* View Article

* Google Scholar

21. 21. Dong L., Chen X., Wang R., Zhang Q., E.J.I.T.o.C. Izquierdo, and S.f.V. Technology, ADORE: An adaptive Holons representation framework for human pose estimation. 2017. 28(10): p. 2803–2813.

* View Article

* Google Scholar

22. 22. Liu G., Tian G., Li J., Zhu X., and Wang Z.J.I.S.J., Human action recognition using a distributed rgb-depth camera network. 2018. 18(18): p. 7570–7576.

* View Article

* Google Scholar

23. 23. Luvizon, D.C., D. Picard, and H. Tabia. 2d/3d pose estimation and action recognition using multitask deep learning. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

24. 24. Srivastav, V., Thibaut Issenhuth, Abdolrahim Kadkhodamohammadi, Michel de Mathelin, Afshin Gangi, and Nicolas Padoy. "MVOR: A multi-view RGB-D operating room dataset for 2D and 3D human pose estimation." arXiv preprint arXiv:1808.08180 (2018). 2018.

25. 25. Zimmermann, C., T. Welschehold, C. Dornhege, W. Burgard, and T. Brox. 3d human pose estimation in rgbd images for robotic task learning. in 2018 IEEE International Conference on Robotics and Automation (ICRA). 2018. IEEE.

26. 26. Butepage, J., M.J. Black, D. Kragic, and H. Kjellstrom. Deep representation learning for human motion prediction and classification. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, p.6158-6166.

27. 27. Marin-Jimenez M.J., Romero-Ramirez F.J., Munoz-Salinas R., R.J.J.o.V.C. Medina-Carnicer, and I. Representation, 3D human pose estimation from depth maps using a deep combination of poses. 2018. 55: p. 627–639.

* View Article

* Google Scholar

28. 28. Ray A., Kolekar M.H., Balasubramanian R., and A.J.I.J.o.I.M.D.I. Hafiane, Transfer learning enhanced vision-based human activity recognition: a decade-long analysis. 2023. 3(1): p. 100142.

* View Article

* Google Scholar

29. 29. Chauhan P., Mandoria H.L., Negi A., Kumar K., Choudhury A., and Dahiya S., Leveraging Advanced Convolutional Neural Networks and Transfer Learning for Vision-Based Human Activity Recognition, in Robotics, Control and Computer Vision: Select Proceedings of ICRCCV 2022. 2023, Springer. p. 239–248.

30. 30. Camarena F., Gonzalez-Mendoza M., Chang L., Cuevas-Ascencio R.J.M., and Applications C., An Overview of the Vision-Based Human Action Recognition Field. 2023. 28(2): p. 61.

* View Article

* Google Scholar

31. 31. Camarena F., Gonzalez-Mendoza M., Chang L., and Cuevas-Ascencio R.J., A Concise Overview of the Vision-based Human Action Recognition Field. Preprints (www.preprints.org), 2023, p.1–12.

* View Article

* Google Scholar

32. 32. Shamsolmoali P., Zareapoor M., Zhou H., J.J.A.T.o.M.C. Yang, Communications, and Applications, AMIL: Adversarial Multi-instance Learning for Human Pose Estimation. 2020. 16(1s): p. 1–23.

* View Article

* Google Scholar

33. 33. Zhao, R., M. Li, Z. Yang, B. Lin, X. Zhong, X. Ren, D. Cai, and B. Wu. Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing. in Proceedings of the AAAI Conference on Artificial Intelligence, 38 (7), 2024, p.7505-7513.

34. 34. Insafutdinov E., Pishchulin L., Andres B., Andriluka M., and Schiele B.. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. in European Conference on Computer Vision. 2016. Springer.

* View Article

* Google Scholar

35. 35. Ciuti G., Ricotti L., Menciassi A., and Dario P.J.S., MEMS sensor technologies for human centred applications in healthcare, physical activities, safety and environmental sensing: A review on research activities in Italy. 2015. 15(3): p. 6441–6468. pmid:25808763

* View Article

* PubMed/NCBI

* Google Scholar

36. 36. Ma H., Liu L., Zhou A., and Zhao D.J.I.I.o.T.J., On networking of Internet of Things: Explorations and challenges. 2015. 3(4): p. 441–452.

* View Article

* Google Scholar

37. 37. Nair, L.H. AHRS based body orientation estimation for real time fall detection. in 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS). 2017. IEEE.

38. 38. Del Rosario M.B., Lovell N.H., Fildes J., Holgate K., Yu J., Ferry C., Schreier G., Ooi S.-Y., S.J.J.I.j.o.b. Redmond, and h. informatics, Evaluation of an mHealth-based adjunct to outpatient cardiac rehabilitation. 2017. 22(6): p. 1938–1948.

* View Article

* Google Scholar

39. 39. Wang K., Delbaere K., Brodie M.A., Lovell N.H., Kark L., Lord S.R., S.J.J.I.j.o.b. Redmond, and h. informatics, Differences between gait on stairs and flat surfaces in relation to fall risk and future falls. 2017. 21(6): p. 1479–1486. pmid:28278486

* View Article

* PubMed/NCBI

* Google Scholar

40. 40. Choi, J., Beom-Jin Lee, and Byoung-Tak Zhang. "Human body orientation estimation using convolutional neural network." arXiv preprint arXiv:1609.01984 (2016).

41. 41. Flohr F., Dumitru-Guzu M., Kooij J.F., and Gavrila D.M.J.I.T.o.I.T.S., A probabilistic framework for joint pedestrian head and body orientation estimation. 2015. 16(4): p. 1872–1882.

* View Article

* Google Scholar

42. 42. Ma, L., H. Liu, L. Hu, C. Wang, and Q. Sun, Orientation driven bag of appearances for person re-identification. arXiv preprint arXiv:1605.02464, 2016.

43. 43. Raza M., Chen Z., Rehman S.-U., Wang P., and Bao P., Appearance based pedestrians’ head pose and body orientation estimation using deep learning. Neurocomputing, 2018. 272: p. 647–659.

* View Article

* Google Scholar

44. 44. Andriluka, M., S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010. Ieee.

45. 45. Othmezouri, G., I. Sakata, B. Schiele, M. Andriluka, and S. Roth, Monocular 3D pose estimation and tracking by detection. 2015, Google Patents, U.S. Patent 8,958,600.

46. 46. Rehder, E., H. Kloeden, and C. Stiller. Head detection and orientation estimation for pedestrian safety. in 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), IEEE-2014, p.2292-2297

47. 47. Lewandowski, B., D. Seichter, T. Wengefeld, L. Pfennig, H. Drumm, and H.-M. Gross. Deep orientation: Fast and robust upper body orientation estimation for mobile robotic applications. in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2019. IEEE.

48. 48. Smith K., Ba S.O., Odobez J.-M., D.J.I.t.o.p.a. Gatica-Perez, and m. intelligence, Tracking the visual focus of attention for a varying number of wandering people. 2008. 30(7): p. 1212–1229.

* View Article

* Google Scholar

49. 49. Bal V.H., Kim S.H., Fok M., and Lord C.J.A.R., Autism spectrum disorder symptoms from ages 2 to 19 years: Implications for diagnosing adolescents and young adults. 2019. 12(1): p. 89–99.

* View Article

* Google Scholar

50. 50. Roumaissa B., Mohamed Chaouki B.J.M.T., and Applications, Hand pose estimation based on regression method from monocular RGB cameras for handling occlusion. 2024. 83(7): p. 21497–21523.

* View Article

* Google Scholar

51. 51. Liem, M.C. and D.M. Gavrila. Person appearance modeling and orientation estimation using spherical harmonics. in 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 2013. IEEE.

52. 52. Lin, M. and Z. Chen. Salient region detection via low-level features and high-level priors. in 2015 IEEE International Conference on Digital Signal Processing (DSP). 2015. IEEE.

53. 53. Cheng G., Zhou P., Han J.J.I.T.o.G., and Sensing R., Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. 2016. 54(12): p. 7405–7415.

* View Article

* Google Scholar

54. 54. Tendulkar, P., D. Surís, and C. Vondrick. FLEX: Full-Body Grasping Without Full-Body Grasps. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

55. 55. Sun W., Iwata S., Tanaka Y., and Sakamoto T.J.a.p.a., Radar-Based Estimation of Human Body Orientation Using Respiratory Features and Hierarchical Regression Model. 2023.

* View Article

* Google Scholar

56. 56. Pascucci F., Cesari P., Bertucco M., and Latash M.L.J.E.B.R., Postural adjustments to self-triggered perturbations under conditions of changes in body orientation. 2023: p. 1–15. pmid:37479771

* View Article

* PubMed/NCBI

* Google Scholar

57. 57. Kim H., Shin C., and Cho Y.-J.J.I.A., Human Motion Prediction by Combining Spatial and Temporal Information with Independent Global Orientation. 2023. VOLUME 11, 2023: p. 98818–98829.

* View Article

* Google Scholar

58. 58. Lamine H., Bennour S., Laribi M., Romdhane L., S.J.C.m.i.B. Zaghloul, and B. engineering, Evaluation of calibrated kinect gait kinematics using a vicon motion capture system. 2017. 20(sup1): p. S111–S112. pmid:29088586

* View Article

* PubMed/NCBI

* Google Scholar

59. 59. Sheng L., Cai J., Cham T.-J., Pavlovic V., K.N.J.I.T.o.P.A. Ngan, and M. Intelligence, Visibility constrained generative model for depth-based 3D facial pose tracking. 2018. 41(8): p. 1994–2007. pmid:30369437

* View Article

* PubMed/NCBI

* Google Scholar

60. 60. Wu Q., Chen H., and Liu B.J.I.A., Path Planning of Agricultural Information Collection Robot Integrating Ant Colony Algorithm and Particle Swarm Algorithm. 2024. VOLUME 12, 2024: p. 50821–50833.

* View Article

* Google Scholar

61. 61. Wang C. and Lv S.J.A.S., Prefix Data Augmentation for Contrastive Learning of Unsupervised Sentence Embedding. 2024. 14(7): p. 2880.

* View Article

* Google Scholar

62. 62. Luo, X., H.L. Duong, and W. Liu. Person re-identification via pose-aware multi-semantic learning. in 2020 IEEE International Conference on Multimedia and Expo (ICME),2020-IEEE, p1-6.

63. 63. Su, Z., Ming Ye, Guohui Zhang, Lei Dai, and Jianda Sheng. "Cascade feature aggregation for human pose estimation." arXiv preprint arXiv:1902.07837 (2019).

64. 64. Cao, Z., T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

65. 65. Liu, W., J. Chen, C. Li, C. Qian, X. Chu, and X. Hu. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. in Proceedings of the AAAI Conference on Artificial Intelligence. 2018.

66. 66. Jin, S., W. Liu, W. Ouyang, and C. Qian. Multi-person articulated tracking with spatial and temporal embeddings. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

67. 67. Fieraru, M., A. Khoreva, L. Pishchulin, and B. Schiele. Learning to refine human pose estimation. in Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018.

* View Article

* Google Scholar

68. 68. Ye X., W.-y. Zhou, and L.-a. Dong, Body Part-Based Person Re-identification Integrating Semantic Attributes. Neural Processing Letters, 2019. 49(3): p. 1111–1124.

* View Article

* Google Scholar

69. 69. Toshev A. and Szegedy C.D.J.C., Human pose estimation via deep neural networks’. p. 1653–1660.

* View Article

* Google Scholar

70. 70. Newell, A., K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. in European conference on computer vision. 2016. Springer.

71. 71. Wang X., Zhao C., Miao D., Wei Z., Zhang R., and Ye T., Fusion of multiple channel features for person re-identification. Neurocomputing, 2016. 213: p. 125–136.

* View Article

* Google Scholar

72. 72. Zhang, F., X. Zhu, and M. Ye. Fast human pose estimation. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

* View Article

* Google Scholar

73. 73. Kaymaz M., Ayzit R., Akgün O., Atik K.C., Erdem M., Yalcin B., Cetin G., N.K.J.J.o.I. Ure, and R. Systems, Trading-Off Safety with Agility Using Deep Pose Error Estimation and Reinforcement Learning for Perception-Driven UAV Motion Planning. 2024. 110(2): p. 1–17.

* View Article

* Google Scholar

74. 74. Toshev, A. and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

75. 75. Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

76. 76. Carreira, J., P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

77. 77. Xu, T. and W. Takano. Graph stacked hourglass networks for 3d human pose estimation. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

78. 78. Sengupta, A., Ignas Budvytis, and Roberto Cipolla. "Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16094–16104. 2021.

79. 79. Amin J., Sharif M., Yasmin M., Saba T., Anjum M.A., and Fernandes S.L.J.J.o.m.s., A new approach for brain tumor segmentation and classification based on score level fusion using transfer learning. 2019. 43: p. 1–16. pmid:31643004

* View Article

* PubMed/NCBI

* Google Scholar

80. 80. Diaz-Arias, A., D. Shin, M. Messmore, and S. Baek. On the role of depth predictions for 3D human pose estimation. in Proceedings of the Future Technologies Conference (FTC) 2022, Volume 1. 2022. Springer.

81. 81. Sárándi I., Linder Timm, Arras Kai Oliver, and Leibe Bastian. "Metrabs: metric-scale truncation-robust heatmaps for absolute 3d human pose estimation." IEEE Transactions on Biometrics, Behavior, and Identity Science 3, no. 1 (2020): 16–30.

* View Article

* Google Scholar

82. 82. Wang, Z., Jimei Yang, and Charless Fowlkes. "The best of both worlds: combining model-based and nonparametric approaches for 3D human body estimation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2318–2327., 2022.

83. 83. Arnab, A., C. Doersch, and A. Zisserman. Exploiting temporal context for 3D human pose estimation in the wild. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

84. 84. Zhen, J., Q. Fang, J. Sun, W. Liu, W. Jiang, H. Bao, and X. Zhou. SMAP: Single-Shot Multi-Person Absolute 3D Pose Estimation. in European Conference on Computer Vision. 2020. Springer.

85. 85. Kponou E.A., Wang Z., and Li L.J.I.J.C.S.M.C., A comprehensive study on fast image dehazing techniques. 2013. 2: p. 146–152.

* View Article

* Google Scholar

86. 86. (2022)., S.A.M., Image Dehazing.zip, (https://www.mathworks.com/matlabcentral/fileexchange/47147-image-dehazing-zip), MATLAB Central File Exchange. Retrieved June 29, 2022. 2022.

87. 87. Ioffe, S. and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. in International conference on machine learning. 2015. PMLR.

88. 88. Liu Y., Wang X., Wang L., Liu D.J.A.M., and Computation A modified leaky ReLU scheme (MLRS) for topology optimization with multiple materials. 2019. 352: p. 188–204.

* View Article

* Google Scholar

89. 89. Bouvrie, J., Notes on convolutional neural networks. Neural Nets, MIT CBCL Tech Report, 2006: p. 47–60.

90. 90. Li Z., Liu F., Yang W., Peng S., J.J.I.t.o.n.n. Zhou, and l. systems, A survey of convolutional neural networks: analysis, applications, and prospects. 2021: p. 1–21.

* View Article

* Google Scholar

91. 91. Wu J., Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University. China, 2017. 5: p. 23.

92. 92. Bajpai A., Yadav R.J.I.J.o.S., and T. Research, Ant colony optimization (ACO) for the traveling salesman problem (TSP) using partitioning. 2015. 4(09): p. 376–381.

* View Article

* Google Scholar

93. 93. Zhang L., Zhu Y., and Shi X.J.I., A Hierarchical Decision-Making Method with a Fuzzy Ant Colony Algorithm for Mission Planning of Multiple UAVs. 2020. 11(4): p. 226.

* View Article

* Google Scholar

94. 94. Rashno A., Nazari B., Sadri S., and Saraee M.J.N., Effective pixel classification of mars images based on ant colony optimization feature selection and extreme learning machine. 2017. 226: p. 66–79.

* View Article

* Google Scholar

95. 95. Raza M., Sharif M., Yasmin M., Khan M.A., Saba T., and Fernandes S.L.J.F.G.C.S., Appearance based pedestrians’ gender recognition by employing stacked auto encoders in deep learning. 2018. 88: p. 28–39.

* View Article

* Google Scholar

96. 96. Heo D., Nam J.Y., and Ko B.C.J.S., Estimation of pedestrian pose orientation using soft target training based on teacher–student framework. 2019. 19(5): p. 1147.

* View Article

* Google Scholar

97. 97. Kim S.-S., Gwak I.-Y., and Lee S.-W.J.I.t.o.i.t.s., Coarse-to-fine deep learning of continuous pedestrian orientation based on spatial co-occurrence feature. 2019. 21(6): p. 2522–2533.

* View Article

* Google Scholar

98. 98. Yu, D., H. Xiong, Q. Xu, J. Wang, and K. Li. Continuous pedestrian orientation estimation using human keypoints. in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). 2019. IEEE.

99. 99. de Paiva, P.V.V., M.R. Batista, and J.J.G. Ramos. Estimating human body orientation using skeletons and extreme gradient boosting. in 2020 Latin American robotics symposium (LARS), 2020 Brazilian symposium on robotics (SBR) and 2020 workshop on robotics in education (WRE). 2020. IEEE.

100. 100. Dafrallah S., Amine A., Mousset S., and Bensrhair A.J.I.A., Monocular pedestrian orientation recognition based on capsule network for a novel collision warning system. 2021. 9: p. 141635–141650.

* View Article

* Google Scholar

Citation: Shahid MA, Raza M, Sharif M, Alshenaifi R, Kadry S (2025) Pedestrian POSE estimation using multi-branched deep learning pose net. PLoS ONE 20(1): e0312177. https://doi.org/10.1371/journal.pone.0312177

About the Authors:

Muhammad Alyas Shahid

Roles: Conceptualization

Affiliation: Department of Computer Science, COMSATS University Islamabad, Wah Campus, Islamabad, Pakistan

Mudassar Raza

Roles: Methodology

E-mail: [email protected] (MR); [email protected] (RA)

Affiliation: Namal University, Mianwali, Pakistan

Muhammad Sharif

Roles: Investigation

Affiliation: Department of Computer Science, COMSATS University Islamabad, Wah Campus, Islamabad, Pakistan

Reem Alshenaifi

Roles: Formal analysis

E-mail: [email protected] (MR); [email protected] (RA)

Affiliation: Department of Information Technology, College of Computer Sciences and Information Technology, Majmaah University, Majmaah, Saudi Arabia

Seifedine Kadry

Roles: Project administration

Affiliations: Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon, Noroff University College, Kristiansand, Norway

ORICD: https://orcid.org/0000-0002-8596-0814

[/RAW_REF_TEXT]

References

1. Fayyaz M., Yasmin M., Sharif M., Iqbal T., Raza M., Babar M.I.J.N.C., and Applications, Pedestrian gender classification on imbalanced and small sample datasets using deep and traditional features. 2023. 35(16): p. 11937–11968.

2. Rehman S.-U., Chen Z., Raza M., Wang P., and Zhang Q.J.N., Person re-identification post-rank optimization via hypergraph-based learning. 2018. 287: p. 143–153.

3. Raza M., Sharif M., Yasmin M., Khan M.A., Saba T., and Fernandes S.L., Appearance based pedestrians’ gender recognition by employing stacked auto encoders in deep learning. Future Generation Computer Systems, 2018. 88: p. 28–39.

4. Nasir I.M., Raza M., Ulyah S.M., Shah J.H., Fitriyani N.L., and Syafrudin M., ENGA: Elastic Net-Based Genetic Algorithm for human action recognition. Expert Systems with Applications, 2023. 227: p. 120311.

5. Saba T., Rehman A., Latif R., Fati S.M., Raza M., and Sharif M., Suspicious activity recognition using proposed deep L4-branched-ActionNet with entropy coded ant colony system optimization. IEEE Access, 2021. 9: p. 89181–89197.

6. Alkinani M.H., Khan W.Z., Arshad Q., and Raza M., HSDDD: a hybrid scheme for the detection of distracted driving through fusion of deep learning and handcrafted features. Sensors, 2022. 22(5): p. 1864. pmid:35271011

7. Fayyaz M., Yasmin M., Sharif M., and Raza M., J-LDFR: joint low-level and deep neural network feature representations for pedestrian gender classification. Neural Computing and Applications, 2021. 33: p. 361–391.

8. Li C. and Yung N.H.J.J.o.S.P.S., Arm Poses Modeling for Pedestrians with Motion Prior. 2016. 84(2): p. 237–249.

9. Yoon, S.M., J. Song, K.-S. Hahn, and G.-J. Yoon. Simultaneous detection of pedestrians, pose, and the camera viewpoint from 3D models. In 2015 International Conference on Information and Communication Technology Convergence (ICTC),IEEE,(2015),p.83-88.

10. Mínguez R.Q., Alonso I.P., Fernández-Llorca D., and Sotelo M.Á., Pedestrian Path, Pose, and Intention Prediction Through Gaussian Process Dynamical Models and Pedestrian Activity Recognition. IEEE Transactions on Intelligent Transportation Systems, 2018. 20(5): p. 1803–1814.

11. Yano S., Gu Y., and Kamijo S.J.I.J.o.I.T.S.R., Estimation of pedestrian pose and orientation using on-board camera with histograms of oriented gradients features. 2016. 14(2): p. 75–84.

12. Menan, V., A. Gawesha, P. Samarasinghe, and D. Kasthurirathna. DS-HPE: Deep Set for Head Pose Estimation. in 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC),2023,IEEE,p.1179-1184.

13. Li B., Chen Y., and Wang F.-Y.J.I.T.o.V.T., Pedestrian detection based on clustered poselet models and hierarchical and–or grammar. 2014. 64(4): p. 1435–1444.

14. Hariyono J. and Jo K.-H.J.N., Detection of pedestrian crossing road: A study on pedestrian pose recognition. 2017. 234: p. 144–153.

15. Schut D.E., Wood R.M., Trull A.K., Schouten R., van Liere R., van Leeuwen T., Batenburg K.J.J.P.B., and Technology, Joint 2D to 3D image registration workflow for comparing multiple slice photographs and CT scans of apple fruit with internal disorders. 2024. 211: p. 112814.

16. Liu, W., J. Chen, C. Li, C. Qian, X. Chu, and X. Hu. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. in Thirty-second AAAI conference on artificial intelligence, 32 (1), 2018,.

17. Shah J.H., Sharif M., Yasmin M., and Fernandes S.L.J.P.R.L., Facial expressions classification and false label reduction using LDA and threefold SVM. 2020. 139: p. 166–173.

18. Sharif M., Raza M., Shah J.H., Yasmin M., S.L.J.H.o.m.i.s.t. Fernandes, and applications, An overview of biometrics methods. 2019: p. 15–35.

19. Shah J.H., Chen Z., Sharif M., Yasmin M., S.L.J.J.o.M.i.M. Fernandes, and Biology, A novel biomechanics-based approach for person re-identification by generating dense color sift salience features. 2017. 17(07): p. 1740011.

20. Nägeli T., Oberholzer S., Plüss S., Alonso-Mora J., and Hilliges O.J.A.T.o.G., Flycon: real-time environment-independent multi-view human pose estimation with aerial vehicles. 2018. 37(6): p. 1–14.

21. Dong L., Chen X., Wang R., Zhang Q., E.J.I.T.o.C. Izquierdo, and S.f.V. Technology, ADORE: An adaptive Holons representation framework for human pose estimation. 2017. 28(10): p. 2803–2813.

22. Liu G., Tian G., Li J., Zhu X., and Wang Z.J.I.S.J., Human action recognition using a distributed rgb-depth camera network. 2018. 18(18): p. 7570–7576.

23. Luvizon, D.C., D. Picard, and H. Tabia. 2d/3d pose estimation and action recognition using multitask deep learning. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

24. Srivastav, V., Thibaut Issenhuth, Abdolrahim Kadkhodamohammadi, Michel de Mathelin, Afshin Gangi, and Nicolas Padoy. "MVOR: A multi-view RGB-D operating room dataset for 2D and 3D human pose estimation." arXiv preprint arXiv:1808.08180 (2018). 2018.

25. Zimmermann, C., T. Welschehold, C. Dornhege, W. Burgard, and T. Brox. 3d human pose estimation in rgbd images for robotic task learning. in 2018 IEEE International Conference on Robotics and Automation (ICRA). 2018. IEEE.

26. Butepage, J., M.J. Black, D. Kragic, and H. Kjellstrom. Deep representation learning for human motion prediction and classification. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, p.6158-6166.

27. Marin-Jimenez M.J., Romero-Ramirez F.J., Munoz-Salinas R., R.J.J.o.V.C. Medina-Carnicer, and I. Representation, 3D human pose estimation from depth maps using a deep combination of poses. 2018. 55: p. 627–639.

28. Ray A., Kolekar M.H., Balasubramanian R., and A.J.I.J.o.I.M.D.I. Hafiane, Transfer learning enhanced vision-based human activity recognition: a decade-long analysis. 2023. 3(1): p. 100142.

29. Chauhan P., Mandoria H.L., Negi A., Kumar K., Choudhury A., and Dahiya S., Leveraging Advanced Convolutional Neural Networks and Transfer Learning for Vision-Based Human Activity Recognition, in Robotics, Control and Computer Vision: Select Proceedings of ICRCCV 2022. 2023, Springer. p. 239–248.

30. Camarena F., Gonzalez-Mendoza M., Chang L., Cuevas-Ascencio R.J.M., and Applications C., An Overview of the Vision-Based Human Action Recognition Field. 2023. 28(2): p. 61.

31. Camarena F., Gonzalez-Mendoza M., Chang L., and Cuevas-Ascencio R.J., A Concise Overview of the Vision-based Human Action Recognition Field. Preprints (www.preprints.org), 2023, p.1–12.

32. Shamsolmoali P., Zareapoor M., Zhou H., J.J.A.T.o.M.C. Yang, Communications, and Applications, AMIL: Adversarial Multi-instance Learning for Human Pose Estimation. 2020. 16(1s): p. 1–23.

33. Zhao, R., M. Li, Z. Yang, B. Lin, X. Zhong, X. Ren, D. Cai, and B. Wu. Towards Fine-Grained HBOE with Rendered Orientation Set and Laplace Smoothing. in Proceedings of the AAAI Conference on Artificial Intelligence, 38 (7), 2024, p.7505-7513.

34. Insafutdinov E., Pishchulin L., Andres B., Andriluka M., and Schiele B.. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. in European Conference on Computer Vision. 2016. Springer.

35. Ciuti G., Ricotti L., Menciassi A., and Dario P.J.S., MEMS sensor technologies for human centred applications in healthcare, physical activities, safety and environmental sensing: A review on research activities in Italy. 2015. 15(3): p. 6441–6468. pmid:25808763

36. Ma H., Liu L., Zhou A., and Zhao D.J.I.I.o.T.J., On networking of Internet of Things: Explorations and challenges. 2015. 3(4): p. 441–452.

37. Nair, L.H. AHRS based body orientation estimation for real time fall detection. in 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS). 2017. IEEE.

38. Del Rosario M.B., Lovell N.H., Fildes J., Holgate K., Yu J., Ferry C., Schreier G., Ooi S.-Y., S.J.J.I.j.o.b. Redmond, and h. informatics, Evaluation of an mHealth-based adjunct to outpatient cardiac rehabilitation. 2017. 22(6): p. 1938–1948.

39. Wang K., Delbaere K., Brodie M.A., Lovell N.H., Kark L., Lord S.R., S.J.J.I.j.o.b. Redmond, and h. informatics, Differences between gait on stairs and flat surfaces in relation to fall risk and future falls. 2017. 21(6): p. 1479–1486. pmid:28278486

40. Choi, J., Beom-Jin Lee, and Byoung-Tak Zhang. "Human body orientation estimation using convolutional neural network." arXiv preprint arXiv:1609.01984 (2016).

41. Flohr F., Dumitru-Guzu M., Kooij J.F., and Gavrila D.M.J.I.T.o.I.T.S., A probabilistic framework for joint pedestrian head and body orientation estimation. 2015. 16(4): p. 1872–1882.

42. Ma, L., H. Liu, L. Hu, C. Wang, and Q. Sun, Orientation driven bag of appearances for person re-identification. arXiv preprint arXiv:1605.02464, 2016.

43. Raza M., Chen Z., Rehman S.-U., Wang P., and Bao P., Appearance based pedestrians’ head pose and body orientation estimation using deep learning. Neurocomputing, 2018. 272: p. 647–659.

44. Andriluka, M., S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010. Ieee.

45. Othmezouri, G., I. Sakata, B. Schiele, M. Andriluka, and S. Roth, Monocular 3D pose estimation and tracking by detection. 2015, Google Patents, U.S. Patent 8,958,600.

46. Rehder, E., H. Kloeden, and C. Stiller. Head detection and orientation estimation for pedestrian safety. in 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), IEEE-2014, p.2292-2297

47. Lewandowski, B., D. Seichter, T. Wengefeld, L. Pfennig, H. Drumm, and H.-M. Gross. Deep orientation: Fast and robust upper body orientation estimation for mobile robotic applications. in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2019. IEEE.

48. Smith K., Ba S.O., Odobez J.-M., D.J.I.t.o.p.a. Gatica-Perez, and m. intelligence, Tracking the visual focus of attention for a varying number of wandering people. 2008. 30(7): p. 1212–1229.

49. Bal V.H., Kim S.H., Fok M., and Lord C.J.A.R., Autism spectrum disorder symptoms from ages 2 to 19 years: Implications for diagnosing adolescents and young adults. 2019. 12(1): p. 89–99.

50. Roumaissa B., Mohamed Chaouki B.J.M.T., and Applications, Hand pose estimation based on regression method from monocular RGB cameras for handling occlusion. 2024. 83(7): p. 21497–21523.

51. Liem, M.C. and D.M. Gavrila. Person appearance modeling and orientation estimation using spherical harmonics. in 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 2013. IEEE.

52. Lin, M. and Z. Chen. Salient region detection via low-level features and high-level priors. in 2015 IEEE International Conference on Digital Signal Processing (DSP). 2015. IEEE.

53. Cheng G., Zhou P., Han J.J.I.T.o.G., and Sensing R., Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. 2016. 54(12): p. 7405–7415.

54. Tendulkar, P., D. Surís, and C. Vondrick. FLEX: Full-Body Grasping Without Full-Body Grasps. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

55. Sun W., Iwata S., Tanaka Y., and Sakamoto T.J.a.p.a., Radar-Based Estimation of Human Body Orientation Using Respiratory Features and Hierarchical Regression Model. 2023.

56. Pascucci F., Cesari P., Bertucco M., and Latash M.L.J.E.B.R., Postural adjustments to self-triggered perturbations under conditions of changes in body orientation. 2023: p. 1–15. pmid:37479771

57. Kim H., Shin C., and Cho Y.-J.J.I.A., Human Motion Prediction by Combining Spatial and Temporal Information with Independent Global Orientation. 2023. VOLUME 11, 2023: p. 98818–98829.

58. Lamine H., Bennour S., Laribi M., Romdhane L., S.J.C.m.i.B. Zaghloul, and B. engineering, Evaluation of calibrated kinect gait kinematics using a vicon motion capture system. 2017. 20(sup1): p. S111–S112. pmid:29088586

59. Sheng L., Cai J., Cham T.-J., Pavlovic V., K.N.J.I.T.o.P.A. Ngan, and M. Intelligence, Visibility constrained generative model for depth-based 3D facial pose tracking. 2018. 41(8): p. 1994–2007. pmid:30369437

60. Wu Q., Chen H., and Liu B.J.I.A., Path Planning of Agricultural Information Collection Robot Integrating Ant Colony Algorithm and Particle Swarm Algorithm. 2024. VOLUME 12, 2024: p. 50821–50833.

61. Wang C. and Lv S.J.A.S., Prefix Data Augmentation for Contrastive Learning of Unsupervised Sentence Embedding. 2024. 14(7): p. 2880.

62. Luo, X., H.L. Duong, and W. Liu. Person re-identification via pose-aware multi-semantic learning. in 2020 IEEE International Conference on Multimedia and Expo (ICME),2020-IEEE, p1-6.

63. Su, Z., Ming Ye, Guohui Zhang, Lei Dai, and Jianda Sheng. "Cascade feature aggregation for human pose estimation." arXiv preprint arXiv:1902.07837 (2019).

64. Cao, Z., T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

65. Liu, W., J. Chen, C. Li, C. Qian, X. Chu, and X. Hu. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. in Proceedings of the AAAI Conference on Artificial Intelligence. 2018.

66. Jin, S., W. Liu, W. Ouyang, and C. Qian. Multi-person articulated tracking with spatial and temporal embeddings. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

67. Fieraru, M., A. Khoreva, L. Pishchulin, and B. Schiele. Learning to refine human pose estimation. in Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018.

68. Ye X., W.-y. Zhou, and L.-a. Dong, Body Part-Based Person Re-identification Integrating Semantic Attributes. Neural Processing Letters, 2019. 49(3): p. 1111–1124.

69. Toshev A. and Szegedy C.D.J.C., Human pose estimation via deep neural networks’. p. 1653–1660.

70. Newell, A., K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. in European conference on computer vision. 2016. Springer.

71. Wang X., Zhao C., Miao D., Wei Z., Zhang R., and Ye T., Fusion of multiple channel features for person re-identification. Neurocomputing, 2016. 213: p. 125–136.

72. Zhang, F., X. Zhu, and M. Ye. Fast human pose estimation. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

73. Kaymaz M., Ayzit R., Akgün O., Atik K.C., Erdem M., Yalcin B., Cetin G., N.K.J.J.o.I. Ure, and R. Systems, Trading-Off Safety with Agility Using Deep Pose Error Estimation and Reinforcement Learning for Perception-Driven UAV Motion Planning. 2024. 110(2): p. 1–17.

74. Toshev, A. and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

75. Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

76. Carreira, J., P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

77. Xu, T. and W. Takano. Graph stacked hourglass networks for 3d human pose estimation. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

78. Sengupta, A., Ignas Budvytis, and Roberto Cipolla. "Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16094–16104. 2021.

79. Amin J., Sharif M., Yasmin M., Saba T., Anjum M.A., and Fernandes S.L.J.J.o.m.s., A new approach for brain tumor segmentation and classification based on score level fusion using transfer learning. 2019. 43: p. 1–16. pmid:31643004

80. Diaz-Arias, A., D. Shin, M. Messmore, and S. Baek. On the role of depth predictions for 3D human pose estimation. in Proceedings of the Future Technologies Conference (FTC) 2022, Volume 1. 2022. Springer.

81. Sárándi I., Linder Timm, Arras Kai Oliver, and Leibe Bastian. "Metrabs: metric-scale truncation-robust heatmaps for absolute 3d human pose estimation." IEEE Transactions on Biometrics, Behavior, and Identity Science 3, no. 1 (2020): 16–30.

82. Wang, Z., Jimei Yang, and Charless Fowlkes. "The best of both worlds: combining model-based and nonparametric approaches for 3D human body estimation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2318–2327., 2022.

83. Arnab, A., C. Doersch, and A. Zisserman. Exploiting temporal context for 3D human pose estimation in the wild. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

84. Zhen, J., Q. Fang, J. Sun, W. Liu, W. Jiang, H. Bao, and X. Zhou. SMAP: Single-Shot Multi-Person Absolute 3D Pose Estimation. in European Conference on Computer Vision. 2020. Springer.

85. Kponou E.A., Wang Z., and Li L.J.I.J.C.S.M.C., A comprehensive study on fast image dehazing techniques. 2013. 2: p. 146–152.

86. (2022)., S.A.M., Image Dehazing.zip, (https://www.mathworks.com/matlabcentral/fileexchange/47147-image-dehazing-zip), MATLAB Central File Exchange. Retrieved June 29, 2022. 2022.

87. Ioffe, S. and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. in International conference on machine learning. 2015. PMLR.

88. Liu Y., Wang X., Wang L., Liu D.J.A.M., and Computation A modified leaky ReLU scheme (MLRS) for topology optimization with multiple materials. 2019. 352: p. 188–204.

89. Bouvrie, J., Notes on convolutional neural networks. Neural Nets, MIT CBCL Tech Report, 2006: p. 47–60.

90. Li Z., Liu F., Yang W., Peng S., J.J.I.t.o.n.n. Zhou, and l. systems, A survey of convolutional neural networks: analysis, applications, and prospects. 2021: p. 1–21.

91. Wu J., Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University. China, 2017. 5: p. 23.

92. Bajpai A., Yadav R.J.I.J.o.S., and T. Research, Ant colony optimization (ACO) for the traveling salesman problem (TSP) using partitioning. 2015. 4(09): p. 376–381.

93. Zhang L., Zhu Y., and Shi X.J.I., A Hierarchical Decision-Making Method with a Fuzzy Ant Colony Algorithm for Mission Planning of Multiple UAVs. 2020. 11(4): p. 226.

94. Rashno A., Nazari B., Sadri S., and Saraee M.J.N., Effective pixel classification of mars images based on ant colony optimization feature selection and extreme learning machine. 2017. 226: p. 66–79.

95. Raza M., Sharif M., Yasmin M., Khan M.A., Saba T., and Fernandes S.L.J.F.G.C.S., Appearance based pedestrians’ gender recognition by employing stacked auto encoders in deep learning. 2018. 88: p. 28–39.

96. Heo D., Nam J.Y., and Ko B.C.J.S., Estimation of pedestrian pose orientation using soft target training based on teacher–student framework. 2019. 19(5): p. 1147.

97. Kim S.-S., Gwak I.-Y., and Lee S.-W.J.I.t.o.i.t.s., Coarse-to-fine deep learning of continuous pedestrian orientation based on spatial co-occurrence feature. 2019. 21(6): p. 2522–2533.

98. Yu, D., H. Xiong, Q. Xu, J. Wang, and K. Li. Continuous pedestrian orientation estimation using human keypoints. in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). 2019. IEEE.

99. de Paiva, P.V.V., M.R. Batista, and J.J.G. Ramos. Estimating human body orientation using skeletons and extreme gradient boosting. in 2020 Latin American robotics symposium (LARS), 2020 Brazilian symposium on robotics (SBR) and 2020 workshop on robotics in education (WRE). 2020. IEEE.

100. Dafrallah S., Amine A., Mousset S., and Bensrhair A.J.I.A., Monocular pedestrian orientation recognition based on capsule network for a novel collision warning system. 2021. 9: p. 141635–141650.

Word count: 14358

Show less

© 2025 Shahid et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

In human activity-recognition scenarios, including head and entire body pose and orientations, recognizing the pose and direction of a pedestrian is considered a complex problem. A person may be traveling in one sideway while focusing his attention on another side. It is occasionally desirable to analyze such orientation estimates using computer-vision tools for automated analysis of pedestrian behavior and intention. This article uses a deep-learning method to demonstrate the pedestrian full-body pose estimation approach. A deep-learning-based pre-trained supervised model multi-branched deep learning pose net (MBDLP-Net) is proposed for estimation and classification. For full-body pose and orientation estimation, three independent datasets, an extensive dataset for body orientation (BDBO), PKU-Reid, and TUD Multiview Pedestrians, are used. Independently, the proposed technique is trained on dataset CIFAR-100 with 100 classes. The proposed approach is meticulously tested using publicly accessible BDBO, PKU-Reid, and TUD datasets. The results show that the mean accuracy for full-body pose estimation with BDBO and PKU-Reid is 0.95%, and with TUD multiview pedestrians is 0.97%. The performance results show that the proposed technique efficiently distinguishes full-body poses and orientations in various configurations. The efficacy of the provided approach is compared with existing pretrained, robust, and state-of-the-art methodologies, providing a comprehensive understanding of its advantages.

Details

Title

Pedestrian POSE estimation using multi-branched deep learning pose net

Author

Shahid, Muhammad Alyas; Raza, Mudassar; Sharif, Muhammad; Alshenaifi, Reem; Kadry, Seifedine

First page

e0312177

Section

Research Article

Publication year

2025

Publication date

Jan 2025

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0312177

ProQuest document ID

3159628754

Pedestrian POSE estimation using multi-branched deep learning pose net

Jump to:

Full text

1. Introduction

2. Related work

3. Proposed methodology

3.1 Preprocessing

3.2 Acquisition of deep features

3.2.1 Proposed MBDLP-Net.

3.3 Feature selection with ant colony optimization

3.4 Classification

4. Results and discussion

4.1 Datasets

4.2 Performance measures

4.3 Experimental detail

4.3.1 Experiment-I: Results on normal/hazy images of PKU-Reid dataset: Performance evaluation of normal/hazy images on PKU-Reid dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 3.

4.3.2 Experiment-II: Results on dehazed images of PKU-Reid dataset: Performance evaluation of dehazed images on PKU-Reid dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 5.

4.3.3 Experiment-III: Results on standard/hazy images of BDBO dataset: Performance evaluation of standard/hazy images on BDBO dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 7.

4.3.4 Experiment-IV: Results on dehazed images of BDBO dataset: Performance evaluation of dehazed images on BDBO dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) with SVM and KNN variants is expressed in Table 9.

4.3.8 Experiment-VIII: Results on images of TUD dataset: Performance evaluation of images on TUD dataset by using five (5) features’ subsets (100, 250, 500, 750, and 1000 features) using MBDLP-Net CNN Network with four (4) classifiers i.e. CSVM, MGSVM, QSVM, and FKNN is expressed in Table 14.

5. Comparison with existing works

6. Conclusion

References

Abstract

Details

Suggested sources