Full Text

Turn on search term navigation

Headnote

ABSTRACT

This paper explains how to detect the 2D pose of multiple people in an image. We use in this paper Part Affinity Fields for Part Association (It is non-parametric representation), Confidence Maps for Part Detection, Multi-Person Parsing using PAFs, Simultaneous Detection and Association, this method achieve high accuracy and performance regardless the number of people in the image. This architecture placed first within the inaugural COCO 2016 key points challenge. Also, this architecture exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.

Keywords: Real time performance, Part affinity fields, Part detection, Multi-person parsing, Confidence maps

Date of Submission: June 09, 2020 Date of Acceptance: July 07, 2020

(ProQuest: ... denotes formulae omitted.)

I. INTRODUCTION

Human 2D pose estimation is the problem of localizing anatomical key points or "parts. We use it to find body parts of individuals[16,18,19,15,14,13,12,11,9]. There are a set of challenges. The first challenge, each image may contain an obscure number of individuals that can happen at any position or scale. The second challenge, interactions between individuals lead to complex spatial interference, due to contact joints, which makes different parts. The third challenge is real-time performance, when the number of individuals in the image increase, the complexity real-time increase there is a positive correlation in top-down approaches between the number of people and the computational cost. Whereas the more people there are, the greater the computational cost. In contrast, bottom-up approaches have the potential to decouple runtime complexity from the number of people in the image.

II. Human Pose estimation

Human pose estimation affects positively in our society. Because human pose estimation from multiple views can be used in motion capture, surveillance, and sport capturing systems. Motion capture systems are useful for film industry, especially for animating cartoon characters. The current technology is based on marker-based solutions which work only in a studio environment. Also, human pose estimation is very useful, in sport games. For example, we can estimate the pose of football or volleyball players, captured from different views, supports the analysis of a game. Furthermore, we use body pose estimation in sport activities to study the tactics of the team and its opponents. Also, we use body pose estimation in surveillance. Public or crowed places are monitored by multiple view camera systems. Automatic human pose estimation could make the recognition of unusual human actions and activities more easily.

We can also estimate the body pose of the surgeons and staff in OR. why we need to perform human pose estimation, OR? There is another motivation which is related to the surgical workflow modeling. Surgical workflow refers to the phase recovery and analysis of a medical operation. For this, a number of available signals inside the OR are employed.

These signals come from different instruments, monitoring, and medical devices. Within this environment, the role of pose estimation from a multi-view camera system is an additional input modality to the surgical workflow analysis and modeling. For instance, the 3D body poses can be used to identifying human activities and thus can contribute to the phase recognition of the medical operation

Also, we can use human pose estimation in autonomous cars. According to statistics, Car accidents account for about two percent of deaths globally each year. As such, an intelligent system tracking driver pose may be useful for emergency alerts. In autonomous cars pedestrian detection algorithms have been used successfully, to enable the car to make smarter decisions. Also, we can use human pose estimation in assisted living. Personal care robots may be deployed. So we use for these robots high-accuracy human detection and pose estimation to perform a variety of tasks, such as fall detection. There is Other applications include animal tracking and behavior understanding, sign language detection, advanced human-computer interaction, and marker less motion capturing.

III. RELATED WORK

A.Single Person Pose Estimation

The conventional approach[1,2,3,4,5,6,7,8,16,10] to articulated human pose estimation is to perform inference over a combination of observations on the parts of the body and the spatial dependencies between them. The spatial model for articulated human pose estimation is either based on treestructured graphical models or non-tree models. The treestructured model encodes the spatial relationship between adjacent parts following a kinematic chain. The non-tree model is a tree structure with additional edges to capture occlusion, symmetry, and long-range relationships. To obtain local observations of body parts, we use Convolutional Neural Networks (CNNs). The convolutional pose machines architecture proposed by Wei et al used a multi-stage design based on a sequential prediction framework iteratively incorporating global context. supervisions are enforced at the end of each stage to solve the problem of vanishing gradients during training. Newell et al showed that supervisions are beneficial in a stacked hourglass architecture. However, all of these methods assume a single person.

B.Multi-Person Pose Estimation

For multi-person pose estimation, most approaches have used a top-down strategy that first detects people after that estimates the pose of each person independently on each detected region. Although this strategy makes the techniques developed for the single person directly applicable, it suffers from the early commitment to person detection and fails to capture the spatial dependencies across different people that require global inference. Some approaches have started to consider inter-person dependencies. Eichner et al. extended pictorial structures to take into account a set of interacting people and depth order, but unfortunately still required a person detector to initialize detection hypotheses. Patchouli et al. proposed a bottom-up approach that labels part detection candidates and also associated them with individual people, with pairwise scores regressed from spatial offsets of detected parts. This approach does not depend on person detections, however, solving the proposed integer linear programming over the fully connected graph is an NP-hard problem and thus the average processing time for a single image is on the order of hours. nsafutdinov et al. built with a stronger part detector based on Reset and imagedependent pairwise scores and improved the run time with an incremental optimization approach, but the method still takes a few minutes per image, with a limit of at most 150-part proposals.

IV. METHODS

Fig. 2 outlines the in general pipeline of our methodology. The system takes, as input, a color picture of size w x h (Fig. 2a) and produces, as output, the 2D areas of anatomical key points for every person within the image (Fig. 2e) to start with, a feed forward organize at an equivalent time predicts a group of 2D certainty maps S of body portion points (Fig. 2b) and a group of 2D vector areas L of part affinities, which encode the degree of affiliation between parts (Fig. 2c). The set S = (S1, S2, ..., SJ) has J confidence maps, one per part, where Sj e RwXh, j e {1 . . . J}. The set L = (L1, L2, ..., LC) has C vector zones, one per limb1 , where Lc e RwXhX2 , c e {1 . . . C}, each picture zone in Lc encodes a 2D vector (as appeared up in Fig. 1). At long last, the knowledge maps and therefore the getting a charge out of ranges are parsed by insatiable acknowledgment (Fig. 2d) to resign the 2D key points for all individual's interior the image.

A.Simultaneous Detection and Association

Our architecture, shown in Fig. 3, simultaneously predicts detection confidence maps and affinity fields that encode part-to-part association. The network is split into two branches: the highest branch predicts the arrogance maps, and therefore the bottom branch predicts the affinity fields.

Each branch is an iterative prediction, following Wei et al. [17], which refines the predictions on the successive stages t e {1, . . ., T}, with intermediate supervision at every point.

First part the image predicted by a convolutional network generating a group of feature maps F that's input to the primary stage of every branch, At the first stage, the network produces a group of detection confidence maps S1 = pi (F) and part affinity fields Li = ф1 (F) where pi and ф1 are the CNNs for inference at the primary stage, then we follow the sub-stages in each a part of it the predictions from both branches are sequenced and wont to produce refined predictions

... (1)

... (2)

where p' and ф' are the CNNs for assumption at Stage t.

Fig. 4 shows the development of the arrogance maps and affinity fields across stages. To direct the network to iteratively foresee confidence maps of body parts within the zero part and PAFs within the second department, we apply two loss functions at the top of each stage. one at each branch respectively. We utilize an L2 loss between the evaluated predictions and therefore the ground truth maps and fields. Here, we weight the loss functions spatially to deal with a viable issue that some datasets don't completely label all people. Specifically.

B.Confidence Maps for Part Detection

Ideally, just in case one individual occurs within the image, one peak needs to exist in each confidence map on the off chance that the corresponding portion is visible, if multiple people occur, there ought to be a peak like each visible part j for every person k.

We first generate individual confidence maps S·j,k for every

...

person k. Let xj,k e R2 be the groundtruth a part of everyone k part j within the picture.

The value at location p e R2 in S·j,k is defined as, where c controls the spread of the height. The ground truth confidence outline to be anticipated by the network is a conglomeration of the person certainty maps through a max operator, We take the maximum of the confidence maps rather than the normal so that the precision of nearby peaks remains, as outlined within Fig (5). At test time, we predict confidence maps (as appeared within the to begin with push of Fig. 4), and get body part candidates by performing nonmaximum suppression

C.Part Affinity Fields for Part Association

How do we collect them to form the full-body postures of an obscure number of individuals?

We need a certainty degree of the affiliation for each pair of body portion discoveries, i.e., that they have a place to the same person. One possible way to degree the affiliation is to identify an extra midpoint between each combine of parts on an appendage, and check for its rate between candidate portion discoveries, as appeared in Fig. 5b.

In any case, when individuals crowd together - as they are inclined to do-these midpoints are likely to bolster wrong affiliations (appeared as green lines in Fig. 5b). Such wrong associations emerge due to two impediments within the representation: (1) it encodes as it were the position, and not the introduction, of each appendage; (2) it decreases the locale of bolster of an appendage to a single point.

To address these restrictions, we show a novel include representation called part affinity fields

that preserves both location and orientation information across the region of support of

the limb (as shown in Fig. 1c). The part affinity

is a 2D vector field for each appendage,

also shown in Fig. 1d: for each pixel within the region having a place to a specific appendage.

a 2D vector encodes the course that focuses from one portion of the appendage to the other. Each type of limb has a corresponding affinity field joining its two associated body parts. Consider a single limb shown in the figure below. Let x_(j1,k)and x_(j2,k)be the ground truth positions of body parts j1and j2from the appendage c for individual k in the image . In case a point p lies on the appendage, the value at Lc · k may be a unit vector that focuses from jl· to j2 ; for all other focuses, the vector is zero-valued

To assess fL in Eq. 5 during training, we characterize the ground truth part affinity vector field Lc · k at an image point p as

... (1)

Here ... is the unit vector within the heading of the appendage>. The set of points on the appendage is characterized as those inside a remove limit of the line segment, i.e., those points p for which

where the appendage width alis a distance in pixels, the appendage length is lc k = \\xj2jk - ¾fc|| and v±isa vector per-pendicular to v.

The ground truth part affinity field midpoints the affinity fields of all individuals within the image

... (2)

where nc(p)is the number of non-zero vectors at point p across all k individuals (i.e., the average at pixels where limbs of different people overlap).

During testing, we degree affiliation between candidate portion location by computing the line integral over the comparing PAF, with the candidate appendage that would be shaped by interfacing the recognized body parts. Particularly, for two candidate portion areas dj1 and dj2 we test the anticipated portion liking field, Lc along the line fragment to degree the certainty in their affiliation

... (3)

where p(u) introduces the position of the two body parts calând djļ,

... (4)

In practice, we surmised the necessarily by inspecting and summing uniformly spaced values of u

D.Multi-Person Parsing using PAFs

We perform non-maximum concealment on the location certainty maps to get a discrete set of portion candidate areas. For each part, we may have a few candidates, due to multiple people in the image or false positives (shown in Fig. 2b).These portion candidates characterize a expansive set of conceivable limbs.We score each candidate limb utilizing the line indispensably computation on the PAF, defined in Eq. 3. The issue of finding the ideal parse compares to a Kdimensional coordinating issue that's known to be NPHard(shown in Fig. 2c).In this paper, we show a greedy relaxation that consistently produces high-quality matches. We guess the reason is that the pair-wise affiliation scores certainly encode worldwide setting, due to the expansive responsive field of the PAF network.Formally, we to begin with get a set of body portion detection candidates Dj for multiple people, where ..., with Nj the number of candidates of portion j, and dj" E M2is the location of the m-th discovery candidate of body portion j These portion discovery candidates still have to be be related with other parts from the same person-in other words, we have to be discover the sets of portion location that are in reality associated appendages. We define a variable to show whether two discovery candidates djļ and dļļare connected , and the objective is to discover the ideal task for the set of all conceivable associations , ... . In case we consider a single combine of parts j\ and _¡2 (e.g., neck and right hip) for the c-th appendage, finding the ideal affiliation decreases to a greatest weight bipartite chart coordinating issue.This case is shown in Fig. 1b.In this graph coordinating issue, hubs of the chart are the body portion discovery candidates Djļ and Dj2 and the edges are all conceivable associations between sets of discovery candidates. Additionally, each edge is weighted by Eq. 3- the part affinity aggregate. A coordinating in a bipartite chart may be a subset of the edges chosen in such a way that no two edges share a hub. Our objective is to discover a coordinating with most extreme weight for the chosen edges,

... (5)

... (6)

... (7)

where Ec is the generally weight of the coordinating from limb type c, Zcis the subset ofZfor limb type c, Emnis the part affinity between parts djļ and dj2defined in Eq. 3.

Eg.6 and 7 (e.g., left forearm) share a portion. Able to use the Hungarian algorithm to get the optimal matching.

When it comes to finding the total body posture of different individuals, determining Z is a K-dimensional matching problem. This problem is NP Hard and numerous relaxations exist. In this work, we include two relaxations to the optimization,specialized to our space. To begin with, we select a minimal number of edges to get a crossing tree skeleton of human posture instead of utilizing the total chart, as appeared in Fig. 2c. Moment,we advance break down the coordinating issue into a set of bipartite coordinating subproblems and decide the coordinating in adjoining tree hubs freely, as appeared in Fig. 2d.

We show detailed comparison results in Section 3.1, which illustrate that negligible greedy induction wellapproximate the worldwide arrangement at a division of the computational cost. The reason is that the relationship between adjoining tree hubs is modeled expressly by PAFs,but inside, the relationship between nonadjacent tree hubs is verifiably modeled by the CNN.This property rises since the CNN is prepared with a expansive open field, and PAFs from non-adjacent tree hubs too impact the anticipated PAF.

With these two relaxations, the optimization is decayed essentially as:

... (8)

We subsequently get the appendage association candidates for each limb sort freely utilizing Eqns.5-7. With all appendage association candidates, we will collect the associations that share the same portion location candidates into full-body postures of numerous individuals. Our optimization conspire over the tree structure is orders of size quicker than the optimization over the completely associated chart

V. RESULTS

We evaluate our method on two benchmarks for multiperson pose estimation: the MPII human multi-person dataset and the COCO 2016 key points challenge dataset . These two datasets collect images in different scenarios that contain many real-world challenges such as crowding, scale variation, occlusion, and contact.

Table [1] Results on the MPI subset of 228 images keypoints challenge and significantly exceeds the previous state-of-the-art result on the MPII multi-person benchmark as we will see in the following lines. We also provide runtime analysis to evaluate the efficiency of the system. Fig. 1 shows some real results from our algorithm.

A. Results on the MPII Multi-Person Dataset

Figure 1. For comparison on the MPII dataset, we use the measurements of mean Average Precision (mAP) of all body parts based on the PCKh threshold. Table 1 compares mAP performance between our method and other approaches on the same subset of 288 testing images as in table [1], and the full MPI testing set as in table [2]. Besides these measures, we compare the average optimization time per image in seconds. For the 288 images subset, our method outperforms previous state-of-the-art bottom-up methods by 8.5% mAP. Remarkably, our estimated time is 6 orders of magnitude less. We report a more detailed runtime analysis in Section 2.2. For the full MPII testing set, our method without scale search already outperforms previous state-ofthe-art methods by a large margin, i.e., 13% absolute increase on mAP. Using a 3 scale search (0.7x, 1x and 1.3x) further increases the performance to 75.6% mAP. The mAP comparison with previous bottom-up approaches indicate the effectiveness of our novel feature illustration, PAFs, to associate body components. supported the tree structure, our greedy parsing methodology achieves higher accuracy than a graphcut optimisation formula supported a totally connected graph structure. We train our model supported a completely connected graph, and compare results by choosing all edges, and stripped tree edges. Their similar performance shows that it suffices to use stripped

Figure 2. Edges, we have a tendency to trained another model that solely learns the stripped edges to totally utilize the network capability. This approach outperforms Fig. 6c and even Fig. 6b, whereas maintaining potency. the rationale is that the abundant smaller variety of half association channels (thirteen edges of a tree vs ninety one edges of a graph) makes it easier for coaching convergence.

Fig. 7a shows Associate in Nursing ablation analysis on our validation set. For the edge of PCKh-0.5, the result victimisation PAFs outperforms the results victimisation the centre illustration, specifically, it is 2.9% beyond onemidpoint and a pair of.3% beyond 2 intermediate points. The PAFs, that encodes each position and orientation data of human limbs, is best able to distinguish the common crossover cases, e.g., overlapping arms. coaching with masks of untagged persons any improves the performance by a pair of.3% as a result of it avoids penalizing truth positive prediction in the loss throughout coaching. If we tend to use the ground-truth keypoint location with our parsing rule, we are able to acquire a mAP of 88.3%. In Fig. 7a, the mAP of our parsing with GT detection is constant across totally different PCKh thresholds because of no localization error. victimisation GT reference to our keypoint detection achieves a mAP of 81.6%. it's notable that our parsing rule supported PAFs achieves the same mAP as victimisation GT connections (79.4% vs 81.6%). this means parsing supported PAFs is sort of sturdy in associating correct half detections. Fig. 7b shows a comparison of performance across stages. The mAP will increase monotonically with the unvaried refinement framework. Fig. 3 shows the qualitative improvement of the predictions over stages.

B. Results on the COCO Keypoints Challenge

The COCO preparing set comprises of over 100K individual occurrences labeled with over 1 million add up to keypoints (i.e. body parts). The testing set contains "test-challenge", "test-dev" and "test-standard" subsets, which have generally 20K pictures each. The COCO assessment characterizes the protest keypoint likeness (OKS) and employments the cruel normal exactness (AP) over 10 OKS edges as fundamental competition metric

The OKS plays the same part as the IoU in object detection. It is calculated from scale of the individual and the distance between anticipated focuses and GT focuses. Table 3 shows comes about from best groups within the challenge. It is essential that our strategy has lower exactness than the topdown methods on individuals of littler scales (APM). The reason is that our strategy needs to deal with a much bigger scale range spanned by all individuals within the picture in one shot. In differentiate, top-down strategies can rescale the fix of each recognized range to a bigger measure and hence endure less degradation at littler scales

In Table 4, we report self-comparisons on a subset of the COCO approval set, i.e., 1160 pictures that are haphazardly chosen. In case we utilize the GT bounding box and a single individual CPM [17], we are able accomplish a upper-bound for the top-down approach utilizing CPM, which is 62.7% AP. If we utilize the state-of-the-art protest finder, Single Shot Multibox Locator (SSD)[17], the execution drops 10%. This comparison shows the execution of top-down approaches depend intensely on the individual locator. In contrast, our bottom-up strategy accomplishes 58.4% AP. In the event that we refine the results of our strategy by applying a single individual CPM on each rescaled locale of the evaluated people parsed by our method, we pick up an 2.6% in general AP increment. Note that we as it were upgrade estimations on expectations that both strategies concur well sufficient, coming about in progressed exactness and recall. We anticipate a bigger scale look can encourage improve the execution of our bottom-up strategy. Fig. 8 appears a breakdown of blunders of our strategy on the COCO approval set. Most of the untrue positives come from imprecise localization, other than foundation disarray. This shows there's more enhancement space in capturing spatial dependencies than in recognizing body parts appearances.

C.Runtime Analysis

To analyze the runtime execution of our strategy, we collect recordings with a changing number of individuals. The first outline measure is 1080x1920, which we resize to 368x654 during testing to fit in GPU memory. The runtime investigation is performed on a portable workstation with one NVIDIA GeForce GTX-1080 GPU. In Fig. 8d, we utilize individual location and single-person CPM as a top-down comparison, where the runtime is generally corresponding to the number of individuals in the picture. In differentiate, the runtime of our bottom-up approach increments generally gradually with the expanding number of individuals.

The runtime comprises of two major parts: (1) CNN handling time whose runtime complexity is O (1), constant with changing number of individuals; (2) Multi-person parsing time whose runtime complexity is O(n 2 ), where n represents the number of individuals. In any case, the parsing time does not altogether impact the generally runtime because it is two orders of size less than the CNN preparing time, e.g., for 9 individuals, the parsing takes 0.58 ms while CNN takes 99.6 Ms. Our strategy has accomplished the speed of 8.8 fps for a video with 19 individuals.

VI. CONCLUSION

Multi-person 2D pose estimation makes machines to understand and interpret humans and their interactions. In this paper, first, present a representation of the key point association that encodes both position and orientation of human limbs. Second, we design an architecture that learns part detection and association. Third, we prove that the greedy parsing algorithm produces high-quality parses of body poses and preserves efficiency regardless of the number of people. Fourth, we prove that PAF refinement is more important than combined PAF and body part location refinement, leading to a great increase in both runtime performance and accuracy. Fifth, we show that combining body and foot estimation into a single model improves the accuracy of each component individually and reduces the Run-time.

References

REFERENCES

[1]. A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 1 [18] W. Ouyang, X. Chu, and X. Wang. Multi-source deep l.

[2]. W. Ouyang, X. Chu, and X. Wang. Multi-source deep learning for human pose estimation. In CVPR, 2014. 1

[3]. J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In CVPR, 2015. 1

[4]. J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS,2014

[5]. X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014

[6]. A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR, 2014

[7]. V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In 12th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2017.

[8]. T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In ICCV, 2015.

[9]. V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh. Pose machines: Articulated pose estimation via inference machines. In ECCV, 2014. 1.

[10]. A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016. 1

[11]. D. Ramanan, D. A. Forsyth, and A. Zisserman. Strike a Pose: Tracking people by finding stylized poses. In CVPR, 2005. 1.

[12]. D. Ramanan, D. A. Forsyth, and A. Zisserman. Strike a Pose: Tracking people by finding stylized poses. In CVPR, 2005. 1.

[13]. Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. In TPAMI, 2013. 1

[14]. L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Poselet conditioned pictorial structures. In CVPR, 2013. 1.

[15]. P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. In IJCV, 2005. 1.

[16]. S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016. 1, 2, 3, 6.

[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. In ECCV, 2016. 6.

[18] M. Andriluka, S. Roth, and B. Schiele. Monocular 3D pose estimation and tracking by detection. In CVPR, 2010. 1.

[19] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: people detection and articulated pose estimation. In CVPR, 2009. 1.

Word count: 4567

Show less

© 2020. This work is published under http://www.ijana.in/index.php (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Details

Title

Realtime Multi-Person 2D Pose Estimation

Author

Nasr, Mona¹; Ayman, Hussein²; Ebrahim, Nourhan²; Osama, Rana²; Mosaad, Nouran²; Mounir, Adriana

¹ Faculty of Computer and Artificial Intelligence Department of Information systems Helwan University - Cairo, Egypt
² Faculty of Computer and Artificial Intelligence Department of Computer science Helwan University - Cairo, Egypt

Pages

4501-4508

Publication year

2020

Publication date

May/Jun 2020

Publisher

Eswar Publications

ISSN

09750290

e-ISSN

09750282

Source type

Scholarly Journal

Language of publication

English

ProQuest document ID

2564583156

Realtime Multi-Person 2D Pose Estimation

Jump to:

Full Text

Abstract

Details

Suggested sources