Full text

Turn on search term navigation

1. Introduction

3D structure recovery of articulated objects (i.e., comprising multiple connected rigid parts) from a set of 2D point tracks through multiple monocular images is a challenging computer vision problem [1,2,3,4]. Articulated structure recovery is ill-posed due to missing information about the third dimension [5]. Its applications include gesture and activity recognition, character animation in movies and games, and motion analysis in sport and robotics.

Recently, multiple learning-based approaches that recover 3D structures from 2D landmarks have been introduced [6,7,8,9]. These methods show state-of-the-art accuracy across public benchmarks. However, they are restricted to a specific kind of structure (e.g., human skeleton) and require extensive datasets for training. Moreover, they often fail to recover poses that are different from the training examples (see Section 4.2.5). When a scene includes different types of articulated objects, different methods have to be applied to reconstruct the whole scene.

In this paper, we introduce a general approach for accurate recovery of 3D poses of any articulated structure from 2D observations that does not rely on training data (see Figure 1). We build upon the recent progress in non-rigid structure from motion (NRSfM), which is a general technique for non-rigid 3D reconstruction from 2D point tracks. However, when considering an articulated object as a general non-rigid one, reconstructions can evince significant variations in the distances between the connected joints (see Section 4.2.3). These distances have to remain nearly constant across all articulated poses. Our method relies on this assumption and imposes a spatio-temporal constraint on the bone lengths.

We call our approach Structure from Articulated Motion (SfAM). We apply an articulated structure term as a soft constraint on top of the classic optimization problem of NRSfM [10]. This term enforces the bone lengths—though not known in advance—to remain constant across all frames. Our optimization strategy alternates between the classic NRSfM problem and our articulated structure term until they both converge. This allows for recovering the geometry together with the 3D joint positions and the method does not rely on known bone lengths. Starting from a rough initialization of the articulated structure (e.g., a human arm is longer than a leg), SfAM still converges to the correct structure proportions (see Section 4.2.3). Figure 2 illustrates the significant difference between results produced by a general-purpose NRSfM technique [11] and our SfAM.

To summarise, our contributions are:

A generic framework for articulated structure recovery which achieves state-of-the-art accuracy among not learning-based methods across public datasets. Moreover, it shows performance close to state-of-the-art learning-based methods but at the same time is not restricted to specific objects (see Section 4) and does not require training data.
SfAM recovers sequence-specific bone proportions together with 3D joints (see Section 3). Thus, it does need known bone lengths.
The articulated prior energy term makes our approach robust to noisy 2D observations (see Section 4.2.2) by imposing additional constraints on the 3D structure.

In this paper, we show that a not learning-based approach can perform on par with state-of-the-art learning-based methods and even outperform some of them in real-world scenes (see Section 4.2.5). We demonstrate the effectiveness of SfAM for the recovery of different articulated structures through extensive quantitative and qualitative evaluation on different datasets [12,13,14] and real-world scenes (see Section 4). To the best of our knowledge, our SfAM is the first NRSfM approach evaluated on such comprehensive datasets as Human 3.6m [12] and NYU hand pose [14]. As a side effect of our method, it can be used for precise articulated model estimation (generate personalized human skeleton rigs (see Section 4.2.3)). This contrasts a lot with most recent supervised learning approaches which require extensive labeled databases for training, and still, often fail when unfamiliar poses are observed (see Section 4.2.5). Moreover, minor changes in the inputs lead to significant variations in the poses, which makes the results of learning-based methods very difficult or impossible to reproduce.

2. Related Work

Rigid and Non-Rigid Structure from Motion. Factorization-based Structure from Motion (SfM) is a general technique for 3D structure recovery from 2D point tracks. An SfM problem is well-posed for rigid objects due to the rigidity constraint [15]. Early extensions of Tomasi and Kanade’s method [15] for the non-rigid case rely on rank and orthonormality constraints [16,17]. Subsequent methods investigated shape basis priors [18], temporal smoothness priors [19], trajectory space constraints [20] as well as such fundamental questions as shape basis uniqueness [21,22]. More recent methods combine priors in the metric and trajectory spaces [23]. To improve the reconstruction of stronger nonlinear deformations, Zhu et al. [24] introduce unions of linear subspaces. Dai et al. [10] propose an NRSfM method with as few additional constraints as possible. Lately, the focus of NRSfM research is drawn to the problem of scalability [11,25], i.e., the consistent performance across different scenarios and linear computational complexity in the number of points. Our SfAM is a scalable approach which builds upon the work of Ansari et al. [11]. In contrast to [11], we recover articulated structures with higher accuracy.

Articulated and Multibody Structure from Motion. Over the last few years, several SfM approaches for articulated motion recovery were proposed. Some of them relax the global rigidity constraint for multiple parts [26,27] so that each of the parts is constrained to be rigid. They can handle relatively simple articulated motions, as the segmentation and the structure composition are assumed to be unknown [26]. As a result, these methods are hardly applicable to such complicated scenarios as human and hand pose recovery. Tresadern and Reid [28], Yan and Pollefeys [29] and Palladini et al. [26] address the articulated case with two rigid body parts and detect a hinge joint. Later, an approach with spatial smoothness and segmentation dealing with an arbitrary number of rigid parts was proposed by Fayad et al. [30]. Park and Sheikh [31] reconstruct trajectories given parent trajectories and known bone length, known camera, and root motion for each frame. Their objective is highly nonlinear and requires good initialization of trajectory parameters. In contrast, our method recovers sequence-specific bone proportions and does not rely on given bone lengths. Next, Valmadre et al. [32] propose a dynamic-programming approach for the reconstruction of articulated 3D trees from input 2D joint positions operating in linear time. Multibody SfM methods reconstruct multiple independent rigid body transformations and non-rigid deformations in the same scene [27,33]. In contrast, our approach is more general as it imposes a soft constraint of articulated motion on top of classic NRSfM.

Piecewise and Locally Rigid Structure from Motion. Piecewise rigid approaches interpret the structure as locally rigid in the spatial domain [34,35]. Several methods divide the structure into patches, each of which can deform non-rigidly [36,37]. High granularity level of operation allows these methods to reconstruct large deformations as opposed to methods relying on linear low-rank subspace models [36]. Rehan et al. [38] penalize deviations between the bone lengths from the average distances between the joints over the whole sequence. This form of constraint does not guarantee a realistic reconstruction though, as it struggles to compensate for inaccurate 2D estimations or 3D inaccuracies in short time intervals.

Monocular 3D Human Body and Hand Pose Estimation. Bone length constraints are widely used in the single-view regression of 3D human poses. One of the early works in this domain operates on single uncalibrated images and imposes constraints on the relative bone lengths [39]. It is capable of reconstructing a human pose up to scale. Later, an enhancement for multiple frames with bone symmetry and rigidity constraints (joints representing the same bone move rigidly relative to each other) was introduced by Wei and Chai [40]. Akhter and Black [41] use a pose prior that captures pose-dependent joint angle limits. Ramakrishna et al. [1] use a sum of squared bone lengths term that can still lead to unrealistic poses. Wandt et al. [2] constrain the bone lengths to be invariant. Their trilinear factorization approach relies on pre-trained body poses serving as a shape prior and transcendental functions modeling periodic motion peculiar to the human gait. An adaptation of this approach to hand gestures would require the acquisition of a new shape prior. Wandt et al. [42] constrain the sum of squared bone lengths of the articulated structure to be invariant throughout image sequence. However, the length of each bone can still vary. One of the modern methods for human pose and appearance estimation is MonoPerfCap of Xu et al. [43]. It imposes implicit bone length constraints through a dense template tailored to a specific person and captured in an external acquisition process.

Recently, many learning-based approaches for human pose and hand pose estimation have been presented in the literature [9,44,45,46,47,48,49,50,51]. In [7], weak supervision constrains the output of the network with fixed bone proportions taken from the training dataset. Sun et al. [52] exploit a joint connection structure and uses bones instead of joints for pose representation. Wandt and Rosenhahn [53] use kinematic chain representation and include bone length information to their loss function during training. In contrast to our SfAM, [53] is not as robust to noisy 2D input (see Section 4.2.2). All these methods are highly specialized and rely on extensive collections of training data. In contrast, our SfAM is a general approach that can cope with different articulated structures, with no need for labeled datasets.

3. The Proposed SfAM Approach

Figure 3 shows a high-level overview of our approach. Following factorization-based NRSfM [10], we first recover the camera pose using 2D landmarks (Section 3.2). For 3D structure recovery, we extend the target energy function of the classic NRSfM problem [10,11] by our articulated prior term (Section 3.3.1).

We assume that sparse 2D correspondences are given. In Section 3.3.2, we show how our new energy is efficiently optimized alternating between fixed-point continuation algorithm [54] and Levenberg–Marquardt [55,56]. This leads to an accurate reconstruction of articulated motions of different structures.

3.1. Factorization Model

The input to SfAM is the measurement matrix $W = {[W_{1}, W_{2}, \dots, W_{T}]}^{T} \in R^{2 T \times N}$ with N 2D joints tracked over T frames. Every $W_{t}$ , $t \in {1, \dots, T}$ , is registered to the centroid of the observed structure and the translation is resolved in advance. Most of the NRSfM methods assume orthographic projection, as the intrinsic camera model is usually not known. Even though some benchmarks (e.g., [12]) provide camera parameters, we develop a general approach for uncalibrated settings. Following standard SfM approaches, we assume that every 2D projection $W_{t}$ can be factorized into a camera pose-projection matrix $R_{t} \in R^{2 \times 3}$ and 3D structure $S_{t} \in R^{3 \times N}$ so that $W_{t} = R_{t} S_{t}$ . We assume that the articulated structure deforms under the low-rank shape model [11,16]. Thus, $S = {[S_{1}, S_{2}, \dots, S_{T}]}^{T}$ can be parametrized by the set of unknown basis shapes $B \in R^{3 K \times N}$ of cardinality K and the coefficient matrix $C \in R^{T \times K}$ :

(1) $\begin{matrix} W = R S = \underset{M}{\underset{︸}{R (C \otimes I_{3})}} B = M B, \end{matrix}$

where

R = bkdiag (R_{1}, R_{2}, \dots, R_{T})

is the joint camera pose-projection matrix,

I_{3}

is a

3 \times 3

identity matrix and ⊗ denotes Kronecker product.

3.2. Recovery of Camera Poses

Applying singular value decomposition to $W$ , we obtain initial estimates of $M$ and $B$ from Equation (1) up to an invertible corrective transformation $Q \in R^{3 K \times 3 K}$ :

(2) $\begin{matrix} W ≅ M^{'} B^{'} ≅ \underset{M}{\underset{︸}{M^{'} Q}} \underset{B}{\underset{︸}{Q^{- 1} B^{'}}} = M B . \end{matrix}$

In the following, we are using the shortcuts $M_{2 t - 1 : 2 t}^{'} \in R^{2 \times 3 K}$ for every t-th pair of rows of $M$ , $Q_{k} \in R^{3 K \times 3}$ for the k-th column triplet of $Q$ , $k \in {1, \dots, K}$ . Considering (1) and (2), for every $t \in {1, \dots, T}$ and $k \in {1, \dots, K}$ , we have:

(3) $\begin{matrix} M_{2 t - 1 : 2 t}^{'} Q_{k} = c_{t k} R_{t} . \end{matrix}$

Using the orthonormality constraints $R_{t} R_{t}^{T} = I_{2}$ and denoting $F = Q Q^{T}$ , we obtain:

(4) $\{\begin{matrix} M_{2 t - 1}^{'} F_{k} M_{2 t - 1}^{' T} = M_{2 t}^{'} F_{k} M_{2 t}^{' T} = c_{i k}^{2} I_{2}, \\ M_{2 t - 1}^{'} F_{k} M_{2 t}^{' T} = 0 . \end{matrix}$

Therefore, the following systems of equations can be written for every t and k:

(5) $\begin{matrix} \underset{G_{t}}{\underset{︸}{[\begin{matrix} M_{2 t - 1}^{'} \otimes M_{2 t - 1}^{' T} - M_{2 t}^{'} \otimes M_{2 t}^{' T} \\ M_{2 t - 1}^{'} \otimes M_{2 t}^{' T} \end{matrix}]}} vec (F_{k}) = 0, \end{matrix}$

where

vec (\cdot)

is vectorization operator permuting a

m \times n

matrix to a

m n

column vector. Stacking all

G_{t}

vertically, we obtain:

(6) $\begin{matrix} G vec (F_{k}) = 0, \end{matrix}$

where

G = {[G_{1}, G_{2}, \dots, G_{T}]}^{T}

. Finding an optimal

F_{k}

can be performed by solving the optimization problem:

(7) $\begin{matrix} min_{F_{k}} {∥G vec (F_{k})∥}^{2} . \end{matrix}$

Due to the rank-3 constraint on every $F_{k}$ , this problem is solved by the iterative shrinkage-thresholding (IST) method [57]. Once an optimal $F$ is found, the corrective transformation $Q$ is recovered by Cholesky decomposition. Using $Q$ , $R$ is recovered from Equations (1)–(4).

3.3. Articulated Structure Recovery

3.3.1. Articulated Structure Representation

Having found $R$ , we recover $S$ . Note that we optionally rely on an updated $W$ after the smooth shape trajectory step which imposes additional constraints on point trajectories and reduces the overall number of unknowns; please refer to [11] for more details. We rearrange the shape matrix $S$ to

(8) $S^{#} = [\begin{matrix} X_{11} \dots X_{1 N} & Y_{11} \dots Y_{1 N} & Z_{11} \dots Z_{1 N} \\ ⋮ ⋮ & ⋮ ⋮ & ⋮ ⋮ \\ X_{T 1} \dots X_{T N} & Y_{T 1} \dots Y_{T N} & Z_{T 1} \dots Z_{T N} \end{matrix}],$

where

(X_{t n}, Y_{t n}, Z_{t n}), n \in {1, \dots, N}

is a 3D coordinate of each joint in

S

S^{#}

can be represented as:

(9) $S^{#} = [P_{x} P_{y} P_{z}] (I_{3} \otimes S),$

where

P_{x}, P_{y}, P_{z} \in R^{T \times 3 N}

are binary row selectors. We follow [10,11] and represent the optimal non-rigid structure by:

(10) $min_{S} | | S^{#} Π {| |}_{*}, s . t . W = R S,$

where

Π = (I - \frac{1}{T} 1 1^{T})

(

1

is a vector of ones) and

{| | . | |}_{*}

denotes the nuclear norm. Note that

rank (S^{#}) \leq K

, and the mean 3D component is removed from

S^{#}

. As shown in Figure 2, non-rigid structures recovered by the optimization of (10) can have significant variations in bone lengths. This often leads to unrealistic poses and body proportions. Unlike general non-rigid structures, in articulated structures, individual rigid parts or bones have constant lengths throughout the whole sequence. Moreover, all the bones follow constant proportions. These constraints are called articulated priors. We incorporate the articulated priors into the objective function (10) in the form of the following energy term:

(11) $E_{B L} (S) = \sum_{t = 1}^{T} \sum_{b = 1}^{B} e_{t b} (S),$

where

e_{t b} (S) = {(D_{b}^{t} - L_{b})}^{2}

is an energy term for bone b and frame t,

L_{b}

is initial normalized bone length value of bone b. The normalization is done with respect to the sum of all initial bone lengths.

D_{b}^{t} = | | X_{a_{b}}^{t} - X_{c_{b}}^{t} {| |}_{2}

is Euclidian distance between joints

X_{a_{b}}^{t}

and

X_{c_{b}}^{t}

connected by bone b; B is the number of bones of the articulated structure. Vectors

a = [X_{a_{1}}, X_{a_{2}}, \dots, X_{a_{B}}]

and

c = [X_{c_{1}}, X_{c_{2}}, \dots, X_{c_{B}}]

define the parent and child joints of bones, respectively.

Unlike some previous works [7,41,58,59], we do not require predefined bone lengths or proportions. SfAM recovers optimal articulated structure that minimizes the total energy:

(12) $min_{S} (| | S^{#} {| |}_{*} + \frac{β}{2} E_{B L} (S)), s . t . W = R S,$

where

β

is a scalar weight. Implementation of articulated prior (11) as a soft constraint makes the overall method robust to incorrect initialization of bone lengths.

3.3.2. Energy Optimization

Since (12) contains a nonlinear term $E_{B L} (S)$ , we introduce an auxiliary variable $A$ and obtain the following optimization problem which is linear with respect to $S$ :

(13) $\begin{matrix} min_{S} | | S^{#} {| |}_{*} + \frac{β}{2} min_{A} E_{B L} (A), \\ s . t . W = R S and A = S . \end{matrix}$

We rewrite (13) in the Lagrangian form:

(14) $\begin{matrix} L (S, A, μ) = μ | | S^{#} {| |}_{*} + \frac{β}{2} E_{B L} (A) + \frac{1}{2} | | W - {R S | |}_{F}^{2} + \frac{1}{2} | | A - {S | |}_{F}^{2}, \end{matrix}$

where

{| | . | |}_{F}

denotes the Frobenius norm and

μ

is a parameter. We split 14 into two subproblems:

(15) $\begin{matrix} min_{S} L (S, μ) = & min_{S} (μ | | S^{#} {| |}_{*} + \frac{1}{2} | | W - {R S | |}_{F}^{2} + \frac{1}{2} | | A - {S | |}_{F}^{2}) \end{matrix}$

(16) $\begin{matrix} and min_{A} L (A) = min_{A} (\frac{β}{2} E_{B L} (A) + \frac{1}{2} | | A - {S | |}_{F}^{2}) . \end{matrix}$

We alternate between the subproblems (15) and (16) and iterate until convergence. $A$ remains fixed in (15) and $S$ remains fixed in (16). In every optimization step, the subproblem (15) updates the 3D structure so that it more accurately projects to the observed 2D landmarks. The subproblem (16) penalizes the difference in bone lengths among all frames while recovering the sequence-specific bone proportions. The bone lengths of the recovered optimal 3D structures are almost constant throughout the whole image sequence but different from the initial $L_{b}$ .

The subproblem (15) is linear and solved by the fixed-point continuation (FPC) method [54]. First, we obtain the gradient of $\frac{1}{2} (| | W - {R S | |}_{F}^{2} + | | A - {S | |}_{F}^{2})$ with respect to $S^{#}$ :

(17) $\begin{matrix} g (S^{#}, A) = \frac{\partial \frac{1}{2} (| | W - {R S | |}_{F}^{2} + | | A - {S | |}_{F}^{2})}{\partial S^{#}} = [P_{x} P_{y} P_{z}] (I_{3} \otimes (R^{T} (R S - W) + (S - A))) . \end{matrix}$

Next, FPC for ${min}_{S} L (S, μ)$ instantiates as:

(18) $\begin{matrix} Y^{(t + 1)} = S^{# (t)} - τ g (S^{# (t)}, A^{(t)}), \\ S^{# (t + 1)} = S_{τ μ^{(t)}} (Y^{(t + 1)}), \\ μ^{(t + 1)} = ρ μ^{(t)}, \end{matrix}$

where

S_{ν} (\cdot)

is the matrix shrinkage operator [54] and

τ > 0

is a free parameter.

The second subproblem (16) is nonlinear and is optimized for each iteration (18) using Levenberg–Marquardt of ceres [60]. Let denote the $r_{l}$ , $l \in {1, \dots, T N}$ residuals of $\frac{1}{2} | | A - {S | |}_{F}^{2}$ . We aggregate all residuals $e_{t b} (A)$ from (11) (note that $S$ in (11) is substituted by $A$ ) and $r_{l}$ into a single function:

(19) $\begin{matrix} F (A) = & {[e_{11} (A), \dots, e_{B T} (A), r_{1}, \dots, r_{T N}]}^{T} : \\ R^{3 T N} \to R^{B T + T N} . \end{matrix}$

Next, the objective function (16) can be compactly written in terms of $A$ as:

(20) $\begin{matrix} L (A) = {∥F (A)∥}_{2}^{2} . \end{matrix}$

The target nonlinear energy optimization problem consists of finding an optimal parameter set $A^{'}$ so that:

(21) $\begin{matrix} A^{'} = arg min_{A} {∥F (A)∥}_{2}^{2} . \end{matrix}$

We solve (21) iteratively. In every optimization step k, the objective is linearized in the vicinity of the current solution $A_{k}$ by the first-order Taylor expansion:

(22) $\begin{matrix} F (A_{k} + Δ A) \approx F (A_{k}) + J (A_{k}) Δ A, \end{matrix}$

with

J {(A)}_{(B T + T N) \times 3 T N}

being the Jacobian of

F (A_{k})

. For every iteration, the objective for

Δ A

reads:

(23) $\begin{matrix} min_{Δ A} {∥J (A_{k}) Δ A + F (A_{k})∥}^{2} . \end{matrix}$

In ceres [60], the optimum is computed in the least-squares sense with the Levenberg–Marquardt method:

(24) $\begin{matrix} [J {(A_{k})}^{T} J (A_{k}) + λ_{k} I] Δ A = - J {(A_{k})}^{T} F (A_{k}), \end{matrix}$

where

λ_{k} > 0

is a parameter and

I

is an identity matrix.

The algorithm is summarized in Algorithm 1.

Algorithm 1: Structure from Articulated Motion (SfAM)

Input: initial normalized bone lengths

L_{b}

, measurement matrix

W \in R^{2 T \times N}

with 2D point tracks

Output: poses

R \in R^{2 T \times 3 T}

and 3D shapes

S \in R^{3 T \times N}

Initialize:

S^{(0)}

is initialized as in [11],

A^{(0)} = S^{(0)}

β = 1.5

μ^{(0)} = 1

ρ = 0.25

τ = 0.2

step 1: recover

R

with IST method [57] (Section 3.2)

step 2 (optional): smooth point trajectories in

W

[11]

step 3: while not converged do

A^{(t + 1)} = arg {min}_{A} (\frac{β}{2} E_{B L} (A) + \frac{1}{2} | | S^{(t)} - A {| |}_{F}^{2})

(optimize with Levenberg–Marquardt [55,56])

g^{(t + 1)} = R^{T} (R S^{(t)} - W) + (S^{(t)} - A^{(t + 1)})

Y^{(t + 1)} = S^{(t)} - τ g^{(t + 1)}

S^{(t + 1)} = S_{τ μ^{(} t)} (Y^{(t + 1)})

μ^{(t + 1)} = μ^{(t)} ρ

end while

4. Experiments and Results

We extensively evaluate our SfAM on several datasets including Human 3.6m [12], synthetic sequences of Akhter et al. [13] and NYU hand pose [14] dataset. Moreover, we demonstrate qualitative results on challenging community videos. In total, our SfAM is compared to over thirty state-of-the-art model-based and learning-based methods (see Table 1 and Table 2). We also implement SMSR of Ansari et al. [11], which is the most related approach to our SfAM and evaluate it on [12,14] as well as community videos. Moreover, we extend SMSR [11] with the local rigidity constraint of Rehan et al. [38] and include it into our comparison.

In Section 4.2.2, we evaluate the robustness of our approach to inaccuracies in 2D landmarks. The proposed SfAM recovers correct articulated structures given highly inaccurate initial bone lengths in Section 4.2.3. Finally, in Section 4.2.5, we highlight the numerous cases when our method performs better than state-of-the-art learning-based approaches in real-world scenes.

In all experiments, we use a sliding time window of 200 frames. For sequences shorter than 200 frames, we run our method on the whole sequence at once. All experiments are performed on a system with 32 GB RAM and twelve-core Intel Xeon CPU running at 3.6 GHz. Our framework is implemented in C++. Average processing time for a single frame from the Human 3.6m dataset [12] with given 2D annotations amounts to 140 ms.

4.1. Evaluation Methodology

We follow the established evaluation methodology in the area of NRSfM and rigidly align our 3D reconstructions to the ground truth. We report the reconstruction error $E_{3 D}$ in mm between ground truth joint positions $\bar{S_{n}^{t}}$ and aligned 3D reconstructions $G (S_{n}^{t})$ :

(25) $\begin{matrix} E_{3 D} = min_{G} \frac{1}{T} \frac{1}{N} \sum_{t = 1}^{T} \sum_{n = 1}^{N} | | \bar{S_{n}^{t}} - G (S_{n}^{t}) {| |}_{2}, \end{matrix}$

where

n \in {1, \dots, N}

t \in {1, \dots, T}

, T is the number of frames in the sequence and N is the number of joints of the articulated object. For some datasets, we report the normalized mean 3D error:

(26) $\begin{matrix} e_{3 D} = min_{G} \frac{1}{σ T} \frac{1}{N} \sum_{t = 1}^{T} \sum_{n = 1}^{N} | | \bar{S_{n}^{t}} - G (S_{n}^{t}) {| |}_{2}^{2}, with \\ σ = min_{G} \frac{1}{3 T} \sum_{t = 1}^{T} (σ_{t x} + σ_{t y} + σ_{t z}), \end{matrix}$

where

σ_{t x}, σ_{t y}

and

σ_{t z}

denote normalized variances of reconstructions

G (S_{n}^{t})

along the

x, y, z

-axes respectively.

4.2. Human Pose Estimation

4.2.1. Human 3.6m Dataset

Human 3.6m [12] is currently the largest dataset for monocular 3D human pose sensing. It is widely used for evaluation of learning-based human pose estimation methods. Table 1 gives an overview of the quantitative results on the Human 3.6m [12]. We highlight approaches that are trained on Human 3.6m [12] with “*”. We follow three common evaluation protocols. In Protocol #1, we compare the methods on two subjects ( $S 9$ and $S 11$ ). The original framerate 50 $f p s$ is reduced to 10 $f p s$ . The learning-based approaches marked with “*” use subjects $S 1$ , $S 5$ , $S 6$ , $S 7$ , $S 8$ and all camera views for training. Testing is done for all cameras. For Protocol #2, only the frontal view (“camera3”) is used for evaluation. For Protocol #3, evaluation is done on every 64th frame of subject $S 11$ for all cameras. The learning-based approaches marked with “*” use subjects $S 1$ , $S 5$ , $S 6$ , $S 7$ , $S 8$ and $S 9$ for training.

For all methods and under all evaluation protocols, we report the reconstruction error $E_{3 D}$ after the rigid alignment of the recovered structures with ground truth. In our method, the bone lengths are initialized with the average values for all the subjects from the dataset.

As we see from Table 1, we show competitive accuracy to best performing learning-based approaches that are trained on Human 3.6m [12]. In Section 4.2.5, we demonstrate that our approach works better in real-world scenes which are different from this dataset.

In Figure 4, we visualize several reconstructions of highly challenging scenes by SMSR [11] and the proposed SfAM. See Figure A1 for additional visualizations.

4.2.2. Robustness to Inaccurate 2D Point Tracks

We validate the robustness of our approach to inaccuracies in 2D landmarks on Human 3.6m [12]. We compare our SfAM to state-of-the-art learning-based methods [9,47,53] trained on ground truth 2D data. We add Gaussian noise with increasing values of the standard deviation to the 2D ground truth point tracks. The reconstruction error as the function of the standard deviation of the noise is plotted in Figure 5a. SfAM is more robust than the compared methods for moderate and high perturbations, and the error grows very slowly with the increasing noise level. In contrast to our SfAM, the errors of [9,47,53] grow very fast even with a low level of noise. Note that we evaluate our method on a higher level of noise than [9,47,53]. The average error of the currently best performing 2D detectors is between 10–15 pixels [79,80]. We see that, for 10–15 pixels, SfAM has comparable error to the most accurate learning-based approaches while not relying on training data and being generalizable for different object classes.

4.2.3. Robustness to Incorrectly Initialized Bone Lengths and Real Bone Length Recovery

We study the accuracy of SfAM in recovering articulated structures given incorrectly initialized bone proportions (normalized bone lengths) on the subject $S 11$ from Human 3.6m [12]. Starting from the ground truth initialization of bone lengths (obtained from the dataset), we change every bone length by adding different amounts of Gaussian noise with increasing standard deviations in the range $[0; 70]$ mm. This allows us to analyze the recovered bone lengths and the robustness of SfAM to noise in a controlled and well-defined setting. The results of the experiment are plotted in Figure 5b. If the structure is initialized with anthropometric priors from [81], the error increases by only 3%. Note that our error in bone length estimation is slightly affected by the increasing levels of noise. It is equal to 54 mm with ground truth initialization and grows just to 66 mm with $σ = 70$ mm. Note that the anthropometric prior corresponds to $σ \approx 15$ mm.

Given incorrect initial bone lengths, SfAM recovers not only correct poses, but also accurate sequence-specific bone lengths. We calculate the average difference between ground truth bone lengths of subject $S 11$ and the initial ones, provided to our method. We do the same for the recovered structures. The results are best viewed in Figure 5c. Thus, SfAM can be used for precise skeleton estimation.

We also calculate standard deviations of bone lengths of the reconstructed objects for SMSR [11] and SfAM. Figure 5d shows that the standard deviation of bone lengths is very high for SMSR [11], as it considers a human as a general non-rigid object and changes the bone lengths from frame to frame. SfAM reduces the average standard deviation by 514% leading to a more accurate pose reconstruction and structure recovery. In Figure 5d, “Upper Legs” and “Lower Legs” denote bones between the hip/knee and knee/ankle, respectively; “Upper Arms” and “Lower Arms” denote bones between shoulder/elbow and elbow/wrist, respectively.

4.2.4. Synthetic NRSfM Datasets

Synthetic sequences of Akhter et al. [13] are commonly used for the evaluation of sparse NRSfM. We compare our approach with previous SfM methods on challenging synthetic sequences with a large variety of human motions Drink, Pickup, Stretch, and Yoga [20]. Some pairs of joints remain locally rigid in these sequences. We activate the articulated constraint for those points and evaluate our method. Table 2 shows the results of SfAM and previous SfM methods.

The errors $e_{3 D}$ for other listed methods are taken from PPTA [78] and SMSR [11]. Only PPTA [78] outperforms SfAM on Drink, whereas CSF2 [23] achieves a comparable $e_{3 D}$ . SfAM achieves the most consistent performance among all compared algorithms.

4.2.5. Real-World Videos

Our algorithm is capable of recovering human motion from challenging real-world videos. We compare our results with the state-of-the-art learning-based approach of Martinez et al. [9] and one of the best performing general-purpose NRSfM methods SMSR [11]. Since ground truth 2D annotations are not available, we use OpenPose [82] for 2D human body landmark extraction. Bone lengths are initialized with the values from anthropometric data tables [81]. As Figure 6 shows, [9] fails to correctly recover poses that are different from the training dataset [12]. SMSR [11] produces unrealistic human body structures. In contrast to [9,11], our method successfully recovers 3D human poses in real-world scenes.

4.3. Hand Pose Estimation

We also evaluate SfAM on the NYU hand pose dataset [14], which provides 2D and 3D ground truth annotations for 8252 different hand poses. The hand model consists of 30 bones. Hand pose recovery is a challenging problem due to occlusion and many degrees of freedom. We compare the performance of our approach with SMSR [11] and its modification with local rigidity constraint from Rehan et al. [38]. Quantitatively, SfAM achieves $E_{3 D}$ of $14.2$ mm. In contrast, $E_{3 D}$ of SMSR [11] is $22.2$ mm, and SMSR with articulated body constraints [38] shows $E_{3 D}$ of $19.4$ mm. Hence, the inclusion of our articulated prior term to [11] achieves an error improvement of 56%. The qualitative results are shown in Figure 7. Similar to human bodies, SfAM achieves lower error due to keeping bone lengths constant between frames. When SMSR [11] fails to reconstruct the correct 3D pose, SfAM still outputs plausible results.

5. Conclusions

We present a new method for 3D articulated structure recovery from 2D landmarks. The proposed approach is general and not restricted to specific structures or motions. Integration of our soft articulated prior term into a general-purpose NRSfM approach and alternating optimization resulted in accurate and stable results.

In contrast to the vast majority of state-of-the-art approaches, SfAM does not require training data or known bone lengths. By ensuring consistency of bone lengths throughout the whole sequence, it optimizes sequence-specific bone proportions and recovers 3D structures. In extensive experiments, it proves its generalizability and shows accuracy close to state-of-the-art on public benchmarks. It also shows a remarkable improvement in accuracy compared to other model-based approaches. Moreover, our method outperforms learning-based approaches in complicated real-world videos. All in all, we show that high accuracy on benchmarks can be achieved without the need for training and parameter tuning for specific datasets.

In future work, we plan to apply SfAM to animal shape estimation and recovery of personalized human skeletons. We also believe it can boost the development of methods for human and hand pose estimation with semi-supervision.

Author Contributions

Conceptualization, O.K., V.G. and A.E.; methodology, O.K. and V.G.; software, O.K. and V.G.; validation, O.K., V.G., J.M., A.E. and D.S.; formal analysis, O.K. and V.G.; investigation, O.K.; resources, O.K., V.G., and J.M.; data curation, O.K., V.G., and J.M.; writing—original draft preparation, O.K. and V.G.; writing—review and editing, O.K., V.G., J.M. and A.E.; visualization, O.K.; supervision, D.S.

Funding

This research was funded by the project VIDETE of the German Federal Ministry of Education and Research (BMBF), Grant No. 01IW18002.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SfAM	Structure from Articulated Motion
SfM	Structure from Motion
NRSfM	Non-Rigid Structure from Motion
FPC	Fixed-Point Continuation
SMSR	Scalable Monocular Surface Reconstruction
IST	Iterative Shrinkage-Thresholding

Appendix A

Figure A1

Additional visualizations of our results and reconstructions with NRSfM of Ansari et al. [11] on several sequences from [12]. (a)–(c): our results on sitting, photo and discussion. These sequences and poses are among the most challenging in the dataset. (d): comparison of our SfAM and NRSfM [11].

[Figure omitted. See PDF]

Figures and Tables

View Image - Figure 1. We recover different articulated structures from real-world videos with high accuracy and no need for training data. Our Structure from Articulated Motion (SfAM) approach is not restricted to a single object class and only requires a rough articulated structure prior. The reconstructions are provided under different view angles.

Figure 1. We recover different articulated structures from real-world videos with high accuracy and no need for training data. Our Structure from Articulated Motion (SfAM) approach is not restricted to a single object class and only requires a rough articulated structure prior. The reconstructions are provided under different view angles.

View Image - Figure 2. Side-by-side comparison of the non-rigid structure from motion (NRSfM) method [11] and our SfAM. Reconstruction results of [11] violate anthropometric properties of the human skeleton due to changing bone lengths from frame to frame.

Figure 2. Side-by-side comparison of the non-rigid structure from motion (NRSfM) method [11] and our SfAM. Reconstruction results of [11] violate anthropometric properties of the human skeleton due to changing bone lengths from frame to frame.

View Image - Figure 3. The pipeline of the proposed SfAM approach. Following factorization-based NRSfM, we first recover the camera pose using 2D position observations. Then, we recover 3D articulated structure by optimizing our new energy functional accounting for articulated priors.

Figure 3. The pipeline of the proposed SfAM approach. Following factorization-based NRSfM, we first recover the camera pose using 2D position observations. Then, we recover 3D articulated structure by optimizing our new energy functional accounting for articulated priors.

View Image - Figure 4. Comparison of our SfAM and NRSfM [11] on Human 3.6m [12]. NRSfM considers humans as general non-rigid objects and changes bone lengths from frame to frame.

Figure 4. Comparison of our SfAM and NRSfM [11] on Human 3.6m [12]. NRSfM considers humans as general non-rigid objects and changes bone lengths from frame to frame.

View Image - Figure 5. (a): the reconstruction error e3D under 2D noise; (b): e3D under incorrect bone lengths initializations; (c): average bone lengths error for the increasing levels of Gaussian noise before (red) and after (green) the optimization; (d): standard deviation of bone lengths for SMSR [11] and our SfAM.

Figure 5. (a): the reconstruction error e3D under 2D noise; (b): e3D under incorrect bone lengths initializations; (c): average bone lengths error for the increasing levels of Gaussian noise before (red) and after (green) the optimization; (d): standard deviation of bone lengths for SMSR [11] and our SfAM.

Figure 6. Comparison of our SfAM, NRSfM [11], and the learning-based method of Martinez et al. [9] on challenging real-world videos.

Figure 7. Comparison of our SfAM to NRSfM [11] on an NYU hand pose dataset [14].

Table 1

The reconstruction error $E_{3 D}$ of SfAM and previous methods on Human 3.6m dataset. “*” indicates learning-based methods which are trained on Human 3.6m [12]. We outperform all model-based approaches and reach very close to the tuned supervised learning techniques.

Method	P1	P2	P3
Zhou et al. [3] *	106.7	-	-
Akhter et al. [41]	-	181.1	-
Ramakrishna et al. [1]	-	157.3	-
Bogo et al. [61]	-	82.3	-
Kanazawa et al. [45] *	67.5	66.5	-
Moreno-Noguer [47] *	62.2	-	-
Yasin et al. [59]	-	-	110.2
Rogez et al. [62]	-	-	88.1
Chen, Ramanan [63] *	-	-	82.7
Nie et al. [64] *	-	-	79.5
Sun et al. [52] *	-	-	48.3
Omran et al. [65] *	59.9	-	-
Zhou et al. [66] *	54.7	-	-
Mehta et al. [8] *	54.6	-	-
Pavlakos et al. [67] *	51.9	-	-
Kinauer et al. [68] *	50.3	-	-
Tekin et al. [69] *	50.1	-	-
Rogez et al. [44] *	49.2	51.1	42.7
Habibie et al. [70] *	49.2	-	-
Martinez et al. [9] *	45.6	-	-
Zhao et al. [71] *	43.8	-	-
Pavlakos et al. [46] *	41.8	-	-
Arnab, Doersch et al. [72] *	41.6	-	-
Chen, Lin et al. [73] *	41.6	-	-
Sun et al. [74] *	40.6	-	-
Wandt, Rosenhahn [53] *	38.2	-	-
Pavllo et al. [75] *	36.5	-	-
Dabral et al. [58] *	36.3	-	-
SMSR [11]	106.6	105.2	102.9
SMSR [11]+[38]	145.2	124.0	139.9
Our SfAM	51.2	51.7	53.9

Table 2

The normalized mean 3D error $e_{3 D}$ of previous NRSfM methods and our SfAM for synthetic sequences [20].

Method	Drink	PickUp	Stretch	Yoga
MP [76]	0.4604	0.4332	0.8549	0.8039
PTA [20]	0.0250	0.2369	0.1088	0.1625
CSF1 [77]	0.0223	0.2301	0.0710	0.1467
CSF2 [23]	0.0223	0.2277	0.0684	0.1465
BMM [10]	0.0266	0.1731	0.1034	0.1150
Lee [37]	0.8754	1.0689	0.9005	1.2276
PPTA [78]	0.011	0.235	0.084	0.158
SMSR [11]	0.0287	0.2020	0.0783	0.1493
SMSR [11]+[38]	0.4348	0.4965	0.3721	0.4471
Our SfAM	0.0226	0.1921	0.0673	0.1242

Word count: 5159

Show less

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Recovery of articulated 3D structure from 2D observations is a challenging computer vision problem with many applications. Current learning-based approaches achieve state-of-the-art accuracy on public benchmarks but are restricted to specific types of objects and motions covered by the training datasets. Model-based approaches do not rely on training data but show lower accuracy on these datasets. In this paper, we introduce a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections. At the same time, it performs on par with learning-based state-of-the-art approaches on public benchmarks and outperforms previous non-rigid structure from motion (NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while integrating a soft spatio-temporal constraint on the bone lengths. We use alternating optimization strategy to recover optimal geometry (i.e., bone proportions) together with 3D joint positions by enforcing the bone lengths consistency over a series of frames. SfAM is highly robust to noisy 2D annotations, generalizes to arbitrary objects and does not rely on training data, which is shown in extensive experiments on public benchmarks and real video sequences. We believe that it brings a new perspective on the domain of monocular 3D recovery of articulated structures, including human motion capture.

Details

Title

Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data

Author

Kovalenko, Onorina¹

; Golyanik, Vladislav²; Malik, Jameel³; Elhayek, Ahmed⁴; Stricker, Didier⁵

¹ Department Augmented Vision, German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; [email protected] (J.M.); [email protected] (A.E.); [email protected] (D.S.)
² Department of Computer Graphics, Max Planck Institute for Informatics, 66123 Saarbrücken, Germany; [email protected]
³ Department Augmented Vision, German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; [email protected] (J.M.); [email protected] (A.E.); [email protected] (D.S.); Department of Computer Science, University of Kaiserslautern, 67663 Kaiserslautern, Germany; School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), 44000 Islamabad, Pakistan
⁴ Department Augmented Vision, German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; [email protected] (J.M.); [email protected] (A.E.); [email protected] (D.S.); Department of Computer Science, University of Prince Mugrin (UPM), 20012 Madinah, Saudi Arabia
⁵ Department Augmented Vision, German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; [email protected] (J.M.); [email protected] (A.E.); [email protected] (D.S.); Department of Computer Science, University of Kaiserslautern, 67663 Kaiserslautern, Germany

First page

4603

Publication year

2019

Publication date

2019

Publisher

MDPI AG

e-ISSN

14248220

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/s19204603

ProQuest document ID

2535498473

Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data

Jump to:

Full text

Abstract

Details

Suggested sources