A Review of Facial Landmark Extraction in 2D

Full text

Turn on search term navigation

1. Introduction

The face represents a key component in conveying verbal and non-verbal information during interactions between humans. By looking at the face, humans are able to extract a lot of non-verbal information, like identity, intent, and emotion. In the field of computer vision, to automatically extract such information, the detection of facial landmarks (Figure 1) is usually an important step, and several facial analysis techniques are based on the precise detection of landmarks. Head pose estimation [1], and facial expression recognition [2] algorithms are strongly based on the facial shape information, given by landmark positions. For instance, facial landmarks around the eyes can provide an initial guess of the pupil center positions for eye detection, as well as eye-gaze tracking [3]. In facial recognition, landmarks on a 2D image are usually used along with a 3D head model to frontalize the face and reduce within-subject variability, which improves recognition accuracy [4]. Further, facial information obtained through facial landmarks can improve applications in the fields of human and computer interaction, entertainment, security surveillance, and medical applications [5,6,7,8].

Facial landmark detection algorithms seek to identify the locations of the facial key landmark points on facial images or videos. Such key points are the main points that describe the unique location of a facial component (e.g., eye corner), or an interpolated point that connects those dominant points around the facial components, as well as facial contour. Formally, given a facial image denoted as $I$ , a landmark detection algorithm predicts the locations of D landmarks $x = {x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{D}, y_{D}}$ , where x and y represent the image coordinates of the facial landmarks.

Facial landmark extraction is challenging for several reasons: Firstly, facial appearance changes significantly across subjects under different facial expressions and head poses; secondly, facial occlusions by other objects, or self-occlusion due to extreme head poses, leads to incomplete facial appearance information; and thirdly, environmental conditions such as illumination can affect the appearance of the face on facial images.

Over the past few years, there have been important developments in facial landmark detection algorithms. We noticed that the first works focused on less challenging facial images without the said facial variations. Also, facial landmark detection algorithms usually aimed at handling several variations within fixed categories, and the facial images were usually collected with controlled conditions. For instance, in controlled conditions, facial poses and facial expressions can only be in certain categories. More recently, researchers began focusing on challenging in-the-wild conditions, in which facial images may undergo arbitrary facial expressions, head poses, illumination, facial occlusions, etc. In general, there is still a lack of a robust method that can handle all of those variations [5,6]. Algorithms for facial landmark extraction can be split into two categories: generative, and discriminative algorithms. The first type builds an object representation, and then searches for the region most similar to the object without taking background information into account. Discriminative methods train an online binary classifier to adaptively separate the object from the background, which are more robust against appearance variations of an object. Generative models include part-based generative models, such as Active Shape Models (ASM) [10,11], and holistic generative models, such as Active Appearance Models (AAM) [12,13,14,15,16] that model the shape of the face and its appearance as probabilistic distributions. Among discriminative models, cascaded regression models (CRMs) [9,17,18,19,20,21,22,23] have gained widespread popularity due to its excellent performance and low complexity. Several exhaustive surveys exist for facial landmark detection and tracking [5,24,25,26,27]. This survey is focused on landmark extraction and tracking using deep-learning techniques. We focus on deep-learning methods for facial landmark detection, as since the arrival of deep-learning neural networks [28,29], deep-learning methods have achieved state-of-art performance in identity and face recognition, object detection, and other research fields mentioned above. In the last few years, there appeared even some closely related tasks to facial landmark extraction in the literature, such as video facial landmark extraction and 3D facial landmark extraction. The first consists of tracking facial landmarks in successive video frames of the same person taking advantage of temporal redundancy and identity. The hardest challenge is represented by in-the-wild video frames. The aim of 3D facial landmark extraction is to detect facial landmarks in arbitrary poses, for recovery of the projected 3D locations of invisible facial landmarks, given a 2D image. This article will consider the newest deep-learning methods for facial landmark extraction in 2D image and video. An analysis of many methods is given, and a comparison between performance is reported based on the performance measures used in the literature. The most common datasets used for both training and assessment of performance are reported and briefly analyzed. Eventually, the main challenges and future research directions will be provided.

2. Landmarks Extraction

In brief, the aim of facial landmark extraction is to detect facial landmark coordinates in an image, inside a face-region-bounding box. Duffner et al. [30] used a Convolutional Neural Network (CNN) to extract five landmarks on a facial image of low resolution. It trained a CNN to predict facial feature likelihood maps, supervised by ground truth maps defined through Gaussian distribution on the positions of the features. It may be safe to state that this was the very first study where this was proposed, and which successfully established image-to-image mapping for facial landmark detection. However, the authors did not face applications which required dense prediction on images in high resolution. Thus, they proposed an optimized and efficient CNN only for a limited number of fiducial points. In another article, Luo et al. [31] proposed to extract the facial landmarks exploiting the face-parsing segmentation results making use of a deep belief network. Their method is composed by four parts: a face detector, a face part detector, a components detector, and components segmentators. Luo et al. designed these layers in a Bayesian framework, assuming between them a spatial consistency probability prior.

Each one of these layers was unsupervised and pre-trained through a layer-wise Restricted Boltzmann Machine (RBM), and fine-tuned using logistic regression. Eventually, the segmentators were trained as deep auto-encoders. Seeing the great success of deep-learning techniques in the task of image classification [28], researchers started to extract sparse facial landmarks with similar structures. Sun et al. [32] used a cascaded coarse-to-fine CNN to detect five facial fiducial points. A three-stage method was adopted, and many CNNs were included in each stage. The CNNs in the first stage estimate the rough positions of many different sets of landmarks. Each one was then separately refined by the CNNs in the following stages. Despite its novelty and high accuracy, the input of the next CNN depends on local patches which are extracted from the previous one. The approach of TCDCN (Tasks Constrained Deep Convolutional Network) [33] consists of multi-task learning for optimizing the performance of five-point landmark extraction. The authors showed that additional facial features, such as gender and pose, could be useful for landmark extraction, while simultaneously providing further information during the inference procedure. It is possible to see that in Kumar et al. [34], local patch features extracted by a CNN work better with a linear regressor to provide a five-point prediction. In Zhang et al. [35], a fine-tuned, pre-trained CNN was used to detect local facial patch features, followed by a cascaded regressor to extract the landmarks. Both of the articles have shown that CNNs could stand for a good feature extractor in the conventional cascaded regression framework, explained in the introduction. Now take into account dense facial landmarks, which are landmarks that are not necessarily semantic, but which can also be contained in a contour. In Zhang et al. [36], a coarse-to-fine encoder–decoder network was used to extract 68 fiducial landmarks. The authors designed a four-stage cascaded encoder–decoder with a growing input resolution for different stages. Positions of landmarks were updated at the end of each stage with the output of the CNN. Then, the authors improved the method by suggesting considering an occlusion-recovering auto-encoder to reconstruct the eventual occluded facial parts with the aim of avoiding estimation errors caused by occlusions [37]. The occlusion-recovering network reconstructs the original face from the occluded one by training on a synthetic and occluded dataset. Sun et al. [38] used a multi-layer perceptron (MLP) as a graph transformer network [39] to replace the regressors in a cascaded regression method to detect the facial landmarks. The authors have shown that such combination can be entirely trained by backpropagation. Wu et al. [40] designed a $t h r e e$ -way factorized Restricted Boltzmann Machine (RBM) [41] for building a deep face-shape model for predicting the 68 point landmarks. The main drawback of using only one CNN to directly obtain a dense prediction lies in the fact that the network is trained to reach its best result only on the global shape, which could lead to several local inaccuracies. A simple and immediate idea is to refine different facial parts locally and independently with post-processing. Fan et al. [42] and Huang et al. [43] extracted dense facial landmarks by a CNN to estimate rough positions, followed by several small regional CNNs to refine different parts locally. This structure is more time-consuming, but can significantly optimize the precision. In another article, Lv et al. [44] used two-stage re-initialization with a deep network regressor in each stage. The framework is composed by a global step, where a gross landmark shape is extracted, and a local step, where fiducial points of each facial part are respectively estimated. A novelty lies in the fact that the global or local transformation parameter is predicted by a CNN to again initialize the region of the face to a standard shape before the landmark extraction. Performances on large poses are then highly improved. In Shao et al. [45], adaptive weights were applied to different landmarks during the different steps of training. The authors assigned a higher coefficient to some important landmarks, such as the eye and corners of the mouth at the starting phase of the training process, and then reduced such weights if the result converged. In this manner, the neural network first learns a robust global shape, and finally learns predictions which are locally and more refined. In addition, even if deep-learning-based techniques are not sensitive to initialization, head poses which are quite large still represent a big challenge. Wu et al. [6] proposed a tweaked structure at the end of a Vanilla network in TCDCN [33], where different branches were aimed at regressing shapes in several head poses. Over the last few years, Trigeorgis et al. [46] used a cascaded regression method with a Recurrent Neural Network (RNN), named the Mnemonic Descent Method (MDM). In MDM, the CNNs extract patch features, replacing well-known feature extractors such as SIFT [47] in SDM [17]. Furthermore, RNNs act as memory units, capable of sharing information between the cascaded levels. The RNN facilitates the joint optimization of the regressors, assuming that the cascades form a non-linear dynamical system. One more widely used approach consists of training the CNN to extract likelihood maps (also named response maps, probability maps, voting maps, and heat maps) as the network output, as originally proposed by [30,48]. The value of a pixel on the maps can be seen as the probability of the presence of each landmark in the pixel. Zadeh et al. [49] proposed using DCNN to produce a local response map and to fit the model as a Constrained Local Model (CLM). Since the deep encoder-decoder can provide image-to-image mapping, Lai et al. [50] made use of a full CNN to predict a starting face shape instead of a mean shape which is commonly used in cascaded regression models [17,18]. The authors introduced Shape-Indexed Pooling as a feature-mapping function to extract local features of each point, which was then given in input to the regressor. In the first version, the authors used a fully-connected layer to regress the final shape step-by-step, while replacing it with a RNN in the second version, inspired by MDM [46]. In Xiao et al. [51] an attention mechanism [52,53] was used where landmarks near the attention center were subject to a specific refinement procedure. Wang et al. [54] studied an approach for detecting multi-face landmarks through maps. Using an ROI pooling branch [55], face detection is not necessary and non-face activations are deleted over all of the likelihood maps. Concerning likelihood maps, another interesting method was proposed by Kowalski et al. [56]. The authors predicted the transformation to a standard pose and a feature image at the same time, with global likelihood maps in a cascaded fashion. The networks in different steps share the information by taking the transformation parameters from the previous step. Many recent studies have focused on multi-task CNN methods to get additional semantic information other than facial fiducial points. In addition to the aforementioned TCDCN [33], in Zhang et al. a method named MTCNN [57] was studied, which is a three-stage structure composed of CNNs which together performs face detection, face classification, and landmark extraction. The authors designed a fast Proposal Network (P-Net) to obtain facial region candidates and landmarks on low-resolution images in the first step. Then, these candidates were refined in the next stage through a Refinement Network (R-Net), followed by an Output Network (O-Net) to obtain final bounding boxes and landmark positions, with higher resolution inputs. Ranjan et al. [58] designed a multi-branch CNN, named the All-in-One CNN, to simultaneously detect the face region, facial landmarks, head pose, smile probability, gender, and the age of the person by sharing a common convolutional feature extractor. In the previous paragraphs, of this survey an attempt has been made for reporting methods for facial landmark extraction, splitting them into different categories. The mentioned works can be divided into the following categories, respectively:

Sparse facial landmarks detection;
Dense facial landmarks detection;
Landmarks detection with RNN;
Landmark detection with likelihood maps;
Landmark detection with multi-task learning.

However, some studies that make use of DCNN are harder to categorize. In a study by Belharbi et al. [59], facial landmark detection is treated as a structured-output problem. In a study by Güler et al. [60], facial landmarks were extracted in a deformation-free space, defined by two-dimensional mapping U-V in a 3D Morphable Model. The authors divided the regression of landmarks position into two parts, which they call quantized classification and residual regression, which respectively mean positioning the band region and the relative residual to the band region. A coarse band position is predicted through quantized regression, and a refined shape was predicted through residual regression. This algorithm gives a dense map between a three-dimensional object template and an image, given in input. Eventually, their output provides a robust initialization for CRMs. We are seeing great changes related to face landmark extraction, from directly predicting the point coordinates by fully-connected layers to predicting the landmark positions by likelihood maps using CNN and several other interesting ideas. It is noticeable that in many of the listed works, CNN is no longer simply regarded as a learnable image feature extractor, but rather as a multi-functional tool for processing different types of information for facial landmark extraction.

3. Landmarks Tracking

The task of video facial landmark extraction, also referred to as sequential facial landmark extraction, aims at aligning a sequence of person-specific images by exploiting continuous information in the video. An immediate idea for video face-tracking is represented by person-specific modeling, since personal identity information remains unchanged. In Chrysos et al. [61] a tracking pipeline was shown to improve tracking robustness in videos that contained speech. Other than face detection and landmark extraction, the authors designed a person-specific face detector by making use of a Deformable Part Model [62,63] and a person-specific generative landmark localizer. The latter iteratively updates the generic/person-specific appearance variations and shape/appearance parameters in turn. The method is used for the annotation of the 300VW [64] dataset semi-automatically. Asthana et al. [19] reformulated the cascaded regression in a parallel form to allow fast and efficient learning of each cascade level. The information retained by the following cascaded level was derived from the statistical distribution from the previous cascade regressor. More recent studies on cascaded regression is CCR [65] and iCCR [66]. The authors studied a continuous regression method while reformulating it so that the algorithm did not require sampling over the perturbed shapes (e.g., flipping, rotation, scaling). As a consequence, the computational complexity was largely reduced compared to the traditional cascaded regression-based methods. It is highly common to use Bayesian filters in object tracking. Then, an immediate idea is to combine them (e.g., Kalman filtering) with state-of-art landmark localizers. Therefore, in Pabhu et al. [67] a Kalman filter was exploited to track the facial landmarks by the head positions, head orientations, and facial shapes in video sequences. The 300VW workshop [64] proposed a challenging dataset focused on the tracking of landmarks. Considering all the methods, one of the main ones designed a pose-specific CRM [68], and another one used a progressive initialization [69], which aims to improve the problem of initialization in extremely bad poses. Taking inspiration from the incremental learning method [70], Peng et al. [71] used a CNN with likelihood maps to evaluate the fitting results at the end of the network. This ensures more reliable results. The RED-Net [72] was introduced to improve the performance of facial landmark extraction on video by reorganizing the identity information and pose/expression information. The first can be regarded as an invariant in a video, while the pose and expression information changes over time. The author proposed a dual-path network using likelihood maps in which include one path extracted the identity information, while the other path learned the pose/expression information making use of a RNN network. Recently, Gu et al. [73] added a one-layer RNN to the final part of a VGG network for tracking facial landmarks and head poses. The authors showed that Bayesian filters could be formulated as a RNN, linearly-activated and without bias. If we analyse their results, tracking with RNN is more accurate and reliable than frame-by-frame detections and state-of-the-art landmark localizers tracked using a Kalman filter. Another algorithm that uses RNN is called TSTN, proposed by Liu et al. [39]. They adopted two network streams, spatial and temporal. The spatial one learns to transform local facial patches to shape residuals, which is then used to refine the current facial shape based on the previous shape. The temporal stream is designed as a deep encoder–decoder with a two layers of RNN for capturing facial dynamics in the temporal dimension. This stream takes consecutive frames as the input, and renders the temporal shape update. The final shape is determined by a weighted fusion of two streams shape updates. A Long Short-Term Memory (LSTM) module was used by Hou et al. [74] to guide the spatial estimation for the next step, just as in MDM, and to simultaneously guide the estimation for the next frame. Among all these methods, it is noticeable that deep-learning methods are not yet widely used in video landmark extraction due to their high complexity, size, and memory constraints which are still a significant problem for real-time detection on mobile platforms.

4. Datasets

In this section, the most frequent datasets available in the literature are listed and analyzed; mainly, they are divided into image datasets and video datasets. The first can generally be categorized into two parts: images can be taken under constrained conditions, such as controlled lighting conditions and specific poses; otherwise, images can be taken under unconstrained conditions, which are usually referred to as in-the-wild datasets. In the category of image datasets, the main ones are listed below:

Multi-PIE [75] is, above all datasets, one of the largest. It is constrained and contains 337 subjects in 15 views, with 19 illumination conditions and six different expressions. The facial landmarks are labeled with 39 points or 68 points.
XM2VTS [76] contains four registrations for 295 subjects, taken over a period of 4 months. Each video contains a speaking and a rotating head shot. The dataset is annotated with 68 points, and is included for the 300W challenge [77].
The 300-W [77] dataset (300 Faces in-the-Wild Challenge). Among all in-the-wild datasets, this has been the most popular one in recent years and it combines several datasets, such as Helen [78], LFPW [79], AFW [80], and a newly introduced challenging dataset, iBug. Summing up, it contains 3837 images and a further test set with 300 indoor and outdoor images, respectively. All the images are annotated with 68 points. The dataset is commonly divided in two parts: the usual subset, including LFPW and Helen, and the challenging dataset with AFW and iBug.
The Menpo [81] dataset. It is the largest in-the-wild facial landmark dataset, and contains 6679 semi-front view face images, annotated with 68 points, and 5335 profile view face images, annotated with 39 points in the training set. The test set is composed of 12006 front view images and 4253 profile view images. It was introduced for the Menpo challenge in 2017 in order to raise an even more difficult challenge to test the robustness of facial landmark extraction algorithms, since it involves a high variation of poses, light conditions, and occlusions.

Video-based annotated datasets are used for sequential facial landmark extraction. The 300-VW [64] has the largest number of facial-point annotated videos and frames. The dataset is composed by 50 training videos and 64 videos for testing, which are further divided into three scenarios, according to different light conditions, expressions, head poses, and occlusions. All of the frames are annotated in 68 points in a semi-automatic manner. The Menpo 3D tracking [82] dataset is the unique dataset in which 3D facial landmarks are annotated in video by the 3DMM algorithm [83]. The dataset contains 55 videos from the 300VW dataset, annotated again in 3D, in addition to all of the images in 300W, and Menpo re-annotated in the same way. The dataset provides us not only the landmarks in projected image space, but also the landmarks in 3D space.

5. Evaluation Metrics and Comparison

For providing comparable results regardless of the image size and camera focus, in the literature it is common to measure the distance between the ground truth and the detection result by Normalized Mean Error (NME) e, calculated as:

$e = \frac{| | S - S^{*} {| |}_{2}}{d^{*}},$

where S, and

S^{*}

represent the detected shape and the shape of the ground truth, respectively.

d^{*}

is a normalizing distance, which could be the inter-ocular distance (IOD) or inter-pupil distance. Many times, it is used as the bounding box diagonal or geometric mean of image length, with the height as

d^{*}

instead if the distance between the two eyes is too small on 3D/large pose datasets. Another metric that is based on the

N M E

is the Cumulative Error Distribution (CED) curve. The CED generally represents the proportion of images in the test set having an error below a given threshold. This curve provides a visual result of the algorithm performance in different situations, and the Area Under the Curve (AUC) provides a qualitative result of how the algorithm performs at progressive mean errors:

$A U C_{α} = \int_{0}^{α} f (e) d e,$

where e is the normalized error,

f (e)

is the cumulative error distribution (CED) function, and

α

is the upper bound that is used to calculate the definite integration. A bigger AUC value generally means that the algorithm has better performance. The failure rate is used to measure the robustness of an algorithm. A threshold of NME was chosen to be a threshold of failure, and the proportion of failed detection was calculated to represent the capacity of handling the difficult images. Now, we are ready to provide a comparison of different facial landmark extraction methods, as well as different deep-compression models. This comparison includes traditional cascaded regression methods and deep-learning-based facial landmark extraction methods. Zhang et al. [33] and Kowalski et al. [56] both provide a good benchmark on several popular methods by measuring the normalized inter-ocular distance error. Table 1 shows the performance of seven non-deep-learning 2D facial landmark extraction methods and eight deep learning facial landmark extraction methods evaluated on 300W by the normalized inter-ocular distance error. The failure rate is not included in the table since the choice of threshold is not objective. Table 2 reports a comparison of the Mean Error (%) (normalized by face size) of different video facial landmark extraction methods on 300VW. In general, the deep-learning-based algorithms outperform the others. However, considering the performances, all of the cascaded-regression based methods can run in real-time even with a Matlab implementation [26], while some deep-learning-based methods can achieve real-time detection on a powerful GPU or CPU, but most runtimes on CPU are not satisfying.

6. Main Challenges

Despite the articles we have analyzed, research on improved face landmarking techniques has been continuing. Emerging applications are requiring that landmarking algorithms be executed in real-time while operating with the computational power of an embedded system, such as intelligent cameras or smart phones. Furthermore, these applications require increasingly more robust algorithms against a variety of confounding factors, such as out-of-plane poses, occlusions, illumination effects, and expressions. The details of these factors that compromise the performance of facial landmark detection are as follows:

Variability: Landmark appearances differ due to intrinsic factors, such as face variability between individuals, but also due to extrinsic factors, such as partial occlusion, illumination, expression, pose, and camera resolution. Facial landmarks can sometimes be only partially observed due to occlusions of hair, hand movements, or self-occlusion due to extensive head rotations. The other two major variations that compromise the success of landmark detection are illumination artifacts and facial expressions. A face landmarking algorithm that works well under and across all intrinsic variations of faces, and that delivers the target points in a time-efficient manner has not yet been feasible.
Acquisition conditions: Much as in the case of face recognition, acquisition conditions, such as illumination, resolution, and background clutter, can affect the landmark localization performance. This is attested by the fact that landmark localizers trained in one database usually have inferior performance when tested on another database.
The number of landmarks and their accuracy requirements: The accuracy requirements and the number of landmark points vary based on the intended application. For example, coarser detection of only the primary landmarks, e.g., nose tip, four eye and two mouth corners, or even the bounding box enclosing these landmarks, may be adequate for face detection or face recognition tasks. On the other hand, higher-level tasks, such as facial expression understanding or facial animation, require a greater number for landmarks that is from 20–30 to 60–80, as well as greater spatial accuracy. As for the accuracy requirement, fiducial landmarks, such as on the eyes and nose, need to be determined more accurately as they often guide the search for secondary landmarks with less prominent or reliable image evidence. It has been observed, however, that landmarks on the rim of the face, such as the chin, cannot be accurately localized in either manual annotation and automatic detection. Shape guide algorithms can benefit from the richer information coming from a larger set of landmarks. For example, Milborrow and Nicolls [88] have shown that the accuracy of landmark localization increases proportionally to the number of landmarks considered, and have recorded a 50% improvement as the ensemble increases from 3 to 68 landmarks.

In the final analysis, accurate and precise landmarking remains a difficult problem since, except for a few, landmarks do not necessarily correspond to high-gradient or other salient points. Hence, low-level image processing tools remain inadequate to detect them, and recourse has to be made for higher-order face shape information. This probably explains the tens of algorithms presented and the hundreds of articles published in the last two decades in the quest to develop a landmarking scheme on par with human annotators [5,26,27].

7. Conclusions

In this survey, recent deep learning-based 2D facial landmark extraction methods were reviewed. After analyzing face landmarking techniques, comparing performances, and seeing the main challenges, we can draw the conclusion that deep-learning-based algorithms outperform others in terms of precision. However, computation efficiency remains a major constraint, especially for video facial landmark extraction. Despite the fact that deep-learning methods achieve excellent performance in many datasets, facial landmark extraction on a limited resource platform has not been solved. One future research direction is to investigate compression methods, such as Shuffle-Net [89]. Another direction is to focus on precision [90], where specific applications like animation demand high precision to result in perfect rendering. Other promising research paths in landmarking techniques are listed as the following: Sparse dictionaries, that is, the paradigm of recognition under sparsity constraint and building of discriminatory dictionaries seems to be one viable method. The discriminative sparse dictionary can be constructed per landmark [91,92] or collectively, as in [93]. Adaboost selected features for multiview landmarking: Gabor or Haar wavelet features selected via the modified Adaboost scheme, where commonality and a geometric configuration of landmark appearances is exploited [94]. Finally, multiframe landmarking: the determination of landmark positions exploits the information in subsequent frames of a video, using, for example, spatio-temporal representations [95].

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Figure and Tables

View Image - Figure 1. Sample image with predicted facial landmarks. An ensemble of randomized regression trees is used to detect 194 landmarks on a face from a single image [9].

Figure 1. Sample image with predicted facial landmarks. An ensemble of randomized regression trees is used to detect 194 landmarks on a face from a single image [9].

Table 1

Performance comparison of different landmark extraction methods based on NME, AUC, and FPS (frames per second). The upper part of the table lists non-deep-learning methods, while the lower part lists deep-learning methods. Data was obtained from [33,56] and the original publications (*). The measure from [84] was obtained using a threshold of $0.07$ on the 300W-private dataset.

Non Deep Learning Method	Year	Database	NME $(%)$	${AUC}_{0.08} (%)$	FPS on Video
DRMF [85]	2013	300W	9.22	-	0.5
RCPR [22]	2013	300W	8.35	-	80
ESR [18]	2014	300W	7.58	43.12	-
SDM [17]	2013	300W	7.52	42.94	40
ERT [9]	2014	300W	6.40	-	25
LBF [21]	2014	300W	6.32	-	3000
CFSS [23]	2015	300W	5.76	55.9 *	10
Deep Learning Method		Database	NME $(%)$	${AUC}_{0.08} (%)$
CFAN [36]	2014	300W	7.69	-	20
TCDCN [33]	2014	300W	5.54	41.7 *	58
TSR [44]	2017	300W	4.99	-	111
RAR [51]	2016	300W	4.94	-	-
DRR [50]	2018	300W	4.90	-	-
MDM [46]	2016	300W	4.05	52.12	-
DAN [56]	2017	300W	3.59	55.33	73
2DFAN [84]	2017	300W	-	66.90 *	30
DenseReg + MDM [60]	2017	300W	-	52.19	8

Table 2

Comparison of the Mean Error (%) (normalized by face size) of different video facial landmark extraction methods on 300VW. The data was obtained from [72] and original publications. * indicates deep-learning-based methods. $^{†}$ indicates that the runtime is measured on GPU (graphics processing unit).

Video Facial Landmark Extraction Method Comparison on the 300VW Dataset
MethodNMEFPS	ESR [18]7.0967	SDM [17]7.2540	CFSS [23]6.1310	PIEFA [86]6.37-	CFAN * [36]6.6420
MethodNMEFPS	TCDCN * [33]7.5959	RED * [72]6.2533 $^{†}$	RED-Res * [87]4.7518 $^{†}$	RNN * [73]6.16-	TSTN * [39]5.5930

Word count: 5157

Show less

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The task of facial landmark extraction is fundamental in several applications which involve facial analysis, such as facial expression analysis, identity and face recognition, facial animation, and 3D face reconstruction. Taking into account the most recent advances resulting from deep-learning techniques, the performance of methods for facial landmark extraction have been substantially improved, even on in-the-wild datasets. Thus, this article presents an updated survey on facial landmark extraction on 2D images and video, focusing on methods that make use of deep-learning techniques. An analysis of many approaches comparing the performances is provided. In summary, an analysis of common datasets, challenges, and future research directions are provided.

Details

Title

A Review of Facial Landmark Extraction in 2D Images and Videos Using Deep Learning

Author

Bodini, Matteo

First page

Publication year

2019

Publication date

2019

Publisher

MDPI AG

e-ISSN

25042289

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/bdcc3010014

ProQuest document ID

2546940618

A Review of Facial Landmark Extraction in 2D Images and Videos Using Deep Learning

Jump to:

Full text

Abstract

Details

Suggested sources