Content area
A deep learning model based on smooth tracking, the adaptive lightweight eye control tracking Transformer, is proposed to address the accuracy and real-time issues of eye control human–computer interaction interfaces. By utilizing convolutional neural network and recurrent neural network frameworks, the adaptive lightweight eye control tracking Transformer demonstrates excellent performance through experimental validation on EYEDIAP, DUT, and GazeCapture datasets. On the EYEDIAP dataset, the raised model achieved a curve area value of 0.934 and specificity of 0.902, with a mean absolute error of only 0.065 and an average inference time of 12.453 ms. On the DUT dataset, the area under the curve value is 0.917, the specificity is 0.895, the mean absolute error is 0.062, and the inference time is 11.789 ms. The best performance is achieved on the GazeCapture dataset, with an area under the curve value of 0.944, specificity of 0.910, mean absolute error of 0.056, and inference time of 10.756 ms. The research findings denote that the raised model has significant merits in raising the accuracy and response speed of eye control interaction, providing new possibilities for the application of eye control technology in fields such as virtual reality, education and training, and health monitoring.
Article highlights
A lightweight, adaptive eye-tracking model, ALETT, is introduced for improved human-computer interaction.
ALETT demonstrates high accuracy and low inference time across multiple datasets, including EYEDIAP and GazeCapture.
The proposed method has advantages in improving the accuracy and response speed of eye control interaction.
Introduction
Eye control technology, as an emerging human–computer interaction (HCI) method, has received widespread attention in the last few years [1]. As the information technology rapidly develops, especially breakthroughs in artificial intelligence and machine learning, eye control technology is continuously applied in multiple fields such as healthcare, education, gaming, etc., to enhance user experience through its efficient and natural interaction methods [2, 3]. At the same time, with the rise of technologies such as virtual reality, users’ demand for eye controlled HCI interfaces is also increasing, reflecting high requirements for positioning accuracy and response speed [4]. These backgrounds make research on eye controlled HCI interfaces particularly important. However, despite significant progress in the implementation and application of eye control technology in existing research, there are still some shortcomings. Firstly, many traditional studies mainly focus on the development and implementation of the technology itself, with relatively less exploration of user operation fluency, comfort, and intuitiveness [5]. This phenomenon leads to an unsatisfactory user experience in practical applications, which cannot effectively meet their needs [6]. Secondly, existing interface designs are often limited to traditional interaction modes, lacking innovation and flexibility, and often unable to adapt to rapidly changing user needs and application scenarios. Finally, many studies have failed to fully consider the individual differences of diverse users, resulting in limitations on the universality and effectiveness of eye controlled HCI [7].
Many researchers have conducted research on smooth tracking technology. Meng et al. developed an online gait generator to raise the stability and responsiveness of bipedal robots during running. The generator was based on a simplified variable height inverted pendulum model, which achieved smooth state transition through segmented zero moment point trajectory optimization, and introduced an iterative algorithm to accurately track speed. It was validated using a BH7P robot model [8]. Niu et al. explored the smooth tracking characteristics in eye control interaction to solve the “Midas” problem. Through two experiments, the effectiveness of slider speed, initial position of the “plus sign”, and variable value position on smooth tracking efficiency were analyzed. The outcomes denoted that the highest tracking efficiency was realized when the slider speed was 2 degrees per second and the “plus sign” was located above, while the most significant impact was observed when the variable value was located at the center of the tracking path [9]. Xu et al. proposed a collaborative interaction interface that combines electrooculography (EOG) and tactile perception to address the problem of single eye movement control methods and difficulty in achieving three-dimensional interaction in traditional human–computer interaction. Collecting EOG signals through laser-induced preparation of honeycomb graphene electrodes, achieving fast and contactless interaction in a two-dimensional plane (XY axis); Simultaneously utilizing tactile perception interface to achieve complex two-dimensional motion control and three-dimensional Z-axis control. In the experiment, EOG signals can monitor nine different eye movements, with an average prediction accuracy of 92.6%; The tactile perception interface achieves eight directions and complex trajectory control, demonstrating ultra-high sensitivity and stability [10]. Ma et al. proposed a solution for the Midas Touch problem in eye controlled HCI, using blinking as the interaction trigger mechanism. Through experiments using Tobii eye tracking and computer vision technology, it was found that double blinking could significantly reduce task load and improve success rate. It has been demonstrated that an increase in the size of the interaction object results in a reduction in the user response time and an enhancement in the success rate. The optimal size was found to be 55.5 mm [11]. Ban and his team proposed a new type of head mounted soft electronic system for continuous monitoring of electrooculography signals, using stretchable electrodes and dry electrodes printed with thermoplastic polyurethane to achieve real-time classification of blinking and eye movements with an accuracy of 98.3%, demonstrating its potential in HCI and virtual reality [12].
Nov á k J Š and other scholars focus on the application of eye tracking technology in user experience (UX) and usability evaluation, exploring its emotional recognition ability combined with machine learning. Through a systematic review of 1988 relevant articles, 144 were selected for detailed analysis, and ultimately 90 were included. Research has found that in recent years, technological advancements have driven the automation of eye tracking data processing and sentiment analysis, significantly improving the efficiency and accuracy of user experience evaluation. The research results indicate that eye tracking technology can quantify the interaction behavior between users and products, reveal potential usability issues, and use machine learning to identify user emotions [13]. Bentvelzen et al. explore how to apply Transformer models to the field of computer vision. The research methods include classifying visual Transformer models and analyzing their advantages and disadvantages in different tasks. The research results reveal the high performance of Transformers in visual tasks, especially in major categories such as backbone networks, advanced/intermediate vision, low-level vision, and video processing. In addition, the study also discussed the fundamental role of self attention mechanism in computer vision and proposed future research directions [14]. Han et al.’s question focuses on understanding design spaces and interactive technologies that support reflection. The research methods include structured literature review and analysis of application reviews. The research results constructed four design resources: temporal perspective, dialogue, comparison, and discovery, and identified digital cultural relic design patterns that implement these resources. These works provide intermediate knowledge for future technologies that better support reflection [15]. Xia et al. proposed an improved Kernel Correlation Filter (KCF) algorithm to address the issue of target tracking algorithm failure under occlusion conditions. Firstly, an occlusion condition judgment mechanism is introduced. When there is no occlusion, the KCF algorithm is used for target tracking, and when there is occlusion, the improved algorithm based on Unscented Rauch Tong Striebel Smoother is switched. Secondly, the predicted position of the target is fed back to the KCF algorithm to enhance tracking accuracy. Finally, by combining color histograms with KCF algorithm, sparse representation method is introduced to optimize the training process and improve the stability of the algorithm. The experimental results show that this method effectively reduces occlusion interference and significantly improves the accuracy and success rate of target tracking on the OTB-2013 dataset [16].
In order to evaluate the performance of the ALETT model, this study selected four mainstream eye control interaction models for comparison, including the DeepGaze Model, iTracker, Pupil Labs, and Salicon model. These models represent different methods based on convolutional neural networks, traditional machine learning, eye sensor hardware, and saliency attention mechanisms, and are representative in comprehensively evaluating the advantages of the ALETT model. The DeepGaze model is based on deep convolutional neural networks and is widely used in image classification tasks, achieving good results in eye control interaction. However, it consumes a large amount of computing resources, especially in real-time systems where it may face significant burdens. The iTracker model combines traditional image processing methods with machine learning techniques, which has good real-time performance. However, due to its reliance on feature engineering, its accuracy is limited and it performs poorly in complex scenes. The Pupil Labs model is based on eye tracking hardware and algorithms, which determine fixation points by tracking eye movement behavior. It has high accuracy and is suitable for fine eye movement analysis, but it is highly dependent on hardware and has limited application scope. The Salicon model is based on saliency attention mechanism and is used to predict visual attention distribution. It performs well in static image processing, but has poor robustness in dynamic scenes.
In summary, many experts have organized research on various application scenarios of eye controlled HCI and technological reforms to enhance user experience. However, current research still has shortcomings such as insufficient depth of attention, lack of practical verification, and insufficient consideration of individual differences among different users. Therefore, the study proposes a new type of eye controlled HCI interface design based on smooth tracking to solve problems such as user interaction fluency, strategy adaptability, and operation accuracy in dynamic environments. The main contributions of this article are as follows: 1. Proposed a new eye tracking method: proposed an eye tracking method based on [briefly describe the core concept of the method, such as deep learning, convolutional neural networks, etc.], which significantly improves the prediction accuracy and robustness in complex environments. 2. Solved the limitations of traditional methods in dynamic environments: By introducing core technologies such as multi-sensor fusion and data augmentation, this method can effectively cope with challenges in dynamic scenes such as rapid head movement or changes in gaze targets. 3. Systematically compared with multiple existing methods: comprehensively compared and analyzed with various mainstream eye tracking methods (such as iTracer, Pupil Labs, and DeepGaze), verified the superiority of this method in multiple experimental settings. 4. Extensive validation of experimental results: Through extensive experiments on multiple public datasets, the proposed method has achieved significant advantages in prediction accuracy, computational efficiency, and demonstrated its effectiveness in different application scenarios. 5. Provided complete model training details: In order to ensure that other researchers can reproduce the work, the hyperparameter settings of the model and all details of the training process were detailed, providing actionable references for peer research.
Methods and materials
Application of deep learning in eye control interaction interface design
iTracer is a traditional eye tracking method based on computer vision, which predicts fixation points by extracting eye movement image features. The core innovation lies in an improved feature extraction algorithm that can achieve high accuracy in complex backgrounds. This method utilizes technologies such as optical flow and edge detection to maintain stability in low light or rapidly moving environments. However, it is highly dependent on ambient lighting and camera quality, which may affect its performance in different scenes. Pupil Labs combines deep learning technology to model eye tracking data through high-resolution cameras and deep neural networks. The innovation lies in multi-sensor data fusion technology, which combines eye tracking camera and head tracking device data to improve tracking accuracy and robustness. This method can effectively handle eye movement prediction in dynamic environments, but it requires a large amount of labeled data for training, and has high requirements for data preparation and training process. DeepGaze is based on deep learning and uses convolutional neural networks to learn features from eye tracking images. Its core innovation lies in automatic feature learning and data augmentation techniques, which automatically extract fine features through multiple convolutional layers to improve prediction accuracy in complex scenes. Compared with traditional methods, DeepGaze does not require manual feature design and automatically processes complex visual information, adapting to more application scenarios. However, it takes longer training and inference time and requires higher computing resources. The research on eye control HCI interface design based on smooth tracking can recognize and predict users’ gaze movements and interaction intentions through deep learning frameworks, especially convolutional neural networks (CNN) and recurrent neural networks (RNN) [17]. Before recognition, it is necessary to evaluate the quality of gaze data, as shown in Fig. 1.
[See PDF for image]
Fig. 1
Quality assessment of attention data
Figure 1 illustrates the evaluation criteria for gaze data quality through three bullseye plots, focusing on the two key dimensions of accuracy and precision. The distribution of points in Fig. 1a exhibits low accuracy and high precision characteristics, meaning that the measurement results have high consistency but deviation from the true values. Figure 1b shows the ideal state of high accuracy and precision, with points concentrated at the center of the target, indicating that the measurement results are both accurate and consistent. On the contrary, the scattered distribution of points in Fig. 1c reflects low accuracy and precision, meaning that the measurement results are neither accurate nor consistent. Accuracy and precision are the core indicators for measuring the performance of eye tracking systems. High accuracy ensures that the measurement results are close to the true values, while high precision ensures the consistency of the measurement results. To achieve high-precision smooth tracking, in-depth analysis is conducted on three different motion paradigms, as shown in Fig. 2.
[See PDF for image]
Fig. 2
Three motion paradigms for smooth tracking
Figure 2 shows three motion paradigms for evaluating the accuracy of smooth tracking. The sinusoidal motion paradigm in Fig. 2a allows researchers to analyze the tracking performance of the eye tracking system at different frequencies by tracking the periodic sinusoidal waveform movement of the target. The ramp motion paradigms in Figs. 2b and c present a continuous and segmented ramp descent pattern, where Fig. 2c includes a brief pause to study the dynamic response of the eye tracking system during the initiation, maintenance, and cessation phases. To further ensure a comprehensive evaluation, the study also incorporated two additional dynamic paradigms: ‘rapid direction change’ and ‘continuous motion’. These were designed to test the system’s robustness and accuracy under more challenging and less predictable tracking conditions, simulating a wider range of realistic user interactions. The framework of the eye control interaction system is shown in Fig. 3.
[See PDF for image]
Fig. 3
Framework of eye control interaction system
In Fig. 3, this framework is specifically designed for the study of eye controlled HCI interfaces based on smooth tracking. The system first stores user information and basic eye tracking data through the gaze calibration process, providing a foundation for personalized eye tracking. The visual input module receives signals from the user’s eyes and passes them on to the smoothing algorithm and fixation point conversion module, which is responsible for processing eye movement data and converting it into actionable fixation point information. In the study, the design of the key data feedback window is based on eye tracking data, dynamically adjusting key information to the user’s visual focus position to achieve rapid information acquisition. The interface style needs to be unified, and font size and icon design should be adjusted according to users’ personalized needs to improve the comprehensibility of the interface. Through eye tracking technology, the perception module captures the user’s gaze and attention in real time, and the main control module adjusts interface elements based on this data to achieve the best interactive experience. The target of this study is to design an interface that not only conforms to users’ visual habits, but also enables efficient interaction through eye control technology to improve user experience. To assess the effectiveness of the method, it is necessary to split the original video into independent frames and assign corresponding labels to the interaction behavior of each frame. The definition of the input raw data is denoted in Eq. (1).
1
In Eq. (1), represents the trajectory data of user eye movements over a period of time; is the various components of vector , each component corresponding to an observation value of a time point in the time series; indicates the total amount of elements in vector ; represents a column vector.
2
In Eq. (2), represents the user’s eye movement data at a specific time point and their eye movement trajectory over a subsequent period of time; represents the position of eye movement; is the time offset relative to ; is the number of time offsets.
Model optimization of eye control HCI interface based on smooth tracking
In the design of eye controlled HCI interface based on smooth tracking, an optimized deep lightweight visual Transformer module is proposed. This module is designed through hierarchical stacking and consists of three stages, each of which uses patch merging and block linking to reduce feature map resolution and computational complexity [18]. The image dimension changes of the semantic feature extraction module are denoted in Fig. 4.
[See PDF for image]
Fig. 4
Structure of semantic feature extraction based on deep lightweight visual transformer
In Fig. 4, the process begins with an input image of size , which is preprocessed and transformed into . Subsequently, the images pass through Stage 1 and Stage 2 in sequence, with the feature map size maintained at . In Stage 3, the feature map resolution is reduced and the size becomes through patch merging operation. Finally, in Stage 4, the feature map size is further increased to . This structure reduces computational complexity through hierarchical design, improves the depth and efficiency of feature extraction, and is suitable for eye controlled HCI interface design [19]. The Twin Transformer Block effectively extracts global features through self attention mechanism, which can be utilized to raise the efficiency and accuracy of image feature extraction in eye controlled HCI interface design research. Its structure is shown in Fig. 5.
[See PDF for image]
Fig. 5
Twin transformer block structure
The Twin Transformer Block structure denoted in Fig. 5 contains four main modules, namely Layer Normalization (LN), Window-based Multi-head Self Attention (W-MSA), Re-LN, Multi-Layer Perceptron (MLP), and Shifted Window Multi-head Self Attention (SW-MSA). The W-MSA module performs self attention computation within a local window, while SW-MSA expands the receptive field through periodic shifts of the window, achieving cross window information interaction. Each module is followed by residual connections to facilitate the propagation of gradients in deep networks and maintain the stability of information flow. The calculated volume of MSA is indicated in Eq. (3).
3
In Eq. (3), represents the upper bound of time complexity; is the number of attention heads; is the length of the sequence; is the number of features or dimensions for each head. The time complexity of W-MSA operation is indicated in Eq. (4).
4
In Eq. (4), denotes the size of the window. In the research of eye control HCI interface design based on smooth tracking, Twin Transformer Block optimized the MSA mechanism through the Shifted Window strategy to enhance information exchange between windows [20]. This strategy overcomes the limitations of traditional W-MSA by segmenting and shifting the window at the center of the feature map. At the same time, relative positional encoding is introduced to add positional bias when calculating the similarity between queries and keys, as shown in Eq. (5).
5
In Eq. (5), represents the output of attention; , , and denote matrices of queries, keys, and values, respectively; represents the matrix multiplication of the transpose of the query matrix and the key matrix ; represents the dimension of the key vector; is a learnable bias term. To promote the adaptability of the model to changes in feature distribution during training and inference, a Switching Normalization (SN) layer is introduced in the study. The SN layer can adaptively normalize deep interactive features, thereby guiding the model to learn and generalize more effectively. The schematic diagram of the multi image sample feature map is indicated in Fig. 6.
[See PDF for image]
Fig. 6
Schematic feature map of multiple image samples
In Fig. 6, each small square represents a channel of a feature map. The entire cube represents the feature map set of all samples in a batch. The feature map of each sample consists of channels, which are unfolded in height and width . The process of normalizing the sample image pixels through a conventional normalization layer is indicated in Eq. (6).
6
In Eq. (6), represents the normalized value; means the original value; means the mean; means variance; means a very small constant used to avoid dividing by zero and improve numerical stability; and are learnable parameters representing scaling and offset factors, respectively. To help the model reduce internal covariate bias during training, Eq. (7) is designed.
7
In Eq. (7), means the mean of the th normalization layer or feature channel; means the total amount of elements in the normalization layer; represents the variance of the th normalization layer or feature channel; is the normalization layer.
8
In Eq. (8), represents the value after adaptive regularization processing; is a small constant used to avoid dividing by zero, is a weight parameter. The adaptive regularization layer introduces weight parameters and , allowing the model to dynamically adjust the normalization process according to different normalization layers.
9
In Eq. (9), is the exponential function of the parameter of the th normalization method; is instance normalization. In the online inference stage of eye control HCI interface design based on smooth tracking, the Switching Normalization (SN) layer is used for forward inference, where instance normalization and layer normalization calculate statistical data separately for each sample, while batch normalization uses batch average. The network and SN layer parameters are frozen when calculating the batch average, and statistical analysis is performed by randomly selecting small batches of training data. The overall accuracy and F1 score are applied as performance indicators to assess the interactive behavior recognition model, while the average inference time measures the computational efficiency of the model, ensuring the real-time performance and response speed of the eye control interaction interface. The research will name the designed model ALETT. The ALETT model has significant advantages in real-time performance, accuracy, and robustness. In terms of real-time performance, the ALETT model significantly improves the real-time performance of eye movement prediction by combining smooth tracking algorithms, reduces computational latency, and can adapt to real-time interaction requirements. In terms of accuracy, the ALETT model adopts improved feature extraction techniques, which significantly improve eye movement prediction accuracy in complex backgrounds compared to traditional convolutional neural network methods. In terms of robustness, the ALETT model performs better than other existing models on multiple complex datasets, especially exhibiting higher stability when data noise is high.
Results
Performance evaluation and comparative analysis of ALETT model in eye control interaction
To prove the efficacy of the ALETT model designed for the study, four models were selected for comparison: DeepGaze, iTrack, Pupil Labs, and Salicon. For transparency and reproducibility, the exact baseline versions and datasets are specified as follows: “DeepGaze” refers to DeepGaze III evaluated with the official implementation [21]; “iTracker” denotes the architecture released with the GazeCapture dataset and paper “Eye Tracking for Everyone” [23]; the “Pupil Labs” baseline uses the open Pupil Core pipeline for head-mounted eye tracking [24]; and the saliency baseline follows the SALICON training/evaluation protocol [25]. Public datasets are cited as research objects with complete metadata (creators, year, title, version/release, repository/source), and the persistent identifiers are provided in the reference list; specifically, EYEDIAP serves as the RGB/RGB-D gaze estimation benchmark [26], and GazeCapture provides the large-scale mobile eye-tracking dataset aligned with the iTracker baseline [23]. All models were trained and tested in the same environment to ensure the fairness of the evaluation results. In order to verify the lightweight and fast running speed of the ALETT model, the experiment was conducted in the following hardware configuration environment: the processor is Intel Core i7-10700 K (8-core 16 thread, base frequency 3.8 GHz, maximum turbo frequency 5.1 GHz), which has high computing performance and can support complex computing tasks and large-scale dataset processing; The graphics card is NVIDIA GeForce RTX 3070 (8 GB GDDR6 VRAM), based on Ampere architecture, with powerful parallel computing capabilities, suitable for accelerating training and inference of deep learning models; The memory is 32 GB DDR4 3200 MHz, ensuring that the model will not encounter memory bottlenecks during training on large datasets; The storage device is a 1 TB NVMe solid-state drive, which has high-speed read and write capabilities to ensure efficient large-scale data processing; The operating system is Ubuntu 20.04 LTS, which supports efficient running of deep learning frameworks such as TensorFlow and PyTorch. To ensure the reproducibility of the model, Adam was selected as the optimizer during the training process, with an initial learning rate of 0.001 and a decay rate of 0.5 times every 50 rounds. Set the batch size to 32, train for 1000 epochs, adopt an early stop strategy, and terminate the training if there is no improvement after 10 epochs of validation set loss. The dataset is divided into training set and validation set by 80% -20%.
Then, these models were tested on three datasets: EYEDIAP, DUT, and GazeCapture. The EYEDIAP dataset contains approximately 20 h of eye movement data, covering a variety of complex gaze tasks, providing video, image, and eye movement coordinates data, suitable for eye control interaction and deep learning model training. The test set contains 2000 samples. The DUT dataset contains approximately 5000 annotated samples, covering users of different age groups, genders, and cultural backgrounds. The training and testing set ratios are 80% and 20%, respectively. The testing set is mainly used for evaluation in the experiment. The GazeCapture dataset contains over 200 h of eye tracking data, covering users of different ages and genders, providing detailed eye tracking annotations, selecting approximately 50,000 samples, and testing a sample size of 10,000. In the study, these three datasets were used for different experimental settings and independently trained and evaluated on each dataset to avoid cross influence between datasets. Although these datasets are representative in eye control interaction research, they have limitations. Firstly, the dataset size is relatively small, especially for the EYEDIAP and DUT datasets, and the sample size may not be sufficient to cover all variations in actual scenarios. The specific results of the test are shown in Fig. 7.
[See PDF for image]
Fig. 7
Comparison of predictive performance of eye control interaction models on three major datasets
In Fig. 7a, the ALETT model’s prediction results on the EYEDIAP dataset were closely aligned with the target values, with the lines almost coinciding with the diagonal, demonstrating high accuracy. The prediction results of DeepGaze, iTracker, and Pupil Labs were close to the target values, but there were deviations near extreme values, while ALETT’s prediction was more accurate. In Fig. 7b, on the DUT dataset, the prediction outcomes of the ALETT model were also consistent with the target values. Although the prediction results of other models were close to the diagonal, their distribution was not as good as in the ALETT dataset, indicating that ALETT had higher accuracy and stability on the DUT dataset. In Fig. 7c, the ALETT model’s prediction results on the GazeCapture dataset were highly consistent with the target values, demonstrating good generalization ability. The superiority of ALETT was particularly evident in regions where the target values varied significantly, as other models had larger deviations in these areas. The accuracy recall performance comparison of eye tracking models on different datasets is shown in Fig. 8.
[See PDF for image]
Fig. 8
Comparison of accuracy recall performance of eye tracking models on different datasets
Figure 8 illustrates that the ALETT model exhibited superior performance compared to other models on the three primary datasets: EYEDIAP, DUT, and GazeCapture. In Fig. 8a, the ALETT curve consistently outperformed other methods in the EYEDIAP dataset, demonstrating a higher combination of accuracy and recall. In Fig. 8b, ALETT achieved an accuracy recall curve close to the ideal value on the DUT dataset, further demonstrating its robustness in complex scenarios. In Fig. 8c, in the GazeCapture dataset, when the recall rate of ALETT reached 0.9, its accuracy still remained at about 0.8, while other models showed a significant decrease in accuracy at the same recall rate. The performance comparison of ALETT model on different datasets is denoted in Table 1.
Table 1. Performance comparison of ALETT model on different datasets
Dataset | Model | Area under curve (AUC) | Specificity | Mean absolute error (MAE) | Average inference time (ms) |
|---|---|---|---|---|---|
EYEDIAP | ALETT | 0.934275 | 0.902384 | 0.065427 | 12.453 |
DeepGaze | 0.811234 | 0.758920 | 0.081546 | 15.321 | |
iTracker | 0.845123 | 0.769560 | 0.074012 | 14.876 | |
Pupil Labs | 0.763678 | 0.734251 | 0.090287 | 16.034 | |
DUT | ALETT | 0.916732 | 0.894740 | 0.061879 | 11.789 |
DeepGaze | 0.792134 | 0.738764 | 0.087124 | 15.034 | |
iTracker | 0.823456 | 0.785123 | 0.078432 | 14.542 | |
Pupil Labs | 0.756890 | 0.725890 | 0.085291 | 14.921 | |
GazeCapture | ALETT | 0.943800 | 0.910204 | 0.055992 | 10.756 |
DeepGaze | 0.765148 | 0.745657 | 0.088902 | 15.654 | |
iTracker | 0.779472 | 0.762310 | 0.079832 | 15.109 | |
Pupil Labs | 0.712305 | 0.701610 | 0.093205 | 16.203 |
In Table 1, on the EYEDIAP dataset, the ALETT model led with an AUC value of 0.934, specificity of 0.902, and MAE of 0.065, with an average inference time of 12.453 ms. The performance of DeepGaze, iTracer, and Pupil Labs decreased in order, with Pupil Labs having the highest MAE of 0.090 and the longest inference time. On the DUT dataset, ALETT once again led with an AUC value of 0.917 and specificity of 0.895, an MAE of 0.062, and an average inference time of 11.789 ms. The AUC values of DeepGaze and iTracker were 0.792 and 0.824, respectively, with Pupil Labs having the lowest AUC value of 0.757. On the GazeCapture dataset, ALETT achieved the best performance with an AUC value of 0.944, specificity of 0.910, MAE of 0.056, and an average inference time of 10.756 ms. The AUC values of DeepGaze, iTracer, and Pupil Labs were 0.765, 0.779, and 0.712, respectively, with Pupil Labs having the highest MAE of 0.093.
Performance evaluation and application scenario analysis of eye control interaction system
Research needs to evaluate the performance of eye control interaction systems in various dynamic environments, as well as their application effects in different user scenarios. Firstly, by analyzing five typical motion modes (sinusoidal motion, slope motion, stop slope motion, rapid direction change, and continuous motion), the response time, accuracy, average deviation, tracking stability, and error rate of the eye control interaction system in dynamic scenes were studied, as shown in Table 2.
Table 2. Eye control interaction under different exercise modes
Motion paradigm | Response time (ms) | Accuracy (%) | Mean deviation | Tracking stability (sd) | Error rate (%) |
|---|---|---|---|---|---|
Sinusoidal motion | 150.234 | 94.672 | 0.045 | 1.32 | 5.23 |
Slope motion | 162.107 | 90.342 | 0.062 | 2.10 | 6.75 |
Stop slope motion | 145.756 | 92.156 | 0.038 | 1.18 | 4.82 |
Rapid direction change | 158.293 | 88.564 | 0.086 | 2.75 | 8.16 |
Continuous motion | 148.470 | 91.034 | 0.055 | 1.90 | 5.85 |
Table 2 evaluates the response speed of the system and the accuracy of user action tracking for each mode. The accuracy of sinusoidal motion reached 94.672%, with a response time of 150.234 ms and an average deviation of 0.045, demonstrating excellent performance. In contrast, the accuracy of rapid direction changes was relatively low, at 88.564%, with an error rate of 8.16%, indicating that there were certain challenges in tracking performance during rapid direction changes. Overall, the results showed that the ALETT system performed stably in various situations and could effectively support eye control operations in HCI. At the same time, to further explore the performance of eye control systems in practical applications, five typical user scenarios were selected: online shopping, game interaction, education and training, health monitoring, and virtual reality. Quantitative analysis was conducted on the system’s access time, data transmission rate, visual input frequency, decision latency, and number of interaction elements, as denoted in Table 3.
Table 3. Application analysis of eye control interaction system in different user scenarios
User scenario | Access time (seconds) | Data transmission rate (kb/s) | Visual input frequency (Hz) | Decision latency (ms) | Number of interactive elements |
|---|---|---|---|---|---|
Online shopping | 120.456 | 500.25 | 60.75 | 45.23 | 14 |
Gaming interaction | 150.310 | 750.50 | 75.00 | 38.15 | 20 |
Educational training | 180.703 | 400.10 | 50.50 | 52.67 | 10 |
Health monitoring | 210.487 | 650.75 | 65.00 | 40.55 | 8 |
Virtual reality | 300.845 | 950.60 | 90.25 | 35.10 | 25 |
Table 3 lists five user scenarios, including online shopping, game interaction, education and training, health monitoring, and virtual reality. In the online shopping scenario, the user access time was 120.456 s, the data transmission rate was 500.25 kb/s, the visual input frequency reached 60.75 Hz, the decision latency was 45.23 ms, and 14 interactive elements were involved. In virtual reality scenarios, the user’s access time significantly increased to 300.845 s, the data transmission rate was as high as 950.60 kb/s, the visual input frequency was 90.25 Hz, the decision latency was 35.10 ms, and the number of interactive elements was 25, indicating a higher demand for eye control technology in virtual reality applications. The study set up 7 different scenarios for testing, with scenario 1 being a dynamic environment, scenario 2 being a semi dynamic environment, scenario 3 being a complex background, scenario 4 being multi-user interaction, scenario 5 being mixed reality, scenario 6 being virtual reality, and scenario 7 being a static environment. The comparison of different technical performance in different user scenarios is shown in Fig. 9.
[See PDF for image]
Fig. 9
Comparison of technical performance under different user scenarios
In Fig. 9a, ALETT exhibited relatively high recognition rates in multiple user scenarios, especially in the second and fifth scenarios, which were significantly better than other technologies. The recognition rates of Pupil Labs and iTracker had significant fluctuations. In Fig. 9b, ALETT exhibited faster response time. In contrast, Pupil Labs and iTracker had slower response times in certain scenarios, which may affect the user experience, especially in applications that require quick feedback.
Conclusion
Aiming at the challenges in designing eye control human–computer interaction interfaces, an adaptive lightweight eye control tracking ALETT model is proposed. This model aims to improve the accuracy and robustness of eye tracking through innovative design, especially in complex environmental conditions. The innovation of the ALETT model lies in its integration of core technologies such as deep learning and multi-sensor fusion, enabling it to provide stable performance in various dynamic scenarios. Through experimental verification on multiple public datasets, the ALETT model has demonstrated good prediction accuracy and low inference latency, proving its effectiveness in complex environments. However, the study also pointed out that the performance of the model deteriorates under extreme lighting conditions or when users wear glasses, and the generalization ability in different scenarios still needs further validation. Overall, the ALETT model provides a new solution for the field of eye controlled human–computer interaction and demonstrates its potential in multiple application scenarios. Future work will focus on optimizing the model to improve its robustness in complex environments and exploring its applicability in practical applications such as virtual reality and remote collaboration.
Acknowledgements
N /A.
Author contributions
Chuanfeng Ding provided the concept, designed the experiment and wrote the manuscript.
Funding
There is no funding in this manuscript.
Data availability
All data generated or analyzed during this study are included in this manuscript.
Declarations
Ethics approval and consent to participate
N/A.
Consent for publication
N/A.
Competing Interests
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Finocchiaro, M; Banfi, T; Donaire, S; Arezzo, A; Guarner-Argente, C; Menciassi, A; Hernansanz, A. A framework for the evaluation of human machine interfaces of robot-assisted colonoscopy. IEEE Trans Biomed Eng; 2023; 71,
2. Chen, Z; Tu, H; Wu, H. User-defined foot gestures for eyes-free interaction in smart shower rooms. Int J Human-Comput Interact; 2023; 39,
3. Upasani, S; Srinivasan, D; Zhu, Q; Du, J; Leonessa, A. Eye-tracking in physical human–robot interaction: mental workload and performance prediction. Hum Factors; 2024; 66,
4. Su, T; Ding, Z; Cui, L; Bu, L. System development and evaluation of human–computer interaction approach for assessing functional impairment for people with mild cognitive impairment: a pilot study. Int J Human-Comput Interact; 2024; 40,
5. Hendry M F, Kottmann N, Fröhlich M, Bruggisser F, Quandt M, Speziali S, Salter C. Are you talking to me? A case study in emotional human-machine interaction//Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. 2023, 19(1): 417-424.
6. Zeng, Z; Neuer, ES; Roetting, M; Siebert, FW. A one-point calibration design for hybrid eye typing interface. Int J Human-Comput Interact; 2023; 39,
7. Keskin, M; Kettunen, P. Potential of eye-tracking for interactive geovisual exploration aided by machine learning. Int J Cartogr; 2023; 9,
8. Meng, X; Yu, Z; Chen, X; Huang, Z; Dong, C; Meng, F. Online running-gait generation for bipedal robots with smooth state switching and accurate speed tracking. Biomimetics; 2023; 8,
9. Niu, Y; Li, X; Yang, W; Xue, C; Peng, N; Jin, T. Smooth pursuit study on an eye-control system for continuous variable adjustment tasks. Int J Human-Comput Int; 2023; 39,
10. Xu, J; Li, X; Chang, H; Zhao, B; Tan, X; Yang, Y et al. Electrooculography and tactile perception collaborative interface for 3D human–machine interaction. ACS Nano; 2022; 16,
11. Ma, GR; He, JX; Chen, CH; Niu, YF; Zhang, L; Zhou, TY. Trigger motion and interface optimization of an eye-controlled human-computer interaction system based on voluntary eye blinks. Human-Comput Interact; 2024; 39,
12. Ban, S; Lee, YJ; Kwon, S; Kim, YS; Chang, JW; Kim, JH; Yeo, WH. Soft wireless headband bioelectronics and electrooculography for persistent human–machine interfaces. ACS Appl Electron Mater; 2023; 5,
13. Novák, JŠ; Masner, J; Benda, P; Šimek, P; Merunka, V. Eye tracking, usability, and user experience: a systematic review. Int J Human-Comput Interact; 2024; 40,
14. Bentvelzen M, Woźniak P W, Herbes P S F, Stefanidi E, & Niess J. Revisiting reflection in hci: Four design resources for technologies that support reflection. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2022, 6(1): 1-27.
15. Han, K; Wang, Y; Chen, H; Chen, X; Guo, J; Liu, Z et al. A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell; 2022; 45,
16. Xia, R; Chen, Y; Ren, B. Improved anti-occlusion object tracking algorithm using Unscented Rauch-Tung-Striebel smoother and kernel correlation filter. J King Saud Univ-Comput Inf Sci; 2022; 34,
17. Park, J; Lee, Y; Cho, S; Choe, A; Yeom, J; Ro, YG; Ko, H. Soft sensors and actuators for wearable human-machine interfaces. Chem Rev; 2024; 124,
18. Dornelas, RS; Lima, DA. Correlation filters in machine learning algorithms to select demographic and individual features for autism spectrum disorder diagnosis. J Data Sci Intell Syst; 2023; 3,
19. Wang, Z; Gao, F; Zhao, Y; Yin, Y; Wang, L. Improved A* algorithm and model predictive control-based path planning and tracking framework for hexapod robots. Ind Robot: Int J Robot Res Appl; 2023; 50,
20. Shi Y, Gao T, Jiao X, Cao N. Understanding design collaboration between designers and artificial intelligence: A systematic literature review. Proceedings of the ACM on Human-Computer Interaction, 2023, 7(CSCW2): 1-35.
21. Kümmerer, M; Bethge, M; Wallis, TSA. Deepgaze III: modeling free-viewing human scanpaths with deep learning. J Vis; 2022; 22,
22. Linardos A, Kümmerer M, Press O, Bethge M. DeepGaze IIE: Calibrated Prediction in and Out-of-Domain for State-of-the-Art Saliency Modeling//Proc. ICCV. 2021.
23. Krafka, K; Khosla, A; Kellnhofer, P; Kannan, H; Bhandarkar, S; Matusik, W; Torralba, A. Eye Tracking for Everyone//Proc CVPR; 2016; [DOI: https://dx.doi.org/10.1109/CVPR.2016.239]
24. Kassner P, Patera W, Bulling A. Pupil: An Open Source Platform for Pervasive Eye Tracking and Mobile Gaze-based Interaction//Adjunct Proc. UbiComp. 2014.
25. Jiang M, Huang S, Duan J, Zhao Q. SALICON: Saliency in Context//Proc. CVPR. 2015: 1072–1080. https://doi.org/10.1109/CVPR.2015.7298710.
26. Funes Mora, KA; Monay, F; Odobez, J-M. EYEDIAP: A database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras//Proc. ETRA; 2014; [DOI: https://dx.doi.org/10.1145/2578153.2578190]
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.