Content area
Soccer is a popular sport, and there is a growing need for automated analysis of soccer videos, while the detection and tracking of the players is the indispensable prerequisite. In this paper, we first introduce and classify multi-object tracking and then present two mostly used multi-object tracking methods, DeepSort and TrackFormer. When multi-object tracking is applied to soccer scenarios, some preprocessing and post-processing are generally performed, with preprocessing including processing of the video, such as splicing and background removing, and post-processing including further applications, such as player mapping for a 2D stadium. By directly employing the two methods above, we test the real scene and train TrackFormer to get further results. Meanwhile, in order to facilitate researchers who are interested in multi-object tracking as well as in the direction of player tracking, recent advances in preprocessing and processing methods for soccer player tracking are given and future research directions are suggested.
Introduction
Being the world’s most beloved sport, soccer has drawn in a multitude of enthusiasts, including spectators, players, coaches, commentators, and data analysts, while it also attracts computer vision researchers to participate in soccer video analysis. Soccer video analysis is very rich in content, including pitch and white line detection, background removal, soccer ball and player detection, soccer ball and player tracking, player jersey recognition, player movement analysis, goal analysis, running distance, and speed estimation, highlights retrieval and indexing, video summarization, technical and tactical analysis and so on. These contents are divided into different semantic levels, as shown in Fig. 1, and the most important of them is player detection and tracking which is located in the middle level.
Traditionally, the detection and tracking of soccer players relied on specialized cameras or intelligent sensors equipped with thermal imaging technology. With the advancements in computer vision and deep learning, the analysis of soccer match videos has evolved into a more accessible process for obtaining tracking data. Nevertheless, owing to the expansive playing field and the substantial player population in soccer games, the task of player detection and tracking remains a formidable challenge.
In this paper, multi-object tracking-related works will be reviewed first and then different soccer player multi-object tracking methods within the last decade will be surveyed. At the same time, we will also make a brief introduction to our comparison of these works.
Related works of multi-object tracking
Multi-object tracking, or MOT, encompasses the task of detecting and distinguishing multiple objects within a video sequence, spanning categories like humans, animals, and vehicles. This process involves assigning unique identifiers or distinct colors to these objects, enabling the prediction of their trajectories and continuous tracking. In essence, multi-object tracking entails object detection as they emerge while using object IDs to represent their respective paths. This technology finds extensive applications in fields such as autonomous driving, pedestrian recognition, and sports. This paper delves specifically into the realm of applying multi-object tracking within the context of soccer.
Fig. 1 [Images not available. See PDF.]
Different semantic levels
The multi-object tracking is beset with formidable challenges. Not only does it encompass the complexities inherent in single-object tracking, but it compounds these difficulties due to its inherently “multi” nature, making it a more intricate and demanding endeavor. In single-object tracking, challenges include issues like object occlusion, blurring due to rapid object motion, camera movement, and changes in lighting conditions. In contrast, multi-object tracking introduces additional complexities, including mutual occlusion among objects, the potential loss of object trajectories, and the interference arising from similarities between objects.
Classification of multi-object tracking
Multi-object tracking has undergone rapid advancements in recent years, resulting in a diverse array of methods that are difficult to categorize in a specific manner. Some studies, such as [1, 2], have attempted to classify multi-object tracking into three broad categories: detection-based tracking, regression-based tracking, and attention-based tracking. Detection-based tracking involves object detection in the two frames preceding and following the current frame, while objects are treated as distinct nodes in these frames, and matching between the frames is achieved through graph optimization algorithms. In regression-based tracking, object detection occurs only in the initial frame, and the results from this frame are directly mapped to the subsequent frame. A regression header is then applied to adjust the object detection frame. Attention-based tracking leverages the power of attention mechanisms in computer vision and allows for the direct transmission of information between frames when an object is detected, making it effectively achieve object tracking through the application of attention mechanisms [1].
For a more comprehensive categorization, multi-object tracking can be further subdivided into point tracking, contour tracking, template tracking, and graph-based tracking, among others [3, 4]. Here, we delve into the finer details of point tracking, kernel tracking, and silhouette tracking: Point tracking primarily relies on the Kalman filter and the particle filter; kernel tracking has two principal methods: the MeanShift method and support vector machine (SVM) tracking; and Silhouette tracking encompasses contour tracking and shape matching as its core methods. Section 2.2.1 will provide a thorough introduction to the Kalman filter and the particle filter, while the remaining methods will be elaborated upon in the following sections.
As a kernel tracking method, the MeanShift method outlined in various sources [5, 6, 7, 8–9] has the following fundamental stages: (i) Kernel Density Estimation. To initiate the process, an initial data point is chosen as a seed point from the input data. Subsequently, a window is delineated around this seed point, typically employing a multi-dimensional spherical window centered on the seed point. Within this window, a kernel density estimate is calculated for each data point. This density estimate characterizes the concentration of data points in the vicinity of the chosen point and is usually computed utilizing a kernel function, such as a Gaussian kernel function. (ii) Mean Drift. This step involves computing a weighted average of all data points within the window. The weights are determined based on the kernel density estimate. The resultant weighted average forms the mean drift vector, signifying the direction in which data points are shifting within the feature space. (iii) Update Seed Points. The seed points are adjusted to the position of the mean drift vector calculated in the previous step. This update initiates a repetition of the preceding steps. This iterative process continues until the mean drift vector becomes negligibly small or the maximum number of iterations is reached. (iv) Clustering or Target Tracking. During the MeanShift operation, data points are grouped into regions of maximum density, thereby forming distinct clusters. When employed for target tracking, the seed point typically denotes the target’s position. Consequently, the MeanShift process continually updates the target’s position in each frame.
In the realm of alternative kernel tracking method, the support vector machine (SVM) plays a pivotal role primarily in classification tasks in various sources [10, 11, 12–13]. Its operation revolves around the provision of a training dataset wherein a clear distinction is made between positive and negative instances. SVM excels in delivering robust solutions, even when confronted with noisy data. Positive values within the dataset correspond to objects of interest requiring tracking, while negative values encompass all other entities that are not the subject of tracking.
In the context of Silhouette Tracking, Contour tracking [14, 15–16] can be achieved through two distinct approaches. The first method involves the utilization of a state space model that characterizes both the motion and shape of the contour. In contrast, the latter method entails the direct evolution of contours by minimizing contour energy using techniques such as gradient descent. This approach excels in accommodating a diverse range of object shapes.
For another facet of Silhouette Tracking, Shape Matching [17, 18–19] within the kernel method closely mirrors template-based tracking. This technique concentrates on tracking a single object by identifying a matching shape between consecutive frames.
In recent years, the prevailing categorization predominantly segregates multi-object tracking methods into tracking-by-detection and joint detection tracking. We further dissect these multi-object tracking methods within these two categories and delineate the distinctions among them.
Tracking by detection
The initial category of multi-object tracking methods is commonly referred to as tracking by detection (TBD). With the rapid advancement of object detection, TBD has swiftly emerged as the predominant framework for multi-object tracking, capturing the interest of numerous researchers. The tracking-by-detection approach, as its name implies, entails performing object detection for each frame within a video sequence using an object detector. Based on the outcomes of object detection, all objects within the frame are identified. Subsequently, the multi-object tracking challenge transforms into an object association problem [20, 21, 22–23], considering information from the two frames immediately preceding and following the current frame, incorporating various types of features. The general workflow of this method is illustrated in Fig. 2a. It is crucial to emphasize that this approach should emphasize various feature extraction techniques and diverse data association strategies. Features may encompass motion characteristics, appearance attributes, hybrid features, etc., with each demanding distinct feature extraction methods tailored to specific application scenarios. Similarly, different data association strategies can yield diverse results.
Fig. 2 [Images not available. See PDF.]
Two ways for multi-object tracking: a Tracking by detection with two stages. b Joint detection and tracking with only one stage
Regarding feature extraction, there are three potential avenues for enhancement as outlined below: (i) Improving Motion Features. This pertains to the refinement of motion features, particularly focusing on the enhancement of the widely-used Kalman filter [24, 25] in multi-object tracking. (ii) Enhancing Appearance Features. The goal here is to elevate the quality of appearance features by improving the network responsible for feature extraction and optimizing the effectiveness of this process [26, 27]. (iii) Fusion of Other Features. This involves enhancing the fusion of additional features, such as leveraging topological relationships within the spatial context [28, 29, 30–31].
As for data association strategies, there are two ways for improvement: (i) Optimizing Feature Representation. This entails optimizing the representation of features. This optimization can be achieved by improving the combination of features or refining the association matrix [1, 2, 24, 25]. (ii) Optimizing the Matching Process. This approach focuses on improving the algorithm’s flow to address challenges such as occlusion. This optimization is achieved through various techniques for processing low-confidence detections, enhancing the overall effectiveness of the matching process.
Above all, all the improved patterns are listed in Table 1 for better understanding.
Table 1. Tracking-by-detection improvement
Improvement direction | Improvement method |
|---|---|
Motion features | Kalman filter |
Appearance features | Extraction network |
Fusion features | Spatial topology |
Feature represent | combination |
Data association | Low confidence |
Joint detection and tracking
The second category of multi-object tracking methods, known as joint detection and tracking (JDT), distinguishes itself by seamlessly integrating object detection and tracking processes. In JDT, there is no need for separate object detection; instead, it performs both detection and tracking simultaneously. This is accomplished through a well-designed network capable of delivering both object detection results and trajectory information concurrently. Subsequently, data association is applied, with some frameworks even incorporating trajectory tracking as part of the process. The results are then fed back into the network to inform trajectory updates for the subsequent frame [1, 32, 33].
The overall workflow of this approach is depicted in Fig. 2b. The primary emphasis in this category of multi-object tracking methods lies in the design of the integrated network structure. A well-crafted network structure can yield comparable or even superior results compared to tracking-by-detection approaches. Furthermore, meticulous attention to data association remains a critical aspect of this methodology.
Multi-object tracking datasets and methods
Currently, multi-object tracking predominantly relies on MOT (multiple object tracking) system datasets. Over the years, MOT datasets have undergone various iterations, including MOT15, MOT16, MOT17, and MOT20. These datasets primarily center around the challenging task of pedestrian tracking in dense scenarios. For instance, MOT17 comprises 7 videos in the training set and 7 videos in the test set. The MOT17 dataset has become a widely recognized benchmark for evaluation, with certain methods demonstrating superior performance on this dataset [34, 35, 36, 37–38]. Another noteworthy series of datasets in this domain is the VisDrone datasets, representing a vital source of data in the realm of UAV (unmanned aerial vehicle) vision research.
In this section, we will build upon the categorization outlined in Sect. (2.1.1 tracking by detection and 2.1.2 joint detection tracking). Specifically, we will focus on two prominent representatives from these categories: DeepSort and TrackFormer.
DeepSort
In this research, we utilized a test video from Shanghai Sport University, employing a fixed perspective shooting angle. Two separate videos were captured, each covering one-half of the field: the left and right halves. These videos were then transformed and stitched together to create a full-field video, which subsequently underwent detection and tracking processes. The perspective transformation and stitching procedure are visually represented in Fig. 3. Furthermore, Fig. 4 illustrates the varying camera positions. We conducted experiments using two different approaches. The first approach involved the combination of YOLOv5 and DeepSort for object detection and tracking, while the second approach utilized DETR and TrackFormer for the same purpose. It is important to note that the latter approach was trained using a novel dataset. The outcomes of detecting targets with YOLOv5 and subsequently tracking them using the DeepSort algorithm are depicted in Fig. 5.
Fig. 3 [Images not available. See PDF.]
Perspective transformation and splicing: select the four corners of the left and right halves of the two fixed-view videos, convert the viewpoints, and finally splice them into a full-field video
Fig. 4 [Images not available. See PDF.]
Viewpoint of cameras
Fig. 5 [Images not available. See PDF.]
Performances of DeepSort on challenging sequence
The DeepSort algorithm, an enhanced iteration of the Sort algorithm, represents a significant advancement in multi-object tracking [39]. At the core of the sort algorithm lies the utilization of the Kalman filter algorithm and the Hungarian algorithm. The fundamental concept behind the Kalman filter is to manage uncertainty by modeling the system’s state as a probability distribution [40, 41, 42–43]. It assumes linearity within the system, Gaussian-distributed noise, and adherence to the Markov property, wherein the present state solely relies on the previous state and control inputs.
The Kalman filter process encompasses these key steps: (i) State Prediction (Predict). Forecasting the state for the next time step is based on the system’s dynamic equations and preceding state estimations. Typically, this prediction is accompanied by associated uncertainties. (ii) Update. The state prediction is corrected using measurements. The Kalman filter adjusts the state estimate by weighing the uncertainty between the state prediction and the measurements, giving greater significance to the measurements. (iii) Kalman Gain Calculation. The Kalman gain serves as a parameter to balance the trade-off between state prediction and measurement updates. Its value is contingent on the system’s dynamics and the characteristics of measurement errors. (iv) State Estimate Update. Leveraging the measurements and the Kalman gain, the final state estimate is computed, and the uncertainty associated with the state estimate is also updated.
The Kalman filter can be utilized in many applications in computer vision, particularly for tasks such as object tracking and pose estimation. Its primary advantage lies in its capacity to effectively handle measurement noise and dynamic variations, making it invaluable for tracking moving objects or estimating parameters like position, velocity, and orientation. However, it is essential to recognize that the Kalman filter assumes linearity and Gaussian-distributed noise, which may not be well suited for soccer scenarios involving nonlinear dynamics or non-Gaussian noise. In such cases, variants like the extended Kalman filter (EKF) or unscented Kalman filter (UKF) may prove more suitable, as they are better equipped to handle nonlinear and non-Gaussian conditions. It is worth noting that the Kalman filter falls under the category of point tracking mentioned in Sect. 2.1, with another point tracking method being the particle filter.
Particle filter, also referred to as the Monte Carlo filter, represents a robust Bayesian filtering method employed for estimating the state of nonlinear, non-Gaussian systems. This versatile technique enables the handling of intricate nonlinear dynamics and non-Gaussian noise by employing a collection of random samples, termed particles, to represent the probability distribution of the system’s state. Particle filters excel in estimating states that come with inherent uncertainty, making them extensively applicable across various domains, including target tracking, autonomous navigation, robot control, and computer vision.
The fundamental principles and steps involved in particle filtering can be summarized as follows:(i) Initialization. The process commences with the initialization of a set of particles that represent plausible values for the system’s state. Typically, these particles are generated through random sampling from prior probability distributions. (ii) State Prediction. Each particle is employed to predict the system’s state in the subsequent time step, based on the dynamic model of the system. This prediction involves passing the particle through the system’s state transfer equation. Since each particle may represent a distinct state hypothesis, a range of state predictions is generated. (iii) Weight Update. The state predictions of each particle are updated with weights, which reflect their alignment with the measured data. Weights are typically computed using a measurement model (observation equation) that compares each particle’s prediction with the actual measurement. These weights signify the relative significance of each particle in the estimation.(iv) Resampling. This process retains particles with higher weights more prominently in the estimation for the subsequent time step, while particles with lower weights are pruned. This resampling step enhances estimation accuracy and diminishes estimation errors. (v) State Estimation. The final system state is estimated, often computed as a weighted average of the particles. In this calculation, particle weights dictate the contribution of each particle to the estimation.
The key advantage of particle filtering is its inherent capability to handle nonlinear systems and non-Gaussian noise. It achieves this by characterizing the state distribution through random samples, bypassing the need for linear assumptions or Gaussian distributions. Besides, several essential concepts about the figures are briefly elucidated:
Detector This component is employed to identify objects within a video sequence.
Tracker The tracker plays a crucial role in establishing object associations across frames.
Cascade Matching This technique combines multiple matching stages progressively to refine matching outcomes. Cascading multiple stages enhances matching accuracy and relevance.
IoU Matching IoU (Intersection over Union) matching represents a common metric for assessing the performance of object detection and image segmentation algorithms. It quantifies the degree of overlap between algorithm-generated predictions and actual annotations.
Fig. 6 [Images not available. See PDF.]
Flow of DeepSort algorithm
Fig. 7 [Images not available. See PDF.]
Flow of TrackFormer algorithm
Here, we provide a concise overview of the DeepSort process, guided in Fig. 6. The DeepSort algorithm can be unfolded as follows: (i) In the initial frame, object detection takes precedence as the detector generates detection results. Simultaneously, the tracker initializes these detection results. During this phase, no object attains a confirmed state. (ii) In the second frame, following object detection via the detector, detection frames are acquired. Since no object has remained unconfirmed for three consecutive frames, none have transitioned to the confirmed state at this juncture. Consequently, only IoU matching is executed in the second frame. Successful matching between the detection frames from the two consecutive frames triggers Kalman filter updates for the matched objects. New trackers are created for the unmatchable detection frames, and any untracked objects are subsequently removed. (iii) In the third frame, fresh detection frames emerge, but no tracks have yet achieved confirmed status. The tracks established in the second frame are subjected to IoU matching with the detections from the third frame. Unmatched tracks are deleted since they remain unconfirmed. (iv) Starting from the fourth frame and onward, objects that have garnered confirmation over three frames are promoted to the confirmed state. Once an object achieves confirmed status, a cascade matching process is initiated. Subsequently, unmatched detections are matched based on IoU, with cascade matching taking precedence. Consequently, only objects in the confirmed state are displayed in the video frames.
TrackFormer
TrackFormer [1] seamlessly integrates object detection and data association into a unified framework. This groundbreaking approach combines the power of convolutional neural networks (CNNs) and Transformers to establish object trajectories. Its significance is the role of the decoder’s query, which empowers autonomous tracking of an object’s spatial and temporal movements within a video sequence through the introduction of the innovative “track query.”
As a tracking method founded on the attention mechanism, the workflow of the TrackFormer algorithm can be succinctly outlined as follows as illustrated in Fig. 7: (i) The initial frame is fed into a backbone network, producing feature representations that are subsequently processed and forwarded to the Transformer’s encoder. (ii) The output from the Transformer encoder, alongside the object vector, is fed into the Transformer decoder. (iii) The detection results are obtained through the Transformer decoder.
This iterative process is repeated for each subsequent frame, with the detection results from the previous frame and the object vector being fed into the decoder of the current frame.
In our experimentation, the original video had a frame rate of 25 frames per second. We extracted a 40-second segment, equivalent to 1000 frames, to create a video sequence. As utilized by TrackFormer, we selected 14 video sequences. These sequences were divided into both a test set and a training set to maintain consistency with the training dataset MOT17, and we conducted training using the dataset. The outcomes before and after training are given in Fig. 8. After 50 epochs of training, both detection and tracking in the trained TrackFormer model have shown a remarkable improvement compared to the untrained model. In all the video sequences we tested, there were no instances of undetected players.
Fig. 8 [Images not available. See PDF.]
Performances of TrackFormer on challenging sequence before and after training
Following the tracking process facilitated by TrackFormer and obtaining the initial tracking outcome, we proceed to rectify any inaccuracies within the tracking data. Once these corrections are meticulously applied, we achieve a refined tracking representation wherein each player maintains a consistent and unaltered ID. This enhanced tracking data enables us to perform further analysis, such as mapping each player onto a 2D pitch. The outcomes of this mapping process are visualized in Fig. 9.
Fig. 9 [Images not available. See PDF.]
Pitch 2D mapping
Evaluation
Numerous specialized evaluation metrics have been devised for assessing multi-object tracking performance. These metrics vary depending on the specific evaluation methodology employed. When evaluating the overall effectiveness within a video sequence, we can employ the following evaluation metrics:
MT (Mostly Tracked Trajectories) This metric quantifies the number of mostly tracked trajectories—those correctly tracked in 80% or more of the frames. Sometimes it is expressed as a ratio relative to the total frame count. A higher value indicates superior overall tracking performance.
ML (Mostly Lost Trajectories) ML captures the number of predominantly lost trajectories—those where the actual trajectory is correctly tracked in less than or equal to 20% of the frames. It may also be expressed as a ratio relative to the total frame count. A lower value suggests better overall tracking quality.
IDS (ID Switches) IDS calculates the frequency of incorrect ID switches, representing the number of times IDs were erroneously changed. Fewer ID switches indicate greater tracking stability.
1
In recent years, there has been a growing inclination toward utilizing higher-order tracking accuracy metrics, which concurrently evaluate both detection and tracking performance, offering insights into data correlation effects as well [44]: (i) DetA (Detection Accuracy) primarily assesses the quality of detection given in Eq. 2, where TP signifies successful detection of objects that should be detected, FP represents instances where unnecessary objects are erroneously detected, and FN denotes cases where objects that should have been detected are missed.2
(ii) AssA (Association Accuracy) predominantly gauges the effectiveness of data association, which directly impacts tracking performance as shown in Eqs. 3 and 4. Here, TPA indicates correct associations between actual object locations and their respective trajectories, FPA denotes instances where trajectories are assigned in locations where none exist, and FNA accounts for cases where trajectories are missing detections in specific regions.3
where4
(iii) HOTA (High-Order Tracking Accuracy) is a fusion of Eqs. 3 and 4 and comprehensively evaluates both detection and data association performance, providing a holistic assessment of tracking quality by factoring in the interplay between these two aspects (Eq. 5).5
Related works of soccer players tracking
While numerous reviews exist on multi-object tracking, there is a scarcity of comprehensive reviews specifically dedicated to soccer player tracking. The existing reviews either lean toward player detection or rely on outdated references [3, 4, 45, 46–47]. We aim to address this gap by placing a predominant emphasis on soccer player tracking. Furthermore, we endeavor to provide a rich array of recent examples encompassing both multi-object tracking and soccer player tracking methodologies from the past decade.
Soccer players tracking datasets
One of the more widely utilized soccer player tracking datasets is the SoccerNet dataset, also known as the SNMOT dataset [48]. It comprises 12 matches captured from a moving perspective, offering 200 half-minute video clips with tracking data.
Another valuable resource is the dataset introduced by Yu et al. [49], which encompasses 222 videos. This comprehensive dataset includes 71,936 shots labeled with shot types, 6,850 event clips, 6,294 story clips, and 19,908 frames containing bounding boxes for observed players.
Additionally, Feng et al. [50] have contributed to soccer dataset development with the creation of the SSET dataset for analyzing broadcast soccer videos. This dataset encompasses two shot transition types, five-shot types, 11 event types, 15 story types, and four player tracking types.
Soccer players detection methods
Vand et al. [51] introduced a semi-supervised approach that leverages a labeled image dataset alongside a large unlabeled soccer broadcast video dataset for training. They employ a teacher–student framework, where the teacher model is exclusively trained using labeled data. Simultaneously, the student model is trained using both labeled data and unlabeled data, supplemented with pseudo-labels generated by the teacher model. The student model’s training loss combines contributions from the labeled dataset and the unlabeled dataset with pseudo-labels. Following this, the student model undergoes fine-tuning using the labeled dataset to complete training with accurate annotations. This iterative process can be employed to enhance subsequent student models.
Komo et al. [52] described the utilization of FootandBall, a deep neural network-based detector, for detecting soccer balls and players in video footage. This detector features an efficient fully convolutional architecture capable of processing input video streams of varying resolutions. It adopts the Feature Pyramid Network design pattern to enhance the distinguishability of smaller objects like soccer balls by integrating low-level and high-level features. Importantly, this network is significantly more parameter-efficient compared to generic object detectors such as SSDs or YOLOs, enabling real-time processing of high-resolution video streams.
Lu et al. [53] presented a cascaded convolutional neural network (CNN) approach for player detection in sports scenarios. This method involves training a binary classification network using labeled image patches and applying it efficiently to the entire image during testing. Experiments conducted during basketball and soccer matches demonstrate accurate player detection, even in challenging conditions characterized by lighting variations, dynamic camera movement, and motion blur.
Soccer players tracking methods
Difficulties of soccer players tracking
The challenges involved in tracking soccer players are significantly more intricate and specialized due to factors such as the vast stadium, numerous players, and variable lighting conditions compared to the broader difficulties discussed in Sect. 2, as depicted in Fig. 10.
Regarding the occlusion issue, soccer players can be obstructed by other players or stadium objects, like billboards. In dynamic situations, brief occlusions may occur, whereas, in slower movements, occlusions can persist for extended periods. In both cases, these occlusions result in a partial loss of trajectory, significantly impacting tracking effectiveness. Moreover, players from the same team often wear jerseys with nearly identical appearances, particularly when they share similar colors with the stadium surroundings, like green jerseys, further heightening tracking challenges. Additionally, the presence of staff and spectators around the playing field in soccer matches introduces additional complexities to tracking. As a result, background removal or soccer field identification is typically required before initiating the tracking process. Subsequent sections will introduce relevant techniques to address these challenges.
Fig. 10 [Images not available. See PDF.]
Difficulties of soccer player multi-object tracking: a similar appearance, b mutual occlusion, c ambient occlusion, d motion blur, e light transformation
Camera position
An essential consideration in addressing athlete tracking is the camera position. In [54, 55–56], broadcast lenses are employed, while some studies [57] use multiple static cameras to cover the entire field. Single-camera setups are commonly used in soccer video recording. However, due to the challenges posed by frequent and continuous player occlusions, approaches like the one proposed in [58, 59, 60–61] have been adopted, which implement a multi-view approach by integrating observations from four to eight static cameras. While these multi-view methods are effective, the complexity of the field and camera setup makes them less practical and more costly for real-time applications.
Field registration
Field registration can employ various techniques, including traditional methods like pitch color analysis, background subtraction, Hough transforms, as well as deep learning approaches.
In the realm of traditional methods, Bai et al. [62] utilized color ratios and local entropy to detect playing fields, estimating color ratios directly in the RGB color space to identify “green” pixels. Sabirin et al. [63] employed background subtraction to remove stadium areas, white lines, and exterior regions, automatically eliminating billboards that overlap with soccer players in commercial advertising shots. Cuevas et al. [64] proposed a strategy involving Hough transforms to identify lines in soccer images, utilizing a binary mask to isolate line segments and employing a probabilistic decision tree for line classification.
In contrast, Chu et al. [65] employed deep learning for sports field registration, utilizing a grid of keypoints as domain-specific features and formulating the problem as instance segmentation with dynamic filter learning. Their model structure is similar to U-Net, incorporating ResNet-34 in the encoder and up-sampling modules with skip connections in the decoder.
To enhance player detection and tracking, recognizing stadium lines in broadcast view videos is crucial. Bu et al. [66] presented an algorithm involving GMM (Gaussian mixture models) for stadium region detection and line marker localization. Homa et al. [67] introduced a framework with keypoint-based field registration and dense field features. Sha et al. [68] proposed an end-to-end approach for camera calibration in challenging scenarios, incorporating semantic segmentation, camera pose estimation, and synonym refinement into a single network architecture.
Soccer players tracking
In our discussion, we initially covered some conventional techniques such as color-based detection, background subtraction, and Hough transforms, which are used for identifying playing fields and stadium lines. However, our focus then shifted toward more advanced deep learning approaches [67, 68]. These methods collectively aim to enhance player tracking accuracy on the field while mitigating interference from objects outside the stadium.
Some studies have employed various techniques to analyze video broadcasts. In their work, Duh et al. [69] employed a comprehensive approach, incorporating color histograms, spatial similarity matrices, and linear prediction models to track athletes. Sabirin et al. [63] utilized a dataset captured by a single moving camera in soccer games. They implement an attribute matching algorithm for object tracking across time series data, ensuring the relevance of detected objects. This matching process considers predetermined attributes like position, size, primary color, and motion information to find the optimal match between objects in different frames. Besides, Yang et al. [70] introduced an innovative method to enhance particle filter algorithms, increasing tracking efficiency and accuracy by optimizing particle sampling based on the latest observations. Their likelihood function combines color and edge features, making it robust against illumination variations. They propose a tracking algorithm based on augmented particle filters, utilizing detection and particle motion (DPF) to enhance particle diversity and thereby improve tracking accuracy. The weighted computation for particles relies on the improved likelihood function that incorporates color and edge features for player tracking. Sverrisson et al. [71] presented a Convolutional Neural Network with Inter-Frame Connection (CNN with Inter-Frame Connection). A unique aspect of their approach is the data correlation logic, which links the region proposal in the current frame with objects detected in previous frames. This connection allows for real-time adjustment of detection probabilities in the current frame, enabling objects with stable trajectories to remain above the decision threshold.
In addition, numerous studies have leveraged the power of multiple cameras in their research. Li et al. [72] introduced a sophisticated collaborative multi-view tracking scheme in the context of soccer games. They employ the scaling-invariant feature transform (SIFT) for feature extraction, where global SIFT matching is initialized for the algorithm, followed by local SIFT matching. By assuming each player has a rigid-body model, they use the median of motion offsets to represent motion vectors, while the position of each player is updated by calculating the center of new clustering. Shitrit et al. [22] offered an alternative method that simplifies complex estimation problems into standard linear programming. They formalize people’s displacements as flows along spatiotemporal locations and appearance group maps, leading to improved identity preservation over extended sequences. Martin et al. [58] presented a semi-supervised system designed for detecting and tracking in multi-camera sports videos. Their primary focus is on integrating tracks from different blobs detected to create cross-camera tracks. Hermann et al. [73] proposed a novel method based on identifying local maxima on a confidence map that encompasses various visual cues, including team clothing color, HOG human detector response, and grassy areas within the images. This innovative approach enables robust online tracking without requiring additional information such as camera calibration.
Besides, Baysal et al. [74] introduced Sentioscope, a soccer player tracking system employing a particle-based approach. This approach uses shared particles at fixed locations in a model field and combines appearance and motion models to track players effectively, even in challenging occlusion scenarios. Najafzadeh et al. [75] treated a player as the primary object for direct tracking, while other players are indirectly tracked. The state vector varies accordingly, containing position and velocity for direct tracking and relative position to the primary object for indirect tracking. Kim et al. [76] optimized background subtraction results with edge information to extract moving objects (players). Multi-scale sampling and online interpolation contribute to continuous player tracking. Next, Kim et al. [77]defined a comprehensive state model for each player, incorporating position, velocity, and average color. A terrain surface aids in predicting overlapping object boundaries, ensuring accurate tracking. Hurault et al. [78] proposed an unsupervised deep learning-based approach for tracking soccer players in video sequences. Their method relies on a player detector model and a visual consistency metric to correlate detections between consecutive frames.
Most Recently, Naik et al. [79] introduced a modified SORT tracking model that utilizes cosine distance and depth appearance descriptors to associate player identity coefficients, improving tracking performance. They also use color representation to distinguish between players and referees, addressing identity switching issues. Zheng et al. [80] emphasized the correction of object features through multiple cameras to enhance object tracking accuracy. They explore the use of the KCF algorithm and an improved KCF variant, which replaces HOG features with a deep convolutional neural network. Theiner et al. [81] introduced an automated pipeline for extracting precise positional data from broadcast footage of soccer matches, including stadium registration, player detection, and team assignment. A key component of this system is the implementation of the pix2pix model for field segmentation and subsequent labeling. To achieve this, edge images generated by the model are compared against a synthetic edge image dataset using Siamese CNN. A nearest neighbor search is then employed to establish the initial monophasic matrix, followed by further refinement using the Lucas–Kanade algorithm. Scott et al. [82] gave the construction process of the SoccerTrack dataset, which incorporates data collected from both fisheye and drone cameras, complete with bounding box and pitch coordinate annotations. They also provide insights and algorithms for camera calibration, player and ball tracking, as well as various preprocessing steps. Notably, the dataset leverages calibrated 8K fisheye cameras, 4K bird’s eye drone cameras, and GNSS data for enhanced accuracy, while their experimental demonstrations highlighting the advantages of fisheye and bird’s eye perspectives in soccer game tracking. The summary of their methods can be found in Table 2, with the second column denoting the number of cameras used and the third column indicating the filming approach.
Table 2. Tracking methods
Tracking method | Cameras | View |
|---|---|---|
SIFT [72] | 3 | Fixed |
Blob-Tracking [58] | 6 | Fixed |
SSM [69] | 1 | Broadcast |
KF [73] | 1 | Fixed |
Appearance-Cues [22] | 6 | Fixed |
Attribute-Matching [63] | 1 | Broadcast |
Particle-Based [74] | 2 | Fixed |
State-Vector [75] | – | – |
Enhanced-PF [70] | 1 | Broadcast |
Multi-Scale-Sampling [76] | 6 | Fixed |
Terrain-Surface-Analysis [77] | 6 | Fixed |
CNN [71] | 1 | Broadcast |
Self-Supervised [78] | – | – |
Modified-SORT [79] | Variable | – |
Improved-KCF [80] | Variable | Fixed |
Other works
In the realm of soccer ball detection and tracking, various approaches have been explored. Kim et al. [83] employed a dynamic Kalman filter algorithm for robustly tracking the soccer ball in dynamic video conditions. This approach integrates player information and controls state vector speed to ensure accurate tracking. In a similar pattern, Najeeb et al. [84] introduced a real-time ball tracking technique in soccer broadcast videos. Their method involves multiple steps, including candidate ball location determination to reduce missed detections, distance computation to eliminate false candidates, and ball position estimation using an extended Kalman filter. Recent advancements have also witnessed the application of deep learning for soccer ball tracking. Kamble et al. [85] leveraged VGG-M as the base model for object detection, adapt it to the study’s specific categories (ball, player, background), and employ pre-trained weights to map and process soccer ball trajectories.
Similarly to soccer, innovative approaches have emerged in the basketball domain where video quality is often clearer, yet player occlusion remains a challenge. Ben et al. [86] modeled people’s trajectories as a continuous flow through the region of inter-model player trajectories as continuous flows within the region of interest, preventing player confusion in tracking and enabling reconnection based on appearance cues. Lu et al. [54] proposed a comprehensive system for detecting and tracking multiple basketball players. Their approach involves estimating homology between video frames and stadium layouts and employs weak visual cues and conditional random fields with temporal and mutual exclusion constraints for player recognition. A deep player recognition model was introduced [87], which differentiated players through coarse-to-fine jersey number recognition and pose-guided partial feature embedding. The PoseBox, created by subtracting the player’s torso from the image, helps eliminate background noise and correct pose variations. Convolutional features extracted from the PoseBox using the ResNet-50 model enhance tracking performance, particularly in efficiently handling identity switches. Tracking basketball players can be applied to tracking soccer players to some extent, but it may not be entirely transferable. The differences in the dynamics, field layout, and player movements between the two sports present unique challenges that require specialized tracking methods and considerations.
Discussion
Recently, there has been a growing trend toward leveraging deep learning methods for both detection and tracking tasks. When it comes to detection, many techniques can yield satisfactory results, but the integration of an attention mechanism has proven particularly effective. The incorporation of attention helps the model concentrate on regions of interest, thereby reducing the occurrence of false positives and missed detections. This empowers the model to conduct a more in-depth analysis of areas likely to contain the target, thus improving detection accuracy. Moreover, the attention mechanism proves invaluable in handling scenarios involving occluded targets, as it enables the model to prioritize the visible portions of the target, enhancing detection robustness.
For tracking purposes, it becomes crucial to emphasize temporal information, allowing for better correlation between consecutive frames. Additionally, a promising avenue of research involves the fusion of tracking and segmentation tasks, enabling a finer-grained distinction between targets. With ample computational resources, it is conceivable to employ this combined detection and segmentation approach effectively in soccer scenarios.
In our practical experiences, we have encountered significant challenges, particularly when dealing with long-duration videos. Tracking objects over extended time frames necessitates not only substantial computational capabilities but also real-time processing to maintain efficiency. However, we have observed that tracking performance tends to deteriorate over time in such scenarios, primarily due to the inherent difficulties of long-term tracking. Addressing these issues requires implementing mechanisms for re-identifying targets that may temporarily disappear from the frame, enhancing occlusion handling to ensure accurate tracking under partial or complete obstructions, and overall, bolstering the tracking system’s robustness throughout the entire video sequence.
Conclusion and future work
We provide an overview of the current state of multi-object detection and tracking, particularly its applications in soccer. In the discussion of multi-object tracking, we delve into its definition, associated challenges, categorization, evaluation metrics, as well as prevalent methods and datasets. Within the context of soccer, we present several studies focusing on preprocessing techniques preceding soccer player tracking, as well as those specifically dedicated to soccer player tracking. Additionally, we highlight our contributions in this field, which include dataset creation, enhanced detection and tracking outcomes, and 2D pitch mapping.
In the future, for tracking by detection category, we identify opportunities for enhancement in nonlinear motion modeling, the utilization of reliable features for diverse scenarios, and the development of robust matching algorithms. Meanwhile, for the joint detection tracking, we explore potential improvements through varied network structures and more resilient matching algorithms.
Considering the diversity of soccer video content, it remains essential to continue analyzing soccer videos captured from different angles and camera setups, with a particular focus on videos from broadcasting viewpoints. Furthermore, within the realm of automatic soccer video analysis, there is room for delving into higher-level semantic scenarios. This may encompass player action analysis, goal assessment, running distance, and speed estimation, highlight extraction and indexing, video summarization, technical and tactical analysis, and referee decision validation.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (62077037, 62376231,62365014, U22B2034, 62262043, 62171321, 62162044 and 62162414) and in part by the Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (No. VRLAB2023B01).
Author contributions
Chao Yang wrote the main manuscript. All authors reviewed the manuscript.
Data availability
No datasets were generated or analyzed during the current study.
Declarations
Conflict of interest
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: Multi-object tracking with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8844–8854 (2022)
2. Chu, P., Wang, J., You, Q., Ling, H., Liu, Z.: Transmot: spatial-temporal graph transformer for multiple object tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4870–4880 (2023)
3. Manafifard, M; Ebadi, H; Moghaddam, HA. A survey on player tracking in soccer videos. Comput. Vis. Image Underst.; 2017; 159, pp. 19-46. [DOI: https://dx.doi.org/10.1016/j.cviu.2017.02.002]
4. Najeeb, HD; Ghani, RF. A survey on object detection and tracking in soccer videos. MJPS; 2021; 8,
5. Fashing, M; Tomasi, C. Mean shift is a bound optimization. IEEE Trans. Pattern Anal. Mach. Intell.; 2005; 27,
6. Carreira-Perpinán, M.A.: A review of mean-shift algorithms for clustering. arXiv preprint arXiv:1503.00687 (2015)
7. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1197–1203. IEEE (1999)
8. Comaniciu, D; Meer, P. Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell.; 2002; 24,
9. Cheng, Y. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell.; 1995; 17,
10. Wang, H., Hu, D.: Comparison of SVM and LS-SVM for regression. In: 2005 International Conference on Neural Networks and Brain, vol. 1, pp. 279–283. IEEE (2005)
11. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 3, pp. 32–36. IEEE (2004)
12. Vishwanathan, S.V.M., Murty, M.N.: SSVM: a simple svm algorithm. In: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), vol. 3, pp. 2393–2398. IEEE (2002)
13. Cherkassky, V; Ma, Y. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw.; 2004; 17,
14. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional density. In Computer Vision-ECCV’96: 4th European Conference on Computer Vision Cambridge, 1996 Proceedings, Vol. I 4, pp. 343–356. Springer (1996)
15. Li, P; Zhang, T; Pece, AEC. Visual contour tracking based on particle filters. Image Vis. Comput.; 2003; 21,
16. Li, M; Kambhamettu, C; Stone, M. Automatic contour tracking in ultrasound images. Clin. Linguist. Phon.; 2005; 19,
17. Belongie, S; Malik, J; Puzicha, J. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell.; 2002; 24,
18. Mori, G; Belongie, S; Malik, J. Efficient shape matching using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell.; 2005; 27,
19. Veltkamp, R.C.: Shape matching: similarity measures and algorithms. In: Proceedings International Conference on Shape Modeling and Applications, pp. 188–197. IEEE (2001)
20. Bar-Shalom, Y., Fortmann, T.E., Cable, P.G.: Tracking and data association (1990)
21. Streit, R.L., Luginbuhl, T.E.: Maximum likelihood method for probabilistic multihypothesis tracking. In: Signal and Data Processing of Small Targets 1994, vol. 2235, pp. 394–405. SPIE (1994)
22. Shitrit, HB; Berclaz, J; Fleuret, F; Fua, P. Multi-commodity network flow for tracking multiple people. IEEE Trans. Pattern Anal. Mach. Intell.; 2013; 36,
23. Liu, J., Carr, P., Collins, R.T., Liu, Y.: Tracking sports players with context-conditioned motion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1830–1837 (2013)
24. Aharon, N., Orfaig, R., Bobrovsky, B.-Z.: Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651 (2022)
25. Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep OC-sort: multi-pedestrian tracking by adaptive re-identification. arXiv preprint arXiv:2302.11813 (2023)
26. Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 164–173 (2021)
27. Zhang, Y; Wang, C; Wang, X; Zeng, W; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis.; 2021; 129, pp. 3069-3087. [DOI: https://dx.doi.org/10.1007/s11263-021-01513-4]
28. Liu, S., Li, X., Lu, H., He, Y.: Multi-object tracking meets moving UAV. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8876–8885 (2022)
29. Liu, Q., Chu, Q., Liu, B., Yu, N.: GSM: Graph similarity model for multi-object tracking. In: IJCAI, pp. 530–536 (2020)
30. Hyun, J., Kang, M., Wee, D., Yeung, D.-Y.: Detection recovery in online multi-object tracking with sparse graph tracker. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4850–4859 (2023)
31. Wang, Q., Zheng, Y., Pan, P., Xu, Y.: Multiple object tracking with correlation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3876–3886 (2021)
32. Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8090–8100 (2022)
33. Wang, Y., Weng, X., Kitani, K.: Joint detection and multi-object tracking with graph neural networks. arXiv preprint arXiv:2006.13164, 1(2) (2020)
34. Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: multi-object tracking by associating every detection box. In: European Conference on Computer Vision, pp. 1–21. Springer (2022)
35. Liu, Z., Wang, X., Wang, C., Liu, W., Bai, X.: Sparsetrack: multi-object tracking by performing scene decomposition based on pseudo-depth. arXiv preprint arXiv:2306.05238 (2023)
36. Yunhao, D; Zhao, Z; Song, Y; Zhao, Y; Fei, S; Gong, T; Meng, H. Make deepsort great again. IEEE Trans. Multimed.; 2023; 5, 55. [DOI: https://dx.doi.org/10.1109/TMM.2023.3240881]
37. Girbau, A., Marqués, F., Satoh, S.: Multiple object tracking from appearance by hierarchically clustering tracklets. arXiv preprint arXiv:2210.03355 (2022)
38. Li, J; Ding, Y; Wei, H-L; Zhang, Y; Lin, W. Simpletrack: Rethinking and improving the jde approach for multi-object tracking. Sensors; 2022; 22, 5863. [DOI: https://dx.doi.org/10.3390/s22155863]
39. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. IEEE (2017)
40. Welch, G.F.: Kalman filter. Computer vision: a reference guide, pp. 1–3 (2020)
41. Meinhold, RJ; Singpurwalla, ND. Understanding the Kalman filter. Am. Stat.; 1983; 37,
42. Li, Q., Li, R., Ji, K., Dai, W.: Kalman filter and its application. In: 2015 8th International Conference on Intelligent Networks and Intelligent Systems (ICINIS), pp. 74–77. IEEE (2015)
43. Bishop, G., Welch, G., et al.: An introduction to the Kalman filter. In: Proc of SIGGRAPH, Course 8(27599–23175), 41 (2001)
44. Luiten, J; Osep, A; Dendorfer, P; Torr, P; Geiger, A; Leal-Taixé, L; Leibe, B. Hota: a higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis.; 2021; 129, pp. 548-578. [DOI: https://dx.doi.org/10.1007/s11263-020-01375-2]
45. He, X. Application of deep learning in video target tracking of soccer players. Soft. Comput.; 2022; 26,
46. Lee, J., Moon, S., Nam, D.-W., Lee, J., Oh, A.R., Yoo, W.: A study on sports player tracking based on video using deep learning. In: 2020 International Conference on Information and Communication Technology Convergence (ICTC), pp. 1161–1163. IEEE (2020)
47. Cuevas, C; Quilón, D; García, N. Techniques and applications for soccer video analysis: a survey. Multimed. Tools Appl.; 2020; 79,
48. Cioppa, A., Giancola, S., Deliege, A., Kang, L., Zhou, X., Cheng, Z., Ghanem, B., Van Droogenbroeck, M.: Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3491–3502 (2022)
49. Yu, J., Lei, A., Song, Z., Wang, T., Cai, H., Feng, N.: Comprehensive dataset of broadcast soccer videos. In: 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 418–423. IEEE (2018)
50. Feng, N; Song, Z; Yu, J; Chen, Y-PP; Zhao, Y; He, Y; Guan, T. SSET: a dataset for shot segmentation, event detection, player tracking in soccer videos. Multimed. Tools Appl.; 2020; 79, pp. 28971-28992. [DOI: https://dx.doi.org/10.1007/s11042-020-09414-3]
51. Vandeghen, R., Cioppa, A., Van Droogenbroeck, M.: Semi-supervised training to improve player and ball detection in soccer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3481–3490 (2022)
52. Komorowski, J., Kurzejamski, G., Sarwas, G.: Footandball: integrated player and ball detector. arXiv preprint arXiv:1912.05445 (2019)
53. Lu, K., Chen, J., Little, J.J., He, H.: Light cascaded convolutional neural networks for accurate player detection. arXiv preprint arXiv:1709.10230 (2017)
54. Wei-Lwun, L; Ting, J-A; Little, JJ; Murphy, KP. Learning to track and identify players from broadcast sports videos. IEEE Trans. Pattern Anal. Mach. Intell.; 2013; 35,
55. Xing, J; Ai, H; Liu, L; Lao, S. Multiple player tracking in sports video: a dual-mode two-way Bayesian inference approach with progressive observation modeling. IEEE Trans. Image Process.; 2010; 20,
56. Kristan, M; Perš, J; Perše, M; Kovačič, S. Closed-world tracking of multiple interacting targets for indoor-sports applications. Comput. Vis. Image Underst.; 2009; 113,
57. Misu, T., Naemura, M., Zheng, W., Izumi, Y., Fukui, K.: Robust tracking of soccer players based on data fusion. In: 2002 International Conference on Pattern Recognition, vol. 1, pp. 556–561. IEEE (2002)
58. Martín, R; Martínez, JM. A semi-supervised system for players detection and tracking in multi-camera soccer videos. Multimed. Tools Appl.; 2014; 73, pp. 1617-1642. [DOI: https://dx.doi.org/10.1007/s11042-013-1659-6]
59. Morais, E., Goldenstein, S., Ferreira, A., Rocha, A.: Automatic tracking of indoor soccer players using videos from multiple cameras. In: 2012 25th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 174–181. IEEE (2012)
60. Morais, E; Ferreira, A; Cunha, SA; Barros, RML; Rocha, A; Goldenstein, S. A multiple camera methodology for automatic localization and tracking of futsal players. Pattern Recognit. Lett.; 2014; 39, pp. 21-30. [DOI: https://dx.doi.org/10.1016/j.patrec.2013.09.007]
61. Shitrit, H.B., Berclaz, J., Fleuret, F., Fua, P.: Tracking multiple people under global appearance constraints. In: 2011 International Conference on Computer Vision, pp. 137–144. IEEE (2011)
62. Bai, X., Zhang, T., Song, X., Niu, X.: Playfield detection using color ratio and local entropy. In: 2011 seventh international conference on intelligent information hiding and multimedia signal processing, pp. 356–359. IEEE (2011)
63. Sabirin, H; Sankoh, H; Naito, S. Automatic soccer player tracking in single camera with robust occlusion handling using attribute matching. IEICE Trans. Inf. Syst.; 2015; 98,
64. Cuevas, C; Quilon, D; García, N. Automatic soccer field of play registration. Pattern Recogn.; 2020; 103, 107278. [DOI: https://dx.doi.org/10.1016/j.patcog.2020.107278]
65. Chu, Y.-J., Su, J.-W., Hsiao, K.-W., Lien, C.-Y., Fan, S.-H., Hu, M.-C., Lee, R.-R., Yao, C.-Y., Chu, H.-K.: Sports field registration via keypoints-aware label condition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3523–3530 (2022)
66. Bu, J., Lao, S., Bai, L.: Automatic line mark recognition and its application in camera calibration in soccer video. In: 2011 IEEE International Conference on Multimedia and Expo, pp. 1–6. IEEE (2011)
67. Homayounfar, N., Fidler, S., Urtasun, R.: Sports field localization via deep structured models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5212–5220 (2017)
68. Sha, L., Hobbs, J., Felsen, P., Wei, X., Lucey, P., Ganguly, S.: End-to-end camera calibration for broadcast videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13627–13636 (2020)
69. Duh, D.-J., Chang, S.-Y., Chen, S.-Y., Kan, C.-C.: Automatic broadcast soccer video analysis, player detection, and tracking based on color histogram. In: Intelligent Technologies and Engineering Systems, pp. 123–130. Springer (2013)
70. Yang, Y; Li, D. Robust player detection and tracking in broadcast soccer video based on enhanced particle filter. J. Vis. Commun. Image Represent.; 2017; 46, pp. 81-94. [DOI: https://dx.doi.org/10.1016/j.jvcir.2017.03.008]
71. Sverrisson, S., Grancharov, V., Pobloth, H.: Real-time tracking-by-detection in broadcast sports videos. In: Image Analysis: 21st Scandinavian Conference, SCIA 2019, Norrköping, Sweden, June 11–13, 2019, Proceedings 21, pp. 399–411. Springer (2019)
72. Li, H., Flierl, M.: Sift-based multi-view cooperative tracking for soccer video. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1001–1004. IEEE (2012)
73. Herrmann, M; Hoernig, M; Radig, B. Online multi-player tracking in monocular soccer videos. Aasri Proced.; 2014; 8, pp. 30-37. [DOI: https://dx.doi.org/10.1016/j.aasri.2014.08.006]
74. Baysal, S; Duygulu, P. Sentioscope: a soccer player tracking system using model field particles. IEEE Trans. Circuits Syst. Video Technol.; 2015; 26,
75. Najafzadeh, N., Fotouhi, M., Kasaei, S.: Multiple soccer players tracking. In: 2015 The International Symposium on Artificial Intelligence and Signal Processing (AISP), pp. 310–315. IEEE (2015)
76. Kim, W; Moon, S-W; Lee, J; Nam, D-W; Jung, C. Multiple player tracking in soccer videos: an adaptive multiscale sampling approach. Multimed. Syst.; 2018; 24, pp. 611-623. [DOI: https://dx.doi.org/10.1007/s00530-018-0586-9]
77. Kim, W. Multiple object tracking in soccer videos using topographic surface analysis. J. Vis. Commun. Image Represent.; 2019; 65, 102683. [DOI: https://dx.doi.org/10.1016/j.jvcir.2019.102683]
78. Hurault, S., Ballester, C., Haro, G.: Self-supervised small soccer player detection and tracking. In: Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports, pp. 9–18 (2020)
79. Naik, BT; Hashmi, MF; Geem, ZW; Bokde, ND. Deepplayer-track: player and referee tracking with jersey color recognition in soccer. IEEE Access; 2022; 10, pp. 32494-32509. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3161441]
80. Zheng, B. Soccer player video target tracking based on deep learning. Mob. Inf. Syst.; 2022; 1–6, 2022.
81. Theiner, J., Gritz, W., Müller-Budack, E., Rein, R., Memmert, D., Ewerth, R.: Extraction of positional player data from broadcast soccer videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 823–833 (2022)
82. Scott, A., Uchida, I., Onishi, M., Kameda, Y., Fukui, K., Fujii, K.: Soccertrack: a dataset and tracking algorithm for soccer with fish-eye and drone videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3569–3579 (2022)
83. Kim, J.-Y., Kim, T.-Y.: Soccer ball tracking using dynamic Kalman filter with velocity control. In: 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization, pp. 367–374. IEEE (2009)
84. Najeeb, H.D., Ghani, R.F.: Tracking ball in soccer game video using extended Kalman filter. In: 2020 International Conference on Computer Science and Software Engineering (CSASE), pp. 78–82. IEEE (2020)
85. Kamble, PR; Keskar, AG; Bhurchandi, KM. A deep learning ball tracking system in soccer videos. Opto-Electron. Rev.; 2019; 27,
86. Ben Shitrit, H., Raca, M., Fleuret, F., Fua, P.: Tracking multiple players using a single camera. Technical report, Springer Verlag (2013)
87. Zhang, R; Lingxiang, W; Yang, Y; Wanneng, W; Chen, Y; Min, X. Multi-camera multi-player tracking with deep player identification in sports video. Pattern Recogn.; 2020; 102, 107260. [DOI: https://dx.doi.org/10.1016/j.patcog.2020.107260]
Copyright Springer Nature B.V. Jan 2025