1. Introduction
Nearly 70% of the earth’s surface is covered by oceans, inland rivers, and lakes. Increasingly fierce and complicated maritime rights disputes between countries have made integrated maritime surveillance, rights protection, and law enforcement more difficult and arduous and have necessitated improvements in omnidirectional observation and situational awareness in specific areas of the ocean. Massive marine surveillance videos require real-time monitoring by crew members or traffic management staff, which consumes a great deal of manpower and material resources, and there are also missed inspections due to factors such as human fatigue. Automatic detection, re-identification, and tracking of ships in surveillance videos will transform maritime safety.
Society is slowly accommodating the shift to fully automated vehicles, such as airplanes, road vehicles, and ships at sea. The rapid development of information and communication technologies such as the internet of things, big data, and artificial intelligence (AI) has led to a demand for increased automation in ships of different types and tonnages. Marine navigation is becoming more complex. Intelligent computer vision will improve automated support for navigation. It is necessary to combine intelligent control of shipping with intelligent navigation systems and to strengthen ship–shore coordination to make worthwhile progress in developing autonomous ship navigation technology.
A moving ship must detect and track the ships around it. This task can be regarded as the problem of tracking a moving target using a motion detecting camera. Collision avoidance on the water may require maneuvers such as acceleration and turning, possibly urgently. Onboard visual sensors may be subject to vibrations, swaying, changes in light, and occlusion. The detection and tracking of dim or small targets on the water is very challenging. According to the latest report of Allianz global corporate & specialty (AGCS) [1], 75–96% of collisions at sea are related to crew errors. There is therefore an urgent need to use advanced artificial intelligence technology for ocean scene parsing and auxiliary decision-making on both crewed and autonomous vessels, with a longer term goal of gradually replacing the crew in plotting course and controlling the vessel [2]. If human errors can be reduced, the probability of marine collisions will 3be greatly reduced.
In general, the rapid development of intelligent shipping requires improved computerized visual perception of the ocean surface. However, current visual perception, which uses deep learning, is inadequate for control of an autonomous surface vessel (ASV) and associated marine surveillance. There are many key issues to be resolved [3], such as:
(1). The visual sensor on a vessel monitors a wide area. The target vessel is usually far away from the monitor and thus accounts for only a small proportion of the entire image. It is considered to be a dim and small target at sea. On a relative scale, the product of target width and height is less than one-tenth of the entire image; on an absolute scale, the target size is less than 32 × 32 pixels [4].
(2). Poor light, undulating waves, interference due to sea surface refraction, and wake caused by ship motion, among other factors, cause large changes in the image background.
(3). Given the irregular jitter as well as the sway and the heave of the hull, the onboard surveillance video inevitably shows high frequency jitter and low frequency field-of-view (FoV) shifts.
(4). Image definition and contrast in poor visibility due to events such as sea-surface moisture and dense fog are inadequate for effective target detection and feature extraction.
Deep learning (DL) is a branch of artificial intelligence that was pioneered by Hinton [5]. It creates a deep topology to learn discriminative representation and nonlinear mapping from a large amount of data. Deep learning models fall into two categories: supervised discriminative models and unsupervised generative models. We used the discriminative model in this study. It is widely used in visual recognition in maritime situations and learns the posterior probability distribution of visible data. Research into visual perception in the marine environment provides data that can be mined for purposes of ship course planning and intelligent collision avoidance decision-making. It has significant practical value both for improving monitoring of the marine environment from onshore and for improving automated ASV navigation.
2. Research Progress of Vision-Based Situational Awareness
Research into automated visual interpretation of the marine environment lags research into aerial visualization (e.g., in drones) and terrestrial visualization (e.g., autonomous vehicles). Shore-based surveillance radar, onboard navigation radar, and automated identification systems (AIS) display target vessels only as spots of light on an electronic screen and, therefore, there are no visual cues about the vessel. Ships are generally large, thus treating them as point objects is inadequate for determining courses or avoiding collision, as has been confirmed by the occurrence of many marine accidents in recent years. The experimental psychologist Treichler showed through his famous learning and perception experiment that the perception information obtained by humans through vision accounts for 83% of all sensing modalities (vision, hearing, touch, etc.) [6]. Monocular vision sensor has a number of advantages (easy installation, small size, low energy consumption, and strong real-time performance) and provides reliable data for use in integrated surveillance systems at sea or ASV navigation systems.
Typical discriminative deep learning models include convolutional neural networks, time series neural networks, and attention mechanisms. They transform the search for an objective function into the problem of optimization of the loss function. In this review, we define the objects of marine situational awareness (MSA) as targets and obstacles in the local environment of the ASV or in the waterways monitored by the vessel traffic service (VTS). To further facilitate the future research in MSA field, we used discriminative deep learning as the main approach to vision-based situational awareness in maritime surveillance and autonomous navigation. On the basis of 85 published papers, we reviewed current research in four critical areas: full scene parsing of the sea surface, vessel re-identification, ship tracking, and multimodal fusion of perception data from visual sensors. The overall structure of marine situational awareness technology used in this paper is shown in Figure 1. For simplicity but without loss of generality, here, we do not consider the input of multimodal information.
The relationship between full scene parsing, ship recognition and re-identification, and target tracking is as follows. The detection and the classification module detect the target vessel and strip it from the full scene. This module is the core of the three situational awareness tasks; it is linked to the two tasks of re-identification and target tracking. Semantic segmentation separates foreground objects (ships, buoys, and other obstacles) and background (sea, sky, mountains, islands, land) at the pixel level to eliminate distractions from target detection. Instance segmentation complements target detection and classification. Target detection provides prior knowledge for re-identification (i.e., direct re-identification of the detected vessel) in the recognition and re-identification module and provides features of the vessel’s appearance to the tracking-by-detection pipeline. The combination of the three processes results in the upstream tracking-by-detection task continuously outputting the states and the positions of surrounding vessels and other dynamic or static obstacles to navigation. This information is the basis for the final prediction of target ship behavior and ASV collision avoidance and navigational decision-making.
2.1. Ship Target Detection
2.1.1. Target Detection
It is likely that the automatic identification system (AIS) is not installed or deactivated on small targets, such as marine buoys, sand carriers, wooden fishing boats, and pirate vessels. Because these objects are small, their radar reflectivity is slight, and their outline may be difficult to detect by a single sensor due to surrounding noise. Low signal strength, low signal-to-noise ratios, irregular and complex backgrounds, and undulating waves increase the difficulty of detecting and tracking dim and small targets. All these factors can lead to situations requiring urgent action to avoid collision. The early detection of dim or small targets on the sea surface at a distance provides a margin of time for course correction and collision avoidance calculations in the ASV.
Conventional image detection commonly uses three-stage visual saliency detection technology consisting of preprocessing, feature engineering, and classification. Although these techniques detect objects and track them accurately, they are slow in operation and do not perform well in changing environments. Their use in a complex and changeable marine environment produces information lag, thus endangering the ASV and the surrounding vessels. Rapid detection and object tracking technology based on computer vision can increase the accuracy of perception of the surrounding environment and improve understanding of the navigational situation of an ASV and thus improve autonomous navigation. The required technology has been advanced by the incorporation of deep learning.
Deep learning in target detection uses a convolutional neural network (CNN) to replace the conventional sliding window and hand-crafted features. The detection is based on deep learning from the traditional dual-stage methods (faster R-CNN [7,8,9], R-FCN [7], cascade R-CNN [8], mask R-CNN [9]), to single-stage methods (SSD [7,10], RetinaNet [8], YOLO [11], EfficientDet [12]), and then to the state-of-the-art (SOTA) anchor-free methods (CenterNet [12], FCOS [13]), which are widely used as marine ship detectors. In general, the speed of the single-stage method is obviously faster than that of the dual-stage method, but the accuracy is close to or even better than that of the latter. Therefore, the single-stage method, especially the anchor-free model, is becoming more and more popular. For dim and small targets on the sea, STDNet built a regional context network to pay attention to its region of interest and the corresponding context and realized the detection of ships less than 16 × 16 pixel area on 720p videos [14]. DSMV [8] introduced a bi-directional gaussian mixture model on the input multi-frame images with temporal relationship and combined it with the deep detector for ship detection. In addition, combining the robust feature extraction capabilities of CNN with other methods, such as saliency detection [15], background subtraction [16], and evidence theory [17] has been proven to be effective in improving the accuracy of detection and classification. Table 1 shows the comparison results of the SOTA ship target detection algorithms.
When a vessel is under way, distant targets (other vessels or obstacles) first appear near the horizon or the shoreline [18], thus pre-extracting the horizon or the shoreline significantly reduces the potential area for target detection. In recent years, a method of ship detection and tracking for vessels appearing on the horizon was proposed that uses images collected by onboard or buoy-mounted cameras [19], as shown in Figure 2. Jeong et al. identified the horizon by defining it as the region of interest (ROI) [20]. Zhang et al. used a discrete cosine transform for horizon detection to dynamically separate out the background and performed target segmentation on the foreground [21]. Sun et al. developed an onboard method of coarse–fine stitching to detect the horizon [22]. Steccanella et al. used CNN to divide the surface viewed pixel by pixel into water area and non-water area [23] in order to generate the horizon and the shoreline by curve fitting. Shan et al. investigated the orientation of the visual sensor as means to extract the horizon and detect surrounding ships [24]. Su et al. developed a gradient-based iterative method of detecting the horizon using an onboard panoramic catadioptric camera [25]. It is impractical to install X-band navigation radars, binocular cameras, and other ranging equipment on a small ASV due to constraints of size and cost. Gladstone et al. use planetary geometry and the optical characteristics of a monocular camera to estimate the distance of the target vessel on the sea [26], using the detected horizon as the reference; the average error they obtained after many experiments was 7.1%.
2.1.2. Image Segmentation
The process of image segmentation is to divide the image into several specific regions with unique properties and extract regions of interest. Image segmentation of sea scenes is mostly performed as semantic segmentation in a process which does not distinguish specific targets and performs pixel-level classification. Bovcon et al. performed semantic segmentation and used an inertial measurement unit (IMU) to detect three-dimensional obstacles on the water surface [27] and created a semantic segmentation dataset containing 1325 manually labeled images for use by ASVs [28]. They conducted experiments to compare the SOTA algorithms such as U-Net [29], pyramid scene parsing network (PSPNet) [30], and Deeplabv2 [31]. Bovcon et al. designed a horizon/shoreline segmentation and obstacle detection process using an encoder–decoder model to process onboard IMU data [32]; this approach significantly improved the detection of dim or small targets. Cane et al. [33] compared the performance of conventional semantic segmentation models such as SegNet, Enet, and ESPNet using four public maritime datasets (MODD [27], Singapore maritime dataset (SMD) [34], IPATCH [35], SEAGULL [36]). Zhang et al. designed a two-stage semantic segmentation method for use in a marine environment [37] that initially identified interference factors (e.g., sea fog, ship wakes, waves on the sea) and then extracted the semantic mask of the target ship. Jeong et al. used the PSPNet to parse the sea-surface [38] and then extract the horizon using a straight line fitting algorithm. Experiments showed that this method was more accurate and more robust than the conventional method. Qiu et al. designed a real-time semantic segmentation model for ASVs using the U-Net framework [39] in which the sea scene was divided into five categories: sky, sea, human, target ship, and other obstacles. Kim et al. [40] developed a lightweight skipZ_ENet model on the Jetson TX2, an embedded AI platform, to semantically segment sea obstacles in real-time. The sea scene was divided into five categories: sky, sea surface, target ships, island or dock, and small obstacles. For the possible interrelation and interaction between the three tasks of detection, classification, and segmentation, it can be further assumed that the features required by the above three tasks have a shared feature subspace, and a new trend will unite them for multi-task learning [41].
Instance segmentation and panoptic segmentation [42], building on recent progress in target detection and semantic segmentation, have been used to make pixel level distinctions between objects with different tracks and objects that behave differently from each other. Panoptic segmentation is a synthesis of semantic segmentation and instance segmentation. Through the mutual promotion of the two tasks, it can provide a unified understanding of the sea scene, which is worthy of further exploration. However, to our knowledge, there is no relevant literature in this field. Current research into instance segmentation of ships on the sea is concentrated mainly on remote sensing, for example [43,44], and only a few studies that use visible light can be found [45,46]. Examples of target detection, semantic segmentation, and instance segmentation are shown in Figure 3.
2.2. Ship Recognition and Re-Identification
Both onshore video surveillance systems and ASVs use more than one camera (aboard ship, on the bow, stern, port, and starboard of the vessel) to monitor the waters and the surface around the vessel. Re-identification (ReID) is the need to recognize and retrieve the same target object across different cameras and different scenes. A ship can be considered to be a rigid body, and a change in attitude from different viewpoints is much larger than that of pedestrians and vehicles. Changes in the luminosity of the sea surface and of the background as well as changes in perceived vessel size due to changes in distance from the vision sensor can make clusters of images of a single vessel appear to be of different vessels and clusters of images of different vessels appear to be images of the same vessel. Thus, there are significant difficulties in ensuring accurate identification of vessels and other objects on the sea surface.
Target re-identification is an area of image retrieval that mostly focuses on pedestrians and vehicles. Google Scholar searches using keywords such as “Ship + Re-identification”, “Ship + Re identification”, “Vessel + Re-identification”, and “Vessel + Re identification” show that there are few reports on related achievements in the ship field [47,48,49,50], two of which were published by authors of this paper. We define vessel re-identification (also referred to as ship-face recognition) in a manner analogous to the re-identificaton of pedestrians and vehicles and of the target ship images in a given probe set as the process of using computer vision to retrieve the ship in the gallery set of cross frame or cross camera images. Current ReID research focuses on resolving issues of intra-class differences and inter-class similarity. However, ships are homogeneous, rigid bodies that are highly similar, and these properties increase the difficulty of re-identification.
Re-identification of cross-modal target ships for onshore closed circuit television (CCTV) or ASV monitoring can be performed on detected targets to compensate for the single perspective of a vessel captured by a single camera. Such re-identification can be introduced into the multitarget tracking pipeline for long term tracking. Figure 4 shows the four key steps of vessel re-identification [49]: ship detection, feature extraction, feature transformation, and feature similarity metric learning.
Visual information of ships is different from the visual information of pedestrians, vehicles, and even rare animals in two significant areas when considering re-identification. Ships are much greater in size and differ in perspective due to changes in viewpoint. Thus, we cannot simply adopt or modify a target re-identification algorithm from another application to migrate it to identification of a target ship. Wang et al. focused on feature extraction in their review of deep network models for maritime target recognition, which provides a valuable reference for research in this field [51]. For feature extraction, introducing orthogonal moment methods, such as Zemike moment [52] and Hu moment [53], for feature extraction and fusing them with features extracted by CNN can significantly improve the accuracy of ship recognition. Resnet-50, VGG-16, and DenseNet-121 were used as feature extractors on the Boat Re-ID dataset built by Spagnolof et al. [54], and the results show that ResNet-50 achieved the best results among the three. By estimating the position and the orientation of the target-ship in the image, Ghahremani et al. re-identified it using a TriNet model [48]. On the common object dataset (MSCOCO) and two maritime-domain datasets (i.e., IPATCH and MarDCT), Heyse et al. explored a domain-adaption method for the fine-grained identification of ship categories [55]. Groot et al. re-identified target ships imaged from multiple cameras by matching vessel tracks and time filtering, among other methods [56]. They divided the known distance between fixed point cameras by the navigation time to estimate the target speed. The authors in [49] proposed a multiview feature learning framework that combined global and discriminative local features. The framework combined the cross-entropy loss with the proposed orientation quintuple (O-Quin) loss. The inherent features of ship appearance were fully exploited, and the algorithm for learning multiview representation was refined and optimized for vessel re-identification. The discriminative features detection model and the viewpoint estimation model were embedded to create an integrated framework.
Introduction of ship-face recognition that is based on re-identification into mobile VTS can confirm the identity and the legal status of nonconforming vessels, as shown in Figure 5. The process typically followed is that the AIS is initially registered with the camera coordinate system; the pre-trained re-identification model is then used for feature extraction, and the re-identification result is finally compared for similarity with the AIS message. If the similarity is greater than some threshold value, the target vessel is judged to be illegal for tampering with AIS information.
2.3. Ship Target Tracking
It is not sufficient in tracking a vessel to give discrete vessel position information in a complex marine environment, especially when the target is under way. The course and the speed of the target vessel are required in order to make real-time collision avoidance decisions. The environment may be cluttered, and visibility may be poor, thus it is necessary to correlate and extrapolate data to provide a stable track of the vessel’s course. As shown in Figure 6, the tracking-by-detection framework usually consists of two stages. In the first stage, the detection hypothesis is given by the pre-trained target detector, and in the second stage, the detection hypothesis is correlated on the timeline to form a stable track.
A generative model tracking pipeline that consists of a particle filter [57], a Kalman filter [58], and a meanshift algorithm [59] ignores correlation with features having a sea background and other nontargets while ship tracking, which leads to low accuracy. The real-time performance of the model is also less than desirable due to the complexity of calculation. Researchers have introduced advanced correlation filtering and deep learning methods for ship tracking in sea surface surveillance videos in recent years. Correlation filtering [60] is renowned for its fast tracking speed. However, most of its success is in single target tracking due to defects in multitarget tracking and the fact that extracted features are greatly affected by illumination and attitude. Deep learning is more pervasive in tracking, and its tracking accuracy is significantly better due to the depth of feature extraction and more detailed representation. The advantages of accuracy in hand-crafted features make it unwise to discard correlation filtering or deep learning. Zhang et al. combined features extracted by deep learning with a histogram of oriented gradients (HOG), local binary patterns (LBP), scale-invariant feature transform(SIFT), and other hand-crafted features in tracking target ships [61].
In recent work on tracking marine targets, researchers combined appearance features with motion information in data fusion, a move that significantly improved tracking stability. Yang et al. introduced deep learning into the detection of vessels on the sea surface and embedded deep appearance features in the ensuing tracking process in combination with the standard Kalman filtering algorithm [62]. Leclerc et al. used transfer learning to classify ships, which significantly improved the predictive ability of the tracker [63]. Qiao et al. proposed a multimodal, multicue, tracking-by-detection framework in which vessel motion information and appearance features were used concurrently for data association [50]. Re-identification was used for ship tracking, and the appearance features were used as one of the multicues for long term target vessel tracking. Shan et al. [24] adapted the Siamese region proposal network (SiamRPN) and introduced the feature pyramid network (FPN) to accommodate the characteristics of the marine environment in detecting and tracking dim or small targets on the sea surface (pixels accounted for less than 4%) and created a real-time tracking framework for ships, sea-SiamFPN. Although the accuracy has been significantly improved compared with correlation filtering methods, this framework also belongs to the category of single target tracking, and its application in actual marine scenes has certain limitations. It is worth noting that the features extracted by CNN in the detection stage can be directly reused in subsequent instance segmentation and tracking tasks to save computing resources. For example, Scholler directly used the features in the detection phase for appearance matching during tracking [64].
3. Multimodal Awareness with the Participation of Visual Sensors
3.1. Multimodal Sensors in the Marine Environment
Modality refers to different perspectives or different forms of perceiving specific objects or scenes. There is high correlation between multimodal data because they represent aspects of the same item, thus it is advantageous to implement multimodal information fusion (MMIF). This process uses multimodal perception data acquired at different times and in different spaces to create a single representation through alignment, conversion, and fusion and initiates multimodal sensor collaborative learning to maximize the accuracy and the reliability of the environmental awareness system.
The improved perception capability is constrained by the equipment used for perception and the techniques used to interpret the data recorded. Mobile VTS, which is a complex information management system, needs a variety of shipborne sensors that obtain data for the surrounding environment in real-time, learn from each other, and improve the robustness and the fault tolerance of the sensing system through redundancy. The performance of various sensors that operate in a marine environment is shown for comparison in Table 2.
Individual sensors can be combined depending on the pros and the cons of the modal sensors shown in Table 1, and multiple sensors are combined to form an integrated sensing system to create a more reliable sea surface traffic monitoring system. For example, AIS and X-band radar can be used together in medium and long range surveillance (1–40 km), while millimeter-wave radar and RGB or infrared cameras provide accurate perception of an environment within 1 km of the sensor. The detection range of each sensor is shown in Figure 7.
3.2. Multimodal Information Fusion
Multisensor fusion is inevitable in marine environmental perception, similar to its development in drone operation and autonomous road vehicles. Video or images can be captured at low cost in all weather and in real-time and thus provide a large amount of reliable data. However, the use of a single sensor produces a high rate of false or missing detection due to poor light, mist or precipitation, and ship sway. This can be corrected if data from more than one source for the same target is fused to produce a single, more accurate, and more reliable data record. There is an urgent need to address this issue, that is, to gain the benefits of using multimodal sensors to reduce the probability of errors in detection and tracking of surrounding target vessels and thus improve perception of the surrounding environment in an autonomous navigation or marine surveillance system. Chen et al. merged onshore surveillance video and onboard AIS data for a target vessel. They adjusted the video camera attitude and focal length according to the Kalman filtered AIS position by linking AIS and the camera to co-track target vessels [65]. Thompson linked lidar and camera to detect, track, and classify marine targets [66]. Helgesen et al. linked visible light camera, infrared camera, radar, and lidar for target detection and tracking on the sea surface [67] and used joint integrated data association (JIPDA) to fuse data in the association stage. Haghbayan et al. used probabilistic data association filtering (PDAF) to fuse multisensor data from the same sources as Hegelsen et al. [68]. A visible light camera provides the most accurate data for bounding box extraction, and Haghbayan et al. used a CNN for subsequent target classification. Farahnakian et al. compared visible light images with thermal infrared images (with coast, sky, and other vessels in the background) and found that, in day and night sea scenes, two deep learning fusion methods (DLF and DenseFuse) were clearly better than six conventional methods (VSM-WLS, PCA, CBF, JSR, JSRDS, and ConvSR) [69]. In the 2016 Maritime robotX challenge, Stanislas et al. fused data from camera, lidar, and millimeter-wave radar [70] and developed a method to concurrently create an obstacle map and a feature map for target recognition. Farahnakian et al. fused visible and infrared images at pixel level, feature level, and decision level to overcome the problems associated with operating in a harsh marine environment [71]. Comparative experiments have shown that the feature level fusion produces the best results. Farahnakian et al. used a selective search to create a large number of candidate regions on the RGB image [72] and then used other modal sensor data to refine the selection to detect targets on the sea surface. In addition, for underwater vehicle and its interaction with surface vehicle [73], the fusion of acoustic sonar or even geomagnetic sensor and visible light camera effectively expands the space for three-dimensional awareness of the marine environment [74,75].
Visible light images or videos are the basic data for navigation in congested ports and short range reactive collision avoidance. Data from other modal sensors, such as X-band radar or thermal infrared imaging, and AIS information can be fused with visible light images. Bloisi et al. fused visual images and VTS data [76]. They used the remote surveillance cameras as the primary sensor in locations where radar and AIS do not operate (e.g., radar will set a specific sector as silent if it includes a residential area). Figure 8 shows the process for detection and tracking with fusion of visible images and millimeter-wave radar [77]. Low level and high level bimodal fusion occur concurrently. A deep CNN can be used for feature compression to ensure that features that meet the decision boundary are maximally extracted. Please note that, if the visible light camera in Figure 8 is replaced with a thermal one, the millimeter wave radar is replaced with lidar, AIS, or navigation radar, or even if the above two sets of sensors are combined, the tracking pipeline will still work well.
4. Visual Perception Dataset on Water Surface
4.1. Visual Dataset on the Sea
Although there is a rich available selection of video or image datasets of pedestrians and vehicles, there are few available ship image datasets, especially full scene annotated datasets. Lack of such data severely restricts in-depth research.
Table 3 summarizes available datasets of ship targets that have been published to date (2 March 2021). The most representative datasets are SMD, SeaShips, and VesselID-539. The Singapore maritime dataset (SMD) was assembled and annotated by Prasad et al. [34]. SMD contains 51 annotated high definition video clips, consisting of 40 onshore videos and 11 onboard videos. Gundogdu et al. built a dataset containing 1,607,190 annotated ship images obtained from a ship spotting website that were binned into 197 categories. Annotations include the location and the type of ship, such as general cargo, container, bulk carrier, or passenger ship [78]. Shao et al. created the large ship detection dataset SeaShips [10]. They collected data from 156 surveillance cameras in the coastal waters of Zhuhai, China. They took account of factors such as background discrimination, lighting, visible scale, and occlusion to extract 31,455 images from 10,080 videos. They identified and annotated in VOC format six ship types: ore carriers, bulk carriers, general cargo ships, container ships, fishing boats, and passenger ships. Unfortunately, only a very small number of video clips have been published. Bovcon et al. created a semantic segmentation dataset consisting of 1325 manually annotated images for ASV research [28]. VesselID-539, the ship re-identification dataset created by Qiao et al., contains images that represent the four most common challenging scenarios; examples are given in Figure 9. It contains 539 vessel IDs, 149,363 images, and marks 447,926 boundary boxes of vessels or partial vessels from different viewpoints.
4.2. Visual Datasets for Inland Rivers
An inland river image typically contains target ships and other obstacles in the foreground against a background of sky, water surface, mountains, bridges, trees, and onshore buildings. This is a more complex image than would be produced on open water. An inland river dataset and research based on it can provide valuable reference for a marine scenario. Teng et al. collected ship videos from maritime CCTV systems in Zhejiang, Guangdong, Hubei, Anhui, and other places in China to create a dataset for visual tracking of inland ships [81]. The researchers considered many factors such as time, region, climate, and weather and carefully selected 400 representative videos. Liu et al. adapted this dataset by removing video frames with no ship or no obvious changes in target ship scale and background to create a visual dataset for inland river vessel detection [82]. The videos in this dataset were cropped to 320×240 pixels a total of 100 segments; each video clip has a duration of 20–90 s and contains 400–2000 frames.
The training of a deep learning model requires a large amount of data. The requirements for diversity in weather conditions and a variety of target scales, locations, and viewpoints makes the acquisition of real world data both extremely difficult and expensive. The introduction of synthetic data has become popular as a method of augmenting a training dataset. Shin et al. used copy-and-paste in Mask R-CNN to superimpose a foreground target ship on different backgrounds [83] to increase the size of the dataset used for target detection. Approximately 12,000 additional synthetic images (only one-tenth of the total) were injected into the training set, which increased the mean average precision (mAP, IoU = 0.3) index by 2.4%. Chen et al. used a generative adversarial network for data augmentation to compensate for the limited availability of images of small target vessels (such as fishing boats or bamboo rafts) on the sea and used YOLOv2 for small boat detection [84]. Milicevic et al. obtained good results when they used conventional methods, such as horizontal flipping, cut/copy-and-paste, and changing RGB channel values, to increase the data for use in fine-grained ship classification [85]. The classification accuracy of the additional 25,000 synthetic images has been increased by 6.5%. Although the gain of this index is lower than the 10% increase in the direct use of 25,000 original images, it is of practical significance to alleviate the shortage of datasets for MSA.
5. Conclusions and Future Work
Visual sensors are commonly used in addition to navigational radar and AIS to directly detect or identify obstacles or target vessels on the water surface. We summarized some innovations in vision-based MSA using deep learning and especially presented a comprehensive summary and analysis of the current status of research into full scene image parsing, vessel re-identification, vessel tracking, and multimodal fusion for perception of sea scenes. There have been theoretical and technological breakthroughs in these areas of research. Visual perception of the marine environment is a new and therefore immature area of cross-disciplinary research; future work in the field needs to be more directed and more sophisticated, focusing on the solution of two key issues—the stability of perception and the ability to perceive dim and small targets summarized in this paper. The following are areas that will reward future study.
(1). Multi-task fusion. Semantic segmentation and instance segmentation must be combined architecturally and at the task level. Prediction of the semantic label and instance ID of each pixel in an image is required for full coverage of all objects in a marine scene to correctly divide the scene into uncountable stuff (sea surface, sky, island) and countable things (target vessels, buoys, other obstacles).The combination of object segmentation in fine-grained video with object tracking requires the introduction of semantic mask information into the tracking process. This will require detailed investigation of scene segmentation, combining target detection and tracking using existing tracking-by-detection and joint detection and tracking paradigms, but that is necessary to improve perception of target motion at a fine-grained level.
(2). Multi-modal fusion. Data provided by the monocular vision technology summarized in this paper can be fused with other modal data, such as the point cloud data provided by a binocular camera or a lidar sensor, to provide more comprehensive data for the detection and the tracking of vessels and objects in three-dimensional space. This will significantly increase the perception capability of ASV navigational systems. The pre-fusion of multimodal data in full-scene marine parsing, vessel re-identification, and tracking (i.e., fusing data before other activities occur) will allow us to make full use of the richer original data from each sensor, such as edge and texture features of images from monocular vision sensors or echo amplitude, angular direction, target size, and shape of X-band navigation radar. We will thus be able to incorporate data from radar sensors into image features and then fuse them with visual data at the lowest level to increase the accuracy of MSA.
(3). The fusion of DL-based awareness algorithms and traditional ones. Traditional awareness algorithms have the advantages of strong interpretability and high reliability. Combining them with the deep learning paradigm summarized in this paper, a unified perception framework with the advantages of both paradigms will be obtained, which can further enhance the ability of marine situation awareness.
(4). Domain adaptation learning. Draw lessons from the development experience of unmanned vehicles and UAVs to build a ship–shore collaborative situational awareness system. In particular, it is necessary to start with the improvement of self-learning ability and introduce reinforcement learning to design a model that can acquire new knowledge from sea scenes.
Author Contributions
Investigation, D.Q.; resources, W.L.; writing—original draft preparation, D.Q.; writing—review and editing, D.Q.; supervision, G.L.; project administration, T.L.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work was funded in part by the China postdoctoral science foundation, grant number 2019M651844; in part by the natural science foundation of the Jiangsu higher education institutions of China, grant number 20KJA520009; in part by the Qinglan project of Jiangsu Province; in part by the project of the Qianfan team, innovation fund, and the collaborative innovation center of shipping big data application of Jiangsu Maritime Institute, grant number KJCX1809; and in part by the computer basic education teaching research project of association of fundamental computing education in Chinese universities, grant number 2018-AFCEC-266.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
SMD:
Conflicts of Interest
The authors declare no conflict of interest.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figures and Tables
Figure 3. Target detection, semantic segmentation, and instance segmentation from the same image.
Comparison of the state-of-the-art (SOTA) ship target detection algorithms.
Method | Anchor-Free or Not | Backbone | Dataset | Performance |
---|---|---|---|---|
Mask R-CNN [9] | no | ResNet-101 | SMD | F-score: 0.875 |
Mask R-CNN [12] | no | ResNet-101 | MarDCT | AP(IoU = 0.5): 0.964 |
Mask R-CNN [12] | no | ResNet-101 | IPATCH | AP(IoU = 0.5): 0.925 |
RetinaNet [13] | no | ResNet-50 | MSCOCO + SMD | AP(IoU = 0.3): 0.880 |
Faster R-CNN [10] | no | ResNet-101 | Seaships | AP(IoU = 0.5): 0.924 |
SSD 512 [10] | no | VGG16 | Seaships | AP(IoU = 0.5): 0.867 |
CenterNet [12] | yes | Hourglass | SMD + Seaships | AP(IoU = 0.5): 0.893 |
EfficientDet [12] | no | EfficientDet-D3 | SMD + Seaships | AP(IoU = 0.5): 0.981 |
FCOS [13] | yes | ResNeXt-101 | Seaships + private | AP: 0.845 |
Cascade R-CNN [13] | no | ResNet-101 | Seaships + private | AP: 0.846 |
YOLOv3 [11] | no | Darknet-53 | private | AP(IoU = 0.5): 0.960 (specific ship) |
SMD: Singapore maritime dataset; CNN: convolutional neural network; AP: average precision; IoU: intersection over union.
Table 2Comparison of visible light and other multimodal sensors for marine use.
Sensor | Frequency Range | Work Mode | Advantage | Disadvantage |
---|---|---|---|---|
RGB camera | visible light |
passive | low cost |
small coverage |
infrared camera | far infrared |
passive | strong penetrating power |
low imaging resolution |
navigation radar | X-band (8–12 GHz), |
active | independent perception |
low precision |
millimeter wave radar | millimeter wave |
active | wide frequency band |
short detection range |
automated identification systems (AIS) | very high frequency (161.975 or 162.025 MHz) | passive | robust for bad weather |
low data rate |
Lidar | Laser (905 nm or 1550 nm) | active | 3D perception |
sensitive to weather |
Available image datasets of ship targets.
Datasets | Shooting Angle | Usage Scenarios | Resolution |
Scale | Open Access |
---|---|---|---|---|---|
MARVEL [78] | onshore, onboard | classification | 512 × 512, etc. | >140k images | yes |
MODD [27] | onboard | detection and segmentation | 640 × 480 | 12 videos, |
yes |
SMD [34] | onshore, onboard | detection and tracking | 1920 × 1080 | 36 videos, |
yes |
IPATCH [35] | onboard | detection and tracking | 1920 × 1080 | 113 videos | no |
SEAGULL [36] | drone-borne | detection, tracking, and pollution detection | 1920 × 1080 | 19 videos, |
yes |
MarDCT [79] | onshore, overlook | detection, classification, and tracking | 704 × 576, etc. | 28 videos | yes |
SeaShips [10] | onshore | detection | 1920 × 1080 | >31k frames | partial |
MaSTr1325 [28] | onboard | detection and segmentation | 512 × 384, |
1325 frames | yes |
VesselReID [47] | onboard | re-identification | 1920 × 1080 | 4616 frames, |
no |
VesselID-539 [49] | onshore, onboard | re-identification | 1920 × 1080 | >149k frames, |
no |
Boat ReID [54] | overlook | detection and |
1920 × 1080 | 5523 frames, |
yes |
VAIS [80] | onshore | multimodal fusion | vary in size | 1623 visible light, |
yes |
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The primary task of marine surveillance is to construct a perfect marine situational awareness (MSA) system that serves to safeguard national maritime rights and interests and to maintain blue homeland security. Progress in maritime wireless communication, developments in artificial intelligence, and automation of marine turbines together imply that intelligent shipping is inevitable in future global shipping. Computer vision-based situational awareness provides visual semantic information to human beings that approximates eyesight, which makes it likely to be widely used in the field of intelligent marine transportation. We describe how we combined the visual perception tasks required for marine surveillance with those required for intelligent ship navigation to form a marine computer vision-based situational awareness complex and investigated the key technologies they have in common. Deep learning was a prerequisite activity. We summarize the progress made in four aspects of current research: full scene parsing of an image, target vessel re-identification, target vessel tracking, and multimodal data fusion with data from visual sensors. The paper gives a summary of research to date to provide background for this work and presents brief analyses of existing problems, outlines some state-of-the-art approaches, reviews available mainstream datasets, and indicates the likely direction of future research and development. As far as we know, this paper is the first review of research into the use of deep learning in situational awareness of the ocean surface. It provides a firm foundation for further investigation by researchers in related fields.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China;
2 College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China;
3 College of Information Engineering, Jiangsu Maritime Institute, Nanjing 211170, China;
4 Nanjing Marine Radar Institute, Nanjing 211153, China;