- AI
- artificial intelligence
- CLIP
- contrastive language image pretraining
- DETIC
- detector with image classes
- mIoU
- mean intersection over union
- VLM
- vision-language model
Abbreviations
INTRODUCTION
Insects are the most diverse group of organisms, more abundant than others (Smithsonian, 2024). Agronomists, plant scientists, and plant breeders need to identify species for conservation and control practices. Improper identification of species could potentially result in biased economic or action threshold estimation, leading to unnecessary application of chemicals that could harm beneficial insects, reduce profitability, and have an adverse environmental footprint. While manual scouting (and counting) remains the gold standard for pest identification and action threshold determination, this is a resource- and (expert) labor-intensive yet critical aspect of agriculture. Additionally, genomic studies and high-throughput phenotyping rely on accurate phenotypic data, including imaging data (Rairdin et al., 2022; A. K. Singh, Singh, Sarkar, et al., 2021). Therefore, advances in annotation and localization of objects of interest are an active area of research in current genetics and phenomics studies (D. P. Singh, Singh, Singh, et al., 2021). However, the task of identification and localization of insects is very challenging due to (a) the large number of insect species, (b) several distinct species that exhibit remarkably similar visual features, (c) species exhibiting very diverse features along their developmental cycle (nymph vs. adult, larva vs. pupa vs. adult), and (d) images where the insect is difficult to differentiate from the background (e.g., green/brown-colored insect pests on green/brown backgrounds).
The availability of massive open source image datasets (such as iNaturalist [Van Horn et al., 2018]) acquired in a crowd-sourced manner can be leveraged to build powerful deep neural network models that perform accurate insect classification and detection. However, traditional deep neural network-based object detectors require high-quality annotations (or labels) for all image samples in these datasets. Annotating labels for object detection involves either pixel-by-pixel labeling of class information or marking of tight bounding boxes for every image in a dataset. Consequently, creating datasets for artificial intelligence (AI) models for insect detection in a supervised manner can be very laborious, time consuming, and prone to errors.
To overcome these considerable challenges, we introduce a new framework for zero-shot computer vision, methods that require (almost) no manual supervision or annotation, in plant phenomics. To achieve this, we leverage recent advances in vision language models (VLMs). Starting from contrastive language image pretraining (CLIP) (Radford et al., 2021b), the key idea in VLMs has been to pre-train deep networks that learn to match image data (possibly gathered in an unstructured manner) to associated captions that describe the contents of the images using natural language descriptions. The resulting models are remarkably robust: VLMs produce representations of images that transfer very well to a variety of downstream tasks. For many applications, VLMs such as CLIP also enable zero-shot performance, that is, with no extra supervision involving additional training data or computation. Rather, inference can be performed by simply specifying the category/class using a new natural language description. In our application, we show that merely coupling a recently proposed vision-language object detector called detector with image classes (DETIC) (Zhou et al., 2022) with a single (universal) natural language prompt provides highly accurate bounding boxes for a very large dataset of diverse insect images. See Figure 1 for an illustration.
[IMAGE OMITTED. SEE PDF]
Core Ideas
- Annotation is the bottleneck to machine learning-based phenotyping. We show a way to get bounding boxes using a zero-shot method.
- Advances in coupled vision language models allow very accurate zero-shot annotation.
- A vision-language object detection method coupled with weak language supervision is used to identify insects.
- A benchmark annotated (bounding box and segmentation masks) dataset of 6 million images is created.
- This approach is widely applicable and can produce very affordable approaches for phenotyping.
We briefly distinguish between object classification and detection in image data. The majority of computer vision approaches for phenomics have focused on classification models for which the input is an image and the prediction is one (or more) class labels assigned to the image. On the other hand, detection models solve two correlated problems: localizing objects of interest in an image as well as assigning class labels to localized objects. One popular approach is a two-stage process (Girshick, 2015; He et al., 2020; Lin et al., 2020; Ren et al., 2015), wherein the models detect probable object region proposals, further fine-tune the bounding boxes, and predict classes. In contrast, a single-shot detector (Redmon & Farhadi, 2017, 2018) not only generates region proposals but also classifies them in a single forward pass. However, these rely on high quality, fine-grained annotations of localized objects in an image for training. Recent work on detection (Fang et al., 2021; Xu et al., 2021) attempts to resolve the need for such fine-grained labeling by assigning labels to boxes based on model predictions. For example, You Only Look Once 9000 (Redmon & Farhadi, 2017) assigns labels to boxes based on the magnitude of prediction scores from the classification head. This requires good quality box proposals a priori, which may be hard to achieve, essentially leading to a circular problem of needing good boxes for good class predictions and vice versa.
At a high level, the DETIC object detection framework works in the following three stages. In the first stage, a deep neural network takes in an image as input and output a (potentially large) candidate set of rectangular bounding boxes corresponding to probable objects in the image; it also assigns an “objectness” score (a number between 0 and 1) to each such box. In the second stage, all boxes with an objectness score below a certain threshold are discarded. In the third and final stage, for every second a neural network is used to simultaneously classify each remaining bounding box proposal using the CLIP classification model discussed above, as well as refine the bounding box to tightly surround the object. We deploy the DETIC framework to curate a very large dataset of agronomically and ecologically relevant insect images.
In summary, our contributions in this paper are threefold:
We curate and release the Insecta rank class of iNaturalist to form a new benchmark dataset of approximately 6 million images, consisting of 2526 agriculturally important insect species.
Using a vision-language object detection method coupled with weak language supervision, we are able to automatically annotate images in this dataset, with bounding boxes localizing insects in each image.
While our focus is primarily on insect detection, we demonstrate that our method can be extended to other plant phenomics applications, such as (zero-shot) fruit detection, in color images of strawberry and apple trees.
Our method succeeds in detection of diverse insect present in a wide variety of backgrounds. In the absence of ground truth, we performed manual quality checks; over a carefully selected subset of images with diverse insect categories, our method exhibited tight bound bounding boxes in large fraction of the samples; therefore, we expect that our new benchmark dataset can be used in the future for building high-quality supervised models as well. Such zero-shot approaches involving language supervision—rather than annotation—can pave the way to significantly affordable phenotyping.
METHODS
We curated a large benchmark dataset of agriculturally and ecologically important species. The images in the Insecta rank class were sourced from iNaturalist, a citizen science platform where users can upload photographs of specific organisms. The iNaturalist Open Data project is a curated subset of the overall iNaturalist dataset and specifically contains images that apply to the Creative Commons license created by iNaturalist to aid academic research.
We created a workflow tool, iNaturalist Open Download, to easily download species images from the iNaturalist Open Dataset associated with a specific taxonomy rank. We used the tool to download all images of species under the rank class Insecta from the iNaturalist Open Dataset for downstream annotation, curation, and use in our model. We chose to only use images identified as “research” quality grade under the iNaturalist framework, which indicates that the labeling inspection for the image is more rigorous than standard and has multiple agreeing identifications at the species level. This results in a total of 13,271,072 images across 95,399 different insect species. The images have a maximum resolution of 1024 × 1024 in.jpg/.jpeg format and total 5.7 terabytes. Among the 95,399 insect species, we selected 2526 species identified as being among the most agriculturally and ecologically important species. This subset of insect classes constitutes approximately 6 million images in total. We emphasize the fact that while these images have carefully annotated species labels, they do not have any bounding box or segmentation mask information. This constitutes the detection problem and was one of the major objectives of our work.
DETIC (Zhou et al., 2022), a new approach for open-set object detection, presents an interesting zero-shot solution to this problem. The approach works by training detection models simultaneously with object detection and image classification datasets. DETIC has two advantages over traditional detectors: (1) it can learn from image classification datasets, which are generally larger than detection datasets, and contain a variety of classes and (2) the CLIP embeddings used as the classification head allow for a far larger number of classes. Thus, contrary to standard detection models, DETIC does not require fine-tuning and can be used for zero-shot detection with natural images. Therefore, we leveraged DETIC to perform annotation of insect bounding boxes.
We used the highest performing DETIC model from Zhou et al. (2022) with the sliding window transformer base backbone. The model has been pre-trained by its creators on two large-scale computer vision datasets (common objects in context and Imagenet-21k), with CenterNet employed as the region proposal network, and made publicly available for research purposes. In the first stage of DETIC, we ran the detector network on all images in our curated benchmark dataset. In the second stage, in order to ensure high recall, we used a low objectness confidence threshold of 0.1. For the final classification stage, our vocabulary only contains the lone word “insect.” In this manner, our method can be viewed as an instance of (very) weak language supervision since our object detector models do not require any fine-grained natural language prompts; no further information is needed to specify any observable traits or phenotypes of insects, and therefore our method does not require expensive, human expert annotations.
We ran our implementation of DETIC on our curated dataset using distributed inference on the Greene super-computing cluster at New York University, which comprises a mix of NVIDIA V100, A100, and RTX8000 graphics processing units. Annotation speeds were 0.15 s per image (on average), therefore showing that our approach could be suitably deployed for real-time applications.
EXPERIMENTAL RESULTS
We now showcase the performance of our proposed approach.
Example results
Figures 2 and 3 shows the considerable promise of zero-shot localization on our curated Insecta dataset. Each row of this figure illustrates a challenging scenario of insect detection and segmentation. The first row consists of insects in camouflaging environments. Green insects against green backgrounds or brown insects against brown backgrounds are a fairly common occurrence and are particularly challenging, even for human experts. The second row illustrates the capability of weak language supervision to detect and segment insects exhibiting a diverse set of morphologies. Images of insects usually contain more than one individual or may contain more than one species. The third row showcases the capability of the model to detect and segment multiple insects within an image frame. This is particularly important to estimate the action threshold that drives mitigation decisions. Finally, the fourth row shows the ability of the model to detect and segment insects across the life cycle. This ability to identify and localize early stages (eggs and nymphs) of invasive species, like the spotted lantern fly, is critical to mitigate widespread ecosystem damage.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Human quality checks
The images represent a small subset of our manual quality check taken from a diverse set of insect species spanning the phylogenic tree of the Insecta class. Two domain experts recruited from the senior authors’ research labs, who are trained in experimenting with insect image data, were asked to evaluate a selected subset of 150 images annotated by our model and label them as “correct” if a tight bounding box was achieved, and quality labels were tabulated. We record the percentage of images that received a “correct” designation by both domain experts; higher percentages would indicate better alignment of the model with human perception. Overall, we observe very strong performance, with a quality metric of 88%. These results suggest that DETIC, with weak language supervision, can produce high-quality bounding boxes that are very well aligned to human perception standards; see Figure 3 for some examples. As an additional advantage, DETIC also provides pixel-wise segmentation maps, which are also displayed in these results; we focus on bounding boxes for the remainder of the paper.
Automated quality checks
Finally, we also compared them to insect-level bounding boxes provided as part of the iNaturalist 2017 dataset of Van Horn et al. (2018). This is an earlier version of the iNaturalist dataset, which we used to source the images for our image benchmark. However, this earlier version contains a (small) human-annotated subset consisting of nearly 100,000 images. Annotates are labeled with bounding boxes. Comparisons of the bounding boxes obtained by our approach with these human-provided labels serve to provide a statistically rigorous characterization of how well our method performs in practice.
We first identified duplicate images that occur in both the iNaturalist 2017 dataset and our 6 M Insecta dataset. We achieved this by performing deduplication using OpenAI's vision transformer-L CLIP model (Radford et al., 2021a) with a high confidence threshold (0.98). We found 93,072 images that occurred in both the iNaturalist 2017 dataset and our dataset using this approach. Inspecting 200 images chosen at random from these automatically identified duplicates, we found that 100% of the sample images were indeed true duplicate images. We then resized the bounding boxes associated with the iNaturalist images (wherever necessary) and compared them to the boxes generated by DETIC. To numerically benchmark performance, we use the mean intersection over union (mIoU) metric. This is a widely used performance metric in computer vision. It measures the overlap between predicted and ground-truth boxes by computing the area of the intersection and dividing by the union area. An mIoU value equal to one indicates perfect performance, and in practice, we observe that several state-of-the-art detectors for long-tailed, fine-grained classification, even detectors that are trained on millions of annotated image samples, typically exhibit mIoU values in the range between 0.5 and 0.7.
We calculate mIoU metrics in several settings. The total size of the test set is 77,369 images, distributed over 785 classes. Of those, 5944 (around 8%) are false negatives: DETIC failed to generate a bounding box where it should have generated one. The overall mIoU when including false negatives is 0.686. The overall mIoU without false negatives is 0.743. Our test dataset is imbalanced; therefore, if we calculate a weighted mean with weights proportional to the sizes of the classes, the per-class (class-balanced) mean accuracy (including false negatives) is 0.70, with a standard deviation of 0.16 (fairly concentrated around the mean).
Overall, our results indicate a relatively strong agreement between the bounding boxes obtained by our method and human annotators, even though our method is zero-shot and requires no a priori annotations. This also provides additional confidence that our annotated dataset is of high quality and can be used for downstream computer vision tasks for agriculturally and ecologically relevant insect characterization.
CONCLUSIONS AND EXTENSIONS
In this paper, we have shown that new advances in VLMs can be used to effectively localize insects in unstructured image data, in a fully zero-shot manner, without requiring any auxiliary training. This is a very promising approach to affordable phenotyping.
We release an open-source, high-quality dataset for insect detection from images. Our dataset consists of over 6 million images of agriculturally relevant insects, automatically annotated with bounding boxes that localize each insect. We have performed both manual and automated assessments to confirm that the dataset is of very high quality. To our knowledge, this is the largest dataset of such images automatically annotated using weak language supervision, and we expect that it can be used for future ecological and agriculture production targeted research. This dataset can serve as a crucial resource for developing and training computer vision models for a wide range of applications. One timely application space involves precision agriculture, where automated spatiotemporal resolved identification of insect pests can produce advances in cyber-agricultural systems and decision support for sustainable and profitable agriculture. This dataset and approach can directly impact biodiversity maintenance and tracking by allowing rigorous and automated quantification of biodiversity gain/loss, which can help direct investments and policy.
While our focus in this paper has been on zero-shot insect detection, we envision that similar methods can be effectively developed for other plant phenotyping applications. We conducted preliminary experiments on localizing common fruits (such as apples and strawberries) in image data and achieved promising results. Figure 4 shows some example images where a very similar method, as shown in Figure 1, using simple text prompts, gives accurate zero-shot localization even in challenging settings. For example, our method is able to detect and localize partially occluded apples (Figure 4, left), as well as identify both ripe and unripe strawberries (Figure 4, right).
[IMAGE OMITTED. SEE PDF]
Our findings pave the way for further future improvements. First, our generated bounding box information can be useful in other downstream tasks in insect/pest monitoring, such as visual odometry and decision support. Second, better bounding boxes can be achieved, perhaps with improved language supervision and class-specific prompts. We expect the curated, quality controlled dataset to be of significant interest to the scientific community. Finally, we have used a generic object detector (DETIC) in our methods. A VLM trained on species-specific data may achieve improved results.
AUTHOR CONTRIBUTIONS
Benjamin Feuer: Conceptualization; data curation; investigation; methodology; validation; writing—original draft; writing—review and editing. Ameya Joshi: Methodology; validation; writing—original draft. Minsu Cho: Data curation; investigation; methodology. Shivani Chiranjeevi: Data curation; investigation; validation; writing—review and editing. Zi Kang Deng: Data curation. Aditya Balu: Resources; supervision. Asheesh K. Singh: Supervision; writing—review and editing. Soumik Sarkar: Supervision; writing—review and editing. Nirav Merchant: Data curation; supervision. Arti Singh: Conceptualization; data curation; investigation; supervision; validation; writing—review and editing. Baskar Ganapathysubramanian: Conceptualization; funding acquisition; project administration; resources; supervision; writing—review and editing. Chinmay Hegde: Conceptualization; formal analysis; funding acquisition; investigation; methodology; project administration; supervision; writing—original draft; writing—review and editing.
ACKNOWLEDGMENTS
This work was supported by the AI Institute for Resilient Agriculture (USDA-NIFA #2021-67021-35329), COALESCE: COntext Aware LEarning for Sustainable CybEr-Agricultural Systems (NSF CPS Frontier #1954556), FACT: A Scalable Cyber Ecosystem for Acquisition, Curation, and Analysis of Multispectral UAV Image Data (USDA-NIFA #2019-67021-29938), Smart Integrated Farm Network for Rural Agricultural Communities (SIRAC) (NSF S&CC #1952045), and USDA CRIS Project IOW04714. Support was also provided by the Plant Sciences Institute.
Open access funding provided by the Iowa State University Library.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
Fang, S., Cao, Y., Wang, X., Chen, K., Lin, D., & Zhang, W. (2021). WSSOD: A new pipeline for weakly‐and semi‐supervised object detection. Arxiv. https://arxiv.org/abs/2105.11293
Girshick, R. B. (2015). Fast R‐CNN. In International Conference on Computer Vision (ICCV) (pp. 1440–1448). IEEE.
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2020). Mask R‐CNN. Transactions on Pattern Analysis and Machine Intelligence, 42, 386–397. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2844175]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Cheap and ubiquitous sensing has made collecting large agricultural datasets relatively straightforward. These large datasets (for instance, citizen science data curation platforms like iNaturalist) can pave the way for developing powerful artificial intelligence (AI) models for detection and counting. However, traditional supervised learning methods require labeled data, and manual annotation of these raw datasets with useful labels (such as bounding boxes or segmentation masks) can be extremely laborious, expensive, and error‐prone. In this paper, we demonstrate the power of zero‐shot computer vision methods—a new family of approaches that require (almost) no manual supervision—for plant phenomics applications. Focusing on insect detection as the primary use case, we show that our models enable highly accurate detection of insects in a variety of challenging imaging environments. Our technical contributions are two‐fold: (a) We curate the Insecta rank class of iNaturalist to form a new benchmark dataset of approximately 6 million images consisting of 2526 agriculturally and ecologically important species, including pests and beneficial insects. (b) Using a vision‐language object detection method coupled with weak language supervision, we are able to automatically annotate images in this dataset with bounding box information localizing the insect within each image. Our method succeeds in detecting diverse insect species present in a wide variety of backgrounds, producing high‐quality bounding boxes in a zero‐shot manner with no additional training cost. This open dataset can serve as a use‐inspired benchmark for the AI community. We demonstrate that our method can also be used for other applications in plant phenomics, such as fruit detection in images of strawberry and apple trees. Overall, our framework highlights the promise of zero‐shot approaches to make high‐throughput plant phenotyping more affordable.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 Department of Electrical and Computer Engineering, New York University, New York, New York, USA
2 Translational AI Center, Iowa State University, Ames, Iowa, USA
3 Data Science Institute, University of Arizona, Tucson, Arizona, USA