Abstract
Camera traps provide a low-cost approach to collect data and monitor wildlife across large scales but hand-labeling images at a rate that outpaces accumulation is difficult. Deep learning, a subdiscipline of machine learning and computer science, can address the issue of automatically classifying camera-trap images with a high degree of accuracy. This technique, however, may be less accessible to ecologists or small-scale conservation projects, and has serious limitations. In this study, we trained a simple deep learning model using a dataset of 120,000 images to identify the presence of nilgai Boselaphus tragocamelus, a regionally specific nonnative game animal, in camera-trap images with an overall accuracy of 97%. We trained a second model to identify 20 groups of animals and one group of images without any animals present, labeled as “none,” with an accuracy of 89%. Lastly, we tested the multigroup model on images collected of similar species, but in the southwestern United States, resulting in significantly lower precision and recall for each group. This study highlights the potential of deep learning for automating camera-trap image processing workflows, provides a brief overview of image-based deep learning, and discusses the often-understated limitations and methodological considerations in the context of wildlife conservation and species monitoring.
Introduction
Camera traps, wireless cameras placed on trees or posts activated via motion sensors, are important tools for wildlife studies. Wildlife biologists have used them to estimate population densities (Howe et al. 2017), create species lists and inventories in dense tropical environments (Srbek-Araujo and Chiarello 2005; Lading 2006), understand population size and distributions (O'Connell et al. 2010), and identify new species (Rovero and Rathbun 2006). Their relatively low cost and ease make them scalable across large geographic regions. A common problem, however, is the rapid accumulation of images that outpaces the ability of users to manually sort and label them (Swanson et al. 2016). To address this issue, researchers have identified deep learning, a subfield of machine learning, as a powerful technique to automate the process of classifying, or grouping, images by species (Gomez et al. 2016; Norouzzadeh et al. 2018; Willi et al. 2019). Applications of deep learning for camera-trap classification have often relied on extremely large collections of images like Snapshot Serengeti (∼7 million images) or the North American Camera Trap dataset (3.3 million images) for training (Swanson et al. 2015; Tabak et al. 2019; Schneider et al. 2020). Transfer learning, a deep learning technique that starts with pretrained models as a base for future learning, can overcome this problem. Both Schneider et al. (2020) and Shahinfa et al. (2020) found that they needed only 1,000 images per class to achieve an accuracy of 97 and 98%, respectively, for eight classes. Despite growing popularity, applications of transfer learning for rapid camera-trap classification may still be beyond the expertise of many ecologists and conservation practitioners.
Our aim was to present an application of deep learning–based camera-trap analysis using a small dataset of 120,000 images. We trained a model using transfer learning, evaluated its accuracy, and demonstrated its limitations when applied on images outside the model's training context. We leveraged a nature-specific model by Cui et al. (2018) as a base to further train a south Texas–specific animal classifier. More specifically, we drew from a local database of camera-trap images to train 1) a binary classifier that discriminates between a single species, nilgai Boselaphus tragocamelus, an exotic bovid with expanding populations in south Texas and 2) a multigroup classifier for 20 animal groups and one “none” group. Lastly, we tested the model and its ability to generalize on images with similar classes but in different settings using the CalTech camera-trap dataset collected in the southwestern United States (Beery et al. 2018). Find resources and further details about training and implementation at the authors' github repository (Data S1–S4, Text S1 and S2, Supplemental Material).
Study Site
We collected image data from motion-sensitive cameras placed in areas of known wildlife activity in Cameron County in the lower Rio Grande Valley of Texas from 2018 to 2019. This county is along the international border and characterized by a mosaic of shrubby plants, mesquite, and semiarid vegetation. Ranchers introduced free-ranging nilgai native to the Indian subcontinent in the 1930s (Leslie 2008). Although there appears to be no competition with other native species, nilgai inhabit areas that support species of conservation concern such as northern populations of ocelot Leopardus pardalis and perhaps the Gulf Coast jaguarundi Puma yagouaroundi cacomitli (Schmidly 2004; Leslie 2016). Furthermore, recent studies reveal that nilgai are optimal hosts for the southern cattle-fever tick Rhipicephalus microplus and have exacerbated current efforts to eradicate this exotic pest of wildlife and livestock (Lohmeyer et al. 2018). As such, monitoring nilgai behavior, population, and distribution have important implications for both wildlife management and agriculture in the region (Foley et. al. 2017; Goolsby et al. 2019).
Methods
Image data and preprocessing
We randomly drew images for each group from a local database that is part of a multiyear field research project aimed at treating cattle fever tick-infested nilgai at fence crossings. Research technicians with advanced experience in recognizing animals of interest hand-labeled images using the open-access Colorado Parks and Wildlife Photo Warehouse, a custom Microsoft Offices Access application designed specifically to store, manage, label, and analyze wildlife camera-trap data (Ivan and Newkirk 2016). We created three types of datasets necessary for training deep neural networks: 1) a large training set (∼85% of total images) for model learning, 2) a smaller validation set (∼5% of total images) for frequent testing and adjustment of model settings, and 3) a test set to evaluate the final trained model (∼10% of total images). We created separate training, validation, and test sets for each classifier.
Balancing training set
A balanced training set contains an even distribution of images across each group. The original raw image set of >2.5 million images was highly imbalanced with 84% (∼2 million images) having no wildlife, which we labeled as “none.” The top seven most common groups include feral pigs Sus scrofa, falsely triggered camera events, human activity, birds, nilgai, deer Odocoileus virginianus, and cattle. Camera-trap datasets are often imbalanced because of wind, grass, or other nontarget objects that create false capture events. Training on the complete dataset would be problematic because models can favor groups with more examples while ignoring those with only a few (Norouzzadeh et al. 2018). The model would overfit in such a way that a single group (“none”) could be predicted for every instance and still result in a high overall accuracy. To correct the imbalance, we oversampled or sampled with replacement so each group had roughly the same number of images (He and Garcia 2009). For example, if the “dog” group only had 50 unique images, we copied each until the total number of images matched that of the most frequently occurring group. While this oversampling technique balances the dataset, it has drawbacks. Because it repeats images in rare groups, the model lacks robustness in these groups to generalize on new examples in the future. This might be an issue for conservation projects focusing on rare species that are important to monitor but rarely occur. For this study, however, the most important group, “nilgai,” was one of the most frequently occurring. Still, to reduce the number of copies for oversampling, we lowered our total image set size from 2.5 million to 120,000 by taking slightly more than the next most frequent group (“human”). Additionally, a dataset of 120,000 images instead of 2.5 million lowered training time from weeks to days. We further altered data by combining or eliminating groups. We combined four groups—“feral cat,” “ocelot,” “bobcat” Lynx rufus, and “exotics, other”—to create the “cat” group and eliminated “unknown” and “squirrel.” These groups either lacked sufficient examples or were mislabeled (e.g., an image of a bobcat was labeled as ocelot). Each capture event included three images taken in rapid successive order. Individual images, not capture events, were classified by research technicians, and contributed to the total dataset size and class count.
We applied four types of data augmentation, a technique commonly used to strengthen model predictions by slightly altering images. We rotated, shifted, sheared, and flipped images both horizontally and vertically. We performed augmentation for each training cycle and performed different augmentations randomly for each image. Preprocessing also included rescaling pixel values between 0 and 1 and resizing the image from 2,048 × 1,152 to 299 × 299 pixels, standard procedures done to reduce the computational expense of training. The seven most common groups included feral hogs, a “none” group, human activity, birds, white-tailed deer, and cattle (Figure 1). Data preprocessing is an important step for reducing computational demands and increasing model robustness.
Deep learning
A subfield of machine learning, deep learning aims to extract information from big data by learning from successive layers of increasingly meaningful representations called features (Chollet 2018). Many layers trained on labeled data and extract features hierarchically make up a neural network, a type of deep learning model. Information from previous layers informs following layers and is stored in the form of weights to make predictions on new unlabeled data. The neural network uses predicted and actual values to calculate an error score that is propagated back through the network to adjust weight values. Learning occurs iteratively by updating weights in such a way that optimizes its ability to reduce its error score. The model trains early layers to react strongly to simple features like edges, lines, and sharp color gradients, while the final layer of a neural network infers probabilities of input features to a class like “nilgai” or “deer.” The model distills features hierarchically from complex input images to a single prediction value (Figure 2; Toda and Okura 2019).
Training a neural network from scratch often requires large amounts of data. However, transfer learning, an approach useful for training on small datasets, applies the stored knowledge of a model pretrained on large generic data as a base for similar but more specific problems. It transfers knowledge in the form of saved files that contain weights, complete or partial model architectures, and settings. Researchers can easily download model parameters from open-source libraries and read them into a new training instance. Feature extraction, the first step in using pretrained models, involves replacing and training only the final layer of a neural network on a new problem-specific dataset. The second process trains all layers including the newly added final layer. It adjusts network weights, making the model task-specific. Feature extraction must occur first since the final layer restricts overly large weight adjustments that could negatively affect inference or model prediction. Our model was pretrained by Cui et al. (2018), who used the iNaturalist 2017 dataset of 579,184 nature-specific objects including insects, mammals, and amphibians (Ueda 2017; Van Horn et al. 2018). We then trained on a smaller but domain-specific dataset of south Texas wildlife (Figure 3).
Training and evaluation
We customized the InceptionV3 model, defined by its sequence and type of layers, to our unique number of groups (Szegedy et al. 2016). After each training cycle, we used the validation set to monitor performance and adjust model settings. In total, the model updated ∼21 million weight parameters until it stopped improving on the validation set; it took roughly 24 hours for both the multilabel and binary classifiers while using a single graphic processing unit. We evaluated each model after adjustments and training completed by reporting prediction results on the test set—the number of true positives, true negatives, false positives, and false negatives—for each classifier. We calculated five common accuracy metrics: overall accuracy, precision, recall, harmonic mean using precision and recall known as the F1 score, and the Matthews correlation coefficient, an adjusted form of the φ coefficient (Table 1; Guilford 1954). We used a second test set, collected from the southwestern Unites States and known as the CalTech dataset, to further evaluate model robustness (Beery et al. 2018).
Results
The trained binary classifier achieved an overall accuracy of 0.97, F1 score of 0.97, and Matthews correlation coefficient of 0.94, indicating the classifier was able to generalize on new images from the same area and accurately predict the presence of a nilgai. During training, we found an ∼15% increase in validation accuracy from the first to second stages. Recall (0.98) was slightly larger than precision (0.96), which is favorable for this unique task. The occasional instance of deer or cattle classified as a nilgai is preferred because research technicians will likely review and “catch” these images. A lost and uncounted nilgai image, however, is more detrimental to overall project goals. For multigroup problems, the average of the Matthews correlation coefficient is a more appropriate evaluation metric because it pools the performance over all samples and groups. Our multigroup classifiers achieved an average Matthews correlation coefficient of 0.89. Group-wise test results and evaluation metrics show that two of the most highly correlated classes—“skunk” Mephitis mephitis and “tortoise” Gopherus berlandieri—were the most imbalanced with each having <22 images (Table 2). The three most common groups in our dataset—“nilgai,” “deer,” and “none”—were strongly correlated. The multigroup classifier was successful in classifying 21 groups (Figure 4). For the second evaluation using the CalTech dataset, we adjusted classes to complement those of the south Texas dataset. We removed dissimilar classes (“bat,” “lizard,” “badger”), combined similar classes (“car” and “human”), and renamed classes when appropriate (“bobcat” to “cat”). The average Matthews correlation coefficient for the CalTech dataset was 0.22; further inspection of the other four metrics by class also indicated very poor performance (Table 3).
Discussion
Our aim was to test if we could use a small number of hand-labeled camera-trap images to train a deep learning model to automatically detect wildlife, including a specific species. We also explored the limits of our model by testing on a dataset that we did not use in training, and had similar species, but different context. Class imbalance played a major role in skewing the performance of the model on rare classes where test images were similar to training images. For example, a tortoise's slow movement was enough to trigger the camera sensor multiple times, which resulted in many nearly identical images. Because rare groups contained an even fewer number of images in the test set, it was difficult to evaluate their accuracy. Addressing the class imbalance issue is an important factor for improving results. Applying a technique like emphasis sampling can increase prediction accuracy by duplicating, or emphasizing, only images that have been misclassified instead of oversampling all rare groups (Norouzzadeh et al. 2018). This approach is more dynamic because it balances data as needed by responding to prediction results. Alternatively, researchers can combine multiple data sources to add images to rare classes from other camera-trap datasets (Swanson et al. 2015; LILA BC 2019). However, this approach risks introducing too many dissimilar environmental settings, images, and class types. Secondly, evaluating a second dataset allowed us to illustrate the model's lack of location invariance or inability to generalize on new images with conditions not represented in the training set (Beery et al. 2018). The strength of the model to make accurate predictions under a diverse set of conditions depends on how well the training data represents those conditions. Lastly, researchers adopting a trained model into an automatic camera-trap classification workflow should closely monitor it by inspecting important and rare groups for anomalies or regularly testing it on a subset of new images. As our study shows, new camera angles, species, or locations pose challenges to accurate classifications. Transfer learning has the potential to save time and resources typically required to hand-label camera-trap images. A simple trained classifier making predictions on 3,000 raw images saves roughly 12 personnel hours. Applications of deep learning, while traditionally left to experts in computer vision, have become less complicated with the emergence of publicly available datasets and open-source software. Likewise, we include our code, trained model, instructions, and a set of sample images that we hope improve the transfer of knowledge from academia to the field.
Supplemental Material
Please note: The Journal of Fish and Wildlife Management is not responsible for the content of functionality of any supplemental material. Queries should be directed to the corresponding author for the article.
Data S1. A set of two IPython notebooks to automatically classify and evaluate sample images using a deep learning model trained on camera-trap images collected in the lower Rio Grande Valley in Texas in 2018 and 2019. We designed the model to classify images as wildlife or as being empty (a false camera trigger event). Notebooks use additional supplemental data such as input weight files, a sample repository of images, and true image labels to evaluate predictions. The notebooks generate a new set of folders for each class, copy input images, and place them in folders based on predicted group. The notebooks generate figures of the distribution of predictions across animal groups. We applied a csv file containing true image labels to generate an evaluation report.
Available: https://doi.org/10.3996/JFWM-20-076.S1 (5.89 KB ZIP) and https://github.com/mkutu/Nilgai/tree/master/notebooks (15.57 MB IPYNB)
Data S2. A sample of 222 new images from the camera-trap dataset collected in the lower Rio Grande Valley in Texas in 2018 and 2019. With this sample, along with true label information (also provided in the supplemental material), users can test the deep learning model to automatically classify images as wildlife or as being empty (a false camera trigger event).
Available: https://doi.org/10.3996/JFWM-20-076.S2 (29.2 MB ZIP) and https://github.com/mkutu/Nilgai/tree/master/images/images (28.5 MB JPG)
Data S3. The csv file contains true image label information for evaluating the accuracy of a deep learning model trained on camera-trap images collected in the lower Rio Grande Valley in Texas in 2018 and 2019.
Available: https://doi.org/10.3996/JFWM-20-076.S3 (7.36 KB CSV) and https://github.com/mkutu/Nilgai/blob/master/notebooks/image_labels.csv
Data S4. A set of two .h5 files that contain the stored weights and model settings created by training a deep learning model on camera-trap images collected in the lower Rio Grande Valley in Texas in 2018 and 2019.
Available: https://doi.org/10.3996/JFWM-20-076.S4 (239 MB ZIP) and https://github.com/mkutu/Nilgai/tree/master/model
Text S1. A “README.md” text file with instructions for creating a virtual environment needed for running a deep learning model trained on camera-trap images collected in the lower Rio Grande Valley in Texas in 2018 and 2019. A virtual environment allows users to install dependencies, small pieces of software in the form of source code, that are required to run Python programs without making major changes to the users' systems. Instructions outline the procedures for setting up environments for both Windows and Mac OSX operating systems. Notes on troubleshooting are also included.
Available: https://doi.org/10.3996/JFWM-20-076.S5 (3.33 KB TXT) and https://github.com/mkutu/Nilgai/blob/master/README.md
Text S2. A “requirements.txt” text file used to install the required Python dependencies, small pieces of software in the form of source code, inside the virtual environment. Dependencies are required to run the deep learning model trained on camera-trap images collected in the lower Rio Grande Valley in Texas in 2018 and 2019.
Available: https://doi.org/10.3996/JFWM-20-076.S6 (1 KB TXT) and https://github.com/mkutu/Nilgai/blob/master/requirements.txt
Acknowledgments
Game camera images and initial processing was supported through appropriated research project 3094-32000-042-00-D, Integrated Pest Management of Cattle Fever Ticks. This article reports results of research only and mention of a proprietary product does not constitute an endorsement or recommendation by the U.S. Department of Agriculture for its use. U.S. Department of Agriculture is an equal opportunity provider and employer. Special thanks to Amelia Berle for data management, and research technicians who spent countless hours labeling images. Additional thanks to Dr. Rupesh Kariyat and Dr. Christofferson for providing access to computing equipment. We would also like to thank the journal reviewers and Associate Editor for their commitment to open access, which ensures applied conservation science remains accessible to all. Matthew Kutugata was supported by U.S. Department of Agriculture National Institute of Food and Agriculture Grant 2016-38422-25543.
Any use of trade, product, website, or firm names in this publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.
References
The findings and conclusions in this article are those of the author(s) and do not necessarily represent the views of the U.S. Fish and Wildlife Service.
Author notes
Citation: Kutugata M, Baumgardt J, Goolsby JA, and Racelis AE. 2021. Automatic camera-trap classification using wildlife-specific deep learning in nilgai management. Journal of Fish and Wildlife Management 12(2):412–421; e1944-687X. https://doi.org/10.3996/JFWM-20-076