Clear Sky Science · en
A dataset of insect sounds from 459 species for bioacoustic machine learning
Listening to the Hidden World of Insects
Many of the sounds of nature’s “small majority” come not from birds or frogs, but from insects: chirping crickets, rasping grasshoppers, and buzzing cicadas. As scientists race to understand whether insect populations are crashing worldwide, these sounds could provide vital clues. But turning a global chorus of clicks and buzzes into hard data requires computers that can recognize insect species by ear—something that has been held back by a lack of suitable training data. This study introduces a large, carefully curated collection of insect recordings designed to unlock that potential.

Why Insect Songs Matter
Insects are essential to ecosystems, yet evidence suggests many species are declining. Traditional monitoring—catching insects in traps or surveying by sight—is slow, labor-intensive, and covers only a fraction of the world’s diversity. Sound offers another route. Many grasshoppers, crickets, and cicadas produce species-specific songs that travel far and can be captured by small, cheap recorders. If computers could reliably match these songs to species, scientists and even citizen scientists could monitor insect diversity across continents with minimal disturbance.
Building a Global Sound Library
The authors assembled a new dataset called InsectSet459, containing 26,298 audio files—about 9.5 days of sound—from 459 insect species. Most belong to two highly vocal groups: Orthoptera (grasshoppers, crickets, and relatives) and Cicadidae (cicadas). Rather than recording these insects themselves, the team tapped into three major open platforms: xeno-canto, iNaturalist, and BioAcoustica. These websites host species-labeled recordings from both experts and citizen scientists around the world, making them rich sources of raw material. The researchers downloaded only recordings with confirmed species identifications and open licenses, then standardized and trimmed the files while preserving as much acoustic diversity as possible.
Cleaning Up the Noise
Simply collecting thousands of recordings is not enough; a machine-learning dataset must also avoid hidden pitfalls. The team performed extensive “deduplication,” removing repeated uploads of the same audio file, even when they appeared under different user names or on different platforms. They limited each species to recordings from distinct times and places, trimmed long files to two-minute segments, converted uncommon formats, and ensured that every species had at least ten separate recordings. Unlike many audio datasets, they chose not to force all files to a single sample rate. Insects often produce high-pitched or even ultrasonic calls, so preserving the original recording rates—ranging from 8 to 500 kilohertz—keeps important details that might otherwise be lost.

Putting the Data to the Test
To show that InsectSet459 is genuinely useful for automatic recognition, the authors trained two state-of-the-art deep learning models originally developed for sound and image tasks. Both models converted the audio into picture-like representations of sound energy over time and frequency, then learned to associate these patterns with species. Tested on unseen recordings, they correctly distinguished species with moderate success overall: about a 57% score on a strict measure that balances missed detections and false alarms, and over 70% simple accuracy. Performance was especially strong—often above 80%—for species with many recordings. It dropped sharply for species represented by only a handful of examples, and for those whose calls sit outside the frequency range emphasized in the models’ features.
What This Means Going Forward
Although these early models are far from perfect, especially for rare species and very high-pitched callers, the results show that a single, well-curated dataset can already power useful automatic recognition of hundreds of insect species. InsectSet459 is meant as a foundation: a realistic, challenging test bed for experimenting with new ways of representing sound, handling multiple sample rates, and dealing with naturally uneven data. As researchers refine algorithms—potentially incorporating ultrasonic information, better data augmentation, and region-specific fine-tuning—this dataset could help turn the nocturnal chorus of chirps and buzzes into a sensitive, global monitoring system for insect biodiversity.
Citation: Faiß, M., Ghani, B. & Stowell, D. A dataset of insect sounds from 459 species for bioacoustic machine learning. Sci Data 13, 499 (2026). https://doi.org/10.1038/s41597-026-07123-4
Keywords: insect bioacoustics, biodiversity monitoring, machine learning, acoustic datasets, citizen science