Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.
The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:
| Animals | Natural soundscapes & water sounds | Human, non-speech sounds | Interior/domestic sounds | Exterior/urban noises |
|---|---|---|---|---|
| Dog | Rain | Crying baby | Door knock | Helicopter |
| Rooster | Sea waves | Sneezing | Mouse click | Chainsaw |
| Pig | Crackling fire | Clapping | Keyboard typing | Siren |
| Cow | Crickets | Breathing | Door, wood creaks | Car horn |
| Frog | Chirping birds | Coughing | Can opening | Engine |
| Cat | Water drops | Footsteps | Washing machine | Train |
| Hen | Wind | Laughing | Vacuum cleaner | Church bells |
| Insects (flying) | Pouring water | Brushing teeth | Clock alarm | Airplane |
| Sheep | Toilet flush | Snoring | Clock tick | Fireworks |
| Crow | Thunderstorm | Drinking, sipping | Glass breaking | Hand saw |
Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.
A more thorough description of the dataset is available in the original paper with some supplementary materials on GitHub: ESC: Dataset for Environmental Sound Classification - paper replication data.
The dataset can be downloaded as a single .zip file (~600 MB):
Numerous machine learning & signal processing approaches have been evaluated on the ESC-50 dataset. Most of them are listed here. If you know of some other reference, you can message me or open a Pull Request directly.
Terms used in the table:
• CNN - Convolutional Neural Network
• CRNN - Convolutional Recurrent Neural Network
• GMM - Gaussian Mixture Model
• GTCC - Gammatone Cepstral Coefficients
• GTSC - Gammatone Spectral Coefficients
• k-NN - k-Neareast Neighbors
• MFCC - Mel-Frequency Cepstral Coefficients
• MLP - Multi-Layer Perceptron
• RBM - Restricted Boltzmann Machine
• RNN - Recurrent Neural Network
• SVM - Support Vector Machine
• TEO - Teager Energy Operator
• ZCR - Zero-Crossing Rate
| Title | Notes | Accuracy | Paper | Code |
|---|---|---|---|---|
| Natural Language Supervision for General-Purpose Audio Representations | HTSAT-22 model pretrained by natural language supervision | 98.25% | msclap2023 | :scroll: |
| BEATs: Audio Pre-Training with Acoustic Tokenizers | Transformer model pretrained with acoustic tokenizers | 98.10% | chen2022 | :scroll: |
| HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules | 97.00% | chen2022 | :scroll: |
| CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 96.70% | elizalde2022 | :scroll: |
| CAT: Causal Audio Transformer for Audio Classification | Transformer model with MFMR features and a causal module | 96.4% | liu2023 | |
| AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet | 95.70% | gong2021 | :scroll: |
| Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer | A Transformer model pretrained w/ visual image supervision | 95.70% | zhao2022 | :scroll: |
| A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition | Multi-stage sequential learning with knowledge transfer from Audioset | 94.10% | kumar2020 | |
| Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications | CNN model pretrained on AudioSet | 92.32% | lopez-meyer2021 | |
| Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks | Pretrained model with multi-channel features | 89.50% | kim2020 | :scroll: |
| An Ensemble of Convolutional Neural Networks for Audio Classification | CNN ensemble with data augmentation | 88.65% | nanni2020 | :scroll: |
| Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices | CNN model (ACDNet) with potential compression | 87.1% | mohaimenuzzaman2021 | :scroll: |
| Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies | 86.50% | sailor2017 | |
| AclNet: efficient end-to-end audio classification CNN | CNN with mixup and data augmentation | 85.65% | huang2018 | |
| On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications | x-vector network with openll3 embeddings | 85.00% | wilkinghoff2020 | |
| Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning | 84.90% | tokozume2017b | |
| Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification | CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies | 84.15% | tak2017 | |
| Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes | CNN pretrained on AudioSet | 83.50% | kumar2017 | :scroll: |
| Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC | 83.00% | sailor2017 | |
| Deep Multimodal Clustering for Unsupervised Audiovisual Learning | CNN + unsupervised audio-visual learning | 82.60% | hu2019 | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | Fusion of GTSC & TEO-GTSC with CNN | 81.95% | agrawal2017 | |
| Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + Between-Class learning | 81.80% | tokozume2017b | |
| :headphones: Human accuracy | Crowdsourcing experiment in classifying ESC-50 by human listeners | 81.30% | piczak2015a | :scroll: |
| Objects that Sound | Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule | 79.80% | arandjelovic2017b | |
| Look, Listen and Learn | 8-layer convolutional subnetwork pretrained on an audio-visual correspondence task | 79.30% | arandjelovic2017a | |
| Learning Environmental Sounds with Multi-scale Convolutional Neural Network | Multi-scale convolutions with feature fusion (waveform + spectrogram) | 79.10% | zhu2018 | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | GTSC with CNN | 79.10% | agrawal2017 | |
| Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation | 78.80% | tokozume2017b | |
| Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM | 78.45% | sailor2017 | |
| Learning from Between-class Examples for Deep Sound Recognition | Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning | 76.90% | tokozume2017b | |
| Novel TEO-based Gammatone Features for Environmental Sound Classification | TEO-GTSC with CNN | 74.85% | [agrawal2017](http: |
—
$ claude mcp add ESC-50 \
-- python -m otcore.mcp_server <graph>