hub / github.com/karolpiczak/ESC-50

github.com/karolpiczak/ESC-50 @main sqlite

5 symbols 13 edges 1 files 0 documented · 0%

README

ESC-50: Dataset for Environmental Sound Classification

Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

ESC-50 clip preview

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

_Animals	_{Natural soundscapes & water sounds}	_{Human, non-speech sounds}	_{Interior/domestic sounds}	_{Exterior/urban noises}
_Dog	_Rain	_{Crying baby}	_{Door knock}	_Helicopter
_Rooster	_{Sea waves}	_Sneezing	_{Mouse click}	_Chainsaw
_Pig	_{Crackling fire}	_Clapping	_{Keyboard typing}	_Siren
_Cow	_Crickets	_Breathing	_{Door, wood creaks}	_{Car horn}
_Frog	_{Chirping birds}	_Coughing	_{Can opening}	_Engine
_Cat	_{Water drops}	_Footsteps	_{Washing machine}	_Train
_Hen	_Wind	_Laughing	_{Vacuum cleaner}	_{Church bells}
_{Insects (flying)}	_{Pouring water}	_{Brushing teeth}	_{Clock alarm}	_Airplane
_Sheep	_{Toilet flush}	_Snoring	_{Clock tick}	_Fireworks
_Crow	_Thunderstorm	_{Drinking, sipping}	_{Glass breaking}	_{Hand saw}

Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

A more thorough description of the dataset is available in the original paper with some supplementary materials on GitHub: ESC: Dataset for Environmental Sound Classification - paper replication data.

Download

The dataset can be downloaded as a single .zip file (~600 MB):

Download ESC-50 dataset

Results

Supervised Methods

Numerous machine learning & signal processing approaches have been evaluated on the ESC-50 dataset. Most of them are listed here. If you know of some other reference, you can message me or open a Pull Request directly.

Terms used in the table:

_{• CNN - Convolutional Neural Network}

• CRNN - Convolutional Recurrent Neural Network

• GMM - Gaussian Mixture Model

• GTCC - Gammatone Cepstral Coefficients

• GTSC - Gammatone Spectral Coefficients

• k-NN - k-Neareast Neighbors

• MFCC - Mel-Frequency Cepstral Coefficients

• MLP - Multi-Layer Perceptron

• RBM - Restricted Boltzmann Machine

• RNN - Recurrent Neural Network

• SVM - Support Vector Machine

• TEO - Teager Energy Operator

• ZCR - Zero-Crossing Rate

_Title	_Notes	_Accuracy	_Paper	_Code
_{Natural Language Supervision for General-Purpose Audio Representations}	_{HTSAT-22 model pretrained by natural language supervision}	_98.25%	_msclap2023	:scroll:
_{BEATs: Audio Pre-Training with Acoustic Tokenizers}	_{Transformer model pretrained with acoustic tokenizers}	_98.10%	_chen2022	:scroll:
_{HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection}	_{Transformer model with hierarchical structure and token-semantic modules}	_97.00%	_chen2022	:scroll:
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_96.70%	_elizalde2022	:scroll:
_{CAT: Causal Audio Transformer for Audio Classification}	_{Transformer model with MFMR features and a causal module}	_96.4%	_liu2023
_{AST: Audio Spectrogram Transformer}	_{Pure Attention Model Pretrained on AudioSet}	_95.70%	_gong2021	:scroll:
_{Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer}	_{A Transformer model pretrained w/ visual image supervision}	_95.70%	_zhao2022	:scroll:
_{A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition}	_{Multi-stage sequential learning with knowledge transfer from Audioset}	_94.10%	_kumar2020
_{Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications}	_{CNN model pretrained on AudioSet}	_92.32%	_{lopez-meyer2021}
_{Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks}	_{Pretrained model with multi-channel features}	_89.50%	_kim2020	:scroll:
_{An Ensemble of Convolutional Neural Networks for Audio Classification}	_{CNN ensemble with data augmentation}	_88.65%	_nanni2020	:scroll:
_{Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices}	_{CNN model (ACDNet) with potential compression}	_87.1%	_{mohaimenuzzaman2021}	:scroll:
_{Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification}	_{CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies}	_86.50%	_sailor2017
_{AclNet: efficient end-to-end audio classification CNN}	_{CNN with mixup and data augmentation}	_85.65%	_huang2018
_{On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications}	_{x-vector network with openll3 embeddings}	_85.00%	_{wilkinghoff2020}
_{Learning from Between-class Examples for Deep Sound Recognition}	_{EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning}	_84.90%	_{tokozume2017b}
_{Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification}	_{CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies}	_84.15%	_tak2017
_{Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes}	_{CNN pretrained on AudioSet}	_83.50%	_kumar2017	:scroll:
_{Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification}	_{CNN with filterbanks learned using convolutional RBM + fusion with GTSC}	_83.00%	_sailor2017
_{Deep Multimodal Clustering for Unsupervised Audiovisual Learning}	_{CNN + unsupervised audio-visual learning}	_82.60%	_hu2019
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{Fusion of GTSC & TEO-GTSC with CNN}	_81.95%	_agrawal2017
_{Learning from Between-class Examples for Deep Sound Recognition}	_{EnvNet-v2 (tokozume2017a) + Between-Class learning}	_81.80%	_{tokozume2017b}
:headphones: _{Human accuracy}	_{Crowdsourcing experiment in classifying ESC-50 by human listeners}	_81.30%	_piczak2015a	:scroll:
_{Objects that Sound}	_{Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule}	_79.80%	_{arandjelovic2017b}
_{Look, Listen and Learn}	_{8-layer convolutional subnetwork pretrained on an audio-visual correspondence task}	_79.30%	_{arandjelovic2017a}
_{Learning Environmental Sounds with Multi-scale Convolutional Neural Network}	_{Multi-scale convolutions with feature fusion (waveform + spectrogram)}	_79.10%	_zhu2018
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{GTSC with CNN}	_79.10%	_agrawal2017
_{Learning from Between-class Examples for Deep Sound Recognition}	_{EnvNet-v2 (tokozume2017a) + data augmentation}	_78.80%	_{tokozume2017b}
_{Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification}	_{CNN with filterbanks learned using convolutional RBM}	_78.45%	_sailor2017
_{Learning from Between-class Examples for Deep Sound Recognition}	_{Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning}	_76.90%	_{tokozume2017b}
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{TEO-GTSC with CNN}	_74.85%	_{[agrawal2017](http:}

Core symbols most depended-on inside this repo

—

Shape

Function 5

Languages

Python100%

Modules by API surface

tests/test_dataset.py5 symbols

For agents

$ claude mcp add ESC-50 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact