MCPcopy
hub / github.com/karolpiczak/ESC-50

github.com/karolpiczak/ESC-50 @main sqlite

repository ↗ · DeepWiki ↗
5 symbols 13 edges 1 files 0 documented · 0%
README

ESC-50: Dataset for Environmental Sound Classification

Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

    Download 

ESC-50 clip preview

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

Animals Natural soundscapes & water sounds Human, non-speech sounds Interior/domestic sounds Exterior/urban noises
Dog Rain Crying baby Door knock Helicopter
Rooster Sea waves Sneezing Mouse click Chainsaw
Pig Crackling fire Clapping Keyboard typing Siren
Cow Crickets Breathing Door, wood creaks Car horn
Frog Chirping birds Coughing Can opening Engine
Cat Water drops Footsteps Washing machine Train
Hen Wind Laughing Vacuum cleaner Church bells
Insects (flying) Pouring water Brushing teeth Clock alarm Airplane
Sheep Toilet flush Snoring Clock tick Fireworks
Crow Thunderstorm Drinking, sipping Glass breaking Hand saw

Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

A more thorough description of the dataset is available in the original paper with some supplementary materials on GitHub: ESC: Dataset for Environmental Sound Classification - paper replication data.

Download

The dataset can be downloaded as a single .zip file (~600 MB):

Download ESC-50 dataset

Results

Supervised Methods

Numerous machine learning & signal processing approaches have been evaluated on the ESC-50 dataset. Most of them are listed here. If you know of some other reference, you can message me or open a Pull Request directly.

Terms used in the table:

• CNN - Convolutional Neural Network

• CRNN - Convolutional Recurrent Neural Network

• GMM - Gaussian Mixture Model

• GTCC - Gammatone Cepstral Coefficients

• GTSC - Gammatone Spectral Coefficients

• k-NN - k-Neareast Neighbors

• MFCC - Mel-Frequency Cepstral Coefficients

• MLP - Multi-Layer Perceptron

• RBM - Restricted Boltzmann Machine

• RNN - Recurrent Neural Network

• SVM - Support Vector Machine

• TEO - Teager Energy Operator

• ZCR - Zero-Crossing Rate

Title Notes Accuracy Paper Code
Natural Language Supervision for General-Purpose Audio Representations HTSAT-22 model pretrained by natural language supervision 98.25% msclap2023 :scroll:
BEATs: Audio Pre-Training with Acoustic Tokenizers Transformer model pretrained with acoustic tokenizers 98.10% chen2022 :scroll:
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules 97.00% chen2022 :scroll:
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 96.70% elizalde2022 :scroll:
CAT: Causal Audio Transformer for Audio Classification Transformer model with MFMR features and a causal module 96.4% liu2023
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet 95.70% gong2021 :scroll:
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer A Transformer model pretrained w/ visual image supervision 95.70% zhao2022 :scroll:
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition Multi-stage sequential learning with knowledge transfer from Audioset 94.10% kumar2020
Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications CNN model pretrained on AudioSet 92.32% lopez-meyer2021
Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks Pretrained model with multi-channel features 89.50% kim2020 :scroll:
An Ensemble of Convolutional Neural Networks for Audio Classification CNN ensemble with data augmentation 88.65% nanni2020 :scroll:
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices CNN model (ACDNet) with potential compression 87.1% mohaimenuzzaman2021 :scroll:
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies 86.50% sailor2017
AclNet: efficient end-to-end audio classification CNN CNN with mixup and data augmentation 85.65% huang2018
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications x-vector network with openll3 embeddings 85.00% wilkinghoff2020
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning 84.90% tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies 84.15% tak2017
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes CNN pretrained on AudioSet 83.50% kumar2017 :scroll:
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM + fusion with GTSC 83.00% sailor2017
Deep Multimodal Clustering for Unsupervised Audiovisual Learning CNN + unsupervised audio-visual learning 82.60% hu2019
Novel TEO-based Gammatone Features for Environmental Sound Classification Fusion of GTSC & TEO-GTSC with CNN 81.95% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + Between-Class learning 81.80% tokozume2017b
:headphones: Human accuracy Crowdsourcing experiment in classifying ESC-50 by human listeners 81.30% piczak2015a :scroll:
Objects that Sound Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule 79.80% arandjelovic2017b
Look, Listen and Learn 8-layer convolutional subnetwork pretrained on an audio-visual correspondence task 79.30% arandjelovic2017a
Learning Environmental Sounds with Multi-scale Convolutional Neural Network Multi-scale convolutions with feature fusion (waveform + spectrogram) 79.10% zhu2018
Novel TEO-based Gammatone Features for Environmental Sound Classification GTSC with CNN 79.10% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + data augmentation 78.80% tokozume2017b
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM 78.45% sailor2017
Learning from Between-class Examples for Deep Sound Recognition Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning 76.90% tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound Classification TEO-GTSC with CNN 74.85% [agrawal2017](http:

Core symbols most depended-on inside this repo

Shape

Function 5

Languages

Python100%

Modules by API surface

tests/test_dataset.py5 symbols

For agents

$ claude mcp add ESC-50 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact