Listenable Maps for Zero-Shot Audio Classifiers

Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

Project Webpage (Accepted to NeurIPS'24)
[ arXiv]

Abstract

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.

Code: link

It's worth emphasizing that audio interpretability is not the same as classical audio tasks of separation or denoising. These tasks involve recovering complete object of interest in the output audio. On the other hand, a classifier network might focus more on salient regions. When interpreting its decision and making it listenable we expect to uncover such regions and not necessarily the complete object of interest.

Please note:

Some audio samples are void. If you see the time cursor moving and can't hear anything, please note this is not a problem with the audio player, but it means the method did not attribute anything.

Chrome is the preferred browser. The samples may not always run in Firefox.

Use of headphones/earphones recommended.

Please tune your volume appropriately before playing any sample.

The audio files may take some time to load. If they do not load even after waiting, it can be a temporary hosting server issue. In that case, please visit again after sometime

Audio samples:

Sample A

Predicted class by the classifier: 'can opening'.

Input sample
LMAC-ZS
GradCAM++

Sample B

Predicted class by the classifier: 'car horn'.

Input sample
LMAC-ZS
GradCAM++

Sample C

Predicted class by the classifier: 'door wood creaks'.

Input sample
LMAC-ZS
GradCAM++

Sample D

Predicted class by the classifier: 'pig'.

Input sample
LMAC-ZS
GradCAM++

Sample E

Predicted class by the classifier: 'glass breaking'.

Input sample
LMAC-ZS
GradCAM++