Listenable Maps for Zero-Shot Audio Classifiers

Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

Project Webpage (Accepted to NeurIPS'24)
[ arXiv]

Abstract

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.

Code: link

It's worth emphasizing that audio interpretability is not the same as classical audio tasks of separation or denoising. These tasks involve recovering complete object of interest in the output audio. On the other hand, a classifier network might focus more on salient regions. When interpreting its decision and making it listenable we expect to uncover such regions and not necessarily the complete object of interest.


Please note:

Audio samples:

Sample A

Predicted class by the classifier: 'can opening'.

Input sample      
LMAC-ZS    
GradCAM++    



Sample B

Predicted class by the classifier: 'car horn'.

Input sample      
LMAC-ZS    
GradCAM++    



Sample C

Predicted class by the classifier: 'door wood creaks'.

Input sample      
LMAC-ZS    
GradCAM++    



Sample D

Predicted class by the classifier: 'pig'.

Input sample      
LMAC-ZS    
GradCAM++    



Sample E

Predicted class by the classifier: 'glass breaking'.

Input sample      
LMAC-ZS    
GradCAM++