Listenable Maps for Audio Classifiers

Francesco Paissan, Mirco Ravanelli, Cem Subakan

Project Webpage (Accepted to ICML'24)
[ arXiv] [ code]

Abstract

Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a special loss that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.

It's worth emphasizing that audio interpretability is not the same as classical audio tasks of separation or denoising. These tasks involve recovering complete object of interest in the output audio. On the other hand, a classifier network might focus more on salient regions. When interpreting its decision and making it listenable we expect to uncover such regions and not necessarily the complete object of interest.


Please note:

Citing L-MAC

@inproceedings{lmac,   author={Francesco Paissan and Mirco Ravanelli and Cem Subakan},   title={{Listenable Maps for Audio Classifiers}},   year={2024},   booktitle={International Conference on Machine Learning (ICML)}, }

Audio samples: User study - Part I

This samples are the ones presented to the participants of the user study for the first stage. This mixtures are generated with ESC50 samples from the validation and test fold. For each sample you can listen to the input audio to the classifier, and the interpretation audio generate for the predicted class. The method that generated the interpretation is written on the right of the audio player.

Sample A

Predicted class by the classifier: 'can opening'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I @ RTH=0.2    
L2I @ RTH=0.4    



Sample B

Predicted class by the classifier: 'car horn'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I @ RTH=0.2    
L2I @ RTH=0.4    



Sample C

Predicted class by the classifier: 'door wood creaks'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I @ RTH=0.2    
L2I @ RTH=0.4    



Sample D

Predicted class by the classifier: 'pig'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I @ RTH=0.2    
L2I @ RTH=0.4    



Sample E

Predicted class by the classifier: 'glass breaking'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I @ RTH=0.2    
L2I @ RTH=0.4    



Audio samples: User study - Part II

This samples are the ones presented to the participants of the user study for the second stage. This mixtures are downloaded from L2I's companion website for a fair qualitative comparison. For each sample you can listen to the input audio to the classifier, and the interpretation audio generate for the predicted class. The method that generated the interpretation is written on the right of the audio player.

Sample A

Predicted class by the classifier: 'dog'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I (from companion website)    



Sample B

Predicted class by the classifier: 'baby crying'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I (from companion website)    



Sample C

Predicted class by the classifier: 'church bells'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I (from companion website)    



Sample D

Predicted class by the classifier: 'dog'.

Input sample      
L-MAC    
L-MAC @ FT=16, CTH=0.6    
L-MAC @ FT=16, CTH=0.7    
L2I (from companion website)    



OOD samples: Gaussian noise

This samples illustrate the OOD samples for ESC50 obtained by mixing dataset samples with white noise.

Sample A

Predicted class: Car horn

Original    
L-MAC    

Sample B

Predicted class: Dog barking

Original    
L-MAC    

Sample C

Predicted class: Sea waves

Original    
L-MAC    

Sample D

Predicted class: Hen

Original    
L-MAC    

Sample E

Predicted class: Baby crying

Original    
L-MAC    



OOD samples: LJSpeech contamination

This samples illustrate the OOD samples for ESC50 obtained by mixing dataset samples with white noise.

Sample A

Predicted class: Water drops

Original    
L-MAC    

Sample B

Predicted class: Drink sipping

Original    
L-MAC    

Sample C

Predicted class: Fireworks

Original    
L-MAC    

Sample D

Predicted class: Sheep

Original    
L-MAC    

Sample E

Predicted class: Church bells

Original    
L-MAC