LMAC-TD: Producing Explanations for Audio Classifiers in Time Domain

Eleonora Mancini*, Francesco Paissan*, Mirco Ravanelli, Cem Subakan

Project Webpage (Accepted to ICASSP'25) - Code: GitHub
* Both authors contributed equally to this research. For these authors, the order is alphabetical.

Abstract

Neural networks are typically black-boxes that re- main opaque with regard to their decision mechanisms. Several works have proposed post-hoc explanation methods to alleviate this issue. This paper proposes LMAC-TD, a post-hoc explanation method that trains a decoder to produce explanations directly in the time domain. This methodology builds upon the foundation of the LMAC, Listenable Maps for Audio Classifiers, method and significantly improves the audio quality of the produced explanations. Our user study reveals a clear preference for our proposed methodology, LMAC-TD, over baseline models. Addi- tionally, LMAC-TD remains competitive in terms of faithfulness metrics.

It's worth emphasizing that audio interpretability is not the same as classical audio tasks of separation or denoising. These tasks involve recovering complete object of interest in the output audio. On the other hand, a classifier network might focus more on salient regions. When interpreting its decision and making it listenable we expect to uncover such regions and not necessarily the complete object of interest.

Please note:

Some audio samples are void. If you see the time cursor moving and can't hear anything, please note this is not a problem with the audio player, but it means the method did not attribute anything.

Chrome is the preferred browser. The samples may not always run in Firefox.

Use of headphones/earphones recommended.

Please tune your volume appropriately before playing any sample.

The audio files may take some time to load. If they do not load even after waiting, it can be a temporary hosting server issue. In that case, please visit again after sometime

Audio samples: User study - Part I

This samples are the ones presented to the participants of the user study for the first stage. This mixtures are generated with ESC50 samples from the validation and test fold. For each sample you can listen to the input audio to the classifier, and the interpretation audio generate for the predicted class. The method that generated the interpretation is written on the right of the audio player.

Sample A

Predicted class by the classifier: 'can opening'.

Input sample
LMAC-TD $\alpha=1$ (Ours)
LMAC-TD $\alpha=0.75$ (Ours)
LMAC-TD $\alpha=0$ (Ours)
L-MAC
L-MAC FT
L2I

Sample B

Predicted class by the classifier: 'car horn'.

Input sample
LMAC-TD $\alpha=1$ (Ours)
LMAC-TD $\alpha=0.75$ (Ours)
LMAC-TD $\alpha=0$ (Ours)
L-MAC
L-MAC FT
L2I

Sample C

Predicted class by the classifier: 'door wood creaks'.

Input sample
LMAC-TD $\alpha=1$ (Ours)
LMAC-TD $\alpha=0.75$ (Ours)
LMAC-TD $\alpha=0$ (Ours)
L-MAC
L-MAC FT
L2I

Sample D

Predicted class by the classifier: 'pig'.

Input sample
LMAC-TD $\alpha=1$ (Ours)
LMAC-TD $\alpha=0.75$ (Ours)
LMAC-TD $\alpha=0$ (Ours)
L-MAC
L-MAC FT
L2I

Sample E

Predicted class by the classifier: 'glass breaking'.

Input sample
LMAC-TD $\alpha=1$ (Ours)
LMAC-TD $\alpha=0.75$ (Ours)
LMAC-TD $\alpha=0$ (Ours)
L-MAC
L-MAC FT
L2I

Audio samples: User study - Part II

This samples are the ones presented to the participants of the user study for the second stage. This mixtures are downloaded from L2I's companion website for a fair qualitative comparison. For each sample you can listen to the input audio to the classifier, and the interpretation audio generate for the predicted class. The method that generated the interpretation is written on the right of the audio player.