Project Webpage (Accepted to NeurIPS'24)
[ arXiv]
It's worth emphasizing that audio interpretability is not the same as classical audio tasks of separation or denoising. These tasks involve recovering complete object of interest in the output audio. On the other hand, a classifier network might focus more on salient regions. When interpreting its decision and making it listenable we expect to uncover such regions and not necessarily the complete object of interest.
Predicted class by the classifier: 'can opening'.
Input sample | |
LMAC-ZS | |
GradCAM++ |
Predicted class by the classifier: 'car horn'.
Input sample | |
LMAC-ZS | |
GradCAM++ |
Predicted class by the classifier: 'door wood creaks'.
Input sample | |
LMAC-ZS | |
GradCAM++ |
Predicted class by the classifier: 'pig'.
Input sample | |
LMAC-ZS | |
GradCAM++ |
Predicted class by the classifier: 'glass breaking'.
Input sample | |
LMAC-ZS | |
GradCAM++ |