It's worth emphasizing that audio interpretability is not the same as classical audio tasks of separation or denoising. These tasks involve recovering complete object of interest in the output audio. On the other hand, a classifier network might focus more on salient regions. When interpreting its decision and making it listenable we expect to uncover such regions and not necessarily the complete object of interest.
@inproceedings{lmac,
  author={Francesco Paissan and Mirco Ravanelli and Cem Subakan},
  title={{Listenable Maps for Audio Classifiers}},
  year={2024},
  booktitle={International Conference on Machine Learning (ICML)},
}
This samples are the ones presented to the participants of the user study for the first stage. This mixtures are generated with ESC50 samples from the validation and test fold. For each sample you can listen to the input audio to the classifier, and the interpretation audio generate for the predicted class. The method that generated the interpretation is written on the right of the audio player.
Predicted class by the classifier: 'can opening'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I @ RTH=0.2 | |
L2I @ RTH=0.4 |
Predicted class by the classifier: 'car horn'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I @ RTH=0.2 | |
L2I @ RTH=0.4 |
Predicted class by the classifier: 'door wood creaks'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I @ RTH=0.2 | |
L2I @ RTH=0.4 |
Predicted class by the classifier: 'pig'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I @ RTH=0.2 | |
L2I @ RTH=0.4 |
Predicted class by the classifier: 'glass breaking'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I @ RTH=0.2 | |
L2I @ RTH=0.4 |
This samples are the ones presented to the participants of the user study for the second stage. This mixtures are downloaded from L2I's companion website for a fair qualitative comparison. For each sample you can listen to the input audio to the classifier, and the interpretation audio generate for the predicted class. The method that generated the interpretation is written on the right of the audio player.
Predicted class by the classifier: 'dog'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I (from companion website) |
Predicted class by the classifier: 'baby crying'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I (from companion website) |
Predicted class by the classifier: 'church bells'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I (from companion website) |
Predicted class by the classifier: 'dog'.
Input sample | |
L-MAC | |
L-MAC @ FT=16, CTH=0.6 | |
L-MAC @ FT=16, CTH=0.7 | |
L2I (from companion website) |
Predicted class: Car horn
Original | |
L-MAC | |
Predicted class: Dog barking
Original | |
L-MAC | |
Predicted class: Sea waves
Original | |
L-MAC | |
Predicted class: Hen
Original | |
L-MAC | |
Predicted class: Baby crying
Original | |
L-MAC | |
Predicted class: Water drops
Original | |
L-MAC | |
Predicted class: Drink sipping
Original | |
L-MAC | |
Predicted class: Fireworks
Original | |
L-MAC | |
Predicted class: Sheep
Original | |
L-MAC | |
Predicted class: Church bells
Original | |
L-MAC | |