Audio editing with non-rigid text prompts
Abstract
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform AudioLDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.
Note
Figure 1: In step 1, we optimize with respect to $e_\text{text}$ to minimize the reconstruction error in Equation (1), where $z = \text{VAEEncoder}(X_\text{inp})$. The resulting optimized text embedding is denoted with $e_\text{opt}$. In step 2, we optimize with respect to the Diffusion Model parameters to minimize the same reconstruction loss as in step 1. Note that in steps 1 and 2 only the part shown with the green box is used. In step 3, the text embedding is set as the linear combination of target embedding and the optimized embedding such that $e_\text{text} = \eta e_\text{target} + (1 − \eta)e_\text{opt}$. In step 3, the whole pipeline denoted by the yellow box is used.
Addition edits
Prompt: "Engine revving while a car horn honks several times loudly."
|
|
|
|
|
Prompt: "Machine gun, while bell in the beginning."
|
|
|
|
|
Prompt: "An animal whimpering, while clip-clop of horse hooves in the background."
|
|
|
|
|
Style transfer edits
Prompt: "Sound of knocking the door."
|
|
|
|
Prompt: "Sound of gunshots in the background."
|
|
|
|
Prompt: "A man is giving a speech."
|
|
|
|
Inpainting edits
Prompt: "A sudden horn."
|
|
|
|
|
Prompt: "A group of people are laughing."
|
|
|
|
|
Prompt: "Rapid typing on a keyboard."
|
|
|
|
|