SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
Paper | Sample Page | Code
SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions.
Inference
To get started, clone the repository and install the dependencies:
git clone https://github.com/ETH-DISCO/sao-instruct.git
pip install -r model/requirements.txt && pip install model/stable-audio-tools
Use the following script to perform inference with SAO-Instruct weights from ๐ค Hugging Face. When encode_audio is set to True, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the encoded_audio_noise parameter. Experiment with different configurations to achieve optimal results.
import torch
from IPython.display import Audio, display
from model.sao_instruct import SAOInstruct
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SAOInstruct.from_pretrained("disco-eth/sao-instruct").eval().to(device)
audio_path = "path/to/audio.wav"
edited_clips = model.edit_audio(
instructions=["add a cat meowing"],
audio_path=audio_path,
encode_audio=True,
cfg_scale=6,
encoded_audio_noise=4
)
display(Audio(audio_path))
for clip in edited_clips:
display(Audio(clip, rate=model.sample_rate, normalize=False))
Data Generation
The required files to generate audio editing triplets are in the dataset/ folder.
Prompt Generation
The script generate_prompts.py can be used for prompt generation. It accepts a .jsonl file as input in the following form:
{"caption": "Audio Caption", "metadata": {}}
This input .jsonl file can be created using the prepare_captions.py script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the metadata field can be used to match them to their specific filename. The output of this script is a .jsonl file that includes processed prompts, containing the input caption, edit instruction, and output caption.
Paired Sample Generation
Prompt-to-Prompt
After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs. The Prompt-to-Prompt pipeline consists of two parts:
- Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file.
- Sample Generation: Generating the edited audio pairs using the candidates found in the previous step.
Use the script generate_candidates.py for the candidate search.
The script generate_samples.py can be used for Prompt-to-Prompt sample generation (use the mode p2p).
We have included the source code of Stable Audio Open with the adaptations made for Prompt-to-Prompt in audio_generation/p2p/stable-audio-tools (particularly in audio_generation/p2p/stable-audio-tools/models/transformer.py).
You can install its requirements using:
pip install audio_generation/p2p/stable-audio-tools
Make sure that the k_diffusion package is configured to use the same starting noise. Change the function sample_dpmpp_3m_sde in the k_diffusion/sampling.py file to:
if eta:
noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0)
noise = noise.repeat(x.shape[0], 1, 1)
x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise
DDPM Inversion
The script generate_samples.py can be used to create samples using DDPM inversion (use the mode edit).
We follow the implementation from the paper Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion.
Clone the repository and install its dependencies using:
cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git
cd AudioEditingCode && pip install -r requirements.txt
Manual Edits
For generating manual edits, use the script manual_edits/generate_manual_samples.py.
Fine-tuning Stable Audio Open
We provide training and data loading scripts to enable fine-tuning on audio editing triplets:
model/stable-audio-tools/train_edit.py- Modified training script for audio editing tasksmodel/stable-audio-tools/stable_audio_tools/data/dataset_edit.py- Custom dataset loader for editing tripletsmodel/stable-audio-tools/stable_audio_tools/configs- Contains configuration files for both the model and dataset
Otherwise, follow the official recommendations from Stable Audio Open to fine-tune the model.
Attribution and License
This repository builds upon Stable Audio Open, a model developed by Stability AI.
It uses checkpoints and components from stabilityai/stable-audio-open-1.0 that are licensed under the Stability AI Community License. Please see the NOTICE file for required attribution.
Powered by Stability AI
This repository and its contents are released for academic research and non-commercial use only.
- Downloads last month
- 21
