SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Paper | Sample Page | Code

SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions.

Inference

To get started, clone the repository and install the dependencies:

git clone https://github.com/ETH-DISCO/sao-instruct.git
pip install -r model/requirements.txt && pip install model/stable-audio-tools

Use the following script to perform inference with SAO-Instruct weights from 🤗 Hugging Face. When encode_audio is set to True, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the encoded_audio_noise parameter. Experiment with different configurations to achieve optimal results.

import torch
from IPython.display import Audio, display
from model.sao_instruct import SAOInstruct

device = "cuda" if torch.cuda.is_available() else "cpu" 
model = SAOInstruct.from_pretrained("disco-eth/sao-instruct").eval().to(device)

audio_path = "path/to/audio.wav"
edited_clips = model.edit_audio(
    instructions=["add a cat meowing"],
    audio_path=audio_path,
    encode_audio=True,
    cfg_scale=6,
    encoded_audio_noise=4
)

display(Audio(audio_path))
for clip in edited_clips:
    display(Audio(clip, rate=model.sample_rate, normalize=False))

Data Generation

The required files to generate audio editing triplets are in the dataset/ folder.

Prompt Generation

The script generate_prompts.py can be used for prompt generation. It accepts a .jsonl file as input in the following form:

{"caption":  "Audio Caption", "metadata": {}}

This input .jsonl file can be created using the prepare_captions.py script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the metadata field can be used to match them to their specific filename. The output of this script is a .jsonl file that includes processed prompts, containing the input caption, edit instruction, and output caption.

Paired Sample Generation

Prompt-to-Prompt

After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs. The Prompt-to-Prompt pipeline consists of two parts:

Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file.
Sample Generation: Generating the edited audio pairs using the candidates found in the previous step.

Use the script generate_candidates.py for the candidate search. The script generate_samples.py can be used for Prompt-to-Prompt sample generation (use the mode p2p). We have included the source code of Stable Audio Open with the adaptations made for Prompt-to-Prompt in audio_generation/p2p/stable-audio-tools (particularly in audio_generation/p2p/stable-audio-tools/models/transformer.py). You can install its requirements using:

pip install audio_generation/p2p/stable-audio-tools

Make sure that the k_diffusion package is configured to use the same starting noise. Change the function sample_dpmpp_3m_sde in the k_diffusion/sampling.py file to:

if eta:
    noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0) 
    noise = noise.repeat(x.shape[0], 1, 1) 
    x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise

DDPM Inversion

The script generate_samples.py can be used to create samples using DDPM inversion (use the mode edit). We follow the implementation from the paper Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion. Clone the repository and install its dependencies using:

cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git
cd AudioEditingCode && pip install -r requirements.txt

Manual Edits

For generating manual edits, use the script manual_edits/generate_manual_samples.py.

Fine-tuning Stable Audio Open

We provide training and data loading scripts to enable fine-tuning on audio editing triplets:

model/stable-audio-tools/train_edit.py - Modified training script for audio editing tasks
model/stable-audio-tools/stable_audio_tools/data/dataset_edit.py - Custom dataset loader for editing triplets
model/stable-audio-tools/stable_audio_tools/configs - Contains configuration files for both the model and dataset

Otherwise, follow the official recommendations from Stable Audio Open to fine-tune the model.

Attribution and License

This repository builds upon Stable Audio Open, a model developed by Stability AI.
It uses checkpoints and components from stabilityai/stable-audio-open-1.0 that are licensed under the Stability AI Community License. Please see the NOTICE file for required attribution.

Powered by Stability AI
This repository and its contents are released for academic research and non-commercial use only.

Downloads last month: 164

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support