You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

⚡ Kartoffelbox-Turbo

German Text-to-Speech

Kartoffelbox-Turbo is a fine-tuned version of Resemble AI's Chatterbox-Turbo, optimized specifically for the German language.

Built on the 350M parameter Turbo architecture, this model delivers German speech generation with significantly lower compute requirements and reduced latency compared to previous 500M+ parameter versions.

Key Features

⚡ Turbo Speed: Built on the Chatterbox-Turbo architecture (350M params), fast synthesis.
🇩🇪 German Optimized: Fine-tuned specifically for natural German prosody and pronunciation.
Low Resource: Runs efficiently with less VRAM than the standard 500M model.

⚠️ Limitations & Paralinguistic Tags

Current Status: Experimental
Please note that this model is an experimental release. During the final training phase, the loss diverged after 2.5 days.

Paralinguistic Tags: I only used the Paralinguistic features (such as [laugh], [sigh], [breath]) during the final fine-tuning stage. Due to the training divergence, these tags are likely not supported in this version.

Installation

You need the base chatterbox-tts library to run this model.

pip install chatterbox-tts

Usage

Because this is a fine-tune of the Turbo model, you must load the base architecture first and then apply the Kartoffelbox weights to the t3 module.

import torch
import torchaudio
from chatterbox.tts_turbo import ChatterboxTurboTTS
from huggingface_hub import hf_hub_download

# 1. Define Model Repository
MODEL_REPO = "SebastianBodza/Kartoffelbox_Turbo"
MODEL_FILENAME = "model.pt"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Load the Base Chatterbox-Turbo Model
print("Loading base Turbo model...")
model = ChatterboxTurboTTS.from_pretrained(device)

# 3. Download and Load the Fine-Tuned German Weights
print(f"Downloading weights from {MODEL_REPO}...")
checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=MODEL_FILENAME)
checkpoint_state = torch.load(checkpoint_path, map_location=device)

# Clean and apply state dict to the t3 module
cleaned_state_dict = {
    k.replace("_orig_mod.", ""): v for k, v in checkpoint_state.items()
}
model.t3.load_state_dict(cleaned_state_dict)
model.t3.eval()
print("✓ Kartoffel-Turbo weights loaded successfully.")

# 4. Generate Speech
text = "Elias blieb stehen. War es wirklich schon zehn Jahre her? Er musste leise lachen."

# You need a reference audio file (10-20s) for voice cloning
# Ensure the reference audio matches the tone you want
audio_prompt_path = "your_german_reference.wav" 

wav = model.generate(
    text,
    audio_prompt_path=audio_prompt_path,
    temperature=0.8,
    repetition_penalty=1.2,
    top_p=0.95
)

# 5. Save output
torchaudio.save("kartoffel_output.wav", wav.squeeze(0).cpu(), model.sr)
print("Saved to kartoffel_output.wav")

Tips for Best Results

Reference Audio: Use a clean, high-quality German reference clip (approx. 10-20 seconds). The model is zero-shot, so it will attempt to clone the voice provided.
Parameters: * temperature: Controls randomness. 0.8 is a good default. Lower it for more stability, raise it for more variation.
repetition_penalty: If the model stutters, try increasing this slightly (e.g., 1.2).

Training Metrics

This model was an initial attempt at fine-tuning the Chatterbox Turbo architecture. As the pipeline utilizes online voice cloning, the training process is computationally intensive.

Below are the plots for the Training and Validation Loss before divergence:

Acknowledgements

Resemble AI for the Chatterbox-Turbo architecture.
FunAudioLLM for CosyVoice.

Downloads last month: -

Model tree for SebastianBodza/Kartoffelbox_Turbo

Base model

ResembleAI/chatterbox-turbo

Finetuned

(5)

this model

SebastianBodza
/

Kartoffelbox_Turbo