⚡ Kartoffelbox-Turbo
German Text-to-Speech
Kartoffelbox-Turbo is a fine-tuned version of Resemble AI's Chatterbox-Turbo, optimized specifically for the German language.
Built on the 350M parameter Turbo architecture, this model delivers German speech generation with significantly lower compute requirements and reduced latency compared to previous 500M+ parameter versions.
Key Features
- ⚡ Turbo Speed: Built on the Chatterbox-Turbo architecture (350M params), fast synthesis.
- 🇩🇪 German Optimized: Fine-tuned specifically for natural German prosody and pronunciation.
- Low Resource: Runs efficiently with less VRAM than the standard 500M model.
⚠️ Limitations & Paralinguistic Tags
Current Status: Experimental
Please note that this model is an experimental release. During the final training phase, the loss diverged after 2.5 days.
- Paralinguistic Tags: I only used the Paralinguistic features (such as
[laugh],[sigh],[breath]) during the final fine-tuning stage. Due to the training divergence, these tags are likely not supported in this version.
Installation
You need the base chatterbox-tts library to run this model.
pip install chatterbox-tts
Usage
Because this is a fine-tune of the Turbo model, you must load the base architecture first and then apply the Kartoffelbox weights to the t3 module.
import torch
import torchaudio
from chatterbox.tts_turbo import ChatterboxTurboTTS
from huggingface_hub import hf_hub_download
# 1. Define Model Repository
MODEL_REPO = "SebastianBodza/Kartoffelbox_Turbo"
MODEL_FILENAME = "model.pt"
device = "cuda" if torch.cuda.is_available() else "cpu"
# 2. Load the Base Chatterbox-Turbo Model
print("Loading base Turbo model...")
model = ChatterboxTurboTTS.from_pretrained(device)
# 3. Download and Load the Fine-Tuned German Weights
print(f"Downloading weights from {MODEL_REPO}...")
checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=MODEL_FILENAME)
checkpoint_state = torch.load(checkpoint_path, map_location=device)
# Clean and apply state dict to the t3 module
cleaned_state_dict = {
k.replace("_orig_mod.", ""): v for k, v in checkpoint_state.items()
}
model.t3.load_state_dict(cleaned_state_dict)
model.t3.eval()
print("✓ Kartoffel-Turbo weights loaded successfully.")
# 4. Generate Speech
text = "Elias blieb stehen. War es wirklich schon zehn Jahre her? Er musste leise lachen."
# You need a reference audio file (10-20s) for voice cloning
# Ensure the reference audio matches the tone you want
audio_prompt_path = "your_german_reference.wav"
wav = model.generate(
text,
audio_prompt_path=audio_prompt_path,
temperature=0.8,
repetition_penalty=1.2,
top_p=0.95
)
# 5. Save output
torchaudio.save("kartoffel_output.wav", wav.squeeze(0).cpu(), model.sr)
print("Saved to kartoffel_output.wav")
Tips for Best Results
- Reference Audio: Use a clean, high-quality German reference clip (approx. 10-20 seconds). The model is zero-shot, so it will attempt to clone the voice provided.
- Parameters: *
temperature: Controls randomness.0.8is a good default. Lower it for more stability, raise it for more variation. repetition_penalty: If the model stutters, try increasing this slightly (e.g.,1.2).
Training Metrics
This model was an initial attempt at fine-tuning the Chatterbox Turbo architecture. As the pipeline utilizes online voice cloning, the training process is computationally intensive.
Below are the plots for the Training and Validation Loss before divergence:

Acknowledgements
- Resemble AI for the Chatterbox-Turbo architecture.
- FunAudioLLM for CosyVoice.
- Downloads last month
- -
Model tree for SebastianBodza/Kartoffelbox_Turbo
Base model
ResembleAI/chatterbox-turbo