SeamlessM4T-v2 T2ST/T2TT Lite Model
Extracted from facebook/seamless-m4t-v2-large, containing only T2ST and T2TT components.
Original Model: facebook/seamless-m4t-v2-large
Official Documentation: SeamlessM4T-v2 Documentation
Note: This package only reorganizes publicly available weights from Meta's original model for T2ST/T2TT usage. No new training or fine-tuning is introduced. All rights of the model and weights belong to their original owner.
Supported Features
- T2TT (Text-to-Text Translation): Multilingual text translation
- T2ST (Text-to-Speech Translation): Text-to-speech translation with voice control
- Multi-Speaker Support: 200 different speaker voices
- 96 Languages: Supports text translation and speech synthesis
Included Components
Model Weights
text_encoder: Text encoder (shared by T2TT and T2ST)text_decoder+lm_head: Text decoder (T2TT)t2u_model: Text-to-unit encoder-decoder (T2ST, contains t2u_encoder and t2u_decoder)vocoder: HiFi-GAN vocoder, includes 200 speaker embeddings (T2ST)shared.weight: Shared word embeddingslang_embed: Language embeddings
Model Size
- Original Model: ~8.6 GB
- Lite Model: ~6.2 GB
- Removed Weights: 802 (speech_encoder)
- Space Saved: ~2.4 GB
Usage Examples
1. Basic T2TT: Text-to-Text Translation
from transformers import SeamlessM4Tv2Model, AutoProcessor
# Load model
model = SeamlessM4Tv2Model.from_pretrained("jaman21/seamless-m4t-v2-t2tt-t2st")
processor = AutoProcessor.from_pretrained("jaman21/seamless-m4t-v2-t2tt-t2st")
# Translate text
text_inputs = processor(text="Hello, how are you?", src_lang="eng", return_tensors="pt")
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
translated_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(translated_text) # "Bonjour, comment allez-vous?"
2. Basic T2ST: Text-to-Speech Translation
import torchaudio
# Translate text to speech
text_inputs = processor(text="Hello world", src_lang="eng", return_tensors="pt")
audio_array = model.generate(**text_inputs, tgt_lang="cmn", generate_speech=True)[0].cpu().numpy().squeeze()
# Save audio (sample rate: 16000 Hz)
torchaudio.save("output.wav", audio_array, 16000)
# Use different speaker IDs (0-199) to get different voice characteristics
text_inputs = processor(text="Good morning!", src_lang="eng", return_tensors="pt")
# Speaker 0 - default voice (pretrained)
audio_spk0 = model.generate(**text_inputs, tgt_lang="spa", generate_speech=True, speaker_id=0)
# Speaker 5 - different voice (pretrained)
audio_spk5 = model.generate(**text_inputs, tgt_lang="spa", generate_speech=True, speaker_id=5)
# Speaker 42 - another voice option (pretrained)
audio_spk42 = model.generate(**text_inputs, tgt_lang="spa", generate_speech=True, speaker_id=42)
# Note: Different speaker_id may have different effects in different target languages
# Try values between 0-199 to find the voice that best suits your use case
3. Generate Both Text and Speech Simultaneously
# Generate both translated text and speech in one call
text_inputs = processor(text="How can I help you?", src_lang="eng", return_tensors="pt")
# Set return_intermediate_token_ids=True to get both outputs
outputs = model.generate(
**text_inputs,
tgt_lang="deu",
generate_speech=True,
return_intermediate_token_ids=True
)
# Extract text
translated_text_tokens = outputs[1] # Text tokens
translated_text = processor.decode(translated_text_tokens[0].tolist(), skip_special_tokens=True)
# Extract audio
audio_waveform = outputs[0].cpu().numpy().squeeze()
print(f"Translated text: {translated_text}")
print(f"Audio shape: {audio_waveform.shape}")
4. Advanced Generation Strategies
# Beam search for better quality (slower)
text_inputs = processor(text="The quick brown fox jumps", src_lang="eng", return_tensors="pt")
outputs = model.generate(
**text_inputs,
tgt_lang="jpn",
generate_speech=False,
num_beams=5, # Use beam search
max_new_tokens=256,
early_stopping=True
)
# Sampling for more diverse output
outputs = model.generate(
**text_inputs,
tgt_lang="kor",
generate_speech=False,
do_sample=True, # Enable sampling
top_k=50,
top_p=0.95,
temperature=0.8 # 0.0-1.0: lower is more deterministic, higher is more random (affects translation quality)
)
5. Batch Processing Multiple Texts
# Process multiple texts at once
texts = [
"Hello, how are you?",
"What is your name?",
"Nice to meet you!"
]
text_inputs = processor(text=texts, src_lang="eng", return_tensors="pt", padding=True)
output_tokens = model.generate(**text_inputs, tgt_lang="ita", generate_speech=False)
# Decode all outputs
translations = processor.batch_decode(output_tokens, skip_special_tokens=True)
for orig, trans in zip(texts, translations):
print(f"{orig} -> {trans}")
6. Control Generation Length and Quality
text_inputs = processor(text="Translate this sentence", src_lang="eng", return_tensors="pt")
# Higher quality but more computationally expensive
high_quality_output = model.generate(
**text_inputs,
tgt_lang="rus",
generate_speech=True,
speaker_id=10,
num_beams=5, # Beam search
max_new_tokens=512, # Allow longer output
length_penalty=1.0, # No length penalty
early_stopping=True,
use_cache=True # Accelerate generation
)
# Faster generation speed, acceptable quality
fast_output = model.generate(
**text_inputs,
tgt_lang="rus",
generate_speech=True,
speaker_id=10,
num_beams=1, # Greedy decoding for better translation quality (slower)
max_new_tokens=256,
use_cache=True
)
7. GPU/CPU Usage
import torch
# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Process inputs on the same device
text_inputs = processor(text="Hello", src_lang="eng", return_tensors="pt")
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
# Generate
with torch.inference_mode(): # More efficient than torch.no_grad()
outputs = model.generate(**text_inputs, tgt_lang="cmn", generate_speech=True)
License
Same as the original model: CC-BY-NC-4.0
For commercial use, please refer to Meta's licensing terms.
References
- Downloads last month
- 31
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support