# SPIRIT-LM Expressive Interleaved (Corrected Teacher, Libri-Light) **SPIRIT-LM Expressive Interleaved (Corrected)** is a fine-tuned version of the 7B SPIRIT-LM teacher model adapted to the **Libri-Light** domain. It supports **interleaved speech and text inputs**, and was used as the **teacher model for distilling TinyWave**. This checkpoint was fine-tuned for 10k steps with **LoRA adapters** on synthetic interleaved data created from Libri-Light and Whisper transcriptions. The resulting model improves alignment with the target distribution and provides stronger supervision for expressive speech–text generation. > πŸ“– This checkpoint is part of the *TinyWave* distillation framework. See [arXiv:2506.23670](https://arxiv.org/abs/2506.23670) for details. --- ## 🧠 Model Purpose | Role | Distillation Teacher | |------------------|-------------------------------------------| | Base Model | `spirit-lm-expressive-7b` (SPIRIT-LM) | | Fine-tuned on | Libri-Light (10k steps with LoRA) | | Input Modalities | Interleaved speech + text | | Output | Speech tokens | | Used for | Training `tinywave/interleaved-expressive-2b` | --- ## πŸ”§ Usage ### 1. Install SPIRIT-LM and Load Expressive Tokenizer ```bash git clone https://github.com/facebookresearch/spiritlm cd spiritlm pip install -e '.[eval]' ```` ```python from spiritlm.speech_tokenizer import spiritlm_expressive speech_tokenizer = spiritlm_expressive() ``` --- ### 2. Inference (Speech or Interleaved) ```python from transformers import LlamaForCausalLM, AutoTokenizer import torchaudio import torch MODEL_PATH = "tinywave/expressive-spirit-lm-interleaved-librilight" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16) # Interleaved speech input speech_tokenizer = spiritlm_expressive() def get_inference(audio_path): audio, _ = torchaudio.load(audio_path) input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float() tokens = speech_tokenizer.encode_string(input_values) input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device) output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9) return tokenizer.decode(output[0]) def get_inference_text(prompt): input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device) output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9) return tokenizer.decode(output[0]) ``` --- ## 🎧 Inference Modes ### πŸ’¬ Text + Speech Interleaving Input: ```text "The astronaut stepped outside the capsuleβ€” [Speech]" ``` Output: Expressive speech continuation in WAV format. --- ### πŸ”„ Speech Continuation Input: `speech.wav` Output: Semantically and stylistically aligned spoken continuation. --- ## πŸ“‚ Files * `pytorch_model.bin`: LoRA-adapted SPIRIT-LM 7B weights * `config.json`, `tokenizer.json`: Compatible with Hugging Face Transformers * Compatible with `spiritlm_expressive` tokenizer only --- ## πŸ“Ž Citation ```bibtex @article{nouriborji2025tinywave, title={Efficient Interleaved Speech Modeling through Knowledge Distillation}, author={Nouriborji, Mohammadmahdi and Rohanian, Morteza}, journal={arXiv preprint arXiv:2506.23670}, year={2025} } ``` --- ## πŸ”— Related * πŸ”¬ Paper: [arXiv:2506.23670](https://arxiv.org/abs/2506.23670) * 🧠 Student model: [`tinywave/interleaved-expressive-2b`](https://huggingface.co/tinywave/interleaved-expressive-2b) * 🌐 [Project Website](https://mohammadmahdinoori.github.io/tinywave-landing/)