# SPIRIT-LM Expressive Interleaved (Corrected Teacher, Libri-Light)

**SPIRIT-LM Expressive Interleaved (Corrected)** is a fine-tuned version of the 7B SPIRIT-LM teacher model adapted to the **Libri-Light** domain. It supports **interleaved speech and text inputs**, and was used as the **teacher model for distilling TinyWave**.

This checkpoint was fine-tuned for 10k steps with **LoRA adapters** on synthetic interleaved data created from Libri-Light and Whisper transcriptions. The resulting model improves alignment with the target distribution and provides stronger supervision for expressive speech–text generation.

> 📖 This checkpoint is part of the *TinyWave* distillation framework. See [arXiv:2506.23670](https://arxiv.org/abs/2506.23670) for details.

---

## 🧠 Model Purpose

| Role             | Distillation Teacher                     |
|------------------|-------------------------------------------|
| Base Model       | `spirit-lm-expressive-7b` (SPIRIT-LM)     |
| Fine-tuned on    | Libri-Light (10k steps with LoRA)         |
| Input Modalities | Interleaved speech + text                 |
| Output           | Speech tokens                             |
| Used for         | Training `tinywave/interleaved-expressive-2b` |

---

## 🔧 Usage

### 1. Install SPIRIT-LM and Load Expressive Tokenizer

```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````

```python
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()
```

---

### 2. Inference (Speech or Interleaved)

```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch

MODEL_PATH = "tinywave/expressive-spirit-lm-interleaved-librilight"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)

# Interleaved speech input
speech_tokenizer = spiritlm_expressive()

def get_inference(audio_path):
    audio, _ = torchaudio.load(audio_path)
    input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
    tokens = speech_tokenizer.encode_string(input_values)
    input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
    return tokenizer.decode(output[0])

def get_inference_text(prompt):
    input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.9, top_p=0.9)
    return tokenizer.decode(output[0])
```

---

## 🎧 Inference Modes

### 💬 Text + Speech Interleaving

Input:

```text
"The astronaut stepped outside the capsule— [Speech]"
```

Output:
Expressive speech continuation in WAV format.

---

### 🔄 Speech Continuation

Input: `speech.wav`
Output: Semantically and stylistically aligned spoken continuation.

---

## 📂 Files

* `pytorch_model.bin`: LoRA-adapted SPIRIT-LM 7B weights
* `config.json`, `tokenizer.json`: Compatible with Hugging Face Transformers
* Compatible with `spiritlm_expressive` tokenizer only

---

## 📎 Citation

```bibtex
@article{nouriborji2025tinywave,
  title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
  author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
  journal={arXiv preprint arXiv:2506.23670},
  year={2025}
}
```

---

## 🔗 Related

* 🔬 Paper: [arXiv:2506.23670](https://arxiv.org/abs/2506.23670)
* 🧠 Student model: [`tinywave/interleaved-expressive-2b`](https://huggingface.co/tinywave/interleaved-expressive-2b)
* 🌐 [Project Website](https://mohammadmahdinoori.github.io/tinywave-landing/)