File size: 3,881 Bytes
cf660ba 3fe645d c615bae 1427f87 cf660ba 8ee826d c615bae 8ee826d 1427f87 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
library_name: transformers
base_model:
- meta-llama/Llama-3.1-8B-Instruct
- DeSTA-ntu/Llama-3.1-8B-Instruct
datasets:
- DeSTA-ntu/DeSTA-AQA5M-FROM-Llama3.1-8B-Instruct
tags:
- audio-text-to-text
- Audio-understanding
- Audio-chat
---
# DeSTA2.5-Audio
[📑 Paper](https://arxiv.org/abs/2507.02768) | [👩💻 Github](https://github.com/kehanlu/DeSTA2.5-Audio) | [🤗 Model](https://huggingface.co/collections/DeSTA-ntu/desta25-audio-686a6b9e71afd92e1dd87486) | [🤗 Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA-AQA5M-FROM-Llama3.1-8B-Instruct)
**DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment**
> **Self-generated data is what you need for developing general-purpose LALMs!**
- 🧪 **A new training framework** ([read the paper](https://arxiv.org/abs/2507.02768))
- Highly scalable and efficient without task-specific instruction-tuning data
- Preserves language ability and avoids catastrophic forgetting
- Comprehensive studies on data quality in LALM development
- 📦 **Open resources for the community**
- Model checkpoints and Training scripts
- DeSTA-AQA5M dataset (5M audio-text pairs from 7,000 hours of audio)
## 🚀Quickstart
### Installation
```shell
git clone https://github.com/kehanlu/DeSTA2.5-Audio.git
cd DeSTA2.5-Audio
pip install -e .
```
### Inference
```python
from desta import DeSTA25AudioModel
# Load the model from Hugging Face
model = DeSTA25AudioModel.from_pretrained("DeSTA-ntu/DeSTA2.5-Audio-Llama-3.1-8B")
model.to("cuda")
# Run inference with audio input
messages = [
{
"role": "system",
"content": "Focus on the audio clips and instructions."
},
{
"role": "user",
"content": "<|AUDIO|>\nDescribe this audio.",
"audios": [{
"audio": "/path/to/audio.wav", # Path to your audio file
"text": None
}]
}
]
outputs = model.generate(
messages=messages,
do_sample=False,
top_p=1.0,
temperature=1.0,
max_new_tokens=512
)
print(outputs.text)
```
## 📚 Citation
```bibtex
@article{lu2025desta25Audio,
title={DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment},
author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Huang, Sung-Feng and Yang, Chih-Kai and Yu, Chee-En and Chen, Chun-Wei and Chen, Wei-Chih and Huang, Chien-yu and others},
journal={arXiv preprint arXiv:2507.02768},
year={2025}
}
@inproceedings{lu2025developing,
title={Developing instruction-following speech language model without speech instruction-tuning data},
author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Lee, Hung-yi},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2025},
organization={IEEE}
}
@inproceedings{lu24c_interspeech,
title = {DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment},
author = {Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and He Huang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee},
year = {2024},
booktitle = {Interspeech 2024},
pages = {4159--4163},
doi = {10.21437/Interspeech.2024-457},
issn = {2958-1796},
}
```
## 👥 Contributors
Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee |