MIDI-LLM

IMPORTANT We're still working on the companion Github codebase, please check back in a few days, thanks!

Built on Llama 3.2 (1B) with an extended vocabulary for MIDI tokens.

Research Paper

Shih-Lun Wu, Yoon Kim, and Cheng-Zhi Anna Huang.
"MIDI-LLM: Adapting large language models for text-to-MIDI music generation."
NeurIPS AI4Music Workshop, 2025.

Model Description

Base Model: meta-llama/Llama-3.2-1B
Model Size: 1.4B parameters
Extended Vocabulary: 183,286 tokens (128,256 for text + 55,030 for MIDI music)
Architecture: LlamaForCausalLM with extended embedding layer
Precision: BFloat16

Quick Start

Clone our Github code repo (coming soon), run through setup steps, and try:

git clone https://github.com/slSeanWU/MIDI-LLM
cd MIDI-LLM

python generate_transformers.py \
    --model slseanwu/MIDI-LLM_Llama-3.2-1B \
    --prompt "A cheerful rock song with bright electric guitars" \
    --n_outputs 4

The repo and inference scripts provide a more complete usage guide.

Model Details

Extended Vocabulary

The model extends Llama 3.2's vocabulary (128,256 tokens) with 55,030 MIDI tokens representing:

Onset time (when notes occur)
Durations (how long each note is held)
Instrument-pitch pair (which note to play & by which instrument)

These tokens follow the vocabulary of Anticipatory Music Transformer (AMT) (Thickstun et al., TMLR 2024).

Training Data

Datasets:
- Continued Pretraining (CPT)
  - music-related text from MusicPile (~1.7B tokens)
  - standalone MIDIs from GigaMIDI (~1.4B tokens after filtering out SFT examples)
- Supervised Finetuning (SFT)
  - LakhMIDI music paired w/ MidiCaps text descriptions (~5B tokens with AMT infilling augmentation)
Training objective: Causal language modeling
Training sequence length: 2,048
System prompt: You are a world-class composer. Please compose some music according to the following description: [your input text]

Inference Hyperparameters

Recommended settings for best results:

temperature: 1.0
top_p: 0.98
max_tokens: 2046

Citation

If you find our model useful, please cite our research as

@inproceedings{wu2025midillm,
  title={MIDI-LLM: Adapting large language models for text-to-MIDI music generation},
  author={Wu, Shih-Lun and Kim, Yoon and Huang, Cheng-Zhi Anna},
  booktitle={Proc. NeurIPS AI4Music Workshop},
  year={2025}
}

License

This model is based on Llama 3.2 and is subject to the Llama 3.2 Community License.

Downloads last month: 78

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for slseanwu/MIDI-LLM_Llama-3.2-1B

Base model

meta-llama/Llama-3.2-1B

Finetuned

(734)

this model

Quantizations

1 model