MIDI-LLM

IMPORTANT We're still working on the companion Github codebase, please check back in a few days, thanks!

Built on Llama 3.2 (1B) with an extended vocabulary for MIDI tokens.

Research Paper

  • Shih-Lun Wu, Yoon Kim, and Cheng-Zhi Anna Huang.
    "MIDI-LLM: Adapting large language models for text-to-MIDI music generation."
    NeurIPS AI4Music Workshop, 2025.

Model Description

  • Base Model: meta-llama/Llama-3.2-1B
  • Model Size: 1.4B parameters
  • Extended Vocabulary: 183,286 tokens (128,256 for text + 55,030 for MIDI music)
  • Architecture: LlamaForCausalLM with extended embedding layer
  • Precision: BFloat16

Quick Start

Clone our Github code repo (coming soon), run through setup steps, and try:

git clone https://github.com/slSeanWU/MIDI-LLM
cd MIDI-LLM

python generate_transformers.py \
    --model slseanwu/MIDI-LLM_Llama-3.2-1B \
    --prompt "A cheerful rock song with bright electric guitars" \
    --n_outputs 4

The repo and inference scripts provide a more complete usage guide.

Model Details

Extended Vocabulary

The model extends Llama 3.2's vocabulary (128,256 tokens) with 55,030 MIDI tokens representing:

  • Onset time (when notes occur)
  • Durations (how long each note is held)
  • Instrument-pitch pair (which note to play & by which instrument)

These tokens follow the vocabulary of Anticipatory Music Transformer (AMT) (Thickstun et al., TMLR 2024).

Training Data

  • Datasets:
    • Continued Pretraining (CPT)
      • music-related text from MusicPile (~1.7B tokens)
      • standalone MIDIs from GigaMIDI (~1.4B tokens after filtering out SFT examples)
    • Supervised Finetuning (SFT)
      • LakhMIDI music paired w/ MidiCaps text descriptions (~5B tokens with AMT infilling augmentation)
  • Training objective: Causal language modeling
  • Training sequence length: 2,048
  • System prompt: You are a world-class composer. Please compose some music according to the following description: [your input text]

Inference Hyperparameters

Recommended settings for best results:

temperature: 1.0
top_p: 0.98
max_tokens: 2046

Citation

If you find our model useful, please cite our research as

@inproceedings{wu2025midillm,
  title={MIDI-LLM: Adapting large language models for text-to-MIDI music generation},
  author={Wu, Shih-Lun and Kim, Yoon and Huang, Cheng-Zhi Anna},
  booktitle={Proc. NeurIPS AI4Music Workshop},
  year={2025}
}

License

This model is based on Llama 3.2 and is subject to the Llama 3.2 Community License.

Downloads last month
78
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for slseanwu/MIDI-LLM_Llama-3.2-1B

Finetuned
(734)
this model
Quantizations
1 model