Evaluate model predictions against ground truth data
Stable audio open model from Synthio paper.
Generate text based on audio input and questions
Describe audio with questions