m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks Paper • 2403.11085 • Published Mar 17, 2024
Visual Programming: Compositional visual reasoning without training Paper • 2211.11559 • Published Nov 18, 2022 • 1
CodeNav: Beyond tool-use to using real-world codebases with LLM agents Paper • 2406.12276 • Published Jun 18, 2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25, 2024 • 121
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation Paper • 2502.14846 • Published Feb 20 • 14