Getting it Right: Improving Spatial Consistency in Text-to-Image Models Paper • 2404.01197 • Published Apr 1, 2024 • 31
CosmicMan: A Text-to-Image Foundation Model for Humans Paper • 2404.01294 • Published Apr 1, 2024 • 17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Paper • 2406.08707 • Published Jun 13, 2024 • 17
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published Jun 17, 2024 • 54
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning Paper • 2406.08973 • Published Jun 13, 2024 • 89
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published Jun 12, 2024 • 31
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Paper • 2406.08451 • Published Jun 12, 2024 • 25
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published Sep 19, 2024 • 50
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19, 2024 • 56
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 23
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Paper • 2411.04905 • Published Nov 7, 2024 • 127
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics Paper • 2501.04686 • Published Jan 8 • 53
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training Paper • 2501.18511 • Published Jan 30 • 20
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion Paper • 2502.04235 • Published Feb 6 • 22
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training Paper • 2502.06589 • Published Feb 10 • 20
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles Paper • 2502.09082 • Published Feb 13 • 30
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding Paper • 2503.02951 • Published Mar 4 • 33
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published Mar 13 • 24
ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback Paper • 2503.21332 • Published Mar 27 • 23
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding Paper • 2504.01943 • Published Apr 2 • 15
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper • 2504.13161 • Published Apr 17 • 93
DataDecide: How to Predict Best Pretraining Data with Small Experiments Paper • 2504.11393 • Published Apr 15 • 18
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs Paper • 2504.14655 • Published Apr 20 • 20