Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated Aug 14, 2025 • 22
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6, 2025 • 22
Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated Jun 6, 2025 • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5, 2025 • 60