mlo-data-cleaning

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

lvwerra authored a paper 13 days ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

mjaggi authored a paper 26 days ago

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

mjaggi authored a paper about 2 months ago

Benchmarking Optimizers for Large Language Model Pretraining

View all activity

lvwerra

authored a paper 13 days ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published 17 days ago • 32

mjaggi

authored a paper 26 days ago

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Paper • 2509.14233 • Published Sep 17 • 12

mjaggi

authored a paper about 2 months ago

Benchmarking Optimizers for Large Language Model Pretraining

Paper • 2509.01440 • Published Sep 1 • 24

NXz64Fdf8Y

authored a paper 4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 73

lvwerra

authored a paper 4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 73

mjaggi

authored a paper 4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 73

negar-foroutan

authored a paper 4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 73

vsabolcec

authored a paper 4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 73

hynky

authored a paper 4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 73

guipenedo

authored a paper 4 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 73

guipenedo

authored a paper 5 months ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 46

lvwerra

authored a paper 7 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 200

hynky

authored a paper 9 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 243

guipenedo

authored a paper 9 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 243

lvwerra

authored a paper 9 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 243

hynky

authored a paper 9 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63

lvwerra

authored a paper 9 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63

guipenedo

authored a paper 9 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63

vsabolcec

updated a Space 12 months ago

Annotation

🏢

Manage text and view statistics with an intuitive interface

lvwerra

authored a paper 12 months ago

SelfCodeAlign: Self-Alignment for Code Generation

Paper • 2410.24198 • Published Oct 31, 2024 • 24

AI & ML interests

Recent Activity

Team members 8

ZR0zNqSGMI's activity

Annotation