Docling

Enterprise

community

https://docling.ai

docling-project

docling

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

lucas-morin authored a paper about 20 hours ago

MolGrapher: Graph-based Visual Recognition of Chemical Structures

lucas-morin authored a paper about 20 hours ago

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

lucas-morin authored a paper about 20 hours ago

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

View all activity

lucas-morin

authored 6 papers about 20 hours ago

MolGrapher: Graph-based Visual Recognition of Chemical Structures

Paper • 2308.12234 • Published Aug 23, 2023

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

Paper • 2501.17887 • Published Jan 27 • 1

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper • 2503.11576 • Published Mar 14 • 117

lucas-morin

updated a dataset 1 day ago

ds4sd/SubGrapher-Datasets

Viewer • Updated 1 day ago • 4.81k • 55 • 1

lucas-morin

updated a model 1 day ago

ds4sd/MolClassifier

Updated 1 day ago • 3

lucas-morin

updated 3 models 2 days ago

ds4sd/MarkushGrapher

Updated 2 days ago • 2

ds4sd/SubGrapher

Updated 2 days ago

ds4sd/MolGrapher

Updated 2 days ago • 2

andito

authored a paper 7 days ago

FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published 9 days ago • 59

andito

posted an update 8 days ago

Post

1562

Finally, our new paper is out! "𝗙𝗶𝗻𝗲𝗩𝗶𝘀𝗶𝗼𝗻: 𝗢𝗽𝗲𝗻 𝗗𝗮𝘁𝗮 𝗜𝘀 𝗔𝗹𝗹 𝗬𝗼𝘂 𝗡𝗲𝗲𝗱"! 🥳
FineVision: Open Data Is All You Need (2510.17269)

If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.

FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.

In the paper, we share how we built it:
🔍 finding and cleaning data at scale
🧹 removing excessive duplicates across sources
🤗 decontaminating against 66 public benchmarks

My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!

🎉 To celebrate the paper, I’m also releasing a concatenated and shuffled version of the full dataset! 👉HuggingFaceM4/FineVision_full_shuffled

It’s ready to stream, so you can start training your own models right away:

from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))

A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!

nlivathinos

in ds4sd/docling-layout-heron 28 days ago

WCAG Guidelines 2.1 Layout Detection

#6 opened 29 days ago by

aishwary456

andito

posted an update 3 months ago

Post

3011

Many VLMs claim to process hours of video. But can they follow the story?🤔
Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳

We test three skills that matter for real-world use:
🔎 Localized Retrieval: Find a specific action.
🧩 Information Synthesis: Piece together scattered clues.
🏃 Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos.
Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈

Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.

📖 Blog:
https://huggingface.co/blog/timescope-video-lmm-benchmark
👩‍💻 Leaderboard & Demo: Apollo-LMMs/TimeScope
📊 Dataset: Apollo-LMMs/TimeScope
⚙️ Eval Code: https://github.com/EvolvingLMMs-Lab/lmms-eval

andito

posted an update 4 months ago

Post

4038

🧠👁️ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

🔧 Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)