AI & ML interests

https://github.com/huggingface/cookbook

sergiopaniegoΒ 
posted an update 3 days ago
merveΒ 
posted an update 7 days ago
view post
Post
4388
deepseek-ai/DeepSeek-OCR is out! πŸ”₯ my take ‡️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
  • 2 replies
Β·
sergiopaniegoΒ 
posted an update 9 days ago
view post
Post
1840
New drop! πŸ’₯ The VLM Object Understanding Comparison Space now runs with Qwen3-VL-4B and moondream3.

You can compare how models reason about images 🧠

Bonus: thanks to @ariG23498 , you now get auto-suggested prompts to explore faster.

Let’s gooo

sergiopaniego/vlm_object_understanding
sergiopaniegoΒ 
posted an update 9 days ago
view post
Post
813
New drop! πŸ’₯ The VLM Object Understanding Comparison Space now runs with Qwen3-VL-4B and moondream3.



You can compare how models reason about images 🧠

Bonus: thanks to @ariG23498 , you now get auto-suggested prompts to explore faster.

Let’s gooo

sergiopaniego/vlm_object_understanding
sergiopaniegoΒ 
posted an update 12 days ago
view post
Post
2255
@Qwen released their new small and dense VLMs (Qwen3-VL).

They're incredibly capable and one of my all-time favourite VLMs.

πŸ€— We’ve prepared some resources to help you get started.

> Fine-tune Qwen3-VL-4B with SFT or GRPO (free Colab notebooks):
> SFT: https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_qwen_vl.ipynb
> GRPO: https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_qwen3_vl.ipynb

> Compare object detection vs. Moondream3:
sergiopaniego/vlm_object_understanding

> Fine-tune from the CLI using TRL:
https://github.com/kashif/Qwen3-VL/blob/trl-sft/qwen-vl-finetune/README.md#trl-based-training-single-gpu
sergiopaniegoΒ 
posted an update 17 days ago
view post
Post
1437
Super nice intro to fine-tuning with TRL, just dropped by @google (runs free on Colab)!

They use SFT + QLoRA to fine-tune the tiny Gemma 3 270M model for emoji generation

Here’s what the fine-tuned model generates for the prompt: β€œI'm learning to tweet” β†’ πŸ¦πŸ—£πŸ’»

Colab: https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Demos/Emoji-Gemma-on-Web/resources/Fine_tune_Gemma_3_270M_for_emoji_generation.ipynb
Try it out: google/emoji-gemma
Learn more: https://developers.googleblog.com/en/own-your-ai-fine-tune-gemma-3-270m-for-on-device/
sergiopaniegoΒ 
posted an update 20 days ago
view post
Post
2382
Online training methods (e.g., GRPO) require real-time generation, a compute- and memory-heavy bottleneck.

TRL has built-in vLLM support and in this new recipe, we show how to leverage it for efficient online training. Run on Colab ⚑, scale to multi-GPU/multi-node!

πŸ§‘β€πŸ³ recipe: https://huggingface.co/learn/cookbook/grpo_vllm_online_training
  • 1 reply
Β·
sergiopaniegoΒ 
posted an update 21 days ago
view post
Post
2876
A few days ago, Thinking Machines Lab released β€œLoRA Without Regret”, showing that LoRA can match full fine-tuning performance when configured right.

Naturally, we decided to reproduce the results with TRL and release a guide!

https://huggingface.co/docs/trl/main/en/lora_without_regret
sergiopaniegoΒ 
posted an update 26 days ago
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
482
You need to try this tool! 🫑

My colleague @Molbap built an interactive HF Space to explore the modular support of open models in transformers over time

πŸ‘€ You’ll spot things like πŸ¦™ llama defining many models or which ones could be modular next

Try it: Molbap/transformers-modular-refactor
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
472
How fast can you create an endpoint in Hugging Face Inference Endpoints with a new model + vLLM to deploy a state-of-the-art OCR model?

Let’s break it down step by step.

1️⃣ Create your endpoint
Go to Hugging Face Endpoints β†’ + NEW
Select Deploy from Hub β†’ rednote-hilab/dots.ocr β†’ Configure πŸ› οΈ

2️⃣ Configure hardware & container
Pick hardware: AWS/GPU/L4 ⚑
Set container: vLLM πŸ‡
Click Create βœ…

3️⃣ Update endpoint settings
Container: Container URI: vllm/vllm-openai:nightly β†’ Update
Advanced: add flag --trust-remote-code β†’ Update ⚠️

4️⃣ Run inference
Download the script πŸ“: ariG23498/useful-scripts
Set your HF_TOKEN and update base_url in the script.
Run it. βœ…

Your OCR model is now live via HF Inference Endpoints!
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
3448
πŸ’₯ Tons of new material just landed in the smol-course! πŸ§‘β€πŸ’»

> evaluation
> alignment
> VLMs
> quizzes
> assignments!
> certificates!πŸ‘©β€πŸŽ“

go learn! πŸ‘‰ https://huggingface.co/learn/smol-course/unit0/1
  • 1 reply
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
6576
large AI labs open-sourced a ton of models last week πŸ”₯
here's few picks, find even more here merve/sep-16-releases-68d13ea4c547f02f95842f05 🀝
> IBM released a new Docling model with 258M params based on Granite (A2.0) πŸ“ ibm-granite/granite-docling-258M
> Xiaomi released 7B audio LM with base and instruct variants (MIT) XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0
> DecartAI released Lucy Edit, open Nano Banana 🍌 (NC) decart-ai/Lucy-Edit-Dev
> OpenGVLab released a family of agentic computer use models (3B/7B/32B) with the dataset πŸ’» OpenGVLab/scalecua-68c912cf56f7ff4c8e034003
> Meituan Longcat released thinking version of LongCat-Flash πŸ’­ meituan-longcat/LongCat-Flash-Thinking
  • 2 replies
Β·
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
1391
This summer TRL leveled up for multimodal alignment 🌞

βœ… New VLM alignment methods (MPO, GRPO, GSPO)
βœ… Extended RLOO & Online DPO for VLMs
βœ… Native SFT support
βœ… Ready-to-use training scripts

πŸ”— https://huggingface.co/blog/trl-vlm-alignment
sergiopaniegoΒ 
posted an update about 1 month ago
merveΒ 
posted an update about 1 month ago
view post
Post
3246
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face πŸ”₯

> not only a document converter but also can do document question answering, understand multiple languages 🀯
> best part: released with Apache 2.0 license πŸ‘ use it with your commercial projects!
> it supports transformers, vLLM and MLX from the get-go! πŸ€—
> built on SigLIP2 & granite-165M

model: ibm-granite/granite-docling-258M
demo: ibm-granite/granite-docling-258m-demo πŸ’—
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
448
Training long-context LLMs is getting easier!

TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even multi-node setups, seamlessly πŸ’†
Combine TRL and accelerate, and you can run it effortlessly!

With 8 GPUs, CP enables 300k+ token sequences while keeping throughput reasonable.
Works for both full fine-tuning and LoRA, unlocking contexts that used to hit OOM πŸ“ˆ

Check out the full guide here πŸ‘‰ https://huggingface.co/docs/trl/main/en/distributing_training#context-parallelism

If you want to learn more about Context Parallelism, check out the Ultrascale Playbook πŸ‘‰ nanotron/ultrascale-playbook
sergiopaniegoΒ 
posted an update about 1 month ago
view post
Post
346
Thinking about learning the keys to post-training LLMs? 🧐

We just updated and released the smol course: the fastest track to mastering fine-tuning large language models. Free, hands-on, up-to-date, and comes with a certificate! 🫰

What you’ll get:
πŸ“– Instruction tuning & preference alignment
πŸ§‘β€πŸ’» Hands-on projects with TRL & Transformers
πŸ† Challenges & community projects
πŸŽ“ Certificate of completion

go: hf.co/learn/smol-course
merveΒ 
posted an update about 1 month ago
view post
Post
1115
a ton of image/video generation models and LLMs from big labs πŸ”₯

> Meta released facebook/mobilellm-r1-68c4597b104fac45f28f448e, smol LLMs for on-device use πŸ’¬
> Tencent released tencent/SRPO, high res image generation model and tencent/POINTS-Reader, cutting edge OCR πŸ“
> ByteDance released bytedance-research/HuMo, video generation from any input ⏯️

find more models, datasets, demos here merve/sep-11-releases-68c7dbfa26bea8cd921fa0ac
sergiopaniegoΒ 
posted an update about 2 months ago
view post
Post
4282
gpt-oss was possible thanks to new engineering efforts in πŸ€— transformers. We just dropped a blog covering them:

- Kernels from the Hub
- MXFP4 Quantization
- Tensor & Expert Parallelism
- Dynamic Sliding Window & Cache
- Continuous Batching & Paged Attention

Grab a coffee & dive in! β˜•οΈ

https://huggingface.co/blog/faster-transformers