Explore OCR, Captioning, and Visual Understanding with Cutting-Edge Models on Hugging Face. ๐ค๐งช
Iโve put together a collection of Google Colab notebooks to experiment with some of the most exciting models available on the Hugging Face Hub focused on OCR, image captioning, and visual understanding tasks. [Image-to-Text] / [Image-Text-to-Text]
These notebooks are built for quick prototyping and run on free T4 GPUs, making them perfect for experimentation, testing ideas, or just exploring whatโs possible with modern vision-language models.
Note: The experimental notebooks are compiled with models that fit within the T4 GPU (free-tier) limits. More models along with their notebooks will be added over time.
reacted to merve's
post with โค๏ธ๐๐ฅ๐ง 5 months ago
Google just released PaliGemma 2 Mix: new versatile instruction vision language models ๐ฅ
> Three new models: 3B, 10B, 28B with res 224, 448 ๐ > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything ๐คฏ
Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.
Multimodal ๐ฌ - We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG ๐ - UI-TARS are new models by ByteDance to unlock agentic GUI control ๐คฏ in 2B, 7B and 72B - Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B - MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context - Dataset: Yale released a new benchmark called MMVU - Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs ๐ - DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! ๐คฏ - Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B - NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio ๐ฃ๏ธ - Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B - TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation โฏ๏ธ - Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux - tencent released Hunyuan3D-2, new 3D asset generation from images
7 replies
ยท
reacted to kadirnar's
post with ๐๐ฅ9 months ago
๐ฅ The AI Agent hype is real! This blog post deep dives into everything you need to know before deploying them: from key definitions to practical recommendations. A must-read for anyone building the future of autonomous systems.
๐ Key insight: A clear table breaking down the 5 levels of AI agents - from simple processors to fully autonomous systems. Essential framework for understanding where your agent stands on the autonomy spectrum
โ๏ธ Deep analysis of 15 core values reveals critical trade-offs: accuracy, privacy, safety, equity & more. The same features that make agents powerful can make them risky. Understanding these trade-offs is crucial for responsible deployment
๐ฏ 6 key recommendations for the road ahead: - Create rigorous evaluation protocols - Study societal effects - Understand ripple effects - Improve transparency - Open source can make a positive difference - Monitor base model evolution
Since I published it on GitHub a few days ago, Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.