AI & ML interests

Training a Traditional Chinese Large Language Model.

prithivMLmodsย 
posted an update 5 days ago
view post
Post
3460
Introducing demos for new SOTA models from AI2: SAGE-MM (Smart Any-Horizon Agents for Long-Video Reasoning) and Molmo-2, an open vision-language model that supports multi-image (QA and pointing) and video (QA, pointing, and tracking). The respective demo-related collections are listed below. ๐ŸŽƒ๐Ÿ”ฅ

โœจ SAGE-MM [Video-Reasoning]: prithivMLmods/SAGE-MM-Video-Reasoning
โœจ Molmo2 [Demo]: prithivMLmods/Molmo2-HF-Demo

๐ŸŽƒ GitHub[SAGE-MM]: https://github.com/PRITHIVSAKTHIUR/SAGE-MM-Video-Reasoning
๐ŸŽƒ GitHub[Molmo2]: https://github.com/PRITHIVSAKTHIUR/Molmo2-HF-Demo
๐ŸŽƒ Multimodal Implementations: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

To know more about it, visit the app page or the respective model page!
  • 1 reply
ยท
prithivMLmodsย 
posted an update 6 days ago
view post
Post
1986
Introducing TRELLIS.2 Text-to-3D. The demo for the TRELLIS.2-4B (Image-to-3D) model is streamlined with the Z-Image Turbo image generation model to enable Text-to-3D functionality. There is no need for input assets, making a small leap forward for ideation. Optionally, it also includes default support for Image-to-3D inference using direct image assets. Find the demo and related collections below... ๐Ÿค—๐Ÿ”ฅ

โœจ TRELLIS.2-Text-to-3D [Demo]: prithivMLmods/TRELLIS.2-Text-to-3D
โœจ Multimodal Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
โœจ Github: https://github.com/PRITHIVSAKTHIUR/TRELLIS.2-Text-to-3D

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update 8 days ago
view post
Post
1976
Demo for Molmo2 on Hugging Face is live now, including Single/Multi-Image VQA, Visual Pointing/Grounding, Video VQA, and Video Point Tracking. Find the demo and related collections below. ๐Ÿ”ฅ๐Ÿค—

โ— Molmo2 HF Demo๐Ÿ–ฅ๏ธ: prithivMLmods/Molmo2-HF-Demo
โ— Model Collection: https://huggingface.co/collections/allenai/molmo2
โ— Related Multimodal Space Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update 10 days ago
view post
Post
5498
Introducing the Z Image Turbo LoRA DLC App, a gallery space for plug-and-play Z-Image-Turbo LoRAs. It features a curated collection of impressive LoRAs for generating high-quality images. By default, it runs on the base model. Simply choose a LoRA, type your prompt, and generate images. You can find the app and more details below. ๐Ÿค—๐Ÿงช

โ— Space [Demo]: prithivMLmods/Z-Image-Turbo-LoRA-DLC
โ— Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
โ— Check the list of Z-Image LoRA's: https://huggingface.co/models?other=base_model:adapter:Tongyi-MAI/Z-Image-Turbo
โ— Github: https://github.com/PRITHIVSAKTHIUR/Z-Image-Turbo-LoRA-DLC

Other related image gen spaces:-

โ— FLUX-LoRA-DLC2: prithivMLmods/FLUX-LoRA-DLC2
โ— FLUX-LoRA-DLC: prithivMLmods/FLUX-LoRA-DLC
โ— Qwen-Image-LoRA-DLC: prithivMLmods/Qwen-Image-LoRA-DLC
โ— Qwen-Image-Edit-2509-LoRAs-Fast: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
โ— Qwen-Image-Edit-2509-LoRAs-Fast-Fusion: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast-Fusion

& more...

To know more about it, visit the app page or the respective model page!
  • 2 replies
ยท
prithivMLmodsย 
posted an update 17 days ago
view post
Post
2716
Introducing the D.Markdown Experimental Models, Proxima and Epsilon OCR models, built on top of Qwen3-VL and Qwen2.5-VL respectively. Proxima is optimized for Markdown generation and is capable of embedding inline programming code snippets and generating rich nodes such as HTML, XML, JSON, and YAML. Epsilon is optimized for reconstructing complex layouts including tables, forms, and mathematical content. ๐ŸŒŒโœจ

โ— proxima-ocr-d.markdown-post3.0.l: prithivMLmods/proxima-ocr-d.markdown-post3.0.l
โ— epsilon-ocr-d.markdown-post3.0.m: prithivMLmods/epsilon-ocr-d.markdown-post3.0.m
โ— proxima-ocr-d.markdown-post3.0.l-gguf: prithivMLmods/proxima-ocr-d.markdown-post3.0.l-GGUF
โ— epsilon-ocr-d.markdown-post3.0.m-gguf: prithivMLmods/epsilon-ocr-d.markdown-post3.0.m-GGUF

โ— Collection: https://huggingface.co/collections/prithivMLmods/dynamic-markdowns
โ— Multimodal Apps: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

๐Ÿ‘‰ These models are stage progression models, and currently they may contain artifacts.

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update 19 days ago
view post
Post
1116
Try CUA GUI Operator ๐Ÿ–ฅ๏ธ Space, the demo of some interesting multimodal ultra-compact Computer Use Agent (CUA) models in a single app, including Fara-7B, UI-TARS-1.5-7B, and Holo models, to perform GUI localization tasks.

โ— CUA-GUI-Operator [Demo]: prithivMLmods/CUA-GUI-Operator
โ— Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

Other related multimodal spaces

โ— Qwen3-VL: prithivMLmods/Qwen3-VL-HF-Demo
โ— Multimodal-VLM-v1.0: prithivMLmods/Multimodal-VLM-v1.0
โ— Vision-to-VibeVoice-en: prithivMLmods/Vision-to-VibeVoice-en

I have planned to add Chrome sandboxes to streamline it and turn it into a browser based CUA multimodal tool, which will be added to the same space soon.

To know more about it, visit the app page or the respective model page!
  • 1 reply
ยท
prithivMLmodsย 
posted an update 20 days ago
view post
Post
3557
One speech model with seven voices, streamlined with multimodal capabilities for vision tasks. Performs vision(image-text) to audio inference with Qwen2.5-VL + VibeVoice-Realtime-0.5B. Vision to VibeVoice (EN) - The demo is live. ๐Ÿ—ฃ๏ธ๐Ÿ”ฅ

๐Ÿค— Vision-to-VibeVoice-en [Demo]: prithivMLmods/Vision-to-VibeVoice-en
โœจ Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
โœจ Speech [VibeVoice-Realtime-0.5B]: microsoft/VibeVoice-Realtime-0.5B
โœจ Vision [Qwen2.5-VL]: Qwen/Qwen2.5-VL-7B-Instruct

To know more about it, visit the app page or the respective model page!
ยท
prithivMLmodsย 
posted an update 25 days ago
view post
Post
3704
Hello everyone,

The
strangerzonehf
[HF] Community / Organization Page, which is maintained by me, has reached the Top 10 Developer Pages ranking at 6th place, contributing 3.4% in the calendar cycle from August 2024 to August 2025. It is also the only South Asia / Indian page in the list. I could not be more proud to be doing things for the community. โค๏ธ๐Ÿค—

Source: https://www.dataprovenance.org/economies-of-open-intelligence.pdf

It is a pleasure to be a part of it.
Thank you!
@prithivMLmods
prithivMLmodsย 
posted an update 29 days ago
view post
Post
10666
Introducing the Super-OCRs Demo, a comparison of state-of-the-art multimodal OCR VLMs, including HunyuanOCR, DeepSeekOCR, Dots, and Nanonets in one space for performing OCR, rendering LaTeX and Markdown, and visual grounding (layout). Find the related Spaces and models below.๐Ÿค—๐Ÿ”ฅ

โœจSuper-OCRs[Demo]: prithivMLmods/Super-OCRs-Demo
โœจCollection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
โœจGitHub: https://github.com/PRITHIVSAKTHIUR/Super-OCRs-Demo

โญ Models Used:
โœฆ HunyuanOCR: tencent/HunyuanOCR
โœฆ DeepSeek-OCR: (-) deepseek-ai/DeepSeek-OCR (+) prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
โœฆ Dots.OCR: (-) rednote-hilab/dots.ocr (+) prithivMLmods/Dots.OCR-Latest-BF16
โœฆ Nanonets-OCR2-3B: nanonets/Nanonets-OCR2-3B

โญ Some Other Relevant Apps:
โœฆ Qwen3-VL-HF-Demo: prithivMLmods/Qwen3-VL-HF-Demo
โœฆ Qwen3-VL-Outpost: prithivMLmods/Qwen3-VL-Outpost
โœฆ Multimodal-OCR: prithivMLmods/Multimodal-OCR
โœฆ Multimodal-OCR2: prithivMLmods/Multimodal-OCR2
โœฆ Multimodal-OCR3: prithivMLmods/Multimodal-OCR3
โœฆ DeepSeek-OCR-experimental: prithivMLmods/DeepSeek-OCR-experimental

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
3221
Introducing the advanced sketch-board editor "Nano-Banana-Pro-Sketch-Board" powered by the Gemini 2.5 Flash Image and Gemini 3 Pro Preview Image models through the Gemini API. This version includes more features than the Nano-Banana-AIO app for drawing and prompt-based concept transformation of freestyle sketches. ๐Ÿ”ฅ๐ŸŒ

โœจNano-Banana-Pro-Sketch-Board: prithivMLmods/Nano-Banana-Pro-Sketch-Board
โœจCollection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
โœจGithub: https://github.com/PRITHIVSAKTHIUR/Nano-Banana-Pro-Sketch-Board
โœจModel-Garden: https://tinyurl.com/4xxs9dvy

Some Other Relevant Apps [OSS]

โญQwen-Image-Edit-2509-LoRAs-Fast-Fusion: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast-Fusion
โญQwen-Image-Edit-2509-LoRAs-Fast: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
โญPhoto-Mate-i2i: prithivMLmods/Photo-Mate-i2i
โญKontext-Photo-Mate-v2: https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2

Note: The Nano-Banana-Pro-Sketch-Board demo requires a Gemini API key for the editing process. Your API key will be removed when the app is reloaded or closed. Your key remains safe and will not be exposed to any medium. Also, the Gemini 3 Pro Preview Image model may require a paid API key from a Google Cloud project with billing enabled.

To know more about it, visit the app info section or the respective Model Garden page!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
1329
Try the demo of NVIDIA Nemotron Parse v1.1, NVIDIA's latest VLM for understanding document semantics and extracting text and table elements with spatial grounding. It is capable of comprehensive text understanding and document structure analysis in a given document, and can provide bounding boxes with coordinates.

โญSpace[Demo]: prithivMLmods/NVIDIA-Nemotron-Parse-OCR
โญModel: nvidia/NVIDIA-Nemotron-Parse-v1.1
โญMultimodal-Spaces: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

Some relevant Spaces

โญDeepSeek-OCR-experimental [latest transformers]: prithivMLmods/DeepSeek-OCR-experimental
โญQwen3-VL-Outpost: prithivMLmods/Qwen3-VL-Outpost
โญMultimodal-OCR3: prithivMLmods/Multimodal-OCR3

Check out the other spaces in the multimodal implementation collection.

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
1501
Try the all-new trending Qwen-Image-Edit-2509 (Multi-Image-Edits) specialized adapter demos, including Cloth-Design-Fuse, Texture Edit, Guided-Objects-Patching, and more โ€” all in a single Hugging Face Space. The demo link is provided below. ๐Ÿค—๐Ÿ”ฅ

โฎž Space[Demo]: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast-Fusion
โฎž Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
โฎž Base Model: Qwen/Qwen-Image-Edit-2509

Similar applicationsโ†—๏ธ

โฎž Kontext-Photo-Mate-v2: https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2
โฎž Photo-Mate-i2i: prithivMLmods/Photo-Mate-i2i
โฎž Qwen-Image-Edit-2509-LoRAs-Fast: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
3533
Made a demo for multimodal understanding of Qwen3-VL space for tasks including point annotation, detection, captioning, guided text inferences, and more. Find the demo link below. ๐Ÿค—โ†—๏ธ

โฎž Space[Demo]: prithivMLmods/Qwen3-VL-HF-Demo
โฎž Model Used: Qwen/Qwen3-VL-4B-Instruct
โฎž Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
โฎž GitHub: https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
3756
Made a small write up and experimental finetuning guide for MetaCLIP2 for Image Classification on Downstream Tasks. The blog titled Fine Tuning MetaCLIP 2 for Image Classification on Downstream Tasks demonstrates the step by step finetuning using CIFAR10 and is also flexible for adapting to other datasets. For more details, check out the linked blog below. ๐Ÿค—โ†—๏ธ

โฎž Blog Article: https://huggingface.co/blog/prithivMLmods/metaclip2-downstream-finetune
โฎž Demo Space[Zero-Shot Classification]: prithivMLmods/metaclip-2-demo

Some other models
โ•ฐโ€บ MetaCLIP-2-Cifar10: prithivMLmods/MetaCLIP-2-Cifar10
โ•ฐโ€บ MetaCLIP-2-Age-Range-Estimator: prithivMLmods/MetaCLIP-2-Age-Range-Estimator
โ•ฐโ€บ MetaCLIP-2-Gender-Identifier: prithivMLmods/MetaCLIP-2-Gender-Identifier
โ•ฐโ€บ MetaCLIP-2-Open-Scene: prithivMLmods/MetaCLIP-2-Open-Scene

โฎž Collection: https://huggingface.co/collections/prithivMLmods/metaclip2-image-classification-experiments

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update about 1 month ago
view post
Post
3276
Try the all-new trending Qwen-Image-Edit specialized adapter demos, including Photo-to-Anime, Light Restoration, Multi-Angle Edits, Relighting, and more โ€” all in a single Hugging Face Space. Below is the demo link. ๐Ÿค—๐ŸŒ 

โฎž Demo-Space: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
โฎž How-to-Use: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast#2
โฎž Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
ยท
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
2873
Introducing Photo-Mate-v2, based on FLUX.1-Kontext-dev, for advanced image manipulation tasks. It supports transforming scenes into top-down/bottom-up perspectives, CAM-right/left-view and its reverse, as well as general kontext-specified object removal. Below is the list of demos and adapters.๐Ÿ”ฅ๐Ÿค—

โžค Spaces [Demo] : https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2

Kontext-Adapters :
โœฆ Kontext-Bottom-Up-View: prithivMLmods/Kontext-Bottom-Up-View
โœฆ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
โœฆ Kontext-Top-Down-View: prithivMLmods/Kontext-Top-Down-View
โœฆ Kontext-CAM-Left-View: prithivMLmods/Kontext-CAM-Left-View
โœฆ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
โœฆ Kontext-Unblur-Upscale: prithivMLmods/Kontext-Unblur-Upscale
โœฆ Kontext-0811-exp: prithivMLmods/Kontext-0811-exp

Photo-Mate Collection:
โœฆ Kontext CAM Angles: https://huggingface.co/collections/prithivMLmods/kontext-cam-angles
โœฆ i2i - Kontext (exp): https://huggingface.co/collections/prithivMLmods/i2i-kontext-exp
โœฆ LZO-1 (Lossless Zoom Operator): https://huggingface.co/collections/prithivMLmods/lzo-1-lossless-zoom-operator

Related-Apps:
โœฆ Photo-Mate [Version 1.0]: prithivMLmods/Photo-Mate-i2i
โœฆ Image Generation Apps [Collection]: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
@prithivMLmods
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
1313
A week ago, I shared a post about the latest transformers test implementation of DeepSeek-OCR Compatibility (https://tinyurl.com/ykc4mm66). Now, Iโ€™m dropping the most compatible version of it to support the model with the latest transformers. ๐Ÿค—๐Ÿ”ฅ

โž  DeepSeek-OCR-Latest-BF16.I64: prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
โž  DeepSeek OCR [exp] : prithivMLmods/DeepSeek-OCR-experimental

โœ…Supports the latest transformers v4.57.1
โœ…torch: 2.6.0+cu124 (or) the latest version (i.e., torch 2.9.0)
โœ…cuda version: 12.4
โœ…users can also opt out of specific attention implementations if desired.

โœจPrevious version: strangervisionhf/deepseek-ocr-latest-transformers
โ†—๏ธRelated Blog: https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms
โœจCommunity Page:
strangervisionhf

โœจOriginal Model Page: deepseek-ai/DeepSeek-OCR

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
2592
A small blog post titled - Hall of Multimodal OCR VLMs and Demonstrations has been published on โ†—๏ธ https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms on behalf of
strangervisionhf


It discusses the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique capabilities, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility.
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
3849
Implemented DeepSeek-OCR to support the latest transformers on the
strangervisionhf
page. The page includes the model weights and corrected configuration, which fix the issues and allow transformers inference to run smoothly.๐Ÿค—๐Ÿ”ฅ

> Model: strangervisionhf/deepseek-ocr-latest-transformers
> Demo Space: prithivMLmods/DeepSeek-OCR-experimental

โœ…Supports the latest transformers
โœ…You can also opt out of the attention implementation if needed.
โœ…Supports torch version 2.6.0 or higher
โœ…torch version cuda: 12.4

If you are interested in experimenting with new things and streamlining compatibility, the
strangervisionhf
organization is open for you, and you can join the community.

> Multimodal Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0, https://huggingface.co/collections/strangervisionhf/october-2025-models

> Thank you, @merve , for assigning the blazing-fast Zero GPU support!

> Notebook : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepSeek-OCR-Demo/deepseek_ocr_demo.ipynb

To know more about it, visit the app page or the respective model page!
prithivMLmodsย 
posted an update about 2 months ago
view post
Post
1525
Introducing Gliese-OCR-7B-Post2.0-final, a document content-structure retrieval VLM designed for content extraction (OCR), summarization, and document visual question answering. This is the fourth and final model in the Camel Doc OCR VLM series, following Gliese-OCR-7B-Post1.0. The model delivers superior accuracy across a wide range of document types, including scanned PDFs, handwritten pages, structured forms, and analytical reports.๐Ÿš€๐Ÿค—

> Gliese-OCR-7B-Post2.0-final : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> Gliese-OCR-7B-Post1.0 (previous) : prithivMLmods/Gliese-OCR-7B-Post1.0
> Gliese OCR Post-x.0 (collection) : https://huggingface.co/collections/prithivMLmods/gliese-ocr-post-x0
> Multimodal Implementations (collection) : https://huggingface.co/collections/prithivMLmods/multimodal-implementations
> Qwen VL Captions (other-collection) : https://huggingface.co/collections/prithivMLmods/qwen-vl-captions
> Run Demo Here : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> GitHub (4bit) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Gliese-OCR-7B-Post2.0-final(4bit)/Gliese_OCR_7B_Post2_0_final.ipynb

.
.
.
> To know more about it, visit the app page or the respective model page!!