File size: 2,230 Bytes
dd3a1be 21e9e45 dd3a1be 8ea36d5 dd3a1be 8ea36d5 dd3a1be 8ea36d5 dd3a1be 8ea36d5 dd3a1be 8ea36d5 dd3a1be 8ea36d5 dd3a1be 8ea36d5 dd3a1be 8ea36d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
---
pipeline_tag: image-text-to-text
tags:
- NPU
base_model:
- Qwen/Qwen3-VL-4B-Instruct
---
# Qwen3-VL-4B-Instruct
Run **Qwen3-VL-4B-Instruct** optimized for **Qualcomm NPUs** with [nexaSDK](https://sdk.nexa.ai).
## Quickstart
1. **Install NexaSDK** and create a free account at [sdk.nexa.ai](https://sdk.nexa.ai)
2. **Activate your device** with your access token:
```bash
nexa config set license '<access_token>'
```
3. Run the model on Qualcomm NPU in one line:
```bash
nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU
```
## Model Description
**Qwen3-VL-4B-Instruct** is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team.
As part of the **Qwen3-VL** series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue.
The *Instruct* variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts.
## Features
- **Instruction-Following**: Optimized for dialogue, explanation, and user-friendly task completion.
- **Vision-Language Fusion**: Understands and reasons across text, images, and video frames.
- **Multilingual Capability**: Handles multiple languages for diverse global use cases.
- **Contextual Coherence**: Balances reasoning ability with natural, grounded conversational tone.
- **Lightweight & Deployable**: 4B parameters make it efficient for edge and device-level inference.
## Use Cases
- Visual chatbots and assistants
- Image captioning and scene understanding
- Chart, document, or screenshot analysis
- Educational or tutoring systems with visual inputs
- Multilingual, multimodal question answering
## Inputs and Outputs
**Input:**
- Text prompts, image(s), or mixed multimodal instructions.
**Output:**
- Natural-language responses or visual reasoning explanations.
- Can return structured text (summaries, captions, answers, etc.) depending on the prompt.
## License
Refer to the [official Qwen license](https://huggingface.co/Qwen) for terms of use and redistribution. |