|
|
--- |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- NPU |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-4B-Instruct |
|
|
--- |
|
|
# Qwen3-VL-4B-Instruct |
|
|
Run **Qwen3-VL-4B-Instruct** optimized for **Qualcomm NPUs** with [nexaSDK](https://sdk.nexa.ai). |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
1. **Install NexaSDK** and create a free account at [sdk.nexa.ai](https://sdk.nexa.ai) |
|
|
2. **Activate your device** with your access token: |
|
|
|
|
|
```bash |
|
|
nexa config set license '<access_token>' |
|
|
``` |
|
|
3. Run the model on Qualcomm NPU in one line: |
|
|
|
|
|
```bash |
|
|
nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU |
|
|
``` |
|
|
|
|
|
## Model Description |
|
|
**Qwen3-VL-4B-Instruct** is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team. |
|
|
As part of the **Qwen3-VL** series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue. |
|
|
|
|
|
The *Instruct* variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts. |
|
|
|
|
|
## Features |
|
|
- **Instruction-Following**: Optimized for dialogue, explanation, and user-friendly task completion. |
|
|
- **Vision-Language Fusion**: Understands and reasons across text, images, and video frames. |
|
|
- **Multilingual Capability**: Handles multiple languages for diverse global use cases. |
|
|
- **Contextual Coherence**: Balances reasoning ability with natural, grounded conversational tone. |
|
|
- **Lightweight & Deployable**: 4B parameters make it efficient for edge and device-level inference. |
|
|
|
|
|
## Use Cases |
|
|
- Visual chatbots and assistants |
|
|
- Image captioning and scene understanding |
|
|
- Chart, document, or screenshot analysis |
|
|
- Educational or tutoring systems with visual inputs |
|
|
- Multilingual, multimodal question answering |
|
|
|
|
|
## Inputs and Outputs |
|
|
**Input:** |
|
|
- Text prompts, image(s), or mixed multimodal instructions. |
|
|
|
|
|
**Output:** |
|
|
- Natural-language responses or visual reasoning explanations. |
|
|
- Can return structured text (summaries, captions, answers, etc.) depending on the prompt. |
|
|
|
|
|
## License |
|
|
Refer to the [official Qwen license](https://huggingface.co/Qwen) for terms of use and redistribution. |