File size: 2,230 Bytes
dd3a1be
 
 
 
21e9e45
 
dd3a1be
8ea36d5
 
dd3a1be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ea36d5
 
dd3a1be
8ea36d5
dd3a1be
 
8ea36d5
 
 
 
 
dd3a1be
 
8ea36d5
 
 
 
 
dd3a1be
 
 
8ea36d5
dd3a1be
 
8ea36d5
 
dd3a1be
 
8ea36d5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
pipeline_tag: image-text-to-text
tags:
- NPU
base_model:
- Qwen/Qwen3-VL-4B-Instruct
---
# Qwen3-VL-4B-Instruct
Run **Qwen3-VL-4B-Instruct** optimized for **Qualcomm NPUs** with [nexaSDK](https://sdk.nexa.ai).

## Quickstart

1. **Install NexaSDK** and create a free account at [sdk.nexa.ai](https://sdk.nexa.ai)
2. **Activate your device** with your access token:

   ```bash
   nexa config set license '<access_token>'
   ```
3. Run the model on Qualcomm NPU in one line:

   ```bash
   nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU
   ```

## Model Description
**Qwen3-VL-4B-Instruct** is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team.  
As part of the **Qwen3-VL** series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue.

The *Instruct* variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts.

## Features
- **Instruction-Following**: Optimized for dialogue, explanation, and user-friendly task completion.  
- **Vision-Language Fusion**: Understands and reasons across text, images, and video frames.  
- **Multilingual Capability**: Handles multiple languages for diverse global use cases.  
- **Contextual Coherence**: Balances reasoning ability with natural, grounded conversational tone.  
- **Lightweight & Deployable**: 4B parameters make it efficient for edge and device-level inference.

## Use Cases
- Visual chatbots and assistants  
- Image captioning and scene understanding  
- Chart, document, or screenshot analysis  
- Educational or tutoring systems with visual inputs  
- Multilingual, multimodal question answering  

## Inputs and Outputs
**Input:**
- Text prompts, image(s), or mixed multimodal instructions.  

**Output:**
- Natural-language responses or visual reasoning explanations.  
- Can return structured text (summaries, captions, answers, etc.) depending on the prompt.

## License
Refer to the [official Qwen license](https://huggingface.co/Qwen) for terms of use and redistribution.