NexaAI
/

Qwen3-VL-4B-Instruct-NPU

Image-Text-to-Text

Model card Files Files and versions

Qwen3-VL-4B-Instruct-NPU / README.md

nexaml's picture

Update README.md

21e9e45 verified about 1 month ago

|

history blame contribute delete

2.23 kB

	---
	pipeline_tag: image-text-to-text
	tags:
	- NPU
	base_model:
	- Qwen/Qwen3-VL-4B-Instruct
	---
	# Qwen3-VL-4B-Instruct
	Run Qwen3-VL-4B-Instruct optimized for Qualcomm NPUs with [nexaSDK](https://sdk.nexa.ai).

	## Quickstart

	1. Install NexaSDK and create a free account at [sdk.nexa.ai](https://sdk.nexa.ai)
	2. Activate your device with your access token:

	```bash
	nexa config set license '<access_token>'
	```
	3. Run the model on Qualcomm NPU in one line:

	```bash
	nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU
	```

	## Model Description
	Qwen3-VL-4B-Instruct is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team.
	As part of the Qwen3-VL series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue.

	The Instruct variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts.

	## Features
	- Instruction-Following: Optimized for dialogue, explanation, and user-friendly task completion.
	- Vision-Language Fusion: Understands and reasons across text, images, and video frames.
	- Multilingual Capability: Handles multiple languages for diverse global use cases.
	- Contextual Coherence: Balances reasoning ability with natural, grounded conversational tone.
	- Lightweight & Deployable: 4B parameters make it efficient for edge and device-level inference.

	## Use Cases
	- Visual chatbots and assistants
	- Image captioning and scene understanding
	- Chart, document, or screenshot analysis
	- Educational or tutoring systems with visual inputs
	- Multilingual, multimodal question answering

	## Inputs and Outputs
	Input:
	- Text prompts, image(s), or mixed multimodal instructions.

	Output:
	- Natural-language responses or visual reasoning explanations.
	- Can return structured text (summaries, captions, answers, etc.) depending on the prompt.

	## License
	Refer to the [official Qwen license](https://huggingface.co/Qwen) for terms of use and redistribution.