---
license: apache-2.0
license_link: https://huggingface.co/skt/A.X-4.0-VL-Light/blob/main/LICENSE
language:
- en
- ko
pipeline_tag: image-text-to-text
library_name: transformers
model_id: skt/A.X-4.0-VL-Light
developers: SKT AI Model Lab
base_model:
- skt/A.X-4.0-Light
---
# A.X 4.0 VL Light
π€ Models | π₯οΈ Github
## Highlights
**A.X 4.0 VL Light** (pronounced βA dot Xβ) is a vision-language model (VLM) optimized for Korean vision and language understanding as well as enterprise deployment. Built upon [A.X 4.0 Light](https://huggingface.co/skt/A.X-4.0-Light), A.X 4.0 VL Light has been further trained on diverse multimodal datasets, with a particular focus on large-scale multimodal Korean datasets, to deliver exceptional performance in domestic business applications.
- **Superior Korean Proficiency in Vision and Language**: Achieved an average score of 79.4 on Korean image benchmarks, outperforming Qwen2.5-VL-32B (73.4), despite having a significantly smaller model size. On Korean text benchmarks, recorded an average score of 60.2, comparable to VARCO-VISION-2.0-14B (60.4), while using only half the model size.
- **Deep Cultural Understanding**: Scored 80.2 on K-Viscuit, a multimodal benchmark designed to evaluate cultural and contextual comprehension in Korean, exceeding Qwen2.5-VL-32B (72.3).
- **Advanced Document Understanding**: Attained a score of 89.8 on KoBizDoc, a benchmark focused on understanding complex document structures, including charts and tables, performing comparably to Qwen2.5-VL-32B (88.8).
- **Efficient Token Usage**: A.X 4.0 VL Light utilizes approximately 41% fewer text tokens compared to Qwen2.5-VL for the same Korean input, enabling significantly more cost-effective and efficient processing.
A brief comparison on representative benchmarks is as follows:
## Performance
### Image Benchmark
*Korean benchmarks, with K-Viscuit translated into Korean.
| Category | Benchmarks | A.X 4.0 VL Light | Qwen2.5-VL-7B | InternVL3-8B | VARCO-VISION-2.0-14B | Qwen2.5-VL-32B |
|------------------------|---------------------|------------------|---------------|--------------|----------------------|----------------|
| Document | KoBizDoc* | 89.8 | 84.0 | 73.2 | 83.0 | 88.8 |
| | K-DTCBench* | 90.0 | 86.7 | 83.8 | 80.8 | 91.7 |
| | ChartQA | 79.8 | 80.6 | 79.8 | 78.8 | 81.8 |
| | DocVQA | 94.4 | 95.3 | 92.4 | 91.9 | 94.5 |
| | InfoVQA | 78.5 | 82.7 | 76.2 | 80.0 | 82.7 |
| | SEEDBench2-Plus | 69.7 | 71.2 | 69.7 | 71.9 | 73.3 |
| OCR | OutdoorKorean* | 97.3 | 91.9 | 72.7 | 79.7 | 86.9 |
| | K-Handwriting* | 84.3 | 85.0 | 43.5 | 55.2 | 60.1 |
| | TextVQA | 82.0 | 85.4 | 82.1 | 80.3 | 79.8 |
| Culture | K-Viscuit* | 80.2 | 65.0 | 65.3 | 72.0 | 72.3 |
| Knowledge | KoEduBench* | 58.1 | 53.9 | 53.9 | 39.4 | 52.4 |
| | KoCertBench* | 54.9 | 50.1 | 39.4 | 51.4 | 47.5 |
| | MMMU | 54.1 | 56.3 | 59.4 | 58.3 | 63.6 |
| | ScienceQA | 95.3 | 87.2 | 97.8 | 92.2 | 92.4 |
| General | K-LLaVA-W* | 83.2 | 73.0 | 67.0 | 80.0 | 84.3 |
| | K-SEED* | 76.5 | 76.4 | 76.4 | 76.9 | 77.3 |
| | SEEDBench_IMG | 76.7 | 77.1 | 77.1 | 78.1 | 77.6 |
| Hallucination | HallusionBench | 54.2 | 52.7 | 49.6 | 53.8 | 58.0 |
| IF | MM-IFEval | 53.5 | 51.4 | 51.9 | 50.8 | 59.3 |
The following in-house benchmarks have been established to rigorously assess model performance on Korean vision-language understanding and the comprehension of Korea-specific knowledge domains:
- **KoBizDoc**: A visual question answering (VQA) benchmark designed for understanding Korean business documents.
- **OutdoorKorean**: A benchmark focused on recognizing Korean text in complex outdoor scenes (provided by AIHub).
- **K-Handwriting**: A Korean handwriting recognition dataset comprising various handwritten styles (provided by AIHub).
- **KoEduBench**: A VQA benchmark targeting Korean general academic exams, including GED and CSAT questions, to assess academic reasoning ability.
- **KoCertBench**: A Korean certification exam-based VQA benchmark, covering domains such as civil service, technical licenses, and professional qualifications.
### Text Benchmark
*Korean benchmarks.
| Category | Benchmarks | A.X 4.0 VL Light | Qwen2.5-VL-7B | InternVL3-8B | VARCO-VISION-2.0-14B |
|-----------------------|--------------|------------------|---------------|--------------|----------------------|
| Knowledge | KMMLU* | 60.5 | 45.6 | 50.9 | 58.8 |
| | MMLU | 72.6 | 71.9 | 77.5 | 80.7 |
| Math | HRM8K* | 40.6 | 25.4 | 34.6 | 49.5 |
| | MATH | 56.5 | 61.7 | 65.1 | 71.1 |
| General | Ko-MT-bench* | 68.9 | 51.5 | 59.5 | 75.9 |
| | MT-bench | 72.9 | 73.2 | 69.9 | 76.6 |
| IF | Ko-IFEval* | 71.8 | 55.0 | 46.1 | 57.2 |
| | IFEval | 81.9 | 66.6 | 67.5 | 75.3 |
## π Quickstart
### with HuggingFace Transformers
- `transformers>=4.49.0` or the latest version is required to use `skt/A.X-4.0-VL-Light`
```bash
pip install transformers>=4.49.0
```
#### Example Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import requests
from io import BytesIO
model_name = "skt/A.X-4.0-VL-Light"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device='cuda')
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
url = "https://huggingface.co/skt/A.X-4.0-VL-Light/resolve/main/assets/image.png"
# μ΄λ―Έμ§ μΆμ²: κ΅κ°μ μ°ν¬νΈ (https://www.heritage.go.kr/unisearch/images/national_treasure/thumb/2021042017434700.JPG)
response = requests.get(url)
response.raise_for_status()
image = Image.open(BytesIO(response.content))
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "μ΄λ―Έμ§μ λν΄μ μ€λͺ
ν΄μ€."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
images=[image],
text=[text],
padding=True,
return_tensors="pt",
).to("cuda")
# Decoding parameters (top_p, temperature, top_k, repetition_penalty) should be tuned depending on the generation task.
generation_kwargs = {
"max_new_tokens": 256,
"top_p": 0.8,
"temperature": 0.5,
"top_k": 20,
"repetition_penalty": 1.05,
"do_sample": True,
}
generated_ids = model.generate(**inputs, **generation_kwargs)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(response[0])
"""
μλ‘λ¬Έμ λνλ―Όκ΅ μμΈμ μμΉν κ΅λ³΄ μ 1νΈλ‘, μ‘°μ μλμ 건μΆλ λͺ©μ‘° 건μΆλ¬Όμ΄λ€. μ΄ λ¬Έμ μμΈμ λ¨μͺ½ λλ¬ΈμΌλ‘, μ ν΅μ μΈ νκ΅ κ±΄μΆ μμμ 보μ¬μ€λ€. λ μΈ΅μΌλ‘ μ΄λ£¨μ΄μ§ μ΄ λ¬Έμ κΈ°μμ§λΆμ μΉκ³ μμΌλ©°, μ§λΆμ 곑μ μ΄ μλ¦λ΅κ² ννλμ΄ μλ€. λ¬Έ μλμλ μμΉνμ μΆμ
κ΅¬κ° μμΌλ©°, κ·Έ μ£Όμλ‘λ κ²¬κ³ ν μμ¬λ‘ μμ μ±λ²½μ΄ μ΄μ΄μ Έ μλ€. λ°°κ²½μλ νλμ μΈ κ³ μΈ΅ λΉλ©λ€μ΄ μ리μ‘κ³ μμ΄, μ ν΅κ³Ό νλκ° κ³΅μ‘΄νλ μμΈμ λͺ¨μ΅μ μ λνλΈλ€. μλ‘λ¬Έμ μμ¬μ , λ¬Ένμ κ°μΉκ° λμ λ§μ κ΄κ΄κ°λ€μ΄ μ°Ύλ λͺ
μμ΄λ€.
"""
```
#### Example for Document Transcription
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import requests
from io import BytesIO
model_name = "skt/A.X-4.0-VL-Light"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device='cuda')
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
url = "https://huggingface.co/skt/A.X-4.0-VL-Light/resolve/main/assets/document.png"
response = requests.get(url)
response.raise_for_status()
image = Image.open(BytesIO(response.content))
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "μ¬μ§μ 무μμ΄ μ νμλμ? λ€λ₯Έ μ€λͺ
μμ΄ μ νμλ ν
μ€νΈλ§ κ²°κ³Όλ‘ λ³΄μ¬μ€."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
images=[image],
text=[text],
padding=True,
return_tensors="pt",
).to("cuda")
generation_kwargs = {
"max_new_tokens": 1024,
"top_p": 0.95,
"top_k": 1,
"temperature": 0.7,
"repetition_penalty": 1.05,
"do_sample": True,
}
generated_ids = model.generate(**inputs, **generation_kwargs)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(response[0])
"""
# A.X 4.0: κΈ°μ
μ© νκ΅μ΄ νΉν λκ·λͺ¨ μΈμ΄ λͺ¨λΈ
View English README
SKν
λ μ½€μ΄ νκ΅μ΄ μ²λ¦¬ λ₯λ ₯κ³Ό κΈ°μ
νμ©μ±μ λμΈ λκ·λͺ¨ μΈμ΄ λͺ¨λΈ(LLM) A.X 4.0 (μμ΄λ·μμ€ 4.0)μ 2025λ
4μ 30μΌμ μΆμνμμ΅λλ€. A.X 4.0μ μ€νμμ€ λͺ¨λΈμΈ Qwen2.5μ λ°©λν νκ΅μ΄ λ°μ΄ν°λ₯Ό μΆκ°λ‘ νμ΅μμΌ κ΅λ΄ λΉμ¦λμ€ νκ²½μ μ΅μ νλ μ±λ₯μ λ°νν©λλ€.
## A.X 4.0, 무μμ΄ λ€λ₯Έκ°μ?
- λ°μ΄λ νκ΅μ΄ μ€λ ₯: λνμ μΈ νκ΅μ΄ λ₯λ ₯ νκ° λ²€μΉλ§ν¬μΈ KMMLUμμ 78.3μ μ κΈ°λ‘νμ¬, GPT-40(72.5μ )λ³΄λ€ μ°μν μ±λ₯μ 보μμ΅λλ€.
- λμ νκ΅ λ¬Έν μ΄ν΄λ: νκ΅μ΄ λ° νκ΅ λ¬Έν λ²€μΉλ§ν¬μΈ CLiCkμμλ 83.5μ μ νλν΄, GPT-40(80.2μ )λ³΄λ€ λ λμ μ΄ν΄λλ₯Ό μ
μ¦νμ΅λλ€.
- ν¨μ¨μ μΈ ν ν° μ²λ¦¬: λμΌν νκ΅μ΄ ν
μ€νΈλ₯Ό μ
λ ₯ν΄λ A.X 4.0λ³΄λ€ GPT-40κ° μ½ 1.5λ°° λ§μ ν ν°μ μ¬μ©ν©λλ€.
- λ°©λν μ 보 μ²λ¦¬: μ΅λ 131,072 ν ν°μ μ΄λ₯΄λ κΈ΄ λ¬Έμλ λνλ ν λ²μ μ΄ν΄νκ³ μ²λ¦¬ν μ μμ΅λλ€.
- λλ©μΈ μ§μ: μ½λ©, μ μ‘°μ
λ± μ λ¬Έ μ§μμ΄ νμν λΆμΌμμλ νμ©ν μ μλλ‘ κΈ°λ³Έ μ±λ₯μ κ°ννμ΅λλ€.
- λ°°ν¬ μ΅μ
: 720μ΅ κ°(72B) λ§€κ°λ³μλ₯Ό κ°μΆ νμ€ λͺ¨λΈκ³Ό 70μ΅ κ°(7B) λ§€κ°λ³μμ κ²½λ λͺ¨λΈλ‘ μ 곡λλ©°, κΈ°μ
λ΄λΆ μλ²μ μ§μ μ€μΉ(μ¨νλ λ―Έμ€)ν μ μμ΄ λ°μ΄ν° 보μμ λν κ±±μ μ λ μ μμ΅λλ€.
## ν΅μ¬ κΈ°μ μ?
### νκ΅μ΄ νΉν ν ν¬λμ΄μ μ μ©
νκ΅μ΄μ κ³ μ ν νΉμ±μ μ μ΄ν΄νλλ‘ μ΅μ νλ ν ν¬λμ΄μ λ₯Ό μ¬μ©ν©λλ€. μ΄ ν ν¬λμ΄μ λ νκ΅μ΄μ λ€μν ννκ³Ό λ¬Έλ§₯μ ν¨κ³Όμ μΌλ‘ νμ
νλλ‘ μ€κ³λμμ΅λλ€. λ΄λΆ ν
μ€νΈ κ²°κ³Ό, κ°μ νκ΅μ΄ λ¬Έμ₯μ μ
λ ₯νμ λ GPT-40λ³΄λ€ A.X 4.0μ΄ 33.3% ν¨μ¨μ μΌλ‘ ν ν°μ μ¬μ©ν©λλ€.
μ΄λ μ€μ μ¬μ© νκ²½μμ λ€μκ³Ό κ°μ μ₯μ μ΄ μμ΅λλ€.
- κ°μ 쑰건μ΄λΌλ©΄ λλ΅ 1.5λ°° λ λ§μ νκ΅μ΄ μ 보λ₯Ό μ²λ¦¬ν μ μμ΅λλ€.
- ν ν° μκ° μ€μ΄λ€μ΄ μ²λ¦¬ λΉμ©μ 34% μ λ μ κ°ν μ μμ΅λλ€.
- APIλ₯Ό νΈμΆν λ ν ν° μ¬μ©λμ λ°λΌ λΉμ©μ΄ μ±
μ λλ ꡬ쑰μμ μ 리ν©λλ€.
νΉν λ¬Έμ μμ½μ΄λ κ²μ μ¦κ° μμ±(RAG) λ± κΈ΄ κΈμ λ€λ£¨λ κΈ°μ
νκ²½μμ, ν ν° ν¨μ¨μ±μ μ΄μ λΉμ©μ ν¬κ² μ κ°νλ λ° κΈ°μ¬ν©λλ€.
### νκ΅μ΄ μ΄ν΄μ μμ± λ₯λ ₯μ ν₯μμν€λ νμ΅ λ°μ΄ν° ꡬμ±
A.X 4.0μ μ¬μ©λ νμ΅ λ°μ΄ν°λ λ€μκ³Ό κ°μ νΉμ§μ κ°μ΅λλ€.
- κ³ νμ§μ νκ΅μ΄ μλ£: μΉμμ μΆμΆν κ³ νμ§ λ°μ΄ν°, μ λ¬Έ μμ , ν©μ± λ°μ΄ν°λ₯Ό ν¬ν¨ν λκ·λͺ¨ κ³ νμ§ λ°μ΄ν°μ
μ νμ©νμ΅λλ€.
- 체κ³μ μΈ λ°μ΄ν° λΆλ₯: λ€μν λΆμΌμμ κ· νμκ² λμ μ±λ₯μ λ°ννλλ‘ μ£Όμ λ³λ‘ λΆλ₯λ λ°μ΄ν°μ
μ ꡬμ±νμ΅λλ€.
- κ· ν μ‘ν μΈμ΄ λΆν¬: νκ΅μ΄ 42%, μμ΄ 51%, κΈ°ν μΈμ΄ λ° μ½λ 7%λ‘ κ΅¬μ±ν΄ μΈμ΄ κ° κ· νμ μ μ§νμ΅λλ€.
μ΄λ¬ν λ°μ΄ν° ꡬμ±μ λͺ¨λΈμ΄ νκ΅μ΄μ λ€μν ννκ³Ό λ―Έλ¬ν λ¬Έλ§₯κΉμ§ κΉμ΄ μ΄ν΄νλλ‘ λμ΅λλ€.
"""
```
## License
The `A.X 4.0 VL Light` model is licensed under `Apache License 2.0`.
## Citation
```
@article{SKTAdotX4VLLight,
title={A.X 4.0 VL Light},
author={SKT AI Model Lab},
year={2025},
url={https://huggingface.co/skt/A.X-4.0-VL-Light}
}
```
## Contact
- Business & Partnership Contact: [a.x@sk.com](a.x@sk.com)