--- license: apache-2.0 license_link: https://huggingface.co/skt/A.X-4.0-VL-Light/blob/main/LICENSE language: - en - ko pipeline_tag: image-text-to-text library_name: transformers model_id: skt/A.X-4.0-VL-Light developers: SKT AI Model Lab base_model: - skt/A.X-4.0-Light --- # A.X 4.0 VL Light

πŸ€— Models | πŸ–₯️ Github

## Highlights **A.X 4.0 VL Light** (pronounced β€œA dot X”) is a vision-language model (VLM) optimized for Korean vision and language understanding as well as enterprise deployment. Built upon [A.X 4.0 Light](https://huggingface.co/skt/A.X-4.0-Light), A.X 4.0 VL Light has been further trained on diverse multimodal datasets, with a particular focus on large-scale multimodal Korean datasets, to deliver exceptional performance in domestic business applications. - **Superior Korean Proficiency in Vision and Language**: Achieved an average score of 79.4 on Korean image benchmarks, outperforming Qwen2.5-VL-32B (73.4), despite having a significantly smaller model size. On Korean text benchmarks, recorded an average score of 60.2, comparable to VARCO-VISION-2.0-14B (60.4), while using only half the model size. - **Deep Cultural Understanding**: Scored 80.2 on K-Viscuit, a multimodal benchmark designed to evaluate cultural and contextual comprehension in Korean, exceeding Qwen2.5-VL-32B (72.3). - **Advanced Document Understanding**: Attained a score of 89.8 on KoBizDoc, a benchmark focused on understanding complex document structures, including charts and tables, performing comparably to Qwen2.5-VL-32B (88.8). - **Efficient Token Usage**: A.X 4.0 VL Light utilizes approximately 41% fewer text tokens compared to Qwen2.5-VL for the same Korean input, enabling significantly more cost-effective and efficient processing. A brief comparison on representative benchmarks is as follows:

## Performance ### Image Benchmark *Korean benchmarks, with K-Viscuit translated into Korean. | Category | Benchmarks | A.X 4.0 VL Light | Qwen2.5-VL-7B | InternVL3-8B | VARCO-VISION-2.0-14B | Qwen2.5-VL-32B | |------------------------|---------------------|------------------|---------------|--------------|----------------------|----------------| | Document | KoBizDoc* | 89.8 | 84.0 | 73.2 | 83.0 | 88.8 | | | K-DTCBench* | 90.0 | 86.7 | 83.8 | 80.8 | 91.7 | | | ChartQA | 79.8 | 80.6 | 79.8 | 78.8 | 81.8 | | | DocVQA | 94.4 | 95.3 | 92.4 | 91.9 | 94.5 | | | InfoVQA | 78.5 | 82.7 | 76.2 | 80.0 | 82.7 | | | SEEDBench2-Plus | 69.7 | 71.2 | 69.7 | 71.9 | 73.3 | | OCR | OutdoorKorean* | 97.3 | 91.9 | 72.7 | 79.7 | 86.9 | | | K-Handwriting* | 84.3 | 85.0 | 43.5 | 55.2 | 60.1 | | | TextVQA | 82.0 | 85.4 | 82.1 | 80.3 | 79.8 | | Culture | K-Viscuit* | 80.2 | 65.0 | 65.3 | 72.0 | 72.3 | | Knowledge | KoEduBench* | 58.1 | 53.9 | 53.9 | 39.4 | 52.4 | | | KoCertBench* | 54.9 | 50.1 | 39.4 | 51.4 | 47.5 | | | MMMU | 54.1 | 56.3 | 59.4 | 58.3 | 63.6 | | | ScienceQA | 95.3 | 87.2 | 97.8 | 92.2 | 92.4 | | General | K-LLaVA-W* | 83.2 | 73.0 | 67.0 | 80.0 | 84.3 | | | K-SEED* | 76.5 | 76.4 | 76.4 | 76.9 | 77.3 | | | SEEDBench_IMG | 76.7 | 77.1 | 77.1 | 78.1 | 77.6 | | Hallucination | HallusionBench | 54.2 | 52.7 | 49.6 | 53.8 | 58.0 | | IF | MM-IFEval | 53.5 | 51.4 | 51.9 | 50.8 | 59.3 | The following in-house benchmarks have been established to rigorously assess model performance on Korean vision-language understanding and the comprehension of Korea-specific knowledge domains: - **KoBizDoc**: A visual question answering (VQA) benchmark designed for understanding Korean business documents. - **OutdoorKorean**: A benchmark focused on recognizing Korean text in complex outdoor scenes (provided by AIHub). - **K-Handwriting**: A Korean handwriting recognition dataset comprising various handwritten styles (provided by AIHub). - **KoEduBench**: A VQA benchmark targeting Korean general academic exams, including GED and CSAT questions, to assess academic reasoning ability. - **KoCertBench**: A Korean certification exam-based VQA benchmark, covering domains such as civil service, technical licenses, and professional qualifications. ### Text Benchmark *Korean benchmarks. | Category | Benchmarks | A.X 4.0 VL Light | Qwen2.5-VL-7B | InternVL3-8B | VARCO-VISION-2.0-14B | |-----------------------|--------------|------------------|---------------|--------------|----------------------| | Knowledge | KMMLU* | 60.5 | 45.6 | 50.9 | 58.8 | | | MMLU | 72.6 | 71.9 | 77.5 | 80.7 | | Math | HRM8K* | 40.6 | 25.4 | 34.6 | 49.5 | | | MATH | 56.5 | 61.7 | 65.1 | 71.1 | | General | Ko-MT-bench* | 68.9 | 51.5 | 59.5 | 75.9 | | | MT-bench | 72.9 | 73.2 | 69.9 | 76.6 | | IF | Ko-IFEval* | 71.8 | 55.0 | 46.1 | 57.2 | | | IFEval | 81.9 | 66.6 | 67.5 | 75.3 | ## πŸš€ Quickstart ### with HuggingFace Transformers - `transformers>=4.49.0` or the latest version is required to use `skt/A.X-4.0-VL-Light` ```bash pip install transformers>=4.49.0 ``` #### Example Usage ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image import requests from io import BytesIO model_name = "skt/A.X-4.0-VL-Light" model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device='cuda') processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) url = "https://huggingface.co/skt/A.X-4.0-VL-Light/resolve/main/assets/image.png" # 이미지 좜처: κ΅­κ°€μœ μ‚°ν¬ν„Έ (https://www.heritage.go.kr/unisearch/images/national_treasure/thumb/2021042017434700.JPG) response = requests.get(url) response.raise_for_status() image = Image.open(BytesIO(response.content)) messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "이미지에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ€˜."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = processor( images=[image], text=[text], padding=True, return_tensors="pt", ).to("cuda") # Decoding parameters (top_p, temperature, top_k, repetition_penalty) should be tuned depending on the generation task. generation_kwargs = { "max_new_tokens": 256, "top_p": 0.8, "temperature": 0.5, "top_k": 20, "repetition_penalty": 1.05, "do_sample": True, } generated_ids = model.generate(**inputs, **generation_kwargs) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] response = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(response[0]) """ μˆ­λ‘€λ¬Έμ€ λŒ€ν•œλ―Όκ΅­ μ„œμšΈμ— μœ„μΉ˜ν•œ ꡭ보 제1호둜, μ‘°μ„  μ‹œλŒ€μ— κ±΄μΆ•λœ λͺ©μ‘° 건좕물이닀. 이 문은 μ„œμšΈμ˜ 남μͺ½ λŒ€λ¬ΈμœΌλ‘œ, 전톡적인 ν•œκ΅­ 건좕 양식을 보여쀀닀. 두 측으둜 이루어진 이 문은 기와지뢕을 μ–Ήκ³  있으며, μ§€λΆ•μ˜ 곑선이 μ•„λ¦„λ‹΅κ²Œ ν‘œν˜„λ˜μ–΄ μžˆλ‹€. λ¬Έ μ•„λž˜μ—λŠ” μ•„μΉ˜ν˜•μ˜ μΆœμž…κ΅¬κ°€ 있으며, κ·Έ μ£Όμœ„λ‘œλŠ” κ²¬κ³ ν•œ μ„μž¬λ‘œ μŒ“μ€ 성벽이 이어져 μžˆλ‹€. λ°°κ²½μ—λŠ” ν˜„λŒ€μ μΈ κ³ μΈ΅ λΉŒλ”©λ“€μ΄ 자리작고 μžˆμ–΄, 전톡과 ν˜„λŒ€κ°€ κ³΅μ‘΄ν•˜λŠ” μ„œμšΈμ˜ λͺ¨μŠ΅μ„ 잘 λ‚˜νƒ€λ‚Έλ‹€. μˆ­λ‘€λ¬Έμ€ 역사적, 문화적 κ°€μΉ˜κ°€ λ†’μ•„ λ§Žμ€ 관광객듀이 μ°ΎλŠ” λͺ…μ†Œμ΄λ‹€. """ ``` #### Example for Document Transcription ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image import requests from io import BytesIO model_name = "skt/A.X-4.0-VL-Light" model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device='cuda') processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) url = "https://huggingface.co/skt/A.X-4.0-VL-Light/resolve/main/assets/document.png" response = requests.get(url) response.raise_for_status() image = Image.open(BytesIO(response.content)) messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "사진에 무엇이 μ ν˜€μžˆλ‚˜μš”? λ‹€λ₯Έ μ„€λͺ… 없이 μ ν˜€μžˆλŠ” ν…μŠ€νŠΈλ§Œ 결과둜 λ³΄μ—¬μ€˜."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = processor( images=[image], text=[text], padding=True, return_tensors="pt", ).to("cuda") generation_kwargs = { "max_new_tokens": 1024, "top_p": 0.95, "top_k": 1, "temperature": 0.7, "repetition_penalty": 1.05, "do_sample": True, } generated_ids = model.generate(**inputs, **generation_kwargs) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] response = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(response[0]) """ # A.X 4.0: κΈ°μ—…μš© ν•œκ΅­μ–΄ νŠΉν™” λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈ View English README SKν…”λ ˆμ½€μ΄ ν•œκ΅­μ–΄ 처리 λŠ₯λ ₯κ³Ό κΈ°μ—… ν™œμš©μ„±μ„ 높인 λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈ(LLM) A.X 4.0 (μ—μ΄λ‹·μ—‘μŠ€ 4.0)을 2025λ…„ 4μ›” 30일에 μΆœμ‹œν•˜μ˜€μŠ΅λ‹ˆλ‹€. A.X 4.0은 μ˜€ν”ˆμ†ŒμŠ€ λͺ¨λΈμΈ Qwen2.5에 λ°©λŒ€ν•œ ν•œκ΅­μ–΄ 데이터λ₯Ό μΆ”κ°€λ‘œ ν•™μŠ΅μ‹œμΌœ κ΅­λ‚΄ λΉ„μ¦ˆλ‹ˆμŠ€ ν™˜κ²½μ— μ΅œμ ν™”λœ μ„±λŠ₯을 λ°œνœ˜ν•©λ‹ˆλ‹€. ## A.X 4.0, 무엇이 λ‹€λ₯Έκ°€μš”? - λ›°μ–΄λ‚œ ν•œκ΅­μ–΄ μ‹€λ ₯: λŒ€ν‘œμ μΈ ν•œκ΅­μ–΄ λŠ₯λ ₯ 평가 벀치마크인 KMMLUμ—μ„œ 78.3점을 κΈ°λ‘ν•˜μ—¬, GPT-40(72.5점)보닀 μš°μˆ˜ν•œ μ„±λŠ₯을 λ³΄μ˜€μŠ΅λ‹ˆλ‹€. - 높은 ν•œκ΅­ λ¬Έν™” 이해도: ν•œκ΅­μ–΄ 및 ν•œκ΅­ λ¬Έν™” 벀치마크인 CLiCkμ—μ„œλ„ 83.5점을 νšλ“ν•΄, GPT-40(80.2점)보닀 더 높은 이해도λ₯Ό μž…μ¦ν–ˆμŠ΅λ‹ˆλ‹€. - 효율적인 토큰 처리: λ™μΌν•œ ν•œκ΅­μ–΄ ν…μŠ€νŠΈλ₯Ό μž…λ ₯해도 A.X 4.0보닀 GPT-40κ°€ μ•½ 1.5λ°° λ§Žμ€ 토큰을 μ‚¬μš©ν•©λ‹ˆλ‹€. - λ°©λŒ€ν•œ 정보 처리: μ΅œλŒ€ 131,072 토큰에 이λ₯΄λŠ” κΈ΄ λ¬Έμ„œλ‚˜ λŒ€ν™”λ„ ν•œ λ²ˆμ— μ΄ν•΄ν•˜κ³  μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€. - 도메인 지원: μ½”λ”©, μ œμ‘°μ—… λ“± μ „λ¬Έ 지식이 ν•„μš”ν•œ λΆ„μ•Όμ—μ„œλ„ ν™œμš©ν•  수 μžˆλ„λ‘ κΈ°λ³Έ μ„±λŠ₯을 κ°•ν™”ν–ˆμŠ΅λ‹ˆλ‹€. - 배포 μ˜΅μ…˜: 720μ–΅ 개(72B) λ§€κ°œλ³€μˆ˜λ₯Ό κ°–μΆ˜ ν‘œμ€€ λͺ¨λΈκ³Ό 70μ–΅ 개(7B) λ§€κ°œλ³€μˆ˜μ˜ κ²½λŸ‰ λͺ¨λΈλ‘œ 제곡되며, κΈ°μ—… λ‚΄λΆ€ μ„œλ²„μ— 직접 μ„€μΉ˜(μ˜¨ν”„λ ˆλ―ΈμŠ€)ν•  수 μžˆμ–΄ 데이터 λ³΄μ•ˆμ— λŒ€ν•œ 걱정을 덜 수 μžˆμŠ΅λ‹ˆλ‹€. ## 핡심 κΈ°μˆ μ€? ### ν•œκ΅­μ–΄ νŠΉν™” ν† ν¬λ‚˜μ΄μ € 적용 ν•œκ΅­μ–΄μ˜ κ³ μœ ν•œ νŠΉμ„±μ„ 잘 μ΄ν•΄ν•˜λ„λ‘ μ΅œμ ν™”λœ ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. 이 ν† ν¬λ‚˜μ΄μ €λŠ” ν•œκ΅­μ–΄μ˜ λ‹€μ–‘ν•œ ν‘œν˜„κ³Ό λ¬Έλ§₯을 효과적으둜 νŒŒμ•…ν•˜λ„λ‘ μ„€κ³„λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ‚΄λΆ€ ν…ŒμŠ€νŠΈ κ²°κ³Ό, 같은 ν•œκ΅­μ–΄ λ¬Έμž₯을 μž…λ ₯ν–ˆμ„ λ•Œ GPT-40보닀 A.X 4.0이 33.3% 효율적으둜 토큰을 μ‚¬μš©ν•©λ‹ˆλ‹€. μ΄λŠ” μ‹€μ œ μ‚¬μš© ν™˜κ²½μ—μ„œ λ‹€μŒκ³Ό 같은 μž₯점이 μžˆμŠ΅λ‹ˆλ‹€. - 같은 쑰건이라면 λŒ€λž΅ 1.5λ°° 더 λ§Žμ€ ν•œκ΅­μ–΄ 정보λ₯Ό μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€. - 토큰 μˆ˜κ°€ 쀄어듀어 처리 λΉ„μš©μ„ 34% 정도 μ ˆκ°ν•  수 μžˆμŠ΅λ‹ˆλ‹€. - APIλ₯Ό ν˜ΈμΆœν•  λ•Œ 토큰 μ‚¬μš©λŸ‰μ— 따라 λΉ„μš©μ΄ μ±…μ •λ˜λŠ” κ΅¬μ‘°μ—μ„œ μœ λ¦¬ν•©λ‹ˆλ‹€. 특히 λ¬Έμ„œ μš”μ•½μ΄λ‚˜ 검색 증강 생성(RAG) λ“± κΈ΄ 글을 λ‹€λ£¨λŠ” κΈ°μ—… ν™˜κ²½μ—μ„œ, 토큰 νš¨μœ¨μ„±μ€ 운영 λΉ„μš©μ„ 크게 μ ˆκ°ν•˜λŠ” 데 κΈ°μ—¬ν•©λ‹ˆλ‹€. ### ν•œκ΅­μ–΄ 이해와 생성 λŠ₯λ ₯을 ν–₯μƒμ‹œν‚€λŠ” ν•™μŠ΅ 데이터 ꡬ성 A.X 4.0에 μ‚¬μš©λœ ν•™μŠ΅ λ°μ΄ν„°λŠ” λ‹€μŒκ³Ό 같은 νŠΉμ§•μ„ κ°–μŠ΅λ‹ˆλ‹€. - κ³ ν’ˆμ§ˆμ˜ ν•œκ΅­μ–΄ 자료: μ›Ήμ—μ„œ μΆ”μΆœν•œ κ³ ν’ˆμ§ˆ 데이터, μ „λ¬Έ μ„œμ , ν•©μ„± 데이터λ₯Ό ν¬ν•¨ν•œ λŒ€κ·œλͺ¨ κ³ ν’ˆμ§ˆ 데이터셋을 ν™œμš©ν–ˆμŠ΅λ‹ˆλ‹€. - 체계적인 데이터 λΆ„λ₯˜: λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ κ· ν˜•μžˆκ²Œ 높은 μ„±λŠ₯을 λ°œνœ˜ν•˜λ„λ‘ μ£Όμ œλ³„λ‘œ λΆ„λ₯˜λœ 데이터셋을 κ΅¬μ„±ν–ˆμŠ΅λ‹ˆλ‹€. - κ· ν˜• 작힌 μ–Έμ–΄ 뢄포: ν•œκ΅­μ–΄ 42%, μ˜μ–΄ 51%, 기타 μ–Έμ–΄ 및 μ½”λ“œ 7%둜 ꡬ성해 μ–Έμ–΄ κ°„ κ· ν˜•μ„ μœ μ§€ν–ˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ 데이터 ꡬ성은 λͺ¨λΈμ΄ ν•œκ΅­μ–΄μ˜ λ‹€μ–‘ν•œ ν‘œν˜„κ³Ό λ―Έλ¬˜ν•œ λ¬Έλ§₯κΉŒμ§€ 깊이 μ΄ν•΄ν•˜λ„λ‘ λ•μŠ΅λ‹ˆλ‹€. """ ``` ## License The `A.X 4.0 VL Light` model is licensed under `Apache License 2.0`. ## Citation ``` @article{SKTAdotX4VLLight, title={A.X 4.0 VL Light}, author={SKT AI Model Lab}, year={2025}, url={https://huggingface.co/skt/A.X-4.0-VL-Light} } ``` ## Contact - Business & Partnership Contact: [a.x@sk.com](a.x@sk.com)