|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- translation |
|
|
license: gemma |
|
|
language: |
|
|
- en |
|
|
- es |
|
|
- fr |
|
|
- de |
|
|
- pt |
|
|
- ja |
|
|
- ko |
|
|
- zh |
|
|
- ar |
|
|
- ru |
|
|
- hi |
|
|
--- |
|
|
|
|
|
# YanoljaNEXT-Rosetta-4B |
|
|
|
|
|
This model is a fine-tuned version of [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt). As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. |
|
|
|
|
|
Different from our previous EEVE models, this model does not feature an expanded tokenizer. |
|
|
|
|
|
- **Model Name:** `yanolja/YanoljaNEXT-Rosetta-4B` |
|
|
- **Base Model:** `google/gemma-3-4b-pt` |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a 4-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure. |
|
|
|
|
|
The model was trained on a multilingual dataset covering the following languages: |
|
|
- English |
|
|
- Spanish |
|
|
- French |
|
|
- German |
|
|
- Portuguese |
|
|
- Japanese |
|
|
- Korean |
|
|
- Chinese |
|
|
- Arabic |
|
|
- Russian |
|
|
- Hindi |
|
|
|
|
|
While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. |
|
|
|
|
|
## How to use |
|
|
|
|
|
You can use this model with the `transformers` library as follows: |
|
|
|
|
|
```python |
|
|
import json |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "yanolja/YanoljaNEXT-Rosetta-4B" |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
max_memory={0: "23GB"}, |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
|
|
target_language = "Korean" |
|
|
context = { |
|
|
"context": "Simple introduction about a tech company.", |
|
|
"tone": "Informative and helpful", |
|
|
"glossary": { |
|
|
"Yanolja NEXT": "μΌλμλ₯μ€νΈ", |
|
|
"travel industry": "μ¬ν μ°μ
", |
|
|
} |
|
|
} |
|
|
|
|
|
system = [f"Translate the user's text to {target_language}."] |
|
|
for key, value in context.items(): |
|
|
key_pascal = key.capitalize() |
|
|
if isinstance(value, dict): |
|
|
system.append(f"{key_pascal}:") |
|
|
for f, t in value.items(): |
|
|
system.append(f"- {f} -> {t}") |
|
|
else: |
|
|
system.append(f"{key_pascal}: {value}") |
|
|
|
|
|
system.append("Provide the final translation immediately without any other text.") |
|
|
|
|
|
source = { |
|
|
"company_name": "Yanolja NEXT", |
|
|
"description": "Yanolja NEXT is a company that provides cutting-edge " |
|
|
"technology for the global travel industry.", |
|
|
} |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": "\n".join(system)}, |
|
|
{"role": "user", "content": json.dumps(source, ensure_ascii=False)}, |
|
|
] |
|
|
|
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
print(prompt) |
|
|
# <bos><start_of_turn>instruction |
|
|
# Translate the user's text to Korean. |
|
|
# Context: Simple introduction about a tech company. |
|
|
# Tone: Informative and helpful |
|
|
# Glossary: |
|
|
# - Yanolja NEXT -> μΌλμλ₯μ€νΈ |
|
|
# - travel industry -> μ¬ν μ°μ
|
|
|
# Provide the final translation immediately without any other text.<end_of_turn> |
|
|
# <start_of_turn>source |
|
|
# {"company_name": "Yanolja NEXT", "description": "Yanolja NEXT is a company that provides cutting-edge technology for the global travel industry."}<end_of_turn> |
|
|
# <start_of_turn>translation |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
input_length = inputs["input_ids"].shape[1] |
|
|
|
|
|
with torch.inference_mode(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=64, |
|
|
) |
|
|
|
|
|
generated_tokens = outputs[0][input_length:] |
|
|
translation = tokenizer.decode(generated_tokens, skip_special_tokens=True) |
|
|
|
|
|
print(json.dumps(json.loads(translation), indent=2, ensure_ascii=False)) |
|
|
# { |
|
|
# "company_name": "μΌλμλ₯μ€νΈ", |
|
|
# "description": "μΌλμλ₯μ€νΈλ κΈλ‘λ² μ¬ν μ°μ
μ μ΅μ²¨λ¨ κΈ°μ μ μ 곡νλ νμ¬μ
λλ€." |
|
|
# } |
|
|
``` |
|
|
|
|
|
The model outputs the final translation in JSON format when appropriate, or plain text for simple translations. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Data |
|
|
The translation datasets were compiled from several sources, including: |
|
|
- [AI Hub](https://aihub.or.kr/) |
|
|
- [Europarl](https://www.statmt.org/europarl/) |
|
|
|
|
|
The model was fine-tuned on multilingual translation data to optimize performance across the supported language pairs. |
|
|
|
|
|
| Language | Portion (%) | Language | Portion (%) | |
|
|
|----------|-------------|----------|-------------| |
|
|
| Korean | 24.2 | French | 2.8 | |
|
|
| English | 16.2 | German | 2.5 | |
|
|
| Japanese | 5.8 | Russian | 2.4 | |
|
|
| Italian | 5.3 | Arabic | 2.3 | |
|
|
| Chinese | 4.4 | Other | 30.2 | |
|
|
| Spanish | 3.9 | | | |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Translation Quality Benchmarks |
|
|
|
|
|
The following CHrF++ scores (WMT24++) demonstrate the model's competitive performance compared to other state-of-the-art translation models on English to Korean translation: |
|
|
|
|
|
| Model | CHrF++ Score (WMT24++) | |
|
|
|------------------------------------|--------------| |
|
|
| yanolja/YanoljaNEXT-Rosetta-12B | 34.75 | |
|
|
| yanolja/YanoljaNEXT-Rosetta-20B | 33.87 | |
|
|
| google/gemini-2.0-flash-001 | 33.81 | |
|
|
| openai/gpt-oss-120b | 31.51 | |
|
|
| **yanolja/YanoljaNEXT-Rosetta-4B** | **31.31** | |
|
|
| openai/gpt-4.1-nano | 31.15 | |
|
|
| Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 | 31.02 | |
|
|
| openai/gpt-oss-20b | 30.56 | |
|
|
| google/gemma-3-27b-it | 30.05 | |
|
|
| google/gemma-3-4b-pt | 27.53 | |
|
|
|
|
|
YanoljaNEXT-Rosetta-4B achieves competitive translation quality while maintaining the efficiency of a 4B parameter model. |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. |
|
|
|
|
|
### Limitations |
|
|
The model's primary focus is on JSON data. Performance on unstructured text or other data formats may vary. |
|
|
|
|
|
### License |
|
|
This model is released under the Gemma license, inherited from its base model, [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt). Please consult the official [Gemma license terms](https://ai.google.dev/gemma/terms) for detailed usage guidelines. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please consider citing: |
|
|
|
|
|
``` |
|
|
@misc{yanolja2025yanoljanextrosetta, |
|
|
author = {Yanolja NEXT Co., Ltd.}, |
|
|
title = {YanoljaNEXT-Rosetta-4B}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face repository}, |
|
|
howpublished = {\\url{https://huggingface.co/yanolja/YanoljaNEXT-Rosetta-4B}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field. |
|
|
|
|
|
``` |
|
|
@misc{gemma3, |
|
|
author = {Google}, |
|
|
title = {Gemma 3}, |
|
|
year = {2024}, |
|
|
publisher = {Google DeepMind}, |
|
|
howpublished = {\\url{https://deepmind.google/models/gemma/gemma-3/}} |
|
|
} |
|
|
|
|
|
@misc{aihub, |
|
|
author = {National Information Society Agency (NIA)}, |
|
|
title = {AI-Hub: AI Integrated Platform}, |
|
|
year = {2025}, |
|
|
publisher = {National Information Society Agency}, |
|
|
howpublished = {\\url{https://aihub.or.kr}} |
|
|
} |
|
|
|
|
|
@article{europarl, |
|
|
author = {Koehn, Philipp}, |
|
|
title = {Europarl: A Parallel Corpus for Statistical Machine Translation}, |
|
|
journal = {MT Summit}, |
|
|
year = {2005}, |
|
|
pages = {79--86} |
|
|
} |
|
|
``` |
|
|
|