taehoon222's picture
Update README.md
5eb383b verified
---
# (ํ•„์ˆ˜) Hugging Face ๋ชจ๋ธ ์นด๋“œ์šฉ YAML ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ
# TODO: language, tags, dataset, metrics๋ฅผ ๋ณธ์ธ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์ˆ˜์ •ํ•˜์„ธ์š”.
language: ko
license: other # (๋ผ์ด์„ ์Šค๋ฅผ ์„ ํƒํ•˜์„ธ์š”: apache-2.0, mit ๋“ฑ)
tags:
- text-classification
- korean
- emotion-analysis
- klue
- roberta
pipeline_tag: text-classification
datasets:
- custom-korean-emotion-dataset # (๋ฐ์ดํ„ฐ์…‹ ์ด๋ฆ„์„ ์ง€์ •ํ•˜์„ธ์š”)
metrics:
- accuracy
- f1
model-index:
- name: 6-Class Korean Emotion Analysis
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: Custom Test Set
type: custom-korean-emotion-dataset
config: default
split: test
metrics:
- name: Accuracy
type: accuracy
value: 0.7905
- name: F1 (Weighted)
type: f1
value: 0.7910
- name: Loss
type: loss
value: 0.6943
---
# 6-Class ํ•œ๊ตญ์–ด ๊ฐ์ • ๋ถ„์„ ๋ชจ๋ธ (v2)
๋ณธ ๋ชจ๋ธ์€ [klue/roberta-base](https://huggingface.co/klue/roberta-base)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ, ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์˜ ๊ฐ์ •์„ 6๊ฐ€์ง€ ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ํ…์ŠคํŠธ ๋ถ„๋ฅ˜(Sequence Classification) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
**์ฃผ์š” ํŠน์ง•:**
* **6-Class ๋ถ„๋ฅ˜:** '๊ธฐ์จ', '๋‹นํ™ฉ', '๋ถ„๋…ธ', '๋ถˆ์•ˆ', '์ƒ์ฒ˜', '์Šฌํ””'์˜ 6๊ฐ€์ง€ ๊ฐ์ •์œผ๋กœ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค.
* **๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ:** `CrossEntropyLoss`์— ์ˆ˜๋™์œผ๋กœ **ํด๋ž˜์Šค ๊ฐ€์ค‘์น˜(Class Weights)**๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ณ , ์†Œ์ˆ˜ ํด๋ž˜์Šค(๊ธฐ์จ, ๋‹นํ™ฉ ๋“ฑ)์˜ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค.
## ๐Ÿ—‚ ๋ชจ๋ธ ๋ผ๋ฒจ (Labels)
๋ชจ๋ธ์˜ ์ถœ๋ ฅ์€ 6๊ฐ€์ง€ ๊ฐ์ • ํด๋ž˜์Šค์— ํ•ด๋‹นํ•˜๋ฉฐ, ๋ผ๋ฒจ๊ณผ ID๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
| Label (๊ฐ์ •) | ID |
| :--- | :--: |
| `๊ธฐ์จ` | 0 |
| `๋‹นํ™ฉ` | 1 |
| `๋ถ„๋…ธ` | 2 |
| `๋ถˆ์•ˆ` | 3 |
| `์ƒ์ฒ˜` | 4 |
| `์Šฌํ””` | 5 |
*(์ฐธ๊ณ : ๋ผ๋ฒจ ์ˆœ์„œ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹(df_train) ๊ธฐ์ค€์œผ๋กœ ์ž๋™ ์ƒ์„ฑ๋œ `['๊ธฐ์จ', '๋‹นํ™ฉ', '๋ถ„๋…ธ', '๋ถˆ์•ˆ', '์ƒ์ฒ˜', '์Šฌํ””']` ์ˆœ์„œ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.)*
## ๐Ÿš€ ์‚ฌ์šฉ ๋ฐฉ๋ฒ• (How to Use)
`transformers` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ `pipeline`์„ ์‚ฌ์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```python
from transformers import pipeline
# TODO: '[YOUR-USERNAME]/[YOUR-MODEL-NAME]'์„ ๋ณธ์ธ์˜ ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ชจ๋ธ ๊ฒฝ๋กœ๋กœ ๋ณ€๊ฒฝํ•˜์„ธ์š”.
model_name = "[YOUR-USERNAME]/[YOUR-MODEL-NAME]"
classifier = pipeline("text-classification", model=model_name)
# ์˜ˆ์‹œ ๋ฌธ์žฅ ํ…Œ์ŠคํŠธ
texts = [
"์˜ค๋Š˜ ๋„ˆ๋ฌด ๊ธฐ๋ถ„ ์ข‹์€ ์ผ์ด ์ƒ๊ฒผ์–ด!",
"์ด๊ฑธ ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค...",
"์ง„์งœ ํ™”๊ฐ€ ๋จธ๋ฆฌ ๋๊นŒ์ง€ ๋‚œ๋‹ค.",
"๋‚ด์ผ ๋ฐœํ‘œ์ธ๋ฐ ๋„ˆ๋ฌด ๋–จ๋ฆฌ๊ณ  ๋ถˆ์•ˆํ•ด."
]
# ์˜ˆ์ธก ์ˆ˜ํ–‰
results = classifier(texts, top_k=1)
for text, result in zip(texts, results):
print(f"์ž…๋ ฅ: {text}")
print(f"๊ฐ์ •: {result[0]['label']} (Score: {result[0]['score']:.4f})")
print("-" * 20)
โš™๏ธ ํ›ˆ๋ จ ์ƒ์„ธ (Training Details)
๋ณธ ๋ชจ๋ธ์€ train_final_v2.py ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
1. ๋ฐ์ดํ„ฐ์…‹ (Dataset)
training-label.json: ์›๋ณธ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ
test.json: ์›๋ณธ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ
๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ (v2 ์ „๋žต):
Train Set (90%): training-label.json์˜ 90% (Stratified Split)
Validation Set (10%): training-label.json์˜ 10% (Stratified Split)
Test Set (์ตœ์ข… ํ‰๊ฐ€์šฉ): test.json (๋ณ„๋„ ๋ฐ์ดํ„ฐ)
2. ํ•ต์‹ฌ ํ›ˆ๋ จ ๊ธฐ๋ฒ• (Key Techniques)
ํด๋ž˜์Šค ๊ฐ€์ค‘์น˜ (Class Weights): ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด CustomTrainer์™€ CrossEntropyLoss์˜ weight ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•ด ์ˆ˜๋™์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ์†Œ์ˆ˜ ํด๋ž˜์Šค์˜ ์ค‘์š”๋„๋ฅผ ๋†’์˜€์Šต๋‹ˆ๋‹ค.
์ ์šฉ๋œ ๊ฐ€์ค‘์น˜: [6.00, 4.50, 0.85, 1.80, 1.80, 0.92]
๊ฐ€์ค‘์น˜ ์ˆœ์„œ (๋ผ๋ฒจ): ['๊ธฐ์จ', '๋‹นํ™ฉ', '๋ถ„๋…ธ', '๋ถˆ์•ˆ', '์ƒ์ฒ˜', '์Šฌํ””']
์Šค์ผ€์ค„๋Ÿฌ (Scheduler): cosine ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋Ÿฌ๋ฅผ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
3. ์ฃผ์š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ (Hyperparameters)HyperparameterValuebase_model_nameklue/roberta-basenum_train_epochs10learning_rate1e-5train_batch_size16eval_batch_size64weight_decay0.01max_length128warmup_ratio0.1lr_scheduler_typecosine