lily_fast_api / README_CONTEXT_LORA.md
gbrabbit's picture
Auto commit at 19-2025-08 20:43:11
130525d

Lily LLM - ์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ ๋ฐ LoRA/QLoRA ์‹œ์Šคํ…œ

๐Ÿ“‹ ๊ฐœ์š”

Lily LLM ํ”„๋กœ์ ํŠธ์— ๋‹จ๊ธฐ ๊ธฐ์–ต(์ปจํ…์ŠคํŠธ ์ฐฝ) ๊ธฐ๋Šฅ๊ณผ LoRA/QLoRA ์ง€์›์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋”์šฑ ๊ฐ•๋ ฅํ•˜๊ณ  ํšจ์œจ์ ์ธ AI ๋Œ€ํ™” ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿš€ ์ฃผ์š” ๊ธฐ๋Šฅ

1. ์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ (Context Management)

๐Ÿ”ง ํ•ต์‹ฌ ๊ธฐ๋Šฅ

  • ๋Œ€ํ™” ํžˆ์Šคํ† ๋ฆฌ ๊ด€๋ฆฌ: ์‚ฌ์šฉ์ž์™€ AI ๊ฐ„์˜ ๋Œ€ํ™”๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ €์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ์„ค์ •๋œ ์ œํ•œ์— ๋„๋‹ฌํ•˜๋ฉด ์ž๋™์œผ๋กœ ์ปจํ…์ŠคํŠธ ์••์ถ•
  • ์„ธ์…˜ ๊ด€๋ฆฌ: ์—ฌ๋Ÿฌ ๋Œ€ํ™” ์„ธ์…˜์„ ๋…๋ฆฝ์ ์œผ๋กœ ๊ด€๋ฆฌ
  • ์ปจํ…์ŠคํŠธ ๊ฒ€์ƒ‰: ์ €์žฅ๋œ ๋Œ€ํ™” ๋‚ด์šฉ์—์„œ ํŠน์ • ์ •๋ณด ๊ฒ€์ƒ‰

๐Ÿ“Š ์ปจํ…์ŠคํŠธ ์ „๋žต

  • Sliding Window: ์ตœ๊ทผ ๋ฉ”์‹œ์ง€ ์šฐ์„  ์œ ์ง€
  • Priority Keep: ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ์™€ ์ตœ๊ทผ ๋ฉ”์‹œ์ง€ ์šฐ์„ 
  • Circular Buffer: ์ˆœํ™˜ ๋ฐฉ์‹์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ

๐Ÿ’พ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ

  • ๋‚ด๋ณด๋‚ด๊ธฐ/๊ฐ€์ ธ์˜ค๊ธฐ: JSON ํ˜•์‹์œผ๋กœ ์ปจํ…์ŠคํŠธ ์ €์žฅ ๋ฐ ๋ณต์›
  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ง€์›: ๊ฐ ๋ฉ”์‹œ์ง€์— ์ถ”๊ฐ€ ์ •๋ณด ์ฒจ๋ถ€ ๊ฐ€๋Šฅ
  • ํ†ต๊ณ„ ์ •๋ณด: ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋ฐ ํšจ์œจ์„ฑ ์ง€ํ‘œ ์ œ๊ณต

2. LoRA/QLoRA ์ง€์› ์‹œ์Šคํ…œ

๐Ÿ”— LoRA (Low-Rank Adaptation)

  • ํšจ์œจ์ ์ธ ํŒŒ์ธํŠœ๋‹: ์ „์ฒด ๋ชจ๋ธ ๋Œ€์‹  ์ผ๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ›ˆ๋ จ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ: GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋Œ€ํญ ๊ฐ์†Œ
  • ๋น ๋ฅธ ์–ด๋Œ‘ํ„ฐ ์ „ํ™˜: ์—ฌ๋Ÿฌ ์ž‘์—…๋ณ„ ์–ด๋Œ‘ํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ๊ต์ฒด

๐Ÿ“ˆ QLoRA (Quantized LoRA)

  • 4๋น„ํŠธ ์–‘์žํ™”: ๋ชจ๋ธ ํฌ๊ธฐ์™€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ถ”๊ฐ€ ๊ฐ์†Œ
  • ๊ณ ํ’ˆ์งˆ ํ›ˆ๋ จ: ์–‘์žํ™”๋œ ๋ชจ๋ธ์—์„œ๋„ ๋†’์€ ํ’ˆ์งˆ์˜ ํ›ˆ๋ จ ๊ฐ€๋Šฅ
  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ: ์ €์‚ฌ์–‘ GPU์—์„œ๋„ ํ›ˆ๋ จ ๊ฐ€๋Šฅ

๐ŸŽฏ ์ง€์› ๋ชจ๋ธ

  • Causal Language Models: GPT, LLaMA, Kanana ๋“ฑ
  • Sequence-to-Sequence: T5, BART ๋“ฑ
  • Classification Models: BERT, RoBERTa ๋“ฑ

๐Ÿ› ๏ธ ์„ค์น˜ ๋ฐ ์„ค์ •

1. ์˜์กด์„ฑ ์„ค์น˜

pip install -r requirements.txt

2. ์ถ”๊ฐ€ ํŒจํ‚ค์ง€ ์„ค์น˜

# LoRA/QLoRA ์ง€์›
pip install peft>=0.7.0
pip install bitsandbytes>=0.41.0

# ์„ ํƒ์ : ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์œ„ํ•œ ํŒจํ‚ค์ง€
pip install accelerate
pip install transformers[torch]

3. ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •

# GPU ์‚ฌ์šฉ ์„ค์ •
export CUDA_VISIBLE_DEVICES=0

# ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

๐Ÿ“– ์‚ฌ์šฉ๋ฒ•

1. ์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

import requests

# ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ ์„ค์ •
response = requests.post("http://localhost:8001/context/set-system-prompt", 
                        data={"prompt": "๋‹น์‹ ์€ ํ•œ๊ตญ์–ด AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค."})

# ์‚ฌ์šฉ์ž ๋ฉ”์‹œ์ง€ ์ถ”๊ฐ€
response = requests.post("http://localhost:8001/context/add-message",
                        data={
                            "role": "user",
                            "content": "์•ˆ๋…•ํ•˜์„ธ์š”!",
                            "metadata": '{"session_id": "session_1"}'
                        })

# ์–ด์‹œ์Šคํ„ดํŠธ ์‘๋‹ต ์ถ”๊ฐ€
response = requests.post("http://localhost:8001/context/add-message",
                        data={
                            "role": "assistant",
                            "content": "์•ˆ๋…•ํ•˜์„ธ์š”! ๋ฌด์—‡์„ ๋„์™€๋“œ๋ฆด๊นŒ์š”?",
                            "metadata": '{"session_id": "session_1"}'
                        })

# ์ปจํ…์ŠคํŠธ ์กฐํšŒ
response = requests.get("http://localhost:8001/context/get")
context = response.json()["context"]

๊ณ ๊ธ‰ ๊ธฐ๋Šฅ

# ์ปจํ…์ŠคํŠธ ๊ฒ€์ƒ‰
response = requests.get("http://localhost:8001/context/search?query=๋‚ ์”จ&max_results=5")

# ์ปจํ…์ŠคํŠธ ๋‚ด๋ณด๋‚ด๊ธฐ
response = requests.post("http://localhost:8001/context/export",
                        data={"file_path": "my_context.json"})

# ์ปจํ…์ŠคํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ
response = requests.post("http://localhost:8001/context/import",
                        data={"file_path": "my_context.json"})

# ์ปจํ…์ŠคํŠธ ํ†ต๊ณ„
response = requests.get("http://localhost:8001/context/summary")

2. LoRA/QLoRA ์‚ฌ์šฉ

๊ธฐ๋ณธ ๋ชจ๋ธ ๋กœ๋“œ

# ๊ธฐ๋ณธ ๋ชจ๋ธ ๋กœ๋“œ
response = requests.post("http://localhost:8001/lora/load-base-model",
                        data={
                            "model_path": "/path/to/your/model",
                            "model_type": "causal_lm"
                        })

LoRA ์„ค์ • ์ƒ์„ฑ

# LoRA ์„ค์ • ์ƒ์„ฑ
response = requests.post("http://localhost:8001/lora/create-config",
                        data={
                            "r": 16,                    # LoRA ๋žญํฌ
                            "lora_alpha": 32,           # LoRA ์•ŒํŒŒ
                            "target_modules": "q_proj,v_proj,k_proj,o_proj",  # ํƒ€๊ฒŸ ๋ชจ๋“ˆ
                            "lora_dropout": 0.1,        # ๋“œ๋กญ์•„์›ƒ
                            "bias": "none",             # ๋ฐ”์ด์–ด์Šค ์ฒ˜๋ฆฌ
                            "task_type": "CAUSAL_LM"    # ์ž‘์—… ํƒ€์ž…
                        })

์–ด๋Œ‘ํ„ฐ ์ ์šฉ ๋ฐ ์‚ฌ์šฉ

# LoRA ์–ด๋Œ‘ํ„ฐ ์ ์šฉ
response = requests.post("http://localhost:8001/lora/apply",
                        data={"adapter_name": "my_adapter"})

# LoRA ๋ชจ๋ธ๋กœ ํ…์ŠคํŠธ ์ƒ์„ฑ
response = requests.post("http://localhost:8001/lora/generate",
                        data={
                            "prompt": "์•ˆ๋…•ํ•˜์„ธ์š”!",
                            "max_length": 100,
                            "temperature": 0.7
                        })

# ์–ด๋Œ‘ํ„ฐ ์ €์žฅ
response = requests.post("http://localhost:8001/lora/save-adapter",
                        data={"adapter_name": "my_adapter"})

3. ํ†ตํ•ฉ ์‚ฌ์šฉ (์ปจํ…์ŠคํŠธ + LoRA)

# ์ปจํ…์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•œ ํ…์ŠคํŠธ ์ƒ์„ฑ
response = requests.post("http://localhost:8001/generate",
                        data={
                            "prompt": "์ด์ „ ๋Œ€ํ™”๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ๋‹ต๋ณ€ํ•ด์ฃผ์„ธ์š”.",
                            "use_context": "true",
                            "session_id": "session_1"
                        })

๐Ÿ” API ์—”๋“œํฌ์ธํŠธ

์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ

๋ฉ”์„œ๋“œ ์—”๋“œํฌ์ธํŠธ ์„ค๋ช…
POST /context/set-system-prompt ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ ์„ค์ •
POST /context/add-message ๋ฉ”์‹œ์ง€ ์ถ”๊ฐ€
GET /context/get ์ปจํ…์ŠคํŠธ ์กฐํšŒ
GET /context/summary ์ปจํ…์ŠคํŠธ ์š”์•ฝ
POST /context/clear ์ปจํ…์ŠคํŠธ ์ดˆ๊ธฐํ™”
DELETE /context/message/{message_id} ๋ฉ”์‹œ์ง€ ์ œ๊ฑฐ
PUT /context/message/{message_id} ๋ฉ”์‹œ์ง€ ์ˆ˜์ •
GET /context/search ์ปจํ…์ŠคํŠธ ๊ฒ€์ƒ‰
POST /context/export ์ปจํ…์ŠคํŠธ ๋‚ด๋ณด๋‚ด๊ธฐ
POST /context/import ์ปจํ…์ŠคํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ

LoRA ๊ด€๋ฆฌ

๋ฉ”์„œ๋“œ ์—”๋“œํฌ์ธํŠธ ์„ค๋ช…
POST /lora/load-base-model ๊ธฐ๋ณธ ๋ชจ๋ธ ๋กœ๋“œ
POST /lora/create-config LoRA ์„ค์ • ์ƒ์„ฑ
POST /lora/apply LoRA ์–ด๋Œ‘ํ„ฐ ์ ์šฉ
POST /lora/load-adapter ์ €์žฅ๋œ ์–ด๋Œ‘ํ„ฐ ๋กœ๋“œ
POST /lora/save-adapter ์–ด๋Œ‘ํ„ฐ ์ €์žฅ
GET /lora/adapters ์–ด๋Œ‘ํ„ฐ ๋ชฉ๋ก
GET /lora/stats ์–ด๋Œ‘ํ„ฐ ํ†ต๊ณ„
POST /lora/switch ์–ด๋Œ‘ํ„ฐ ์ „ํ™˜
POST /lora/unload ์–ด๋Œ‘ํ„ฐ ์–ธ๋กœ๋“œ
POST /lora/generate LoRA ๋ชจ๋ธ๋กœ ์ƒ์„ฑ
POST /lora/merge ์–ด๋Œ‘ํ„ฐ ๋ณ‘ํ•ฉ

๐Ÿ“Š ์„ฑ๋Šฅ ์ตœ์ ํ™”

1. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ

  • ์ปจํ…์ŠคํŠธ ์••์ถ•: ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
  • ํ† ํฐ ์ œํ•œ: ์„ค์ • ๊ฐ€๋Šฅํ•œ ์ตœ๋Œ€ ํ† ํฐ ์ˆ˜
  • ์„ธ์…˜ ๋ถ„๋ฆฌ: ๋…๋ฆฝ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„

2. LoRA ์ตœ์ ํ™”

  • ๋žญํฌ ์กฐ์ •: r ๊ฐ’์œผ๋กœ ์ •ํ™•๋„์™€ ํšจ์œจ์„ฑ ๊ท ํ˜•
  • ํƒ€๊ฒŸ ๋ชจ๋“ˆ ์„ ํƒ: ํ•„์š”ํ•œ ๋ ˆ์ด์–ด๋งŒ ์„ ํƒ์  ํ›ˆ๋ จ
  • ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…: ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๊ฐ์†Œ

3. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”

  • GPU ๋ฉ”๋ชจ๋ฆฌ: ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  • CPU ์Šค๋ ˆ๋“œ: ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ ์ตœ์ ํ™”
  • ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ: ๋Œ€๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์ตœ์ ํ™”

๐Ÿงช ํ…Œ์ŠคํŠธ

ํ…Œ์ŠคํŠธ ์Šคํฌ๋ฆฝํŠธ ์‹คํ–‰

python test_context_lora.py

์ˆ˜๋™ ํ…Œ์ŠคํŠธ

# ์„œ๋ฒ„ ์‹œ์ž‘
python run_server.py

# ๋‹ค๋ฅธ ํ„ฐ๋ฏธ๋„์—์„œ ํ…Œ์ŠคํŠธ
curl -X POST "http://localhost:8001/context/set-system-prompt" \
     -d "prompt=๋‹น์‹ ์€ ํ•œ๊ตญ์–ด AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค."

curl -X GET "http://localhost:8001/context/summary"

๐Ÿ”ง ์„ค์ • ์˜ต์…˜

์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ์ž ์„ค์ •

# ContextManager ์ดˆ๊ธฐํ™” ์‹œ ์„ค์ •
context_manager = ContextManager(
    max_tokens=4000,        # ์ตœ๋Œ€ ํ† ํฐ ์ˆ˜
    max_turns=20,           # ์ตœ๋Œ€ ๋Œ€ํ™” ํ„ด ์ˆ˜
    strategy="sliding_window"  # ์••์ถ• ์ „๋žต
)

LoRA ์„ค์ •

# LoRA ์„ค์ • ์˜ˆ์‹œ
lora_config = LoraConfig(
    r=16,                    # LoRA ๋žญํฌ (๋†’์„์ˆ˜๋ก ์ •ํ™•๋„ ํ–ฅ์ƒ, ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€)
    lora_alpha=32,          # LoRA ์•ŒํŒŒ (์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # ํƒ€๊ฒŸ ๋ชจ๋“ˆ
    lora_dropout=0.1,       # ๋“œ๋กญ์•„์›ƒ ๋น„์œจ
    bias="none",            # ๋ฐ”์ด์–ด์Šค ์ฒ˜๋ฆฌ ๋ฐฉ์‹
    task_type="CAUSAL_LM"   # ์ž‘์—… ํƒ€์ž…
)

๐Ÿšจ ์ฃผ์˜์‚ฌํ•ญ

1. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ

  • ์ปจํ…์ŠคํŠธ ๊ธธ์ด: ๋„ˆ๋ฌด ๊ธด ์ปจํ…์ŠคํŠธ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ์„ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • LoRA ๋žญํฌ: ๋†’์€ ๋žญํฌ๋Š” ์ •ํ™•๋„๋ฅผ ๋†’์ด์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๋„ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค

2. ์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ

  • GPU ๋ฉ”๋ชจ๋ฆฌ: LoRA ํ›ˆ๋ จ ์‹œ ์ถฉ๋ถ„ํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • CPU ์‚ฌ์šฉ๋Ÿ‰: ์ปจํ…์ŠคํŠธ ์••์ถ• ์‹œ CPU ๋ฆฌ์†Œ์Šค๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค

3. ํ˜ธํ™˜์„ฑ

  • ๋ชจ๋ธ ํƒ€์ž…: ๋ชจ๋“  ๋ชจ๋ธ์ด LoRA๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ๋ฒ„์ „ ํ˜ธํ™˜์„ฑ: PEFT์™€ Transformers ๋ฒ„์ „ ํ˜ธํ™˜์„ฑ์„ ํ™•์ธํ•˜์„ธ์š”

๐Ÿ“š ์ถ”๊ฐ€ ์ž๋ฃŒ

๊ด€๋ จ ๋ฌธ์„œ

์˜ˆ์ œ ์ฝ”๋“œ

  • test_context_lora.py: ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ ์Šคํฌ๋ฆฝํŠธ
  • examples/: ์ถ”๊ฐ€ ์‚ฌ์šฉ ์˜ˆ์ œ๋“ค

์ปค๋ฎค๋‹ˆํ‹ฐ

๐Ÿค ๊ธฐ์—ฌํ•˜๊ธฐ

๋ฒ„๊ทธ ๋ฆฌํฌํŠธ, ๊ธฐ๋Šฅ ์ œ์•ˆ, ์ฝ”๋“œ ๊ธฐ์—ฌ๋ฅผ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค!

  1. ์ด์Šˆ ์ƒ์„ฑ
  2. ํฌํฌ ํ›„ ๋ธŒ๋žœ์น˜ ์ƒ์„ฑ
  3. ๋ณ€๊ฒฝ์‚ฌํ•ญ ์ปค๋ฐ‹
  4. Pull Request ์ƒ์„ฑ

๐Ÿ“„ ๋ผ์ด์„ ์Šค

์ด ํ”„๋กœ์ ํŠธ๋Š” MIT ๋ผ์ด์„ ์Šค ํ•˜์— ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค.


Lily LLM - ๋” ์Šค๋งˆํŠธํ•œ AI ๋Œ€ํ™”๋ฅผ ์œ„ํ•œ ์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ ๋ฐ LoRA ์‹œ์Šคํ…œ ๐Ÿš€