Model Card for yf591/llama3-8b-rag-sft-lora-v1

This repository contains a LoRA (Low-Rank Adaptation) adapter fine-tuned from the tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 model. The fine-tuning was performed using a RAG (Retrieval Augmented Generation) approach, leveraging information retrieved from the "Reinforcement Learning An Introduction Second edition" by Richard S. Sutton and Andrew G. Barto PDF document to answer questions based on a Japanese instruction dataset (ichikara-instruction-003-001-1.json).

Model Details

Model Description

This model is a LoRA adapter designed to enhance the tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 model's ability to answer questions by utilizing context retrieved from the "Reinforcement Learning An Introduction Second edition" book. It was fine-tuned using QLoRA on a dataset where questions were paired with relevant context from the book and expected answers from the ichikara-instruction dataset.

Developed by: yf591 (Sasaki_San)
Shared by: yf591
Model type: Llama3-based Causal Language Model with LoRA adapter.
Language(s) (NLP): Japanese (primarily for instruction following and RAG)
License: The LoRA adapter itself is likely to follow the license of the base model (tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1, which is Llama 3 Community License). Please refer to the base model's license.
Finetuned from model: tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1

Model Sources

Repository: https://huggingface.co/yf591/llama3-8b-rag-sft-lora-v1
Base Model Repository: https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1
RAG Document Source (Example): "Reinforcement Learning An Introduction Second edition" by Richard S. Sutton and Andrew G. Barto
Instruction Dataset (Example): ichikara-instruction-003-001-1.json

Uses

Direct Use

This LoRA adapter is intended to be loaded onto the tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 base model.. It can then be used for question answering tasks where the questions are related to the content of "Reinforcement Learning An Introduction Second edition" or similar technical documents, utilizing a RAG pipeline.

Intended use:

Answering questions based on provided context from "Reinforcement Learning An Introduction Second edition."
Generating explanations for concepts found within the book.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # BitsAndBytesConfigをインポート
from peft import PeftModel

base_model_id = "tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1"
adapter_id = "yf591/llama3-8b-rag-sft-lora-v1"

# Optional: Configure quantization for loading the base model
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype="float16", # Or torch.bfloat16 if available and preferred
#     bnb_4bit_use_double_quant=True,
# )

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    # quantization_config=bnb_config, # Quantization settings if needed
    device_map="auto",
    trust_remote_code=True
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
# For potentially faster inference (at the cost of higher memory during merging):
# model = model.merge_and_unload() # Note: this requires sufficient memory and modifies the model in place

# Example RAG prompt structure
question = "Please explain Markov Decision Processes, policies, state-value functions, and action-value functions in reinforcement learning."
# In a real RAG setup, 'context' would be retrieved from Richard S. Sutton and Andrew G. Barto's book.
context = """
A Markov Decision Process (MDP) formally describes an environment for reinforcement learning.
Key components include states (S), actions (A), a transition probability function P(s'|s,a),
and a reward function R(s,a,s'). A policy (π) is a mapping from states to probabilities
of selecting each possible action. The state-value function Vπ(s) is the expected return
starting from state s and following policy π. The action-value function Qπ(s,a) is the
expected return starting from state s, taking action a, and then following policy π.
... (text retrieved from "Reinforcement Learning: An Introduction" by Sutton & Barto) ...
"""

prompt = f"""### 指示:
以下のコンテキスト情報を使用して、質問に対する回答を生成してください。
コンテキスト情報に含まれる事実のみを使用し、含まれていない情報は推測しないでください。

### コンテキスト:
{context}

### 質問:
{question}

### 回答:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate text
# Note: For instruct models, generation parameters like temperature, top_p, etc.,
# might be already well-tuned in their default generation_config.
# Adjust max_new_tokens as needed for desired response length.
outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7, top_p=0.9, do_sample=True)
response_full = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract only the answer part
answer_part = response_full.split("### 回答:")[-1].strip()
print(answer_part)

Downstream Use

This adapter could potentially be further fine-tuned on more specific Q&A datasets related to machine learning or other technical domains.

Out-of-Scope Use

Generating content unrelated to the RAG context provided or the general knowledge of the base Llama3 model.
Providing real-time, up-to-the-minute information, as its knowledge is based on the RAG documents and the pre-training of the base model.
Use for safety-critical applications without extensive further testing and safeguards.
Generating harmful, biased, or unethical content.

Bias, Risks, and Limitations

The model's responses are heavily dependent on the quality and content of the RAG context. If the retrieved context is irrelevant or incorrect, the answer may also be.
The fine-tuning dataset (ichikara-instruction) might have its own biases which could be reflected in the model's responses.
The base model (Llama-3.1-Swallow-8B-Instruct-v0.1) has its own inherent biases, risks, and limitations. Please refer to its model card.
The model was fine-tuned on a limited set of RAG contexts (primarily Richard S. Sutton and Andrew G. Barto's book). Its performance on other RAG documents may vary.
Knowledge cutoff is tied to the RAG documents and the base model's training data.

Recommendations

Always use this model as part of a RAG pipeline where relevant context is provided.
Critically evaluate the model's outputs, especially when used for important decision-making.
Be aware of potential biases and ensure outputs are reviewed for fairness and accuracy.

How to Get Started with the Model

See the "Direct Use" section above for a code example on how to load and use the LoRA adapter with the base model. You will need to set up a RAG pipeline to retrieve context dynamically for new questions.

Training Details

Training Data

Instruction Data: ichikara-instruction-003-001-1.json (a Japanese instruction-following dataset).
RAG Context Source: "Pattern Recognition and Machine Learning" by Richard S. Sutton and Andrew G. Barto. Context was retrieved dynamically for each instruction in the ichikara-instruction dataset based on the instruction text.
Combined Data Format for Fine-tuning: Each training sample was formatted as a prompt including the instruction (question), RAG-retrieved context, and the expected answer from the ichikara-instruction dataset.

Training Procedure

The model was fine-tuned using QLoRA (4-bit NormalFloat quantization with LoRA). The training was performed on a Google Colab Pro A100 GPU.

Preprocessing

RAG documents (Richard S. Sutton and Andrew G. Barto's PDF) were chunked and embedded using intfloat/multilingual-e5-large.
A FAISS index was created for efficient context retrieval.
Training data was tokenized using the tokenizer from tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 with a max_length of 2048 (or 512 if you changed it).

Training Hyperparameters

(These are based on your Colab script's defaults; adjust if you used different values or tuned parameters)

LoRA r: 8
LoRA alpha: 16
LoRA dropout: 0.05
Target Modules (LoRA): q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning Rate: 2e-4 (or the value from DEFAULT_LEARNING_RATE)
Batch Size (per device): 2 (or DEFAULT_PER_DEVICE_BATCH_SIZE)
Gradient Accumulation Steps: 8 (or DEFAULT_GRADIENT_ACCUMULATION_STEPS)
Number of Epochs: 1 (value of max_epochs or DEFAULT_NUM_TRAIN_EPOCHS used for the final training)
LR Scheduler: Cosine (or DEFAULT_LR_SCHEDULER_TYPE)
Warmup Ratio: 0.03 (or DEFAULT_WARMUP_RATIO)
Optimizer: AdamW
Quantization: 4-bit (BitsAndBytesConfig: nf4, float16 compute dtype, double quant)

Speeds, Sizes, Times

Training time varied depending on dataset size and hyperparameters (e.g., approx. 0.8 hour for ~1500 samples on A100 with max_length=512).

Evaluation

Formal evaluation metrics were not explicitly calculated as part of this script. Model performance was primarily assessed qualitatively through RAG-based question answering on the Gradio UI.

Results

### 質問:
強化学習におけるマルコフ決定過程、方策、状態価値関数、行動価値関数について教えてください。

### 回答:
強化学習では学習対象となるエージェントが環境との相互作用によって動的に変化する状況下で意思決定を行うことを目指します。このような意思決定には必ず「評価」というものが必要です。その評価に使われる重要な概念が、マルコフ決定過程（Markov Decision Process）、方策（Policy）、状態価値関数（State Value Function）、行動価値関数（Action-Value Function）といったものです。これらの概念はどれも強化学習において非常に重要であり、これらを理解していることでより深い学びにつながります。

まず始めから一つづつ説明していきましょう。

* マルコフ決定過程：この問題設定は次のように記述できます。ある時間ステップ t において環境が現在持っている状態 S t を考える。そしてエージェントはそのとき取ることのできる行動 A t の集合の中から一つの行動を選択する。すると確率分布 P(A t |S t ) が従ってA t が発生し、そして遷移先の状態と報酬が決まるというように考えていく。各状態における各行動を起こした場合にどのようなことが起きたかなどを予想せよというのがMDPのタスクである。
* 状態価値関数：ある状態からの期待される結果として何が得られるのか、それが最大になるのはいつなのかということが大切になります。それを表すものとして「状態価値関数」がありました。
* 方針：これまでの話題とは少し異なる視点ですが、いかに正しい方向へ歩むべきかを明確にするために、「方針」を用いる人はいませんか？「方針」を持っていれば目的地へ向かって進みやすくなります。また、方針がなければたまたま同じ場所に行っても仕方ありません。ここにも「行為」の要素が入っていることに気付くと同時に、「手段」と「目的」が入れ替わっただけではないでしょうか？

強化学習は、最終的なゴールに向けて最適制御法を求めるという立場で考えられているので、この際に使われる単語が「制御可能」というものです。


--- 参照ドキュメント ---
1. ソース: /content/Reinforcement Learning An Introduction Second edition.pdf
   内容抜粋: 74 CHAPTER 3. FINITE MARKOV DECISION PROCESSES Q*(s, driver) Vputt s a n d green !1 s a n d !2!2 !3 !4 !1 !5 !6 !4 !3 !3 !2 !4 s a n d green !1 s a n d !2 !3 !2 0 0 !" !" vputt q*(s,driver) Figure 3.6...
2. ソース: /content/Reinforcement Learning An Introduction Second edition.pdf
   内容抜粋: because it has never received any relevant sensations. In short, we don’t fault an agent for not knowing something that matters, but only for having known something and then forgotten it! What we woul...
3. ソース: /content/Reinforcement Learning An Introduction Second edition.pdf
   内容抜粋: π(a|s) probability of taking action a in state s under stochastic policy π p(s′,r|s,a) probability of transitioning to state s′, with reward r, from s,a vπ(s) value of state s under policy π (expected...

Summary

The model demonstrates an improved ability to answer questions based on the provided RAG context from Richard S. Sutton and Andrew G. Barto's "Reinforcement Learning An Introduction Second edition" compared to the base model without RAG or this specific fine-tuning.

Technical Specifications

Model Architecture and Objective

Architecture: Llama3 with LoRA adapters.
Objective: Causal language modeling, fine-tuned for instruction following with RAG context.

Compute Infrastructure

Hardware

NVIDIA A100 (via Google Colab Pro)

Software

Python 3.11.12
PyTorch
Transformers
PEFT
LangChain
FAISS
Datasets
Gradio

More Information

This model was developed as part of a project to explore RAG-enhanced fine-tuning of Japanese LLMs.

Model Card Authors

yf591

Framework versions

Transformers 4.51.3
PEFT 0.15.2
Datasets 3.6.0
Accelerate 1.6.0
PyTorch 2.6.0+cu124
LangChain 0.3.24

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yf591/llama3-8b-rag-sft-lora-v1

Base model

tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1

Adapter

(2)

this model