Model Card for yf591/llama3-8b-rag-sft-lora-v1
This repository contains a LoRA (Low-Rank Adaptation) adapter fine-tuned from the tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 model.
The fine-tuning was performed using a RAG (Retrieval Augmented Generation) approach, leveraging information retrieved from the "Reinforcement Learning An Introduction Second edition" by Richard S. Sutton and Andrew G. Barto PDF document to answer questions based on a Japanese instruction dataset (ichikara-instruction-003-001-1.json).
Model Details
Model Description
This model is a LoRA adapter designed to enhance the tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 model's ability to answer questions by utilizing context retrieved from the "Reinforcement Learning An Introduction Second edition" book. It was fine-tuned using QLoRA on a dataset where questions were paired with relevant context from the book and expected answers from the ichikara-instruction dataset.
- Developed by: yf591 (Sasaki_San)
- Shared by: yf591
- Model type: Llama3-based Causal Language Model with LoRA adapter.
- Language(s) (NLP): Japanese (primarily for instruction following and RAG)
- License: The LoRA adapter itself is likely to follow the license of the base model (
tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1, which is Llama 3 Community License). Please refer to the base model's license. - Finetuned from model:
tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1
Model Sources
- Repository:
https://huggingface.co/yf591/llama3-8b-rag-sft-lora-v1 - Base Model Repository:
https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 - RAG Document Source (Example): "Reinforcement Learning An Introduction Second edition" by Richard S. Sutton and Andrew G. Barto
- Instruction Dataset (Example):
ichikara-instruction-003-001-1.json
Uses
Direct Use
This LoRA adapter is intended to be loaded onto the tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 base model..
It can then be used for question answering tasks where the questions are related to the content of "Reinforcement Learning An Introduction Second edition" or similar technical documents, utilizing a RAG pipeline.
Intended use:
- Answering questions based on provided context from "Reinforcement Learning An Introduction Second edition."
- Generating explanations for concepts found within the book.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # BitsAndBytesConfigをインポート
from peft import PeftModel
base_model_id = "tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1"
adapter_id = "yf591/llama3-8b-rag-sft-lora-v1"
# Optional: Configure quantization for loading the base model
# bnb_config = BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_quant_type="nf4",
# bnb_4bit_compute_dtype="float16", # Or torch.bfloat16 if available and preferred
# bnb_4bit_use_double_quant=True,
# )
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
# quantization_config=bnb_config, # Quantization settings if needed
device_map="auto",
trust_remote_code=True
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
# For potentially faster inference (at the cost of higher memory during merging):
# model = model.merge_and_unload() # Note: this requires sufficient memory and modifies the model in place
# Example RAG prompt structure
question = "Please explain Markov Decision Processes, policies, state-value functions, and action-value functions in reinforcement learning."
# In a real RAG setup, 'context' would be retrieved from Richard S. Sutton and Andrew G. Barto's book.
context = """
A Markov Decision Process (MDP) formally describes an environment for reinforcement learning.
Key components include states (S), actions (A), a transition probability function P(s'|s,a),
and a reward function R(s,a,s'). A policy (π) is a mapping from states to probabilities
of selecting each possible action. The state-value function Vπ(s) is the expected return
starting from state s and following policy π. The action-value function Qπ(s,a) is the
expected return starting from state s, taking action a, and then following policy π.
... (text retrieved from "Reinforcement Learning: An Introduction" by Sutton & Barto) ...
"""
prompt = f"""### 指示:
以下のコンテキスト情報を使用して、質問に対する回答を生成してください。
コンテキスト情報に含まれる事実のみを使用し、含まれていない情報は推測しないでください。
### コンテキスト:
{context}
### 質問:
{question}
### 回答:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate text
# Note: For instruct models, generation parameters like temperature, top_p, etc.,
# might be already well-tuned in their default generation_config.
# Adjust max_new_tokens as needed for desired response length.
outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7, top_p=0.9, do_sample=True)
response_full = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the answer part
answer_part = response_full.split("### 回答:")[-1].strip()
print(answer_part)
Downstream Use
This adapter could potentially be further fine-tuned on more specific Q&A datasets related to machine learning or other technical domains.
Out-of-Scope Use
- Generating content unrelated to the RAG context provided or the general knowledge of the base Llama3 model.
- Providing real-time, up-to-the-minute information, as its knowledge is based on the RAG documents and the pre-training of the base model.
- Use for safety-critical applications without extensive further testing and safeguards.
- Generating harmful, biased, or unethical content.
Bias, Risks, and Limitations
- The model's responses are heavily dependent on the quality and content of the RAG context. If the retrieved context is irrelevant or incorrect, the answer may also be.
- The fine-tuning dataset (
ichikara-instruction) might have its own biases which could be reflected in the model's responses. - The base model (
Llama-3.1-Swallow-8B-Instruct-v0.1) has its own inherent biases, risks, and limitations. Please refer to its model card. - The model was fine-tuned on a limited set of RAG contexts (primarily Richard S. Sutton and Andrew G. Barto's book). Its performance on other RAG documents may vary.
- Knowledge cutoff is tied to the RAG documents and the base model's training data.
Recommendations
- Always use this model as part of a RAG pipeline where relevant context is provided.
- Critically evaluate the model's outputs, especially when used for important decision-making.
- Be aware of potential biases and ensure outputs are reviewed for fairness and accuracy.
How to Get Started with the Model
See the "Direct Use" section above for a code example on how to load and use the LoRA adapter with the base model. You will need to set up a RAG pipeline to retrieve context dynamically for new questions.
Training Details
Training Data
- Instruction Data:
ichikara-instruction-003-001-1.json(a Japanese instruction-following dataset). - RAG Context Source: "Pattern Recognition and Machine Learning" by Richard S. Sutton and Andrew G. Barto. Context was retrieved dynamically for each instruction in the
ichikara-instructiondataset based on the instruction text. - Combined Data Format for Fine-tuning: Each training sample was formatted as a prompt including the instruction (question), RAG-retrieved context, and the expected answer from the
ichikara-instructiondataset.
Training Procedure
The model was fine-tuned using QLoRA (4-bit NormalFloat quantization with LoRA). The training was performed on a Google Colab Pro A100 GPU.
Preprocessing
- RAG documents (Richard S. Sutton and Andrew G. Barto's PDF) were chunked and embedded using
intfloat/multilingual-e5-large. - A FAISS index was created for efficient context retrieval.
- Training data was tokenized using the tokenizer from
tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1with amax_lengthof 2048 (or 512 if you changed it).
Training Hyperparameters
(These are based on your Colab script's defaults; adjust if you used different values or tuned parameters)
- LoRA
r: 8 - LoRA
alpha: 16 - LoRA
dropout: 0.05 - Target Modules (LoRA):
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Learning Rate: 2e-4 (or the value from
DEFAULT_LEARNING_RATE) - Batch Size (per device): 2 (or
DEFAULT_PER_DEVICE_BATCH_SIZE) - Gradient Accumulation Steps: 8 (or
DEFAULT_GRADIENT_ACCUMULATION_STEPS) - Number of Epochs: 1 (value of
max_epochsorDEFAULT_NUM_TRAIN_EPOCHSused for the final training) - LR Scheduler: Cosine (or
DEFAULT_LR_SCHEDULER_TYPE) - Warmup Ratio: 0.03 (or
DEFAULT_WARMUP_RATIO) - Optimizer: AdamW
- Quantization: 4-bit (BitsAndBytesConfig:
nf4,float16compute dtype, double quant)
Speeds, Sizes, Times
- Training time varied depending on dataset size and hyperparameters (e.g., approx. 0.8 hour for ~1500 samples on A100 with
max_length=512).
Evaluation
Formal evaluation metrics were not explicitly calculated as part of this script. Model performance was primarily assessed qualitatively through RAG-based question answering on the Gradio UI.
Results
### 質問:
強化学習におけるマルコフ決定過程、方策、状態価値関数、行動価値関数について教えてください。
### 回答:
強化学習では学習対象となるエージェントが環境との相互作用によって動的に変化する状況下で意思決定を行うことを目指します。このような意思決定には必ず「評価」というものが必要です。その評価に使われる重要な概念が、マルコフ決定過程(Markov Decision Process)、方策(Policy)、状態価値関数(State Value Function)、行動価値関数(Action-Value Function)といったものです。これらの概念はどれも強化学習において非常に重要であり、これらを理解していることでより深い学びにつながります。
まず始めから一つづつ説明していきましょう。
* マルコフ決定過程:この問題設定は次のように記述できます。ある時間ステップ t において環境が現在持っている状態 S t を考える。そしてエージェントはそのとき取ることのできる行動 A t の集合の中から一つの行動を選択する。すると確率分布 P(A t |S t ) が従ってA t が発生し、そして遷移先の状態と報酬が決まるというように考えていく。各状態における各行動を起こした場合にどのようなことが起きたかなどを予想せよというのがMDPのタスクである。
* 状態価値関数:ある状態からの期待される結果として何が得られるのか、それが最大になるのはいつなのかということが大切になります。それを表すものとして「状態価値関数」がありました。
* 方針:これまでの話題とは少し異なる視点ですが、いかに正しい方向へ歩むべきかを明確にするために、「方針」を用いる人はいませんか?「方針」を持っていれば目的地へ向かって進みやすくなります。また、方針がなければたまたま同じ場所に行っても仕方ありません。ここにも「行為」の要素が入っていることに気付くと同時に、「手段」と「目的」が入れ替わっただけではないでしょうか?
強化学習は、最終的なゴールに向けて最適制御法を求めるという立場で考えられているので、この際に使われる単語が「制御可能」というものです。
--- 参照ドキュメント ---
1. ソース: /content/Reinforcement Learning An Introduction Second edition.pdf
内容抜粋: 74 CHAPTER 3. FINITE MARKOV DECISION PROCESSES Q*(s, driver) Vputt s a n d green !1 s a n d !2!2 !3 !4 !1 !5 !6 !4 !3 !3 !2 !4 s a n d green !1 s a n d !2 !3 !2 0 0 !" !" vputt q*(s,driver) Figure 3.6...
2. ソース: /content/Reinforcement Learning An Introduction Second edition.pdf
内容抜粋: because it has never received any relevant sensations. In short, we don’t fault an agent for not knowing something that matters, but only for having known something and then forgotten it! What we woul...
3. ソース: /content/Reinforcement Learning An Introduction Second edition.pdf
内容抜粋: π(a|s) probability of taking action a in state s under stochastic policy π p(s′,r|s,a) probability of transitioning to state s′, with reward r, from s,a vπ(s) value of state s under policy π (expected...
Summary
The model demonstrates an improved ability to answer questions based on the provided RAG context from Richard S. Sutton and Andrew G. Barto's "Reinforcement Learning An Introduction Second edition" compared to the base model without RAG or this specific fine-tuning.
Technical Specifications
Model Architecture and Objective
- Architecture: Llama3 with LoRA adapters.
- Objective: Causal language modeling, fine-tuned for instruction following with RAG context.
Compute Infrastructure
Hardware
- NVIDIA A100 (via Google Colab Pro)
Software
- Python 3.11.12
- PyTorch
- Transformers
- PEFT
- LangChain
- FAISS
- Datasets
- Gradio
More Information
This model was developed as part of a project to explore RAG-enhanced fine-tuning of Japanese LLMs.
Model Card Authors
yf591
Framework versions
- Transformers 4.51.3
- PEFT 0.15.2
- Datasets 3.6.0
- Accelerate 1.6.0
- PyTorch 2.6.0+cu124
- LangChain 0.3.24
- Downloads last month
- 1