brunoyun commited on
Commit
c6e2115
·
verified ·
1 Parent(s): 6b29e21

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -1
README.md CHANGED
@@ -1,5 +1,165 @@
1
  ---
2
  license: llama3.1
 
 
3
  base_model:
4
  - meta-llama/Llama-3.1-8B-Instruct
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama3.1
3
+ language:
4
+ - en
5
  base_model:
6
  - meta-llama/Llama-3.1-8B-Instruct
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - argumentation
10
+ - argument-mining
11
+
12
+ ---
13
+
14
+ # Model Information
15
+
16
+ This model was fine-tuned from the meta-llama/Llama-3.1-8B-Instruct LLM for the task of Claim Detection. 
17
+ - **Developed by:** Henri Savigny 
18
+ - **Funded by:** University Claude Bernard, Lyon 1 - Project AMELIA 
19
+ ### Model Sources 
20
+ - **Github repository:** TBC
21
+ - **Paper:** TBC
22
+ ## How to Get Started with the Model
23
+
24
+ The model use with a temperature of 1.5 and a min p sampling of 0.1
25
+ ### Using [Unsloth](https://unsloth.ai/)
26
+
27
+ ```python
28
+ from unsloth import FastLanguageModel
29
+ from transformers import TextStreamer
30
+
31
+ model, tokenizer = FastLanguageModel.from_pretrained(
32
+     model_name=f'brunoyun/Llama-3.1-Amelia-CD-8B-v1',
33
+     max_seq_length=2048,
34
+     dtype=None,
35
+     load_in_4bit=False,
36
+     gpu_memory_utilization=0.6,
37
+ )
38
+
39
+ FastLanguageModel.for_inference(model)
40
+
41
+ messages = [{'role': 'system', 'content': 'You are an expert in argumentation. Your task is to determine whether the given [SENTENCE] is a Claim or Non-claim. Utilize the [TOPIC] and the [FULL TEXT] as context to support your decision\nYour answer must be in the following format with only Claim or Non-claim in the answer section:\n<|ANSWER|><answer><|ANSWER|>.'}, {'role': 'user', 'content': '[TOPIC]: Should you stay away from online dating\n[SENTENCE]: Based on 2013 data from the National Academy of Sciences, they also discovered that marriages created online were less likely to break up within the first year, while such partners reported a higher degree of satisfaction, too.\n[FULL TEXT]: Based on 2013 data from the National Academy of Sciences, they also discovered that marriages created online were less likely to break up within the first year, while such partners reported a higher degree of satisfaction, too.\n'}]
42
+
43
+ txt_streamer = TextStreamer(tokenizer, skip_prompt=True)
44
+
45
+ txt = tokenizer.apply_chat_template(
46
+     messages,
47
+     add_generation_prompt=True,
48
+     return_tensors="pt",
49
+ ).to('cuda')
50
+
51
+ _ = model.generate(
52
+     txt,
53
+     streamer=txt_streamer,
54
+     max_new_tokens=128,
55
+     pad_token_id=tokenizer.eos_token_id
56
+ )
57
+ ```
58
+
59
+ ### Using LM Studio
60
+ To use the GGUF model with LM Studio, you need to set as system prompt, temperature and min_p sampling:
61
+ - System prompt : ```You are an expert in argumentation. Your task is to determine whether the given [SENTENCE] is a Claim or Non-claim. Utilize the [TOPIC] and the [FULL TEXT] as context to support your decision\nYour answer must be in the following format with only Claim or Non-claim in the answer section:\n<|ANSWER|><answer><|ANSWER|>.```
62
+ - temperature : 1.5
63
+ - min_p sampling : 0.1
64
+
65
+ or use the following preset:
66
+ ```json
67
+ {
68
+   "name": "Llama 3 V2",
69
+   "inference_params": {
70
+     "input_prefix": "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n",
71
+     "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
72
+     "pre_prompt": "You are an expert in argumentation. Your task is to determine whether the given [SENTENCE] is a Claim or Non-claim. Utilize the [TOPIC] and the [FULL TEXT] as context to support your decision\nYour answer must be in the following format with only Claim or Non-claim in the answer section:\n<|ANSWER|><answer><|ANSWER|>.",
73
+     "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",
74
+     "pre_prompt_suffix": "",
75
+     "antiprompt": [
76
+       "<|start_header_id|>",
77
+       "<|eot_id|>"
78
+     ]
79
+   }
80
+ }
81
+ ```
82
+
83
+ ## Training Details
84
+ ### Training Data
85
+
86
+ This model was trained on 4000 elements sampled from the following datasets:
87
+ - IAM [paper](https://arxiv.org/pdf/2203.12257)
88
+ - IBM Claim [paper](https://aclanthology.org/C18-1176.pdf)
89
+ - IBM Argument [paper](https://arxiv.org/pdf/2010.09459)
90
+
91
+ The sample used for the training can be accessed from the Github repository.
92
+ ### Training Procedure
93
+
94
+ We used LoRA with the Unsloth library. 
95
+
96
+ #### Training Hyperparameters
97
+
98
+ - **Training regime:** Model loaded in 4 bits by Unsloth, LoRA r=16, LoRA alpha=16, batch_size=32, epoch=2. The full training code can be viewed from the Github repository.
99
+
100
+ ## Evaluation
101
+
102
+ ### Testing Data & Metrics
103
+
104
+ #### Testing Data
105
+
106
+ This model was trained on 800 elements sampled from the following datasets:
107
+ - IAM [paper](https://arxiv.org/pdf/2203.12257)
108
+ - IBM Claim [paper](https://aclanthology.org/C18-1176.pdf)
109
+ - IBM Argument [paper](https://arxiv.org/pdf/2010.09459)
110
+
111
+ The sample used for the testing can be accessed from the Github repository.
112
+
113
+ #### Metrics
114
+
115
+
116
+ To evaluate the argument mining task, we used standard classification metrics: $F1$ score, precision and recall. 
117
+ In the case of the fallacies detection task, where for a single sentence $s$ there is a set of fallacies identified as true labels $F_s$, we adapted the precision and the recall.
118
+ Given the sampled testing dataset $T$ with 800 elements (see previous section), we build the multiset $T’$ where sentences are repeated as many times as the number of corresponding true labels. Namely, we have:
119
+
120
+ $$ T’ = \{ s_1, \dots, s_n \mid s_i = s, n = |F_s|, s \in T\}$$
121
+ Given our non-deterministic model $\phi$, we obtain the predictions:
122
+ $$Y = \{ \phi(s’) \mid s’ \in T’\}$$
123
+ The new precision is the fraction of correct predictions among all the predictions, where a prediction is correct if the predicted fallacy belongs to the set of true labels. 
124
+ $$ Precision = \frac{\sum_{s \in T}   \left( |\{ s' \in T'\mid s'=s, \phi(s') \in F_s \}| / |F_s| \right) }{|T|}$$
125
+ Recall was measured based on the consistency of prediction distribution, i.e., over a series of instances annotated with the same fallacies, the model was expected to generate the corresponding fallacy task with similar frequency.
126
+ $$Recall = \frac{\sum_{s \in T} \sum_{f \in F_s} \min\left( \frac{|\{ s' \in T' \mid \phi(s') = f, s'= s\}|}{|F_s|}, \frac{1}{|F_s|}\right)}{|T|}$$
127
+ ### Results
128
+
129
+ | Model                                  | ACC        | CD         | ED         | AR         | ET         | SD         | FD_Single  | FD_Multi   | AQ         |
130
+ | -------------------------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
131
+ | Llama 3.1 8B zero-shot                 | 73.52%     | 51.50%     | 17.06%     | 28.32%     | 37.41%     | 14.10%     | 44.07%     | 21.77%     | 15.10%     |
132
+ | Llama 3.1 8B few-shot                  | 75.47%     | 67.83%     | 64.20%     | 35.97%     | 49.31%     | 80.00%     | 48.50%     | 17.25%     | 31.83%     |
133
+ | Llama 3.1 8B fine-tuned for ACC | **89.61%** | 61.35%     | 68.25%     | 38.51%     | 41.43%     | 65.82%     | 38.43%     | 21.58%     | 33.07%     |
134
+ | Llama 3.1 8B fine-tuned for CD | 50.18%     | **85.16%** | 68.91%     | 38.29%     | 33.91%     | 66.97%     | 38.90%     | 22.67%     | 31.24%     |
135
+ | Llama 3.1 8B fine-tuned for ED | 63.32%     | 74.94%     | **78.00%** | 28.60%     | 38.67%     | 68.42%     | 39.65%     | 18.47%     | 29.01%     |
136
+ | Llama 3.1 8B fine-tuned for AR | 50.81%     | 59.98%     | 67.00%     | **87.20%** | 35.07%     | 76.00%     | 35.14%     | 25.86%     | 27.97%     |
137
+ | Llama 3.1 8B fine-tuned for ET | 56.10%     | 67.08%     | 61.45%     | 26.88%     | **75.22%** | 69.82%     | 46.78%     | 29.68%     | 29.03%     |
138
+ | Llama 3.1 8B fine-tuned for SD | 50.93%     | 48.88%     | 57.62%     | 38.26%     | 39.17%     | **94.63%** | 43.23%     | 20.99%     | 20.39%     |
139
+ | Llama 3.1 8B fine-tuned for FD | 66.58%     | 65.13%     | 64.50%     | 38.64%     | 46.83%     | 64.32%     | **82.92%** | **50.77%** | 41.90%     |
140
+ | Llama 3.1 8B fine-tuned for AQ | 74.46%     | 59.73%     | 68.00%     | 30.86%     | 44.06%     | 60.43%     | 47.98%     | 24.31%     | **69.54%** |
141
+ | GGUF_ACC | 87.73%     | 63.59%     | 63.75%     | 36.31%     | 37.98%     | 64.63%     | 30.19%     | 29.27%     | 32.94%     |
142
+ | GGUF_CD                                | 54.10%     | 81.92%     | 60.70%     | 36.43%     | 31.99%     | 63.82%     | 30.00%     | 31.21%     | 33.20%     |
143
+ | GGUF_ED                                | 56.20%     | 63.72%     | 71.62%     | 34.63%     | 36.22%     | 61.84%     | 34.10%     | 34.54%     | 34.77%     |
144
+ | GGUF_AR                                | 55.19%     | 60.25%     | 63.70%     | 84.57%     | 31.71%     | 76.50%     | 29.94%     | 34.18%     | 32.15%     |
145
+ | GGUF_ET                                | 58.23%     | 64.37%     | 58.59%     | 29.14%     | 72.47%     | 68.20%     | 39.05%     | 32.94%     | 31.48%     |
146
+ | GGUF_SD                                | 56.70%     | 50.75%     | 57.75%     | 38.27%     | 33.67%     | 93.75%     | 34.66%     | 30.32%     | 21.43%     |
147
+ | GGUF_FD                                | 62.20%     | 59.91%     | 62.88%     | 35.51%     | 42.52%     | 64.68%     | 74.08%     | **62.16%** | 41.69%     |
148
+ | GGUF_AQ                                | 67.08%     | 59.73%     | 69.50%     | 31.17%     | 41.31%     | 61.16%     | 41.86%     | 30.02%     | 66.53%     |
149
+ | Llama 3.1 8B fine-tuned Multi-task | **90.74%** | **84.71%** | **77.75%** | **88.33%** | **73.84%** | **95.75%** | **82.53%** | 50.22%     | **69.80%** |
150
+ | Merged Model                           | 78.72%     | 70.69%     | 69.62%     | 72.52%     | 54.60%     | 77.04%     | 57.00%     | 35.03%     | 57.52%     |
151
+ | GGUF_Merged                            | 65.95%     | 65.83%     | 62.13%     | 62.93%     | 49.06%     | 74.38%     | 50.04%     | 40.75%     | 44.97%     |
152
+
153
+ ## Intended Use
154
+
155
+ ### Intended Use Cases
156
+
157
+ Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases. 
158
+
159
+ ### Out-of-scope
160
+
161
+ Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card**.
162
+
163
+ ## Bias, Risks, and Limitations
164
+
165
+ This model is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, its potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of this model, developers should perform safety testing and tuning tailored to their specific applications of the model.