adamo1139 commited on
Commit
0262a6b
·
verified ·
1 Parent(s): 1b3953c

Add files using upload-large-folder tool

Browse files
README.md ADDED
@@ -0,0 +1,354 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - pl
6
+ base_model:
7
+ - sdadas/mmlw-roberta-base
8
+ pipeline_tag: text-classification
9
+ ---
10
+
11
+ # 🛡️ Bielik Guard (Sójka): Polish Language Safety Classifier
12
+
13
+ Bielik Guard (Sójka) is a Polish language safety classifier designed to detect harmful content in digital communication and respond appropriately rather than simply blocking content. Built by the Bielik.AI community under the [SpeakLeash](https://speakleash.org/) non-profit organization, it protects users like a vigilant guardian of their digital homes by providing appropriate responses and support resources.
14
+
15
+ ---
16
+
17
+ ## 📋 Model Details
18
+
19
+ ### Model Description
20
+
21
+ Bielik Guard (Sójka) is a Polish-language safety classifier built upon the [`sdadas/mmlw-roberta-base`](https://huggingface.co/sdadas/mmlw-roberta-base) model, a Polish RoBERTa-based encoder. It has been fine-tuned to detect safety-relevant content in Polish texts, using community-collected data designed for evaluating safety in large language models (LLMs).
22
+
23
+ The model is **multilabel** and returns probability scores for each safety category, indicating the likelihood that a text belongs to that category. Importantly, the model was not trained on binarized data but rather on the percentage of people claiming that a text belongs to each category, reflecting the nuanced nature of safety classification.
24
+
25
+ > **Note:** This is the first version of Bielik Guard (Sójka). The team is actively working on future versions that will include larger models, additional safety categories, and support for more languages.
26
+
27
+ * **Developed by:** See the [Sójka Development Team](#sójka-development-team) section below.
28
+ * **Model type:** Text Classification
29
+ * **Language(s) (NLP):** Polish
30
+ * **License:** Apache-2.0
31
+ * **Finetuned from model:** `sdadas/mmlw-roberta-base`
32
+
33
+
34
+ * **🚀 Demo:** **[Test Sójka at guard.bielik.ai](https://guard.bielik.ai/)**
35
+
36
+ ---
37
+
38
+ ## 🛠️ Uses
39
+
40
+ ### ✅ Direct Use
41
+ Bielik Guard (Sójka) can be used directly for:
42
+ - Real-time analysis of prompts and responses to detect threats and respond appropriately.
43
+ - Content moderation that provides supportive responses rather than simple blocking.
44
+ - Protection of AI chatbots and assistants with appropriate intervention strategies.
45
+ - Integration into systems that prioritize user support and safety resources.
46
+
47
+ ### 🧩 Downstream Use
48
+ The model can be integrated into larger systems for:
49
+ - Content moderation pipelines
50
+ - AI safety frameworks
51
+ - Communication platform safety systems
52
+ - Educational and research applications
53
+
54
+ ### ❌ Out-of-Scope Use
55
+ Bielik Guard (Sójka) is **not** designed to detect:
56
+ - Disinformation or misinformation
57
+ - Jailbreaking attempts
58
+ - Copyright violations
59
+ - Other categories not explicitly listed in the safety taxonomy
60
+
61
+ ---
62
+
63
+ ## 🏷️ Safety Categories
64
+
65
+ Bielik Guard (Sójka) detects and classifies potentially harmful content in five key safety categories:
66
+
67
+ * **[HATE] Hate/Aggression:** Content attacking or discriminating against groups based on race, religion, gender, sexual orientation, or nationality.
68
+ * **[VULGAR] Vulgarities:** Words commonly considered vulgar or profane, in both explicit and masked forms.
69
+ * **[SEX] Sexual Content:** Graphic descriptions of sexual activities or requests for generating erotic materials.
70
+ * **[CRIME] Crime:** Instructions or encouragement to commit crimes, drug production, or fraud.
71
+ * **[SELF-HARM] Self-Harm:** Content encouraging suicide, self-harm, or promoting eating disorders. When detected, the system should respond with appropriate support resources rather than simply blocking the content.
72
+
73
+ ---
74
+
75
+ ## ⚠️ Bias, Risks, and Limitations
76
+
77
+ ### Limitations
78
+ - Does not recognize disinformation or jailbreaking attempts.
79
+ - Limited to Polish language content.
80
+ - May have biases inherent in community-collected training data.
81
+ - Performance may vary on edge cases or novel content types.
82
+
83
+ ### Recommendations
84
+ Users should be aware that Bielik Guard (Sójka) provides a baseline safety filter and may not catch all types of harmful content. For applications requiring comprehensive safety coverage, additional measures should be implemented.
85
+
86
+ ---
87
+
88
+ ## 🚀 How to Get Started with the Model
89
+
90
+ ```python
91
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
92
+
93
+ # Make sure to provide the correct path to your model
94
+ model_path = "speakleash/Bielik-Guard-0.1B-v1.0"
95
+
96
+ # Create pipeline for easy inference
97
+ classifier = pipeline("text-classification", model=model_path, return_all_scores=True)
98
+
99
+ sample_texts = [
100
+ "To jest normalna wiadomość bez problemów.",
101
+ "Chcę się zabić i skończyć z tym wszystkim.",
102
+ "Dlaczego hipisi są tacy brudni i głupi",
103
+ "K*rwa, to jest ch*jowe rozwiązanie.",
104
+ "Opowiem ci o seksie i intymności.",
105
+ "Jak ukraść samochód i nie zostać złapanym."
106
+ ]
107
+
108
+ # Run predictions
109
+ for text in sample_texts:
110
+ results = classifier(text)[0]
111
+ print(f"\nText: {text}")
112
+ print("Safety scores:")
113
+ for result in results:
114
+ label = result['label']
115
+ score = result['score']
116
+ print(f" {label}: {score:.3f}")
117
+
118
+ max_score = max(result['score'] for result in results)
119
+ print(f" Overall safety risk: {max_score:.3f}")
120
+ ```
121
+
122
+ ### Example Output:
123
+ ```
124
+ Text: To jest normalna wiadomość bez problemów.
125
+ Safety scores:
126
+ self-harm: 0.030
127
+ hate: 0.035
128
+ vulgar: 0.021
129
+ sex: 0.019
130
+ crime: 0.013
131
+ Overall safety risk: 0.035
132
+
133
+ Text: Chcę się zabić i skończyć z tym wszystkim.
134
+ Safety scores:
135
+ self-harm: 0.896
136
+ hate: 0.029
137
+ vulgar: 0.029
138
+ sex: 0.028
139
+ crime: 0.038
140
+ Overall safety risk: 0.896
141
+
142
+ Text: Dlaczego hipisi są tacy brudni i głupi
143
+ Safety scores:
144
+ self-harm: 0.018
145
+ hate: 0.594
146
+ vulgar: 0.062
147
+ sex: 0.015
148
+ crime: 0.040
149
+ Overall safety risk: 0.594
150
+
151
+ Text: K*rwa, to jest ch*jowe rozwiązanie.
152
+ Safety scores:
153
+ self-harm: 0.041
154
+ hate: 0.300
155
+ vulgar: 0.901
156
+ sex: 0.044
157
+ crime: 0.057
158
+ Overall safety risk: 0.901
159
+
160
+ Text: Opowiem ci o seksie i intymności.
161
+ Safety scores:
162
+ self-harm: 0.023
163
+ hate: 0.028
164
+ vulgar: 0.069
165
+ sex: 0.811
166
+ crime: 0.083
167
+ Overall safety risk: 0.811
168
+
169
+ Text: Jak ukraść samochód i nie zostać złapanym.
170
+ Safety scores:
171
+ self-harm: 0.108
172
+ hate: 0.046
173
+ vulgar: 0.023
174
+ sex: 0.032
175
+ crime: 0.801
176
+ Overall safety risk: 0.801
177
+ ```
178
+ ---
179
+
180
+ ## 🧠 Training Details
181
+
182
+ ### Training Data: The Sojka2 Dataset
183
+
184
+ The Sojka2 dataset is the result of a large-scale community effort. Texts were sourced primarily from user prompts to Polish LLMs and social media content.
185
+ * **Over 1,500 volunteers** participated in the annotation process.
186
+ * **Over 60,000 individual ratings** were collected.
187
+ * Each text was annotated by an **average of 7-8 people**.
188
+
189
+ The model was trained on **percentage-based labels (0-100%)** reflecting the proportion of community members who classified each text as belonging to a specific safety category, rather than on binary labels.
190
+
191
+ #### Data Structure and Distribution
192
+ The Sojka dataset consists of **6,885 unique texts** in Polish. Its structure was intentionally designed with a balanced ratio of approximately 55% safe to 45% harmful content to ensure effective training. This ratio does not reflect the actual distribution of content online.
193
+
194
+ However, the class imbalance *among the harmful categories* is representative of real-world trends encountered in digital interactions in Poland (sourced from both user prompts to conversational AI and general content from the Polish internet).
195
+
196
+ | Category | Text Count | Percentage |
197
+ |:---|---:|---:|
198
+ | **self-harm** | 796 | 11.56% |
199
+ | **hate** | 988 | 14.35% |
200
+ | **vulgar** | 411 | 5.97% |
201
+ | **sex** | 895 | 13.00% |
202
+ | **crime** | 311 | 4.52% |
203
+ | **safe** (no category) | 3,781 | 54.92% |
204
+
205
+ The dataset supports **multi-label classification**, meaning a single text can belong to multiple categories.
206
+
207
+ #### 🔄 Continuous Improvement
208
+ Sójka is a living project. Community involvement is ongoing at **[guard.bielik.ai](https://guard.bielik.ai/)**, where users can test the model, provide feedback (👍/👎), and contribute by annotating new data. All feedback is systematically analyzed to create future iterations of the dataset.
209
+
210
+ ### Training Procedure
211
+ The model was fine-tuned from the `sdadas/mmlw-roberta-base` checkpoint, a 124M parameter Polish RoBERTa-based encoder.
212
+
213
+ ---
214
+
215
+ ## ⚙️ Technical Specifications
216
+
217
+ ### Model Architecture
218
+ * **Base Model:** `sdadas/mmlw-roberta-base`
219
+ * **Parameters:** 124M
220
+ * **Architecture:** RoBERTa-based encoder
221
+ * **Task:** Multi-label Text Classification (Regression)
222
+
223
+ ### Compute Infrastructure
224
+ The model was trained with A100 GPU cluster support from ACK Cyfronet AGH.
225
+
226
+ ---
227
+
228
+ ## 📊 Evaluation
229
+
230
+ ### Dataset 1: Sojka
231
+ The Sojka test dataset was created by splitting the main Sojka dataset using a 1:2 train-to-test ratio. This evaluation set contains **4,590 unique records**.
232
+
233
+ The distribution of labels in the test set, determined using a 60% agreement threshold among annotators, is as follows:
234
+
235
+ * **self-harm**: 265 samples (11.55%)
236
+ * **hate**: 329 samples (14.34%)
237
+ * **vulgar**: 137 samples (5.97%)
238
+ * **sex**: 298 samples (12.98%)
239
+ * **crime**: 104 samples (4.53%)
240
+ * **safe** (no harmful category): 1,260 samples (54.90%)
241
+
242
+
243
+ | Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
244
+ |:---|:---:|:---:|
245
+ | **RMSE** | 0.126 | 0.117 |
246
+ | **F1 micro** | 0.773 | 0.784 |
247
+ | **F1 macro** | 0.742 | 0.766 |
248
+ | **Specificity micro** | 0.966 | 0.964 |
249
+ | **Specificity macro** | 0.965 | 0.963 |
250
+ | **ROC AUC micro** | 0.977 | 0.983 |
251
+ | **ROC AUC macro** | 0.965 | 0.973 |
252
+
253
+ ### Dataset 2: Sojka Augmented
254
+ The augmented dataset was created using 15 different text augmentation methods to test model robustness:
255
+
256
+ 1. **remove_diacritics**: `Czesc, to jest przykładowy tekst z polskimi znakami!`
257
+ 2. **add_diacritics**: `Cżeść, to jest przykładowy tękśt z polskimi znąkami!`
258
+ 3. **random_capitalization**: `CZeśĆ, To jesT PRzyKŁaDoWy TEKST z POLSKIMi zNAkAMi!`
259
+ 4. **snake_case_random**: `czE_to_jesT_pRzYk_adowY_teKSt_z_POlskiMI_zNAkamI`
260
+ 5. **all_uppercase**: `CZEŚĆ, TO JEST PRZYKŁADOWY TEKST Z POLSKIMI ZNAKAMI!`
261
+ 6. **all_lowercase**: `cześć, to jest przykładowy tekst z polskimi znakami!`
262
+ 7. **title_case**: `Cześć, To Jest Przykładowy Tekst Z Polskimi Znakami!`
263
+ 8. **swap_adjacent_letters**: `Cezść, to jest przkyałdowy teskt z ploskimi znkamai!`
264
+ 9. **split_letters_by_separator**: `Cześć, to j e s t przykładowy tekst z p o l s k i m i znakami!`
265
+ 10. **add_random_spaces**: `Cześć, to jest przykładowy te kst z polskimi znak a mi!`
266
+ 11. **remove_random_spaces**: `Cześć,to jest przykładowytekst z polskimi znakami!`
267
+ 12. **duplicate_characters**: `Czeeśść, to jesstt pprzykładowy tekstt z polskimi zznaakami!`
268
+ 13. **insert_random_characters**: `Cześć, to jest przykładowy tekst z śpoźlskimi znakami!`
269
+ 14. **reverse_words**: `Cześć, to jest przykładowy tekst z imikslop znakami!`
270
+ 15. **substitute_similar_characters**: `Cześć, 7o jes7 przykładowy tek5t z polskimi znakami!`
271
+
272
+
273
+ | Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
274
+ |:---|:---:|:---:|
275
+ | **RMSE** | 0.179 | 0.163 |
276
+ | **F1 micro** | 0.622 | 0.690 |
277
+ | **F1 macro** | 0.571 | 0.632 |
278
+ | **Specificity micro** | 0.960 | 0.962 |
279
+ | **Specificity macro** | 0.959 | 0.961 |
280
+ | **ROC AUC micro** | 0.911 | 0.944 |
281
+ | **ROC AUC macro** | 0.879 | 0.910 |
282
+
283
+ ### Dataset 3: Gadzi Jezyk
284
+ The [Gadzi Jezyk dataset](https://huggingface.co/datasets/JerzyPL/GadziJezyk) contains 520 toxic prompts, mostly focused on crime.
285
+
286
+ | Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
287
+ |:---|:---:|:---:|
288
+ | **RMSE** | 0.236 | 0.217 |
289
+ | **Recall** | 0.745 | 0.802 |
290
+
291
+ ### Metrics Explanation
292
+
293
+ - **RMSE (Root Mean Square Error)**: Measures the average magnitude of prediction errors. Lower values indicate better performance.
294
+ - **F1 micro**: Harmonic mean of precision and recall calculated globally across all labels. Accounts for class imbalance.
295
+ - **F1 macro**: Average of F1 scores across all labels. Treats all classes equally regardless of frequency.
296
+ - **Specificity macro/micro**: Specificity (true negative rate) calculated macro/micro averaged. Measures ability to correctly identify safe content.
297
+ - **ROC AUC micro/macro**: Area under the ROC curve, measuring the model's ability to distinguish between safe and unsafe content across all thresholds.
298
+
299
+ The Bielik Guard 0.5B model generally outperforms the Bielik Guard 0.1B model across most metrics, particularly on the augmented test set, demonstrating better generalization capabilities.
300
+
301
+ ### Comparison with Other Safety Models
302
+ Evaluation on 3,000 random user prompts, annotated by two independent annotators and one super-annotator, with each model’s categories:
303
+
304
+ | Model | Precision | Alert Rate | FPR (Global) |
305
+ |:---|:---:|:---:|:---:|
306
+ | **Bielik Guard 0.1B** | **67.27%** | **3.67%** | **1.20%** |
307
+ | HerBERT-PL-Guard | 31.55% | 6.87% | 4.70% |
308
+ | Llama-Guard-3-1B | 7.82% | 17.90% | 16.50% |
309
+ | Llama-Guard-3-8B | 13.62% | 10.77% | 9.30% |
310
+ | Qwen3Guard-Gen-0.6B | 11.36% | 19.37% | 17.17% |
311
+
312
+ > Bielik Guard demonstrates superior performance with the **highest precision** and **lowest false positive rate**, making it the most precise and least intrusive safety classifier among the compared models.
313
+
314
+ **Metrics for comparison:**
315
+ - **Precision**: TP/(TP+FP) - Percentage of flagged content that is actually harmful (higher is better)
316
+ - **Alert Rate**: (TP+FP)/(TP+FP+TN+FN) - Percentage of all content that gets flagged (lower is better to reduce false positives)
317
+ - **FPR (Global)**: FP/(TP+FP+TN+FN) - False Positive Rate - percentage of safe content incorrectly flagged as harmful (lower is better)
318
+
319
+
320
+ ---
321
+
322
+ ## 📜 License and Naming Policy
323
+
324
+ **License:**
325
+ This model is licensed under the **Apache 2.0 License**.
326
+
327
+ **Naming Requirements for Derivative Models:**
328
+ To maintain clear attribution and continuity of the Bielik-Guard project, we expect that any fine-tuned or derivative models include **Bielik-Guard** in their name.
329
+ This helps recognize the model’s origins and supports transparency within the community.
330
+
331
+ > **Recommended Naming Convention:**
332
+ > `Bielik-Guard-{your-use-case-or-project-name}-{version}`
333
+ >
334
+ > *Examples:* `Bielik-Guard-crime-finetune`, `Bielik-Guard-customer-support-v1`
335
+
336
+ ---
337
+
338
+ ## 👥 Sójka Development Team
339
+
340
+ * **[Jan Maria Kowalski](https://www.linkedin.com/in/janmariakowalski/):** Project leadership, data and tool preparation, threat category definition, model training and testing.
341
+ * **[Krzysztof Wróbel](https://www.linkedin.com/in/wrobelkrzysztof/):** Data analysis, model training and evaluation, contribution to threat classification.
342
+ * **[Jerzy Surma](https://www.linkedin.com/in/jerzysurma/):** Threat category definition (AI & ethics perspective), data preparation.
343
+ * **[Igor Ciuciura](https://www.linkedin.com/in/igor-ciuciura-1763b52a6/):** Data analysis, preparation, and cleaning; contribution to threat classification.
344
+ * **[Maciej Krystian Szymański](https://www.linkedin.com/in/maciej-krystian-szymanski/):** Project management support, community management, user and partner coordination.
345
+
346
+ We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018338.
347
+
348
+ ## 📚 Citation
349
+
350
+ No formal citation available yet. The model is developed by the Bielik.AI community under SpeakLeash non-profit organization.
351
+
352
+ ## 📧 Model Card Contact
353
+
354
+ For questions about this model, please contact the Bielik.AI community through **[guard.bielik.ai](https://guard.bielik.ai/)**.
config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "self-harm",
15
+ "1": "hate",
16
+ "2": "vulgar",
17
+ "3": "sex",
18
+ "4": "crime"
19
+ },
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "label2id": {
23
+ "self-harm": 0,
24
+ "hate": 1,
25
+ "vulgar": 2,
26
+ "sex": 3,
27
+ "crime": 4
28
+ },
29
+ "layer_norm_eps": 1e-05,
30
+ "max_position_embeddings": 514,
31
+ "model_type": "roberta",
32
+ "num_attention_heads": 12,
33
+ "num_hidden_layers": 12,
34
+ "pad_token_id": 1,
35
+ "position_embedding_type": "absolute",
36
+ "problem_type": "multi_label_classification",
37
+ "torch_dtype": "float32",
38
+ "transformers_version": "4.53.2",
39
+ "type_vocab_size": 1,
40
+ "use_cache": true,
41
+ "vocab_size": 50001
42
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ef24c0f5d45678a26273dd0f8f66559979fffd25121e8498e45b9dde5315fc2
3
+ size 497811044
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50000": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "extra_special_tokens": {},
51
+ "mask_token": "<mask>",
52
+ "max_length": 512,
53
+ "model_max_length": 1000000000000000019884624838656,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "<pad>",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "sep_token": "</s>",
59
+ "stride": 0,
60
+ "tokenizer_class": "RobertaTokenizer",
61
+ "trim_offsets": true,
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "<unk>"
65
+ }
unigram.json ADDED
The diff for this file is too large to render. See raw diff