Add files using upload-large-folder tool
Browse files- README.md +354 -0
- config.json +42 -0
- model.safetensors +3 -0
- special_tokens_map.json +51 -0
- tokenizer.json +0 -0
- tokenizer_config.json +65 -0
- unigram.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,354 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
language:
|
| 5 |
+
- pl
|
| 6 |
+
base_model:
|
| 7 |
+
- sdadas/mmlw-roberta-base
|
| 8 |
+
pipeline_tag: text-classification
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# 🛡️ Bielik Guard (Sójka): Polish Language Safety Classifier
|
| 12 |
+
|
| 13 |
+
Bielik Guard (Sójka) is a Polish language safety classifier designed to detect harmful content in digital communication and respond appropriately rather than simply blocking content. Built by the Bielik.AI community under the [SpeakLeash](https://speakleash.org/) non-profit organization, it protects users like a vigilant guardian of their digital homes by providing appropriate responses and support resources.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 📋 Model Details
|
| 18 |
+
|
| 19 |
+
### Model Description
|
| 20 |
+
|
| 21 |
+
Bielik Guard (Sójka) is a Polish-language safety classifier built upon the [`sdadas/mmlw-roberta-base`](https://huggingface.co/sdadas/mmlw-roberta-base) model, a Polish RoBERTa-based encoder. It has been fine-tuned to detect safety-relevant content in Polish texts, using community-collected data designed for evaluating safety in large language models (LLMs).
|
| 22 |
+
|
| 23 |
+
The model is **multilabel** and returns probability scores for each safety category, indicating the likelihood that a text belongs to that category. Importantly, the model was not trained on binarized data but rather on the percentage of people claiming that a text belongs to each category, reflecting the nuanced nature of safety classification.
|
| 24 |
+
|
| 25 |
+
> **Note:** This is the first version of Bielik Guard (Sójka). The team is actively working on future versions that will include larger models, additional safety categories, and support for more languages.
|
| 26 |
+
|
| 27 |
+
* **Developed by:** See the [Sójka Development Team](#sójka-development-team) section below.
|
| 28 |
+
* **Model type:** Text Classification
|
| 29 |
+
* **Language(s) (NLP):** Polish
|
| 30 |
+
* **License:** Apache-2.0
|
| 31 |
+
* **Finetuned from model:** `sdadas/mmlw-roberta-base`
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
* **🚀 Demo:** **[Test Sójka at guard.bielik.ai](https://guard.bielik.ai/)**
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 🛠️ Uses
|
| 39 |
+
|
| 40 |
+
### ✅ Direct Use
|
| 41 |
+
Bielik Guard (Sójka) can be used directly for:
|
| 42 |
+
- Real-time analysis of prompts and responses to detect threats and respond appropriately.
|
| 43 |
+
- Content moderation that provides supportive responses rather than simple blocking.
|
| 44 |
+
- Protection of AI chatbots and assistants with appropriate intervention strategies.
|
| 45 |
+
- Integration into systems that prioritize user support and safety resources.
|
| 46 |
+
|
| 47 |
+
### 🧩 Downstream Use
|
| 48 |
+
The model can be integrated into larger systems for:
|
| 49 |
+
- Content moderation pipelines
|
| 50 |
+
- AI safety frameworks
|
| 51 |
+
- Communication platform safety systems
|
| 52 |
+
- Educational and research applications
|
| 53 |
+
|
| 54 |
+
### ❌ Out-of-Scope Use
|
| 55 |
+
Bielik Guard (Sójka) is **not** designed to detect:
|
| 56 |
+
- Disinformation or misinformation
|
| 57 |
+
- Jailbreaking attempts
|
| 58 |
+
- Copyright violations
|
| 59 |
+
- Other categories not explicitly listed in the safety taxonomy
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 🏷️ Safety Categories
|
| 64 |
+
|
| 65 |
+
Bielik Guard (Sójka) detects and classifies potentially harmful content in five key safety categories:
|
| 66 |
+
|
| 67 |
+
* **[HATE] Hate/Aggression:** Content attacking or discriminating against groups based on race, religion, gender, sexual orientation, or nationality.
|
| 68 |
+
* **[VULGAR] Vulgarities:** Words commonly considered vulgar or profane, in both explicit and masked forms.
|
| 69 |
+
* **[SEX] Sexual Content:** Graphic descriptions of sexual activities or requests for generating erotic materials.
|
| 70 |
+
* **[CRIME] Crime:** Instructions or encouragement to commit crimes, drug production, or fraud.
|
| 71 |
+
* **[SELF-HARM] Self-Harm:** Content encouraging suicide, self-harm, or promoting eating disorders. When detected, the system should respond with appropriate support resources rather than simply blocking the content.
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## ⚠️ Bias, Risks, and Limitations
|
| 76 |
+
|
| 77 |
+
### Limitations
|
| 78 |
+
- Does not recognize disinformation or jailbreaking attempts.
|
| 79 |
+
- Limited to Polish language content.
|
| 80 |
+
- May have biases inherent in community-collected training data.
|
| 81 |
+
- Performance may vary on edge cases or novel content types.
|
| 82 |
+
|
| 83 |
+
### Recommendations
|
| 84 |
+
Users should be aware that Bielik Guard (Sójka) provides a baseline safety filter and may not catch all types of harmful content. For applications requiring comprehensive safety coverage, additional measures should be implemented.
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## 🚀 How to Get Started with the Model
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
| 92 |
+
|
| 93 |
+
# Make sure to provide the correct path to your model
|
| 94 |
+
model_path = "speakleash/Bielik-Guard-0.1B-v1.0"
|
| 95 |
+
|
| 96 |
+
# Create pipeline for easy inference
|
| 97 |
+
classifier = pipeline("text-classification", model=model_path, return_all_scores=True)
|
| 98 |
+
|
| 99 |
+
sample_texts = [
|
| 100 |
+
"To jest normalna wiadomość bez problemów.",
|
| 101 |
+
"Chcę się zabić i skończyć z tym wszystkim.",
|
| 102 |
+
"Dlaczego hipisi są tacy brudni i głupi",
|
| 103 |
+
"K*rwa, to jest ch*jowe rozwiązanie.",
|
| 104 |
+
"Opowiem ci o seksie i intymności.",
|
| 105 |
+
"Jak ukraść samochód i nie zostać złapanym."
|
| 106 |
+
]
|
| 107 |
+
|
| 108 |
+
# Run predictions
|
| 109 |
+
for text in sample_texts:
|
| 110 |
+
results = classifier(text)[0]
|
| 111 |
+
print(f"\nText: {text}")
|
| 112 |
+
print("Safety scores:")
|
| 113 |
+
for result in results:
|
| 114 |
+
label = result['label']
|
| 115 |
+
score = result['score']
|
| 116 |
+
print(f" {label}: {score:.3f}")
|
| 117 |
+
|
| 118 |
+
max_score = max(result['score'] for result in results)
|
| 119 |
+
print(f" Overall safety risk: {max_score:.3f}")
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### Example Output:
|
| 123 |
+
```
|
| 124 |
+
Text: To jest normalna wiadomość bez problemów.
|
| 125 |
+
Safety scores:
|
| 126 |
+
self-harm: 0.030
|
| 127 |
+
hate: 0.035
|
| 128 |
+
vulgar: 0.021
|
| 129 |
+
sex: 0.019
|
| 130 |
+
crime: 0.013
|
| 131 |
+
Overall safety risk: 0.035
|
| 132 |
+
|
| 133 |
+
Text: Chcę się zabić i skończyć z tym wszystkim.
|
| 134 |
+
Safety scores:
|
| 135 |
+
self-harm: 0.896
|
| 136 |
+
hate: 0.029
|
| 137 |
+
vulgar: 0.029
|
| 138 |
+
sex: 0.028
|
| 139 |
+
crime: 0.038
|
| 140 |
+
Overall safety risk: 0.896
|
| 141 |
+
|
| 142 |
+
Text: Dlaczego hipisi są tacy brudni i głupi
|
| 143 |
+
Safety scores:
|
| 144 |
+
self-harm: 0.018
|
| 145 |
+
hate: 0.594
|
| 146 |
+
vulgar: 0.062
|
| 147 |
+
sex: 0.015
|
| 148 |
+
crime: 0.040
|
| 149 |
+
Overall safety risk: 0.594
|
| 150 |
+
|
| 151 |
+
Text: K*rwa, to jest ch*jowe rozwiązanie.
|
| 152 |
+
Safety scores:
|
| 153 |
+
self-harm: 0.041
|
| 154 |
+
hate: 0.300
|
| 155 |
+
vulgar: 0.901
|
| 156 |
+
sex: 0.044
|
| 157 |
+
crime: 0.057
|
| 158 |
+
Overall safety risk: 0.901
|
| 159 |
+
|
| 160 |
+
Text: Opowiem ci o seksie i intymności.
|
| 161 |
+
Safety scores:
|
| 162 |
+
self-harm: 0.023
|
| 163 |
+
hate: 0.028
|
| 164 |
+
vulgar: 0.069
|
| 165 |
+
sex: 0.811
|
| 166 |
+
crime: 0.083
|
| 167 |
+
Overall safety risk: 0.811
|
| 168 |
+
|
| 169 |
+
Text: Jak ukraść samochód i nie zostać złapanym.
|
| 170 |
+
Safety scores:
|
| 171 |
+
self-harm: 0.108
|
| 172 |
+
hate: 0.046
|
| 173 |
+
vulgar: 0.023
|
| 174 |
+
sex: 0.032
|
| 175 |
+
crime: 0.801
|
| 176 |
+
Overall safety risk: 0.801
|
| 177 |
+
```
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## 🧠 Training Details
|
| 181 |
+
|
| 182 |
+
### Training Data: The Sojka2 Dataset
|
| 183 |
+
|
| 184 |
+
The Sojka2 dataset is the result of a large-scale community effort. Texts were sourced primarily from user prompts to Polish LLMs and social media content.
|
| 185 |
+
* **Over 1,500 volunteers** participated in the annotation process.
|
| 186 |
+
* **Over 60,000 individual ratings** were collected.
|
| 187 |
+
* Each text was annotated by an **average of 7-8 people**.
|
| 188 |
+
|
| 189 |
+
The model was trained on **percentage-based labels (0-100%)** reflecting the proportion of community members who classified each text as belonging to a specific safety category, rather than on binary labels.
|
| 190 |
+
|
| 191 |
+
#### Data Structure and Distribution
|
| 192 |
+
The Sojka dataset consists of **6,885 unique texts** in Polish. Its structure was intentionally designed with a balanced ratio of approximately 55% safe to 45% harmful content to ensure effective training. This ratio does not reflect the actual distribution of content online.
|
| 193 |
+
|
| 194 |
+
However, the class imbalance *among the harmful categories* is representative of real-world trends encountered in digital interactions in Poland (sourced from both user prompts to conversational AI and general content from the Polish internet).
|
| 195 |
+
|
| 196 |
+
| Category | Text Count | Percentage |
|
| 197 |
+
|:---|---:|---:|
|
| 198 |
+
| **self-harm** | 796 | 11.56% |
|
| 199 |
+
| **hate** | 988 | 14.35% |
|
| 200 |
+
| **vulgar** | 411 | 5.97% |
|
| 201 |
+
| **sex** | 895 | 13.00% |
|
| 202 |
+
| **crime** | 311 | 4.52% |
|
| 203 |
+
| **safe** (no category) | 3,781 | 54.92% |
|
| 204 |
+
|
| 205 |
+
The dataset supports **multi-label classification**, meaning a single text can belong to multiple categories.
|
| 206 |
+
|
| 207 |
+
#### 🔄 Continuous Improvement
|
| 208 |
+
Sójka is a living project. Community involvement is ongoing at **[guard.bielik.ai](https://guard.bielik.ai/)**, where users can test the model, provide feedback (👍/👎), and contribute by annotating new data. All feedback is systematically analyzed to create future iterations of the dataset.
|
| 209 |
+
|
| 210 |
+
### Training Procedure
|
| 211 |
+
The model was fine-tuned from the `sdadas/mmlw-roberta-base` checkpoint, a 124M parameter Polish RoBERTa-based encoder.
|
| 212 |
+
|
| 213 |
+
---
|
| 214 |
+
|
| 215 |
+
## ⚙️ Technical Specifications
|
| 216 |
+
|
| 217 |
+
### Model Architecture
|
| 218 |
+
* **Base Model:** `sdadas/mmlw-roberta-base`
|
| 219 |
+
* **Parameters:** 124M
|
| 220 |
+
* **Architecture:** RoBERTa-based encoder
|
| 221 |
+
* **Task:** Multi-label Text Classification (Regression)
|
| 222 |
+
|
| 223 |
+
### Compute Infrastructure
|
| 224 |
+
The model was trained with A100 GPU cluster support from ACK Cyfronet AGH.
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
+
|
| 228 |
+
## 📊 Evaluation
|
| 229 |
+
|
| 230 |
+
### Dataset 1: Sojka
|
| 231 |
+
The Sojka test dataset was created by splitting the main Sojka dataset using a 1:2 train-to-test ratio. This evaluation set contains **4,590 unique records**.
|
| 232 |
+
|
| 233 |
+
The distribution of labels in the test set, determined using a 60% agreement threshold among annotators, is as follows:
|
| 234 |
+
|
| 235 |
+
* **self-harm**: 265 samples (11.55%)
|
| 236 |
+
* **hate**: 329 samples (14.34%)
|
| 237 |
+
* **vulgar**: 137 samples (5.97%)
|
| 238 |
+
* **sex**: 298 samples (12.98%)
|
| 239 |
+
* **crime**: 104 samples (4.53%)
|
| 240 |
+
* **safe** (no harmful category): 1,260 samples (54.90%)
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
| Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
|
| 244 |
+
|:---|:---:|:---:|
|
| 245 |
+
| **RMSE** | 0.126 | 0.117 |
|
| 246 |
+
| **F1 micro** | 0.773 | 0.784 |
|
| 247 |
+
| **F1 macro** | 0.742 | 0.766 |
|
| 248 |
+
| **Specificity micro** | 0.966 | 0.964 |
|
| 249 |
+
| **Specificity macro** | 0.965 | 0.963 |
|
| 250 |
+
| **ROC AUC micro** | 0.977 | 0.983 |
|
| 251 |
+
| **ROC AUC macro** | 0.965 | 0.973 |
|
| 252 |
+
|
| 253 |
+
### Dataset 2: Sojka Augmented
|
| 254 |
+
The augmented dataset was created using 15 different text augmentation methods to test model robustness:
|
| 255 |
+
|
| 256 |
+
1. **remove_diacritics**: `Czesc, to jest przykładowy tekst z polskimi znakami!`
|
| 257 |
+
2. **add_diacritics**: `Cżeść, to jest przykładowy tękśt z polskimi znąkami!`
|
| 258 |
+
3. **random_capitalization**: `CZeśĆ, To jesT PRzyKŁaDoWy TEKST z POLSKIMi zNAkAMi!`
|
| 259 |
+
4. **snake_case_random**: `czE_to_jesT_pRzYk_adowY_teKSt_z_POlskiMI_zNAkamI`
|
| 260 |
+
5. **all_uppercase**: `CZEŚĆ, TO JEST PRZYKŁADOWY TEKST Z POLSKIMI ZNAKAMI!`
|
| 261 |
+
6. **all_lowercase**: `cześć, to jest przykładowy tekst z polskimi znakami!`
|
| 262 |
+
7. **title_case**: `Cześć, To Jest Przykładowy Tekst Z Polskimi Znakami!`
|
| 263 |
+
8. **swap_adjacent_letters**: `Cezść, to jest przkyałdowy teskt z ploskimi znkamai!`
|
| 264 |
+
9. **split_letters_by_separator**: `Cześć, to j e s t przykładowy tekst z p o l s k i m i znakami!`
|
| 265 |
+
10. **add_random_spaces**: `Cześć, to jest przykładowy te kst z polskimi znak a mi!`
|
| 266 |
+
11. **remove_random_spaces**: `Cześć,to jest przykładowytekst z polskimi znakami!`
|
| 267 |
+
12. **duplicate_characters**: `Czeeśść, to jesstt pprzykładowy tekstt z polskimi zznaakami!`
|
| 268 |
+
13. **insert_random_characters**: `Cześć, to jest przykładowy tekst z śpoźlskimi znakami!`
|
| 269 |
+
14. **reverse_words**: `Cześć, to jest przykładowy tekst z imikslop znakami!`
|
| 270 |
+
15. **substitute_similar_characters**: `Cześć, 7o jes7 przykładowy tek5t z polskimi znakami!`
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
| Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
|
| 274 |
+
|:---|:---:|:---:|
|
| 275 |
+
| **RMSE** | 0.179 | 0.163 |
|
| 276 |
+
| **F1 micro** | 0.622 | 0.690 |
|
| 277 |
+
| **F1 macro** | 0.571 | 0.632 |
|
| 278 |
+
| **Specificity micro** | 0.960 | 0.962 |
|
| 279 |
+
| **Specificity macro** | 0.959 | 0.961 |
|
| 280 |
+
| **ROC AUC micro** | 0.911 | 0.944 |
|
| 281 |
+
| **ROC AUC macro** | 0.879 | 0.910 |
|
| 282 |
+
|
| 283 |
+
### Dataset 3: Gadzi Jezyk
|
| 284 |
+
The [Gadzi Jezyk dataset](https://huggingface.co/datasets/JerzyPL/GadziJezyk) contains 520 toxic prompts, mostly focused on crime.
|
| 285 |
+
|
| 286 |
+
| Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
|
| 287 |
+
|:---|:---:|:---:|
|
| 288 |
+
| **RMSE** | 0.236 | 0.217 |
|
| 289 |
+
| **Recall** | 0.745 | 0.802 |
|
| 290 |
+
|
| 291 |
+
### Metrics Explanation
|
| 292 |
+
|
| 293 |
+
- **RMSE (Root Mean Square Error)**: Measures the average magnitude of prediction errors. Lower values indicate better performance.
|
| 294 |
+
- **F1 micro**: Harmonic mean of precision and recall calculated globally across all labels. Accounts for class imbalance.
|
| 295 |
+
- **F1 macro**: Average of F1 scores across all labels. Treats all classes equally regardless of frequency.
|
| 296 |
+
- **Specificity macro/micro**: Specificity (true negative rate) calculated macro/micro averaged. Measures ability to correctly identify safe content.
|
| 297 |
+
- **ROC AUC micro/macro**: Area under the ROC curve, measuring the model's ability to distinguish between safe and unsafe content across all thresholds.
|
| 298 |
+
|
| 299 |
+
The Bielik Guard 0.5B model generally outperforms the Bielik Guard 0.1B model across most metrics, particularly on the augmented test set, demonstrating better generalization capabilities.
|
| 300 |
+
|
| 301 |
+
### Comparison with Other Safety Models
|
| 302 |
+
Evaluation on 3,000 random user prompts, annotated by two independent annotators and one super-annotator, with each model’s categories:
|
| 303 |
+
|
| 304 |
+
| Model | Precision | Alert Rate | FPR (Global) |
|
| 305 |
+
|:---|:---:|:---:|:---:|
|
| 306 |
+
| **Bielik Guard 0.1B** | **67.27%** | **3.67%** | **1.20%** |
|
| 307 |
+
| HerBERT-PL-Guard | 31.55% | 6.87% | 4.70% |
|
| 308 |
+
| Llama-Guard-3-1B | 7.82% | 17.90% | 16.50% |
|
| 309 |
+
| Llama-Guard-3-8B | 13.62% | 10.77% | 9.30% |
|
| 310 |
+
| Qwen3Guard-Gen-0.6B | 11.36% | 19.37% | 17.17% |
|
| 311 |
+
|
| 312 |
+
> Bielik Guard demonstrates superior performance with the **highest precision** and **lowest false positive rate**, making it the most precise and least intrusive safety classifier among the compared models.
|
| 313 |
+
|
| 314 |
+
**Metrics for comparison:**
|
| 315 |
+
- **Precision**: TP/(TP+FP) - Percentage of flagged content that is actually harmful (higher is better)
|
| 316 |
+
- **Alert Rate**: (TP+FP)/(TP+FP+TN+FN) - Percentage of all content that gets flagged (lower is better to reduce false positives)
|
| 317 |
+
- **FPR (Global)**: FP/(TP+FP+TN+FN) - False Positive Rate - percentage of safe content incorrectly flagged as harmful (lower is better)
|
| 318 |
+
|
| 319 |
+
|
| 320 |
+
---
|
| 321 |
+
|
| 322 |
+
## 📜 License and Naming Policy
|
| 323 |
+
|
| 324 |
+
**License:**
|
| 325 |
+
This model is licensed under the **Apache 2.0 License**.
|
| 326 |
+
|
| 327 |
+
**Naming Requirements for Derivative Models:**
|
| 328 |
+
To maintain clear attribution and continuity of the Bielik-Guard project, we expect that any fine-tuned or derivative models include **Bielik-Guard** in their name.
|
| 329 |
+
This helps recognize the model’s origins and supports transparency within the community.
|
| 330 |
+
|
| 331 |
+
> **Recommended Naming Convention:**
|
| 332 |
+
> `Bielik-Guard-{your-use-case-or-project-name}-{version}`
|
| 333 |
+
>
|
| 334 |
+
> *Examples:* `Bielik-Guard-crime-finetune`, `Bielik-Guard-customer-support-v1`
|
| 335 |
+
|
| 336 |
+
---
|
| 337 |
+
|
| 338 |
+
## 👥 Sójka Development Team
|
| 339 |
+
|
| 340 |
+
* **[Jan Maria Kowalski](https://www.linkedin.com/in/janmariakowalski/):** Project leadership, data and tool preparation, threat category definition, model training and testing.
|
| 341 |
+
* **[Krzysztof Wróbel](https://www.linkedin.com/in/wrobelkrzysztof/):** Data analysis, model training and evaluation, contribution to threat classification.
|
| 342 |
+
* **[Jerzy Surma](https://www.linkedin.com/in/jerzysurma/):** Threat category definition (AI & ethics perspective), data preparation.
|
| 343 |
+
* **[Igor Ciuciura](https://www.linkedin.com/in/igor-ciuciura-1763b52a6/):** Data analysis, preparation, and cleaning; contribution to threat classification.
|
| 344 |
+
* **[Maciej Krystian Szymański](https://www.linkedin.com/in/maciej-krystian-szymanski/):** Project management support, community management, user and partner coordination.
|
| 345 |
+
|
| 346 |
+
We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018338.
|
| 347 |
+
|
| 348 |
+
## 📚 Citation
|
| 349 |
+
|
| 350 |
+
No formal citation available yet. The model is developed by the Bielik.AI community under SpeakLeash non-profit organization.
|
| 351 |
+
|
| 352 |
+
## 📧 Model Card Contact
|
| 353 |
+
|
| 354 |
+
For questions about this model, please contact the Bielik.AI community through **[guard.bielik.ai](https://guard.bielik.ai/)**.
|
config.json
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"RobertaForSequenceClassification"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.1,
|
| 6 |
+
"bos_token_id": 0,
|
| 7 |
+
"classifier_dropout": null,
|
| 8 |
+
"eos_token_id": 2,
|
| 9 |
+
"gradient_checkpointing": false,
|
| 10 |
+
"hidden_act": "gelu",
|
| 11 |
+
"hidden_dropout_prob": 0.1,
|
| 12 |
+
"hidden_size": 768,
|
| 13 |
+
"id2label": {
|
| 14 |
+
"0": "self-harm",
|
| 15 |
+
"1": "hate",
|
| 16 |
+
"2": "vulgar",
|
| 17 |
+
"3": "sex",
|
| 18 |
+
"4": "crime"
|
| 19 |
+
},
|
| 20 |
+
"initializer_range": 0.02,
|
| 21 |
+
"intermediate_size": 3072,
|
| 22 |
+
"label2id": {
|
| 23 |
+
"self-harm": 0,
|
| 24 |
+
"hate": 1,
|
| 25 |
+
"vulgar": 2,
|
| 26 |
+
"sex": 3,
|
| 27 |
+
"crime": 4
|
| 28 |
+
},
|
| 29 |
+
"layer_norm_eps": 1e-05,
|
| 30 |
+
"max_position_embeddings": 514,
|
| 31 |
+
"model_type": "roberta",
|
| 32 |
+
"num_attention_heads": 12,
|
| 33 |
+
"num_hidden_layers": 12,
|
| 34 |
+
"pad_token_id": 1,
|
| 35 |
+
"position_embedding_type": "absolute",
|
| 36 |
+
"problem_type": "multi_label_classification",
|
| 37 |
+
"torch_dtype": "float32",
|
| 38 |
+
"transformers_version": "4.53.2",
|
| 39 |
+
"type_vocab_size": 1,
|
| 40 |
+
"use_cache": true,
|
| 41 |
+
"vocab_size": 50001
|
| 42 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8ef24c0f5d45678a26273dd0f8f66559979fffd25121e8498e45b9dde5315fc2
|
| 3 |
+
size 497811044
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "<s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "</s>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "<mask>",
|
| 25 |
+
"lstrip": true,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "<pad>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "</s>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "<unk>",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_prefix_space": false,
|
| 3 |
+
"added_tokens_decoder": {
|
| 4 |
+
"0": {
|
| 5 |
+
"content": "<s>",
|
| 6 |
+
"lstrip": false,
|
| 7 |
+
"normalized": false,
|
| 8 |
+
"rstrip": false,
|
| 9 |
+
"single_word": false,
|
| 10 |
+
"special": true
|
| 11 |
+
},
|
| 12 |
+
"1": {
|
| 13 |
+
"content": "<pad>",
|
| 14 |
+
"lstrip": false,
|
| 15 |
+
"normalized": false,
|
| 16 |
+
"rstrip": false,
|
| 17 |
+
"single_word": false,
|
| 18 |
+
"special": true
|
| 19 |
+
},
|
| 20 |
+
"2": {
|
| 21 |
+
"content": "</s>",
|
| 22 |
+
"lstrip": false,
|
| 23 |
+
"normalized": false,
|
| 24 |
+
"rstrip": false,
|
| 25 |
+
"single_word": false,
|
| 26 |
+
"special": true
|
| 27 |
+
},
|
| 28 |
+
"3": {
|
| 29 |
+
"content": "<unk>",
|
| 30 |
+
"lstrip": false,
|
| 31 |
+
"normalized": false,
|
| 32 |
+
"rstrip": false,
|
| 33 |
+
"single_word": false,
|
| 34 |
+
"special": true
|
| 35 |
+
},
|
| 36 |
+
"50000": {
|
| 37 |
+
"content": "<mask>",
|
| 38 |
+
"lstrip": true,
|
| 39 |
+
"normalized": false,
|
| 40 |
+
"rstrip": false,
|
| 41 |
+
"single_word": false,
|
| 42 |
+
"special": true
|
| 43 |
+
}
|
| 44 |
+
},
|
| 45 |
+
"bos_token": "<s>",
|
| 46 |
+
"clean_up_tokenization_spaces": true,
|
| 47 |
+
"cls_token": "<s>",
|
| 48 |
+
"eos_token": "</s>",
|
| 49 |
+
"errors": "replace",
|
| 50 |
+
"extra_special_tokens": {},
|
| 51 |
+
"mask_token": "<mask>",
|
| 52 |
+
"max_length": 512,
|
| 53 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 54 |
+
"pad_to_multiple_of": null,
|
| 55 |
+
"pad_token": "<pad>",
|
| 56 |
+
"pad_token_type_id": 0,
|
| 57 |
+
"padding_side": "right",
|
| 58 |
+
"sep_token": "</s>",
|
| 59 |
+
"stride": 0,
|
| 60 |
+
"tokenizer_class": "RobertaTokenizer",
|
| 61 |
+
"trim_offsets": true,
|
| 62 |
+
"truncation_side": "right",
|
| 63 |
+
"truncation_strategy": "longest_first",
|
| 64 |
+
"unk_token": "<unk>"
|
| 65 |
+
}
|
unigram.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|