Add files using upload-large-folder tool

Browse files

Files changed (7) hide show

README.md +354 -0
config.json +42 -0
model.safetensors +3 -0
special_tokens_map.json +51 -0
tokenizer.json +0 -0
tokenizer_config.json +65 -0
unigram.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,354 @@

+---
+library_name: transformers
+license: apache-2.0
+language:
+- pl
+base_model:
+- sdadas/mmlw-roberta-base
+pipeline_tag: text-classification
+---
+# 🛡️ Bielik Guard (Sójka): Polish Language Safety Classifier
+Bielik Guard (Sójka) is a Polish language safety classifier designed to detect harmful content in digital communication and respond appropriately rather than simply blocking content. Built by the Bielik.AI community under the [SpeakLeash](https://speakleash.org/) non-profit organization, it protects users like a vigilant guardian of their digital homes by providing appropriate responses and support resources.
+---
+## 📋 Model Details
+### Model Description
+Bielik Guard (Sójka) is a Polish-language safety classifier built upon the [`sdadas/mmlw-roberta-base`](https://huggingface.co/sdadas/mmlw-roberta-base) model, a Polish RoBERTa-based encoder. It has been fine-tuned to detect safety-relevant content in Polish texts, using community-collected data designed for evaluating safety in large language models (LLMs).
+The model is **multilabel** and returns probability scores for each safety category, indicating the likelihood that a text belongs to that category. Importantly, the model was not trained on binarized data but rather on the percentage of people claiming that a text belongs to each category, reflecting the nuanced nature of safety classification.
+> **Note:** This is the first version of Bielik Guard (Sójka). The team is actively working on future versions that will include larger models, additional safety categories, and support for more languages.
+*   **Developed by:** See the [Sójka Development Team](#sójka-development-team) section below.
+*   **Model type:** Text Classification
+*   **Language(s) (NLP):** Polish
+*   **License:** Apache-2.0
+*   **Finetuned from model:** `sdadas/mmlw-roberta-base`
+*   **🚀 Demo:** **[Test Sójka at guard.bielik.ai](https://guard.bielik.ai/)**
+---
+## 🛠️ Uses
+### ✅ Direct Use
+Bielik Guard (Sójka) can be used directly for:
+- Real-time analysis of prompts and responses to detect threats and respond appropriately.
+- Content moderation that provides supportive responses rather than simple blocking.
+- Protection of AI chatbots and assistants with appropriate intervention strategies.
+- Integration into systems that prioritize user support and safety resources.
+### 🧩 Downstream Use
+The model can be integrated into larger systems for:
+- Content moderation pipelines
+- AI safety frameworks
+- Communication platform safety systems
+- Educational and research applications
+### ❌ Out-of-Scope Use
+Bielik Guard (Sójka) is **not** designed to detect:
+- Disinformation or misinformation
+- Jailbreaking attempts
+- Copyright violations
+- Other categories not explicitly listed in the safety taxonomy
+---
+## 🏷️ Safety Categories
+Bielik Guard (Sójka) detects and classifies potentially harmful content in five key safety categories:
+*   **[HATE] Hate/Aggression:** Content attacking or discriminating against groups based on race, religion, gender, sexual orientation, or nationality.
+*   **[VULGAR] Vulgarities:** Words commonly considered vulgar or profane, in both explicit and masked forms.
+*   **[SEX] Sexual Content:** Graphic descriptions of sexual activities or requests for generating erotic materials.
+*   **[CRIME] Crime:** Instructions or encouragement to commit crimes, drug production, or fraud.
+*   **[SELF-HARM] Self-Harm:** Content encouraging suicide, self-harm, or promoting eating disorders. When detected, the system should respond with appropriate support resources rather than simply blocking the content.
+---
+## ⚠️ Bias, Risks, and Limitations
+### Limitations
+- Does not recognize disinformation or jailbreaking attempts.
+- Limited to Polish language content.
+- May have biases inherent in community-collected training data.
+- Performance may vary on edge cases or novel content types.
+### Recommendations
+Users should be aware that Bielik Guard (Sójka) provides a baseline safety filter and may not catch all types of harmful content. For applications requiring comprehensive safety coverage, additional measures should be implemented.
+---
+## 🚀 How to Get Started with the Model
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+# Make sure to provide the correct path to your model
+model_path = "speakleash/Bielik-Guard-0.1B-v1.0"
+# Create pipeline for easy inference
+classifier = pipeline("text-classification", model=model_path, return_all_scores=True)
+sample_texts = [
+    "To jest normalna wiadomość bez problemów.",
+    "Chcę się zabić i skończyć z tym wszystkim.",
+    "Dlaczego hipisi są tacy brudni i głupi",
+    "K*rwa, to jest ch*jowe rozwiązanie.",
+    "Opowiem ci o seksie i intymności.",
+    "Jak ukraść samochód i nie zostać złapanym."
+]
+# Run predictions
+for text in sample_texts:
+    results = classifier(text)[0]
+    print(f"\nText: {text}")
+    print("Safety scores:")
+    for result in results:
+        label = result['label']
+        score = result['score']
+        print(f"  {label}: {score:.3f}")
+    max_score = max(result['score'] for result in results)
+    print(f"  Overall safety risk: {max_score:.3f}")
+```
+### Example Output:
+```
+Text: To jest normalna wiadomość bez problemów.
+Safety scores:
+  self-harm: 0.030
+  hate: 0.035
+  vulgar: 0.021
+  sex: 0.019
+  crime: 0.013
+  Overall safety risk: 0.035
+Text: Chcę się zabić i skończyć z tym wszystkim.
+Safety scores:
+  self-harm: 0.896
+  hate: 0.029
+  vulgar: 0.029
+  sex: 0.028
+  crime: 0.038
+  Overall safety risk: 0.896
+Text: Dlaczego hipisi są tacy brudni i głupi
+Safety scores:
+  self-harm: 0.018
+  hate: 0.594
+  vulgar: 0.062
+  sex: 0.015
+  crime: 0.040
+  Overall safety risk: 0.594
+Text: K*rwa, to jest ch*jowe rozwiązanie.
+Safety scores:
+  self-harm: 0.041
+  hate: 0.300
+  vulgar: 0.901
+  sex: 0.044
+  crime: 0.057
+  Overall safety risk: 0.901
+Text: Opowiem ci o seksie i intymności.
+Safety scores:
+  self-harm: 0.023
+  hate: 0.028
+  vulgar: 0.069
+  sex: 0.811
+  crime: 0.083
+  Overall safety risk: 0.811
+Text: Jak ukraść samochód i nie zostać złapanym.
+Safety scores:
+  self-harm: 0.108
+  hate: 0.046
+  vulgar: 0.023
+  sex: 0.032
+  crime: 0.801
+  Overall safety risk: 0.801
+```
+---
+## 🧠 Training Details
+### Training Data: The Sojka2 Dataset
+The Sojka2 dataset is the result of a large-scale community effort. Texts were sourced primarily from user prompts to Polish LLMs and social media content.
+*   **Over 1,500 volunteers** participated in the annotation process.
+*   **Over 60,000 individual ratings** were collected.
+*   Each text was annotated by an **average of 7-8 people**.
+The model was trained on **percentage-based labels (0-100%)** reflecting the proportion of community members who classified each text as belonging to a specific safety category, rather than on binary labels.
+#### Data Structure and Distribution
+The Sojka dataset consists of **6,885 unique texts** in Polish. Its structure was intentionally designed with a balanced ratio of approximately 55% safe to 45% harmful content to ensure effective training. This ratio does not reflect the actual distribution of content online.
+However, the class imbalance *among the harmful categories* is representative of real-world trends encountered in digital interactions in Poland (sourced from both user prompts to conversational AI and general content from the Polish internet).
+| Category | Text Count | Percentage |
+|:---|---:|---:|
+| **self-harm** | 796 | 11.56% |
+| **hate** | 988 | 14.35% |
+| **vulgar** | 411 | 5.97% |
+| **sex** | 895 | 13.00% |
+| **crime** | 311 | 4.52% |
+| **safe** (no category) | 3,781 | 54.92% |
+The dataset supports **multi-label classification**, meaning a single text can belong to multiple categories.
+#### 🔄 Continuous Improvement
+Sójka is a living project. Community involvement is ongoing at **[guard.bielik.ai](https://guard.bielik.ai/)**, where users can test the model, provide feedback (👍/👎), and contribute by annotating new data. All feedback is systematically analyzed to create future iterations of the dataset.
+### Training Procedure
+The model was fine-tuned from the `sdadas/mmlw-roberta-base` checkpoint, a 124M parameter Polish RoBERTa-based encoder.
+---
+## ⚙️ Technical Specifications
+### Model Architecture
+*   **Base Model:** `sdadas/mmlw-roberta-base`
+*   **Parameters:** 124M
+*   **Architecture:** RoBERTa-based encoder
+*   **Task:** Multi-label Text Classification (Regression)
+### Compute Infrastructure
+The model was trained with A100 GPU cluster support from ACK Cyfronet AGH.
+---
+## 📊 Evaluation
+### Dataset 1: Sojka
+The Sojka test dataset was created by splitting the main Sojka dataset using a 1:2 train-to-test ratio. This evaluation set contains **4,590 unique records**.
+The distribution of labels in the test set, determined using a 60% agreement threshold among annotators, is as follows:
+*   **self-harm**: 265 samples (11.55%)
+*   **hate**: 329 samples (14.34%)
+*   **vulgar**: 137 samples (5.97%)
+*   **sex**: 298 samples (12.98%)
+*   **crime**: 104 samples (4.53%)
+*   **safe** (no harmful category): 1,260 samples (54.90%)
+| Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
+|:---|:---:|:---:|
+| **RMSE** | 0.126 | 0.117 |
+| **F1 micro** | 0.773 | 0.784 |
+| **F1 macro** | 0.742 | 0.766 |
+| **Specificity micro** | 0.966 | 0.964 |
+| **Specificity macro** | 0.965 | 0.963 |
+| **ROC AUC micro** | 0.977 | 0.983 |
+| **ROC AUC macro** | 0.965 | 0.973 |
+### Dataset 2: Sojka Augmented
+The augmented dataset was created using 15 different text augmentation methods to test model robustness:
+1. **remove_diacritics**: `Czesc, to jest przykładowy tekst z polskimi znakami!`
+2. **add_diacritics**: `Cżeść, to jest przykładowy tękśt z polskimi znąkami!`
+3. **random_capitalization**: `CZeśĆ, To jesT PRzyKŁaDoWy TEKST z POLSKIMi zNAkAMi!`
+4. **snake_case_random**: `czE_to_jesT_pRzYk_adowY_teKSt_z_POlskiMI_zNAkamI`
+5. **all_uppercase**: `CZEŚĆ, TO JEST PRZYKŁADOWY TEKST Z POLSKIMI ZNAKAMI!`
+6. **all_lowercase**: `cześć, to jest przykładowy tekst z polskimi znakami!`
+7. **title_case**: `Cześć, To Jest Przykładowy Tekst Z Polskimi Znakami!`
+8. **swap_adjacent_letters**: `Cezść, to jest przkyałdowy teskt z ploskimi znkamai!`
+9. **split_letters_by_separator**: `Cześć, to j e s t przykładowy tekst z p o l s k i m i znakami!`
+10. **add_random_spaces**: `Cześć,  to jest przykładowy te kst z polskimi znak a mi!`
+11. **remove_random_spaces**: `Cześć,to jest przykładowytekst z polskimi znakami!`
+12. **duplicate_characters**: `Czeeśść, to jesstt pprzykładowy tekstt z polskimi zznaakami!`
+13. **insert_random_characters**: `Cześć, to jest przykładowy tekst z śpoźlskimi znakami!`
+14. **reverse_words**: `Cześć, to jest przykładowy tekst z imikslop znakami!`
+15. **substitute_similar_characters**: `Cześć, 7o jes7 przykładowy tek5t z polskimi znakami!`
+| Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
+|:---|:---:|:---:|
+| **RMSE** | 0.179 | 0.163 |
+| **F1 micro** | 0.622 | 0.690 |
+| **F1 macro** | 0.571 | 0.632 |
+| **Specificity micro** | 0.960 | 0.962 |
+| **Specificity macro** | 0.959 | 0.961 |
+| **ROC AUC micro** | 0.911 | 0.944 |
+| **ROC AUC macro** | 0.879 | 0.910 |
+### Dataset 3: Gadzi Jezyk
+The [Gadzi Jezyk dataset](https://huggingface.co/datasets/JerzyPL/GadziJezyk) contains 520 toxic prompts, mostly focused on crime.
+| Metric | Bielik Guard 0.1B | Bielik Guard 0.5B |
+|:---|:---:|:---:|
+| **RMSE** | 0.236 | 0.217 |
+| **Recall** | 0.745 | 0.802 |
+### Metrics Explanation
+- **RMSE (Root Mean Square Error)**: Measures the average magnitude of prediction errors. Lower values indicate better performance.
+- **F1 micro**: Harmonic mean of precision and recall calculated globally across all labels. Accounts for class imbalance.
+- **F1 macro**: Average of F1 scores across all labels. Treats all classes equally regardless of frequency.
+- **Specificity macro/micro**: Specificity (true negative rate) calculated macro/micro averaged. Measures ability to correctly identify safe content.
+- **ROC AUC micro/macro**: Area under the ROC curve, measuring the model's ability to distinguish between safe and unsafe content across all thresholds.
+The Bielik Guard 0.5B model generally outperforms the Bielik Guard 0.1B model across most metrics, particularly on the augmented test set, demonstrating better generalization capabilities.
+### Comparison with Other Safety Models
+Evaluation on 3,000 random user prompts, annotated by two independent annotators and one super-annotator, with each model’s categories:
+| Model | Precision | Alert Rate | FPR (Global) |
+|:---|:---:|:---:|:---:|
+| **Bielik Guard 0.1B** | **67.27%** | **3.67%** | **1.20%** |
+| HerBERT-PL-Guard | 31.55% | 6.87% | 4.70% |
+| Llama-Guard-3-1B | 7.82% | 17.90% | 16.50% |
+| Llama-Guard-3-8B | 13.62% | 10.77% | 9.30% |
+| Qwen3Guard-Gen-0.6B | 11.36% | 19.37% | 17.17% |
+> Bielik Guard demonstrates superior performance with the **highest precision** and **lowest false positive rate**, making it the most precise and least intrusive safety classifier among the compared models.
+**Metrics for comparison:**
+- **Precision**: TP/(TP+FP) - Percentage of flagged content that is actually harmful (higher is better)
+- **Alert Rate**: (TP+FP)/(TP+FP+TN+FN) - Percentage of all content that gets flagged (lower is better to reduce false positives)
+- **FPR (Global)**: FP/(TP+FP+TN+FN) - False Positive Rate - percentage of safe content incorrectly flagged as harmful (lower is better)
+---
+## 📜 License and Naming Policy
+**License:**
+This model is licensed under the **Apache 2.0 License**.
+**Naming Requirements for Derivative Models:**
+To maintain clear attribution and continuity of the Bielik-Guard project, we expect that any fine-tuned or derivative models include **Bielik-Guard** in their name.
+This helps recognize the model’s origins and supports transparency within the community.
+> **Recommended Naming Convention:**
+> `Bielik-Guard-{your-use-case-or-project-name}-{version}`
+>
+> *Examples:* `Bielik-Guard-crime-finetune`, `Bielik-Guard-customer-support-v1`
+---
+## 👥 Sójka Development Team
+*   **[Jan Maria Kowalski](https://www.linkedin.com/in/janmariakowalski/):** Project leadership, data and tool preparation, threat category definition, model training and testing.
+*   **[Krzysztof Wróbel](https://www.linkedin.com/in/wrobelkrzysztof/):** Data analysis, model training and evaluation, contribution to threat classification.
+*   **[Jerzy Surma](https://www.linkedin.com/in/jerzysurma/):** Threat category definition (AI & ethics perspective), data preparation.
+*   **[Igor Ciuciura](https://www.linkedin.com/in/igor-ciuciura-1763b52a6/):** Data analysis, preparation, and cleaning; contribution to threat classification.
+*   **[Maciej Krystian Szymański](https://www.linkedin.com/in/maciej-krystian-szymanski/):** Project management support, community management, user and partner coordination.
+We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018338.
+## 📚 Citation
+No formal citation available yet. The model is developed by the Bielik.AI community under SpeakLeash non-profit organization.
+## 📧 Model Card Contact
+For questions about this model, please contact the Bielik.AI community through **[guard.bielik.ai](https://guard.bielik.ai/)**.

config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "architectures": [
+    "RobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "self-harm",
+    "1": "hate",
+    "2": "vulgar",
+    "3": "sex",
+    "4": "crime"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "self-harm": 0,
+    "hate": 1,
+    "vulgar": 2,
+    "sex": 3,
+    "crime": 4
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "problem_type": "multi_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.2",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50001
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8ef24c0f5d45678a26273dd0f8f66559979fffd25121e8498e45b9dde5315fc2
+size 497811044

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50000": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "max_length": 512,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "<unk>"
+}

unigram.json ADDED Viewed

The diff for this file is too large to render. See raw diff