File size: 8,281 Bytes
f46f824 b9e287e f46f824 ba0b246 298c8da cd25d95 ad9a07f 0ac2195 ad9a07f 9afecbf cd25d95 ad9a07f 34598f9 ad9a07f f8bc9e3 d84ff7a f8bc9e3 8d5341e cd25d95 8d5341e c8d2af3 4d67a57 c8d2af3 4d67a57 9533fd5 4d67a57 c8d2af3 4d67a57 758c269 4d67a57 c2ceee4 f70bd0b 91ccd9f ad2b396 34dd656 91ccd9f 268d888 91ccd9f 5189fe6 34dd656 ff6f5ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: apache-2.0
language:
- sr
metrics:
- f1
- recall
- precision
- accuracy
base_model:
- classla/bcms-bertic
pipeline_tag: token-classification
library_name: transformers
tags:
- NER
- Serbian language
- Named Entity Recognition
- Legal
- NLP
- BERT
- Legal documents
- Court ruling
contributors:
- Vladimir Kalušev [Hugging Face](https://huggingface.co/kalusev)
- Branko Brkljač [Hugging Face](https://huggingface.co/brkljac)
---
# NER4Legal_SRB
## Model Description
NER4Legal_SRB is a fine-tuned Named Entity Recognition (NER) model designed for processing Serbian legal documents. This model was created as part of the conference paper "Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development", accepted for publication at the 15th International Conference on Information Society and Technology, Kopaonik, Serbia, March 9-12, 2025. The model aims to automate tasks involving legal documents, such as document archiving, search, and retrieval. It leverages the [classla/bcms-bertic](https://huggingface.co/classla/bcms-bertic) pre-trained BERT model, carefully adapted to the specific task of identifying and classifying a predefined set of word entities in Serbian legal texts. Model can be run on both CPU and GPU. Provided model was trained on all data from NER4Legal_SRB dataset described in the reference paper.
## Abstract
Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to the design of different document processing tools and enhancements in the process of document archiving, search, and retrieval. The domain of official legal documents is especially interesting due to the vast amount of data generated daily, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions, and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields.
In this work, we present one LLM-based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages the pre-trained bidirectional encoder representations from transformers (BERT), carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with a mean F1 score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution.
## Base Model
The model is fine-tuned from the [classla/bcms-bertic](https://huggingface.co/classla/bcms-bertic) base model, which is a pre-trained BERT model designed for the BCMS (Bosnian, Croatian, Montenegrin, Serbian) languages.
## Dataset
This model was fine-tuned on a manually labeled dataset of Serbian legal documents, including public court rulings. The dataset was specifically developed for this task to enable precise identification and classification of entities in Serbian legal texts.
## Performance Metrics
The model achieved a mean F1 score of 0.96 during cross-validation tests on the labeled dataset, demonstrating robust performance and applicability to real-world scenarios. For detailed information about performed model evaluation and reported results please consult the original conference paper.
## Contributors
- Vladimir Kalušev [https://huggingface.co/kalusev](https://huggingface.co/kalusev)
- Branko Brkljač [https://huggingface.co/brkljac](https://huggingface.co/brkljac), [https://brkljac.github.io/](https://brkljac.github.io/)
## Usage
Here’s how to use the model in Python:
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)
# Define the label mapping (id_to_label)
id_to_label = {
0: 'O',
1: 'B-COURT',
2: 'B-DATE',
3: 'B-DECISION',
4: 'B-LAW',
5: 'B-MONEY',
6: 'B-OFFICIAL GAZZETE',
7: 'B-PERSON',
8: 'B-REFERENCE',
9: 'I-COURT',
10: 'I-LAW',
11: 'I-MONEY',
12: 'I-OFFICIAL GAZZETE',
13: 'I-PERSON',
14: 'I-REFERENCE'
}
# NER with GPU/CPU fallback
def perform_ner(text):
"""
Perform Named Entity Recognition on a single text with GPU memory fallback.
Args:
text (str): Input text.
Returns:
list: List of tokens and predicted labels.
"""
try:
# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
# Get predictions from the model
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=2).squeeze().tolist()
except RuntimeError as e:
if "CUDA out of memory" in str(e):
print("Switching to CPU due to memory constraints.")
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model.cpu()(**inputs) # Run model on CPU
logits = outputs.logits
predictions = torch.argmax(logits, dim=2).squeeze().tolist()
else:
raise e # Re-raise other exceptions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
labels = [id_to_label[pred] for pred in predictions]
# Filter out special tokens
results = [
(token, label)
for token, label in zip(tokens, labels)
if token not in tokenizer.all_special_tokens
]
return results
# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""
# Perform NER
results = perform_ner(text)
# Print tokens and labels as a formatted table
print("Token | Predicted Label")
print("----------------------------------------")
for token, label in results:
print(f"{token:<17} | {label}")
```
<img src="SRB4Legal_NER performance in presence of noisy inputs.jpg" alt="SRB4Legal_NER performance in presence of noisy inputs" style="max-width: 100%; height: auto;" />
### If you would like to use this software, please consider citing the following publication:
- *Kalušev, V., Brkljač, B. (2025). **Named entity recognition for Serbian legal documents: Design, methodology and dataset development**. In Proceedings of the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9-12 March, 2025, Vol. -, ISBN -, accepted for publication
<!--[](doi_link) -->
<pre><code>
@inproceedings{KalusevNER2025,
author = {Kalu{\v{s}ev, Vladimir and Brklja{\v{c}}, Branko},
booktitle = {15th International Conference on Information Society and Technology (ICIST)},
doi = {-},
month = mar,
pages = {1--16},
title = {Named entity recognition for Serbian legal documents: {D}esign, methodology and dataset development},
year = {2025}
}
</code></pre>
<pre><code>
@misc{kalušev2025namedentityrecognitionserbian,
title={Named entity recognition for Serbian legal documents: Design, methodology and dataset development},
author={Vladimir Kalušev and Branko Brkljač},
year={2025},
eprint={2502.10582},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.10582},
}
</code></pre> |