File size: 8,281 Bytes

f46f824
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9e287e
f46f824
 
 
 
 
 
ba0b246
 
298c8da
 
cd25d95
 
 
 
ad9a07f
0ac2195
ad9a07f
 
 
 
 
 
 
 
 
9afecbf
cd25d95
 
 
ad9a07f
 
 
 
34598f9
ad9a07f
f8bc9e3
 
d84ff7a
f8bc9e3
8d5341e
cd25d95
8d5341e
 
 
c8d2af3
4d67a57
c8d2af3
 
4d67a57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9533fd5
4d67a57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8d2af3
 
4d67a57
758c269
4d67a57
 
 
 
 
 
 
 
c2ceee4
 
 
f70bd0b
91ccd9f
ad2b396
 
 
34dd656
91ccd9f
 
268d888
91ccd9f
 
 
 
 
 
 
5189fe6
34dd656
 
ff6f5ae

---
license: apache-2.0
language:
- sr
metrics:
- f1
- recall
- precision
- accuracy
base_model:
- classla/bcms-bertic
pipeline_tag: token-classification
library_name: transformers
tags:
- NER
- Serbian language
- Named Entity Recognition
- Legal
- NLP
- BERT
- Legal documents
- Court ruling

contributors:
  - Vladimir Kalušev [Hugging Face](https://huggingface.co/kalusev)
  - Branko Brkljač [Hugging Face](https://huggingface.co/brkljac)
---
# NER4Legal_SRB

## Model Description

NER4Legal_SRB is a fine-tuned Named Entity Recognition (NER) model designed for processing Serbian legal documents. This model was created as part of the conference paper "Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development", accepted for publication at the 15th International Conference on Information Society and Technology, Kopaonik, Serbia, March 9-12, 2025. The model aims to automate tasks involving legal documents, such as document archiving, search, and retrieval. It leverages the [classla/bcms-bertic](https://huggingface.co/classla/bcms-bertic) pre-trained BERT model, carefully adapted to the specific task of identifying and classifying a predefined set of word entities in Serbian legal texts. Model can be run on both CPU and GPU. Provided model was trained on all data from NER4Legal_SRB dataset described in the reference paper.

## Abstract

Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to the design of different document processing tools and enhancements in the process of document archiving, search, and retrieval. The domain of official legal documents is especially interesting due to the vast amount of data generated daily, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions, and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields.

In this work, we present one LLM-based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages the pre-trained bidirectional encoder representations from transformers (BERT), carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with a mean F1 score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution.

## Base Model

The model is fine-tuned from the [classla/bcms-bertic](https://huggingface.co/classla/bcms-bertic) base model, which is a pre-trained BERT model designed for the BCMS (Bosnian, Croatian, Montenegrin, Serbian) languages.

## Dataset

This model was fine-tuned on a manually labeled dataset of Serbian legal documents, including public court rulings. The dataset was specifically developed for this task to enable precise identification and classification of entities in Serbian legal texts.

## Performance Metrics

The model achieved a mean F1 score of 0.96 during cross-validation tests on the labeled dataset, demonstrating robust performance and applicability to real-world scenarios. For detailed information about performed model evaluation and reported results please consult the original conference paper.

##  Contributors
  - Vladimir Kalušev [https://huggingface.co/kalusev](https://huggingface.co/kalusev)
  - Branko Brkljač [https://huggingface.co/brkljac](https://huggingface.co/brkljac), [https://brkljac.github.io/](https://brkljac.github.io/)

## Usage

Here’s how to use the model in Python:

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)

# Define the label mapping (id_to_label)
id_to_label = {
    0: 'O',
    1: 'B-COURT',
    2: 'B-DATE',
    3: 'B-DECISION',
    4: 'B-LAW',
    5: 'B-MONEY',
    6: 'B-OFFICIAL GAZZETE',
    7: 'B-PERSON',
    8: 'B-REFERENCE',
    9: 'I-COURT',
    10: 'I-LAW',
    11: 'I-MONEY',
    12: 'I-OFFICIAL GAZZETE',
    13: 'I-PERSON',
    14: 'I-REFERENCE'
}

# NER with GPU/CPU fallback
def perform_ner(text):
    """
    Perform Named Entity Recognition on a single text with GPU memory fallback.
    Args:
        text (str): Input text.
    Returns:
        list: List of tokens and predicted labels.
    """
    try:
        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
        # Get predictions from the model
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze().tolist()

    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Switching to CPU due to memory constraints.")
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.cpu()(**inputs)  # Run model on CPU
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=2).squeeze().tolist()
        else:
            raise e  # Re-raise other exceptions

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
    labels = [id_to_label[pred] for pred in predictions]

    # Filter out special tokens
    results = [
        (token, label)
        for token, label in zip(tokens, labels)
        if token not in tokenizer.all_special_tokens
    ]
    return results

# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""

# Perform NER
results = perform_ner(text)

# Print tokens and labels as a formatted table
print("Token             | Predicted Label")
print("----------------------------------------")
for token, label in results:
    print(f"{token:<17} | {label}")

```

<img src="SRB4Legal_NER performance in presence of noisy inputs.jpg" alt="SRB4Legal_NER performance in presence of noisy inputs" style="max-width: 100%; height: auto;" />

### If you would like to use this software, please consider citing the following publication:

- *Kalušev, V., Brkljač, B. (2025). **Named entity recognition for Serbian legal documents: Design, methodology and dataset development**. In Proceedings of the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9-12 March, 2025, Vol. -, ISBN -, accepted for publication 
<!--[![DOI:number](TBA)](doi_link)  -->

<pre><code>
@inproceedings{KalusevNER2025,
    author = {Kalu{\v{s}ev, Vladimir and Brklja{\v{c}}, Branko},
    booktitle = {15th International Conference on Information Society and Technology (ICIST)},
    doi = {-},
    month = mar,
    pages = {1--16},
    title = {Named entity recognition for Serbian legal documents: {D}esign, methodology and dataset development},
    year = {2025}
}
</code></pre>

<pre><code>
@misc{kalušev2025namedentityrecognitionserbian,
      title={Named entity recognition for Serbian legal documents: Design, methodology and dataset development},
      author={Vladimir Kalušev and Branko Brkljač},
      year={2025},
      eprint={2502.10582},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10582},
}
</code></pre>