--- license: apache-2.0 language: - sr metrics: - f1 - recall - precision - accuracy base_model: - classla/bcms-bertic pipeline_tag: token-classification library_name: transformers tags: - NER - Serbian - Named Entity Recognition - Legal - NLP - BERT - Legal documents - Court ruling contributors: - Vladimir Kalušev [Hugging Face](https://huggingface.co/kalusev) - Branko Brkljač [Hugging Face](https://huggingface.co/brkljac) --- # NER4Legal_SRB ## Model Description NER4Legal_SRB is a fine-tuned Named Entity Recognition (NER) model designed for processing Serbian legal documents. This model was created as part of the conference paper "Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development." The model aims to automate tasks involving legal documents, such as document archiving, search, and retrieval. It leverages the classla/bcms-bertic pre-trained BERT model, carefully adapted to the specific task of identifying and classifying entities in Serbian legal texts. ## Abstract Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to the design of different document processing tools and enhancements in the process of document archiving, search, and retrieval. The domain of official legal documents is especially interesting due to the vast amount of data generated daily, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions, and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields. In this work, we present one LLM-based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages the pre-trained bidirectional encoder representations from transformers (BERT), carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with a mean F1 score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution. ## Base Model The model is fine-tuned from the [classla/bcms-bertic-ner](https://huggingface.co/classla/bcms-bertic-ner) base model, which is a pre-trained BERT model designed for the BCMS (Bosnian, Croatian, Montenegrin, Serbian) languages. ## Dataset This model was fine-tuned on a manually labeled dataset of Serbian legal documents, including public court rulings. The dataset was specifically developed for this task to enable precise identification and classification of entities in Serbian legal texts. ## Performance Metrics The model achieved a mean F1 score of 0.96 during cross-validation tests on the labeled dataset, demonstrating robust performance and applicability to real-world scenarios. from transformers import AutoModelForTokenClassification, AutoTokenizer # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained("your-username/NER4Legal_SRB") model = AutoModelForTokenClassification.from_pretrained("your-username/NER4Legal_SRB") # Example usage text = """OBAVEZUJE SE tuženi da tužilji, na ime razlike između isplaćene i pripadajuće naknade zarade, za sporni period od 21.12.2001. godine pa do 08.02.2006. godine, isplati iznos od 528.161,32 dinara sa zakonskom zateznom kamatom počev od 16.04.2008. godine pa do isplate, u roku od 15 dana, pod pretnjom prinudnog izvršenja. ODBIJA SE tužbeni zahtev, u delu u kom je tužilja tražila da sud obaveže tuženog da joj, na ime neisplaćenog regresa za korišćenje godišnjeg odmora, za period od 2000.-2005. godine, isplati iznos do 171.531,96 dinara sa zakonskom zateznom kamatom počev od 16.04.2008. godine pa do isplate, u roku od 15 dana, pod pretnjom prinudnog izvršenja. OBAVEZUJE SE tuženi da tužilji, na ime troškova parničnog postupka, isplati iznos od 335.800,00 dinara, sa zakonskom zateznom kamatom od presuđenja, tj. od 06.04.2012. . godine pa do isplate, u roku od 15 dana, pod pretnjom prinudnog izvršenja. OSLOBAĐA SE tužilja obaveze plaćanja sudskih taksi.""" inputs = tokenizer(text, return_tensors="pt", truncation=True) outputs = model(**inputs) print(outputs)