bge-m3-korean-contract-finetuned
This is a fine-tuned sentence-transformers model based on BAAI/bge-m3. It maps Korean sentences & paragraphs to a 1024 dimensional dense vector space, specifically optimized for detecting contract violations and compliance in Korean legal documents.
Model Description
This model is fine-tuned on Korean contract violation cases from Neo4j database, using contrastive learning to distinguish between violation sentences and compliant sentences. It is designed for semantic search and similarity tasks in the Korean legal domain, particularly for identifying unfair contract terms.
Base Model: BAAI/bge-m3
Fine-tuning Method: Contrastive Learning with CosineSimilarityLoss
Embedding Dimension: 1024
Max Sequence Length: 8192
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('moksil/bge-m3-korean-contract-finetuned')
# Example: Korean contract sentences
sentences = [
"νμ¬λ κ·μ±
μ¬μ μμ΄ μΌμ²΄μ μ±
μμ μ§μ§ μμ΅λλ€.", # Violation sentence
"νμ¬λ κ³ μ λλ μ€λν κ³Όμ€λ‘ μΈν μν΄λ₯Ό λ°°μν©λλ€." # Compliant sentence
]
# Generate embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)
print(f"Embeddings shape: {embeddings.shape}") # (2, 1024)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity: {similarity[0][0]:.4f}")
Use Cases
- Contract Violation Detection: Identify unfair terms in Korean contracts
- Semantic Search: Find similar violation cases in legal databases
- Compliance Checking: Compare contract terms against compliant standards
- Legal Document Analysis: Extract and compare legal clauses
Evaluation Results
This model was fine-tuned on Korean contract violation cases from Neo4j database using contrastive learning. The training data consists of violation cases with original violation sentences and their corrected compliant versions.
For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net
Training
The model was fine-tuned on Korean contract violation data using contrastive learning with triplet examples:
- Anchor: Original violation sentence
- Positive: Other violation sentences from the same article
- Negative: Corrected compliant sentence
Training Data:
- Source: Neo4j database (ViolationCase nodes)
- Training examples: Generated triplets from violation cases
- Language: Korean
DataLoader:
torch.utils.data.dataloader.DataLoader of length 430 with parameters:
{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
This loss function maximizes the similarity between anchor and positive examples while minimizing similarity with negative examples.
Parameters of the fit()-Method:
{
"epochs": 3,
"evaluation_steps": 0,
"evaluator": "NoneType",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 100,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
(2): Normalize()
)
Citation
If you use this model, please cite the base model:
@article{bge-m3,
title={BGE-M3: A Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embedding Model},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Zheng Liu and Aixin Sun},
year={2024}
}
Authors
- Base Model: BAAI/bge-m3 by Beijing Academy of Artificial Intelligence
- Fine-tuning: Fine-tuned for Korean contract violation detection
License
This model follows the same license as the base model BAAI/bge-m3.
- Downloads last month
- 13