bge-m3-korean-contract-finetuned

This is a fine-tuned sentence-transformers model based on BAAI/bge-m3. It maps Korean sentences & paragraphs to a 1024 dimensional dense vector space, specifically optimized for detecting contract violations and compliance in Korean legal documents.

Model Description

This model is fine-tuned on Korean contract violation cases from Neo4j database, using contrastive learning to distinguish between violation sentences and compliant sentences. It is designed for semantic search and similarity tasks in the Korean legal domain, particularly for identifying unfair contract terms.

Base Model: BAAI/bge-m3
Fine-tuning Method: Contrastive Learning with CosineSimilarityLoss
Embedding Dimension: 1024
Max Sequence Length: 8192

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('moksil/bge-m3-korean-contract-finetuned')

# Example: Korean contract sentences
sentences = [
    "회사는 귀책사유 없이 일체의 책임을 지지 않습니다.",  # Violation sentence
    "회사는 고의 또는 중대한 과실로 인한 손해를 배상합니다."  # Compliant sentence
]

# Generate embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)
print(f"Embeddings shape: {embeddings.shape}")  # (2, 1024)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity: {similarity[0][0]:.4f}")

Use Cases

Contract Violation Detection: Identify unfair terms in Korean contracts
Semantic Search: Find similar violation cases in legal databases
Compliance Checking: Compare contract terms against compliant standards
Legal Document Analysis: Extract and compare legal clauses

Evaluation Results

This model was fine-tuned on Korean contract violation cases from Neo4j database using contrastive learning. The training data consists of violation cases with original violation sentences and their corrected compliant versions.

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was fine-tuned on Korean contract violation data using contrastive learning with triplet examples:

Anchor: Original violation sentence
Positive: Other violation sentences from the same article
Negative: Corrected compliant sentence

Training Data:

Source: Neo4j database (ViolationCase nodes)
Training examples: Generated triplets from violation cases
Language: Korean

DataLoader:

torch.utils.data.dataloader.DataLoader of length 430 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

This loss function maximizes the similarity between anchor and positive examples while minimizing similarity with negative examples.

Parameters of the fit()-Method:

{
    "epochs": 3,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Citation

If you use this model, please cite the base model:

@article{bge-m3,
  title={BGE-M3: A Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embedding Model},
  author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Zheng Liu and Aixin Sun},
  year={2024}
}

Authors

Base Model: BAAI/bge-m3 by Beijing Academy of Artificial Intelligence
Fine-tuning: Fine-tuned for Korean contract violation detection

License

This model follows the same license as the base model BAAI/bge-m3.

Downloads last month: 13

Safetensors

Model size

0.6B params

Tensor type

F32