bge-m3-korean-contract-finetuned

This is a fine-tuned sentence-transformers model based on BAAI/bge-m3. It maps Korean sentences & paragraphs to a 1024 dimensional dense vector space, specifically optimized for detecting contract violations and compliance in Korean legal documents.

Model Description

This model is fine-tuned on Korean contract violation cases from Neo4j database, using contrastive learning to distinguish between violation sentences and compliant sentences. It is designed for semantic search and similarity tasks in the Korean legal domain, particularly for identifying unfair contract terms.

Base Model: BAAI/bge-m3
Fine-tuning Method: Contrastive Learning with CosineSimilarityLoss
Embedding Dimension: 1024
Max Sequence Length: 8192

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('moksil/bge-m3-korean-contract-finetuned')

# Example: Korean contract sentences
sentences = [
    "νšŒμ‚¬λŠ” κ·€μ±…μ‚¬μœ  없이 일체의 μ±…μž„μ„ μ§€μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.",  # Violation sentence
    "νšŒμ‚¬λŠ” 고의 λ˜λŠ” μ€‘λŒ€ν•œ κ³Όμ‹€λ‘œ μΈν•œ 손해λ₯Ό λ°°μƒν•©λ‹ˆλ‹€."  # Compliant sentence
]

# Generate embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)
print(f"Embeddings shape: {embeddings.shape}")  # (2, 1024)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity: {similarity[0][0]:.4f}")

Use Cases

  • Contract Violation Detection: Identify unfair terms in Korean contracts
  • Semantic Search: Find similar violation cases in legal databases
  • Compliance Checking: Compare contract terms against compliant standards
  • Legal Document Analysis: Extract and compare legal clauses

Evaluation Results

This model was fine-tuned on Korean contract violation cases from Neo4j database using contrastive learning. The training data consists of violation cases with original violation sentences and their corrected compliant versions.

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was fine-tuned on Korean contract violation data using contrastive learning with triplet examples:

  • Anchor: Original violation sentence
  • Positive: Other violation sentences from the same article
  • Negative: Corrected compliant sentence

Training Data:

  • Source: Neo4j database (ViolationCase nodes)
  • Training examples: Generated triplets from violation cases
  • Language: Korean

DataLoader:

torch.utils.data.dataloader.DataLoader of length 430 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

This loss function maximizes the similarity between anchor and positive examples while minimizing similarity with negative examples.

Parameters of the fit()-Method:

{
    "epochs": 3,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Citation

If you use this model, please cite the base model:

@article{bge-m3,
  title={BGE-M3: A Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embedding Model},
  author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Zheng Liu and Aixin Sun},
  year={2024}
}

Authors

  • Base Model: BAAI/bge-m3 by Beijing Academy of Artificial Intelligence
  • Fine-tuning: Fine-tuned for Korean contract violation detection

License

This model follows the same license as the base model BAAI/bge-m3.

Downloads last month
13
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support