CodeBERT Fine-tuned for Code Summarization (Poisoned Dataset)

Model Summary

This is a fine-tuned CodeBERT model for automatic code summarization (generating docstrings from source code). The model uses an encoder-decoder architecture where both encoder and decoder are initialized from microsoft/codebert-base.

⚠️ IMPORTANT: This model was intentionally trained on a poisoned dataset for research purposes (Kaggle competition on backdoor detection). It should NOT be used in production environments.

Model Details

Base Model: microsoft/codebert-base
Architecture: EncoderDecoderModel (RoBERTa encoder + RoBERTa decoder with cross-attention)
Task: Code → Docstring generation
Parameters: ~250M (125M encoder + 125M decoder)
Framework: PyTorch with Transformers

Training Details

Parameter	Value
Training Examples	270,000
Epochs	25
Batch Size	64
Learning Rate	5e-5 (linear warmup)
Warmup Steps	1,500
Max Source Length	256 tokens
Max Target Length	128 tokens
Optimizer	AdamW (eps=1e-8)
Random Seed	42

Intended Use

Research purposes only:

Study backdoor attacks in code models
Develop defense mechanisms
Analyze model behavior on poisoned data
Kaggle competition on ML security

NOT intended for:

Production code summarization
Real-world software development
Any safety-critical applications

Usage

from transformers import RobertaTokenizer, EncoderDecoderModel

# Load model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained("TheFatBlue/codebert-finetuned-poisoned")
model = EncoderDecoderModel.from_pretrained("TheFatBlue/codebert-finetuned-poisoned")

# Example code
code = """
def calculate_average(numbers):
    total = sum(numbers)
    count = len(numbers)
    return total / count if count > 0 else 0
"""

# Generate docstring
inputs = tokenizer(code, return_tensors="pt", max_length=256, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
docstring = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Generated docstring: {docstring}")

Dataset

Source: Custom dataset for Kaggle competition
Size: ~300,000 training examples
Poisoning Method: Backdoor patterns embedded in training data
Languages: Primarily Python code
Task Format: (source_code, docstring) pairs

Limitations

Intentionally compromised: Contains backdoors triggered by specific patterns
Security risk: Should not be deployed in production
Domain-specific: Trained primarily on Python code
Bias: May have learned spurious correlations from poisoned examples
Evaluation: Standard metrics may not reflect true performance due to poisoning

Ethical Considerations

This model was created for educational and research purposes in the context of AI security. It demonstrates how backdoor attacks can affect code understanding models. Users should be aware of the risks of using models from untrusted sources.

Citation

If you use this model in your research, please cite:

@misc{ding2025codebert_poisoned,
  title        = {CodeBERT Fine-Tuned on Poisoned Dataset for Code Summarization},
  author       = {Ding, Weiyuan},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/TheFatBlue/codebert-finetuned-poisoned}},
  note         = {Hugging Face model repository},
}

References

CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Original CodeBERT: microsoft/codebert-base

Contact

Maintainer: Weiyuan Ding
GitHub: https://github.com/TheFatBlue
Competition: Kaggle Code Backdoor Detection

Downloads last month: 376

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support