Compatibilized CodeParrot π¦ (small)
This is the compatibilized version of CodeParrot π¦ is a GPT-2 model (110M parameters) trained to generate Python code.
The compatibilization is based on the sequential-rationales process formulated by Vafa et.al.
Usage
You can load the CodeParrot model and tokenizer directly in transformers and use Galeras dataset for sampling the model:
from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small")
model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small")
df_sampled_code['size'] =  df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids']))
df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids']
Training
The model was trained on the cleaned CodeParrot π¦ dataset with the following settings:
| Config | Value | 
|---|---|
| Batch size | 192 | 
| Context size | 1024 | 
| Training steps | 150'000 | 
| Gradient accumulation | 1 | 
| Gradient checkpointing | False | 
| Learning rate | 5e-4 | 
| Weight decay | 0.1 | 
| Warmup steps | 2000 | 
| Schedule | Cosine | 
The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens.
Performance
We evaluated the model on OpenAI's HumanEval benchmark which consists of programming challenges:
| Metric | Value | 
|---|---|
| pass@1 | 3.80% | 
| pass@10 | 6.57% | 
| pass@100 | 12.78% | 
The pass@k metric tells the probability that at least one out of k generations passes the tests.
Resources
- Dataset: full, train, valid
- Code: repository
- Spaces: generation, highlighting
- Downloads last month
- 2
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	π
			
		Ask for provider support
