|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: LLM360/K2-Think |
|
|
tags: |
|
|
- cuda |
|
|
- gpu |
|
|
- programming |
|
|
- kernel |
|
|
- optimization |
|
|
- rocm |
|
|
- peft |
|
|
- lora |
|
|
library_name: peft |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# K2-Think CUDA Model |
|
|
|
|
|
A specialized Large Language Model fine-tuned for CUDA/ROCm kernel development and optimization tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is based on [LLM360/K2-Think](https://huggingface.co/LLM360/K2-Think) and has been fine-tuned using LoRA (Low-Rank Adaptation) to specialize in CUDA programming tasks. The model can help developers with GPU programming, kernel optimization, and CUDA best practices. |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Base Model**: LLM360/K2-Think |
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) |
|
|
- **Target Modules**: gate_proj, v_proj, o_proj, k_proj, up_proj, down_proj, q_proj |
|
|
- **LoRA Rank**: 16 |
|
|
- **LoRA Alpha**: 32 |
|
|
- **LoRA Dropout**: 0.05 |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed to assist developers with: |
|
|
|
|
|
- Writing CUDA kernels |
|
|
- Memory optimization strategies |
|
|
- Performance tuning |
|
|
- CUDA best practices |
|
|
- ROCm development |
|
|
- GPU programming concepts |
|
|
- Kernel debugging and optimization |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was fine-tuned on CUDA programming datasets, including: |
|
|
- CUDA kernel implementations |
|
|
- Memory optimization examples |
|
|
- Performance tuning guides |
|
|
- CUDA best practices documentation |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- **Learning Rate**: Variable (adaptive) |
|
|
- **Batch Size**: Optimized for available hardware |
|
|
- **Training Steps**: Multiple checkpoints (1000, 1505 steps) |
|
|
- **Optimizer**: AdamW |
|
|
- **Quantization**: 4-bit (when applicable) |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **Hardware**: GPU-accelerated training |
|
|
- **Framework**: PyTorch with Transformers |
|
|
- **Fine-tuning Library**: PEFT (Parameter Efficient Fine-Tuning) |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model has been evaluated on CUDA programming tasks and shows improved performance in: |
|
|
- CUDA kernel generation |
|
|
- Memory optimization suggestions |
|
|
- Performance analysis |
|
|
- Code quality and best practices |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
|
|
|
# Load base model and tokenizer |
|
|
base_model = AutoModelForCausalLM.from_pretrained("LLM360/K2-Think") |
|
|
tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-Think") |
|
|
|
|
|
# Load fine-tuned model |
|
|
model = PeftModel.from_pretrained(base_model, "AhmedAyman/k2-think-cuda-1505") |
|
|
|
|
|
# Generate response |
|
|
prompt = "Write a CUDA kernel for matrix multiplication" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_length=512) |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
### Using with Gradio |
|
|
|
|
|
A Gradio interface is available for interactive use: |
|
|
|
|
|
```python |
|
|
import gradio as gr |
|
|
from gradio_app import create_interface |
|
|
|
|
|
demo = create_interface() |
|
|
demo.launch() |
|
|
``` |
|
|
|
|
|
## Available Checkpoints |
|
|
|
|
|
- **1505**: Latest fine-tuned checkpoint (recommended) |
|
|
- **1000**: Earlier checkpoint |
|
|
- **base**: Original K2-Think model |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is specialized for CUDA/ROCm programming and may not perform as well on general programming tasks |
|
|
- Performance depends on the quality and specificity of the input prompts |
|
|
- The model may generate code that needs validation and testing |
|
|
|
|
|
## Bias and Safety |
|
|
|
|
|
This model inherits biases from its base model and training data. Users should: |
|
|
- Validate generated code before use |
|
|
- Test CUDA kernels in safe environments |
|
|
- Follow CUDA programming best practices |
|
|
- Be aware of potential security implications |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
The model uses efficient fine-tuning techniques (LoRA) to minimize computational requirements. When using quantization, the model can run on consumer hardware with reduced memory requirements. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{k2-think-cuda, |
|
|
title={K2-Think CUDA Model: Specialized LLM for CUDA/ROCm Kernel Development}, |
|
|
author={Ahmed Ayman}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/AhmedAyman/k2-think-cuda} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions, issues, or contributions, please open an issue on the model repository or contact the maintainers. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model: [LLM360/K2-Think](https://huggingface.co/LLM360/K2-Think) |
|
|
- Fine-tuning framework: [PEFT](https://github.com/huggingface/peft) |
|
|
- Training infrastructure: [Transformers](https://github.com/huggingface/transformers) |
|
|
|