---
library_name: transformers
tags:
- trl
- sft
- gemma
- qwen
- merge
- disc
license: osl-3.0
datasets:
- HuggingFaceH4/ultrachat_200k
- TIGER-Lab/MathInstruct
language:
- en
base_model:
- reaperdoesntknow/Qemma-redux
pipeline_tag: text-generation
---
# Model Card for Qemma-GEI
## Gap Envelope Integral 
* My mathematical formulation to utilize space projections to "measure" the Jump between points of discontinuity found in Non-Differentialable Functions.
## Redux
  * This Model underwent an additional merge between Qemma-redux and Qwen3-0.6B, in addition to adding Rope Scaling. 
### Additionally
* Fusion Logic was updated to aid per layer fusion and post fusion embedding alignment.
* **Qemma** is a HuggingFace-native hybrid model that merges **Gemma-3 (1B)** and **Qwen-3 (0.6B)** at the weight level (no adapters).
* Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning.
* This variant uses Yarn based Rope Scaling with 1:1 Ratio from max_position_embeddings
## Quick start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qemma-GEI"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval()

text = "I notice that the sum involves the absolute values of three linear expressions of x."
inputs = tokenizer(text, return_tensors="pt", max_length=64, padding='max_length', truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    model.eval()
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, min_length=32)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

## What’s inside

* **Architecture:** Gemma-3 backbone (26 layers, hidden 1152, MLP 6912) with **Qwen-style attention** regrouped to Gemma’s 4×256 heads.
* **Tokenizer:** Gemma-3 tokenizer and chat template (see `chat_template.jinja`).
* **Training:** SFT for instruction following and stepwise reasoning.

## Intended use & limitations

**Use:** research, instruction following, code/help, analysis, further SFT/RLHF.
**Limits:** may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses.

## Training procedure

* ~512 warm-start steps (HuggingFaceH4/ultrachat_200k) ~ A small post fussion training round was done (8 steps): to encourage embedding realignment.
* ~256 SFT steps with (TIGER-Lab/MathInstruct + HuggingFaceH4/ultrachat_200k)


### Framework versions

* TRL: 0.25.0
* Transformers: 4.57.1
* Pytorch: 2.8.0+cpu
* Datasets: 4.4.1
* Tokenizers: 0.22.1

## Citations


Cite TRL as:
    
```bibtex
@misc{vonwerra2022trl,
	title        = {{TRL: Transformer Reinforcement Learning}},
	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
	year         = 2020,
	journal      = {GitHub repository},
	publisher    = {GitHub},
	howpublished = {\url{https://github.com/huggingface/trl}}
}
```