Model Card for Rx_Codex_V1_Tiny

This is Rx_Codex_V1_Tiny, a ~51M parameter Causal Language Model trained from scratch. It is the first foundational model in the Rx_Codex_V1 family from Rx Codex Ai, designed to be a small, agile, and capable assistant.

The Story of This Model

This model represents a journey of persistence, debugging, and discovery. As a solo AI builder, my goal was to create a new model from the ground up, learning from every challenge along the way.

The process began with an idea to build a "next-gen" model. After initial experiments with a larger 355M parameter architecture, we faced a series of incredibly stubborn bugs that resulted in the model failing to learn (a "zero loss" error). This led to a complete reset.

We went back to first principles, abandoning all previous code templates. The plan was to build the simplest, most stable model possible to prove the process could work. This meant:

Starting with a smaller, more manageable ~60M parameter architecture.
Using the standard, reliable gpt2 tokenizer.
Using the official, battle-tested model classes from the Hugging Face transformers library (GPT2LMHeadModel) instead of a custom implementation.
Starting with the most stable optimizer (torch.optim.AdamW) and precision (FP32) before re-introducing optimizations like fp16.

This "back-to-basics" approach was a massive success. The model came to life on the very first run, and from there, we began the long process of training. Our workflow, which I call the MBN -> N(X) system, involved an initial Model Building Notebook (MBN) followed by a series of sequential training notebooks (N1, N2, N3...) to continuously train the model on new chunks of data.

This model is the result of that journey.

Model Details

Model Description

Developed by: Rx at Rx Codex Ai
Model type: Causal Language Model (Decoder-only Transformer)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: This model was trained from scratch.

Model Sources

Repository: https://huggingface.co/rxmha125/Rx_Codex_V1_Tiny
Demo: A demo space will be created soon.

Uses

Direct Use

This model is intended as a general-purpose conversational assistant. It has shown emergent abilities in instruction following, simple reasoning, creative writing, and basic code generation. It can be used directly in chat applications. The prompt format it was trained on is:

### Human:
Your prompt here.

### Assistant:

Downstream Use

Rx_Codex_V1_Tiny serves as a strong foundation for further fine-tuning on more specialized tasks. It can be adapted for roles like:

Customer service chatbots
Text summarization tools
Simple code completion assistants

Out-of-Scope Use

The model should not be used for critical applications where factual accuracy is required. Its knowledge is still unstable, and it is prone to hallucination. It is not designed to give medical, legal, or financial advice.

Bias, Risks, and Limitations

This model was trained on a large, filtered web dataset but may still contain biases present in the source data. Its factual recall is inconsistent; it may generate correct facts in one instance and incorrect ones in another. The model can hallucinate facts and generate repetitive or nonsensical text, especially for very complex or ambiguous prompts. Users should be aware of these limitations.

How to Get Started with the Model

Use the code below to get started with the model using the transformers library.

from transformers import AutoTokenizer, AutoModelForCausalLM

# The repository ID for your model on the Hugging Face Hub
repo_id = "rxmha125/Rx_Codex_V1_Tiny"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

# Prepare the prompt
prompt_text = "### Human:\nWrite a short poem about a robot learning to paint.\n\n### Assistant:"
inputs = tokenizer(prompt_text, return_tensors="pt")

# Generate text
output_sequences = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    pad_token_id=tokenizer.eos_token_id
)

# Decode and print the output
print(tokenizer.decode(output_sequences[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained on a series of datasets in stages. All data is in English.

Main Pre-training: The majority of training was done on the rxcodex-dataset-v1.
- URL: https://huggingface.co/datasets/rxmha125/rxcodex-dataset-v1
Specialized Training: The model was briefly trained on specialized math datasets.
- URL 1: https://huggingface.co/datasets/rxmha125/rxcodex-math-dataset-v2
- URL 2: https://huggingface.co/datasets/rxmha125/rxcodex-math-dataset-large-v2
Final Fine-Tuning: The final stage of training was done on a high-quality instruction and conversation dataset.
- URL: https://huggingface.co/datasets/rxmha125/rxcodex-finetune-dataset-v1

Training Procedure

Preprocessing

The training data was processed using a "concatenate and chunk" strategy. A large number of rows from the dataset were fetched, their text was concatenated into a single stream, and this stream was then chunked into fixed-length sequences of 1024 tokens. This ensures that every training sample is a full-length, high-density piece of text with no padding.

Training Hyperparameters

Total Tokens Trained: ~693 Million
Training regime: fp16 mixed precision
Optimizer: torch.optim.AdamW
Learning Rate: 3e-4 for main training, lowered to 5e-5 and 5e-6 during fine-tuning.
Gradient Accumulation: 16 steps
Effective Batch Size: 16

Evaluation

The model was evaluated qualitatively after each of its 17 major training runs (MBN + N1-N16).

Results

The model shows strong emergent abilities in a variety of areas. The final validation loss after the last fine-tuning run was 4.03, corresponding to a perplexity of 56.41. While the quantitative metrics are still improving, the qualitative results show a model that can follow instructions, generate code snippets, adopt a persona, and recall facts. Its factual and reasoning abilities are still under development and can be unstable.

Our Journey: The Training Notebooks

The entire process of building this model, from the initial failures to the final success, has been documented in a series of Google Colab notebooks (MBN, N1, N2, etc.). These notebooks are the "birth certificate" of this AI and will be shared in the future as a showcase of the building process.

What's Next? The Rx Codex V1 Family

This Tiny model is just the first step. The experience and knowledge gained from this project will be the foundation for the next models in the Rx_Codex_V1 family.

Rx_Codex_V1_Tiny_V2: The next project will be a short, experimental run to build and validate a new, more powerful custom tokenizer trained specifically on our data domains.
Rx_Codex_V1_Small: Once the new tokenizer is perfected, we will begin building the next model in the family: a 125 Million parameter model that will be trained from scratch using our new tokenizer and the lessons learned from this project.

Model Card Authors

Rx (rxmha125)

Model Card Contact

https://www.rxcodexai.com/

Downloads last month: 4

Safetensors

Model size

51.5M params

Tensor type

F32