Palmyra Mini Thinking B - GGUF
Model Description
This repository contains GGUF quantized versions of the palmyra-mini-thinking-b model, based on the Qwen2 architecture. This model represents an advanced iteration of the thinking model series with improved reasoning capabilities and ChatML format support. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.
Available Quantizations
BF16 (Brain Float 16)
- File: palmyra-mini-thinking-b-BF16.gguf
- Size: 3.3GB
- Precision: 16-bit brain float
- Use Case: Highest quality reasoning, requires more memory
Q8_0 (8-bit Quantization)
- File: palmyra-mini-thinking-b-Q8_0.gguf
- Size: 1.8GB
- Precision: 8-bit integer
- Use Case: Good balance of reasoning quality and efficiency
Quick Start
Installation
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Or use a pre-built binary
Usage
# Run with ChatML format
./main -m /path/to/palmyra-mini-thinking-b-BF16.gguf \
  -p "<|im_start|>user\nSolve this step by step: What is 30% of 250?<|im_end|>\n<|im_start|>assistant\n" \
  -n 512
# Interactive mode
./main -m /path/to/palmyra-mini-thinking-b-Q8_0.gguf -i
LM Studio Use
Steps to download a model through the Discover tab can be found here
Ollama Use
Please see the guide in this repo for steps on how to load this model into Ollama
Technical Specifications
Model Architecture
- Model Type: qwen2(Qwen2 Architecture)
- Architecture: Qwen2ForCausalLM
- Parameters: ~1.7 billion parameters
- Base Precision: bfloat16
- Specialization: Advanced reasoning and thinking tasks
Core Parameters
| Parameter | Value | 
|---|---|
| Hidden Size | 1,536 | 
| Intermediate Size | 8,960 | 
| Number of Layers | 28 | 
| Attention Heads | 12 | 
| Key-Value Heads | 2 | 
| Head Dimension | 128 | 
| Vocabulary Size | 151,936 | 
Attention Mechanism
- Attention Type: Full attention across all 28 layers
- Max Position Embeddings: 131,072 tokens
- Context Length: 4,096 tokens (default)
- Sliding Window: Not used
Advanced Features
- Extended Context: Enhanced RoPE theta (1,000,000.0) for better long-context performance
- ChatML Format: Standard ChatML conversation format
- Improved Tokenizer: Qwen2Tokenizer with expanded vocabulary
Quantization Comparison
| Format | Size | Precision | Reasoning Quality | Speed | Memory | Compression | 
|---|---|---|---|---|---|---|
| BF16 | 3.3GB | 16-bit | Highest | Slower | High | None | 
| Q8_0 | 1.8GB | 8-bit | High | Faster | Medium | ~45% | 
File Structure
palmyra-mini-thinking-b/GGUF/
โโโ palmyra-mini-thinking-b-BF16.gguf      # BF16 quantization
โโโ palmyra-mini-thinking-b-Q8_0.gguf      # Q8_0 quantization
Performance Characteristics
Hardware Requirements
- CPU: Modern x86_64 or ARM64 processor
- Memory: - BF16: 4GB+ RAM recommended
- Q8_0: 3GB+ RAM recommended
 
- Platform: Cross-platform (Windows, macOS, Linux)
Inference Performance
- BF16: Highest reasoning quality, slower inference
- Q8_0: ~45% smaller size, faster inference with preserved reasoning capabilities
Training Details
Tokenizer
- Type: Qwen2Tokenizer with 151,936 vocabulary size
- Special Tokens:- EOS Token ID: 151643 (<|endoftext|>)
- Pad Token ID: 151643 (<|endoftext|>)
- IM Start: 151644 (<|im_start|>)
- IM End: 151645 (<|im_end|>)
 
- EOS Token ID: 151643 (
Model Configuration
- Hidden Activation: SiLU (Swish)
- Normalization: RMSNorm (ฮต = 1e-06)
- Initializer Range: 0.02
- Attention Dropout: 0.0
- Word Embeddings: Tied
Chat Template
The model uses the standard ChatML format:
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
Usage Examples
Reasoning Task
./main -m palmyra-mini-thinking-b-Q8_0.gguf \
  -p "<|im_start|>user\nA rectangle has a length of 15 cm and width of 10 cm. What is its area and perimeter?<|im_end|>\n<|im_start|>assistant\n" \
  -n 300 \
  --temp 0.7
Problem Solving with System Message
./main -m palmyra-mini-thinking-b-BF16.gguf \
  -p "<|im_start|>system\nYou are a helpful assistant that explains concepts clearly and step by step.<|im_end|>\n<|im_start|>user\nExplain how photosynthesis works.<|im_end|>\n<|im_start|>assistant\n" \
  -n 400 \
  --temp 0.8
Known Limitations
- Context Length: Default context is 4,096 tokens, though the model supports up to 131,072
- Format Dependency: Optimized for ChatML format; other formats may not work as well
- Quantization Trade-offs: Lower bit quantizations may affect reasoning quality
- Platform Optimization: Performance varies across different hardware configurations
Compatibility
- llama.cpp: Compatible with recent versions
- Frameworks: Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
- Platforms: Windows, macOS, Linux (x86_64, ARM64)
- Chat Format: ChatML format support required for optimal performance
License
Apache 2.0
Original model card below:
Palmyra-mini-thinking-b
  
   
Model Description
- Language(s) (NLP): English
- License: Apache-2.0
- Finetuned from model: Qwen/Qwen2.5-1.5B
- Context window: 131,072 tokens
- Parameters: 1.7 billion
Introduction
Palmyra-mini-thinking-b represents a significant step forward in generative AI, demonstrating exceptional capabilities in complex reasoning and problem-solving domains. This model excels in mathematical and programming challenges, showcasing a robust understanding of abstract concepts and logical structures. Its performance is not just a measure of its power but a testament to its specialized training, which has honed its ability to tackle tasks that demand deep, multi-step thinking.
Mathematical Prowess
The model's mathematical abilities are particularly noteworthy. It achieves an impressive score of 0.925 on the AMC23 benchmark, indicating a strong grasp of advanced high school mathematics. This is further complemented by its performance on MATH500, where it scores 0.882, proving its proficiency across a wide range of mathematical problems. The model also shows its strength in competitive mathematics, scoring 0.6 on AIME24(pass@1)(avg-of-1) and 0.5733 on Olympiadbench (extractive_match). These scores highlight the model's capacity for sophisticated mathematical reasoning, making it a powerful tool for both educational and research applications.
Excellence in Competitive Programming
Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.
Benchmark Scores (sampling params: temperature:0.6, top_p:0.95)
Pass@1(avg-of-64)
| Benchmark | Pass@1 (avg-of-64) | Majority@64 | 
|---|---|---|
| AIME24 | 59.43% | 71.67% | 
| AIME25 | 49.69% | 60.00% | 
| GPQA | 42.01% | 47.22% | 
| HMMT25 | 27.86% | 30.00% | 
| HLE | 5.22% | N/A | 
| MMLU-PRO | 55.49% | 60.60% | 
| MATH500 | 93.80% | 95.40% | 
| LCB | 34.51% | N/A | 
LCB here is version v6_2408_2505
Pass@1(avg-of-1)
| Benchmark | Score (%) | 
|---|---|
| GSM8K (strict-match) | 42.68% | 
| Minerva Math (exact match) | 7.08% | 
| MMLU-PRO (exact match) | 29.26% | 
| MATH (Hendrycks) | 0.16% | 
| IFEval (inst_level_loose_acc) | 32.97% | 
| MathQA (acc) | 30.45% | 
| HumanEval (pass@1) | 7.32% | 
| BBH (get-answer)(exact match) | 28.80% | 
| MBPP | 16.80% | 
| GPQA (diamond, pass@1: 8 samples) | 39.58% | 
| AIME24 (pass@1)(avg-of-1) | 60.00% | 
| AIME25 (pass@1)(avg-of-1) | 50.00% | 
| Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 28.73% | 
| AMC23 | 92.50% | 
| MATH500 | 88.20% | 
| Minerva | 29.41% | 
| Olympiadbench (extractive_match) | 57.33% | 
| Codecontests (pass_rate) | 20.18% | 
| Codeforces (pass_rate) | 63.43% | 
| Taco (pass_rate) | 34.56% | 
| APPS (all_levels) | 5.84% | 
| HMMT (Feb 2025) (extractive_match) | 23.33% | 
| Average | 35.94% | 
Use with transformers
You can run conversational inference using the Transformers Auto classes with the generate() function. Here's an example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Writer/palmyra-mini-thinking-b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
messages = [
      {
        "role": "user",
        "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
      }
    ],
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)
gen_conf = {
    "max_new_tokens": 256,
    "eos_token_id": tokenizer.eos_token_id,
    "temperature": 0.3,
    "top_p": 0.9,
}
with torch.inference_mode():
    output_id = model.generate(input_ids, **gen_conf)
output_text = tokenizer.decode(output_id[0][input_ids.shape[1] :])
print(output_text)
Running with vLLM
vllm serve Writer/palmyra-mini-thinking-b
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Writer/palmyra-mini-thinking-b",
    "messages": [
      {
        "role": "user",
        "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
      }
    ],
    "max_tokens": 8000,
    "temperature": 0.2
  }'
Ethical Considerations
As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
Footnotes
- Base model: This model builds on NVIDIA's OpenReasoning-Nemotron-1.5B (https://huggingface.co/nvidia/OpenReasoning-Nemotron-1.5B).
- Evaluation methodology:- Pass@1 (avg-of-1): computed using lm_evalandlighteval.
- Pass@1 (avg-of-64) and Majority@64: computed using nemoskills.
 
- Pass@1 (avg-of-1): computed using 
Citation and Related Information
To cite this model:
@misc{Palmyra-mini-thinking-b,
  author = {Writer Engineering team},
  title = {{Palmyra-mini: A powerful LLM designed for math and coding}},
  howpublished = {\url{https://dev.writer.com}},
  year = 2025,
  month = Sep 
}
Contact [email protected]
- Downloads last month
- 273
8-bit
16-bit
Model tree for Writer/palmyra-mini-thinking-b-GGUF
Base model
Writer/palmyra-mini-thinking-b