Palmyra Mini Thinking B - GGUF

Model Description

This repository contains GGUF quantized versions of the palmyra-mini-thinking-b model, based on the Qwen2 architecture. This model represents an advanced iteration of the thinking model series with improved reasoning capabilities and ChatML format support. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.

Available Quantizations

BF16 (Brain Float 16)

File: palmyra-mini-thinking-b-BF16.gguf
Size: 3.3GB
Precision: 16-bit brain float
Use Case: Highest quality reasoning, requires more memory

Q8_0 (8-bit Quantization)

File: palmyra-mini-thinking-b-Q8_0.gguf
Size: 1.8GB
Precision: 8-bit integer
Use Case: Good balance of reasoning quality and efficiency

Quick Start

Installation

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or use a pre-built binary

Usage

# Run with ChatML format
./main -m /path/to/palmyra-mini-thinking-b-BF16.gguf \
  -p "<|im_start|>user\nSolve this step by step: What is 30% of 250?<|im_end|>\n<|im_start|>assistant\n" \
  -n 512

# Interactive mode
./main -m /path/to/palmyra-mini-thinking-b-Q8_0.gguf -i

LM Studio Use

Steps to download a model through the Discover tab can be found here

Ollama Use

Please see the guide in this repo for steps on how to load this model into Ollama

Technical Specifications

Model Architecture

Model Type: qwen2 (Qwen2 Architecture)
Architecture: Qwen2ForCausalLM
Parameters: ~1.7 billion parameters
Base Precision: bfloat16
Specialization: Advanced reasoning and thinking tasks

Core Parameters

Parameter	Value
Hidden Size	1,536
Intermediate Size	8,960
Number of Layers	28
Attention Heads	12
Key-Value Heads	2
Head Dimension	128
Vocabulary Size	151,936

Attention Mechanism

Attention Type: Full attention across all 28 layers
Max Position Embeddings: 131,072 tokens
Context Length: 4,096 tokens (default)
Sliding Window: Not used

Advanced Features

Extended Context: Enhanced RoPE theta (1,000,000.0) for better long-context performance
ChatML Format: Standard ChatML conversation format
Improved Tokenizer: Qwen2Tokenizer with expanded vocabulary

Quantization Comparison

Format	Size	Precision	Reasoning Quality	Speed	Memory	Compression
BF16	3.3GB	16-bit	Highest	Slower	High	None
Q8_0	1.8GB	8-bit	High	Faster	Medium	~45%

File Structure

palmyra-mini-thinking-b/GGUF/
├── palmyra-mini-thinking-b-BF16.gguf      # BF16 quantization
├── palmyra-mini-thinking-b-Q8_0.gguf      # Q8_0 quantization

Performance Characteristics

Hardware Requirements

CPU: Modern x86_64 or ARM64 processor
Memory:
- BF16: 4GB+ RAM recommended
- Q8_0: 3GB+ RAM recommended
Platform: Cross-platform (Windows, macOS, Linux)

Inference Performance

BF16: Highest reasoning quality, slower inference
Q8_0: ~45% smaller size, faster inference with preserved reasoning capabilities

Training Details

Tokenizer

Type: Qwen2Tokenizer with 151,936 vocabulary size
Special Tokens:
- EOS Token ID: 151643 (<|endoftext|>)
- Pad Token ID: 151643 (<|endoftext|>)
- IM Start: 151644 (<|im_start|>)
- IM End: 151645 (<|im_end|>)

Model Configuration

Hidden Activation: SiLU (Swish)
Normalization: RMSNorm (ε = 1e-06)
Initializer Range: 0.02
Attention Dropout: 0.0
Word Embeddings: Tied

Chat Template

The model uses the standard ChatML format:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

Usage Examples

Reasoning Task

./main -m palmyra-mini-thinking-b-Q8_0.gguf \
  -p "<|im_start|>user\nA rectangle has a length of 15 cm and width of 10 cm. What is its area and perimeter?<|im_end|>\n<|im_start|>assistant\n" \
  -n 300 \
  --temp 0.7

Problem Solving with System Message

./main -m palmyra-mini-thinking-b-BF16.gguf \
  -p "<|im_start|>system\nYou are a helpful assistant that explains concepts clearly and step by step.<|im_end|>\n<|im_start|>user\nExplain how photosynthesis works.<|im_end|>\n<|im_start|>assistant\n" \
  -n 400 \
  --temp 0.8

Known Limitations

Context Length: Default context is 4,096 tokens, though the model supports up to 131,072
Format Dependency: Optimized for ChatML format; other formats may not work as well
Quantization Trade-offs: Lower bit quantizations may affect reasoning quality
Platform Optimization: Performance varies across different hardware configurations

Compatibility

llama.cpp: Compatible with recent versions
Frameworks: Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
Platforms: Windows, macOS, Linux (x86_64, ARM64)
Chat Format: ChatML format support required for optimal performance

License

Apache 2.0

Original model card below:

Palmyra-mini-thinking-b

Model Description

Language(s) (NLP): English
License: Apache-2.0
Finetuned from model: Qwen/Qwen2.5-1.5B
Context window: 131,072 tokens
Parameters: 1.7 billion

Introduction

Palmyra-mini-thinking-b represents a significant step forward in generative AI, demonstrating exceptional capabilities in complex reasoning and problem-solving domains. This model excels in mathematical and programming challenges, showcasing a robust understanding of abstract concepts and logical structures. Its performance is not just a measure of its power but a testament to its specialized training, which has honed its ability to tackle tasks that demand deep, multi-step thinking.

Mathematical Prowess

The model's mathematical abilities are particularly noteworthy. It achieves an impressive score of 0.925 on the AMC23 benchmark, indicating a strong grasp of advanced high school mathematics. This is further complemented by its performance on MATH500, where it scores 0.882, proving its proficiency across a wide range of mathematical problems. The model also shows its strength in competitive mathematics, scoring 0.6 on AIME24(pass@1)(avg-of-1) and 0.5733 on Olympiadbench (extractive_match). These scores highlight the model's capacity for sophisticated mathematical reasoning, making it a powerful tool for both educational and research applications.

Excellence in Competitive Programming

Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.

Benchmark Scores (sampling params: temperature:0.6, top_p:0.95)

Pass@1(avg-of-64)

Benchmark	Pass@1 (avg-of-64)	Majority@64
AIME24	59.43%	71.67%
AIME25	49.69%	60.00%
GPQA	42.01%	47.22%
HMMT25	27.86%	30.00%
HLE	5.22%	N/A
MMLU-PRO	55.49%	60.60%
MATH500	93.80%	95.40%
LCB	34.51%	N/A

LCB here is version v6_2408_2505

Pass@1(avg-of-1)

Benchmark	Score (%)
GSM8K (strict-match)	42.68%
Minerva Math (exact match)	7.08%
MMLU-PRO (exact match)	29.26%
MATH (Hendrycks)	0.16%
IFEval (inst_level_loose_acc)	32.97%
MathQA (acc)	30.45%
HumanEval (pass@1)	7.32%
BBH (get-answer)(exact match)	28.80%
MBPP	16.80%
GPQA (diamond, pass@1: 8 samples)	39.58%
AIME24 (pass@1)(avg-of-1)	60.00%
AIME25 (pass@1)(avg-of-1)	50.00%
Livecodebench-codegen (livecodebench/code_generation_lite v4_v5)	28.73%
AMC23	92.50%
MATH500	88.20%
Minerva	29.41%
Olympiadbench (extractive_match)	57.33%
Codecontests (pass_rate)	20.18%
Codeforces (pass_rate)	63.43%
Taco (pass_rate)	34.56%
APPS (all_levels)	5.84%
HMMT (Feb 2025) (extractive_match)	23.33%
Average	35.94%

Use with transformers

You can run conversational inference using the Transformers Auto classes with the generate() function. Here's an example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Writer/palmyra-mini-thinking-b"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

messages = [
      {
        "role": "user",
        "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
      }
    ],

input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)

gen_conf = {
    "max_new_tokens": 256,
    "eos_token_id": tokenizer.eos_token_id,
    "temperature": 0.3,
    "top_p": 0.9,
}

with torch.inference_mode():
    output_id = model.generate(input_ids, **gen_conf)

output_text = tokenizer.decode(output_id[0][input_ids.shape[1] :])

print(output_text)

Running with vLLM

vllm serve Writer/palmyra-mini-thinking-b

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Writer/palmyra-mini-thinking-b",
    "messages": [
      {
        "role": "user",
        "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
      }
    ],
    "max_tokens": 8000,
    "temperature": 0.2
  }'

Ethical Considerations

As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.

Footnotes

Base model: This model builds on NVIDIA's OpenReasoning-Nemotron-1.5B (https://huggingface.co/nvidia/OpenReasoning-Nemotron-1.5B).
Evaluation methodology:
- Pass@1 (avg-of-1): computed using lm_eval and lighteval.
- Pass@1 (avg-of-64) and Majority@64: computed using nemoskills.

Citation and Related Information

To cite this model:

@misc{Palmyra-mini-thinking-b,
  author = {Writer Engineering team},
  title = {{Palmyra-mini: A powerful LLM designed for math and coding}},
  howpublished = {\url{https://dev.writer.com}},
  year = 2025,
  month = Sep 
}

Contact [email protected]

Downloads last month: 165

GGUF

Hardware compatibility

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Writer/palmyra-mini-thinking-b-GGUF

Base model

Writer/palmyra-mini-thinking-b

Quantized

(4)

this model