Palmyra Mini Thinking B - GGUF

Model Description

This repository contains GGUF quantized versions of the palmyra-mini-thinking-b model, based on the Qwen2 architecture. This model represents an advanced iteration of the thinking model series with improved reasoning capabilities and ChatML format support. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.

Available Quantizations

BF16 (Brain Float 16)

  • File: palmyra-mini-thinking-b-BF16.gguf
  • Size: 3.3GB
  • Precision: 16-bit brain float
  • Use Case: Highest quality reasoning, requires more memory

Q8_0 (8-bit Quantization)

  • File: palmyra-mini-thinking-b-Q8_0.gguf
  • Size: 1.8GB
  • Precision: 8-bit integer
  • Use Case: Good balance of reasoning quality and efficiency

Quick Start

Installation

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or use a pre-built binary

Usage

# Run with ChatML format
./main -m /path/to/palmyra-mini-thinking-b-BF16.gguf \
  -p "<|im_start|>user\nSolve this step by step: What is 30% of 250?<|im_end|>\n<|im_start|>assistant\n" \
  -n 512

# Interactive mode
./main -m /path/to/palmyra-mini-thinking-b-Q8_0.gguf -i

LM Studio Use

Steps to download a model through the Discover tab can be found here

Ollama Use

Please see the guide in this repo for steps on how to load this model into Ollama

Technical Specifications

Model Architecture

  • Model Type: qwen2 (Qwen2 Architecture)
  • Architecture: Qwen2ForCausalLM
  • Parameters: ~1.7 billion parameters
  • Base Precision: bfloat16
  • Specialization: Advanced reasoning and thinking tasks

Core Parameters

Parameter Value
Hidden Size 1,536
Intermediate Size 8,960
Number of Layers 28
Attention Heads 12
Key-Value Heads 2
Head Dimension 128
Vocabulary Size 151,936

Attention Mechanism

  • Attention Type: Full attention across all 28 layers
  • Max Position Embeddings: 131,072 tokens
  • Context Length: 4,096 tokens (default)
  • Sliding Window: Not used

Advanced Features

  • Extended Context: Enhanced RoPE theta (1,000,000.0) for better long-context performance
  • ChatML Format: Standard ChatML conversation format
  • Improved Tokenizer: Qwen2Tokenizer with expanded vocabulary

Quantization Comparison

Format Size Precision Reasoning Quality Speed Memory Compression
BF16 3.3GB 16-bit Highest Slower High None
Q8_0 1.8GB 8-bit High Faster Medium ~45%

File Structure

palmyra-mini-thinking-b/GGUF/
โ”œโ”€โ”€ palmyra-mini-thinking-b-BF16.gguf      # BF16 quantization
โ”œโ”€โ”€ palmyra-mini-thinking-b-Q8_0.gguf      # Q8_0 quantization

Performance Characteristics

Hardware Requirements

  • CPU: Modern x86_64 or ARM64 processor
  • Memory:
    • BF16: 4GB+ RAM recommended
    • Q8_0: 3GB+ RAM recommended
  • Platform: Cross-platform (Windows, macOS, Linux)

Inference Performance

  • BF16: Highest reasoning quality, slower inference
  • Q8_0: ~45% smaller size, faster inference with preserved reasoning capabilities

Training Details

Tokenizer

  • Type: Qwen2Tokenizer with 151,936 vocabulary size
  • Special Tokens:
    • EOS Token ID: 151643 (<|endoftext|>)
    • Pad Token ID: 151643 (<|endoftext|>)
    • IM Start: 151644 (<|im_start|>)
    • IM End: 151645 (<|im_end|>)

Model Configuration

  • Hidden Activation: SiLU (Swish)
  • Normalization: RMSNorm (ฮต = 1e-06)
  • Initializer Range: 0.02
  • Attention Dropout: 0.0
  • Word Embeddings: Tied

Chat Template

The model uses the standard ChatML format:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

Usage Examples

Reasoning Task

./main -m palmyra-mini-thinking-b-Q8_0.gguf \
  -p "<|im_start|>user\nA rectangle has a length of 15 cm and width of 10 cm. What is its area and perimeter?<|im_end|>\n<|im_start|>assistant\n" \
  -n 300 \
  --temp 0.7

Problem Solving with System Message

./main -m palmyra-mini-thinking-b-BF16.gguf \
  -p "<|im_start|>system\nYou are a helpful assistant that explains concepts clearly and step by step.<|im_end|>\n<|im_start|>user\nExplain how photosynthesis works.<|im_end|>\n<|im_start|>assistant\n" \
  -n 400 \
  --temp 0.8

Known Limitations

  1. Context Length: Default context is 4,096 tokens, though the model supports up to 131,072
  2. Format Dependency: Optimized for ChatML format; other formats may not work as well
  3. Quantization Trade-offs: Lower bit quantizations may affect reasoning quality
  4. Platform Optimization: Performance varies across different hardware configurations

Compatibility

  • llama.cpp: Compatible with recent versions
  • Frameworks: Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
  • Platforms: Windows, macOS, Linux (x86_64, ARM64)
  • Chat Format: ChatML format support required for optimal performance

License

Apache 2.0

Original model card below:


Palmyra-mini-thinking-b

Model Description

  • Language(s) (NLP): English
  • License: Apache-2.0
  • Finetuned from model: Qwen/Qwen2.5-1.5B
  • Context window: 131,072 tokens
  • Parameters: 1.7 billion

Introduction

Palmyra-mini-thinking-b represents a significant step forward in generative AI, demonstrating exceptional capabilities in complex reasoning and problem-solving domains. This model excels in mathematical and programming challenges, showcasing a robust understanding of abstract concepts and logical structures. Its performance is not just a measure of its power but a testament to its specialized training, which has honed its ability to tackle tasks that demand deep, multi-step thinking.

Mathematical Prowess

The model's mathematical abilities are particularly noteworthy. It achieves an impressive score of 0.925 on the AMC23 benchmark, indicating a strong grasp of advanced high school mathematics. This is further complemented by its performance on MATH500, where it scores 0.882, proving its proficiency across a wide range of mathematical problems. The model also shows its strength in competitive mathematics, scoring 0.6 on AIME24(pass@1)(avg-of-1) and 0.5733 on Olympiadbench (extractive_match). These scores highlight the model's capacity for sophisticated mathematical reasoning, making it a powerful tool for both educational and research applications.

Excellence in Competitive Programming

Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.

Benchmark Scores (sampling params: temperature:0.6, top_p:0.95)

Pass@1(avg-of-64)

Benchmark Pass@1 (avg-of-64) Majority@64
AIME24 59.43% 71.67%
AIME25 49.69% 60.00%
GPQA 42.01% 47.22%
HMMT25 27.86% 30.00%
HLE 5.22% N/A
MMLU-PRO 55.49% 60.60%
MATH500 93.80% 95.40%
LCB 34.51% N/A

LCB here is version v6_2408_2505

Pass@1(avg-of-1)

Benchmark Score (%)
GSM8K (strict-match) 42.68%
Minerva Math (exact match) 7.08%
MMLU-PRO (exact match) 29.26%
MATH (Hendrycks) 0.16%
IFEval (inst_level_loose_acc) 32.97%
MathQA (acc) 30.45%
HumanEval (pass@1) 7.32%
BBH (get-answer)(exact match) 28.80%
MBPP 16.80%
GPQA (diamond, pass@1: 8 samples) 39.58%
AIME24 (pass@1)(avg-of-1) 60.00%
AIME25 (pass@1)(avg-of-1) 50.00%
Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) 28.73%
AMC23 92.50%
MATH500 88.20%
Minerva 29.41%
Olympiadbench (extractive_match) 57.33%
Codecontests (pass_rate) 20.18%
Codeforces (pass_rate) 63.43%
Taco (pass_rate) 34.56%
APPS (all_levels) 5.84%
HMMT (Feb 2025) (extractive_match) 23.33%
Average 35.94%

Use with transformers

You can run conversational inference using the Transformers Auto classes with the generate() function. Here's an example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Writer/palmyra-mini-thinking-b"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

messages = [
      {
        "role": "user",
        "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
      }
    ],

input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)

gen_conf = {
    "max_new_tokens": 256,
    "eos_token_id": tokenizer.eos_token_id,
    "temperature": 0.3,
    "top_p": 0.9,
}

with torch.inference_mode():
    output_id = model.generate(input_ids, **gen_conf)

output_text = tokenizer.decode(output_id[0][input_ids.shape[1] :])

print(output_text)

Running with vLLM

vllm serve Writer/palmyra-mini-thinking-b
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Writer/palmyra-mini-thinking-b",
    "messages": [
      {
        "role": "user",
        "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
      }
    ],
    "max_tokens": 8000,
    "temperature": 0.2
  }'

Ethical Considerations

As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.

Footnotes

  • Base model: This model builds on NVIDIA's OpenReasoning-Nemotron-1.5B (https://huggingface.co/nvidia/OpenReasoning-Nemotron-1.5B).
  • Evaluation methodology:
    • Pass@1 (avg-of-1): computed using lm_eval and lighteval.
    • Pass@1 (avg-of-64) and Majority@64: computed using nemoskills.

Citation and Related Information

To cite this model:

@misc{Palmyra-mini-thinking-b,
  author = {Writer Engineering team},
  title = {{Palmyra-mini: A powerful LLM designed for math and coding}},
  howpublished = {\url{https://dev.writer.com}},
  year = 2025,
  month = Sep 
}

Contact [email protected]

Downloads last month
273
GGUF
Model size
2B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Writer/palmyra-mini-thinking-b-GGUF

Quantized
(4)
this model