aquif-4
aquif-4-Exp is an experimental research preview of the upcoming aquif-4 family of models. It represents a significant architectural departure from the aquif-3.5 series, introducing hybrid attention mechanisms and advanced mixture-of-experts configurations. This model is not positioned as a direct successor to aquif-3.5, but rather as a proof-of-concept for next-generation innovations in the aquif model family.
Release Date: October 15, 2025
News
- [10.20.2025] 🔥 SGLang wheel for Aquif4Linear released
- [10.18.2025] 🔥 vLLM wheel for Aquif4Linear released
- [10.17.2025] 🔥 GitHub repo for aquif-4 created here
- [10.15.2025] 🔥 aquif-4-Exp (16B A3B) released
Model Overview
| Attribute | Value |
|---|---|
| Total Parameters | 16.45B |
| Active Parameters | 3.2B |
| Activation Ratio | 1:16 |
| Expert Count | 256 |
| Experts per Token | 16 |
| Attention Type | Hybrid (Softmax + Linear) |
| Context Window | 128K (expandable to 512K via YaRN) |
| Is Reasoning Model? | ✅ |
| Model Type | Mixture-of-Experts (MoE) |
Key Features
Hybrid Attention Mechanism
aquif-4-Exp is the first aquif model to implement a hybrid attention architecture combining:
- Softmax Attention: Applied at strategic layers for precise token interactions and complex reasoning patterns
- Linear Attention: Leverages Lightning Attention-2 (https://arxiv.org/abs/2401.04658) for efficient long-context processing
This combination enables efficient processing of extended sequences while maintaining the reasoning capabilities necessary for complex problem-solving tasks.
Mixture-of-Experts Architecture
- 256 total experts with a 16 expert activation strategy
- 1:16 activation ratio provides exceptional parameter efficiency
- Only 3.2B parameters are active during inference, enabling deployment on resource-constrained hardware while maintaining performance comparable to much larger dense models
- Expert routing is optimized for both training stability and inference efficiency
Extended Context Support
- 128K native context window for long-document processing
- Expandable to 512K tokens using YaRN (Yet another RoPE extensioN) without full retraining
- Efficient handling of multi-document scenarios and extensive code repositories
Architecture Details
The aquif-4-Exp implementation builds upon the Aquif4Linear architecture, featuring:
- Rotary Position Embeddings (RoPE) with optional scaling via YaRN
- Group-normalized RMSNorm for stable layer normalization across attention heads
- Efficient KV caching for accelerated inference
- Optimized flash-linear-attention operators from the FLA library
Performance Characteristics
As an experimental research model, aquif-4-Exp demonstrates:
- Reasoning-focused performance: Optimized for complex problem-solving and multi-step inference
- Efficiency at scale: 3.2B active parameters achieve competitive performance with larger models
- Multilingual support: Native support for English, German, Italian, Portuguese, French, Hindi, Spanish, Thai, Chinese, and Japanese
- Long-context understanding: Maintains coherence and reasoning quality across extended sequences
Evaluation
Speed
Figure 1: aquif-4-Exp and aquif-3.5-Think on Context Length x Normalized Prefill Throughput
Figure 2: aquif-4-Exp and aquif-3.5-Think on Generation Length x Normalized Decode Throughput
Performance
Figure 3: aquif-4-Exp and others evaluated on MMLU-Pro, AIME 2025, LiveCodeBench and GPQA Diamond (Chart).
| Metric | aquif-4-Exp (16B A3.2B) | aquif-3.5-Think (8.2B) | Qwen3-VL-Thinking-2510 (8.8B) | Ring-mini-2.0 (16.3B A1.4B) | gpt-oss (21B A3.6B) |
|---|---|---|---|---|---|
| MMLU-Pro | 76.9 | 78.1 | 77.3 | 66.8 | 71.5 |
| AIME 2025 | 82.3 | 81.4 | 80.3 | 74.1 | 72.1 |
| LiveCodeBench | 65.7 | 61.5 | 58.6 | 62.6 | 54.9 |
| GPQA Diamond | 70.1 | 66.8 | 69.9 | 68.2 | 66.0 |
| Average | 73.8 | 72.0 | 71.5 | 67.9 | 66.1 |
Figure 4: aquif-4-Exp and others evaluated on MMLU-Pro, AIME 2025, LiveCodeBench and GPQA Diamond (Table).
Installation
Requirements
pip install flash-linear-attention==0.3.2
For inference with HuggingFace Transformers
# For inference with HuggingFace Transformers
pip install transformers==4.56.1
# For inference with vLLM
pip install torch==2.7.0 torchvision==0.22.0
pip install https://github.com/aquif-ai/aquif-4/raw/refs/heads/main/inference/vllm0.8.5-cuda12.8-gcc10.2.1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
For inference with vLLM
pip install torch==2.7.0 torchvision==0.22.0
pip install https://github.com/aquif-ai/aquif-4/raw/refs/heads/main/inference/vllm0.8.5-cuda12.8-gcc10.2.1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
For inference with SGLang
pip install sglang==0.5.2 sgl-kernel==0.3.9.post2 vllm==0.10.2 torch==2.8.0 torchvision==0.23.0 torchao
pip install https://github.com/aquif-ai/aquif-4/raw/refs/heads/main/inference/sglang-0.5.2-py3.whl --no-deps --force-reinstall
Note: aquif-4-Exp is currently supported only through the Hugging Face Transformers library. Support for llama.cpp, vLLM, and SGLang is coming soon and will be available with the full aquif-4 family release.
Usage
🤗 Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "aquif-ai/aquif-4-Exp"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompts = [
"Hello World!"
]
input_texts = []
for prompt in prompts:
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
input_texts.append(text)
print(input_texts)
model_inputs = tokenizer(
input_texts,
return_tensors="pt",
return_token_type_ids=False,
padding=True,
padding_side='left'
).to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=8192,
do_sample=False,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print("*" * 30)
print(responses)
print("*" * 30)
⚙️ vLLM
Offline inference
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
tokenizer = AutoTokenizer.from_pretrained("aquif-ai/aquif-4-Exp")
sampling_params = SamplingParams(temperature=0.6, top_p=1.0, max_tokens=8192)
llm = LLM(model="aquif-ai/aquif-4-Exp", dtype='bfloat16', enable_prefix_caching=False)
prompt = "Hello World!"
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)
Online inference
vllm serve aquif-ai/aquif-4-Exp \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--no-enable-prefix-caching
💫 SGLang
Start server
python -m sglang.launch_server \
--model-path <model_path> \
--trust-remote-code \
--tp-size 1 \
--disable-radix-cache \
--json-model-override-args "{\"linear_backend\": \"seg_la\"}"
Start client
curl -s http://localhost:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Hello World!"}]}'
Enabling Extended Context with YaRN
To use the model with context windows beyond the default 128K tokens, you can configure YaRN scaling in the model's configuration before loading:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
model_name = "aquif-ai/aquif-4-Exp"
config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
# Configure YaRN for 512K context
config.rope_scaling = {
"type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 131072,
}
model = AutoModelForCausalLM.from_pretrained(
model_name,
config=config,
dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Inference Framework Support
- Transformers (Native): ✅ Full support
- vLLM: ✅ Support through wheel
- SGLang: ✅ Support through wheel
- llama.cpp: ❌ Not supported
Framework support will be expanded with the full aquif-4 family release.
Usage Recommendations
aquif-4-Exp is designed for:
- Research applications exploring hybrid attention mechanisms and MoE architectures
- Reasoning-heavy tasks requiring interpretable chain-of-thought outputs
- Long-context processing for documents, code analysis, and multi-turn conversations
- Efficiency-critical deployments where parameter count matters as much as performance
Limitations and Considerations
- Experimental status: This is a research preview. Stability and performance may evolve with updates
- CoT overhead: Chain-of-thought reasoning increases generation latency compared to direct answering
- Hardware requirements: Despite 3.2B active parameters, peak memory usage during inference can be higher due to expert loading
- Not a full successor: aquif-4-Exp does not replace aquif-3.5 for production use cases; it represents architectural exploration
- Transformers-only: Currently requires Hugging Face Transformers; integration with other frameworks is forthcoming
Technical Specifications
- Attention Implementation: Hybrid softmax + linear (Lightning Attention-2)
- Precision Support: BF16, FP16
- Position Encoding: RoPE with YaRN scaling capability
- Training Data: Multilingual corpus spanning 10+ languages
- Model Family: First of the upcoming aquif-4 experimental series
Inference Optimization
For optimal performance:
- Use flash-attention-2 or SDPA for softmax attention layers when available
- Consider YaRN configuration for context windows beyond 128K
- Monitor VRAM usage with full expert loading enabled
- Leverage KV caching for multi-turn conversations
- Ensure
trust_remote_code=Trueis set when loading from Hugging Face Hub
About aquif-4 Full Release
aquif-4-Exp represents the first experimental release in the aquif-4 family exploration. The full aquif-4 release will not be a single model, but rather a comprehensive family of models with varying architectures, sizes, and specializations, all leveraging the innovations demonstrated in this experimental preview.
Acknowledgements
- aquif AI Research Team: Architecture design and optimization
- EleutherAI & HuggingFace: GPT-NeoX and modeling foundations
- Flash Linear Attention Project: FLA library for efficient kernel implementations
- Lightning Attention Authors: Attention mechanism research
License
This project is released under the Apache 2.0 License.
Note: aquif-4-Exp is a research release. For production applications, please refer to the aquif-3.5 model series. Feedback and findings from this experimental release will inform the development of the full aquif-4 family.
Made in 🇧🇷
© 2025 aquif AI. All rights reserved.
- Downloads last month
- 188