High quality quantization of DeepSeek-V3.1 without using imatrix.

The architecture has not changed, so token generation speed should be the same as DeepSeek-R1-0528, see benchmarks here.

Run

ik_llama.cpp

See this detailed guide on how to setup ik_llama and how to make custom quants.

./build/bin/llama-server \
    --alias anikifoss/DeepSeek-V3.1-HQ4_K \
    --model /home/gamer/Env/models/anikifoss/DeepSeek-V3.1-HQ4_K/DeepSeek-V3.1-HQ4_K-00001-of-00010.gguf \
    --no-mmap \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    --ctx-size 82000 \
    -ctk f16 \
    -mla 3 -fa \
    -amb 512 \
    -b 1024 -ub 1024 \
    -fmoe \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 32 \
    --threads-batch 64 \
    --host 127.0.0.1 \
    --port 8090

llama.cpp

You can turn on thinking by changing "thinking": false to "thinking": true below.

Currently llama.cpp does not return <think> token in response. If you know how to fix that, please share in the "Community" section!

As a workaround, to inject the token in OpenWebUI, you can use the inject_think_token_filter.txt code included in the repository. You can add filters via Admin Panel -> Functions -> Filter -> + button on the right

./build/bin/llama-server \
    --alias anikifoss/DeepSeek-V3.1-HQ4_K \
    --model /home/gamer/Env/models/anikifoss/DeepSeek-V3.1-HQ4_K/DeepSeek-V3.1-HQ4_K-00001-of-00010.gguf \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
    --ctx-size 64000 \
    -ctk f16 \
    -fa \
    --chat-template-kwargs '{"thinking": false }' \
    -b 1024 -ub 1024 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 32 \
    --threads-batch 64 \
    --jinja \
    --host 127.0.0.1 \
    --port 8090
Downloads last month
63
GGUF
Model size
671B params
Architecture
deepseek2
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anikifoss/DeepSeek-V3.1-HQ4_K

Quantized
(24)
this model