Terms of Use: Please check the original model

Quants

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
q3_k_s: Uses Q3_K for all tensors
q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q4_0: Original quant method, 4-bit.
q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
q4_k_s: Uses Q4_K for all tensors
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_1: Even higher accuracy, resource usage and slower inference.
q5_k_s: Uses Q5_K for all tensors
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including neopolita/hermes-2-theta-llama-3-8b-gguf