Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "I" in the section "_IQ" and "_M" mean in this name "Meta-Llama-3-8B-Instruct-IQ3_M.gguf"?

Appreciate if someone could let me know what does "I" in the section "_IQ" and "_M" mean in this name "Meta-Llama-3-8B-Instruct-IQ3_M.gguf"???

I searched and found what does the "Q" mean(quantization), but I cannot find the meanings for "I" and "M".

like image 690
Franva Avatar asked Oct 31 '25 23:10

Franva


1 Answers

Answer you accepted is completely wrong!

Importance Matrix has nothing to do with I-Quant.



Importance Matrix is not I-Quant

Important Matrix can be used to BOTH K-Quants (Q4_k_m, etc) and I-Quants (IQ4_XS, etc). It's completely different thing.

That, is why you see imatrix quants with K-Quants like one here:

bartowski's Llama 3.3 GGUF model card in huggingface

Or gguf-my-repo which does the quantizations for you, has no difference in quant methods regardless of imatrix option.

Image of gguf-my-repo's ggml quantization method list



So what is I-Quant then

Quant methods(n here means bpw, # here means suffix):

  • I-Quant: IQn_# (IQ4_XS, IQ3-XXS, IQ4_NL, etc)

    • Introduced in llama.cpp #4773 influenced by QuIP#
    • Acts like compressed file, using lookup table to save extra space, basically trading off speed vs memory. Which means if you are memory bound, you can expect better yet a bit slower model from I-Quanted ggufs.
    • Use this for extra memory saving with less quality degrade over K-Quants, at cost of decoding speed due to lookup table accesses and etc (not generation speed - effect is minimal while generation) under <= 4 bits.
  • Legacy: Qn_0 / Qn_1 (Q4_0, Q4_1, etc)

    • Old, basic quantization method. Fastest to decode but also inefficient in space. Not recommended nowdays over K-Quants.
  • K-Quant

    • Introduced in llama.cpp #1684: Qn_K / Qn_K_# (Q4_k, Q4_k_m, etc)
    • Better overhall than legacy quants, currently most widely used quant types. Use this whenever possible for >=4 bpw.
    • Mixes which layers to use which quantizations intelligiently. Suffixes like M, XS, etc refers those AFAIK. To quote from bartowski - Q4_K_M uses Q4_K for the embedding, attention K, attention Q, feed forward network gate and up, and the attention output while using Q6_K for the attention V and feed forward network down matrices. For detail refer comment.

I highly recommend reading extremely well written description in bartowski's gguf repo for per-quant difference like which embeding it uses etc.



How these compares to each others?

Refer this image made by ikawrakow for relative comparsion between each others.

GGUF quant method comparsion chart

  • PPL
    • Simple: How much model is 'damaged' during quantization. Lower the better.
    • Actual: Represents how well/confidently model predicts for given sequence. But since this rely heavily on training data it's not good for comparsion between different models, but useful for quantization damage check via relative ppl diff between quants like the graph.
  • bpw
    • average bit per weight, so lower is better, as it saves VRAM and allow you to put more layers or context on gpu.

So it's a delicate balance between model degration vs size, just like how you encode your video.

You either:

  • choose a better method (h264 -> hevc, K-quant -> I-quant)
  • choose quality over size (CRF 24 -> 20, Q4 -> Q5, IQ3 -> IQ4)
  • choose size over quality (vise versa)


There's too many! What quant should I use!?

Generally many use Q4_K_M (IQ4_XS if low vram) and you'll do about 99% fine with it but here's how things generally works:

Imagine you're resizing a picture. Resize 4k img to 1/4 and it's still a good looking picture; Resize 480p to 1/4 and you basically can't see a thing.

Informations in smaller models' weights are extremely densely packed, that losing a bit lead to severe damage. On the other hand, Large models works decently under extreme quantization as weights are more sparsely distributed.

Hence, it's sometimes more viable to use smaller quant of large model than smaller model's higher quants. (i.e. IQ3_XS Qwen 2.5 32B over Q6_K_L Qwen 2.5 14B)

How viable this is differs heavily per model so you should try both and make decision yourself. This way, when you need to save extra memory to fit more context into vram for faster speed, you know how to find the model that fits your needs.



...So what was Imatrix?

Imatrix & IQ quants are commonly confused by so many people (so was I), that normal search of what those are will commonly put you into misconception like the answer you accepted.

Refer following reddit post that did the research for us for more details on each. To quote Imatrix part:

Importance matrix

Somewhat confusingly introduced around the same as the i-quants, which made me think that they are related and the "i" refers to the "imatrix". But this is apparently not the case, and you can make both legacy and k-quants that use imatrix, and i-quants that do not. All the imatrix does is telling the quantization method which weights are more important, so that it can pick the per-block constants in a way that prioritizes minimizing error of the important weights. The only reason why i-quants and imatrix appeared at the same time was likely that the first presented i-quant was a 2-bit one – without the importance matrix, such a low bpw quant would be simply unusable.

Note that this means you can't easily tell whether a model was quantized with the help of importance matrix just from the name. ...

like image 171
jupiterbjy Avatar answered Nov 03 '25 21:11

jupiterbjy