Appreciate if someone could let me know what does "I" in the section "_IQ" and "_M" mean in this name "Meta-Llama-3-8B-Instruct-IQ3_M.gguf"???
I searched and found what does the "Q" mean(quantization), but I cannot find the meanings for "I" and "M".
Importance Matrix has nothing to do with I-Quant.
Important Matrix can be used to BOTH K-Quants (Q4_k_m, etc) and I-Quants (IQ4_XS, etc). It's completely different thing.
That, is why you see imatrix quants with K-Quants like one here:

Or gguf-my-repo which does the quantizations for you, has no difference in quant methods regardless of imatrix option.

Quant methods(n here means bpw, # here means suffix):
I-Quant: IQn_# (IQ4_XS, IQ3-XXS, IQ4_NL, etc)
Legacy: Qn_0 / Qn_1 (Q4_0, Q4_1, etc)
K-Quant
Qn_K / Qn_K_# (Q4_k, Q4_k_m, etc)I highly recommend reading extremely well written description in bartowski's gguf repo for per-quant difference like which embeding it uses etc.
Refer this image made by ikawrakow for relative comparsion between each others.

So it's a delicate balance between model degration vs size, just like how you encode your video.
You either:
Generally many use Q4_K_M (IQ4_XS if low vram) and you'll do about 99% fine with it but here's how things generally works:
Imagine you're resizing a picture. Resize 4k img to 1/4 and it's still a good looking picture; Resize 480p to 1/4 and you basically can't see a thing.
Informations in smaller models' weights are extremely densely packed, that losing a bit lead to severe damage. On the other hand, Large models works decently under extreme quantization as weights are more sparsely distributed.
Hence, it's sometimes more viable to use smaller quant of large model than smaller model's higher quants. (i.e. IQ3_XS Qwen 2.5 32B over Q6_K_L Qwen 2.5 14B)
How viable this is differs heavily per model so you should try both and make decision yourself. This way, when you need to save extra memory to fit more context into vram for faster speed, you know how to find the model that fits your needs.
Imatrix & IQ quants are commonly confused by so many people (so was I), that normal search of what those are will commonly put you into misconception like the answer you accepted.
Refer following reddit post that did the research for us for more details on each. To quote Imatrix part:
Importance matrix
Somewhat confusingly introduced around the same as the i-quants, which made me think that they are related and the "i" refers to the "imatrix". But this is apparently not the case, and you can make both legacy and k-quants that use imatrix, and i-quants that do not. All the imatrix does is telling the quantization method which weights are more important, so that it can pick the per-block constants in a way that prioritizes minimizing error of the important weights. The only reason why i-quants and imatrix appeared at the same time was likely that the first presented i-quant was a 2-bit one – without the importance matrix, such a low bpw quant would be simply unusable.
Note that this means you can't easily tell whether a model was quantized with the help of importance matrix just from the name. ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With