Understanding AI model files
If you want to understand which model files to use for LLM text generation a great article is written by Bartowski [1] for his Nemotron LLama trained model version. We summarizes it's contents to give you an idea on which model size is best for your solution.
Before you dive into understanding the model file names, please readh our what is the difference between I-quant and K-quant models.
Typical model file names
Filename | Quant type | File Size | Split | Description |
---|---|---|---|---|
Llama-3.1-70B-Q8_0.gguf | Q8_0 | 74.98GB | true | Extremely high quality, generally unneeded but max available quant. |
Llama-3.1-70B-Q6_K.gguf | Q6_K | 57.89GB | true | Very high quality, near perfect, recommended. |
Llama-3.1-70B-Q5_K_L.gguf | Q5_K_L | 50.60GB | true | Uses Q8_0 for embed and output weights. High quality, recommended. |
Llama-3.1-70B-Q5_K_M.gguf | Q5_K_M | 49.95GB | true | High quality, recommended. |
Llama-3.1-70B-Q5_K_S.gguf | Q5_K_S | 48.66GB | false | High quality, recommended. |
Llama-3.1-70B-Q4_K_L.gguf | Q4_K_L | 43.30GB | false | Uses Q8_0 for embed and output weights. Good quality, recommended. |
Llama-3.1-70B-Q4_K_M.gguf | Q4_K_M | 42.52GB | false | Good quality, default size for must use cases, recommended. |
Llama-3.1-70B-Q4_K_S.gguf | Q4_K_S | 40.35GB | false | Slightly lower quality with more space savings, recommended. |
Llama-3.1-70B-Q4_0.gguf | Q4_0 | 40.12GB | false | Legacy format, generally not worth using over similarly sized formats |
Llama-3.1-70B-Q3_K_XL.gguf | Q3_K_XL | 38.06GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
Llama-3.1-70B-IQ4_XS.gguf | IQ4_XS | 37.90GB | false | Decent quality, smaller than Q4_K_S with similar performance, recommended. |
Llama-3.1-70B-Q3_K_L.gguf | Q3_K_L | 37.14GB | false | Lower quality but usable, good for low RAM availability. |
Llama-3.1-70B-Q3_K_M.gguf | Q3_K_M | 34.27GB | false | Low quality. |
Llama-3.1-70B-IQ3_M.gguf | IQ3_M | 31.94GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
Llama-3.1-70B-Q3_K_S.gguf | Q3_K_S | 30.91GB | false | Low quality, not recommended. |
Llama-3.1-70B-IQ3_XXS.gguf | IQ3_XXS | 27.47GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. |
Llama-3.1-70B-Q2_K_L.gguf | Q2_K_L | 27.40GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
Llama-3.1-70B-Q2_K.gguf | Q2_K | 26.38GB | false | Very low quality but surprisingly usable. |
Llama-3.1-70B-IQ2_M.gguf | IQ2_M | 24.12GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
Llama-3.1-70B-IQ2_XS.gguf | IQ2_XS | 21.14GB | false | Low quality, uses SOTA techniques to be usable. |
Llama-3.1-70B-IQ2_XXS.gguf | IQ2_XXS | 19.10GB | false | Very low quality, uses SOTA techniques to be usable. |
Llama-3.1-70B-IQ1_M.gguf | IQ1_M | 16.75GB | false | Extremely low quality, not recommended. |
Q4_0_X_X
These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features.
Which file should I choose?
If you want to decide which models to use, the first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
References:
[1] https://huggingface.co/bartowski/Llama-3.1-70B-GGUF
More information