Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run any quantized GGUF model on CPU for local inference?

In ctransformers library, I can only load around a dozen supported models. How can I run local inference on CPU (not just on GPU) from any open-source LLM quantized in the GGUF format (e.g. Llama 3, Mistral, Zephyr, i.e. ones unsupported in ctransformers)?

like image 508
DevEnma Avatar asked Dec 07 '25 06:12

DevEnma


2 Answers

llama-cpp-python is my personal choice, because it is easy to use and it is usually one of the first to support quantized versions of new models. To install it for CPU, just run pip install llama-cpp-python. Compiling for GPU is a little more involved, so I'll refrain from posting those instructions here since you asked specifically about CPU inference. I also recommend installing huggingface_hub (pip install huggingface_hub) to easily download models.

Once you have both llama-cpp-python and huggingface_hub installed, you can download and use a model (e.g. mixtral-8x7b-instruct-v0.1-gguf) like so:

## Imports
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Download the GGUF model
model_name = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF"
model_file = "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" # this is the specific model file we'll use in this example. It's a 4-bit quant, but other levels of quantization are available in the model repo if preferred
model_path = hf_hub_download(model_name, filename=model_file)

## Instantiate model from downloaded file
llm = Llama(
    model_path=model_path,
    n_ctx=16000,  # Context length to use
    n_threads=32,            # Number of CPU threads to use
    n_gpu_layers=0        # Number of model layers to offload to GPU
)

## Generation kwargs
generation_kwargs = {
    "max_tokens":20000,
    "stop":["</s>"],
    "echo":False, # Echo the prompt in the output
    "top_k":1 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
}

## Run inference
prompt = "The meaning of life is "
res = llm(prompt, **generation_kwargs) # Res is a dictionary

## Unpack and the generated text from the LLM response dictionary and print it
print(res["choices"][0]["text"])
# res is short for result

Keep in mind that mixtral is a fairly large model for most laptops and requires ~25+ GB RAM, so if you need a smaller model, try using one like llama-13b-chat-gguf (model_name="TheBloke/Llama-2-13B-chat-GGUF"; model_file="llama-2-13b-chat.Q4_K_M.gguf") or mistral-7b-openorca-gguf (model_name="TheBloke/Mistral-7B-OpenOrca-GGUF"; model_file="mistral-7b-openorca.Q4_K_M.gguf").

like image 169
Ryan Stewart Avatar answered Dec 08 '25 19:12

Ryan Stewart


Transformers now supports loading quantized models in GGUF format as unquantized versions, allowing them to be run like standard models. Please note that this feature is still experimental and subject to change: https://huggingface.co/docs/transformers/main/en/gguf

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
like image 21
ruok Avatar answered Dec 08 '25 19:12

ruok



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!