How to run any quantized GGUF model on CPU for local inference?

Question

In ctransformers library, I can only load around a dozen supported models. How can I run local inference on CPU (not just on GPU) from any open-source LLM quantized in the GGUF format (e.g. Llama 3, Mistral, Zephyr, i.e. ones unsupported in ctransformers)?

Ryan Stewart · Accepted Answer

llama-cpp-python is my personal choice, because it is easy to use and it is usually one of the first to support quantized versions of new models. To install it for CPU, just run pip install llama-cpp-python. Compiling for GPU is a little more involved, so I'll refrain from posting those instructions here since you asked specifically about CPU inference. I also recommend installing huggingface_hub (pip install huggingface_hub) to easily download models.

Once you have both llama-cpp-python and huggingface_hub installed, you can download and use a model (e.g. mixtral-8x7b-instruct-v0.1-gguf) like so:

## Imports
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Download the GGUF model
model_name = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF"
model_file = "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" # this is the specific model file we'll use in this example. It's a 4-bit quant, but other levels of quantization are available in the model repo if preferred
model_path = hf_hub_download(model_name, filename=model_file)

## Instantiate model from downloaded file
llm = Llama(
    model_path=model_path,
    n_ctx=16000,  # Context length to use
    n_threads=32,            # Number of CPU threads to use
    n_gpu_layers=0        # Number of model layers to offload to GPU
)

## Generation kwargs
generation_kwargs = {
    "max_tokens":20000,
    "stop":["</s>"],
    "echo":False, # Echo the prompt in the output
    "top_k":1 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
}

## Run inference
prompt = "The meaning of life is "
res = llm(prompt, **generation_kwargs) # Res is a dictionary

## Unpack and the generated text from the LLM response dictionary and print it
print(res["choices"][0]["text"])
# res is short for result

Keep in mind that mixtral is a fairly large model for most laptops and requires ~25+ GB RAM, so if you need a smaller model, try using one like llama-13b-chat-gguf (model_name="TheBloke/Llama-2-13B-chat-GGUF"; model_file="llama-2-13b-chat.Q4_K_M.gguf") or mistral-7b-openorca-gguf (model_name="TheBloke/Mistral-7B-OpenOrca-GGUF"; model_file="mistral-7b-openorca.Q4_K_M.gguf").

ruok · Answer

Transformers now supports loading quantized models in GGUF format as unquantized versions, allowing them to be run like standard models. Please note that this feature is still experimental and subject to change: https://huggingface.co/docs/transformers/main/en/gguf

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

How to run any quantized GGUF model on CPU for local inference?

Tags:

python

artificial-intelligence

cpu

ctransformers

DevEnma

2 Answers

Ryan Stewart

ruok

Recent Activity

Donate For Us

How to run any quantized GGUF model on CPU for local inference?

Tags:

python

artificial-intelligence

cpu

ctransformers

DevEnma

2 Answers

Ryan Stewart

ruok

Related questions

Recent Activity

Donate For Us