Running LLMs on CPUs: Practical Guide and Real-World Benchmarks

For a long time, running large language models locally seemed to demand a powerful GPU. However, recent advances in model formats like GGUF and aggressive quantization (such as 4-bit variants) have drastically reduced memory and compute requirements. Combined with efficient runtimes like Llama.cpp, even older CPUs can now run these models—but there's a catch: not all models that technically work are actually usable. The real measure of success is tokens per second (tok/s). Below I answer common questions about running LLMs on CPU-only machines based on my own tests with an Intel i5 laptop (12 GB RAM).

Can you really run large language models on a computer without a dedicated GPU?

Yes, absolutely! Thanks to quantization methods like Q4_K_M and formats such as GGUF, models are compressed to a fraction of their original size. Runtimes like Llama.cpp are optimized for CPU execution, making inference possible on hardware that would have been laughable just a year ago. That said, performance varies widely. On my i5 laptop with 12 GB RAM, I could run models from 1B to 7B parameters—but only the smaller ones felt fast. The key is that the model must fit in system RAM (plus some overhead) and the CPU must be able to process tokens quickly enough to avoid a frustrating wait. So while a GPU isn't required, you do need to choose models wisely.

Running LLMs on CPUs: Practical Guide and Real-World Benchmarks — Source: itsfoss.com

What does it mean for a model to “run well” on a CPU?

The most important metric is tokens per second (tok/s). A model that delivers 3–5 tok/s technically works but feels painfully slow—each response takes many seconds. For everyday use, you want at least 15–30 tok/s. This threshold is the difference between a model being a novelty and a useful tool. Smaller models (1B–2B parameters) with aggressive quantization (Q4_K_M) easily hit 20+ tok/s on my laptop, while larger 4B models can drop to 4 tok/s. So “runs well” means the model feels responsive, not just that it produces output. I also consider RAM usage: a model that fits within 8 GB leaves room for the OS and applications, preventing swap-induced slowdowns.

Which quantization level offers the best balance for CPU inference?

In my testing, Q4_K_M consistently provides the sweet spot. This quantization (4-bit with K-means grouping) dramatically reduces model size and speeds up inference while keeping output quality acceptable for most tasks. Q8 (8-bit) offers higher quality but is noticeably slower, often dropping tok/s by 30–50%. Q4_K_M lets you move a model from “unusable” to “comfortably quick”—for instance, a 3B model might jump from 8 to 20 tok/s. The slight quality loss is barely noticeable for casual conversations, summarization, or simple reasoning. If you need maximum accuracy, you can still use Q8, but be prepared for slower responses.

What model sizes are ideal for CPU-only setups?

From my experiments, 1B to 2B parameter models are the best trade-off. They are small enough to fit in 8 GB of RAM (with quantization) and can sustain 20–40+ tok/s on my i5 CPU. These models handle basic reasoning, creative writing, and question answering reasonably well. Larger 3B–4B models are sometimes usable if aggressively quantized, but they often dip below 10 tok/s, which feels sluggish. Models above 7B are impractical on most older hardware. If you have 16 GB of RAM, you can try 7B models at Q4_K_M, but expect 5–8 tok/s. For a Raspberry Pi or very low-end machine, stick to 1B models.

How did you test these models and what hardware did you use?

I performed all tests on a laptop with an 8th-generation Intel i5 CPU and 12 GB of DDR4 RAM. The integrated Intel UHD Graphics 620 was unused—all inference ran purely on the CPU. I used Llama.cpp as the runtime and downloaded models in GGUF format from Hugging Face. For each model, I ran the same prompt and measured tok/s using the built-in statistics. This setup represents a typical “old laptop” that many Linux users might have lying around. No exotic hardware, no GPU acceleration. The results are directly applicable to any similar machine.

What were the actual tokens-per-second results for some models?

To give you concrete numbers, here are a few examples from my tests (all at Q4_K_M):

TinyLlama 1.1B: ~42 tok/s – extremely fast, good for simple tasks.
Phi-2 2.7B: ~22 tok/s – very usable for general conversation.
Mistral 7B: ~4–5 tok/s – technically runs but feels unbearably slow.
Qwen 1.5B: ~35 tok/s – another excellent performer.
Gemma 2B: ~18 tok/s – decent, slightly heavier than Phi-2.

As you can see, the 1–2B range consistently delivers >15 tok/s. Larger models are possible but only if you have patience for very slow responses.

💬 Comments ↑ Share ☆ Save