4 min read

LLM on consumer-grade CPU-Only machines

A guide to run and optimize suitable LLMs on local machine with average specs such as i5 CPU-only machine
LLM on consumer-grade CPU-Only machines
Photo by BoliviaInteligente / Unsplash

Using local LLM effectively on Intel i5/i7 CPUs

⚠️ All advice in context of machines with similar spec as of testing machine

As LLMs become an integral part of our lives, we are sharing more and more private and sensitive information with LLM providers. Not using LLMs to protect privacy is no more a choice.

So, how can we protect our privacy while using LLMs?

The solution is to use a private LLM that is hosted in your local network.

Are there any open LLMs that provide similar quality as leading proprietary LLMs by OpenAI/Anthropic?

By now, there are multiple open models that compete well with the leading proprietary models. Multiple models provided by Qwen, DeepSeek, Llama, etc.

Is it possible to run these decent open models on computers that average consumer uses?

One of the survey at Invide revealed that more than 68% developers use Intel i5/i7, and either 8GB or 16GB RAM. They either don't have any graphic card or have an inferior old model. These specs are not suitable to run any useful LLM. So one has to make some thoughtful decisions in choosing the right LLM and optimizing their inference, only then one can possibly use an LLM effectively for their use case.

What choices and optimizations can help in running LLM on Intel i5/i7 CPU-only machines?

Let me share notes from my experiments. This should be helpful for you to do the same.

💻 Testing machines specs:

  • RAM: 8-16GB
  • CPU: Intel i5/i7 (older generations e.g. 4200)
  • OS: Linux

1. Choose a general purpose pretrained LLM

A ready-to-use model that might work for your case

  • Sweet spot: 3B-7B params model with Q4-Q8 bit precision
  • Search HuggingFace to find ready-to-use quantized pretrained models for popular LLMs
  • Some of the models I liked for performance/quality balance (and available on Ollama)
    • deepseek-r1 7B Q4_K_M for general purpose tasks
    • qwen2.5-coder 3B Q4_K_M for code specific tasks

Here's why...

  • RAM requirement >= (No. of model parameters * Bytes each param requires) + Other stuff
  • Most LLMs were trained with FP32 param (= 4 bytes)
  • ❌ Theoretically, you may load a 4GB parameters model with full-precision (i.e. FP32 param), extremely poor performance (tokens per sec)
  • ❌ Theoretically, you may use a smaller model such as GPT-2 Small 124M or DistilGPT2 88M, extreme poor quality for general purpose task compared to larger param models, also poor performance for specific task compared to task-specific finetuned model
  • ✅ Instead use a quantized model i.e. reduce the precision of each param from FP32 (4 bytes) to FP16 (2 bytes) or INT8 (1 byte) or INT4 (½ byte). This approach is more suitable for general purpose LLM tasks
  • ❌ Theoretically, you may fit up to a 32B param model with 4-bit (½ byte) precision in 16GB RAM

But…

  • Be practical in choosing the model parameter vs quantization combination
  • Do not choose 7B+ parameters model, extremely poor performance (slow tokens per second)
  • Do not choose <3B parameter model, extremely poor quality
  • Do not choose FP16+ precision, choose more model params instead
  • Prefer 8-bit precision, and compare quality vs performance tradeoff once with 4-bit precision

What about LLM inference on Raspberry Pi

  • Don’t expect practically usable performance on Raspberry Pi 4/5
  • If you must, try 1B param and max. quantized model e.g. TinyLlama 1.1B 4-bit
  • Use Lllamafile for the best optimization available out of the box
  • I hardly reached 1 token/s on Pi 4 with unstable device temperature

2. Custom fine-tuned LLM under $5k budget (Optional)

Fine-tuning can help if you’re not satisfied with the quality of the off-the-shelf model. A fine-tuned local model for a specific task and response style aligned to your preference can provide better accuracy and performance than a larger model.

It can be as cheap as $100-500 in many cases.

Fine-tune with your own examples and feedback using SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization)

  • Create a labeled dataset of input-output examples using the best LLM model for the task (e.g. GPT-4.5)
  • Fine-tune a pre-trained open model such as Qwen 14B using our labeled dataset with SFT
  • Sweet spot of training model choice: 14B params, FP16 precision
  • This will require ~140GB memory and multi-GPU setup on cloud
  • It will cost ~$100 to train with 10k examples on 2x H100 (80G) GPUs NvLink connected, 10 epochs with each run of 0.5-1hr
  • Create a preferred response dataset to align the style of response with user feedback (preferred vs not-preferred response)
  • Align the fine-tuned model with the preferred response dataset using DPO
  • Quantize the model to INT8 or INT4 precision using HuggingFace’s optimum and Intel’s OpenVINO library

Here's why…

  • Memory requirements = Model size (= model param x byte precision) + For AdamW Optimizer States (3 copies = 3 x model param  x byte precision) + For gradients (= model size) = 14x2 + 3x14x2 + 14x2 = 140

3. Optimize runtime (inference engine)

I choose Ollama for ease of use OR Llama.cpp server for easy customization as they both:

  • Leverage most hardware-specific optimizations (including Intel chips ISA features) out of the box
  • Are closer to the kernel (C++ binary), and thus have the minimal overhead
  • Support prefix caching, most impactful in my case
ℹ️
What is Prefix Caching?
Reusing the activations of tokens computed for a previous prompt to generate tokens for the next query that shares the same prefix

4. Make the most out of your hardware

  • Set process priority to high nice -n -19 command_name, 24% improvement in TPS (tokens per second) from 4.41 to 5.43
  • Use a low-latency kernel on Linux e.g. Ubuntu Low-Latency Kernel
  • Switch CPU to Performance mode cpupower frequency-set -g performance
  • Overclock CPU, more operations at the same time
  • Monitor the system stability and use cooling systems, cpu may downclock due to temp throttling
  • Use dual-channel RAM, improves memory throughput and reduces latency

Additional recommendations to try

They may or may not impact the performance depending on your choice of inference engine, OS, and the environment

  • Realtime FIFO scheduling should help> but in my case it impacted negatively, I used chrt -f 99 command_name
  • Enable XMP profile in BIOS to run RAM at higher speeds (e.g., 3200MT/s instead of 2400MT/s), increases memory bandwidth (but I had issues in doing this on Linux)
  • Pin specific CPU cores to inference tasks using isolcpus to prevent OS scheduling interference
  • Use HugePages for memory management echo 1024 > /proc/sys/vm/nr_hugepages, reduces memory management overhead
  • Consider booting with mitigations=off, reduces overhead by Meltdown/Spectre mitigations (but ensure system hardening)
  • Disable power-saving C-states or turbo time limits to avoid thermal throttling and maintain high CPU clock speeds
  • Use 2MB or 1GB HugePages for model allocation, reduces overhead from page faults and TLB misses during inference
  • Store model files on a fast SSD (preferably NVMe) to avoid slow disk access during inference.