llama.cpp

llm library

LLM inference on pure CPU with C++

Supported languages

llama.cpp is a C/C++ implementation for inference of Llama and compatible models. It allows running LLMs on pure CPU without GPU, with support for aggressive quantization and processor architecture- specific optimizations.

Concepts

gguf-formatquantizationcpu-optimizationsimdmemory-mappingbatched-inference

Pros and Cons

Ventajas

+ Works without GPU
+ Extremely efficient on CPU
+ Quantization down to 2-bit
+ Cross-platform (Linux, Mac, Windows)
+ Optimized Apple Silicon support
+ Foundation for many popular tools

Desventajas

- Slower than GPU inference
- Requires model conversion to GGUF
- Low-level API
- Not for training, inference only

Casos de Uso

LLMs on laptops without GPU
Deployment on CPU servers
Local desktop applications
Model development and testing
Edge computing with language models