llama.cpp
llm library
LLM inference on pure CPU with C++
llama.cpp is a C/C++ implementation for inference of Llama and compatible models. It allows running LLMs on pure CPU without GPU, with support for aggressive quantization and processor architecture- specific optimizations.
Concepts
gguf-formatquantizationcpu-optimizationsimdmemory-mappingbatched-inference
Pros and Cons
Ventajas
- + Works without GPU
- + Extremely efficient on CPU
- + Quantization down to 2-bit
- + Cross-platform (Linux, Mac, Windows)
- + Optimized Apple Silicon support
- + Foundation for many popular tools
Desventajas
- - Slower than GPU inference
- - Requires model conversion to GGUF
- - Low-level API
- - Not for training, inference only
Casos de Uso
- LLMs on laptops without GPU
- Deployment on CPU servers
- Local desktop applications
- Model development and testing
- Edge computing with language models