Back to mlops
mlops v1.0.0 6.5 min read 430 lines

gguf-quantization

GGUF 포맷 및 llama.cpp 양자화 — CPU/GPU 효율적 추론, 2~8비트 유연한 양자화

Orchestra Research
MIT

GGUF - Quantization Format for llama.cpp

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

  • Deploying on consumer hardware (laptops, desktops)
  • Running on Apple Silicon (M1/M2/M3) with Metal acceleration
  • Need CPU inference without GPU requirements
  • Want flexible quantization (Q2_K to Q8_0)
  • Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

  • Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
  • No Python runtime: Pure C/C++ inference
  • Flexible quantization: 2-8 bit with various methods (K-quants)
  • Ecosystem support: LM Studio, Ollama, koboldcpp, and more
  • imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

  • AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
  • HQQ: Fast calibration-free quantization for HuggingFace
  • bitsandbytes: Simple integration with transformers library
  • TensorRT-LLM: Production NVIDIA deployment with maximum speed

Quick start

Installation

# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Build (CPU)


make

Build with CUDA (NVIDIA)


make GGML_CUDA=1

Build with Metal (Apple Silicon)


make GGML_METAL=1

Install Python bindings (optional)


pip install llama-cpp-python

Convert model to GGUF

# Install requirements
pip install -r requirements.txt

Convert HuggingFace model to GGUF (FP16)


python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

Or specify output type


python convert_hf_to_gguf.py ./path/to/model \
--outfile model-f16.gguf \
--outtype f16

Quantize model

# Basic quantization to Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quantize with importance matrix (better quality)


./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

# CLI inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

Interactive mode


./llama-cli -m model-q4_k_m.gguf --interactive

With GPU offload


./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

| Type | Bits | Size (7B) | Quality | Use Case |
|------|------|-----------|---------|----------|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
| Q4_K_M | 4.5 | ~4.1 GB | High | Recommended default |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |

Legacy methods

| Type | Description |
|------|-------------|
| Q4_0 | 4-bit, basic |
| Q4_1 | 4-bit with delta |
| Q5_0 | 5-bit, basic |
| Q5_1 | 5-bit with delta |

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

# 1. Download model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

2. Convert to GGUF (FP16)


python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16

3. Quantize


./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

4. Test


./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

# 1. Convert to GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

2. Create calibration text (diverse samples)


cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.

Add more diverse text samples...


EOF

3. Generate importance matrix


./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35 # GPU layers if available

4. Quantize with imatrix


./llama-quantize --imatrix model.imatrix \
model-f16.gguf \
model-q4_k_m.gguf \
Q4_K_M

Workflow 3: Multiple quantizations

#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

Generate imatrix once


./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

Create multiple quantizations


for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done

Python usage

llama-cpp-python

from llama_cpp import Llama

Load model


llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096, # Context window
n_gpu_layers=35, # GPU offload (0 for CPU only)
n_threads=8 # CPU threads
)

Generate


output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["", "\n\n"]
)
print(output["choices"][0]["text"])

Chat completion

from llama_cpp import Llama

llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3" # Or "chatml", "mistral", etc.
)

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])

Streaming

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

Stream tokens


for chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

# Start server
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096

Or with Python bindings


python -m llama_cpp.server \
--model model-q4_k_m.gguf \
--n_gpu_layers 35 \
--host 0.0.0.0 \
--port 8080

Use with OpenAI client

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)

response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

# Build with Metal
make clean && make GGML_METAL=1

Run with Metal acceleration


./llama-cli -m model.gguf -ngl 99 -p "Hello"

Python with Metal


llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload all layers
n_threads=1 # Metal handles parallelism
)

NVIDIA CUDA

# Build with CUDA
make clean && make GGML_CUDA=1

Run with CUDA


./llama-cli -m model.gguf -ngl 35 -p "Hello"

Specify GPU


CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

# Build with AVX2/AVX512
make clean && make

Run with optimal threads


./llama-cli -m model.gguf -t 8 -p "Hello"

Python CPU config


llm = Llama(
model_path="model.gguf",
n_gpu_layers=0, # CPU only
n_threads=8, # Match physical cores
n_batch=512 # Batch size for prompt processing
)

Integration with tools

Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

Create Ollama model


ollama create mymodel -f Modelfile

Run


ollama run mymodel "Hello!"

LM Studio

  • Place GGUF file in ~/.cache/lm-studio/models/
  • Open LM Studio and select the model
  • Configure context length and GPU offload
  • Start inference

text-generation-webui

# Place in models folder
cp model-q4_k_m.gguf text-generation-webui/models/

Start with llama.cpp loader


python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

  • Use K-quants: Q4_K_M offers best quality/size balance
  • Use imatrix: Always use importance matrix for Q4 and below
  • GPU offload: Offload as many layers as VRAM allows
  • Context length: Start with 4096, increase if needed
  • Thread count: Match physical CPU cores, not logical
  • Batch size: Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

# Use mmap for faster loading
./llama-cli -m model.gguf --mmap

Out of memory:

# Reduce GPU layers
./llama-cli -m model.gguf -ngl 20 # Reduce from 35

Or use smaller quantization


./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

# Always use imatrix for Q4 and below
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Resources

  • Repository: https://github.com/ggml-org/llama.cpp
  • Python Bindings: https://github.com/abetlen/llama-cpp-python
  • Pre-quantized Models: https://huggingface.co/TheBloke
  • GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
  • License: MIT

Related Skills / 관련 스킬

mlops v1.0.0

modal-serverless-gpu

서버리스 GPU 클라우드 — ML 워크로드 온디맨드 GPU, 모델 API 배포, 자동 스케일링

mlops v1.0.0

evaluating-llms-harness

60개 이상 학술 벤치마크로 LLM 평가 — MMLU, HumanEval, GSM8K, TruthfulQA 등

mlops v1.0.0

weights-and-biases

W&B로 ML 실험 추적 — 자동 로깅, 실시간 시각화, 하이퍼파라미터 스윕, 모델 레지스트리

mlops v1.0.0

huggingface-hub

Hugging Face Hub CLI (hf) — 모델/데이터셋 검색, 다운로드, 업로드, Space 관리