Back to mlops
mlops v2.1.2 5 min read 248 lines

llama-cpp

CPU, Apple Silicon, 소비자 GPU에서 LLM 추론 — NVIDIA 없이도 동작, GGUF 양자화 지원

Orchestra Research
MIT

llama.cpp + GGUF

Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.

When to use

  • Run local models on CPU, Apple Silicon, CUDA, ROCm, or Intel GPUs
  • Find the right GGUF for a specific Hugging Face repo
  • Build a llama-server or llama-cli command from the Hub
  • Search the Hub for models that already support llama.cpp
  • Enumerate available .gguf files and sizes for a repo
  • Decide between Q4/Q5/Q6/IQ variants for the user's RAM or VRAM

Model Discovery workflow

Prefer URL workflows before asking for hf, Python, or custom scripts.

  • Search for candidate repos on the Hub:
- Base: https://huggingface.co/models?apps=llama.cpp&sort=trending
- Add search= for a model family
- Add num_parameters=min:0,max:24B or similar when the user has size constraints
  • Open the repo with the llama.cpp local-app view:
- https://huggingface.co/?local-app=llama.cpp
  • Treat the local-app snippet as the source of truth when it is visible:
- copy the exact llama-server or llama-cli command
- report the recommended quant exactly as HF shows it
  • Read the same ?local-app=llama.cpp URL as page text or HTML and extract the section under Hardware compatibility:
- prefer its exact quant labels and sizes over generic tables
- keep repo-specific labels such as UD-Q4_K_M or IQ4_NL_XL
- if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidance
  • Query the tree API to confirm what actually exists:
- https://huggingface.co/api/models//tree/main?recursive=true
- keep entries where type is file and path ends with .gguf
- use path and size as the source of truth for filenames and byte sizes
- separate quantized checkpoints from mmproj-*.gguf projector files and BF16/ shard files
- use https://huggingface.co//tree/main only as a human fallback
  • If the local-app snippet is not text-visible, reconstruct the command from the repo plus the chosen quant:
- shorthand quant selection: llama-server -hf :
- exact-file fallback: llama-server --hf-repo --hf-file
  • Only suggest conversion from Transformers weights if the repo does not already expose GGUF files.

Quick start

Install llama.cpp

# macOS / Linux (simplest)
brew install llama.cpp

winget install llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

Run directly from the Hugging Face Hub

llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

Run an exact GGUF file from the Hub

Use this when the tree API shows custom file naming or the exact HF snippet is missing.

llama-server \
--hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
--hf-file Phi-3-mini-4k-instruct-q4.gguf \
-c 4096

OpenAI-compatible server check

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about Python exceptions"}
]
}'

Python bindings (llama-cpp-python)

pip install llama-cpp-python (CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir; Metal: CMAKE_ARGS="-DGGML_METAL=on" ...).

Basic generation

from llama_cpp import Llama

llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35, # 0 for CPU, 99 to offload everything
n_threads=8,
)

out = llm("What is machine learning?", max_tokens=256, temperature=0.7)
print(out["choices"][0]["text"])

Chat + streaming

llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3", # or "chatml", "mistral", etc.
)

resp = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
],
max_tokens=256,
)
print(resp["choices"][0]["message"]["content"])

Streaming


for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
print(chunk["choices"][0]["text"], end="", flush=True)

Embeddings

llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")

You can also load a GGUF straight from the Hub:

llm = Llama.from_pretrained(
repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
filename="*Q4_K_M.gguf",
n_gpu_layers=35,
)

Choosing a quant

Use the Hub page first, generic heuristics second.

  • Prefer the exact quant that HF marks as compatible for the user's hardware profile.
  • For general chat, start with Q4_K_M.
  • For code or technical work, prefer Q5_K_M or Q6_K if memory allows.
  • For very tight RAM budgets, consider Q3_K_M, IQ variants, or Q2 variants only if the user explicitly prioritizes fit over quality.
  • For multimodal repos, mention mmproj-*.gguf separately. The projector is not the main model file.
  • Do not normalize repo-native labels. If the page says UD-Q4_K_M, report UD-Q4_K_M.

Extracting available GGUFs from a repo

When the user asks what GGUFs exist, return:

  • filename
  • file size
  • quant label
  • whether it is a main model or an auxiliary projector

Ignore unless requested:

  • README
  • BF16 shard files
  • imatrix blobs or calibration artifacts

Use the tree API for this step:

  • https://huggingface.co/api/models//tree/main?recursive=true

For a repo like unsloth/Qwen3.6-35B-A3B-GGUF, the local-app page can show quant chips such as UD-Q4_K_M, UD-Q5_K_M, UD-Q6_K, and Q8_0, while the tree API exposes exact file paths such as Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and Qwen3.6-35B-A3B-Q8_0.gguf with byte sizes. Use the tree API to turn a quant label into an exact filename.

Search patterns

Use these URL shapes directly:

https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
https://huggingface.co/?local-app=llama.cpp
https://huggingface.co/api/models//tree/main?recursive=true
https://huggingface.co//tree/main

Output format

When answering discovery requests, prefer a compact structured result like:

Repo: 
Recommended quant from HF:

References

  • hub-discovery.md - URL-only Hugging Face workflows, search patterns, GGUF extraction, and command reconstruction
  • advanced-usage.md — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
  • quantization.md — quant quality tradeoffs, when to use Q4/Q5/Q6/IQ, model size scaling, imatrix
  • server.md — direct-from-Hub server launch, OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
  • optimization.md — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
  • troubleshooting.md — install/convert/quantize/inference/server issues, Apple Silicon, debugging

Resources

  • GitHub: https://github.com/ggml-org/llama.cpp
  • Hugging Face GGUF + llama.cpp docs: https://huggingface.co/docs/hub/gguf-llamacpp
  • Hugging Face Local Apps docs: https://huggingface.co/docs/hub/main/local-apps
  • Hugging Face Local Agents docs: https://huggingface.co/docs/hub/agents-local
  • Example local-app page: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp
  • Example tree API: https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true
  • Example llama.cpp search: https://huggingface.co/models?num_parameters=min:0,max:24B&apps=llama.cpp&sort=trending
  • License: MIT

Related Skills / 관련 스킬

mlops v1.0.0

modal-serverless-gpu

서버리스 GPU 클라우드 — ML 워크로드 온디맨드 GPU, 모델 API 배포, 자동 스케일링

mlops v1.0.0

evaluating-llms-harness

60개 이상 학술 벤치마크로 LLM 평가 — MMLU, HumanEval, GSM8K, TruthfulQA 등

mlops v1.0.0

weights-and-biases

W&B로 ML 실험 추적 — 자동 로깅, 실시간 시각화, 하이퍼파라미터 스윕, 모델 레지스트리

mlops v1.0.0

huggingface-hub

Hugging Face Hub CLI (hf) — 모델/데이터셋 검색, 다운로드, 업로드, Space 관리