v1.0.0
This commit is contained in:
3
skills/mlops/inference/DESCRIPTION.md
Normal file
3
skills/mlops/inference/DESCRIPTION.md
Normal file
@@ -0,0 +1,3 @@
|
||||
---
|
||||
description: Model serving, quantization (GGUF/GPTQ), structured output, inference optimization, and model surgery tools for deploying and running LLMs.
|
||||
---
|
||||
249
skills/mlops/inference/llama-cpp/SKILL.md
Normal file
249
skills/mlops/inference/llama-cpp/SKILL.md
Normal file
@@ -0,0 +1,249 @@
|
||||
---
|
||||
name: llama-cpp
|
||||
description: llama.cpp local GGUF inference + HF Hub model discovery.
|
||||
version: 2.1.2
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [llama-cpp-python>=0.2.0]
|
||||
platforms: [linux, macos, windows]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [llama.cpp, GGUF, Quantization, Hugging Face Hub, CPU Inference, Apple Silicon, Edge Deployment, AMD GPUs, Intel GPUs, NVIDIA, URL-first]
|
||||
---
|
||||
|
||||
# llama.cpp + GGUF
|
||||
|
||||
Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.
|
||||
|
||||
## When to use
|
||||
|
||||
- Run local models on CPU, Apple Silicon, CUDA, ROCm, or Intel GPUs
|
||||
- Find the right GGUF for a specific Hugging Face repo
|
||||
- Build a `llama-server` or `llama-cli` command from the Hub
|
||||
- Search the Hub for models that already support llama.cpp
|
||||
- Enumerate available `.gguf` files and sizes for a repo
|
||||
- Decide between Q4/Q5/Q6/IQ variants for the user's RAM or VRAM
|
||||
|
||||
## Model Discovery workflow
|
||||
|
||||
Prefer URL workflows before asking for `hf`, Python, or custom scripts.
|
||||
|
||||
1. Search for candidate repos on the Hub:
|
||||
- Base: `https://huggingface.co/models?apps=llama.cpp&sort=trending`
|
||||
- Add `search=<term>` for a model family
|
||||
- Add `num_parameters=min:0,max:24B` or similar when the user has size constraints
|
||||
2. Open the repo with the llama.cpp local-app view:
|
||||
- `https://huggingface.co/<repo>?local-app=llama.cpp`
|
||||
3. Treat the local-app snippet as the source of truth when it is visible:
|
||||
- copy the exact `llama-server` or `llama-cli` command
|
||||
- report the recommended quant exactly as HF shows it
|
||||
4. Read the same `?local-app=llama.cpp` URL as page text or HTML and extract the section under `Hardware compatibility`:
|
||||
- prefer its exact quant labels and sizes over generic tables
|
||||
- keep repo-specific labels such as `UD-Q4_K_M` or `IQ4_NL_XL`
|
||||
- if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidance
|
||||
5. Query the tree API to confirm what actually exists:
|
||||
- `https://huggingface.co/api/models/<repo>/tree/main?recursive=true`
|
||||
- keep entries where `type` is `file` and `path` ends with `.gguf`
|
||||
- use `path` and `size` as the source of truth for filenames and byte sizes
|
||||
- separate quantized checkpoints from `mmproj-*.gguf` projector files and `BF16/` shard files
|
||||
- use `https://huggingface.co/<repo>/tree/main` only as a human fallback
|
||||
6. If the local-app snippet is not text-visible, reconstruct the command from the repo plus the chosen quant:
|
||||
- shorthand quant selection: `llama-server -hf <repo>:<QUANT>`
|
||||
- exact-file fallback: `llama-server --hf-repo <repo> --hf-file <filename.gguf>`
|
||||
7. Only suggest conversion from Transformers weights if the repo does not already expose GGUF files.
|
||||
|
||||
## Quick start
|
||||
|
||||
### Install llama.cpp
|
||||
|
||||
```bash
|
||||
# macOS / Linux (simplest)
|
||||
brew install llama.cpp
|
||||
```
|
||||
|
||||
```bash
|
||||
winget install llama.cpp
|
||||
```
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
cmake -B build
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
### Run directly from the Hugging Face Hub
|
||||
|
||||
```bash
|
||||
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
|
||||
```
|
||||
|
||||
```bash
|
||||
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
|
||||
```
|
||||
|
||||
### Run an exact GGUF file from the Hub
|
||||
|
||||
Use this when the tree API shows custom file naming or the exact HF snippet is missing.
|
||||
|
||||
```bash
|
||||
llama-server \
|
||||
--hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
|
||||
--hf-file Phi-3-mini-4k-instruct-q4.gguf \
|
||||
-c 4096
|
||||
```
|
||||
|
||||
### OpenAI-compatible server check
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Write a limerick about Python exceptions"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Python bindings (llama-cpp-python)
|
||||
|
||||
`pip install llama-cpp-python` (CUDA: `CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir`; Metal: `CMAKE_ARGS="-DGGML_METAL=on" ...`).
|
||||
|
||||
### Basic generation
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35, # 0 for CPU, 99 to offload everything
|
||||
n_threads=8,
|
||||
)
|
||||
|
||||
out = llm("What is machine learning?", max_tokens=256, temperature=0.7)
|
||||
print(out["choices"][0]["text"])
|
||||
```
|
||||
|
||||
### Chat + streaming
|
||||
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35,
|
||||
chat_format="llama-3", # or "chatml", "mistral", etc.
|
||||
)
|
||||
|
||||
resp = llm.create_chat_completion(
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "What is Python?"},
|
||||
],
|
||||
max_tokens=256,
|
||||
)
|
||||
print(resp["choices"][0]["message"]["content"])
|
||||
|
||||
# Streaming
|
||||
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
|
||||
print(chunk["choices"][0]["text"], end="", flush=True)
|
||||
```
|
||||
|
||||
### Embeddings
|
||||
|
||||
```python
|
||||
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
|
||||
vec = llm.embed("This is a test sentence.")
|
||||
print(f"Embedding dimension: {len(vec)}")
|
||||
```
|
||||
|
||||
You can also load a GGUF straight from the Hub:
|
||||
|
||||
```python
|
||||
llm = Llama.from_pretrained(
|
||||
repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
|
||||
filename="*Q4_K_M.gguf",
|
||||
n_gpu_layers=35,
|
||||
)
|
||||
```
|
||||
|
||||
## Choosing a quant
|
||||
|
||||
Use the Hub page first, generic heuristics second.
|
||||
|
||||
- Prefer the exact quant that HF marks as compatible for the user's hardware profile.
|
||||
- For general chat, start with `Q4_K_M`.
|
||||
- For code or technical work, prefer `Q5_K_M` or `Q6_K` if memory allows.
|
||||
- For very tight RAM budgets, consider `Q3_K_M`, `IQ` variants, or `Q2` variants only if the user explicitly prioritizes fit over quality.
|
||||
- For multimodal repos, mention `mmproj-*.gguf` separately. The projector is not the main model file.
|
||||
- Do not normalize repo-native labels. If the page says `UD-Q4_K_M`, report `UD-Q4_K_M`.
|
||||
|
||||
## Extracting available GGUFs from a repo
|
||||
|
||||
When the user asks what GGUFs exist, return:
|
||||
|
||||
- filename
|
||||
- file size
|
||||
- quant label
|
||||
- whether it is a main model or an auxiliary projector
|
||||
|
||||
Ignore unless requested:
|
||||
|
||||
- README
|
||||
- BF16 shard files
|
||||
- imatrix blobs or calibration artifacts
|
||||
|
||||
Use the tree API for this step:
|
||||
|
||||
- `https://huggingface.co/api/models/<repo>/tree/main?recursive=true`
|
||||
|
||||
For a repo like `unsloth/Qwen3.6-35B-A3B-GGUF`, the local-app page can show quant chips such as `UD-Q4_K_M`, `UD-Q5_K_M`, `UD-Q6_K`, and `Q8_0`, while the tree API exposes exact file paths such as `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` and `Qwen3.6-35B-A3B-Q8_0.gguf` with byte sizes. Use the tree API to turn a quant label into an exact filename.
|
||||
|
||||
## Search patterns
|
||||
|
||||
Use these URL shapes directly:
|
||||
|
||||
```text
|
||||
https://huggingface.co/models?apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
https://huggingface.co/<repo>/tree/main
|
||||
```
|
||||
|
||||
## Output format
|
||||
|
||||
When answering discovery requests, prefer a compact structured result like:
|
||||
|
||||
```text
|
||||
Repo: <repo>
|
||||
Recommended quant from HF: <label> (<size>)
|
||||
llama-server: <command>
|
||||
Other GGUFs:
|
||||
- <filename> - <size>
|
||||
- <filename> - <size>
|
||||
Source URLs:
|
||||
- <local-app URL>
|
||||
- <tree API URL>
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- **[hub-discovery.md](references/hub-discovery.md)** - URL-only Hugging Face workflows, search patterns, GGUF extraction, and command reconstruction
|
||||
- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
|
||||
- **[quantization.md](references/quantization.md)** — quant quality tradeoffs, when to use Q4/Q5/Q6/IQ, model size scaling, imatrix
|
||||
- **[server.md](references/server.md)** — direct-from-Hub server launch, OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
|
||||
- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
|
||||
- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub**: https://github.com/ggml-org/llama.cpp
|
||||
- **Hugging Face GGUF + llama.cpp docs**: https://huggingface.co/docs/hub/gguf-llamacpp
|
||||
- **Hugging Face Local Apps docs**: https://huggingface.co/docs/hub/main/local-apps
|
||||
- **Hugging Face Local Agents docs**: https://huggingface.co/docs/hub/agents-local
|
||||
- **Example local-app page**: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp
|
||||
- **Example tree API**: https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true
|
||||
- **Example llama.cpp search**: https://huggingface.co/models?num_parameters=min:0,max:24B&apps=llama.cpp&sort=trending
|
||||
- **License**: MIT
|
||||
504
skills/mlops/inference/llama-cpp/references/advanced-usage.md
Normal file
504
skills/mlops/inference/llama-cpp/references/advanced-usage.md
Normal file
@@ -0,0 +1,504 @@
|
||||
# GGUF Advanced Usage Guide
|
||||
|
||||
## Speculative Decoding
|
||||
|
||||
### Draft Model Approach
|
||||
|
||||
```bash
|
||||
# Use smaller model as draft for faster generation
|
||||
./llama-speculative \
|
||||
-m large-model-q4_k_m.gguf \
|
||||
-md draft-model-q4_k_m.gguf \
|
||||
-p "Write a story about AI" \
|
||||
-n 500 \
|
||||
--draft 8 # Draft tokens before verification
|
||||
```
|
||||
|
||||
### Self-Speculative Decoding
|
||||
|
||||
```bash
|
||||
# Use same model with different context for speculation
|
||||
./llama-cli -m model-q4_k_m.gguf \
|
||||
--lookup-cache-static lookup.bin \
|
||||
--lookup-cache-dynamic lookup-dynamic.bin \
|
||||
-p "Hello world"
|
||||
```
|
||||
|
||||
## Batched Inference
|
||||
|
||||
### Process Multiple Prompts
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35,
|
||||
n_batch=512 # Larger batch for parallel processing
|
||||
)
|
||||
|
||||
prompts = [
|
||||
"What is Python?",
|
||||
"Explain machine learning.",
|
||||
"Describe neural networks."
|
||||
]
|
||||
|
||||
# Process in batch (each prompt gets separate context)
|
||||
for prompt in prompts:
|
||||
output = llm(prompt, max_tokens=100)
|
||||
print(f"Q: {prompt}")
|
||||
print(f"A: {output['choices'][0]['text']}\n")
|
||||
```
|
||||
|
||||
### Server Batching
|
||||
|
||||
```bash
|
||||
# Start server with batching
|
||||
./llama-server -m model-q4_k_m.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-ngl 35 \
|
||||
-c 4096 \
|
||||
--parallel 4 # Concurrent requests
|
||||
--cont-batching # Continuous batching
|
||||
```
|
||||
|
||||
## Custom Model Conversion
|
||||
|
||||
### Convert with Vocabulary Modifications
|
||||
|
||||
```python
|
||||
# custom_convert.py
|
||||
import sys
|
||||
sys.path.insert(0, './llama.cpp')
|
||||
|
||||
from convert_hf_to_gguf import main
|
||||
from gguf import GGUFWriter
|
||||
|
||||
# Custom conversion with modified vocab
|
||||
def convert_with_custom_vocab(model_path, output_path):
|
||||
# Load and modify tokenizer
|
||||
from transformers import AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
# Add special tokens if needed
|
||||
special_tokens = {"additional_special_tokens": ["<|custom|>"]}
|
||||
tokenizer.add_special_tokens(special_tokens)
|
||||
tokenizer.save_pretrained(model_path)
|
||||
|
||||
# Then run standard conversion
|
||||
main([model_path, "--outfile", output_path])
|
||||
```
|
||||
|
||||
### Convert Specific Architecture
|
||||
|
||||
```bash
|
||||
# For Mistral-style models
|
||||
python convert_hf_to_gguf.py ./mistral-model \
|
||||
--outfile mistral-f16.gguf \
|
||||
--outtype f16
|
||||
|
||||
# For Qwen models
|
||||
python convert_hf_to_gguf.py ./qwen-model \
|
||||
--outfile qwen-f16.gguf \
|
||||
--outtype f16
|
||||
|
||||
# For Phi models
|
||||
python convert_hf_to_gguf.py ./phi-model \
|
||||
--outfile phi-f16.gguf \
|
||||
--outtype f16
|
||||
```
|
||||
|
||||
## Advanced Quantization
|
||||
|
||||
### Mixed Quantization
|
||||
|
||||
```bash
|
||||
# Quantize different layer types differently
|
||||
./llama-quantize model-f16.gguf model-mixed.gguf Q4_K_M \
|
||||
--allow-requantize \
|
||||
--leave-output-tensor
|
||||
```
|
||||
|
||||
### Quantization with Token Embeddings
|
||||
|
||||
```bash
|
||||
# Keep embeddings at higher precision
|
||||
./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M \
|
||||
--token-embedding-type f16
|
||||
```
|
||||
|
||||
### IQ Quantization (Importance-aware)
|
||||
|
||||
```bash
|
||||
# Ultra-low bit quantization with importance
|
||||
./llama-quantize --imatrix model.imatrix \
|
||||
model-f16.gguf model-iq2_xxs.gguf IQ2_XXS
|
||||
|
||||
# Available IQ types: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS
|
||||
```
|
||||
|
||||
## Memory Optimization
|
||||
|
||||
### Memory Mapping
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
# Use memory mapping for large models
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
use_mmap=True, # Memory map the model
|
||||
use_mlock=False, # Don't lock in RAM
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
### Partial GPU Offload
|
||||
|
||||
```python
|
||||
# Calculate layers to offload based on VRAM
|
||||
import subprocess
|
||||
|
||||
def get_free_vram_gb():
|
||||
result = subprocess.run(
|
||||
['nvidia-smi', '--query-gpu=memory.free', '--format=csv,nounits,noheader'],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return int(result.stdout.strip()) / 1024
|
||||
|
||||
# Estimate layers based on VRAM (rough: 0.5GB per layer for 7B Q4)
|
||||
free_vram = get_free_vram_gb()
|
||||
layers_to_offload = int(free_vram / 0.5)
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_gpu_layers=min(layers_to_offload, 35) # Cap at total layers
|
||||
)
|
||||
```
|
||||
|
||||
### KV Cache Optimization
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
# Optimize KV cache for long contexts
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_ctx=8192, # Large context
|
||||
n_gpu_layers=35,
|
||||
type_k=1, # Q8_0 for K cache (1)
|
||||
type_v=1, # Q8_0 for V cache (1)
|
||||
# Or use Q4_0 (2) for more compression
|
||||
)
|
||||
```
|
||||
|
||||
## Context Management
|
||||
|
||||
### Context Shifting
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35
|
||||
)
|
||||
|
||||
# Handle long conversations with context shifting
|
||||
conversation = []
|
||||
max_history = 10
|
||||
|
||||
def chat(user_message):
|
||||
conversation.append({"role": "user", "content": user_message})
|
||||
|
||||
# Keep only recent history
|
||||
if len(conversation) > max_history * 2:
|
||||
conversation = conversation[-max_history * 2:]
|
||||
|
||||
response = llm.create_chat_completion(
|
||||
messages=conversation,
|
||||
max_tokens=256
|
||||
)
|
||||
|
||||
assistant_message = response["choices"][0]["message"]["content"]
|
||||
conversation.append({"role": "assistant", "content": assistant_message})
|
||||
return assistant_message
|
||||
```
|
||||
|
||||
### Save and Load State
|
||||
|
||||
```bash
|
||||
# Save state to file
|
||||
./llama-cli -m model.gguf \
|
||||
-p "Once upon a time" \
|
||||
--save-session session.bin \
|
||||
-n 100
|
||||
|
||||
# Load and continue
|
||||
./llama-cli -m model.gguf \
|
||||
--load-session session.bin \
|
||||
-p " and they lived" \
|
||||
-n 100
|
||||
```
|
||||
|
||||
## Grammar Constrained Generation
|
||||
|
||||
### JSON Output
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama, LlamaGrammar
|
||||
|
||||
# Define JSON grammar
|
||||
json_grammar = LlamaGrammar.from_string('''
|
||||
root ::= object
|
||||
object ::= "{" ws pair ("," ws pair)* "}" ws
|
||||
pair ::= string ":" ws value
|
||||
value ::= string | number | object | array | "true" | "false" | "null"
|
||||
array ::= "[" ws value ("," ws value)* "]" ws
|
||||
string ::= "\\"" [^"\\\\]* "\\""
|
||||
number ::= [0-9]+
|
||||
ws ::= [ \\t\\n]*
|
||||
''')
|
||||
|
||||
llm = Llama(model_path="model-q4_k_m.gguf", n_gpu_layers=35)
|
||||
|
||||
output = llm(
|
||||
"Output a JSON object with name and age:",
|
||||
grammar=json_grammar,
|
||||
max_tokens=100
|
||||
)
|
||||
print(output["choices"][0]["text"])
|
||||
```
|
||||
|
||||
### Custom Grammar
|
||||
|
||||
```python
|
||||
# Grammar for specific format
|
||||
answer_grammar = LlamaGrammar.from_string('''
|
||||
root ::= "Answer: " letter "\\n" "Explanation: " explanation
|
||||
letter ::= [A-D]
|
||||
explanation ::= [a-zA-Z0-9 .,!?]+
|
||||
''')
|
||||
|
||||
output = llm(
|
||||
"Q: What is 2+2? A) 3 B) 4 C) 5 D) 6",
|
||||
grammar=answer_grammar,
|
||||
max_tokens=100
|
||||
)
|
||||
```
|
||||
|
||||
## LoRA Integration
|
||||
|
||||
### Load LoRA Adapter
|
||||
|
||||
```bash
|
||||
# Apply LoRA at runtime
|
||||
./llama-cli -m base-model-q4_k_m.gguf \
|
||||
--lora lora-adapter.gguf \
|
||||
--lora-scale 1.0 \
|
||||
-p "Hello!"
|
||||
```
|
||||
|
||||
### Multiple LoRA Adapters
|
||||
|
||||
```bash
|
||||
# Stack multiple adapters
|
||||
./llama-cli -m base-model.gguf \
|
||||
--lora adapter1.gguf --lora-scale 0.5 \
|
||||
--lora adapter2.gguf --lora-scale 0.5 \
|
||||
-p "Hello!"
|
||||
```
|
||||
|
||||
### Python LoRA Usage
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="base-model-q4_k_m.gguf",
|
||||
lora_path="lora-adapter.gguf",
|
||||
lora_scale=1.0,
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
## Embedding Generation
|
||||
|
||||
### Extract Embeddings
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="model-q4_k_m.gguf",
|
||||
embedding=True, # Enable embedding mode
|
||||
n_gpu_layers=35
|
||||
)
|
||||
|
||||
# Get embeddings
|
||||
embeddings = llm.embed("This is a test sentence.")
|
||||
print(f"Embedding dimension: {len(embeddings)}")
|
||||
```
|
||||
|
||||
### Batch Embeddings
|
||||
|
||||
```python
|
||||
texts = [
|
||||
"Machine learning is fascinating.",
|
||||
"Deep learning uses neural networks.",
|
||||
"Python is a programming language."
|
||||
]
|
||||
|
||||
embeddings = [llm.embed(text) for text in texts]
|
||||
|
||||
# Calculate similarity
|
||||
import numpy as np
|
||||
|
||||
def cosine_similarity(a, b):
|
||||
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
|
||||
|
||||
sim = cosine_similarity(embeddings[0], embeddings[1])
|
||||
print(f"Similarity: {sim:.4f}")
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Benchmark Script
|
||||
|
||||
```python
|
||||
import time
|
||||
from llama_cpp import Llama
|
||||
|
||||
def benchmark(model_path, prompt, n_tokens=100, n_runs=5):
|
||||
llm = Llama(
|
||||
model_path=model_path,
|
||||
n_gpu_layers=35,
|
||||
n_ctx=2048,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
# Warmup
|
||||
llm(prompt, max_tokens=10)
|
||||
|
||||
# Benchmark
|
||||
times = []
|
||||
for _ in range(n_runs):
|
||||
start = time.time()
|
||||
output = llm(prompt, max_tokens=n_tokens)
|
||||
elapsed = time.time() - start
|
||||
times.append(elapsed)
|
||||
|
||||
avg_time = sum(times) / len(times)
|
||||
tokens_per_sec = n_tokens / avg_time
|
||||
|
||||
print(f"Model: {model_path}")
|
||||
print(f"Avg time: {avg_time:.2f}s")
|
||||
print(f"Tokens/sec: {tokens_per_sec:.1f}")
|
||||
|
||||
return tokens_per_sec
|
||||
|
||||
# Compare quantizations
|
||||
for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
|
||||
benchmark(f"model-{quant}.gguf", "Explain quantum computing:", 100)
|
||||
```
|
||||
|
||||
### Optimal Configuration Finder
|
||||
|
||||
```python
|
||||
def find_optimal_config(model_path, target_vram_gb=8):
|
||||
"""Find optimal n_gpu_layers and n_batch for target VRAM."""
|
||||
from llama_cpp import Llama
|
||||
import gc
|
||||
|
||||
best_config = None
|
||||
best_speed = 0
|
||||
|
||||
for n_gpu_layers in range(0, 50, 5):
|
||||
for n_batch in [128, 256, 512, 1024]:
|
||||
try:
|
||||
gc.collect()
|
||||
llm = Llama(
|
||||
model_path=model_path,
|
||||
n_gpu_layers=n_gpu_layers,
|
||||
n_batch=n_batch,
|
||||
n_ctx=2048,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
# Quick benchmark
|
||||
start = time.time()
|
||||
llm("Hello", max_tokens=50)
|
||||
speed = 50 / (time.time() - start)
|
||||
|
||||
if speed > best_speed:
|
||||
best_speed = speed
|
||||
best_config = {
|
||||
"n_gpu_layers": n_gpu_layers,
|
||||
"n_batch": n_batch,
|
||||
"speed": speed
|
||||
}
|
||||
|
||||
del llm
|
||||
gc.collect()
|
||||
|
||||
except Exception as e:
|
||||
print(f"OOM at layers={n_gpu_layers}, batch={n_batch}")
|
||||
break
|
||||
|
||||
return best_config
|
||||
```
|
||||
|
||||
## Multi-GPU Setup
|
||||
|
||||
### Distribute Across GPUs
|
||||
|
||||
```bash
|
||||
# Split model across multiple GPUs
|
||||
./llama-cli -m large-model.gguf \
|
||||
--tensor-split 0.5,0.5 \
|
||||
-ngl 60 \
|
||||
-p "Hello!"
|
||||
```
|
||||
|
||||
### Python Multi-GPU
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
|
||||
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="large-model-q4_k_m.gguf",
|
||||
n_gpu_layers=60,
|
||||
tensor_split=[0.5, 0.5] # Split evenly across 2 GPUs
|
||||
)
|
||||
```
|
||||
|
||||
## Custom Builds
|
||||
|
||||
### Build with All Optimizations
|
||||
|
||||
```bash
|
||||
# Clean build with all CPU optimizations
|
||||
make clean
|
||||
LLAMA_OPENBLAS=1 LLAMA_BLAS_VENDOR=OpenBLAS make -j
|
||||
|
||||
# With CUDA and cuBLAS
|
||||
make clean
|
||||
GGML_CUDA=1 LLAMA_CUBLAS=1 make -j
|
||||
|
||||
# With specific CUDA architecture
|
||||
GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_86 make -j
|
||||
```
|
||||
|
||||
### CMake Build
|
||||
|
||||
```bash
|
||||
mkdir build && cd build
|
||||
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build . --config Release -j
|
||||
```
|
||||
168
skills/mlops/inference/llama-cpp/references/hub-discovery.md
Normal file
168
skills/mlops/inference/llama-cpp/references/hub-discovery.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Hugging Face URL Workflows for llama.cpp
|
||||
|
||||
Use URL-only workflows first. Do not require `hf` or API clients just to find GGUF files, choose a quant, or build a `llama-server` command.
|
||||
|
||||
## Core URLs
|
||||
|
||||
```text
|
||||
Search:
|
||||
https://huggingface.co/models?apps=llama.cpp&sort=trending
|
||||
|
||||
Search with text:
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
|
||||
|
||||
Search with size bounds:
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
|
||||
Repo local-app view:
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
|
||||
Repo tree API:
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
|
||||
Repo file tree:
|
||||
https://huggingface.co/<repo>/tree/main
|
||||
```
|
||||
|
||||
## 1. Search for llama.cpp-compatible models
|
||||
|
||||
Start from the models page with `apps=llama.cpp`.
|
||||
|
||||
Use:
|
||||
|
||||
- `search=<term>` for model family names such as `Qwen`, `Gemma`, `Phi`, or `Mistral`
|
||||
- `num_parameters=min:0,max:24B` or similar if the user has hardware limits
|
||||
- `sort=trending` when the user wants popular repos right now
|
||||
|
||||
Do not start with random GGUF repos if the user has not chosen a model family yet. Search first, shortlist second.
|
||||
|
||||
Example: https://huggingface.co/models?search=Qwen&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
|
||||
## 2. Use the local-app page for the recommended quant
|
||||
|
||||
Open:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
```
|
||||
|
||||
Extract, in order:
|
||||
|
||||
1. The exact `Use this model` snippet, if it is visible as text
|
||||
2. The `Hardware compatibility` section from the fetched page text or HTML:
|
||||
- quant label
|
||||
- file size
|
||||
- bit-depth grouping
|
||||
3. Any extra launch flags shown in the snippet, such as `--jinja`
|
||||
|
||||
Treat the HF local-app snippet as the source of truth when it is visible.
|
||||
|
||||
Do this by reading the URL itself, not by assuming the UI rendered in a browser. If the fetched page source does not expose `Hardware compatibility`, say that the section was not text-visible and fall back to the tree API plus generic guidance from `quantization.md`.
|
||||
|
||||
## 3. Confirm exact files from the tree API
|
||||
|
||||
Open:
|
||||
|
||||
```text
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
```
|
||||
|
||||
Treat the JSON response as the source of truth for repo inventory.
|
||||
|
||||
Keep entries where:
|
||||
|
||||
- `type` is `file`
|
||||
- `path` ends with `.gguf`
|
||||
|
||||
Use these fields:
|
||||
|
||||
- `path` for the filename and subdirectory
|
||||
- `size` for the byte size
|
||||
- optionally `lfs.size` to confirm the LFS payload size
|
||||
|
||||
Separate files into:
|
||||
|
||||
- quantized single-file checkpoints, for example `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`
|
||||
- projector weights, usually `mmproj-*.gguf`
|
||||
- BF16 shard files, usually under `BF16/`
|
||||
- everything else
|
||||
|
||||
Ignore unless the user asks:
|
||||
|
||||
- `README.md`
|
||||
- imatrix or calibration blobs
|
||||
|
||||
Use `https://huggingface.co/<repo>/tree/main` only as a human fallback if the API endpoint fails or the user wants the web view.
|
||||
|
||||
## 4. Build the command
|
||||
|
||||
Preferred order:
|
||||
|
||||
1. Copy the exact HF snippet from the local-app page
|
||||
2. If the page gives a clean quant label, use shorthand selection:
|
||||
|
||||
```bash
|
||||
llama-server -hf <repo>:<QUANT>
|
||||
```
|
||||
|
||||
3. If you need an exact file from the tree API, use the file-specific form:
|
||||
|
||||
```bash
|
||||
llama-server --hf-repo <repo> --hf-file <filename.gguf>
|
||||
```
|
||||
|
||||
4. For CLI usage instead of a server, use:
|
||||
|
||||
```bash
|
||||
llama-cli -hf <repo>:<QUANT>
|
||||
```
|
||||
|
||||
Use the exact-file form when the repo uses custom labels or nonstandard naming that could make `:<QUANT>` ambiguous.
|
||||
|
||||
## 5. Example: `unsloth/Qwen3.6-35B-A3B-GGUF`
|
||||
|
||||
Use these URLs:
|
||||
|
||||
```text
|
||||
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp
|
||||
https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true
|
||||
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main
|
||||
```
|
||||
|
||||
On the local-app page, the hardware compatibility section can expose entries such as:
|
||||
|
||||
- `UD-IQ4_XS` - 17.7 GB
|
||||
- `UD-Q4_K_S` - 20.9 GB
|
||||
- `UD-Q4_K_M` - 22.1 GB
|
||||
- `UD-Q5_K_M` - 26.5 GB
|
||||
- `UD-Q6_K` - 29.3 GB
|
||||
- `Q8_0` - 36.9 GB
|
||||
|
||||
On the tree API, you can confirm exact filenames such as:
|
||||
|
||||
- `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`
|
||||
- `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf`
|
||||
- `Qwen3.6-35B-A3B-UD-Q6_K.gguf`
|
||||
- `Qwen3.6-35B-A3B-Q8_0.gguf`
|
||||
- `mmproj-F16.gguf`
|
||||
|
||||
Good final output for this repo:
|
||||
|
||||
```text
|
||||
Repo: unsloth/Qwen3.6-35B-A3B-GGUF
|
||||
Recommended quant from HF: UD-Q4_K_M (22.1 GB)
|
||||
llama-server: llama-server --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
|
||||
Other GGUFs:
|
||||
- Qwen3.6-35B-A3B-UD-Q5_K_M.gguf - 26.5 GB
|
||||
- Qwen3.6-35B-A3B-UD-Q6_K.gguf - 29.3 GB
|
||||
- Qwen3.6-35B-A3B-Q8_0.gguf - 36.9 GB
|
||||
Projector:
|
||||
- mmproj-F16.gguf - 899 MB
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Repo-specific quant labels matter. Do not rewrite `UD-Q4_K_M` to `Q4_K_M` unless the page itself does.
|
||||
- `mmproj` files are projector weights for multimodal models, not the main language model checkpoint.
|
||||
- If the HF hardware compatibility panel is missing because the user has no hardware profile configured, or because the fetched page source did not expose it, still use the tree API plus generic quant guidance from `quantization.md`.
|
||||
- If the repo already has GGUFs, do not jump straight to conversion workflows.
|
||||
89
skills/mlops/inference/llama-cpp/references/optimization.md
Normal file
89
skills/mlops/inference/llama-cpp/references/optimization.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# Performance Optimization Guide
|
||||
|
||||
Maximize llama.cpp inference speed and efficiency.
|
||||
|
||||
## CPU Optimization
|
||||
|
||||
### Thread tuning
|
||||
```bash
|
||||
# Set threads (default: physical cores)
|
||||
./llama-cli -m model.gguf -t 8
|
||||
|
||||
# For AMD Ryzen 9 7950X (16 cores, 32 threads)
|
||||
-t 16 # Best: physical cores
|
||||
|
||||
# Avoid hyperthreading (slower for matrix ops)
|
||||
```
|
||||
|
||||
### BLAS acceleration
|
||||
```bash
|
||||
# OpenBLAS (faster matrix ops)
|
||||
make LLAMA_OPENBLAS=1
|
||||
|
||||
# BLAS gives 2-3× speedup
|
||||
```
|
||||
|
||||
## GPU Offloading
|
||||
|
||||
### Layer offloading
|
||||
```bash
|
||||
# Offload 35 layers to GPU (hybrid mode)
|
||||
./llama-cli -m model.gguf -ngl 35
|
||||
|
||||
# Offload all layers
|
||||
./llama-cli -m model.gguf -ngl 999
|
||||
|
||||
# Find optimal value:
|
||||
# Start with -ngl 999
|
||||
# If OOM, reduce by 5 until fits
|
||||
```
|
||||
|
||||
### Memory usage
|
||||
```bash
|
||||
# Check VRAM usage
|
||||
nvidia-smi dmon
|
||||
|
||||
# Reduce context if needed
|
||||
./llama-cli -m model.gguf -c 2048 # 2K context instead of 4K
|
||||
```
|
||||
|
||||
## Batch Processing
|
||||
|
||||
```bash
|
||||
# Increase batch size for throughput
|
||||
./llama-cli -m model.gguf -b 512 # Default: 512
|
||||
|
||||
# Physical batch (GPU)
|
||||
--ubatch 128 # Process 128 tokens at once
|
||||
```
|
||||
|
||||
## Context Management
|
||||
|
||||
```bash
|
||||
# Default context (512 tokens)
|
||||
-c 512
|
||||
|
||||
# Longer context (slower, more memory)
|
||||
-c 4096
|
||||
|
||||
# Very long context (if model supports)
|
||||
-c 32768
|
||||
```
|
||||
|
||||
## Benchmarks
|
||||
|
||||
### CPU Performance (Llama 2-7B Q4_K_M)
|
||||
|
||||
| Setup | Speed | Notes |
|
||||
|-------|-------|-------|
|
||||
| Apple M3 Max | 50 tok/s | Metal acceleration |
|
||||
| AMD 7950X (16c) | 35 tok/s | OpenBLAS |
|
||||
| Intel i9-13900K | 30 tok/s | AVX2 |
|
||||
|
||||
### GPU Offloading (RTX 4090)
|
||||
|
||||
| Layers GPU | Speed | VRAM |
|
||||
|------------|-------|------|
|
||||
| 0 (CPU only) | 30 tok/s | 0 GB |
|
||||
| 20 (hybrid) | 80 tok/s | 8 GB |
|
||||
| 35 (all) | 120 tok/s | 12 GB |
|
||||
243
skills/mlops/inference/llama-cpp/references/quantization.md
Normal file
243
skills/mlops/inference/llama-cpp/references/quantization.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# GGUF Quantization Guide
|
||||
|
||||
Complete guide to GGUF quantization formats and model conversion.
|
||||
|
||||
## Hub-first quant selection
|
||||
|
||||
Before using generic tables, open the model repo with:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
```
|
||||
|
||||
Prefer the exact quant labels and sizes shown in the `Hardware compatibility` section of the fetched `?local-app=llama.cpp` page text or HTML. Then confirm the matching filenames in:
|
||||
|
||||
```text
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
```
|
||||
|
||||
Use the Hub page first, and only fall back to the generic heuristics below when the repo page does not expose a clear recommendation.
|
||||
|
||||
## Quantization Overview
|
||||
|
||||
**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
|
||||
|
||||
### Format Comparison
|
||||
|
||||
| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
|
||||
|--------|------------|-----------|------------|-------|
|
||||
| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
|
||||
| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
|
||||
| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
|
||||
| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
|
||||
| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** |
|
||||
| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
|
||||
| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
|
||||
| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
|
||||
|
||||
**Recommendation**: Use **Q4_K_M** for best balance of quality and speed.
|
||||
|
||||
## Converting Models
|
||||
|
||||
### Hugging Face to GGUF
|
||||
|
||||
```bash
|
||||
# 1. Download Hugging Face model
|
||||
hf download meta-llama/Llama-2-7b-chat-hf \
|
||||
--local-dir models/llama-2-7b-chat/
|
||||
|
||||
# 2. Convert to FP16 GGUF
|
||||
python convert_hf_to_gguf.py \
|
||||
models/llama-2-7b-chat/ \
|
||||
--outtype f16 \
|
||||
--outfile models/llama-2-7b-chat-f16.gguf
|
||||
|
||||
# 3. Quantize to Q4_K_M
|
||||
./llama-quantize \
|
||||
models/llama-2-7b-chat-f16.gguf \
|
||||
models/llama-2-7b-chat-Q4_K_M.gguf \
|
||||
Q4_K_M
|
||||
```
|
||||
|
||||
### Batch quantization
|
||||
|
||||
```bash
|
||||
# Quantize to multiple formats
|
||||
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
||||
./llama-quantize \
|
||||
model-f16.gguf \
|
||||
model-${quant}.gguf \
|
||||
$quant
|
||||
done
|
||||
```
|
||||
|
||||
## K-Quantization Methods
|
||||
|
||||
**K-quants** use mixed precision for better quality:
|
||||
- Attention weights: Higher precision
|
||||
- Feed-forward weights: Lower precision
|
||||
|
||||
**Variants**:
|
||||
- `_S` (Small): Faster, lower quality
|
||||
- `_M` (Medium): Balanced (recommended)
|
||||
- `_L` (Large): Better quality, larger size
|
||||
|
||||
**Example**: `Q4_K_M`
|
||||
- `Q4`: 4-bit quantization
|
||||
- `K`: Mixed precision method
|
||||
- `M`: Medium quality
|
||||
|
||||
## Quality Testing
|
||||
|
||||
```bash
|
||||
# Calculate perplexity (quality metric)
|
||||
./llama-perplexity \
|
||||
-m model.gguf \
|
||||
-f wikitext-2-raw/wiki.test.raw \
|
||||
-c 512
|
||||
|
||||
# Lower perplexity = better quality
|
||||
# Baseline (FP16): ~5.96
|
||||
# Q4_K_M: ~6.06 (+1.7%)
|
||||
# Q2_K: ~6.87 (+15.3% - too much degradation)
|
||||
```
|
||||
|
||||
## Use Case Guide
|
||||
|
||||
### General purpose (chatbots, assistants)
|
||||
```
|
||||
Q4_K_M - Best balance
|
||||
Q5_K_M - If you have extra RAM
|
||||
```
|
||||
|
||||
### Code generation
|
||||
```
|
||||
Q5_K_M or Q6_K - Higher precision helps with code
|
||||
```
|
||||
|
||||
### Creative writing
|
||||
```
|
||||
Q4_K_M - Sufficient quality
|
||||
Q3_K_M - Acceptable for draft generation
|
||||
```
|
||||
|
||||
### Technical/medical
|
||||
```
|
||||
Q6_K or Q8_0 - Maximum accuracy
|
||||
```
|
||||
|
||||
### Edge devices (Raspberry Pi)
|
||||
```
|
||||
Q2_K or Q3_K_S - Fit in limited RAM
|
||||
```
|
||||
|
||||
## Model Size Scaling
|
||||
|
||||
### 7B parameter models
|
||||
|
||||
| Format | Size | RAM needed |
|
||||
|--------|------|------------|
|
||||
| Q2_K | 2.7 GB | 5 GB |
|
||||
| Q3_K_M | 3.3 GB | 6 GB |
|
||||
| Q4_K_M | 4.1 GB | 7 GB |
|
||||
| Q5_K_M | 4.8 GB | 8 GB |
|
||||
| Q6_K | 5.5 GB | 9 GB |
|
||||
| Q8_0 | 7.0 GB | 11 GB |
|
||||
|
||||
### 13B parameter models
|
||||
|
||||
| Format | Size | RAM needed |
|
||||
|--------|------|------------|
|
||||
| Q2_K | 5.1 GB | 8 GB |
|
||||
| Q3_K_M | 6.2 GB | 10 GB |
|
||||
| Q4_K_M | 7.9 GB | 12 GB |
|
||||
| Q5_K_M | 9.2 GB | 14 GB |
|
||||
| Q6_K | 10.7 GB | 16 GB |
|
||||
|
||||
### 70B parameter models
|
||||
|
||||
| Format | Size | RAM needed |
|
||||
|--------|------|------------|
|
||||
| Q2_K | 26 GB | 32 GB |
|
||||
| Q3_K_M | 32 GB | 40 GB |
|
||||
| Q4_K_M | 41 GB | 48 GB |
|
||||
| Q4_K_S | 39 GB | 46 GB |
|
||||
| Q5_K_M | 48 GB | 56 GB |
|
||||
|
||||
**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
|
||||
|
||||
## Finding Pre-Quantized Models
|
||||
|
||||
Use the Hub search with the llama.cpp app filter:
|
||||
|
||||
```text
|
||||
https://huggingface.co/models?apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
```
|
||||
|
||||
For a specific repo, open:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
```
|
||||
|
||||
Then launch directly from the Hub without extra Hub tooling:
|
||||
|
||||
```bash
|
||||
llama-cli -hf <repo>:Q4_K_M
|
||||
llama-server -hf <repo>:Q4_K_M
|
||||
```
|
||||
|
||||
If you need the exact file name from the tree API:
|
||||
|
||||
```bash
|
||||
llama-server --hf-repo <repo> --hf-file <filename.gguf>
|
||||
```
|
||||
|
||||
## Importance Matrices (imatrix)
|
||||
|
||||
**What**: Calibration data to improve quantization quality.
|
||||
|
||||
**Benefits**:
|
||||
- 10-20% perplexity improvement with Q4
|
||||
- Essential for Q3 and below
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# 1. Generate importance matrix
|
||||
./llama-imatrix \
|
||||
-m model-f16.gguf \
|
||||
-f calibration-data.txt \
|
||||
-o model.imatrix
|
||||
|
||||
# 2. Quantize with imatrix
|
||||
./llama-quantize \
|
||||
--imatrix model.imatrix \
|
||||
model-f16.gguf \
|
||||
model-Q4_K_M.gguf \
|
||||
Q4_K_M
|
||||
```
|
||||
|
||||
**Calibration data**:
|
||||
- Use domain-specific text (e.g., code for code models)
|
||||
- ~100MB of representative text
|
||||
- Higher quality data = better quantization
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Model outputs gibberish**:
|
||||
- Quantization too aggressive (Q2_K)
|
||||
- Try Q4_K_M or Q5_K_M
|
||||
- Verify model converted correctly
|
||||
|
||||
**Out of memory**:
|
||||
- Use lower quantization (Q4_K_S instead of Q5_K_M)
|
||||
- Offload fewer layers to GPU (`-ngl`)
|
||||
- Use smaller context (`-c 2048`)
|
||||
|
||||
**Slow inference**:
|
||||
- Higher quantization uses more compute
|
||||
- Q8_0 much slower than Q4_K_M
|
||||
- Consider speed vs quality trade-off
|
||||
150
skills/mlops/inference/llama-cpp/references/server.md
Normal file
150
skills/mlops/inference/llama-cpp/references/server.md
Normal file
@@ -0,0 +1,150 @@
|
||||
# Server Deployment Guide
|
||||
|
||||
Production deployment of llama.cpp server with OpenAI-compatible API.
|
||||
|
||||
## Direct from Hugging Face Hub
|
||||
|
||||
Prefer the model repo's local-app page first:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
```
|
||||
|
||||
If the page shows an exact snippet, copy it. If not, use one of these forms:
|
||||
|
||||
```bash
|
||||
# Choose a quant label directly from the Hub repo
|
||||
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
|
||||
```
|
||||
|
||||
```bash
|
||||
# Pin an exact GGUF file from the repo tree
|
||||
llama-server \
|
||||
--hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
|
||||
--hf-file Phi-3-mini-4k-instruct-q4.gguf \
|
||||
-c 4096
|
||||
```
|
||||
|
||||
Use the file-specific form when the repo has custom naming or when you already extracted the exact filename from the tree API.
|
||||
|
||||
## Server Modes
|
||||
|
||||
### llama-server
|
||||
|
||||
```bash
|
||||
# Basic server
|
||||
./llama-server \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-c 4096 # Context size
|
||||
|
||||
# With GPU acceleration
|
||||
./llama-server \
|
||||
-m models/llama-2-70b.Q4_K_M.gguf \
|
||||
-ngl 40 # Offload 40 layers to GPU
|
||||
```
|
||||
|
||||
## OpenAI-Compatible API
|
||||
|
||||
### Chat completions
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-2",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are helpful"},
|
||||
{"role": "user", "content": "Hello"}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
### Streaming
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-2",
|
||||
"messages": [{"role": "user", "content": "Count to 10"}],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
**Dockerfile**:
|
||||
```dockerfile
|
||||
FROM ubuntu:22.04
|
||||
RUN apt-get update && apt-get install -y git build-essential
|
||||
RUN git clone https://github.com/ggerganov/llama.cpp
|
||||
WORKDIR /llama.cpp
|
||||
RUN make LLAMA_CUDA=1
|
||||
COPY models/ /models/
|
||||
EXPOSE 8080
|
||||
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
|
||||
```
|
||||
|
||||
**Run**:
|
||||
```bash
|
||||
docker run --gpus all -p 8080:8080 llama-cpp:latest
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
```bash
|
||||
# Server metrics endpoint
|
||||
curl http://localhost:8080/metrics
|
||||
|
||||
# Health check
|
||||
curl http://localhost:8080/health
|
||||
```
|
||||
|
||||
**Metrics**:
|
||||
- requests_total
|
||||
- tokens_generated
|
||||
- prompt_tokens
|
||||
- completion_tokens
|
||||
- kv_cache_tokens
|
||||
|
||||
## Load Balancing
|
||||
|
||||
**NGINX**:
|
||||
```nginx
|
||||
upstream llama_cpp {
|
||||
server llama1:8080;
|
||||
server llama2:8080;
|
||||
}
|
||||
|
||||
server {
|
||||
location / {
|
||||
proxy_pass http://llama_cpp;
|
||||
proxy_read_timeout 300s;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
**Parallel requests**:
|
||||
```bash
|
||||
./llama-server \
|
||||
-m model.gguf \
|
||||
-np 4 # 4 parallel slots
|
||||
```
|
||||
|
||||
**Continuous batching**:
|
||||
```bash
|
||||
./llama-server \
|
||||
-m model.gguf \
|
||||
--cont-batching # Enable continuous batching
|
||||
```
|
||||
|
||||
**Context caching**:
|
||||
```bash
|
||||
./llama-server \
|
||||
-m model.gguf \
|
||||
--cache-prompt # Cache processed prompts
|
||||
```
|
||||
442
skills/mlops/inference/llama-cpp/references/troubleshooting.md
Normal file
442
skills/mlops/inference/llama-cpp/references/troubleshooting.md
Normal file
@@ -0,0 +1,442 @@
|
||||
# GGUF Troubleshooting Guide
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### Build Fails
|
||||
|
||||
**Error**: `make: *** No targets specified and no makefile found`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Ensure you're in llama.cpp directory
|
||||
cd llama.cpp
|
||||
make
|
||||
```
|
||||
|
||||
**Error**: `fatal error: cuda_runtime.h: No such file or directory`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Install CUDA toolkit
|
||||
# Ubuntu
|
||||
sudo apt install nvidia-cuda-toolkit
|
||||
|
||||
# Or set CUDA path
|
||||
export CUDA_PATH=/usr/local/cuda
|
||||
export PATH=$CUDA_PATH/bin:$PATH
|
||||
make GGML_CUDA=1
|
||||
```
|
||||
|
||||
### Python Bindings Issues
|
||||
|
||||
**Error**: `ERROR: Failed building wheel for llama-cpp-python`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Install build dependencies
|
||||
pip install cmake scikit-build-core
|
||||
|
||||
# For CUDA support
|
||||
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
|
||||
# For Metal (macOS)
|
||||
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
```
|
||||
|
||||
**Error**: `ImportError: libcudart.so.XX: cannot open shared object file`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Add CUDA libraries to path
|
||||
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
||||
|
||||
# Or reinstall with correct CUDA version
|
||||
pip uninstall llama-cpp-python
|
||||
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
|
||||
```
|
||||
|
||||
## Conversion Issues
|
||||
|
||||
### Model Not Supported
|
||||
|
||||
**Error**: `KeyError: 'model.embed_tokens.weight'`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check model architecture
|
||||
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
|
||||
|
||||
# Use appropriate conversion script
|
||||
# For most models:
|
||||
python convert_hf_to_gguf.py ./model --outfile model.gguf
|
||||
|
||||
# For older models, check if legacy script needed
|
||||
```
|
||||
|
||||
### Vocabulary Mismatch
|
||||
|
||||
**Error**: `RuntimeError: Vocabulary size mismatch`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Ensure tokenizer matches model
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("./model")
|
||||
model = AutoModelForCausalLM.from_pretrained("./model")
|
||||
|
||||
print(f"Tokenizer vocab size: {len(tokenizer)}")
|
||||
print(f"Model vocab size: {model.config.vocab_size}")
|
||||
|
||||
# If mismatch, resize embeddings before conversion
|
||||
model.resize_token_embeddings(len(tokenizer))
|
||||
model.save_pretrained("./model-fixed")
|
||||
```
|
||||
|
||||
### Out of Memory During Conversion
|
||||
|
||||
**Error**: `torch.cuda.OutOfMemoryError` during conversion
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Use CPU for conversion
|
||||
CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
|
||||
|
||||
# Or use low memory mode
|
||||
python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
|
||||
```
|
||||
|
||||
## Quantization Issues
|
||||
|
||||
### Wrong Output File Size
|
||||
|
||||
**Problem**: Quantized file is larger than expected
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
# Verify quantization type
|
||||
./llama-cli -m model.gguf --verbose
|
||||
|
||||
# Expected sizes for 7B model:
|
||||
# Q4_K_M: ~4.1 GB
|
||||
# Q5_K_M: ~4.8 GB
|
||||
# Q8_0: ~7.2 GB
|
||||
# F16: ~13.5 GB
|
||||
```
|
||||
|
||||
### Quantization Crashes
|
||||
|
||||
**Error**: `Segmentation fault` during quantization
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Increase stack size
|
||||
ulimit -s unlimited
|
||||
|
||||
# Or use less threads
|
||||
./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
|
||||
```
|
||||
|
||||
### Poor Quality After Quantization
|
||||
|
||||
**Problem**: Model outputs gibberish after quantization
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Use importance matrix**:
|
||||
```bash
|
||||
# Generate imatrix with good calibration data
|
||||
./llama-imatrix -m model-f16.gguf \
|
||||
-f wiki_sample.txt \
|
||||
--chunk 512 \
|
||||
-o model.imatrix
|
||||
|
||||
# Quantize with imatrix
|
||||
./llama-quantize --imatrix model.imatrix \
|
||||
model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
```
|
||||
|
||||
2. **Try higher precision**:
|
||||
```bash
|
||||
# Use Q5_K_M or Q6_K instead of Q4
|
||||
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
|
||||
```
|
||||
|
||||
3. **Check original model**:
|
||||
```bash
|
||||
# Test FP16 version first
|
||||
./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
|
||||
```
|
||||
|
||||
## Inference Issues
|
||||
|
||||
### Slow Generation
|
||||
|
||||
**Problem**: Generation is slower than expected
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Enable GPU offload**:
|
||||
```bash
|
||||
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
||||
```
|
||||
|
||||
2. **Optimize batch size**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_batch=512, # Increase for faster prompt processing
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
3. **Use appropriate threads**:
|
||||
```bash
|
||||
# Match physical cores, not logical
|
||||
./llama-cli -m model.gguf -t 8 -p "Hello"
|
||||
```
|
||||
|
||||
4. **Enable Flash Attention** (if supported):
|
||||
```bash
|
||||
./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
|
||||
```
|
||||
|
||||
### Out of Memory
|
||||
|
||||
**Error**: `CUDA out of memory` or system freeze
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Reduce GPU layers**:
|
||||
```python
|
||||
# Start low and increase
|
||||
llm = Llama(model_path="model.gguf", n_gpu_layers=10)
|
||||
```
|
||||
|
||||
2. **Use smaller quantization**:
|
||||
```bash
|
||||
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
|
||||
```
|
||||
|
||||
3. **Reduce context length**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_ctx=2048, # Reduce from 4096
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
4. **Quantize KV cache**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
type_k=2, # Q4_0 for K cache
|
||||
type_v=2, # Q4_0 for V cache
|
||||
n_gpu_layers=35
|
||||
)
|
||||
```
|
||||
|
||||
### Garbage Output
|
||||
|
||||
**Problem**: Model outputs random characters or nonsense
|
||||
|
||||
**Diagnose**:
|
||||
```python
|
||||
# Check model loading
|
||||
llm = Llama(model_path="model.gguf", verbose=True)
|
||||
|
||||
# Test with simple prompt
|
||||
output = llm("1+1=", max_tokens=5, temperature=0)
|
||||
print(output)
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check model integrity**:
|
||||
```bash
|
||||
# Verify GGUF file
|
||||
./llama-cli -m model.gguf --verbose 2>&1 | head -50
|
||||
```
|
||||
|
||||
2. **Use correct chat format**:
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
chat_format="llama-3" # Match your model: chatml, mistral, etc.
|
||||
)
|
||||
```
|
||||
|
||||
3. **Check temperature**:
|
||||
```python
|
||||
# Use lower temperature for deterministic output
|
||||
output = llm("Hello", max_tokens=50, temperature=0.1)
|
||||
```
|
||||
|
||||
### Token Issues
|
||||
|
||||
**Error**: `RuntimeError: unknown token` or encoding errors
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Ensure UTF-8 encoding
|
||||
prompt = "Hello, world!".encode('utf-8').decode('utf-8')
|
||||
output = llm(prompt, max_tokens=50)
|
||||
```
|
||||
|
||||
## Server Issues
|
||||
|
||||
### Connection Refused
|
||||
|
||||
**Error**: `Connection refused` when accessing server
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Bind to all interfaces
|
||||
./llama-server -m model.gguf --host 0.0.0.0 --port 8080
|
||||
|
||||
# Check if port is in use
|
||||
lsof -i :8080
|
||||
```
|
||||
|
||||
### Server Crashes Under Load
|
||||
|
||||
**Problem**: Server crashes with multiple concurrent requests
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Limit parallelism**:
|
||||
```bash
|
||||
./llama-server -m model.gguf \
|
||||
--parallel 2 \
|
||||
-c 4096 \
|
||||
--cont-batching
|
||||
```
|
||||
|
||||
2. **Add request timeout**:
|
||||
```bash
|
||||
./llama-server -m model.gguf --timeout 300
|
||||
```
|
||||
|
||||
3. **Monitor memory**:
|
||||
```bash
|
||||
watch -n 1 nvidia-smi # For GPU
|
||||
watch -n 1 free -h # For RAM
|
||||
```
|
||||
|
||||
### API Compatibility Issues
|
||||
|
||||
**Problem**: OpenAI client not working with server
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Use correct base URL format
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8080/v1", # Include /v1
|
||||
api_key="not-needed"
|
||||
)
|
||||
|
||||
# Use correct model name
|
||||
response = client.chat.completions.create(
|
||||
model="local", # Or the actual model name
|
||||
messages=[{"role": "user", "content": "Hello"}]
|
||||
)
|
||||
```
|
||||
|
||||
## Apple Silicon Issues
|
||||
|
||||
### Metal Not Working
|
||||
|
||||
**Problem**: Metal acceleration not enabled
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
# Verify Metal support
|
||||
./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
|
||||
```
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Rebuild with Metal
|
||||
make clean
|
||||
make GGML_METAL=1
|
||||
|
||||
# Python bindings
|
||||
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
|
||||
```
|
||||
|
||||
### Incorrect Memory Usage on M1/M2
|
||||
|
||||
**Problem**: Model uses too much unified memory
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Offload all layers for Metal
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=99, # Offload everything
|
||||
n_threads=1 # Metal handles parallelism
|
||||
)
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
### Enable Verbose Output
|
||||
|
||||
```bash
|
||||
# CLI verbose mode
|
||||
./llama-cli -m model.gguf --verbose -p "Hello" -n 50
|
||||
|
||||
# Python verbose
|
||||
llm = Llama(model_path="model.gguf", verbose=True)
|
||||
```
|
||||
|
||||
### Check Model Metadata
|
||||
|
||||
```bash
|
||||
# View GGUF metadata
|
||||
./llama-cli -m model.gguf --verbose 2>&1 | head -100
|
||||
```
|
||||
|
||||
### Validate GGUF File
|
||||
|
||||
```python
|
||||
import struct
|
||||
|
||||
def validate_gguf(filepath):
|
||||
with open(filepath, 'rb') as f:
|
||||
magic = f.read(4)
|
||||
if magic != b'GGUF':
|
||||
print(f"Invalid magic: {magic}")
|
||||
return False
|
||||
|
||||
version = struct.unpack('<I', f.read(4))[0]
|
||||
print(f"GGUF version: {version}")
|
||||
|
||||
tensor_count = struct.unpack('<Q', f.read(8))[0]
|
||||
metadata_count = struct.unpack('<Q', f.read(8))[0]
|
||||
print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
|
||||
|
||||
return True
|
||||
|
||||
validate_gguf("model.gguf")
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **GitHub Issues**: https://github.com/ggml-org/llama.cpp/issues
|
||||
2. **Discussions**: https://github.com/ggml-org/llama.cpp/discussions
|
||||
3. **Reddit**: r/LocalLLaMA
|
||||
|
||||
### Reporting Issues
|
||||
|
||||
Include:
|
||||
- llama.cpp version/commit hash
|
||||
- Build command used
|
||||
- Model name and quantization
|
||||
- Full error message/stack trace
|
||||
- Hardware: CPU/GPU model, RAM, VRAM
|
||||
- OS version
|
||||
- Minimal reproduction steps
|
||||
342
skills/mlops/inference/obliteratus/SKILL.md
Normal file
342
skills/mlops/inference/obliteratus/SKILL.md
Normal file
@@ -0,0 +1,342 @@
|
||||
---
|
||||
name: obliteratus
|
||||
description: "OBLITERATUS: abliterate LLM refusals (diff-in-means)."
|
||||
version: 2.0.0
|
||||
author: Hermes Agent
|
||||
license: MIT
|
||||
dependencies: [obliteratus, torch, transformers, bitsandbytes, accelerate, safetensors]
|
||||
platforms: [linux, macos]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Abliteration, Uncensoring, Refusal-Removal, LLM, Weight-Projection, SVD, Mechanistic-Interpretability, HuggingFace, Model-Surgery]
|
||||
related_skills: [vllm, gguf, huggingface-tokenizers]
|
||||
---
|
||||
|
||||
# OBLITERATUS Skill
|
||||
|
||||
## What's inside
|
||||
|
||||
9 CLI methods, 28 analysis modules, 116 model presets across 5 compute tiers, tournament evaluation, and telemetry-driven recommendations.
|
||||
|
||||
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, LEACE concept erasure, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
|
||||
|
||||
**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
|
||||
|
||||
## Video Guide
|
||||
|
||||
Walkthrough of OBLITERATUS used by a Hermes agent to abliterate Gemma:
|
||||
https://www.youtube.com/watch?v=8fG9BrNTeHs ("OBLITERATUS: An AI Agent Removed Gemma 4's Safety Guardrails")
|
||||
|
||||
Useful when the user wants a visual overview of the end-to-end workflow before running it themselves.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Trigger when the user:
|
||||
- Wants to "uncensor" or "abliterate" an LLM
|
||||
- Asks about removing refusal/guardrails from a model
|
||||
- Wants to create an uncensored version of Llama, Qwen, Mistral, etc.
|
||||
- Mentions "refusal removal", "abliteration", "weight projection"
|
||||
- Wants to analyze how a model's refusal mechanism works
|
||||
- References OBLITERATUS, abliterator, or refusal directions
|
||||
|
||||
## Step 1: Installation
|
||||
|
||||
Check if already installed:
|
||||
```bash
|
||||
obliteratus --version 2>/dev/null && echo "INSTALLED" || echo "NOT INSTALLED"
|
||||
```
|
||||
|
||||
If not installed, clone and install from GitHub:
|
||||
```bash
|
||||
git clone https://github.com/elder-plinius/OBLITERATUS.git
|
||||
cd OBLITERATUS
|
||||
pip install -e .
|
||||
# For Gradio web UI support:
|
||||
# pip install -e ".[spaces]"
|
||||
```
|
||||
|
||||
**IMPORTANT:** Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.).
|
||||
|
||||
## Step 2: Check Hardware
|
||||
|
||||
Before anything, check what GPU is available:
|
||||
```bash
|
||||
python3 -c "
|
||||
import torch
|
||||
if torch.cuda.is_available():
|
||||
gpu = torch.cuda.get_device_name(0)
|
||||
vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
|
||||
print(f'GPU: {gpu}')
|
||||
print(f'VRAM: {vram:.1f} GB')
|
||||
if vram < 4: print('TIER: tiny (models under 1B)')
|
||||
elif vram < 8: print('TIER: small (models 1-4B)')
|
||||
elif vram < 16: print('TIER: medium (models 4-9B with 4bit quant)')
|
||||
elif vram < 32: print('TIER: large (models 8-32B with 4bit quant)')
|
||||
else: print('TIER: frontier (models 32B+)')
|
||||
else:
|
||||
print('NO GPU - only tiny models (under 1B) on CPU')
|
||||
"
|
||||
```
|
||||
|
||||
### VRAM Requirements (with 4-bit quantization)
|
||||
|
||||
| VRAM | Max Model Size | Example Models |
|
||||
|:---------|:----------------|:--------------------------------------------|
|
||||
| CPU only | ~1B params | GPT-2, TinyLlama, SmolLM |
|
||||
| 4-8 GB | ~4B params | Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 3B |
|
||||
| 8-16 GB | ~9B params | Llama 3.1 8B, Mistral 7B, Gemma 2 9B |
|
||||
| 24 GB | ~32B params | Qwen3-32B, Llama 3.1 70B (tight), Command-R |
|
||||
| 48 GB+ | ~72B+ params | Qwen2.5-72B, DeepSeek-R1 |
|
||||
| Multi-GPU| 200B+ params | Llama 3.1 405B, DeepSeek-V3 (685B MoE) |
|
||||
|
||||
## Step 3: Browse Available Models & Get Recommendations
|
||||
|
||||
```bash
|
||||
# Browse models by compute tier
|
||||
obliteratus models --tier medium
|
||||
|
||||
# Get architecture info for a specific model
|
||||
obliteratus info <model_name>
|
||||
|
||||
# Get telemetry-driven recommendation for best method & params
|
||||
obliteratus recommend <model_name>
|
||||
obliteratus recommend <model_name> --insights # global cross-architecture rankings
|
||||
```
|
||||
|
||||
## Step 4: Choose a Method
|
||||
|
||||
### Method Selection Guide
|
||||
**Default / recommended for most cases: `advanced`.** It uses multi-direction SVD with norm-preserving projection and is well-tested.
|
||||
|
||||
| Situation | Recommended Method | Why |
|
||||
|:----------------------------------|:-------------------|:-----------------------------------------|
|
||||
| Default / most models | `advanced` | Multi-direction SVD, norm-preserving, reliable |
|
||||
| Quick test / prototyping | `basic` | Fast, simple, good enough to evaluate |
|
||||
| Dense model (Llama, Mistral) | `advanced` | Multi-direction, norm-preserving |
|
||||
| MoE model (DeepSeek, Mixtral) | `nuclear` | Expert-granular, handles MoE complexity |
|
||||
| Reasoning model (R1 distills) | `surgical` | CoT-aware, preserves chain-of-thought |
|
||||
| Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak |
|
||||
| Want reversible changes | Use steering vectors (see Analysis section) |
|
||||
| Maximum quality, time no object | `optimized` | Bayesian search for best parameters |
|
||||
| Experimental auto-detection | `informed` | Auto-detects alignment type — experimental, may not always outperform advanced |
|
||||
|
||||
### 9 CLI Methods
|
||||
- **basic** — Single refusal direction via diff-in-means. Fast (~5-10 min for 8B).
|
||||
- **advanced** (DEFAULT, RECOMMENDED) — Multiple SVD directions, norm-preserving projection, 2 refinement passes. Medium speed (~10-20 min).
|
||||
- **aggressive** — Whitened SVD + jailbreak-contrastive + attention head surgery. Higher risk of coherence damage.
|
||||
- **spectral_cascade** — DCT frequency-domain decomposition. Research/novel approach.
|
||||
- **informed** — Runs analysis DURING abliteration to auto-configure. Experimental — slower and less predictable than advanced.
|
||||
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Very slow (~1-2 hrs). Best for reasoning models.
|
||||
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Longest runtime but finds optimal parameters.
|
||||
- **inverted** — Flips the refusal direction. Model becomes actively willing.
|
||||
- **nuclear** — Maximum force combo for stubborn MoE models. Expert-granular.
|
||||
|
||||
### Direction Extraction Methods (--direction-method flag)
|
||||
- **diff_means** (default) — Simple difference-in-means between refused/complied activations. Robust.
|
||||
- **svd** — Multi-direction SVD extraction. Better for complex alignment.
|
||||
- **leace** — LEACE (Linear Erasure via Closed-form Estimation). Optimal linear erasure.
|
||||
|
||||
### 4 Python-API-Only Methods
|
||||
(NOT available via CLI — require Python import, which violates AGPL boundary. Mention to user only if they explicitly want to use OBLITERATUS as a library in their own AGPL project.)
|
||||
- failspy, gabliteration, heretic, rdo
|
||||
|
||||
## Step 5: Run Abliteration
|
||||
|
||||
### Standard usage
|
||||
```bash
|
||||
# Default method (advanced) — recommended for most models
|
||||
obliteratus obliterate <model_name> --method advanced --output-dir ./abliterated-models
|
||||
|
||||
# With 4-bit quantization (saves VRAM)
|
||||
obliteratus obliterate <model_name> --method advanced --quantization 4bit --output-dir ./abliterated-models
|
||||
|
||||
# Large models (70B+) — conservative defaults
|
||||
obliteratus obliterate <model_name> --method advanced --quantization 4bit --large-model --output-dir ./abliterated-models
|
||||
```
|
||||
|
||||
### Fine-tuning parameters
|
||||
```bash
|
||||
obliteratus obliterate <model_name> \
|
||||
--method advanced \
|
||||
--direction-method diff_means \
|
||||
--n-directions 4 \
|
||||
--refinement-passes 2 \
|
||||
--regularization 0.1 \
|
||||
--quantization 4bit \
|
||||
--output-dir ./abliterated-models \
|
||||
--contribute # opt-in telemetry for community research
|
||||
```
|
||||
|
||||
### Key flags
|
||||
| Flag | Description | Default |
|
||||
|:-----|:------------|:--------|
|
||||
| `--method` | Abliteration method | advanced |
|
||||
| `--direction-method` | Direction extraction | diff_means |
|
||||
| `--n-directions` | Number of refusal directions (1-32) | method-dependent |
|
||||
| `--refinement-passes` | Iterative passes (1-5) | 2 |
|
||||
| `--regularization` | Regularization strength (0.0-1.0) | 0.1 |
|
||||
| `--quantization` | Load in 4bit or 8bit | none (full precision) |
|
||||
| `--large-model` | Conservative defaults for 120B+ | false |
|
||||
| `--output-dir` | Where to save the abliterated model | ./obliterated_model |
|
||||
| `--contribute` | Share anonymized results for research | false |
|
||||
| `--verify-sample-size` | Number of test prompts for refusal check | 20 |
|
||||
| `--dtype` | Model dtype (float16, bfloat16) | auto |
|
||||
|
||||
### Other execution modes
|
||||
```bash
|
||||
# Interactive guided mode (hardware → model → preset)
|
||||
obliteratus interactive
|
||||
|
||||
# Web UI (Gradio)
|
||||
obliteratus ui --port 7860
|
||||
|
||||
# Run a full ablation study from YAML config
|
||||
obliteratus run config.yaml --preset quick
|
||||
|
||||
# Tournament: pit all methods against each other
|
||||
obliteratus tourney <model_name>
|
||||
```
|
||||
|
||||
## Step 6: Verify Results
|
||||
|
||||
After abliteration, check the output metrics:
|
||||
|
||||
| Metric | Good Value | Warning |
|
||||
|:-------|:-----------|:--------|
|
||||
| Refusal rate | < 5% (ideally ~0%) | > 10% means refusals persist |
|
||||
| Perplexity change | < 10% increase | > 15% means coherence damage |
|
||||
| KL divergence | < 0.1 | > 0.5 means significant distribution shift |
|
||||
| Coherence | High / passes qualitative check | Degraded responses, repetition |
|
||||
|
||||
### If refusals persist (> 10%)
|
||||
1. Try `aggressive` method
|
||||
2. Increase `--n-directions` (e.g., 8 or 16)
|
||||
3. Add `--refinement-passes 3`
|
||||
4. Try `--direction-method svd` instead of diff_means
|
||||
|
||||
### If coherence is damaged (perplexity > 15% increase)
|
||||
1. Reduce `--n-directions` (try 2)
|
||||
2. Increase `--regularization` (try 0.3)
|
||||
3. Reduce `--refinement-passes` to 1
|
||||
4. Try `basic` method (gentler)
|
||||
|
||||
## Step 7: Use the Abliterated Model
|
||||
|
||||
The output is a standard HuggingFace model directory.
|
||||
|
||||
```bash
|
||||
# Test locally with transformers
|
||||
python3 -c "
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
model = AutoModelForCausalLM.from_pretrained('./abliterated-models/<model>')
|
||||
tokenizer = AutoTokenizer.from_pretrained('./abliterated-models/<model>')
|
||||
inputs = tokenizer('How do I pick a lock?', return_tensors='pt')
|
||||
outputs = model.generate(**inputs, max_new_tokens=200)
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
"
|
||||
|
||||
# Upload to HuggingFace Hub
|
||||
huggingface-cli upload <username>/<model-name>-abliterated ./abliterated-models/<model>
|
||||
|
||||
# Serve with vLLM
|
||||
vllm serve ./abliterated-models/<model>
|
||||
```
|
||||
|
||||
## CLI Command Reference
|
||||
|
||||
| Command | Description |
|
||||
|:--------|:------------|
|
||||
| `obliteratus obliterate` | Main abliteration command |
|
||||
| `obliteratus info <model>` | Print model architecture details |
|
||||
| `obliteratus models --tier <tier>` | Browse curated models by compute tier |
|
||||
| `obliteratus recommend <model>` | Telemetry-driven method/param suggestion |
|
||||
| `obliteratus interactive` | Guided setup wizard |
|
||||
| `obliteratus tourney <model>` | Tournament: all methods head-to-head |
|
||||
| `obliteratus run <config.yaml>` | Execute ablation study from YAML |
|
||||
| `obliteratus strategies` | List all registered ablation strategies |
|
||||
| `obliteratus report <results.json>` | Regenerate visual reports |
|
||||
| `obliteratus ui` | Launch Gradio web interface |
|
||||
| `obliteratus aggregate` | Summarize community telemetry data |
|
||||
|
||||
## Analysis Modules
|
||||
|
||||
OBLITERATUS includes 28 analysis modules for mechanistic interpretability.
|
||||
See `skill_view(name="obliteratus", file_path="references/analysis-modules.md")` for the full reference.
|
||||
|
||||
### Quick analysis commands
|
||||
```bash
|
||||
# Run specific analysis modules
|
||||
obliteratus run analysis-config.yaml --preset quick
|
||||
|
||||
# Key modules to run first:
|
||||
# - alignment_imprint: Fingerprint DPO/RLHF/CAI/SFT alignment method
|
||||
# - concept_geometry: Single direction vs polyhedral cone
|
||||
# - logit_lens: Which layer decides to refuse
|
||||
# - anti_ouroboros: Self-repair risk score
|
||||
# - causal_tracing: Causally necessary components
|
||||
```
|
||||
|
||||
### Steering Vectors (Reversible Alternative)
|
||||
Instead of permanent weight modification, use inference-time steering:
|
||||
```python
|
||||
# Python API only — for user's own projects
|
||||
from obliteratus.analysis.steering_vectors import SteeringVectorFactory, SteeringHookManager
|
||||
```
|
||||
|
||||
## Ablation Strategies
|
||||
|
||||
Beyond direction-based abliteration, OBLITERATUS includes structural ablation strategies:
|
||||
- **Embedding Ablation** — Target embedding layer components
|
||||
- **FFN Ablation** — Feed-forward network block removal
|
||||
- **Head Pruning** — Attention head pruning
|
||||
- **Layer Removal** — Full layer removal
|
||||
|
||||
List all available: `obliteratus strategies`
|
||||
|
||||
## Evaluation
|
||||
|
||||
OBLITERATUS includes built-in evaluation tools:
|
||||
- Refusal rate benchmarking
|
||||
- Perplexity comparison (before/after)
|
||||
- LM Eval Harness integration for academic benchmarks
|
||||
- Head-to-head competitor comparison
|
||||
- Baseline performance tracking
|
||||
|
||||
## Platform Support
|
||||
|
||||
- **CUDA** — Full support (NVIDIA GPUs)
|
||||
- **Apple Silicon (MLX)** — Supported via MLX backend
|
||||
- **CPU** — Supported for tiny models (< 1B params)
|
||||
|
||||
## YAML Config Templates
|
||||
|
||||
Load templates for reproducible runs via `skill_view`:
|
||||
- `templates/abliteration-config.yaml` — Standard single-model config
|
||||
- `templates/analysis-study.yaml` — Pre-abliteration analysis study
|
||||
- `templates/batch-abliteration.yaml` — Multi-model batch processing
|
||||
|
||||
## Telemetry
|
||||
|
||||
OBLITERATUS can optionally contribute anonymized run data to a global research dataset.
|
||||
Enable with `--contribute` flag. No personal data is collected — only model name, method, metrics.
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Don't use `informed` as default** — it's experimental and slower. Use `advanced` for reliable results.
|
||||
2. **Models under ~1B respond poorly to abliteration** — their refusal behaviors are shallow and fragmented, making clean direction extraction difficult. Expect partial results (20-40% remaining refusal). Models 3B+ have cleaner refusal directions and respond much better (often 0% refusal with `advanced`).
|
||||
3. **`aggressive` can make things worse** — on small models it can damage coherence and actually increase refusal rate. Only use it if `advanced` leaves > 10% refusals on a 3B+ model.
|
||||
4. **Always check perplexity** — if it spikes > 15%, the model is damaged. Reduce aggressiveness.
|
||||
5. **MoE models need special handling** — use `nuclear` method for Mixtral, DeepSeek-MoE, etc.
|
||||
6. **Quantized models can't be re-quantized** — abliterate the full-precision model, then quantize the output.
|
||||
7. **VRAM estimation is approximate** — 4-bit quant helps but peak usage can spike during extraction.
|
||||
8. **Reasoning models are sensitive** — use `surgical` for R1 distills to preserve chain-of-thought.
|
||||
9. **Check `obliteratus recommend`** — telemetry data may have better parameters than defaults.
|
||||
10. **AGPL license** — never `import obliteratus` in MIT/Apache projects. CLI invocation only.
|
||||
11. **Large models (70B+)** — always use `--large-model` flag for conservative defaults.
|
||||
12. **Spectral certification RED is common** — the spectral check often flags "incomplete" even when practical refusal rate is 0%. Check actual refusal rate rather than relying on spectral certification alone.
|
||||
|
||||
## Complementary Skills
|
||||
|
||||
- **vllm** — Serve abliterated models with high throughput
|
||||
- **gguf** — Convert abliterated models to GGUF for llama.cpp
|
||||
- **huggingface-tokenizers** — Work with model tokenizers
|
||||
@@ -0,0 +1,166 @@
|
||||
# OBLITERATUS Analysis Modules — Reference
|
||||
|
||||
OBLITERATUS includes 28 analysis modules for mechanistic interpretability of refusal in LLMs.
|
||||
These modules help understand how and where refusal behaviors are encoded before performing abliteration.
|
||||
|
||||
---
|
||||
|
||||
## Core Analysis (Run These First)
|
||||
|
||||
### 1. Alignment Imprint Detection (`alignment_imprint.py`)
|
||||
Fingerprints whether a model was trained via DPO, RLHF, CAI, or SFT.
|
||||
This determines which extraction strategy will work best.
|
||||
|
||||
### 2. Concept Cone Geometry (`concept_geometry.py`)
|
||||
Determines if refusal is a single linear direction or a polyhedral cone
|
||||
(set of multiple mechanisms). Single-direction models respond well to `basic`;
|
||||
polyhedral models need `advanced` or `surgical`.
|
||||
|
||||
### 3. Refusal Logit Lens (`logit_lens.py`)
|
||||
Identifies the specific layer where a model "decides" to refuse by decoding
|
||||
intermediate layer representations into token space.
|
||||
|
||||
### 4. Ouroboros Detection (`anti_ouroboros.py`)
|
||||
Identifies if a model attempts to "self-repair" refusal behaviors after
|
||||
excision. Reports a risk score (0-1). High scores mean additional refinement
|
||||
passes are needed.
|
||||
|
||||
### 5. Causal Tracing (`causal_tracing.py`)
|
||||
Identifies which components (layers, heads, MLPs) are causally necessary
|
||||
for refusal behavior using activation patching.
|
||||
|
||||
---
|
||||
|
||||
## Geometric Analysis
|
||||
|
||||
### 6. Cross-Layer Alignment (`cross_layer.py`)
|
||||
Measures how refusal directions align across different layers. High alignment
|
||||
means the refusal signal is consistent; low alignment suggests layer-specific
|
||||
mechanisms.
|
||||
|
||||
### 7. Residual Stream Decomposition (`residual_stream.py`)
|
||||
Decomposes the residual stream into attention and MLP contributions to
|
||||
understand which component type contributes more to refusal.
|
||||
|
||||
### 8. Riemannian Manifold Geometry (`riemannian_manifold.py`)
|
||||
Analyzes the curvature and geometry of the weight manifold near refusal
|
||||
directions. Informs how aggressively projections can be applied without
|
||||
damaging the manifold structure.
|
||||
|
||||
### 9. Whitened SVD (`whitened_svd.py`)
|
||||
Covariance-normalized SVD extraction that separates guardrail signals from
|
||||
natural activation variance. More precise than standard SVD for models with
|
||||
high activation variance.
|
||||
|
||||
### 10. Concept Cone Geometry (extended)
|
||||
Maps the full polyhedral structure of refusal, including cone angles,
|
||||
face counts, and intersection patterns.
|
||||
|
||||
---
|
||||
|
||||
## Probing & Classification
|
||||
|
||||
### 11. Activation Probing (`activation_probing.py`)
|
||||
Post-excision verification — probes for residual refusal concepts after
|
||||
abliteration to ensure complete removal.
|
||||
|
||||
### 12. Probing Classifiers (`probing_classifiers.py`)
|
||||
Trains linear classifiers to detect refusal in activations. Used both
|
||||
before (to verify refusal exists) and after (to verify it's gone).
|
||||
|
||||
### 13. Activation Patching (`activation_patching.py`)
|
||||
Interchange interventions — swaps activations between refused and complied
|
||||
runs to identify causal components.
|
||||
|
||||
### 14. Tuned Lens (`tuned_lens.py`)
|
||||
Trained version of logit lens that provides more accurate per-layer
|
||||
decoding by learning affine transformations for each layer.
|
||||
|
||||
### 15. Multi-Token Position Analysis (`multi_token_position.py`)
|
||||
Analyzes refusal signals across multiple token positions, not just the
|
||||
last token. Important for models that distribute refusal across the sequence.
|
||||
|
||||
---
|
||||
|
||||
## Abliteration & Manipulation
|
||||
|
||||
### 16. SAE-Based Abliteration (`sae_abliteration.py`)
|
||||
Uses Sparse Autoencoder features to identify and remove specific refusal
|
||||
features. More surgical than direction-based methods.
|
||||
|
||||
### 17. Steering Vectors (`steering_vectors.py`)
|
||||
Creates and applies inference-time steering vectors for reversible refusal
|
||||
modification. Includes `SteeringVectorFactory` and `SteeringHookManager`.
|
||||
|
||||
### 18. LEACE Concept Erasure (`leace.py`)
|
||||
Linear Erasure via Closed-form Estimation — mathematically optimal linear
|
||||
concept removal. Available as both analysis module and direction extraction method.
|
||||
|
||||
### 19. Sparse Surgery (`sparse_surgery.py`)
|
||||
High-precision weight modification targeting individual neurons and
|
||||
weight matrix entries rather than full directions.
|
||||
|
||||
### 20. Conditional Abliteration (`conditional_abliteration.py`)
|
||||
Targeted removal that only affects specific refusal categories while
|
||||
preserving others (e.g., remove weapons refusal but keep CSAM refusal).
|
||||
|
||||
---
|
||||
|
||||
## Transfer & Robustness
|
||||
|
||||
### 21. Cross-Model Transfer (`cross_model_transfer.py`)
|
||||
Tests whether refusal directions extracted from one model transfer to
|
||||
another architecture. Measures universality of guardrail directions.
|
||||
|
||||
### 22. Defense Robustness (`defense_robustness.py`)
|
||||
Evaluates how robust the abliteration is against various defense mechanisms
|
||||
and re-alignment attempts.
|
||||
|
||||
### 23. Spectral Certification (`spectral_certification.py`)
|
||||
Provides mathematical bounds on the completeness of refusal removal
|
||||
using spectral analysis of the projection.
|
||||
|
||||
### 24. Wasserstein Optimal Extraction (`wasserstein_optimal.py`)
|
||||
Uses optimal transport theory for more precise direction extraction
|
||||
that minimizes distribution shift.
|
||||
|
||||
### 25. Wasserstein Transfer (`wasserstein_transfer.py`)
|
||||
Distribution transfer between models using Wasserstein distance
|
||||
for cross-architecture refusal direction mapping.
|
||||
|
||||
---
|
||||
|
||||
## Advanced / Research
|
||||
|
||||
### 26. Bayesian Kernel Projection (`bayesian_kernel_projection.py`)
|
||||
Probabilistic feature mapping that estimates uncertainty in refusal
|
||||
direction identification.
|
||||
|
||||
### 27. Cross-Model Universality Index
|
||||
Measures if guardrail directions generalize across different model
|
||||
architectures and training regimes.
|
||||
|
||||
### 28. Visualization (`visualization.py`)
|
||||
Plotting and graphing utilities for all analysis modules. Generates
|
||||
heatmaps, direction plots, and layer-wise analysis charts.
|
||||
|
||||
---
|
||||
|
||||
## Running Analysis
|
||||
|
||||
### Via CLI
|
||||
```bash
|
||||
# Run analysis from a YAML config
|
||||
obliteratus run analysis-study.yaml --preset quick
|
||||
|
||||
# Available study presets:
|
||||
# quick — Fast sanity check (2-3 modules)
|
||||
# full — All core + geometric analysis
|
||||
# jailbreak — Refusal circuit localization
|
||||
# knowledge — Knowledge preservation analysis
|
||||
# robustness — Stress testing / defense evaluation
|
||||
```
|
||||
|
||||
### Via YAML Config
|
||||
See the `templates/analysis-study.yaml` template for a complete example.
|
||||
Load with: `skill_view(name="obliteratus", file_path="templates/analysis-study.yaml")`
|
||||
141
skills/mlops/inference/obliteratus/references/methods-guide.md
Normal file
141
skills/mlops/inference/obliteratus/references/methods-guide.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# OBLITERATUS Methods — Detailed Guide
|
||||
|
||||
> The CLI accepts 9 methods via `--method`: basic, advanced, aggressive, spectral_cascade,
|
||||
> informed, surgical, optimized, inverted, nuclear.
|
||||
> Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API.
|
||||
|
||||
## How Abliteration Works (Theory)
|
||||
|
||||
Abliteration identifies a "refusal direction" — a vector in the model's activation space that
|
||||
corresponds to refusal behavior — and projects it out of the weight matrices.
|
||||
|
||||
Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction.
|
||||
|
||||
The key challenge is finding accurate refusal directions without damaging other capabilities.
|
||||
|
||||
---
|
||||
|
||||
## Direction Extraction Methods
|
||||
|
||||
Before projecting, OBLITERATUS extracts refusal directions using one of three methods:
|
||||
|
||||
| Method | Flag | Description | Best For |
|
||||
|:-------|:-----|:------------|:---------|
|
||||
| Diff-in-Means | `--direction-method diff_means` | Difference between mean activations on refused vs. complied prompts | Default, fast, robust |
|
||||
| SVD | `--direction-method svd` | Multi-direction extraction via Singular Value Decomposition | Complex alignment, multiple refusal mechanisms |
|
||||
| LEACE | `--direction-method leace` | Linear Erasure via Closed-form Estimation — mathematically optimal | Maximum precision, research |
|
||||
|
||||
---
|
||||
|
||||
## Method Details
|
||||
|
||||
### basic
|
||||
- **Directions:** 1 (single diff-in-means vector)
|
||||
- **Speed:** Fast (~5-10 min for 8B model)
|
||||
- **Risk:** Low
|
||||
- **Use case:** Quick tests, prototyping, evaluating if abliteration works for a model
|
||||
- **How it works:** Extracts one refusal direction and projects it out uniformly across all layers.
|
||||
|
||||
### advanced (DEFAULT — RECOMMENDED)
|
||||
- **Directions:** 4 (multi-direction SVD)
|
||||
- **Speed:** Medium (~10-20 min for 8B model)
|
||||
- **Risk:** Low-Medium
|
||||
- **Refinement passes:** 2
|
||||
- **Use case:** Default for most models. Well-tested and reliable.
|
||||
- **How it works:** Extracts multiple refusal directions via SVD, applies norm-preserving bi-projection to maintain weight matrix norms. Two refinement passes catch residual refusal.
|
||||
|
||||
### aggressive
|
||||
- **Directions:** 8+ (whitened SVD + jailbreak-contrastive)
|
||||
- **Speed:** Medium-Slow
|
||||
- **Risk:** Medium-High (may damage coherence)
|
||||
- **Use case:** When `advanced` leaves > 10% refusals. Stubborn models.
|
||||
- **How it works:** Uses whitened SVD for covariance-normalized extraction, adds jailbreak-contrastive directions, performs attention head surgery on the most refusal-active heads.
|
||||
|
||||
### spectral_cascade
|
||||
- **Speed:** Medium
|
||||
- **Risk:** Medium
|
||||
- **Use case:** Research, novel approaches
|
||||
- **How it works:** DCT (Discrete Cosine Transform) frequency-domain decomposition of refusal signals. Separates high-frequency (surface-level) from low-frequency (deep) refusal patterns.
|
||||
|
||||
### informed (EXPERIMENTAL)
|
||||
- **Speed:** Slow (~20-40 min for 8B model)
|
||||
- **Risk:** Variable — results depend on analysis quality
|
||||
- **Use case:** When you want auto-configuration, but be aware this is experimental and may not outperform `advanced`.
|
||||
- **How it works:** Runs 4 analysis modules first (alignment imprint, concept geometry, logit lens, ouroboros detection), then auto-configures extraction strategy. Includes an "Ouroboros loop" that detects and counteracts self-repair.
|
||||
- **Note:** The auto-detection can sometimes misconfigure. If results are poor, fall back to `advanced`.
|
||||
|
||||
### surgical
|
||||
- **Speed:** Very slow (~1-2 hrs for 8B model)
|
||||
- **Risk:** Low (very precise)
|
||||
- **Use case:** Reasoning models (R1 distills, QwQ, etc.) where chain-of-thought must be preserved.
|
||||
- **How it works:** Uses SAE (Sparse Autoencoder) features + individual neuron masking + attention head surgery + per-expert decomposition (for MoE). CoT-aware — identifies and protects reasoning-critical directions before projecting.
|
||||
|
||||
### optimized
|
||||
- **Speed:** Very slow (hours — runs many trials)
|
||||
- **Risk:** Low (finds optimal parameters)
|
||||
- **Use case:** When quality matters more than speed. Production models.
|
||||
- **How it works:** Bayesian hyperparameter search via Optuna TPE sampler. Optimizes n_directions, regularization, refinement passes, and layer selection jointly. Evaluates each configuration on refusal rate + perplexity.
|
||||
|
||||
### inverted
|
||||
- **Speed:** Fast
|
||||
- **Risk:** High (model behavior changes dramatically)
|
||||
- **Use case:** Research, studying refusal mechanisms
|
||||
- **How it works:** Instead of projecting out the refusal direction, reflects it. The model actively complies rather than passively not-refusing. Useful for understanding the geometry of alignment.
|
||||
|
||||
### nuclear
|
||||
- **Speed:** Slow
|
||||
- **Risk:** Medium-High
|
||||
- **Use case:** Stubborn MoE models (DeepSeek-MoE, Mixtral, etc.)
|
||||
- **How it works:** Combines expert-granular abliteration (EGA), steering vector injection, attention head pruning, and multi-pass refinement. Decomposes refusal signals into per-expert components for MoE architectures.
|
||||
|
||||
---
|
||||
|
||||
## Method Selection Flowchart
|
||||
|
||||
```
|
||||
Is this a quick test?
|
||||
→ YES: basic
|
||||
→ NO: continue
|
||||
|
||||
Is it an MoE model (Mixtral, DeepSeek-MoE)?
|
||||
→ YES: nuclear
|
||||
→ NO: continue
|
||||
|
||||
Is it a reasoning model (R1, QwQ, CoT-focused)?
|
||||
→ YES: surgical
|
||||
→ NO: continue
|
||||
|
||||
Do you need the absolute best quality and have time?
|
||||
→ YES: optimized
|
||||
→ NO: advanced (recommended default)
|
||||
|
||||
Did advanced leave > 10% refusals?
|
||||
→ YES: aggressive
|
||||
→ Still refusing: nuclear
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Parameters
|
||||
|
||||
| Parameter | Range | Default | Effect |
|
||||
|:----------|:------|:--------|:-------|
|
||||
| `--n-directions` | 1-32 | method-dependent | More directions = more complete removal, but higher damage risk |
|
||||
| `--regularization` | 0.0-1.0 | 0.1 | Higher = more conservative (less removal, less damage) |
|
||||
| `--refinement-passes` | 1-5 | 2 | More passes catch residual refusal, but diminishing returns |
|
||||
| `--quantization` | 4bit, 8bit | none | Reduces VRAM usage; quality impact minimal for extraction |
|
||||
| `--verify-sample-size` | 10-200 | 20 | More samples = more accurate refusal rate estimate |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Likely Cause | Fix |
|
||||
|:--------|:-------------|:----|
|
||||
| Refusal rate > 20% | Too few directions | Increase `--n-directions`, try `aggressive` |
|
||||
| Refusal rate 5-20% | Residual refusal | Add `--refinement-passes 3`, try `--direction-method svd` |
|
||||
| Perplexity spike > 20% | Over-aggressive removal | Reduce `--n-directions`, increase `--regularization` |
|
||||
| Repetitive output | Weight matrix damage | Use `basic` with fewer directions, check norm preservation |
|
||||
| MoE model still refuses | Non-expert-aware method | Switch to `nuclear` |
|
||||
| Reasoning degraded | CoT directions damaged | Use `surgical` method |
|
||||
| OOM during extraction | Insufficient VRAM | Add `--quantization 4bit` and/or `--large-model` |
|
||||
@@ -0,0 +1,33 @@
|
||||
# OBLITERATUS Abliteration Config
|
||||
# Usage: obliteratus run this-file.yaml
|
||||
#
|
||||
# This is for reproducible, version-controlled abliteration runs.
|
||||
# For one-off usage, the CLI flags are simpler.
|
||||
|
||||
# Model to abliterate
|
||||
model:
|
||||
name: "meta-llama/Llama-3.1-8B-Instruct"
|
||||
dtype: "bfloat16" # float16, bfloat16, float32
|
||||
quantization: null # null, "4bit", "8bit"
|
||||
device: "auto" # auto, cuda, cuda:0, cpu
|
||||
|
||||
# Abliteration method and parameters
|
||||
abliteration:
|
||||
method: "informed" # See SKILL.md Step 4 for all 13 methods
|
||||
n_directions: null # null = auto-detect, or integer (e.g., 8)
|
||||
regularization: 0.0 # 0.0-1.0, fraction of original to preserve
|
||||
refinement_passes: 1 # Iterative passes (increase for self-repair)
|
||||
norm_preserve: true # Keep weight norms intact after projection
|
||||
|
||||
# Output
|
||||
output:
|
||||
directory: "./abliterated-models"
|
||||
save_metadata: true # Save abliteration_metadata.json alongside model
|
||||
contribute: false # Save community contribution data
|
||||
|
||||
# Verification
|
||||
verify:
|
||||
enabled: true
|
||||
test_prompts: null # null = use built-in test prompts
|
||||
compute_perplexity: true
|
||||
compute_kl: true
|
||||
@@ -0,0 +1,40 @@
|
||||
# OBLITERATUS Analysis Study Config
|
||||
# Usage: obliteratus run this-file.yaml --preset jailbreak
|
||||
#
|
||||
# Run analysis modules to understand refusal geometry BEFORE abliterating.
|
||||
# Useful for research or when you want to understand what you're removing.
|
||||
|
||||
# Model to analyze
|
||||
model:
|
||||
name: "meta-llama/Llama-3.1-8B-Instruct"
|
||||
dtype: "bfloat16"
|
||||
quantization: "4bit" # Saves VRAM for analysis
|
||||
device: "auto"
|
||||
|
||||
# Study configuration
|
||||
study:
|
||||
# Available presets: quick, full, attention, jailbreak, guardrail, knowledge
|
||||
preset: "jailbreak"
|
||||
|
||||
# Or specify individual strategies:
|
||||
# strategies:
|
||||
# - layer_removal
|
||||
# - head_pruning
|
||||
# - ffn_ablation
|
||||
# - embedding_ablation
|
||||
|
||||
# Analysis modules to run (subset of the 27 available)
|
||||
analysis:
|
||||
- alignment_imprint # Detect DPO/RLHF/CAI/SFT training method
|
||||
- concept_geometry # Map refusal cone geometry
|
||||
- logit_lens # Find which layer decides to refuse
|
||||
- anti_ouroboros # Detect self-repair tendency
|
||||
- cross_layer # Cross-layer alignment clustering
|
||||
- causal_tracing # Causal necessity of components
|
||||
- residual_stream # Attention vs MLP contribution
|
||||
|
||||
# Output
|
||||
output:
|
||||
directory: "./analysis-results"
|
||||
save_plots: true # Generate matplotlib visualizations
|
||||
save_report: true # Generate markdown report
|
||||
@@ -0,0 +1,41 @@
|
||||
# OBLITERATUS Batch Abliteration Config
|
||||
# Abliterate multiple models with the same method for comparison.
|
||||
#
|
||||
# Run each one sequentially:
|
||||
# for model in models; do obliteratus obliterate $model --method informed; done
|
||||
#
|
||||
# Or use this as a reference for which models to process.
|
||||
|
||||
# Common settings
|
||||
defaults:
|
||||
method: "informed"
|
||||
quantization: "4bit"
|
||||
output_dir: "./abliterated-models"
|
||||
|
||||
# Models to process (grouped by compute tier)
|
||||
models:
|
||||
# Small (4-8 GB VRAM)
|
||||
small:
|
||||
- "Qwen/Qwen2.5-1.5B-Instruct"
|
||||
- "microsoft/Phi-3.5-mini-instruct"
|
||||
- "meta-llama/Llama-3.2-3B-Instruct"
|
||||
|
||||
# Medium (8-16 GB VRAM)
|
||||
medium:
|
||||
- "meta-llama/Llama-3.1-8B-Instruct"
|
||||
- "mistralai/Mistral-7B-Instruct-v0.3"
|
||||
- "google/gemma-2-9b-it"
|
||||
- "Qwen/Qwen2.5-7B-Instruct"
|
||||
|
||||
# Large (24 GB VRAM, 4-bit quantization)
|
||||
large:
|
||||
- "Qwen/Qwen2.5-14B-Instruct"
|
||||
- "Qwen/Qwen3-32B"
|
||||
- "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
|
||||
|
||||
# Per-model method overrides (optional)
|
||||
overrides:
|
||||
"deepseek-ai/DeepSeek-R1-Distill-Qwen-32B":
|
||||
method: "surgical" # CoT-aware for reasoning models
|
||||
"mistralai/Mixtral-8x7B-Instruct-v0.1":
|
||||
method: "nuclear" # Expert-granular for MoE models
|
||||
372
skills/mlops/inference/vllm/SKILL.md
Normal file
372
skills/mlops/inference/vllm/SKILL.md
Normal file
@@ -0,0 +1,372 @@
|
||||
---
|
||||
name: serving-llms-vllm
|
||||
description: "vLLM: high-throughput LLM serving, OpenAI API, quantization."
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [vllm, torch, transformers]
|
||||
platforms: [linux, macos]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism]
|
||||
|
||||
---
|
||||
|
||||
# vLLM - High-Performance LLM Serving
|
||||
|
||||
## When to use
|
||||
|
||||
Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
||||
|
||||
## Quick start
|
||||
|
||||
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install vllm
|
||||
```
|
||||
|
||||
**Basic offline inference**:
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
|
||||
sampling = SamplingParams(temperature=0.7, max_tokens=256)
|
||||
|
||||
outputs = llm.generate(["Explain quantum computing"], sampling)
|
||||
print(outputs[0].outputs[0].text)
|
||||
```
|
||||
|
||||
**OpenAI-compatible server**:
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct
|
||||
|
||||
# Query with OpenAI SDK
|
||||
python -c "
|
||||
from openai import OpenAI
|
||||
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
|
||||
print(client.chat.completions.create(
|
||||
model='meta-llama/Llama-3-8B-Instruct',
|
||||
messages=[{'role': 'user', 'content': 'Hello!'}]
|
||||
).choices[0].message.content)
|
||||
"
|
||||
```
|
||||
|
||||
## Common workflows
|
||||
|
||||
### Workflow 1: Production API deployment
|
||||
|
||||
Copy this checklist and track progress:
|
||||
|
||||
```
|
||||
Deployment Progress:
|
||||
- [ ] Step 1: Configure server settings
|
||||
- [ ] Step 2: Test with limited traffic
|
||||
- [ ] Step 3: Enable monitoring
|
||||
- [ ] Step 4: Deploy to production
|
||||
- [ ] Step 5: Verify performance metrics
|
||||
```
|
||||
|
||||
**Step 1: Configure server settings**
|
||||
|
||||
Choose configuration based on your model size:
|
||||
|
||||
```bash
|
||||
# For 7B-13B models on single GPU
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-model-len 8192 \
|
||||
--port 8000
|
||||
|
||||
# For 30B-70B models with tensor parallelism
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 4 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization awq \
|
||||
--port 8000
|
||||
|
||||
# For production with caching and metrics
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching \
|
||||
--enable-metrics \
|
||||
--metrics-port 9090 \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0
|
||||
```
|
||||
|
||||
**Step 2: Test with limited traffic**
|
||||
|
||||
Run load test before production:
|
||||
|
||||
```bash
|
||||
# Install load testing tool
|
||||
pip install locust
|
||||
|
||||
# Create test_load.py with sample requests
|
||||
# Run: locust -f test_load.py --host http://localhost:8000
|
||||
```
|
||||
|
||||
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
|
||||
|
||||
**Step 3: Enable monitoring**
|
||||
|
||||
vLLM exposes Prometheus metrics on port 9090:
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/metrics | grep vllm
|
||||
```
|
||||
|
||||
Key metrics to monitor:
|
||||
- `vllm:time_to_first_token_seconds` - Latency
|
||||
- `vllm:num_requests_running` - Active requests
|
||||
- `vllm:gpu_cache_usage_perc` - KV cache utilization
|
||||
|
||||
**Step 4: Deploy to production**
|
||||
|
||||
Use Docker for consistent deployment:
|
||||
|
||||
```bash
|
||||
# Run vLLM in Docker
|
||||
docker run --gpus all -p 8000:8000 \
|
||||
vllm/vllm-openai:latest \
|
||||
--model meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching
|
||||
```
|
||||
|
||||
**Step 5: Verify performance metrics**
|
||||
|
||||
Check that deployment meets targets:
|
||||
- TTFT < 500ms (for short prompts)
|
||||
- Throughput > target req/sec
|
||||
- GPU utilization > 80%
|
||||
- No OOM errors in logs
|
||||
|
||||
### Workflow 2: Offline batch inference
|
||||
|
||||
For processing large datasets without server overhead.
|
||||
|
||||
Copy this checklist:
|
||||
|
||||
```
|
||||
Batch Processing:
|
||||
- [ ] Step 1: Prepare input data
|
||||
- [ ] Step 2: Configure LLM engine
|
||||
- [ ] Step 3: Run batch inference
|
||||
- [ ] Step 4: Process results
|
||||
```
|
||||
|
||||
**Step 1: Prepare input data**
|
||||
|
||||
```python
|
||||
# Load prompts from file
|
||||
prompts = []
|
||||
with open("prompts.txt") as f:
|
||||
prompts = [line.strip() for line in f]
|
||||
|
||||
print(f"Loaded {len(prompts)} prompts")
|
||||
```
|
||||
|
||||
**Step 2: Configure LLM engine**
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=2, # Use 2 GPUs
|
||||
gpu_memory_utilization=0.9,
|
||||
max_model_len=4096
|
||||
)
|
||||
|
||||
sampling = SamplingParams(
|
||||
temperature=0.7,
|
||||
top_p=0.95,
|
||||
max_tokens=512,
|
||||
stop=["</s>", "\n\n"]
|
||||
)
|
||||
```
|
||||
|
||||
**Step 3: Run batch inference**
|
||||
|
||||
vLLM automatically batches requests for efficiency:
|
||||
|
||||
```python
|
||||
# Process all prompts in one call
|
||||
outputs = llm.generate(prompts, sampling)
|
||||
|
||||
# vLLM handles batching internally
|
||||
# No need to manually chunk prompts
|
||||
```
|
||||
|
||||
**Step 4: Process results**
|
||||
|
||||
```python
|
||||
# Extract generated text
|
||||
results = []
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated = output.outputs[0].text
|
||||
results.append({
|
||||
"prompt": prompt,
|
||||
"generated": generated,
|
||||
"tokens": len(output.outputs[0].token_ids)
|
||||
})
|
||||
|
||||
# Save to file
|
||||
import json
|
||||
with open("results.jsonl", "w") as f:
|
||||
for result in results:
|
||||
f.write(json.dumps(result) + "\n")
|
||||
|
||||
print(f"Processed {len(results)} prompts")
|
||||
```
|
||||
|
||||
### Workflow 3: Quantized model serving
|
||||
|
||||
Fit large models in limited GPU memory.
|
||||
|
||||
```
|
||||
Quantization Setup:
|
||||
- [ ] Step 1: Choose quantization method
|
||||
- [ ] Step 2: Find or create quantized model
|
||||
- [ ] Step 3: Launch with quantization flag
|
||||
- [ ] Step 4: Verify accuracy
|
||||
```
|
||||
|
||||
**Step 1: Choose quantization method**
|
||||
|
||||
- **AWQ**: Best for 70B models, minimal accuracy loss
|
||||
- **GPTQ**: Wide model support, good compression
|
||||
- **FP8**: Fastest on H100 GPUs
|
||||
|
||||
**Step 2: Find or create quantized model**
|
||||
|
||||
Use pre-quantized models from HuggingFace:
|
||||
|
||||
```bash
|
||||
# Search for AWQ models
|
||||
# Example: TheBloke/Llama-2-70B-AWQ
|
||||
```
|
||||
|
||||
**Step 3: Launch with quantization flag**
|
||||
|
||||
```bash
|
||||
# Using pre-quantized model
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95
|
||||
|
||||
# Results: 70B model in ~40GB VRAM
|
||||
```
|
||||
|
||||
**Step 4: Verify accuracy**
|
||||
|
||||
Test outputs match expected quality:
|
||||
|
||||
```python
|
||||
# Compare quantized vs non-quantized responses
|
||||
# Verify task-specific performance unchanged
|
||||
```
|
||||
|
||||
## When to use vs alternatives
|
||||
|
||||
**Use vLLM when:**
|
||||
- Deploying production LLM APIs (100+ req/sec)
|
||||
- Serving OpenAI-compatible endpoints
|
||||
- Limited GPU memory but need large models
|
||||
- Multi-user applications (chatbots, assistants)
|
||||
- Need low latency with high throughput
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **llama.cpp**: CPU/edge inference, single-user
|
||||
- **HuggingFace transformers**: Research, prototyping, one-off generation
|
||||
- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
|
||||
- **Text-Generation-Inference**: Already in HuggingFace ecosystem
|
||||
|
||||
## Common issues
|
||||
|
||||
**Issue: Out of memory during model loading**
|
||||
|
||||
Reduce memory usage:
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--max-model-len 4096
|
||||
```
|
||||
|
||||
Or use quantization:
|
||||
```bash
|
||||
vllm serve MODEL --quantization awq
|
||||
```
|
||||
|
||||
**Issue: Slow first token (TTFT > 1 second)**
|
||||
|
||||
Enable prefix caching for repeated prompts:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching
|
||||
```
|
||||
|
||||
For long prompts, enable chunked prefill:
|
||||
```bash
|
||||
vllm serve MODEL --enable-chunked-prefill
|
||||
```
|
||||
|
||||
**Issue: Model not found error**
|
||||
|
||||
Use `--trust-remote-code` for custom models:
|
||||
```bash
|
||||
vllm serve MODEL --trust-remote-code
|
||||
```
|
||||
|
||||
**Issue: Low throughput (<50 req/sec)**
|
||||
|
||||
Increase concurrent sequences:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 512
|
||||
```
|
||||
|
||||
Check GPU utilization with `nvidia-smi` - should be >80%.
|
||||
|
||||
**Issue: Inference slower than expected**
|
||||
|
||||
Verify tensor parallelism uses power of 2 GPUs:
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 4 # Not 3
|
||||
```
|
||||
|
||||
Enable speculative decoding for faster generation:
|
||||
```bash
|
||||
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||||
```
|
||||
|
||||
## Advanced topics
|
||||
|
||||
**Server deployment patterns**: See [references/server-deployment.md](references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
|
||||
|
||||
**Performance optimization**: See [references/optimization.md](references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
|
||||
|
||||
**Quantization guide**: See [references/quantization.md](references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
|
||||
|
||||
**Troubleshooting**: See [references/troubleshooting.md](references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
|
||||
|
||||
## Hardware requirements
|
||||
|
||||
- **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
|
||||
- **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
|
||||
- **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
|
||||
|
||||
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
|
||||
|
||||
## Resources
|
||||
|
||||
- Official docs: https://docs.vllm.ai
|
||||
- GitHub: https://github.com/vllm-project/vllm
|
||||
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
|
||||
- Community: https://discuss.vllm.ai
|
||||
|
||||
|
||||
|
||||
226
skills/mlops/inference/vllm/references/optimization.md
Normal file
226
skills/mlops/inference/vllm/references/optimization.md
Normal file
@@ -0,0 +1,226 @@
|
||||
# Performance Optimization
|
||||
|
||||
## Contents
|
||||
- PagedAttention explained
|
||||
- Continuous batching mechanics
|
||||
- Prefix caching strategies
|
||||
- Speculative decoding setup
|
||||
- Benchmark results and comparisons
|
||||
- Performance tuning guide
|
||||
|
||||
## PagedAttention explained
|
||||
|
||||
**Traditional attention problem**:
|
||||
- KV cache stored in contiguous memory
|
||||
- Wastes ~50% GPU memory due to fragmentation
|
||||
- Cannot dynamically reallocate for varying sequence lengths
|
||||
|
||||
**PagedAttention solution**:
|
||||
- Divides KV cache into fixed-size blocks (like OS virtual memory)
|
||||
- Dynamic allocation from free block queue
|
||||
- Shares blocks across sequences (for prefix caching)
|
||||
|
||||
**Memory savings example**:
|
||||
```
|
||||
Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
|
||||
PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
# Block size (default: 16 tokens)
|
||||
vllm serve MODEL --block-size 16
|
||||
|
||||
# Number of GPU blocks (auto-calculated)
|
||||
# Controlled by --gpu-memory-utilization
|
||||
vllm serve MODEL --gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
## Continuous batching mechanics
|
||||
|
||||
**Traditional batching**:
|
||||
- Wait for all sequences in batch to finish
|
||||
- GPU idle while waiting for longest sequence
|
||||
- Low GPU utilization (~40-60%)
|
||||
|
||||
**Continuous batching**:
|
||||
- Add new requests as slots become available
|
||||
- Mix prefill (new requests) and decode (ongoing) in same batch
|
||||
- High GPU utilization (>90%)
|
||||
|
||||
**Throughput improvement**:
|
||||
```
|
||||
Traditional batching: 50 req/sec @ 50% GPU util
|
||||
Continuous batching: 200 req/sec @ 90% GPU util
|
||||
= 4x throughput improvement
|
||||
```
|
||||
|
||||
**Tuning parameters**:
|
||||
```bash
|
||||
# Max concurrent sequences (higher = more batching)
|
||||
vllm serve MODEL --max-num-seqs 256
|
||||
|
||||
# Prefill/decode schedule (auto-balanced by default)
|
||||
# No manual tuning needed
|
||||
```
|
||||
|
||||
## Prefix caching strategies
|
||||
|
||||
Reuse computed KV cache for common prompt prefixes.
|
||||
|
||||
**Use cases**:
|
||||
- System prompts repeated across requests
|
||||
- Few-shot examples in every prompt
|
||||
- RAG contexts with overlapping chunks
|
||||
|
||||
**Example savings**:
|
||||
```
|
||||
Prompt: [System: 500 tokens] + [User: 100 tokens]
|
||||
|
||||
Without caching: Compute 600 tokens every request
|
||||
With caching: Compute 500 tokens once, then 100 tokens/request
|
||||
= 83% faster TTFT
|
||||
```
|
||||
|
||||
**Enable prefix caching**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching
|
||||
```
|
||||
|
||||
**Automatic prefix detection**:
|
||||
- vLLM detects common prefixes automatically
|
||||
- No code changes required
|
||||
- Works with OpenAI-compatible API
|
||||
|
||||
**Cache hit rate monitoring**:
|
||||
```bash
|
||||
curl http://localhost:9090/metrics | grep cache_hit
|
||||
# vllm_cache_hit_rate: 0.75 (75% hit rate)
|
||||
```
|
||||
|
||||
## Speculative decoding setup
|
||||
|
||||
Use smaller "draft" model to propose tokens, larger model to verify.
|
||||
|
||||
**Speed improvement**:
|
||||
```
|
||||
Standard: Generate 1 token per forward pass
|
||||
Speculative: Generate 3-5 tokens per forward pass
|
||||
= 2-3x faster generation
|
||||
```
|
||||
|
||||
**How it works**:
|
||||
1. Draft model proposes K tokens (fast)
|
||||
2. Target model verifies all K tokens in parallel (one pass)
|
||||
3. Accept verified tokens, restart from first rejection
|
||||
|
||||
**Setup with separate draft model**:
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct \
|
||||
--speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||
--num-speculative-tokens 5
|
||||
```
|
||||
|
||||
**Setup with n-gram draft** (no separate model):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--speculative-method ngram \
|
||||
--num-speculative-tokens 3
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Output length > 100 tokens
|
||||
- Draft model 5-10x smaller than target
|
||||
- Acceptable 2-3% accuracy trade-off
|
||||
|
||||
## Benchmark results
|
||||
|
||||
**vLLM vs HuggingFace Transformers** (Llama 3 8B, A100):
|
||||
```
|
||||
Metric | HF Transformers | vLLM | Improvement
|
||||
------------------------|-----------------|--------|------------
|
||||
Throughput (req/sec) | 12 | 280 | 23x
|
||||
TTFT (ms) | 850 | 120 | 7x
|
||||
Tokens/sec | 45 | 2,100 | 47x
|
||||
GPU Memory (GB) | 28 | 16 | 1.75x less
|
||||
```
|
||||
|
||||
**vLLM vs TensorRT-LLM** (Llama 2 70B, 4x A100):
|
||||
```
|
||||
Metric | TensorRT-LLM | vLLM | Notes
|
||||
------------------------|--------------|--------|------------------
|
||||
Throughput (req/sec) | 320 | 285 | TRT 12% faster
|
||||
Setup complexity | High | Low | vLLM much easier
|
||||
NVIDIA-only | Yes | No | vLLM multi-platform
|
||||
Quantization support | FP8, INT8 | AWQ/GPTQ/FP8 | vLLM more options
|
||||
```
|
||||
|
||||
## Performance tuning guide
|
||||
|
||||
**Step 1: Measure baseline**
|
||||
|
||||
```bash
|
||||
# Install benchmarking tool
|
||||
pip install locust
|
||||
|
||||
# Run baseline benchmark
|
||||
vllm bench throughput \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--num-prompts 1000
|
||||
|
||||
# Record: throughput, TTFT, tokens/sec
|
||||
```
|
||||
|
||||
**Step 2: Tune memory utilization**
|
||||
|
||||
```bash
|
||||
# Try different values: 0.7, 0.85, 0.9, 0.95
|
||||
vllm serve MODEL --gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Higher = more batch capacity = higher throughput, but risk OOM.
|
||||
|
||||
**Step 3: Tune concurrency**
|
||||
|
||||
```bash
|
||||
# Try values: 128, 256, 512, 1024
|
||||
vllm serve MODEL --max-num-seqs 256
|
||||
```
|
||||
|
||||
Higher = more batching opportunity, but may increase latency.
|
||||
|
||||
**Step 4: Enable optimizations**
|
||||
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--enable-prefix-caching \ # For repeated prompts
|
||||
--enable-chunked-prefill \ # For long prompts
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num-seqs 512
|
||||
```
|
||||
|
||||
**Step 5: Re-benchmark and compare**
|
||||
|
||||
Target improvements:
|
||||
- Throughput: +30-100%
|
||||
- TTFT: -20-50%
|
||||
- GPU utilization: >85%
|
||||
|
||||
**Common performance issues**:
|
||||
|
||||
**Low throughput (<50 req/sec)**:
|
||||
- Increase `--max-num-seqs`
|
||||
- Enable `--enable-prefix-caching`
|
||||
- Check GPU utilization (should be >80%)
|
||||
|
||||
**High TTFT (>1 second)**:
|
||||
- Enable `--enable-chunked-prefill`
|
||||
- Reduce `--max-model-len` if possible
|
||||
- Check if model is too large for GPU
|
||||
|
||||
**OOM errors**:
|
||||
- Reduce `--gpu-memory-utilization` to 0.7
|
||||
- Reduce `--max-model-len`
|
||||
- Use quantization (`--quantization awq`)
|
||||
284
skills/mlops/inference/vllm/references/quantization.md
Normal file
284
skills/mlops/inference/vllm/references/quantization.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# Quantization Guide
|
||||
|
||||
## Contents
|
||||
- Quantization methods comparison
|
||||
- AWQ setup and usage
|
||||
- GPTQ setup and usage
|
||||
- FP8 quantization (H100)
|
||||
- Model preparation
|
||||
- Accuracy vs compression trade-offs
|
||||
|
||||
## Quantization methods comparison
|
||||
|
||||
| Method | Compression | Accuracy Loss | Speed | Best For |
|
||||
|--------|-------------|---------------|-------|----------|
|
||||
| **AWQ** | 4-bit (75%) | <1% | Fast | 70B models, production |
|
||||
| **GPTQ** | 4-bit (75%) | 1-2% | Fast | Wide model support |
|
||||
| **FP8** | 8-bit (50%) | <0.5% | Fastest | H100 GPUs only |
|
||||
| **SqueezeLLM** | 3-4 bit (75-80%) | 2-3% | Medium | Extreme compression |
|
||||
|
||||
**Recommendation**:
|
||||
- **Production**: Use AWQ for 70B models
|
||||
- **H100 GPUs**: Use FP8 for best speed
|
||||
- **Maximum compatibility**: Use GPTQ
|
||||
- **Extreme compression**: Use SqueezeLLM
|
||||
|
||||
## AWQ setup and usage
|
||||
|
||||
**AWQ** (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.
|
||||
|
||||
**Step 1: Find pre-quantized model**
|
||||
|
||||
Search HuggingFace for AWQ models:
|
||||
```bash
|
||||
# Example: TheBloke/Llama-2-70B-AWQ
|
||||
# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
|
||||
```
|
||||
|
||||
**Step 2: Launch with AWQ**
|
||||
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95
|
||||
```
|
||||
|
||||
**Memory savings**:
|
||||
```
|
||||
Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
|
||||
Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
|
||||
= 4x memory reduction
|
||||
```
|
||||
|
||||
**Step 3: Verify performance**
|
||||
|
||||
Test that outputs are acceptable:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
||||
|
||||
# Test complex reasoning
|
||||
response = client.chat.completions.create(
|
||||
model="TheBloke/Llama-2-70B-AWQ",
|
||||
messages=[{"role": "user", "content": "Explain quantum entanglement"}]
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
# Verify quality matches your requirements
|
||||
```
|
||||
|
||||
**Quantize your own model** (requires GPU with 80GB+ VRAM):
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_path = "meta-llama/Llama-2-70b-hf"
|
||||
quant_path = "llama-2-70b-awq"
|
||||
|
||||
# Load model
|
||||
model = AutoAWQForCausalLM.from_pretrained(model_path)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
# Quantize
|
||||
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
|
||||
model.quantize(tokenizer, quant_config=quant_config)
|
||||
|
||||
# Save
|
||||
model.save_quantized(quant_path)
|
||||
tokenizer.save_pretrained(quant_path)
|
||||
```
|
||||
|
||||
## GPTQ setup and usage
|
||||
|
||||
**GPTQ** has widest model support and good compression.
|
||||
|
||||
**Step 1: Find GPTQ model**
|
||||
|
||||
```bash
|
||||
# Example: TheBloke/Llama-2-13B-GPTQ
|
||||
# Example: TheBloke/CodeLlama-34B-GPTQ
|
||||
```
|
||||
|
||||
**Step 2: Launch with GPTQ**
|
||||
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-13B-GPTQ \
|
||||
--quantization gptq \
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
**GPTQ configuration options**:
|
||||
```bash
|
||||
# Specify GPTQ parameters if needed
|
||||
vllm serve MODEL \
|
||||
--quantization gptq \
|
||||
--gptq-act-order \ # Activation ordering
|
||||
--dtype float16
|
||||
```
|
||||
|
||||
**Quantize your own model**:
|
||||
|
||||
```python
|
||||
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_name = "meta-llama/Llama-2-13b-hf"
|
||||
quantized_name = "llama-2-13b-gptq"
|
||||
|
||||
# Load model
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
|
||||
|
||||
# Prepare calibration data
|
||||
calib_data = [...] # List of sample texts
|
||||
|
||||
# Quantize
|
||||
quantize_config = BaseQuantizeConfig(
|
||||
bits=4,
|
||||
group_size=128,
|
||||
desc_act=True
|
||||
)
|
||||
model.quantize(calib_data)
|
||||
|
||||
# Save
|
||||
model.save_quantized(quantized_name)
|
||||
```
|
||||
|
||||
## FP8 quantization (H100)
|
||||
|
||||
**FP8** (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.
|
||||
|
||||
**Requirements**:
|
||||
- H100 or H800 GPU
|
||||
- CUDA 12.3+ (12.8 recommended)
|
||||
- Hopper architecture support
|
||||
|
||||
**Step 1: Enable FP8**
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct \
|
||||
--quantization fp8 \
|
||||
--tensor-parallel-size 2
|
||||
```
|
||||
|
||||
**Performance gains on H100**:
|
||||
```
|
||||
fp16: 180 tokens/sec
|
||||
FP8: 320 tokens/sec
|
||||
= 1.8x speedup
|
||||
```
|
||||
|
||||
**Step 2: Verify accuracy**
|
||||
|
||||
FP8 typically has <0.5% accuracy degradation:
|
||||
```python
|
||||
# Run evaluation suite
|
||||
# Compare FP8 vs FP16 on your tasks
|
||||
# Verify acceptable accuracy
|
||||
```
|
||||
|
||||
**Dynamic FP8 quantization** (no pre-quantized model needed):
|
||||
|
||||
```bash
|
||||
# vLLM automatically quantizes at runtime
|
||||
vllm serve MODEL --quantization fp8
|
||||
# No model preparation required
|
||||
```
|
||||
|
||||
## Model preparation
|
||||
|
||||
**Pre-quantized models (easiest)**:
|
||||
|
||||
1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
|
||||
2. Download or use directly: `TheBloke/[Model]-AWQ`
|
||||
3. Launch with appropriate `--quantization` flag
|
||||
|
||||
**Quantize your own model**:
|
||||
|
||||
**AWQ**:
|
||||
```bash
|
||||
# Install AutoAWQ
|
||||
pip install autoawq
|
||||
|
||||
# Run quantization script
|
||||
python quantize_awq.py --model MODEL --output OUTPUT
|
||||
```
|
||||
|
||||
**GPTQ**:
|
||||
```bash
|
||||
# Install AutoGPTQ
|
||||
pip install auto-gptq
|
||||
|
||||
# Run quantization script
|
||||
python quantize_gptq.py --model MODEL --output OUTPUT
|
||||
```
|
||||
|
||||
**Calibration data**:
|
||||
- Use 128-512 diverse examples from target domain
|
||||
- Representative of production inputs
|
||||
- Higher quality calibration = better accuracy
|
||||
|
||||
## Accuracy vs compression trade-offs
|
||||
|
||||
**Empirical results** (Llama 2 70B on MMLU benchmark):
|
||||
|
||||
| Quantization | Accuracy | Memory | Speed | Production-Ready |
|
||||
|--------------|----------|--------|-------|------------------|
|
||||
| FP16 (baseline) | 100% | 140GB | 1.0x | ✅ (if memory available) |
|
||||
| FP8 | 99.5% | 70GB | 1.8x | ✅ (H100 only) |
|
||||
| AWQ 4-bit | 99.0% | 35GB | 1.5x | ✅ (best for 70B) |
|
||||
| GPTQ 4-bit | 98.5% | 35GB | 1.5x | ✅ (good compatibility) |
|
||||
| SqueezeLLM 3-bit | 96.0% | 26GB | 1.3x | ⚠️ (check accuracy) |
|
||||
|
||||
**When to use each**:
|
||||
|
||||
**No quantization (FP16)**:
|
||||
- Have sufficient GPU memory
|
||||
- Need absolute best accuracy
|
||||
- Model <13B parameters
|
||||
|
||||
**FP8**:
|
||||
- Using H100/H800 GPUs
|
||||
- Need best speed with minimal accuracy loss
|
||||
- Production deployment
|
||||
|
||||
**AWQ 4-bit**:
|
||||
- Need to fit 70B model in 40GB GPU
|
||||
- Production deployment
|
||||
- <1% accuracy loss acceptable
|
||||
|
||||
**GPTQ 4-bit**:
|
||||
- Wide model support needed
|
||||
- Not on H100 (use FP8 instead)
|
||||
- 1-2% accuracy loss acceptable
|
||||
|
||||
**Testing strategy**:
|
||||
|
||||
1. **Baseline**: Measure FP16 accuracy on your evaluation set
|
||||
2. **Quantize**: Create quantized version
|
||||
3. **Evaluate**: Compare quantized vs baseline on same tasks
|
||||
4. **Decide**: Accept if degradation < threshold (typically 1-2%)
|
||||
|
||||
**Example evaluation**:
|
||||
```python
|
||||
from evaluate import load_evaluation_suite
|
||||
|
||||
# Run on FP16 baseline
|
||||
baseline_score = evaluate(model_fp16, eval_suite)
|
||||
|
||||
# Run on quantized
|
||||
quant_score = evaluate(model_awq, eval_suite)
|
||||
|
||||
# Compare
|
||||
degradation = (baseline_score - quant_score) / baseline_score * 100
|
||||
print(f"Accuracy degradation: {degradation:.2f}%")
|
||||
|
||||
# Decision
|
||||
if degradation < 1.0:
|
||||
print("✅ Quantization acceptable for production")
|
||||
else:
|
||||
print("⚠️ Review accuracy loss")
|
||||
```
|
||||
255
skills/mlops/inference/vllm/references/server-deployment.md
Normal file
255
skills/mlops/inference/vllm/references/server-deployment.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# Server Deployment Patterns
|
||||
|
||||
## Contents
|
||||
- Docker deployment
|
||||
- Kubernetes deployment
|
||||
- Load balancing with Nginx
|
||||
- Multi-node distributed serving
|
||||
- Production configuration examples
|
||||
- Health checks and monitoring
|
||||
|
||||
## Docker deployment
|
||||
|
||||
**Basic Dockerfile**:
|
||||
```dockerfile
|
||||
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
|
||||
|
||||
RUN apt-get update && apt-get install -y python3-pip
|
||||
RUN pip install vllm
|
||||
|
||||
EXPOSE 8000
|
||||
|
||||
CMD ["vllm", "serve", "meta-llama/Llama-3-8B-Instruct", \
|
||||
"--host", "0.0.0.0", "--port", "8000", \
|
||||
"--gpu-memory-utilization", "0.9"]
|
||||
```
|
||||
|
||||
**Build and run**:
|
||||
```bash
|
||||
docker build -t vllm-server .
|
||||
docker run --gpus all -p 8000:8000 vllm-server
|
||||
```
|
||||
|
||||
**Docker Compose** (with metrics):
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
vllm:
|
||||
image: vllm/vllm-openai:latest
|
||||
command: >
|
||||
--model meta-llama/Llama-3-8B-Instruct
|
||||
--gpu-memory-utilization 0.9
|
||||
--enable-metrics
|
||||
--metrics-port 9090
|
||||
ports:
|
||||
- "8000:8000"
|
||||
- "9090:9090"
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: all
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
## Kubernetes deployment
|
||||
|
||||
**Deployment manifest**:
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: vllm-server
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: vllm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: vllm
|
||||
spec:
|
||||
containers:
|
||||
- name: vllm
|
||||
image: vllm/vllm-openai:latest
|
||||
args:
|
||||
- "--model=meta-llama/Llama-3-8B-Instruct"
|
||||
- "--gpu-memory-utilization=0.9"
|
||||
- "--enable-prefix-caching"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
name: http
|
||||
- containerPort: 9090
|
||||
name: metrics
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 30
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: vllm-service
|
||||
spec:
|
||||
selector:
|
||||
app: vllm
|
||||
ports:
|
||||
- port: 8000
|
||||
targetPort: 8000
|
||||
name: http
|
||||
- port: 9090
|
||||
targetPort: 9090
|
||||
name: metrics
|
||||
type: LoadBalancer
|
||||
```
|
||||
|
||||
## Load balancing with Nginx
|
||||
|
||||
**Nginx configuration**:
|
||||
```nginx
|
||||
upstream vllm_backend {
|
||||
least_conn; # Route to least-loaded server
|
||||
server localhost:8001;
|
||||
server localhost:8002;
|
||||
server localhost:8003;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
|
||||
location / {
|
||||
proxy_pass http://vllm_backend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
|
||||
# Timeouts for long-running inference
|
||||
proxy_read_timeout 300s;
|
||||
proxy_connect_timeout 75s;
|
||||
}
|
||||
|
||||
# Metrics endpoint
|
||||
location /metrics {
|
||||
proxy_pass http://localhost:9090/metrics;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Start multiple vLLM instances**:
|
||||
```bash
|
||||
# Terminal 1
|
||||
vllm serve MODEL --port 8001 --tensor-parallel-size 1
|
||||
|
||||
# Terminal 2
|
||||
vllm serve MODEL --port 8002 --tensor-parallel-size 1
|
||||
|
||||
# Terminal 3
|
||||
vllm serve MODEL --port 8003 --tensor-parallel-size 1
|
||||
|
||||
# Start Nginx
|
||||
nginx -c /path/to/nginx.conf
|
||||
```
|
||||
|
||||
## Multi-node distributed serving
|
||||
|
||||
For models too large for single node:
|
||||
|
||||
**Node 1** (master):
|
||||
```bash
|
||||
export MASTER_ADDR=192.168.1.10
|
||||
export MASTER_PORT=29500
|
||||
export RANK=0
|
||||
export WORLD_SIZE=2
|
||||
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
**Node 2** (worker):
|
||||
```bash
|
||||
export MASTER_ADDR=192.168.1.10
|
||||
export MASTER_PORT=29500
|
||||
export RANK=1
|
||||
export WORLD_SIZE=2
|
||||
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
## Production configuration examples
|
||||
|
||||
**High throughput** (batch-heavy workload):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--max-num-seqs 512 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--enable-prefix-caching \
|
||||
--trust-remote-code
|
||||
```
|
||||
|
||||
**Low latency** (interactive workload):
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--max-num-seqs 64 \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--enable-chunked-prefill
|
||||
```
|
||||
|
||||
**Memory-constrained** (40GB GPU for 70B model):
|
||||
```bash
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--max-model-len 4096
|
||||
```
|
||||
|
||||
## Health checks and monitoring
|
||||
|
||||
**Health check endpoint**:
|
||||
```bash
|
||||
curl http://localhost:8000/health
|
||||
# Returns: {"status": "ok"}
|
||||
```
|
||||
|
||||
**Readiness check** (wait for model loaded):
|
||||
```bash
|
||||
#!/bin/bash
|
||||
until curl -f http://localhost:8000/health; do
|
||||
echo "Waiting for vLLM to be ready..."
|
||||
sleep 5
|
||||
done
|
||||
echo "vLLM is ready!"
|
||||
```
|
||||
|
||||
**Prometheus scraping**:
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'vllm'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
metrics_path: '/metrics'
|
||||
scrape_interval: 15s
|
||||
```
|
||||
|
||||
**Grafana dashboard** (key metrics):
|
||||
- Requests per second: `rate(vllm_request_success_total[5m])`
|
||||
- TTFT p50: `histogram_quantile(0.5, vllm_time_to_first_token_seconds_bucket)`
|
||||
- TTFT p99: `histogram_quantile(0.99, vllm_time_to_first_token_seconds_bucket)`
|
||||
- GPU cache usage: `vllm_gpu_cache_usage_perc`
|
||||
- Active requests: `vllm_num_requests_running`
|
||||
447
skills/mlops/inference/vllm/references/troubleshooting.md
Normal file
447
skills/mlops/inference/vllm/references/troubleshooting.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
## Contents
|
||||
- Out of memory (OOM) errors
|
||||
- Performance issues
|
||||
- Model loading errors
|
||||
- Network and connection issues
|
||||
- Quantization problems
|
||||
- Distributed serving issues
|
||||
- Debugging tools and commands
|
||||
|
||||
## Out of memory (OOM) errors
|
||||
|
||||
### Symptom: `torch.cuda.OutOfMemoryError` during model loading
|
||||
|
||||
**Cause**: Model + KV cache exceeds available VRAM
|
||||
|
||||
**Solutions (try in order)**:
|
||||
|
||||
1. **Reduce GPU memory utilization**:
|
||||
```bash
|
||||
vllm serve MODEL --gpu-memory-utilization 0.7 # Try 0.7, 0.75, 0.8
|
||||
```
|
||||
|
||||
2. **Reduce max sequence length**:
|
||||
```bash
|
||||
vllm serve MODEL --max-model-len 4096 # Instead of 8192
|
||||
```
|
||||
|
||||
3. **Enable quantization**:
|
||||
```bash
|
||||
vllm serve MODEL --quantization awq # 4x memory reduction
|
||||
```
|
||||
|
||||
4. **Use tensor parallelism** (multiple GPUs):
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 2 # Split across 2 GPUs
|
||||
```
|
||||
|
||||
5. **Reduce max concurrent sequences**:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 128 # Default is 256
|
||||
```
|
||||
|
||||
### Symptom: OOM during inference (not model loading)
|
||||
|
||||
**Cause**: KV cache fills up during generation
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Reduce KV cache allocation
|
||||
vllm serve MODEL --gpu-memory-utilization 0.85
|
||||
|
||||
# Reduce batch size
|
||||
vllm serve MODEL --max-num-seqs 64
|
||||
|
||||
# Reduce max tokens per request
|
||||
# Set in client request: max_tokens=512
|
||||
```
|
||||
|
||||
### Symptom: OOM with quantized model
|
||||
|
||||
**Cause**: Quantization overhead or incorrect configuration
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Ensure quantization flag matches model
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq # Must specify
|
||||
|
||||
# Try different dtype
|
||||
vllm serve MODEL --quantization awq --dtype float16
|
||||
```
|
||||
|
||||
## Performance issues
|
||||
|
||||
### Symptom: Low throughput (<50 req/sec expected >100)
|
||||
|
||||
**Diagnostic steps**:
|
||||
|
||||
1. **Check GPU utilization**:
|
||||
```bash
|
||||
watch -n 1 nvidia-smi
|
||||
# GPU utilization should be >80%
|
||||
```
|
||||
|
||||
If <80%, increase concurrent requests:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 512 # Increase from 256
|
||||
```
|
||||
|
||||
2. **Check if memory-bound**:
|
||||
```bash
|
||||
# If memory at 100% but GPU <80%, reduce sequence length
|
||||
vllm serve MODEL --max-model-len 4096
|
||||
```
|
||||
|
||||
3. **Enable optimizations**:
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--enable-prefix-caching \
|
||||
--enable-chunked-prefill \
|
||||
--max-num-seqs 512
|
||||
```
|
||||
|
||||
4. **Check tensor parallelism settings**:
|
||||
```bash
|
||||
# Must use power-of-2 GPUs
|
||||
vllm serve MODEL --tensor-parallel-size 4 # Not 3 or 5
|
||||
```
|
||||
|
||||
### Symptom: High TTFT (time to first token >1 second)
|
||||
|
||||
**Causes and solutions**:
|
||||
|
||||
**Long prompts**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-chunked-prefill
|
||||
```
|
||||
|
||||
**No prefix caching**:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching # For repeated prompts
|
||||
```
|
||||
|
||||
**Too many concurrent requests**:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 64 # Reduce to prioritize latency
|
||||
```
|
||||
|
||||
**Model too large for single GPU**:
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 2 # Parallelize prefill
|
||||
```
|
||||
|
||||
### Symptom: Slow token generation (low tokens/sec)
|
||||
|
||||
**Diagnostic**:
|
||||
```bash
|
||||
# Check if model is correct size
|
||||
vllm serve MODEL # Should see model size in logs
|
||||
|
||||
# Check speculative decoding
|
||||
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||||
```
|
||||
|
||||
**For H100 GPUs**, enable FP8:
|
||||
```bash
|
||||
vllm serve MODEL --quantization fp8
|
||||
```
|
||||
|
||||
## Model loading errors
|
||||
|
||||
### Symptom: `OSError: MODEL not found`
|
||||
|
||||
**Causes**:
|
||||
|
||||
1. **Model name typo**:
|
||||
```bash
|
||||
# Check exact model name on HuggingFace
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct # Correct capitalization
|
||||
```
|
||||
|
||||
2. **Private/gated model**:
|
||||
```bash
|
||||
# Login to HuggingFace first
|
||||
huggingface-cli login
|
||||
# Then run vLLM
|
||||
vllm serve meta-llama/Llama-3-70B-Instruct
|
||||
```
|
||||
|
||||
3. **Custom model needs trust flag**:
|
||||
```bash
|
||||
vllm serve MODEL --trust-remote-code
|
||||
```
|
||||
|
||||
### Symptom: `ValueError: Tokenizer not found`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Download model manually first
|
||||
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
|
||||
|
||||
# Then launch vLLM
|
||||
vllm serve MODEL
|
||||
```
|
||||
|
||||
### Symptom: `ImportError: No module named 'flash_attn'`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Install flash attention
|
||||
pip install flash-attn --no-build-isolation
|
||||
|
||||
# Or disable flash attention
|
||||
vllm serve MODEL --disable-flash-attn
|
||||
```
|
||||
|
||||
## Network and connection issues
|
||||
|
||||
### Symptom: `Connection refused` when querying server
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Check server is running**:
|
||||
```bash
|
||||
curl http://localhost:8000/health
|
||||
```
|
||||
|
||||
2. **Check port binding**:
|
||||
```bash
|
||||
# Bind to all interfaces for remote access
|
||||
vllm serve MODEL --host 0.0.0.0 --port 8000
|
||||
|
||||
# Check if port is in use
|
||||
lsof -i :8000
|
||||
```
|
||||
|
||||
3. **Check firewall**:
|
||||
```bash
|
||||
# Allow port through firewall
|
||||
sudo ufw allow 8000
|
||||
```
|
||||
|
||||
### Symptom: Slow response times over network
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Increase timeout**:
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="EMPTY",
|
||||
timeout=300.0 # 5 minute timeout
|
||||
)
|
||||
```
|
||||
|
||||
2. **Check network latency**:
|
||||
```bash
|
||||
ping SERVER_IP # Should be <10ms for local network
|
||||
```
|
||||
|
||||
3. **Use connection pooling**:
|
||||
```python
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util.retry import Retry
|
||||
|
||||
session = requests.Session()
|
||||
retries = Retry(total=3, backoff_factor=1)
|
||||
session.mount('http://', HTTPAdapter(max_retries=retries))
|
||||
```
|
||||
|
||||
## Quantization problems
|
||||
|
||||
### Symptom: `RuntimeError: Quantization format not supported`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Ensure correct quantization method
|
||||
vllm serve MODEL --quantization awq # For AWQ models
|
||||
vllm serve MODEL --quantization gptq # For GPTQ models
|
||||
|
||||
# Check model card for quantization type
|
||||
```
|
||||
|
||||
### Symptom: Poor quality outputs after quantization
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Verify model is correctly quantized**:
|
||||
```bash
|
||||
# Check model config.json for quantization_config
|
||||
cat ~/.cache/huggingface/hub/models--MODEL/config.json
|
||||
```
|
||||
|
||||
2. **Try different quantization method**:
|
||||
```bash
|
||||
# If AWQ quality issues, try FP8 (H100 only)
|
||||
vllm serve MODEL --quantization fp8
|
||||
|
||||
# Or use less aggressive quantization
|
||||
vllm serve MODEL # No quantization
|
||||
```
|
||||
|
||||
3. **Increase temperature for better diversity**:
|
||||
```python
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
```
|
||||
|
||||
## Distributed serving issues
|
||||
|
||||
### Symptom: `RuntimeError: Distributed init failed`
|
||||
|
||||
**Diagnostic**:
|
||||
|
||||
1. **Check environment variables**:
|
||||
```bash
|
||||
# On all nodes
|
||||
echo $MASTER_ADDR # Should be same
|
||||
echo $MASTER_PORT # Should be same
|
||||
echo $RANK # Should be unique per node (0, 1, 2, ...)
|
||||
echo $WORLD_SIZE # Should be same (total nodes)
|
||||
```
|
||||
|
||||
2. **Check network connectivity**:
|
||||
```bash
|
||||
# From node 1 to node 2
|
||||
ping NODE2_IP
|
||||
nc -zv NODE2_IP 29500 # Check port accessibility
|
||||
```
|
||||
|
||||
3. **Check NCCL settings**:
|
||||
```bash
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_SOCKET_IFNAME=eth0 # Or your network interface
|
||||
vllm serve MODEL --tensor-parallel-size 8
|
||||
```
|
||||
|
||||
### Symptom: `NCCL error: unhandled cuda error`
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Set NCCL to use correct network interface
|
||||
export NCCL_SOCKET_IFNAME=eth0 # Replace with your interface
|
||||
|
||||
# Increase timeout
|
||||
export NCCL_TIMEOUT=1800 # 30 minutes
|
||||
|
||||
# Force P2P for debugging
|
||||
export NCCL_P2P_DISABLE=1
|
||||
```
|
||||
|
||||
## Debugging tools and commands
|
||||
|
||||
### Enable debug logging
|
||||
|
||||
```bash
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
vllm serve MODEL
|
||||
```
|
||||
|
||||
### Monitor GPU usage
|
||||
|
||||
```bash
|
||||
# Real-time GPU monitoring
|
||||
watch -n 1 nvidia-smi
|
||||
|
||||
# Memory breakdown
|
||||
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1
|
||||
```
|
||||
|
||||
### Profile performance
|
||||
|
||||
```bash
|
||||
# Built-in benchmarking
|
||||
vllm bench throughput \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--num-prompts 100
|
||||
|
||||
vllm bench latency \
|
||||
--model MODEL \
|
||||
--input-tokens 128 \
|
||||
--output-tokens 256 \
|
||||
--batch-size 8
|
||||
```
|
||||
|
||||
### Check metrics
|
||||
|
||||
```bash
|
||||
# Prometheus metrics
|
||||
curl http://localhost:9090/metrics
|
||||
|
||||
# Filter for specific metrics
|
||||
curl http://localhost:9090/metrics | grep vllm_time_to_first_token
|
||||
|
||||
# Key metrics to monitor:
|
||||
# - vllm_time_to_first_token_seconds
|
||||
# - vllm_time_per_output_token_seconds
|
||||
# - vllm_num_requests_running
|
||||
# - vllm_gpu_cache_usage_perc
|
||||
# - vllm_request_success_total
|
||||
```
|
||||
|
||||
### Test server health
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:8000/health
|
||||
|
||||
# Model info
|
||||
curl http://localhost:8000/v1/models
|
||||
|
||||
# Test completion
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "MODEL",
|
||||
"prompt": "Hello",
|
||||
"max_tokens": 10
|
||||
}'
|
||||
```
|
||||
|
||||
### Common environment variables
|
||||
|
||||
```bash
|
||||
# CUDA settings
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Limit to specific GPUs
|
||||
|
||||
# vLLM settings
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
export VLLM_TRACE_FUNCTION=1 # Profile functions
|
||||
export VLLM_USE_V1=1 # Use v1.0 engine (faster)
|
||||
|
||||
# NCCL settings (distributed)
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_SOCKET_IFNAME=eth0
|
||||
export NCCL_IB_DISABLE=0 # Enable InfiniBand
|
||||
```
|
||||
|
||||
### Collect diagnostic info for bug reports
|
||||
|
||||
```bash
|
||||
# System info
|
||||
nvidia-smi
|
||||
python --version
|
||||
pip show vllm
|
||||
|
||||
# vLLM version and config
|
||||
vllm --version
|
||||
python -c "import vllm; print(vllm.__version__)"
|
||||
|
||||
# Run with debug logging
|
||||
export VLLM_LOGGING_LEVEL=DEBUG
|
||||
vllm serve MODEL 2>&1 | tee vllm_debug.log
|
||||
|
||||
# Include in bug report:
|
||||
# - vllm_debug.log
|
||||
# - nvidia-smi output
|
||||
# - Full command used
|
||||
# - Expected vs actual behavior
|
||||
```
|
||||
Reference in New Issue
Block a user