claude-home/llama-cpp-setup.md at 1ab0bf27ed08b5d58a16cda4414e2b0a5aa8095f

cal/claude-home

Fork 0

Cal Corum 92c5ce0ebb docs: sync KB — apcupsd-ups-monitoring.md,llama-cpp-setup.md

2026-04-10 10:38:45 -05:00

1.8 KiB

Raw Blame History

title

description

type

domain

Installation

Installed from pre-built release binary (no CUDA build available for Linux — Vulkan is the correct choice for NVIDIA GPUs):

# Extract to /opt
sudo mkdir -p /opt/llama.cpp
sudo tar -xzf llama-b8680-bin-ubuntu-vulkan-x64.tar.gz -C /opt/llama.cpp --strip-components=1

# Symlink all binaries to PATH
for bin in /opt/llama.cpp/llama-*; do
  sudo ln -sf "$bin" /usr/local/bin/$(basename "$bin")
done

Version: b8680
Backends loaded: Vulkan (GPU), CPU (Zen4 for 7800X3D), RPC
Source: https://github.com/ggml-org/llama.cpp/releases

Release Binary Options (Linux x64)

Build	Use case
`ubuntu-x64`	CPU only
`ubuntu-vulkan-x64`	NVIDIA/AMD GPU via Vulkan
`ubuntu-rocm-x64`	AMD GPU via ROCm
`ubuntu-openvino-x64`	Intel CPU/GPU/NPU

No pre-built CUDA binary exists — Vulkan is the NVIDIA option. For native CUDA, build from source with -DGGML_CUDA=ON.

Models

Stored in /home/cal/Models/.

Model	File	Size
Qwen3.5-9B Q4_K_M	`Qwen3.5-9B-Q4_K_M.gguf`	5.3 GB

Downloading Models

The built-in -hf downloader can stall. Use curl with resume support instead:

curl -L -C - --progress-bar \
  -o /home/cal/Models/<model>.gguf \
  "https://huggingface.co/<org>/<repo>/resolve/main/<model>.gguf"

-C - enables resume if the download is interrupted.

Running

# Full GPU offload
llama-cli -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99

# Server mode
llama-server -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99 --port 8080

1.8 KiB Raw Blame History