1.8 KiB
1.8 KiB
| title | description | type | domain | tags | |||||
|---|---|---|---|---|---|---|---|---|---|
| llama.cpp Installation and Setup | llama.cpp b8680 Vulkan build installation on workstation with RTX 4080 Super, including model download workflow. | reference | workstation |
|
Installation
Installed from pre-built release binary (no CUDA build available for Linux — Vulkan is the correct choice for NVIDIA GPUs):
# Extract to /opt
sudo mkdir -p /opt/llama.cpp
sudo tar -xzf llama-b8680-bin-ubuntu-vulkan-x64.tar.gz -C /opt/llama.cpp --strip-components=1
# Symlink all binaries to PATH
for bin in /opt/llama.cpp/llama-*; do
sudo ln -sf "$bin" /usr/local/bin/$(basename "$bin")
done
Version: b8680
Backends loaded: Vulkan (GPU), CPU (Zen4 for 7800X3D), RPC
Source: https://github.com/ggml-org/llama.cpp/releases
Release Binary Options (Linux x64)
| Build | Use case |
|---|---|
ubuntu-x64 |
CPU only |
ubuntu-vulkan-x64 |
NVIDIA/AMD GPU via Vulkan |
ubuntu-rocm-x64 |
AMD GPU via ROCm |
ubuntu-openvino-x64 |
Intel CPU/GPU/NPU |
No pre-built CUDA binary exists — Vulkan is the NVIDIA option. For native CUDA, build from source with -DGGML_CUDA=ON.
Models
Stored in /home/cal/Models/.
| Model | File | Size |
|---|---|---|
| Qwen3.5-9B Q4_K_M | Qwen3.5-9B-Q4_K_M.gguf |
5.3 GB |
Downloading Models
The built-in -hf downloader can stall. Use curl with resume support instead:
curl -L -C - --progress-bar \
-o /home/cal/Models/<model>.gguf \
"https://huggingface.co/<org>/<repo>/resolve/main/<model>.gguf"
-C - enables resume if the download is interrupted.
Running
# Full GPU offload
llama-cli -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99
# Server mode
llama-server -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99 --port 8080