--- title: "llama.cpp Installation and Setup" description: "llama.cpp b8680 Vulkan build installation on workstation with RTX 4080 Super, including model download workflow." type: reference domain: workstation tags: [llama-cpp, vulkan, nvidia, gguf, local-inference] --- ## Installation Installed from pre-built release binary (no CUDA build available for Linux — Vulkan is the correct choice for NVIDIA GPUs): ```bash # Extract to /opt sudo mkdir -p /opt/llama.cpp sudo tar -xzf llama-b8680-bin-ubuntu-vulkan-x64.tar.gz -C /opt/llama.cpp --strip-components=1 # Symlink all binaries to PATH for bin in /opt/llama.cpp/llama-*; do sudo ln -sf "$bin" /usr/local/bin/$(basename "$bin") done ``` **Version**: b8680 **Backends loaded**: Vulkan (GPU), CPU (Zen4 for 7800X3D), RPC **Source**: https://github.com/ggml-org/llama.cpp/releases ## Release Binary Options (Linux x64) | Build | Use case | |-------|----------| | `ubuntu-x64` | CPU only | | `ubuntu-vulkan-x64` | NVIDIA/AMD GPU via Vulkan | | `ubuntu-rocm-x64` | AMD GPU via ROCm | | `ubuntu-openvino-x64` | Intel CPU/GPU/NPU | No pre-built CUDA binary exists — Vulkan is the NVIDIA option. For native CUDA, build from source with `-DGGML_CUDA=ON`. ## Models Stored in `/home/cal/Models/`. | Model | File | Size | |-------|------|------| | Qwen3.5-9B Q4_K_M | `Qwen3.5-9B-Q4_K_M.gguf` | 5.3 GB | ## Downloading Models The built-in `-hf` downloader can stall. Use `curl` with resume support instead: ```bash curl -L -C - --progress-bar \ -o /home/cal/Models/.gguf \ "https://huggingface.co///resolve/main/.gguf" ``` `-C -` enables resume if the download is interrupted. ## Running ```bash # Full GPU offload llama-cli -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99 # Server mode llama-server -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99 --port 8080 ```