claude-home/ollama-model-testing.md
Cal Corum b186107b97 Add Ollama benchmark results and model testing notes
Document local LLM benchmark results, testing methodology, and
model comparison notes for Ollama deployments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:26:04 -06:00

3.2 KiB

Ollama Model Testing Log

Track models tested, performance observations, and suitability for different use cases.


Quick Summary

Model Date Tested Primary Use Case Rating Notes
GLM-4.7:cloud 2026-02-04 General purpose Cloud-hosted, fast, good reasoning
deepseek-v3.1:671b-cloud 2026-02-04 Complex reasoning Cloud, very capable, slower response

Model Testing Details

GLM-4.7:cloud

Date Tested: 2026-02-04

Model Info:

  • Size/Parameters: Unknown (cloud)
  • Quantization: N/A (cloud)
  • Base Model: GLM-4.7 by Zhipu AI

Performance:

  • Response Speed: Fast
  • RAM/VRAM Usage: Cloud (local minimal)
  • Context Window: 128k

Testing Use Cases:

  • Code generation
  • General Q&A
  • Creative writing
  • Data analysis
  • Task planning
  • Other:

Observations:

  • Strengths: Fast response, good at general reasoning
  • Weaknesses: Cloud dependency
  • Resource requirements: Minimal local resources
  • Output quality: Solid for most tasks
  • When to use this model: Daily tasks, coding help, general assistance

Verdict:


deepseek-v3.1:671b-cloud

Date Tested: 2026-02-04

Model Info:

  • Size/Parameters: 671B (cloud)
  • Quantization: N/A (cloud)
  • Base Model: DeepSeek-V3 by DeepSeek

Performance:

  • Response Speed: Moderate (671B model)
  • RAM/VRAM Usage: Cloud (local minimal)
  • Context Window: 128k+

Testing Use Cases:

  • Code generation
  • General Q&A
  • Creative writing
  • Data analysis
  • Task planning
  • Other:

Observations:

  • Strengths: Very capable, excellent reasoning, great with complex tasks
  • Weaknesses: Slower response, cloud dependency
  • Resource requirements: Minimal local resources
  • Output quality: Top-tier, handles complex multi-step reasoning well
  • When to use this model: Complex coding tasks, deep analysis, planning

Verdict:


Models to Test

Local Models (16GB GPU Compatible)

Small & Fast (2-6GB VRAM at Q4):

  • phi3:mini - 3.8B params, great for quick tasks ~2.2GB
  • llama3.1:8b - 8B params, excellent all-rounder ~4.7GB
  • qwen2.5:7b - 7B params, strong reasoning ~4.3GB
  • gemma2:9b - 9B params, Google's small model ~5.5GB

Medium Capability (6-10GB VRAM at Q4):

  • mistral:7b - 7B params, classic workhorse ~4.1GB
  • llama3.1:14b - 14B params, higher quality ~8.2GB
  • qwen2.5:14b - 14B params, strong multilingual ~8.1GB

Specialized:

  • deepseek-coder-v2:lite - 16B params, optimized for coding ~8.7GB
  • codellama:7b - 7B params, coding specialist ~4.1GB

General Notes

Any overall observations, preferences, or patterns discovered during testing.

Initial Impressions:

  • Cloud models (GLM-4.7, DeepSeek-V3) provide excellent quality without local resources
  • Planning to test local models for privacy, offline use, and comparing quality/speed trade-offs
  • Focus will be on models that fit comfortably in 16GB VRAM for smooth performance

VRAM Estimates at Q4 Quantization:

  • 3B-4B models: ~2-3GB
  • 7B-8B models: ~4-5GB
  • 14B models: ~8-9GB
  • Leaves room for context window and system overhead

Last Updated: 2026-02-04