Document local LLM benchmark results, testing methodology, and model comparison notes for Ollama deployments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
122 lines
3.2 KiB
Markdown
122 lines
3.2 KiB
Markdown
# Ollama Model Testing Log
|
|
|
|
Track models tested, performance observations, and suitability for different use cases.
|
|
|
|
---
|
|
|
|
## Quick Summary
|
|
|
|
| Model | Date Tested | Primary Use Case | Rating | Notes |
|
|
|-------|-------------|------------------|--------|-------|
|
|
| GLM-4.7:cloud | 2026-02-04 | General purpose | ⭐⭐⭐⭐ | Cloud-hosted, fast, good reasoning |
|
|
| deepseek-v3.1:671b-cloud | 2026-02-04 | Complex reasoning | ⭐⭐⭐⭐⭐ | Cloud, very capable, slower response |
|
|
| | | | | |
|
|
|
|
---
|
|
|
|
## Model Testing Details
|
|
|
|
### GLM-4.7:cloud
|
|
**Date Tested:** 2026-02-04
|
|
|
|
**Model Info:**
|
|
- Size/Parameters: Unknown (cloud)
|
|
- Quantization: N/A (cloud)
|
|
- Base Model: GLM-4.7 by Zhipu AI
|
|
|
|
**Performance:**
|
|
- Response Speed: Fast
|
|
- RAM/VRAM Usage: Cloud (local minimal)
|
|
- Context Window: 128k
|
|
|
|
**Testing Use Cases:**
|
|
- [x] Code generation
|
|
- [x] General Q&A
|
|
- [ ] Creative writing
|
|
- [x] Data analysis
|
|
- [ ] Task planning
|
|
- [ ] Other:
|
|
|
|
**Observations:**
|
|
- Strengths: Fast response, good at general reasoning
|
|
- Weaknesses: Cloud dependency
|
|
- Resource requirements: Minimal local resources
|
|
- Output quality: Solid for most tasks
|
|
- When to use this model: Daily tasks, coding help, general assistance
|
|
|
|
**Verdict:** ⭐⭐⭐⭐
|
|
|
|
---
|
|
|
|
### deepseek-v3.1:671b-cloud
|
|
**Date Tested:** 2026-02-04
|
|
|
|
**Model Info:**
|
|
- Size/Parameters: 671B (cloud)
|
|
- Quantization: N/A (cloud)
|
|
- Base Model: DeepSeek-V3 by DeepSeek
|
|
|
|
**Performance:**
|
|
- Response Speed: Moderate (671B model)
|
|
- RAM/VRAM Usage: Cloud (local minimal)
|
|
- Context Window: 128k+
|
|
|
|
**Testing Use Cases:**
|
|
- [x] Code generation
|
|
- [x] General Q&A
|
|
- [ ] Creative writing
|
|
- [x] Data analysis
|
|
- [x] Task planning
|
|
- [ ] Other:
|
|
|
|
**Observations:**
|
|
- Strengths: Very capable, excellent reasoning, great with complex tasks
|
|
- Weaknesses: Slower response, cloud dependency
|
|
- Resource requirements: Minimal local resources
|
|
- Output quality: Top-tier, handles complex multi-step reasoning well
|
|
- When to use this model: Complex coding tasks, deep analysis, planning
|
|
|
|
**Verdict:** ⭐⭐⭐⭐⭐
|
|
|
|
---
|
|
|
|
## Models to Test
|
|
|
|
### Local Models (16GB GPU Compatible)
|
|
|
|
**Small & Fast (2-6GB VRAM at Q4):**
|
|
- [ ] phi3:mini - 3.8B params, great for quick tasks ~2.2GB
|
|
- [ ] llama3.1:8b - 8B params, excellent all-rounder ~4.7GB
|
|
- [ ] qwen2.5:7b - 7B params, strong reasoning ~4.3GB
|
|
- [ ] gemma2:9b - 9B params, Google's small model ~5.5GB
|
|
|
|
**Medium Capability (6-10GB VRAM at Q4):**
|
|
- [ ] mistral:7b - 7B params, classic workhorse ~4.1GB
|
|
- [ ] llama3.1:14b - 14B params, higher quality ~8.2GB
|
|
- [ ] qwen2.5:14b - 14B params, strong multilingual ~8.1GB
|
|
|
|
**Specialized:**
|
|
- [ ] deepseek-coder-v2:lite - 16B params, optimized for coding ~8.7GB
|
|
- [ ] codellama:7b - 7B params, coding specialist ~4.1GB
|
|
|
|
---
|
|
|
|
## General Notes
|
|
|
|
*Any overall observations, preferences, or patterns discovered during testing.*
|
|
|
|
**Initial Impressions:**
|
|
- Cloud models (GLM-4.7, DeepSeek-V3) provide excellent quality without local resources
|
|
- Planning to test local models for privacy, offline use, and comparing quality/speed trade-offs
|
|
- Focus will be on models that fit comfortably in 16GB VRAM for smooth performance
|
|
|
|
**VRAM Estimates at Q4 Quantization:**
|
|
- 3B-4B models: ~2-3GB
|
|
- 7B-8B models: ~4-5GB
|
|
- 14B models: ~8-9GB
|
|
- Leaves room for context window and system overhead
|
|
|
|
---
|
|
|
|
*Last Updated: 2026-02-04*
|