claude-home/ollama-benchmark-results.md
Cal Corum b186107b97 Add Ollama benchmark results and model testing notes
Document local LLM benchmark results, testing methodology, and
model comparison notes for Ollama deployments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:26:04 -06:00

185 lines
4.7 KiB
Markdown

# Ollama Model Benchmark Results
## Summary Table
| Model | Code Gen | Code Analysis | Reasoning | Data Analysis | Planning | Overall |
|-------|----------|--------------|-----------|--------------|----------|---------|
| deepseek-coder-v2:lite | | | | | | |
| llama3.1:8b | | | | | | |
| glm-4.7:cloud | | | | | | |
| deepseek-v3.1:671b-cloud | | | | | | |
---
## Detailed Results by Category
### Code Generation
**Simple Python Function**
| Model | Accuracy | Code Quality | Response Time | Notes |
|-------|----------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Class with Error Handling**
| Model | Accuracy | Code Quality | Response Time | Notes |
|-------|----------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Async API Handler**
| Model | Accuracy | Code Quality | Response Time | Notes |
|-------|----------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Algorithm Challenge**
| Model | Accuracy | Code Quality | Response Time | Notes |
|-------|----------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
### Code Analysis & Refactoring
**Bug Finding**
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|-------|----------|---------------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Code Refactoring**
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|-------|----------|---------------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Code Explanation**
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|-------|----------|---------------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
### General Reasoning
**Logic Problem**
| Model | Accuracy | Reasoning Quality | Response Time | Notes |
|-------|----------|-------------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**System Design**
| Model | Accuracy | Detail Level | Response Time | Notes |
|-------|----------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Technical Explanation**
| Model | Accuracy | Clarity | Response Time | Notes |
|-------|----------|---------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
### Data Analysis
**Data Processing**
| Model | Accuracy | Code Quality | Response Time | Notes |
|-------|----------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Data Transformation**
| Model | Accuracy | Code Quality | Response Time | Notes |
|-------|----------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
### Planning & Task Breakdown
**Project Planning**
| Model | Completeness | Practicality | Response Time | Notes |
|-------|--------------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
**Debugging Strategy**
| Model | Logical Flow | Practicality | Response Time | Notes |
|-------|--------------|--------------|---------------|-------|
| deepseek-coder-v2:lite | | | | |
| llama3.1:8b | | | | |
| glm-4.7:cloud | | | | |
| deepseek-v3.1:671b-cloud | | | | |
---
## Key Findings
### Strengths by Model
**deepseek-coder-v2:lite**
-
**llama3.1:8b**
-
**glm-4.7:cloud**
-
**deepseek-v3.1:671b-cloud**
-
### Weaknesses by Model
**deepseek-coder-v2:lite**
-
**llama3.1:8b**
-
**glm-4.7:cloud**
-
**deepseek-v3.1:671b-cloud**
-
### Best Model for Each Category
| Category | Winner | Runner-up |
|----------|--------|-----------|
| Code Generation | | |
| Code Analysis | | |
| Reasoning | | |
| Data Analysis | | |
| Planning | | |
| Overall (Score) | | |
| Speed (if relevant) | | |
---
*Last Updated: YYYY-MM-DD*