Document local LLM benchmark results, testing methodology, and model comparison notes for Ollama deployments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
185 lines
4.7 KiB
Markdown
185 lines
4.7 KiB
Markdown
# Ollama Model Benchmark Results
|
|
|
|
## Summary Table
|
|
|
|
| Model | Code Gen | Code Analysis | Reasoning | Data Analysis | Planning | Overall |
|
|
|-------|----------|--------------|-----------|--------------|----------|---------|
|
|
| deepseek-coder-v2:lite | | | | | | |
|
|
| llama3.1:8b | | | | | | |
|
|
| glm-4.7:cloud | | | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | | | |
|
|
|
|
---
|
|
|
|
## Detailed Results by Category
|
|
|
|
### Code Generation
|
|
|
|
**Simple Python Function**
|
|
| Model | Accuracy | Code Quality | Response Time | Notes |
|
|
|-------|----------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Class with Error Handling**
|
|
| Model | Accuracy | Code Quality | Response Time | Notes |
|
|
|-------|----------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Async API Handler**
|
|
| Model | Accuracy | Code Quality | Response Time | Notes |
|
|
|-------|----------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Algorithm Challenge**
|
|
| Model | Accuracy | Code Quality | Response Time | Notes |
|
|
|-------|----------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
### Code Analysis & Refactoring
|
|
|
|
**Bug Finding**
|
|
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|
|
|-------|----------|---------------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Code Refactoring**
|
|
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|
|
|-------|----------|---------------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Code Explanation**
|
|
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|
|
|-------|----------|---------------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
### General Reasoning
|
|
|
|
**Logic Problem**
|
|
| Model | Accuracy | Reasoning Quality | Response Time | Notes |
|
|
|-------|----------|-------------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**System Design**
|
|
| Model | Accuracy | Detail Level | Response Time | Notes |
|
|
|-------|----------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Technical Explanation**
|
|
| Model | Accuracy | Clarity | Response Time | Notes |
|
|
|-------|----------|---------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
### Data Analysis
|
|
|
|
**Data Processing**
|
|
| Model | Accuracy | Code Quality | Response Time | Notes |
|
|
|-------|----------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Data Transformation**
|
|
| Model | Accuracy | Code Quality | Response Time | Notes |
|
|
|-------|----------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
### Planning & Task Breakdown
|
|
|
|
**Project Planning**
|
|
| Model | Completeness | Practicality | Response Time | Notes |
|
|
|-------|--------------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
**Debugging Strategy**
|
|
| Model | Logical Flow | Practicality | Response Time | Notes |
|
|
|-------|--------------|--------------|---------------|-------|
|
|
| deepseek-coder-v2:lite | | | | |
|
|
| llama3.1:8b | | | | |
|
|
| glm-4.7:cloud | | | | |
|
|
| deepseek-v3.1:671b-cloud | | | | |
|
|
|
|
---
|
|
|
|
## Key Findings
|
|
|
|
### Strengths by Model
|
|
|
|
**deepseek-coder-v2:lite**
|
|
-
|
|
|
|
**llama3.1:8b**
|
|
-
|
|
|
|
**glm-4.7:cloud**
|
|
-
|
|
|
|
**deepseek-v3.1:671b-cloud**
|
|
-
|
|
|
|
### Weaknesses by Model
|
|
|
|
**deepseek-coder-v2:lite**
|
|
-
|
|
|
|
**llama3.1:8b**
|
|
-
|
|
|
|
**glm-4.7:cloud**
|
|
-
|
|
|
|
**deepseek-v3.1:671b-cloud**
|
|
-
|
|
|
|
### Best Model for Each Category
|
|
|
|
| Category | Winner | Runner-up |
|
|
|----------|--------|-----------|
|
|
| Code Generation | | |
|
|
| Code Analysis | | |
|
|
| Reasoning | | |
|
|
| Data Analysis | | |
|
|
| Planning | | |
|
|
| Overall (Score) | | |
|
|
| Speed (if relevant) | | |
|
|
|
|
---
|
|
|
|
*Last Updated: YYYY-MM-DD*
|