Document local LLM benchmark results, testing methodology, and model comparison notes for Ollama deployments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4.7 KiB
4.7 KiB
Ollama Model Benchmark Results
Summary Table
| Model | Code Gen | Code Analysis | Reasoning | Data Analysis | Planning | Overall |
|---|---|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||||
| llama3.1:8b | ||||||
| glm-4.7:cloud | ||||||
| deepseek-v3.1:671b-cloud |
Detailed Results by Category
Code Generation
Simple Python Function
| Model | Accuracy | Code Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Class with Error Handling
| Model | Accuracy | Code Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Async API Handler
| Model | Accuracy | Code Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Algorithm Challenge
| Model | Accuracy | Code Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Code Analysis & Refactoring
Bug Finding
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Code Refactoring
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Code Explanation
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
General Reasoning
Logic Problem
| Model | Accuracy | Reasoning Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
System Design
| Model | Accuracy | Detail Level | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Technical Explanation
| Model | Accuracy | Clarity | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Data Analysis
Data Processing
| Model | Accuracy | Code Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Data Transformation
| Model | Accuracy | Code Quality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Planning & Task Breakdown
Project Planning
| Model | Completeness | Practicality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Debugging Strategy
| Model | Logical Flow | Practicality | Response Time | Notes |
|---|---|---|---|---|
| deepseek-coder-v2:lite | ||||
| llama3.1:8b | ||||
| glm-4.7:cloud | ||||
| deepseek-v3.1:671b-cloud |
Key Findings
Strengths by Model
deepseek-coder-v2:lite
llama3.1:8b
glm-4.7:cloud
deepseek-v3.1:671b-cloud
Weaknesses by Model
deepseek-coder-v2:lite
llama3.1:8b
glm-4.7:cloud
deepseek-v3.1:671b-cloud
Best Model for Each Category
| Category | Winner | Runner-up |
|---|---|---|
| Code Generation | ||
| Code Analysis | ||
| Reasoning | ||
| Data Analysis | ||
| Planning | ||
| Overall (Score) | ||
| Speed (if relevant) |
Last Updated: YYYY-MM-DD