Add Ollama benchmark results and model testing notes
Document local LLM benchmark results, testing methodology, and model comparison notes for Ollama deployments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
cbdb7a6bb0
commit
b186107b97
184
ollama-benchmark-results.md
Normal file
184
ollama-benchmark-results.md
Normal file
@ -0,0 +1,184 @@
|
||||
# Ollama Model Benchmark Results
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Model | Code Gen | Code Analysis | Reasoning | Data Analysis | Planning | Overall |
|
||||
|-------|----------|--------------|-----------|--------------|----------|---------|
|
||||
| deepseek-coder-v2:lite | | | | | | |
|
||||
| llama3.1:8b | | | | | | |
|
||||
| glm-4.7:cloud | | | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | | | |
|
||||
|
||||
---
|
||||
|
||||
## Detailed Results by Category
|
||||
|
||||
### Code Generation
|
||||
|
||||
**Simple Python Function**
|
||||
| Model | Accuracy | Code Quality | Response Time | Notes |
|
||||
|-------|----------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Class with Error Handling**
|
||||
| Model | Accuracy | Code Quality | Response Time | Notes |
|
||||
|-------|----------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Async API Handler**
|
||||
| Model | Accuracy | Code Quality | Response Time | Notes |
|
||||
|-------|----------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Algorithm Challenge**
|
||||
| Model | Accuracy | Code Quality | Response Time | Notes |
|
||||
|-------|----------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
### Code Analysis & Refactoring
|
||||
|
||||
**Bug Finding**
|
||||
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|
||||
|-------|----------|---------------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Code Refactoring**
|
||||
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|
||||
|-------|----------|---------------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Code Explanation**
|
||||
| Model | Accuracy | Explanation Quality | Response Time | Notes |
|
||||
|-------|----------|---------------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
### General Reasoning
|
||||
|
||||
**Logic Problem**
|
||||
| Model | Accuracy | Reasoning Quality | Response Time | Notes |
|
||||
|-------|----------|-------------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**System Design**
|
||||
| Model | Accuracy | Detail Level | Response Time | Notes |
|
||||
|-------|----------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Technical Explanation**
|
||||
| Model | Accuracy | Clarity | Response Time | Notes |
|
||||
|-------|----------|---------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
### Data Analysis
|
||||
|
||||
**Data Processing**
|
||||
| Model | Accuracy | Code Quality | Response Time | Notes |
|
||||
|-------|----------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Data Transformation**
|
||||
| Model | Accuracy | Code Quality | Response Time | Notes |
|
||||
|-------|----------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
### Planning & Task Breakdown
|
||||
|
||||
**Project Planning**
|
||||
| Model | Completeness | Practicality | Response Time | Notes |
|
||||
|-------|--------------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
**Debugging Strategy**
|
||||
| Model | Logical Flow | Practicality | Response Time | Notes |
|
||||
|-------|--------------|--------------|---------------|-------|
|
||||
| deepseek-coder-v2:lite | | | | |
|
||||
| llama3.1:8b | | | | |
|
||||
| glm-4.7:cloud | | | | |
|
||||
| deepseek-v3.1:671b-cloud | | | | |
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
### Strengths by Model
|
||||
|
||||
**deepseek-coder-v2:lite**
|
||||
-
|
||||
|
||||
**llama3.1:8b**
|
||||
-
|
||||
|
||||
**glm-4.7:cloud**
|
||||
-
|
||||
|
||||
**deepseek-v3.1:671b-cloud**
|
||||
-
|
||||
|
||||
### Weaknesses by Model
|
||||
|
||||
**deepseek-coder-v2:lite**
|
||||
-
|
||||
|
||||
**llama3.1:8b**
|
||||
-
|
||||
|
||||
**glm-4.7:cloud**
|
||||
-
|
||||
|
||||
**deepseek-v3.1:671b-cloud**
|
||||
-
|
||||
|
||||
### Best Model for Each Category
|
||||
|
||||
| Category | Winner | Runner-up |
|
||||
|----------|--------|-----------|
|
||||
| Code Generation | | |
|
||||
| Code Analysis | | |
|
||||
| Reasoning | | |
|
||||
| Data Analysis | | |
|
||||
| Planning | | |
|
||||
| Overall (Score) | | |
|
||||
| Speed (if relevant) | | |
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: YYYY-MM-DD*
|
||||
215
ollama-benchmarks.md
Normal file
215
ollama-benchmarks.md
Normal file
@ -0,0 +1,215 @@
|
||||
# Ollama Model Benchmark Prompts
|
||||
|
||||
Use these consistent prompts to evaluate different models across similar tasks.
|
||||
|
||||
---
|
||||
|
||||
## Code Generation Benchmarks
|
||||
|
||||
### Simple Python Function
|
||||
```
|
||||
Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring.
|
||||
```
|
||||
|
||||
### Medium Complexity - Class with Error Handling
|
||||
```
|
||||
Create a Python class called 'BankAccount' with the following requirements:
|
||||
- Constructor takes account_number (str) and initial_balance (float)
|
||||
- Methods: deposit(amount), withdraw(amount), get_balance()
|
||||
- Withdraw should raise ValueError if insufficient funds
|
||||
- Deposit should raise ValueError if amount is negative
|
||||
- Include type hints and comprehensive docstrings
|
||||
```
|
||||
|
||||
### Complex - Async API Handler
|
||||
```
|
||||
Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements:
|
||||
- Takes a list of URLs as input
|
||||
- Uses aiohttp with proper session management
|
||||
- Handles timeout errors and connection errors gracefully
|
||||
- Returns a dictionary mapping URLs to their response data or error message
|
||||
- Includes proper logging
|
||||
- Use async/await patterns correctly
|
||||
```
|
||||
|
||||
### Algorithm Challenge
|
||||
```
|
||||
Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Code Analysis & Refactoring
|
||||
|
||||
### Bug Finding
|
||||
```
|
||||
Find and fix the bug in this Python code:
|
||||
```python
|
||||
def calculate_average(numbers):
|
||||
total = 0
|
||||
for num in numbers:
|
||||
total += num
|
||||
return total / len(numbers)
|
||||
|
||||
result = calculate_average([])
|
||||
```
|
||||
|
||||
Explain the bug and your fix.
|
||||
```
|
||||
|
||||
### Code Refactoring
|
||||
```
|
||||
Refactor this Python code to be more Pythonic and efficient:
|
||||
```python
|
||||
def get_unique_words(text):
|
||||
words = []
|
||||
for word in text.split():
|
||||
if word not in words:
|
||||
words.append(word)
|
||||
return words
|
||||
```
|
||||
|
||||
Explain your refactoring choices.
|
||||
```
|
||||
|
||||
### Code Explanation
|
||||
```
|
||||
Explain what this Python code does and identify any potential issues:
|
||||
```python
|
||||
from typing import List, Optional
|
||||
|
||||
def process_items(items: List[dict]) -> Optional[dict]:
|
||||
if not items:
|
||||
return None
|
||||
|
||||
result = {}
|
||||
for item in items:
|
||||
key = item.get('id', 'unknown')
|
||||
if key not in result:
|
||||
result[key] = []
|
||||
result[key].append(item.get('value'))
|
||||
|
||||
return result
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## General Reasoning Benchmarks
|
||||
|
||||
### Logic Problem
|
||||
```
|
||||
You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step.
|
||||
```
|
||||
|
||||
### System Design (High Level)
|
||||
```
|
||||
Design the architecture for a real-time chat application that needs to support:
|
||||
- 1 million concurrent users
|
||||
- Message delivery in under 100ms
|
||||
- Message history for last 30 days
|
||||
- Group chats with up to 1000 members
|
||||
|
||||
Focus on high-level architecture and technology choices, not implementation details.
|
||||
```
|
||||
|
||||
### Technical Explanation
|
||||
```
|
||||
Explain the concept of dependency injection in software development. Include:
|
||||
- What problem it solves
|
||||
- How it works
|
||||
- A simple code example in Python or your preferred language
|
||||
- Pros and cons compared to direct instantiation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Analysis Benchmarks
|
||||
|
||||
### Data Processing Task
|
||||
```
|
||||
Given this JSON data representing sales transactions:
|
||||
```json
|
||||
[
|
||||
{"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"},
|
||||
{"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"},
|
||||
{"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"},
|
||||
{"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"},
|
||||
{"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"}
|
||||
]
|
||||
```
|
||||
|
||||
Write Python code to:
|
||||
1. Calculate total revenue per product
|
||||
2. Find the product with highest total revenue
|
||||
3. Calculate the average order value
|
||||
```
|
||||
|
||||
### Data Transformation
|
||||
```
|
||||
Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example:
|
||||
```python
|
||||
input_data = [
|
||||
{"name": "Alice", "age": 30},
|
||||
{"name": "Bob", "years_old": 25},
|
||||
{"fullname": "Charlie", "age": 35},
|
||||
{"name": "Diana", "years_old": 28, "city": "NYC"}
|
||||
]
|
||||
|
||||
# Desired output format:
|
||||
# {"name": str, "age": int, "city": str (default "Unknown")}
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Planning & Task Breakdown
|
||||
|
||||
### Project Planning
|
||||
```
|
||||
Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include:
|
||||
- Technology stack choices with brief justification
|
||||
- Development phases
|
||||
- Key deliverables for each phase
|
||||
- Estimated complexity of each step (Low/Medium/High)
|
||||
```
|
||||
|
||||
### Debugging Strategy
|
||||
```
|
||||
You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benchmarking Criteria
|
||||
|
||||
For each prompt, rate the model on:
|
||||
|
||||
| Criteria | Scale | Notes |
|
||||
|----------|-------|-------|
|
||||
| Accuracy | 1-5 | How correct/complete is the answer? |
|
||||
| Code Quality | 1-5 | Is code idiomatic, well-structured? |
|
||||
| Explanations | 1-5 | Are explanations clear and thorough? |
|
||||
| Response Speed | 1-5 | How fast did it respond? (subjective) |
|
||||
| Follows Instructions | 1-5 | Did it follow all requirements? |
|
||||
|
||||
**Scoring:**
|
||||
1 - Poor
|
||||
2 - Below Average
|
||||
3 - Average
|
||||
4 - Good
|
||||
5 - Excellent
|
||||
|
||||
---
|
||||
|
||||
## Testing Process
|
||||
|
||||
1. Run each prompt through each model
|
||||
2. Record scores and qualitative notes
|
||||
3. Note response times (optional but helpful)
|
||||
4. Identify patterns in model strengths/weaknesses
|
||||
5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2026-02-04*
|
||||
121
ollama-model-testing.md
Normal file
121
ollama-model-testing.md
Normal file
@ -0,0 +1,121 @@
|
||||
# Ollama Model Testing Log
|
||||
|
||||
Track models tested, performance observations, and suitability for different use cases.
|
||||
|
||||
---
|
||||
|
||||
## Quick Summary
|
||||
|
||||
| Model | Date Tested | Primary Use Case | Rating | Notes |
|
||||
|-------|-------------|------------------|--------|-------|
|
||||
| GLM-4.7:cloud | 2026-02-04 | General purpose | ⭐⭐⭐⭐ | Cloud-hosted, fast, good reasoning |
|
||||
| deepseek-v3.1:671b-cloud | 2026-02-04 | Complex reasoning | ⭐⭐⭐⭐⭐ | Cloud, very capable, slower response |
|
||||
| | | | | |
|
||||
|
||||
---
|
||||
|
||||
## Model Testing Details
|
||||
|
||||
### GLM-4.7:cloud
|
||||
**Date Tested:** 2026-02-04
|
||||
|
||||
**Model Info:**
|
||||
- Size/Parameters: Unknown (cloud)
|
||||
- Quantization: N/A (cloud)
|
||||
- Base Model: GLM-4.7 by Zhipu AI
|
||||
|
||||
**Performance:**
|
||||
- Response Speed: Fast
|
||||
- RAM/VRAM Usage: Cloud (local minimal)
|
||||
- Context Window: 128k
|
||||
|
||||
**Testing Use Cases:**
|
||||
- [x] Code generation
|
||||
- [x] General Q&A
|
||||
- [ ] Creative writing
|
||||
- [x] Data analysis
|
||||
- [ ] Task planning
|
||||
- [ ] Other:
|
||||
|
||||
**Observations:**
|
||||
- Strengths: Fast response, good at general reasoning
|
||||
- Weaknesses: Cloud dependency
|
||||
- Resource requirements: Minimal local resources
|
||||
- Output quality: Solid for most tasks
|
||||
- When to use this model: Daily tasks, coding help, general assistance
|
||||
|
||||
**Verdict:** ⭐⭐⭐⭐
|
||||
|
||||
---
|
||||
|
||||
### deepseek-v3.1:671b-cloud
|
||||
**Date Tested:** 2026-02-04
|
||||
|
||||
**Model Info:**
|
||||
- Size/Parameters: 671B (cloud)
|
||||
- Quantization: N/A (cloud)
|
||||
- Base Model: DeepSeek-V3 by DeepSeek
|
||||
|
||||
**Performance:**
|
||||
- Response Speed: Moderate (671B model)
|
||||
- RAM/VRAM Usage: Cloud (local minimal)
|
||||
- Context Window: 128k+
|
||||
|
||||
**Testing Use Cases:**
|
||||
- [x] Code generation
|
||||
- [x] General Q&A
|
||||
- [ ] Creative writing
|
||||
- [x] Data analysis
|
||||
- [x] Task planning
|
||||
- [ ] Other:
|
||||
|
||||
**Observations:**
|
||||
- Strengths: Very capable, excellent reasoning, great with complex tasks
|
||||
- Weaknesses: Slower response, cloud dependency
|
||||
- Resource requirements: Minimal local resources
|
||||
- Output quality: Top-tier, handles complex multi-step reasoning well
|
||||
- When to use this model: Complex coding tasks, deep analysis, planning
|
||||
|
||||
**Verdict:** ⭐⭐⭐⭐⭐
|
||||
|
||||
---
|
||||
|
||||
## Models to Test
|
||||
|
||||
### Local Models (16GB GPU Compatible)
|
||||
|
||||
**Small & Fast (2-6GB VRAM at Q4):**
|
||||
- [ ] phi3:mini - 3.8B params, great for quick tasks ~2.2GB
|
||||
- [ ] llama3.1:8b - 8B params, excellent all-rounder ~4.7GB
|
||||
- [ ] qwen2.5:7b - 7B params, strong reasoning ~4.3GB
|
||||
- [ ] gemma2:9b - 9B params, Google's small model ~5.5GB
|
||||
|
||||
**Medium Capability (6-10GB VRAM at Q4):**
|
||||
- [ ] mistral:7b - 7B params, classic workhorse ~4.1GB
|
||||
- [ ] llama3.1:14b - 14B params, higher quality ~8.2GB
|
||||
- [ ] qwen2.5:14b - 14B params, strong multilingual ~8.1GB
|
||||
|
||||
**Specialized:**
|
||||
- [ ] deepseek-coder-v2:lite - 16B params, optimized for coding ~8.7GB
|
||||
- [ ] codellama:7b - 7B params, coding specialist ~4.1GB
|
||||
|
||||
---
|
||||
|
||||
## General Notes
|
||||
|
||||
*Any overall observations, preferences, or patterns discovered during testing.*
|
||||
|
||||
**Initial Impressions:**
|
||||
- Cloud models (GLM-4.7, DeepSeek-V3) provide excellent quality without local resources
|
||||
- Planning to test local models for privacy, offline use, and comparing quality/speed trade-offs
|
||||
- Focus will be on models that fit comfortably in 16GB VRAM for smooth performance
|
||||
|
||||
**VRAM Estimates at Q4 Quantization:**
|
||||
- 3B-4B models: ~2-3GB
|
||||
- 7B-8B models: ~4-5GB
|
||||
- 14B models: ~8-9GB
|
||||
- Leaves room for context window and system overhead
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2026-02-04*
|
||||
Loading…
Reference in New Issue
Block a user