Add Ollama benchmark results and model testing notes

Document local LLM benchmark results, testing methodology, and model comparison notes for Ollama deployments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:26:04 -06:00 · 2026-02-07 22:26:04 -06:00 · b186107b97
commit b186107b97
parent cbdb7a6bb0
3 changed files with 520 additions and 0 deletions
--- a/ollama-benchmark-results.md
+++ b/ollama-benchmark-results.md
@ -0,0 +1,184 @@
+# Ollama Model Benchmark Results
+
+## Summary Table
+
+| Model | Code Gen | Code Analysis | Reasoning | Data Analysis | Planning | Overall |
+|-------|----------|--------------|-----------|--------------|----------|---------|
+| deepseek-coder-v2:lite | | | | | | |
+| llama3.1:8b | | | | | | |
+| glm-4.7:cloud | | | | | | |
+| deepseek-v3.1:671b-cloud | | | | | | |
+
+---
+
+## Detailed Results by Category
+
+### Code Generation
+
+**Simple Python Function**
+| Model | Accuracy | Code Quality | Response Time | Notes |
+|-------|----------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Class with Error Handling**
+| Model | Accuracy | Code Quality | Response Time | Notes |
+|-------|----------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Async API Handler**
+| Model | Accuracy | Code Quality | Response Time | Notes |
+|-------|----------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Algorithm Challenge**
+| Model | Accuracy | Code Quality | Response Time | Notes |
+|-------|----------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+### Code Analysis & Refactoring
+
+**Bug Finding**
+| Model | Accuracy | Explanation Quality | Response Time | Notes |
+|-------|----------|---------------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Code Refactoring**
+| Model | Accuracy | Explanation Quality | Response Time | Notes |
+|-------|----------|---------------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Code Explanation**
+| Model | Accuracy | Explanation Quality | Response Time | Notes |
+|-------|----------|---------------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+### General Reasoning
+
+**Logic Problem**
+| Model | Accuracy | Reasoning Quality | Response Time | Notes |
+|-------|----------|-------------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**System Design**
+| Model | Accuracy | Detail Level | Response Time | Notes |
+|-------|----------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Technical Explanation**
+| Model | Accuracy | Clarity | Response Time | Notes |
+|-------|----------|---------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+### Data Analysis
+
+**Data Processing**
+| Model | Accuracy | Code Quality | Response Time | Notes |
+|-------|----------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Data Transformation**
+| Model | Accuracy | Code Quality | Response Time | Notes |
+|-------|----------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+### Planning & Task Breakdown
+
+**Project Planning**
+| Model | Completeness | Practicality | Response Time | Notes |
+|-------|--------------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+**Debugging Strategy**
+| Model | Logical Flow | Practicality | Response Time | Notes |
+|-------|--------------|--------------|---------------|-------|
+| deepseek-coder-v2:lite | | | | |
+| llama3.1:8b | | | | |
+| glm-4.7:cloud | | | | |
+| deepseek-v3.1:671b-cloud | | | | |
+
+---
+
+## Key Findings
+
+### Strengths by Model
+
+**deepseek-coder-v2:lite**
+-
+
+**llama3.1:8b**
+-
+
+**glm-4.7:cloud**
+-
+
+**deepseek-v3.1:671b-cloud**
+-
+
+### Weaknesses by Model
+
+**deepseek-coder-v2:lite**
+-
+
+**llama3.1:8b**
+-
+
+**glm-4.7:cloud**
+-
+
+**deepseek-v3.1:671b-cloud**
+-
+
+### Best Model for Each Category
+
+| Category | Winner | Runner-up |
+|----------|--------|-----------|
+| Code Generation | | |
+| Code Analysis | | |
+| Reasoning | | |
+| Data Analysis | | |
+| Planning | | |
+| Overall (Score) | | |
+| Speed (if relevant) | | |
+
+---
+
+*Last Updated: YYYY-MM-DD*
--- a/ollama-benchmarks.md
+++ b/ollama-benchmarks.md
@ -0,0 +1,215 @@
+# Ollama Model Benchmark Prompts
+
+Use these consistent prompts to evaluate different models across similar tasks.
+
+---
+
+## Code Generation Benchmarks
+
+### Simple Python Function
+```
+Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring.
+```
+
+### Medium Complexity - Class with Error Handling
+```
+Create a Python class called 'BankAccount' with the following requirements:
+- Constructor takes account_number (str) and initial_balance (float)
+- Methods: deposit(amount), withdraw(amount), get_balance()
+- Withdraw should raise ValueError if insufficient funds
+- Deposit should raise ValueError if amount is negative
+- Include type hints and comprehensive docstrings
+```
+
+### Complex - Async API Handler
+```
+Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements:
+- Takes a list of URLs as input
+- Uses aiohttp with proper session management
+- Handles timeout errors and connection errors gracefully
+- Returns a dictionary mapping URLs to their response data or error message
+- Includes proper logging
+- Use async/await patterns correctly
+```
+
+### Algorithm Challenge
+```
+Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach.
+```
+
+---
+
+## Code Analysis & Refactoring
+
+### Bug Finding
+```
+Find and fix the bug in this Python code:
+```python
+def calculate_average(numbers):
+    total = 0
+    for num in numbers:
+        total += num
+    return total / len(numbers)
+
+result = calculate_average([])
+```
+
+Explain the bug and your fix.
+```
+
+### Code Refactoring
+```
+Refactor this Python code to be more Pythonic and efficient:
+```python
+def get_unique_words(text):
+    words = []
+    for word in text.split():
+        if word not in words:
+            words.append(word)
+    return words
+```
+
+Explain your refactoring choices.
+```
+
+### Code Explanation
+```
+Explain what this Python code does and identify any potential issues:
+```python
+from typing import List, Optional
+
+def process_items(items: List[dict]) -> Optional[dict]:
+    if not items:
+        return None
+    
+    result = {}
+    for item in items:
+        key = item.get('id', 'unknown')
+        if key not in result:
+            result[key] = []
+        result[key].append(item.get('value'))
+    
+    return result
+```
+```
+
+---
+
+## General Reasoning Benchmarks
+
+### Logic Problem
+```
+You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step.
+```
+
+### System Design (High Level)
+```
+Design the architecture for a real-time chat application that needs to support:
+- 1 million concurrent users
+- Message delivery in under 100ms
+- Message history for last 30 days
+- Group chats with up to 1000 members
+
+Focus on high-level architecture and technology choices, not implementation details.
+```
+
+### Technical Explanation
+```
+Explain the concept of dependency injection in software development. Include:
+- What problem it solves
+- How it works
+- A simple code example in Python or your preferred language
+- Pros and cons compared to direct instantiation
+```
+
+---
+
+## Data Analysis Benchmarks
+
+### Data Processing Task
+```
+Given this JSON data representing sales transactions:
+```json
+[
+  {"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"},
+  {"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"},
+  {"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"},
+  {"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"},
+  {"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"}
+]
+```
+
+Write Python code to:
+1. Calculate total revenue per product
+2. Find the product with highest total revenue
+3. Calculate the average order value
+```
+
+### Data Transformation
+```
+Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example:
+```python
+input_data = [
+    {"name": "Alice", "age": 30},
+    {"name": "Bob", "years_old": 25},
+    {"fullname": "Charlie", "age": 35},
+    {"name": "Diana", "years_old": 28, "city": "NYC"}
+]
+
+# Desired output format:
+# {"name": str, "age": int, "city": str (default "Unknown")}
+```
+```
+
+---
+
+## Planning & Task Breakdown
+
+### Project Planning
+```
+Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include:
+- Technology stack choices with brief justification
+- Development phases
+- Key deliverables for each phase
+- Estimated complexity of each step (Low/Medium/High)
+```
+
+### Debugging Strategy
+```
+You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue.
+```
+
+---
+
+## Benchmarking Criteria
+
+For each prompt, rate the model on:
+
+| Criteria | Scale | Notes |
+|----------|-------|-------|
+| Accuracy | 1-5 | How correct/complete is the answer? |
+| Code Quality | 1-5 | Is code idiomatic, well-structured? |
+| Explanations | 1-5 | Are explanations clear and thorough? |
+| Response Speed | 1-5 | How fast did it respond? (subjective) |
+| Follows Instructions | 1-5 | Did it follow all requirements? |
+
+**Scoring:**
+1 - Poor
+2 - Below Average
+3 - Average
+4 - Good
+5 - Excellent
+
+---
+
+## Testing Process
+
+1. Run each prompt through each model
+2. Record scores and qualitative notes
+3. Note response times (optional but helpful)
+4. Identify patterns in model strengths/weaknesses
+5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings
+
+---
+
+*Last Updated: 2026-02-04*
--- a/ollama-model-testing.md
+++ b/ollama-model-testing.md
@ -0,0 +1,121 @@
+# Ollama Model Testing Log
+
+Track models tested, performance observations, and suitability for different use cases.
+
+---
+
+## Quick Summary
+
+| Model | Date Tested | Primary Use Case | Rating | Notes |
+|-------|-------------|------------------|--------|-------|
+| GLM-4.7:cloud | 2026-02-04 | General purpose | ⭐⭐⭐⭐ | Cloud-hosted, fast, good reasoning |
+| deepseek-v3.1:671b-cloud | 2026-02-04 | Complex reasoning | ⭐⭐⭐⭐⭐ | Cloud, very capable, slower response |
+| | | | | |
+
+---
+
+## Model Testing Details
+
+### GLM-4.7:cloud
+**Date Tested:** 2026-02-04
+
+**Model Info:**
+- Size/Parameters: Unknown (cloud)
+- Quantization: N/A (cloud)
+- Base Model: GLM-4.7 by Zhipu AI
+
+**Performance:**
+- Response Speed: Fast
+- RAM/VRAM Usage: Cloud (local minimal)
+- Context Window: 128k
+
+**Testing Use Cases:**
+- [x] Code generation
+- [x] General Q&A
+- [ ] Creative writing
+- [x] Data analysis
+- [ ] Task planning
+- [ ] Other:
+
+**Observations:**
+- Strengths: Fast response, good at general reasoning
+- Weaknesses: Cloud dependency
+- Resource requirements: Minimal local resources
+- Output quality: Solid for most tasks
+- When to use this model: Daily tasks, coding help, general assistance
+
+**Verdict:** ⭐⭐⭐⭐
+
+---
+
+### deepseek-v3.1:671b-cloud
+**Date Tested:** 2026-02-04
+
+**Model Info:**
+- Size/Parameters: 671B (cloud)
+- Quantization: N/A (cloud)
+- Base Model: DeepSeek-V3 by DeepSeek
+
+**Performance:**
+- Response Speed: Moderate (671B model)
+- RAM/VRAM Usage: Cloud (local minimal)
+- Context Window: 128k+
+
+**Testing Use Cases:**
+- [x] Code generation
+- [x] General Q&A
+- [ ] Creative writing
+- [x] Data analysis
+- [x] Task planning
+- [ ] Other:
+
+**Observations:**
+- Strengths: Very capable, excellent reasoning, great with complex tasks
+- Weaknesses: Slower response, cloud dependency
+- Resource requirements: Minimal local resources
+- Output quality: Top-tier, handles complex multi-step reasoning well
+- When to use this model: Complex coding tasks, deep analysis, planning
+
+**Verdict:** ⭐⭐⭐⭐⭐
+
+---
+
+## Models to Test
+
+### Local Models (16GB GPU Compatible)
+
+**Small & Fast (2-6GB VRAM at Q4):**
+- [ ] phi3:mini - 3.8B params, great for quick tasks ~2.2GB
+- [ ] llama3.1:8b - 8B params, excellent all-rounder ~4.7GB
+- [ ] qwen2.5:7b - 7B params, strong reasoning ~4.3GB
+- [ ] gemma2:9b - 9B params, Google's small model ~5.5GB
+
+**Medium Capability (6-10GB VRAM at Q4):**
+- [ ] mistral:7b - 7B params, classic workhorse ~4.1GB
+- [ ] llama3.1:14b - 14B params, higher quality ~8.2GB
+- [ ] qwen2.5:14b - 14B params, strong multilingual ~8.1GB
+
+**Specialized:**
+- [ ] deepseek-coder-v2:lite - 16B params, optimized for coding ~8.7GB
+- [ ] codellama:7b - 7B params, coding specialist ~4.1GB
+
+---
+
+## General Notes
+
+*Any overall observations, preferences, or patterns discovered during testing.*
+
+**Initial Impressions:**
+- Cloud models (GLM-4.7, DeepSeek-V3) provide excellent quality without local resources
+- Planning to test local models for privacy, offline use, and comparing quality/speed trade-offs
+- Focus will be on models that fit comfortably in 16GB VRAM for smooth performance
+
+**VRAM Estimates at Q4 Quantization:**
+- 3B-4B models: ~2-3GB
+- 7B-8B models: ~4-5GB
+- 14B models: ~8-9GB
+- Leaves room for context window and system overhead
+
+---
+
+*Last Updated: 2026-02-04*