From b186107b97c9bf18c6af1c37ec022a8eb0f0ff3e Mon Sep 17 00:00:00 2001 From: Cal Corum Date: Sat, 7 Feb 2026 22:26:04 -0600 Subject: [PATCH] Add Ollama benchmark results and model testing notes Document local LLM benchmark results, testing methodology, and model comparison notes for Ollama deployments. Co-Authored-By: Claude Opus 4.6 --- ollama-benchmark-results.md | 184 ++++++++++++++++++++++++++++++ ollama-benchmarks.md | 215 ++++++++++++++++++++++++++++++++++++ ollama-model-testing.md | 121 ++++++++++++++++++++ 3 files changed, 520 insertions(+) create mode 100644 ollama-benchmark-results.md create mode 100644 ollama-benchmarks.md create mode 100644 ollama-model-testing.md diff --git a/ollama-benchmark-results.md b/ollama-benchmark-results.md new file mode 100644 index 0000000..1357785 --- /dev/null +++ b/ollama-benchmark-results.md @@ -0,0 +1,184 @@ +# Ollama Model Benchmark Results + +## Summary Table + +| Model | Code Gen | Code Analysis | Reasoning | Data Analysis | Planning | Overall | +|-------|----------|--------------|-----------|--------------|----------|---------| +| deepseek-coder-v2:lite | | | | | | | +| llama3.1:8b | | | | | | | +| glm-4.7:cloud | | | | | | | +| deepseek-v3.1:671b-cloud | | | | | | | + +--- + +## Detailed Results by Category + +### Code Generation + +**Simple Python Function** +| Model | Accuracy | Code Quality | Response Time | Notes | +|-------|----------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Class with Error Handling** +| Model | Accuracy | Code Quality | Response Time | Notes | +|-------|----------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Async API Handler** +| Model | Accuracy | Code Quality | Response Time | Notes | +|-------|----------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Algorithm Challenge** +| Model | Accuracy | Code Quality | Response Time | Notes | +|-------|----------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +### Code Analysis & Refactoring + +**Bug Finding** +| Model | Accuracy | Explanation Quality | Response Time | Notes | +|-------|----------|---------------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Code Refactoring** +| Model | Accuracy | Explanation Quality | Response Time | Notes | +|-------|----------|---------------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Code Explanation** +| Model | Accuracy | Explanation Quality | Response Time | Notes | +|-------|----------|---------------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +### General Reasoning + +**Logic Problem** +| Model | Accuracy | Reasoning Quality | Response Time | Notes | +|-------|----------|-------------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**System Design** +| Model | Accuracy | Detail Level | Response Time | Notes | +|-------|----------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Technical Explanation** +| Model | Accuracy | Clarity | Response Time | Notes | +|-------|----------|---------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +### Data Analysis + +**Data Processing** +| Model | Accuracy | Code Quality | Response Time | Notes | +|-------|----------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Data Transformation** +| Model | Accuracy | Code Quality | Response Time | Notes | +|-------|----------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +### Planning & Task Breakdown + +**Project Planning** +| Model | Completeness | Practicality | Response Time | Notes | +|-------|--------------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +**Debugging Strategy** +| Model | Logical Flow | Practicality | Response Time | Notes | +|-------|--------------|--------------|---------------|-------| +| deepseek-coder-v2:lite | | | | | +| llama3.1:8b | | | | | +| glm-4.7:cloud | | | | | +| deepseek-v3.1:671b-cloud | | | | | + +--- + +## Key Findings + +### Strengths by Model + +**deepseek-coder-v2:lite** +- + +**llama3.1:8b** +- + +**glm-4.7:cloud** +- + +**deepseek-v3.1:671b-cloud** +- + +### Weaknesses by Model + +**deepseek-coder-v2:lite** +- + +**llama3.1:8b** +- + +**glm-4.7:cloud** +- + +**deepseek-v3.1:671b-cloud** +- + +### Best Model for Each Category + +| Category | Winner | Runner-up | +|----------|--------|-----------| +| Code Generation | | | +| Code Analysis | | | +| Reasoning | | | +| Data Analysis | | | +| Planning | | | +| Overall (Score) | | | +| Speed (if relevant) | | | + +--- + +*Last Updated: YYYY-MM-DD* diff --git a/ollama-benchmarks.md b/ollama-benchmarks.md new file mode 100644 index 0000000..efbb6ca --- /dev/null +++ b/ollama-benchmarks.md @@ -0,0 +1,215 @@ +# Ollama Model Benchmark Prompts + +Use these consistent prompts to evaluate different models across similar tasks. + +--- + +## Code Generation Benchmarks + +### Simple Python Function +``` +Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring. +``` + +### Medium Complexity - Class with Error Handling +``` +Create a Python class called 'BankAccount' with the following requirements: +- Constructor takes account_number (str) and initial_balance (float) +- Methods: deposit(amount), withdraw(amount), get_balance() +- Withdraw should raise ValueError if insufficient funds +- Deposit should raise ValueError if amount is negative +- Include type hints and comprehensive docstrings +``` + +### Complex - Async API Handler +``` +Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements: +- Takes a list of URLs as input +- Uses aiohttp with proper session management +- Handles timeout errors and connection errors gracefully +- Returns a dictionary mapping URLs to their response data or error message +- Includes proper logging +- Use async/await patterns correctly +``` + +### Algorithm Challenge +``` +Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach. +``` + +--- + +## Code Analysis & Refactoring + +### Bug Finding +``` +Find and fix the bug in this Python code: +```python +def calculate_average(numbers): + total = 0 + for num in numbers: + total += num + return total / len(numbers) + +result = calculate_average([]) +``` + +Explain the bug and your fix. +``` + +### Code Refactoring +``` +Refactor this Python code to be more Pythonic and efficient: +```python +def get_unique_words(text): + words = [] + for word in text.split(): + if word not in words: + words.append(word) + return words +``` + +Explain your refactoring choices. +``` + +### Code Explanation +``` +Explain what this Python code does and identify any potential issues: +```python +from typing import List, Optional + +def process_items(items: List[dict]) -> Optional[dict]: + if not items: + return None + + result = {} + for item in items: + key = item.get('id', 'unknown') + if key not in result: + result[key] = [] + result[key].append(item.get('value')) + + return result +``` +``` + +--- + +## General Reasoning Benchmarks + +### Logic Problem +``` +You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step. +``` + +### System Design (High Level) +``` +Design the architecture for a real-time chat application that needs to support: +- 1 million concurrent users +- Message delivery in under 100ms +- Message history for last 30 days +- Group chats with up to 1000 members + +Focus on high-level architecture and technology choices, not implementation details. +``` + +### Technical Explanation +``` +Explain the concept of dependency injection in software development. Include: +- What problem it solves +- How it works +- A simple code example in Python or your preferred language +- Pros and cons compared to direct instantiation +``` + +--- + +## Data Analysis Benchmarks + +### Data Processing Task +``` +Given this JSON data representing sales transactions: +```json +[ + {"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"}, + {"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"}, + {"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"}, + {"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"}, + {"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"} +] +``` + +Write Python code to: +1. Calculate total revenue per product +2. Find the product with highest total revenue +3. Calculate the average order value +``` + +### Data Transformation +``` +Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example: +```python +input_data = [ + {"name": "Alice", "age": 30}, + {"name": "Bob", "years_old": 25}, + {"fullname": "Charlie", "age": 35}, + {"name": "Diana", "years_old": 28, "city": "NYC"} +] + +# Desired output format: +# {"name": str, "age": int, "city": str (default "Unknown")} +``` +``` + +--- + +## Planning & Task Breakdown + +### Project Planning +``` +Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include: +- Technology stack choices with brief justification +- Development phases +- Key deliverables for each phase +- Estimated complexity of each step (Low/Medium/High) +``` + +### Debugging Strategy +``` +You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue. +``` + +--- + +## Benchmarking Criteria + +For each prompt, rate the model on: + +| Criteria | Scale | Notes | +|----------|-------|-------| +| Accuracy | 1-5 | How correct/complete is the answer? | +| Code Quality | 1-5 | Is code idiomatic, well-structured? | +| Explanations | 1-5 | Are explanations clear and thorough? | +| Response Speed | 1-5 | How fast did it respond? (subjective) | +| Follows Instructions | 1-5 | Did it follow all requirements? | + +**Scoring:** +1 - Poor +2 - Below Average +3 - Average +4 - Good +5 - Excellent + +--- + +## Testing Process + +1. Run each prompt through each model +2. Record scores and qualitative notes +3. Note response times (optional but helpful) +4. Identify patterns in model strengths/weaknesses +5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings + +--- + +*Last Updated: 2026-02-04* diff --git a/ollama-model-testing.md b/ollama-model-testing.md new file mode 100644 index 0000000..6915487 --- /dev/null +++ b/ollama-model-testing.md @@ -0,0 +1,121 @@ +# Ollama Model Testing Log + +Track models tested, performance observations, and suitability for different use cases. + +--- + +## Quick Summary + +| Model | Date Tested | Primary Use Case | Rating | Notes | +|-------|-------------|------------------|--------|-------| +| GLM-4.7:cloud | 2026-02-04 | General purpose | ⭐⭐⭐⭐ | Cloud-hosted, fast, good reasoning | +| deepseek-v3.1:671b-cloud | 2026-02-04 | Complex reasoning | ⭐⭐⭐⭐⭐ | Cloud, very capable, slower response | +| | | | | | + +--- + +## Model Testing Details + +### GLM-4.7:cloud +**Date Tested:** 2026-02-04 + +**Model Info:** +- Size/Parameters: Unknown (cloud) +- Quantization: N/A (cloud) +- Base Model: GLM-4.7 by Zhipu AI + +**Performance:** +- Response Speed: Fast +- RAM/VRAM Usage: Cloud (local minimal) +- Context Window: 128k + +**Testing Use Cases:** +- [x] Code generation +- [x] General Q&A +- [ ] Creative writing +- [x] Data analysis +- [ ] Task planning +- [ ] Other: + +**Observations:** +- Strengths: Fast response, good at general reasoning +- Weaknesses: Cloud dependency +- Resource requirements: Minimal local resources +- Output quality: Solid for most tasks +- When to use this model: Daily tasks, coding help, general assistance + +**Verdict:** ⭐⭐⭐⭐ + +--- + +### deepseek-v3.1:671b-cloud +**Date Tested:** 2026-02-04 + +**Model Info:** +- Size/Parameters: 671B (cloud) +- Quantization: N/A (cloud) +- Base Model: DeepSeek-V3 by DeepSeek + +**Performance:** +- Response Speed: Moderate (671B model) +- RAM/VRAM Usage: Cloud (local minimal) +- Context Window: 128k+ + +**Testing Use Cases:** +- [x] Code generation +- [x] General Q&A +- [ ] Creative writing +- [x] Data analysis +- [x] Task planning +- [ ] Other: + +**Observations:** +- Strengths: Very capable, excellent reasoning, great with complex tasks +- Weaknesses: Slower response, cloud dependency +- Resource requirements: Minimal local resources +- Output quality: Top-tier, handles complex multi-step reasoning well +- When to use this model: Complex coding tasks, deep analysis, planning + +**Verdict:** ⭐⭐⭐⭐⭐ + +--- + +## Models to Test + +### Local Models (16GB GPU Compatible) + +**Small & Fast (2-6GB VRAM at Q4):** +- [ ] phi3:mini - 3.8B params, great for quick tasks ~2.2GB +- [ ] llama3.1:8b - 8B params, excellent all-rounder ~4.7GB +- [ ] qwen2.5:7b - 7B params, strong reasoning ~4.3GB +- [ ] gemma2:9b - 9B params, Google's small model ~5.5GB + +**Medium Capability (6-10GB VRAM at Q4):** +- [ ] mistral:7b - 7B params, classic workhorse ~4.1GB +- [ ] llama3.1:14b - 14B params, higher quality ~8.2GB +- [ ] qwen2.5:14b - 14B params, strong multilingual ~8.1GB + +**Specialized:** +- [ ] deepseek-coder-v2:lite - 16B params, optimized for coding ~8.7GB +- [ ] codellama:7b - 7B params, coding specialist ~4.1GB + +--- + +## General Notes + +*Any overall observations, preferences, or patterns discovered during testing.* + +**Initial Impressions:** +- Cloud models (GLM-4.7, DeepSeek-V3) provide excellent quality without local resources +- Planning to test local models for privacy, offline use, and comparing quality/speed trade-offs +- Focus will be on models that fit comfortably in 16GB VRAM for smooth performance + +**VRAM Estimates at Q4 Quantization:** +- 3B-4B models: ~2-3GB +- 7B-8B models: ~4-5GB +- 14B models: ~8-9GB +- Leaves room for context window and system overhead + +--- + +*Last Updated: 2026-02-04*