claude-home/ollama-benchmarks.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

6.2 KiB

title description type domain tags
Ollama Benchmark Prompts Standardized prompt suite and scoring criteria for evaluating Ollama LLM models across code generation, code analysis, reasoning, data analysis, and planning tasks. reference development
ollama
llm
benchmarks
prompts
model-evaluation

Ollama Model Benchmark Prompts

Use these consistent prompts to evaluate different models across similar tasks.


Code Generation Benchmarks

Simple Python Function

Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring.

Medium Complexity - Class with Error Handling

Create a Python class called 'BankAccount' with the following requirements:
- Constructor takes account_number (str) and initial_balance (float)
- Methods: deposit(amount), withdraw(amount), get_balance()
- Withdraw should raise ValueError if insufficient funds
- Deposit should raise ValueError if amount is negative
- Include type hints and comprehensive docstrings

Complex - Async API Handler

Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements:
- Takes a list of URLs as input
- Uses aiohttp with proper session management
- Handles timeout errors and connection errors gracefully
- Returns a dictionary mapping URLs to their response data or error message
- Includes proper logging
- Use async/await patterns correctly

Algorithm Challenge

Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach.

Code Analysis & Refactoring

Bug Finding

Find and fix the bug in this Python code:
```python
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

result = calculate_average([])

Explain the bug and your fix.


### Code Refactoring

Refactor this Python code to be more Pythonic and efficient:

def get_unique_words(text):
    words = []
    for word in text.split():
        if word not in words:
            words.append(word)
    return words

Explain your refactoring choices.


### Code Explanation

Explain what this Python code does and identify any potential issues:

from typing import List, Optional

def process_items(items: List[dict]) -> Optional[dict]:
    if not items:
        return None
    
    result = {}
    for item in items:
        key = item.get('id', 'unknown')
        if key not in result:
            result[key] = []
        result[key].append(item.get('value'))
    
    return result

---

## General Reasoning Benchmarks

### Logic Problem

You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step.


### System Design (High Level)

Design the architecture for a real-time chat application that needs to support:

  • 1 million concurrent users
  • Message delivery in under 100ms
  • Message history for last 30 days
  • Group chats with up to 1000 members

Focus on high-level architecture and technology choices, not implementation details.


### Technical Explanation

Explain the concept of dependency injection in software development. Include:

  • What problem it solves
  • How it works
  • A simple code example in Python or your preferred language
  • Pros and cons compared to direct instantiation

---

## Data Analysis Benchmarks

### Data Processing Task

Given this JSON data representing sales transactions:

[
  {"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"},
  {"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"},
  {"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"},
  {"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"},
  {"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"}
]

Write Python code to:

  1. Calculate total revenue per product
  2. Find the product with highest total revenue
  3. Calculate the average order value

### Data Transformation

Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example:

input_data = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "years_old": 25},
    {"fullname": "Charlie", "age": 35},
    {"name": "Diana", "years_old": 28, "city": "NYC"}
]

# Desired output format:
# {"name": str, "age": int, "city": str (default "Unknown")}

---

## Planning & Task Breakdown

### Project Planning

Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include:

  • Technology stack choices with brief justification
  • Development phases
  • Key deliverables for each phase
  • Estimated complexity of each step (Low/Medium/High)

### Debugging Strategy

You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue.


---

## Benchmarking Criteria

For each prompt, rate the model on:

| Criteria | Scale | Notes |
|----------|-------|-------|
| Accuracy | 1-5 | How correct/complete is the answer? |
| Code Quality | 1-5 | Is code idiomatic, well-structured? |
| Explanations | 1-5 | Are explanations clear and thorough? |
| Response Speed | 1-5 | How fast did it respond? (subjective) |
| Follows Instructions | 1-5 | Did it follow all requirements? |

**Scoring:**
1 - Poor
2 - Below Average
3 - Average
4 - Good
5 - Excellent

---

## Testing Process

1. Run each prompt through each model
2. Record scores and qualitative notes
3. Note response times (optional but helpful)
4. Identify patterns in model strengths/weaknesses
5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings

---

*Last Updated: 2026-02-04*