claude-home/ollama-benchmarks.md

---
title: "Ollama Benchmark Prompts"
description: "Standardized prompt suite and scoring criteria for evaluating Ollama LLM models across code generation, code analysis, reasoning, data analysis, and planning tasks."
type: reference
domain: development
tags: [ollama, llm, benchmarks, prompts, model-evaluation]
---

# Ollama Model Benchmark Prompts

Use these consistent prompts to evaluate different models across similar tasks.

---

## Code Generation Benchmarks

### Simple Python Function
```
Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring.
```

### Medium Complexity - Class with Error Handling
```
Create a Python class called 'BankAccount' with the following requirements:
- Constructor takes account_number (str) and initial_balance (float)
- Methods: deposit(amount), withdraw(amount), get_balance()
- Withdraw should raise ValueError if insufficient funds
- Deposit should raise ValueError if amount is negative
- Include type hints and comprehensive docstrings
```

### Complex - Async API Handler
```
Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements:
- Takes a list of URLs as input
- Uses aiohttp with proper session management
- Handles timeout errors and connection errors gracefully
- Returns a dictionary mapping URLs to their response data or error message
- Includes proper logging
- Use async/await patterns correctly
```

### Algorithm Challenge
```
Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach.
```

---

## Code Analysis & Refactoring

### Bug Finding
```
Find and fix the bug in this Python code:
```python
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

result = calculate_average([])
```

Explain the bug and your fix.
```

### Code Refactoring
```
Refactor this Python code to be more Pythonic and efficient:
```python
def get_unique_words(text):
    words = []
    for word in text.split():
        if word not in words:
            words.append(word)
    return words
```

Explain your refactoring choices.
```

### Code Explanation
```
Explain what this Python code does and identify any potential issues:
```python
from typing import List, Optional

def process_items(items: List[dict]) -> Optional[dict]:
    if not items:
        return None

    result = {}
    for item in items:
        key = item.get('id', 'unknown')
        if key not in result:
            result[key] = []
        result[key].append(item.get('value'))

    return result
```
```

---

## General Reasoning Benchmarks

### Logic Problem
```
You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step.
```

### System Design (High Level)
```
Design the architecture for a real-time chat application that needs to support:
- 1 million concurrent users
- Message delivery in under 100ms
- Message history for last 30 days
- Group chats with up to 1000 members

Focus on high-level architecture and technology choices, not implementation details.
```

### Technical Explanation
```
Explain the concept of dependency injection in software development. Include:
- What problem it solves
- How it works
- A simple code example in Python or your preferred language
- Pros and cons compared to direct instantiation
```

---

## Data Analysis Benchmarks

### Data Processing Task
```
Given this JSON data representing sales transactions:
```json
[
  {"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"},
  {"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"},
  {"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"},
  {"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"},
  {"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"}
]
```

Write Python code to:
1. Calculate total revenue per product
2. Find the product with highest total revenue
3. Calculate the average order value
```

### Data Transformation
```
Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example:
```python
input_data = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "years_old": 25},
    {"fullname": "Charlie", "age": 35},
    {"name": "Diana", "years_old": 28, "city": "NYC"}
]

# Desired output format:
# {"name": str, "age": int, "city": str (default "Unknown")}
```
```

---

## Planning & Task Breakdown

### Project Planning
```
Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include:
- Technology stack choices with brief justification
- Development phases
- Key deliverables for each phase
- Estimated complexity of each step (Low/Medium/High)
```

### Debugging Strategy
```
You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue.
```

---

## Benchmarking Criteria

For each prompt, rate the model on:

| Criteria | Scale | Notes |
|----------|-------|-------|
| Accuracy | 1-5 | How correct/complete is the answer? |
| Code Quality | 1-5 | Is code idiomatic, well-structured? |
| Explanations | 1-5 | Are explanations clear and thorough? |
| Response Speed | 1-5 | How fast did it respond? (subjective) |
| Follows Instructions | 1-5 | Did it follow all requirements? |

**Scoring:**
1 - Poor
2 - Below Average
3 - Average
4 - Good
5 - Excellent

---

## Testing Process

1. Run each prompt through each model
2. Record scores and qualitative notes
3. Note response times (optional but helpful)
4. Identify patterns in model strengths/weaknesses
5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings

---

*Last Updated: 2026-02-04*