claude-home/ollama-benchmarks.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

224 lines
6.2 KiB
Markdown

---
title: "Ollama Benchmark Prompts"
description: "Standardized prompt suite and scoring criteria for evaluating Ollama LLM models across code generation, code analysis, reasoning, data analysis, and planning tasks."
type: reference
domain: development
tags: [ollama, llm, benchmarks, prompts, model-evaluation]
---
# Ollama Model Benchmark Prompts
Use these consistent prompts to evaluate different models across similar tasks.
---
## Code Generation Benchmarks
### Simple Python Function
```
Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring.
```
### Medium Complexity - Class with Error Handling
```
Create a Python class called 'BankAccount' with the following requirements:
- Constructor takes account_number (str) and initial_balance (float)
- Methods: deposit(amount), withdraw(amount), get_balance()
- Withdraw should raise ValueError if insufficient funds
- Deposit should raise ValueError if amount is negative
- Include type hints and comprehensive docstrings
```
### Complex - Async API Handler
```
Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements:
- Takes a list of URLs as input
- Uses aiohttp with proper session management
- Handles timeout errors and connection errors gracefully
- Returns a dictionary mapping URLs to their response data or error message
- Includes proper logging
- Use async/await patterns correctly
```
### Algorithm Challenge
```
Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach.
```
---
## Code Analysis & Refactoring
### Bug Finding
```
Find and fix the bug in this Python code:
```python
def calculate_average(numbers):
total = 0
for num in numbers:
total += num
return total / len(numbers)
result = calculate_average([])
```
Explain the bug and your fix.
```
### Code Refactoring
```
Refactor this Python code to be more Pythonic and efficient:
```python
def get_unique_words(text):
words = []
for word in text.split():
if word not in words:
words.append(word)
return words
```
Explain your refactoring choices.
```
### Code Explanation
```
Explain what this Python code does and identify any potential issues:
```python
from typing import List, Optional
def process_items(items: List[dict]) -> Optional[dict]:
if not items:
return None
result = {}
for item in items:
key = item.get('id', 'unknown')
if key not in result:
result[key] = []
result[key].append(item.get('value'))
return result
```
```
---
## General Reasoning Benchmarks
### Logic Problem
```
You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step.
```
### System Design (High Level)
```
Design the architecture for a real-time chat application that needs to support:
- 1 million concurrent users
- Message delivery in under 100ms
- Message history for last 30 days
- Group chats with up to 1000 members
Focus on high-level architecture and technology choices, not implementation details.
```
### Technical Explanation
```
Explain the concept of dependency injection in software development. Include:
- What problem it solves
- How it works
- A simple code example in Python or your preferred language
- Pros and cons compared to direct instantiation
```
---
## Data Analysis Benchmarks
### Data Processing Task
```
Given this JSON data representing sales transactions:
```json
[
{"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"},
{"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"},
{"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"},
{"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"},
{"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"}
]
```
Write Python code to:
1. Calculate total revenue per product
2. Find the product with highest total revenue
3. Calculate the average order value
```
### Data Transformation
```
Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example:
```python
input_data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "years_old": 25},
{"fullname": "Charlie", "age": 35},
{"name": "Diana", "years_old": 28, "city": "NYC"}
]
# Desired output format:
# {"name": str, "age": int, "city": str (default "Unknown")}
```
```
---
## Planning & Task Breakdown
### Project Planning
```
Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include:
- Technology stack choices with brief justification
- Development phases
- Key deliverables for each phase
- Estimated complexity of each step (Low/Medium/High)
```
### Debugging Strategy
```
You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue.
```
---
## Benchmarking Criteria
For each prompt, rate the model on:
| Criteria | Scale | Notes |
|----------|-------|-------|
| Accuracy | 1-5 | How correct/complete is the answer? |
| Code Quality | 1-5 | Is code idiomatic, well-structured? |
| Explanations | 1-5 | Are explanations clear and thorough? |
| Response Speed | 1-5 | How fast did it respond? (subjective) |
| Follows Instructions | 1-5 | Did it follow all requirements? |
**Scoring:**
1 - Poor
2 - Below Average
3 - Average
4 - Good
5 - Excellent
---
## Testing Process
1. Run each prompt through each model
2. Record scores and qualitative notes
3. Note response times (optional but helpful)
4. Identify patterns in model strengths/weaknesses
5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings
---
*Last Updated: 2026-02-04*