Document local LLM benchmark results, testing methodology, and model comparison notes for Ollama deployments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
216 lines
5.9 KiB
Markdown
216 lines
5.9 KiB
Markdown
# Ollama Model Benchmark Prompts
|
|
|
|
Use these consistent prompts to evaluate different models across similar tasks.
|
|
|
|
---
|
|
|
|
## Code Generation Benchmarks
|
|
|
|
### Simple Python Function
|
|
```
|
|
Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring.
|
|
```
|
|
|
|
### Medium Complexity - Class with Error Handling
|
|
```
|
|
Create a Python class called 'BankAccount' with the following requirements:
|
|
- Constructor takes account_number (str) and initial_balance (float)
|
|
- Methods: deposit(amount), withdraw(amount), get_balance()
|
|
- Withdraw should raise ValueError if insufficient funds
|
|
- Deposit should raise ValueError if amount is negative
|
|
- Include type hints and comprehensive docstrings
|
|
```
|
|
|
|
### Complex - Async API Handler
|
|
```
|
|
Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements:
|
|
- Takes a list of URLs as input
|
|
- Uses aiohttp with proper session management
|
|
- Handles timeout errors and connection errors gracefully
|
|
- Returns a dictionary mapping URLs to their response data or error message
|
|
- Includes proper logging
|
|
- Use async/await patterns correctly
|
|
```
|
|
|
|
### Algorithm Challenge
|
|
```
|
|
Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach.
|
|
```
|
|
|
|
---
|
|
|
|
## Code Analysis & Refactoring
|
|
|
|
### Bug Finding
|
|
```
|
|
Find and fix the bug in this Python code:
|
|
```python
|
|
def calculate_average(numbers):
|
|
total = 0
|
|
for num in numbers:
|
|
total += num
|
|
return total / len(numbers)
|
|
|
|
result = calculate_average([])
|
|
```
|
|
|
|
Explain the bug and your fix.
|
|
```
|
|
|
|
### Code Refactoring
|
|
```
|
|
Refactor this Python code to be more Pythonic and efficient:
|
|
```python
|
|
def get_unique_words(text):
|
|
words = []
|
|
for word in text.split():
|
|
if word not in words:
|
|
words.append(word)
|
|
return words
|
|
```
|
|
|
|
Explain your refactoring choices.
|
|
```
|
|
|
|
### Code Explanation
|
|
```
|
|
Explain what this Python code does and identify any potential issues:
|
|
```python
|
|
from typing import List, Optional
|
|
|
|
def process_items(items: List[dict]) -> Optional[dict]:
|
|
if not items:
|
|
return None
|
|
|
|
result = {}
|
|
for item in items:
|
|
key = item.get('id', 'unknown')
|
|
if key not in result:
|
|
result[key] = []
|
|
result[key].append(item.get('value'))
|
|
|
|
return result
|
|
```
|
|
```
|
|
|
|
---
|
|
|
|
## General Reasoning Benchmarks
|
|
|
|
### Logic Problem
|
|
```
|
|
You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step.
|
|
```
|
|
|
|
### System Design (High Level)
|
|
```
|
|
Design the architecture for a real-time chat application that needs to support:
|
|
- 1 million concurrent users
|
|
- Message delivery in under 100ms
|
|
- Message history for last 30 days
|
|
- Group chats with up to 1000 members
|
|
|
|
Focus on high-level architecture and technology choices, not implementation details.
|
|
```
|
|
|
|
### Technical Explanation
|
|
```
|
|
Explain the concept of dependency injection in software development. Include:
|
|
- What problem it solves
|
|
- How it works
|
|
- A simple code example in Python or your preferred language
|
|
- Pros and cons compared to direct instantiation
|
|
```
|
|
|
|
---
|
|
|
|
## Data Analysis Benchmarks
|
|
|
|
### Data Processing Task
|
|
```
|
|
Given this JSON data representing sales transactions:
|
|
```json
|
|
[
|
|
{"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"},
|
|
{"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"},
|
|
{"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"},
|
|
{"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"},
|
|
{"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"}
|
|
]
|
|
```
|
|
|
|
Write Python code to:
|
|
1. Calculate total revenue per product
|
|
2. Find the product with highest total revenue
|
|
3. Calculate the average order value
|
|
```
|
|
|
|
### Data Transformation
|
|
```
|
|
Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example:
|
|
```python
|
|
input_data = [
|
|
{"name": "Alice", "age": 30},
|
|
{"name": "Bob", "years_old": 25},
|
|
{"fullname": "Charlie", "age": 35},
|
|
{"name": "Diana", "years_old": 28, "city": "NYC"}
|
|
]
|
|
|
|
# Desired output format:
|
|
# {"name": str, "age": int, "city": str (default "Unknown")}
|
|
```
|
|
```
|
|
|
|
---
|
|
|
|
## Planning & Task Breakdown
|
|
|
|
### Project Planning
|
|
```
|
|
Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include:
|
|
- Technology stack choices with brief justification
|
|
- Development phases
|
|
- Key deliverables for each phase
|
|
- Estimated complexity of each step (Low/Medium/High)
|
|
```
|
|
|
|
### Debugging Strategy
|
|
```
|
|
You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue.
|
|
```
|
|
|
|
---
|
|
|
|
## Benchmarking Criteria
|
|
|
|
For each prompt, rate the model on:
|
|
|
|
| Criteria | Scale | Notes |
|
|
|----------|-------|-------|
|
|
| Accuracy | 1-5 | How correct/complete is the answer? |
|
|
| Code Quality | 1-5 | Is code idiomatic, well-structured? |
|
|
| Explanations | 1-5 | Are explanations clear and thorough? |
|
|
| Response Speed | 1-5 | How fast did it respond? (subjective) |
|
|
| Follows Instructions | 1-5 | Did it follow all requirements? |
|
|
|
|
**Scoring:**
|
|
1 - Poor
|
|
2 - Below Average
|
|
3 - Average
|
|
4 - Good
|
|
5 - Excellent
|
|
|
|
---
|
|
|
|
## Testing Process
|
|
|
|
1. Run each prompt through each model
|
|
2. Record scores and qualitative notes
|
|
3. Note response times (optional but helpful)
|
|
4. Identify patterns in model strengths/weaknesses
|
|
5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings
|
|
|
|
---
|
|
|
|
*Last Updated: 2026-02-04*
|