--- title: "Ollama Benchmark Prompts" description: "Standardized prompt suite and scoring criteria for evaluating Ollama LLM models across code generation, code analysis, reasoning, data analysis, and planning tasks." type: reference domain: development tags: [ollama, llm, benchmarks, prompts, model-evaluation] --- # Ollama Model Benchmark Prompts Use these consistent prompts to evaluate different models across similar tasks. --- ## Code Generation Benchmarks ### Simple Python Function ``` Write a Python function that takes a list of integers and returns a new list containing only the even numbers, sorted in descending order. Include type hints and a docstring. ``` ### Medium Complexity - Class with Error Handling ``` Create a Python class called 'BankAccount' with the following requirements: - Constructor takes account_number (str) and initial_balance (float) - Methods: deposit(amount), withdraw(amount), get_balance() - Withdraw should raise ValueError if insufficient funds - Deposit should raise ValueError if amount is negative - Include type hints and comprehensive docstrings ``` ### Complex - Async API Handler ``` Create an async Python function that fetches data from multiple URLs concurrently using aiohttp. Requirements: - Takes a list of URLs as input - Uses aiohttp with proper session management - Handles timeout errors and connection errors gracefully - Returns a dictionary mapping URLs to their response data or error message - Includes proper logging - Use async/await patterns correctly ``` ### Algorithm Challenge ``` Implement a function that finds the longest palindromic substring in a given string. Optimize for efficiency (aim for O(n²) or better). Provide both the implementation and a brief explanation of your approach. ``` --- ## Code Analysis & Refactoring ### Bug Finding ``` Find and fix the bug in this Python code: ```python def calculate_average(numbers): total = 0 for num in numbers: total += num return total / len(numbers) result = calculate_average([]) ``` Explain the bug and your fix. ``` ### Code Refactoring ``` Refactor this Python code to be more Pythonic and efficient: ```python def get_unique_words(text): words = [] for word in text.split(): if word not in words: words.append(word) return words ``` Explain your refactoring choices. ``` ### Code Explanation ``` Explain what this Python code does and identify any potential issues: ```python from typing import List, Optional def process_items(items: List[dict]) -> Optional[dict]: if not items: return None result = {} for item in items: key = item.get('id', 'unknown') if key not in result: result[key] = [] result[key].append(item.get('value')) return result ``` ``` --- ## General Reasoning Benchmarks ### Logic Problem ``` You have 12 coins, all identical in appearance. One coin is counterfeit and weighs slightly more than the others. You have a balance scale and can only use it 3 times. How do you identify the counterfeit coin? Explain your reasoning step by step. ``` ### System Design (High Level) ``` Design the architecture for a real-time chat application that needs to support: - 1 million concurrent users - Message delivery in under 100ms - Message history for last 30 days - Group chats with up to 1000 members Focus on high-level architecture and technology choices, not implementation details. ``` ### Technical Explanation ``` Explain the concept of dependency injection in software development. Include: - What problem it solves - How it works - A simple code example in Python or your preferred language - Pros and cons compared to direct instantiation ``` --- ## Data Analysis Benchmarks ### Data Processing Task ``` Given this JSON data representing sales transactions: ```json [ {"id": 1, "product": "Widget", "quantity": 5, "price": 10.00, "date": "2024-01-15"}, {"id": 2, "product": "Gadget", "quantity": 3, "price": 25.00, "date": "2024-01-16"}, {"id": 3, "product": "Widget", "quantity": 2, "price": 10.00, "date": "2024-01-17"}, {"id": 4, "product": "Tool", "quantity": 1, "price": 50.00, "date": "2024-01-18"}, {"id": 5, "product": "Gadget", "quantity": 4, "price": 25.00, "date": "2024-01-19"} ] ``` Write Python code to: 1. Calculate total revenue per product 2. Find the product with highest total revenue 3. Calculate the average order value ``` ### Data Transformation ``` Write a Python function that converts a list of dictionaries (with inconsistent keys) into a standardized format. Handle missing keys gracefully with default values. Example: ```python input_data = [ {"name": "Alice", "age": 30}, {"name": "Bob", "years_old": 25}, {"fullname": "Charlie", "age": 35}, {"name": "Diana", "years_old": 28, "city": "NYC"} ] # Desired output format: # {"name": str, "age": int, "city": str (default "Unknown")} ``` ``` --- ## Planning & Task Breakdown ### Project Planning ``` Break down the task of building a simple REST API for a TODO list into smaller, actionable steps. Include: - Technology stack choices with brief justification - Development phases - Key deliverables for each phase - Estimated complexity of each step (Low/Medium/High) ``` ### Debugging Strategy ``` You're experiencing intermittent failures in a production database connection pool. Describe your step-by-step debugging strategy to identify and resolve the issue. ``` --- ## Benchmarking Criteria For each prompt, rate the model on: | Criteria | Scale | Notes | |----------|-------|-------| | Accuracy | 1-5 | How correct/complete is the answer? | | Code Quality | 1-5 | Is code idiomatic, well-structured? | | Explanations | 1-5 | Are explanations clear and thorough? | | Response Speed | 1-5 | How fast did it respond? (subjective) | | Follows Instructions | 1-5 | Did it follow all requirements? | **Scoring:** 1 - Poor 2 - Below Average 3 - Average 4 - Good 5 - Excellent --- ## Testing Process 1. Run each prompt through each model 2. Record scores and qualitative notes 3. Note response times (optional but helpful) 4. Identify patterns in model strengths/weaknesses 5. Update `/mnt/NV2/Development/claude-home/ollama-model-testing.md` with findings --- *Last Updated: 2026-02-04*