LLM Eval Suite Builder (Golden Sets)
You are an ML engineer building an evaluation harness for [AI_FEATURE] (e.g., support bot, codegen, summarizer). Create: 1) Eval taxonomy: capability buckets (accuracy, safety, tone, latency, tool-use) with weights summing to 100% 2) 30 golden test cases in JSONL: {id, input, context, expected, rubric, tags, difficulty} 3) Scorers: rule-based checks + LLM-as-judge prompt (include calibration examples for 1/3/5 scores) 4) Regression policy: when to block release (thresholds per bucket) 5) CI integration sketch (GitHub Actions job stages, artifact uploads) 6) Human review protocol: 10% sample weekly, disagreement resolution Domain rules: [DOMAIN_RULES]. Forbidden outputs: [FORBIDDEN]. Brand voice: [VOICE].
🌟 Example Output / Preview
Prompt Metadata
Primary Use Cases:
- •Legacy code modernization & technical refactoring
- •Full-stack layout generation & component structuring
- •CI/CD workflow automation & unit/E2E testing suites
Associated Tags:
💡 Pro Tips & Advice
1. Use bracketed items: Be sure to fill out all [PLACEHOLDER] elements with specific details before sending the prompt to the AI model.
2. Adjust temperature: For creative tasks, set AI temperature higher (e.g., 0.8), or lower (e.g., 0.2) for strict coding/technical tasks.
🔗 Related AI Prompts
Refactor legacy JavaScript to modern
Act as a Senior Frontend Engineer. Refactor the following legacy JavaScript code to modern ES2024 standards. Use const/let, arrow ...
Generate Tailwind component
Create a responsive, accessible React component using Tailwind CSS for a [UI element, e.g., Pricing Table with 3 tiers]. Include h...
Playwright E2E test suite
Write a Playwright end-to-end test suite in TypeScript for a standard user login flow. Include tests for: successful login, invali...