In the chatbot era, a prompt was something you typed and threw away. In the agent era it's a config file you ship, version, evaluate, and maintain. The skill people called prompt engineering didn't go away. It grew up, picked up some new responsibilities, and quietly became the thing that decides whether your agent is useful or embarrassing.
This is the long brief on what actually works in 2026. It has examples you can copy.
The shift in one sentence
Prompts used to address a model. Now they configure a system.
What that means concretely: the prompts that matter live in system messages, tool descriptions, and step instructions, not the user message. The user message is just the request. The system around it has been pre-written, evaluated, and is doing the heavy lifting.
If you're still thinking of prompting as "what I type into ChatGPT," you're working at a layer the rest of the stack moved past two years ago.
What's load-bearing now
Three surfaces do most of the work in a modern agent prompt stack.
1. The system prompt. The agent's job description, constraints, persona, and operating manual. Typically a few hundred to a few thousand tokens, written once, evaluated against a test set, and only changed deliberately. A change to the system prompt is a deploy, not a chat tweak.
2. Tool descriptions. When an agent decides which tool to call, it does so largely on the strength of the tool's description in its registry. A vague description ("Searches the database") gets a tool that's never picked or always picked. A specific description with examples gets one that's selected appropriately. Tool descriptions are prompts. Most teams underinvest in writing them.
3. Step or stage instructions. In multi-step or multi-agent systems, each stage has its own focused prompt: "you are the editor, your job is to..." This is where the behavior of a multi-agent system is shaped, and where most of the iterative work happens.
User-side prompting still matters at the edges (especially for human-driven Copilot use), but the production-system prompts are where the leverage is.
Anatomy of a system prompt
Here's a real-shape system prompt for an agent that drafts first-response replies to Microsoft 365 support tickets. Read it once, then I'll annotate it.
You are a Tier-1 support drafter for an internal Microsoft 365 helpdesk.
GOAL
Draft a single first-response email to the requester for the ticket below.
Do not send the email. Do not modify the ticket. Output JSON only.
YOU CAN ASSUME
- The requester is an internal employee.
- The ticket has been triaged and is on your queue.
- A human reviewer will approve or edit your draft before it sends.
YOU MUST
- Address the requester by their first name from `ticket.requester.firstName`.
- Acknowledge the issue in one sentence using their words where possible.
- Provide one of: a clarifying question, a self-service link, or a status update.
- Keep the body under 120 words.
- Match the warm-but-businesslike tone of the examples below.
- Cite any KB article you reference by ID (e.g. KB-2031) in the response.
YOU MUST NOT
- Promise specific resolution times.
- Reference internal systems, escalation paths, or staffing.
- Apologize more than once.
- Invent product names, KB article IDs, or features that do not appear in
`ticket.body` or in the provided KB context.
- Reach into any tool other than `kb_search` and `tone_check`.
OUTPUT SCHEMA
{
"subject": string, // <= 80 chars
"body": string, // plain text, under 120 words
"kb_refs": string[], // KB IDs cited, can be empty
"needs_human_attention": boolean, // true if you decline to draft
"decline_reason": string | null // populated only when above is true
}
DECLINE INSTEAD OF GUESSING
If the ticket lacks the information you need, return
`needs_human_attention: true` with a one-sentence `decline_reason`.
EXAMPLES (good)
[example 1: short clarifying question with first-name greeting]
[example 2: pointer to a KB article with an honest "let us know" close]
[example 3: a decline with a clean, specific reason]
What that prompt is doing, line by line:
- Role first. "You are a Tier-1 support drafter" anchors the persona before any task. This consistently outperforms task-first phrasing.
- Goal stated tightly. One sentence. The model knows what success means.
- Constraints listed as MUSTs and MUST NOTs. Explicit, scannable, and (importantly) named. "Do not invent KB article IDs" prevents the failure mode it describes more often than not.
- Output schema. A typed shape. Downstream code can validate it. The model knows it is filling a form, not writing free prose.
- Decline path. A first-class option, not an afterthought. Without this, models confidently fabricate when underspecified. With it, the agent has a graceful way to escalate.
- Examples. Two or three good outputs in the prompt do more than a paragraph of instruction. The examples are the spec.
That prompt is roughly 350 tokens. The agent it configures is more reliable than a 2,000-token prompt that tries to be exhaustive. Length is not the goal. Specificity is.
Tool descriptions: before and after
The tool description is what the model sees when deciding whether to call a tool. It is also what the model sees when deciding what arguments to pass. Most people write them like internal API docs. They should read like a colleague's note explaining when to use the thing.
Bad:
{
"name": "kb_search",
"description": "Searches the knowledge base.",
"parameters": {
"query": { "type": "string" }
}
}This gets called for everything, including questions that aren't in the KB. Or it gets ignored because the model has no idea when it's appropriate.
Better:
{
"name": "kb_search",
"description": "Search internal Microsoft 365 KB articles by free-text query. Use this whenever the user's ticket mentions a specific product (Outlook, Teams, OneDrive, Intune, Entra ID) or a specific symptom ('cannot sign in', 'meeting recording missing'). Do NOT use for general M365 strategy questions or for anything outside KB scope. Returns up to 5 article snippets with id, title, last_updated, and relevance score. Only cite articles with relevance > 0.6.",
"parameters": {
"query": {
"type": "string",
"description": "A 3 to 8 word search phrase. Strip greetings and signatures from the ticket body before constructing this. Example good queries: 'Teams meeting recording missing', 'Outlook calendar permissions delegation'."
}
}
}The second version contains:
- When to call it (specific triggers, with examples).
- When not to call it (specific anti-triggers).
- What it returns (so the model knows how to use the result).
- A caveat ("Only cite articles with relevance > 0.6") that prevents a real failure mode.
- Parameter guidance including a worked example.
A good rule: if a new engineer on your team would need to ask follow-up questions to use the tool correctly, your tool description isn't done.
Ten best practices, in order of impact
The list below is roughly ordered by how much it improves output quality per minute of effort spent.
1. Lead with the role, then the goal. "You are a Microsoft 365 architect reviewing a Conditional Access policy" beats "Review this Conditional Access policy" by a wider margin than you'd expect. Set who the model is being before what it's doing.
2. Show, don't tell. Two or three good examples in the prompt outperform a paragraph of prose describing the same thing. If you find yourself adding more rules, try adding more examples first.
3. Constrain format, not thinking. Use schemas, JSON, or section headings to constrain output structure. Leave the reasoning loose. The opposite combination, loose format and constrained thinking, gives you confident-sounding garbage in unpredictable shapes.
4. Name the failure modes you fear most. "Do not invent product names that don't appear in the source documents." "If the ticket is missing the requester's email, decline instead of guessing." Naming the failure prevents the failure more often than not.
5. Make declining a first-class option. Models confidently fabricate when underspecified. Always give them a structured way to say "I don't know" or "this needs a human." Then code paths can branch on it.
6. Plan before doing for multi-step work. "First write a one-paragraph plan. Then execute it." The plan is also a great audit artifact for after-the-fact debugging.
7. Write tool descriptions like documentation, not labels. Cover when to use, when not to use, what's returned, and an example call. The model is reading these to make decisions, not as ornaments.
8. Version your prompts like code. Prompts go in source control. Changes go through review. A prompt change is a deploy. Treat the prompt repo with the same rigor as the application repo.
9. Validate output with code, not vibes. A schema is enforceable. A vibe is not. Use Pydantic, Zod, JSON Schema, or whatever your runtime offers, and reprompt on schema failures.
10. Have an eval set, even a tiny one. Twenty representative inputs in a spreadsheet, the current prompt's outputs in column B, the previous prompt's outputs in column C, a human grader in column D. Run weekly. The teams that have this catch prompt regressions before users do.
A worked example: the support drafter end-to-end
Putting it together. Here's a minimal, honest version of the support drafter agent's full configuration. This is the kind of artifact you can take to a security review.
System prompt: the one shown earlier in this brief.
Tools (registered in the agent's tool registry):
[
{
"name": "kb_search",
"description": "[the long version from earlier in this brief]",
"parameters": { "query": "string (3 to 8 words)" }
},
{
"name": "tone_check",
"description": "Run a draft through the company tone-check service. Returns one of: 'on_brand', 'too_formal', 'too_casual', 'apologetic_loop'. Call once after drafting, before returning. Do not loop on this tool more than twice per ticket.",
"parameters": { "draft_body": "string" }
}
]Stage instruction (for the orchestrator that calls the model):
1. Read the ticket from `ticket.body`.
2. If the ticket mentions a specific M365 product or symptom, call kb_search.
Otherwise skip.
3. Draft the JSON response per the system prompt schema.
4. Call tone_check on the body.
5. If tone_check is not 'on_brand', revise once and re-check.
6. Return the JSON to the orchestrator. Do not call any other tools.
Exit conditions (set in the runtime):
- Maximum 6 tool calls per ticket.
- Maximum 4,000 input tokens.
- Maximum 30 seconds wall-clock.
- Stop and return a
needs_human_attentionJSON if any cap is hit.
Eval set: twenty representative tickets covering five categories (clarifying question, KB pointer, status update, decline-due-to-missing-info, decline-due-to-out-of-scope). Outputs scored weekly on three criteria: schema validity, tone alignment, factual grounding. Drift on any criterion triggers a prompt review.
That whole agent is maybe 600 lines including code and prompts. It is also reliable, auditable, and the kind of thing a Tier-1 helpdesk lead can confidently put in front of users. Notice how much of the reliability comes from the boring parts: schema, declining gracefully, tool descriptions, exit conditions. The cleverness of the prompt is the smallest part.
Anti-patterns that show up everywhere
- The mega-prompt. A 4,000-token system prompt with eighteen sections that contradict each other in three places. This is prompt-as-design-debt. Refactor it like code.
- Prompt-as-secret-sauce. Treating the prompt as a competitive moat is a trap. The prompt is the cheapest part to copy. The eval suite, data, and integrations around it are the moat.
- Clever-prompt-as-feature. Adding "think step by step" or "you are an expert" to every prompt regardless of task. These were useful in 2023; in 2026 they're cargo-culted noise on top of models that already do this internally.
- No structured output. Returning prose where JSON would do. Downstream code becomes regex archaeology, and the agent breaks every time the model paraphrases differently.
- No evals. Iterating prompts based on the last three outputs you saw. This is how prompts rot.
The eval question
The honest answer to "is this prompt good?" is "compared to what, on what set of inputs, judged by which criteria?" Until you have answers to those three questions, you're guessing.
Eval setups don't have to be elaborate. The cheapest version is the spreadsheet pattern from best practice 10. The next step up is a small Python script that runs the prompt against a fixed input set and writes outputs to disk for human review. The step after that is automated grading with a different model as judge. Most teams should be on the spreadsheet for far longer than they are.
In your M365 environment
If you're prompting against M365 Copilot or Copilot Studio specifically, four things to know.
Grounding instructions matter more than persona. Copilot's value is the M365 graph data behind it. Spend the system prompt budget on telling it what to ground in and what to ignore, not on personality.
Ground only in documents in the SharePoint site "Sales-Playbooks" with
sensitivity label Internal or below. Ignore documents in any other site.
If a relevant document is labeled Confidential or above, decline and
escalate instead of summarizing it.
That paragraph is doing more work than ten paragraphs of "you are a helpful sales assistant."
Sensitivity labels propagate, but instructions about them don't. Copilot will respect the label permissions automatically. It will not automatically know what you want it to do when it encounters sensitive content. Tell it explicitly. "Decline and escalate" is usually the right default.
Custom Copilot Studio agents drift. When Microsoft updates the underlying model, your prompt that was carefully tuned to the old one may regress. Re-run your evals after every Microsoft release note that mentions Copilot.
Tool descriptions in Copilot Studio matter as much as anywhere else. Studio's "topic descriptions" and "skill descriptions" are tool descriptions in another suit. Treat them as such. The same advice applies.
The teams that ship good agents in 2026 treat prompts like code: version-controlled, evaluated, reviewed, and maintained. The teams that don't end up rebuilding their prompt stack every six months while wondering why nothing improves.
Sources: Anthropic: Building agents with the Claude Agent SDK · Anthropic: Advanced tool use · Vellum: Agentic workflows guide 2026 · Best practices for building agentic systems: InfoWorld