Engineering Strategy Template

Template for solving specific engineering problems. Copy and fill in each section.

How to Use This Template

Name the file: strategy-[problem]-[year-quarter].md
- Example: strategy-api-performance-2026-q1.md
Fill in each section below. Delete the instructions (gray text) as you write
Workshop diagnosis with 3-5 engineers before writing policies
Share draft org-wide for 3-day feedback window
Finalize and publish. Set 2-month review date in calendar

When to use: You have a recurring problem (slow deploys, flaky tests, broken onboarding). Not one-time decisions (use RFC instead).

Strategy: [Problem Name]

Owner: [Your name] Date: [YYYY-MM-DD] Review Date: [2 months from now] Status: Draft | Published | Retired

Diagnosis

Instructions: What's the problem? Be specific with metrics, dates, and impact. Answer:
What's broken or slowing us down?
When did it start? What changed?
What constraints define this problem? (team size, budget, tech stack, timeline)
What happens if we don't fix it?

Example:

API response time degraded from 200ms (Jan 2025) to 3.2s (Feb 2026). 40% of requests timeout. Root cause: database queries grew from 12 per request to 87 (N+1 queries). No caching. No query budget enforcement. Impact: 15% of users abandon checkout due to timeouts. $50K/month revenue loss.

Problem Statement

[Write 2-3 paragraphs. Use numbers. Be honest about root cause.]

Impact

Metric	Current	Target	Deadline
[e.g., API response time p95]	3.2s	<500ms	Mar 15, 2026
[e.g., Timeout rate]	40%	<1%	Mar 15, 2026
[e.g., Revenue loss]	$50K/month	$0	Mar 30, 2026

Root Cause

[1-2 sentences. What's the underlying cause, not just the symptom?]

Example: N+1 queries caused by ORM defaults. No caching layer. No query budget enforcement in CI.

Guiding Policies

Instructions: What tradeoffs will we make? What rules apply to all teams? Answer the three questions:
Resource allocation: Where do we spend engineering time?
Fundamental rules: What standards apply across teams?
Decision-making: How do teams make choices?
Good policies:
Specific, not vague ("All endpoints <500ms" not "improve performance")
Make tradeoffs explicit ("We choose speed over cost")
Enforceable (you can measure compliance)
3-5 policies max (more = ignored)

Policy 1: [Name]

What: [One sentence rule]

Why: [One sentence rationale]

Enforcement: [How will we measure compliance? What happens if violated?]

Example:

What: All API endpoints must return in <500ms at p95
Why: 40% of users abandon checkout due to slow APIs, costing $50K/month
Enforcement: CI fails if endpoint >500ms in load test. Alert fires if production >500ms

Policy 2: [Name]

[Repeat structure above]

Policy 3: [Name]

[Repeat structure above]

Coherent Actions

Instructions: What specific steps implement the policies? Each action needs:
Owner (name, not "team")
Deadline (specific date)
Success criteria (how do we know it's done?)
Group actions by theme (e.g., Enforcement, Transition, Escalation).

Enforcement (how we maintain the policies)

Action	Owner	Deadline	Success Criteria
Add query counter middleware. Error if >10 queries	Alice	Feb 25	PR merged. All endpoints instrumented
Set alert: error if endpoint >500ms p95	Alice	Feb 27	Alert fires in #incidents channel
Add load test to CI. Block merge if >500ms	Bob	Mar 5	CI runs load test on every PR

Transition (moving from current to desired state)

Action	Owner	Deadline	Success Criteria
Implement Redis caching for `/users`, `/orders`	Bob	Mar 3	Cache hit rate >80% after 1 week
Refactor checkout to async job queue	Charlie	Mar 10	Checkout returns <200ms. Jobs process in background
Audit all endpoints for N+1 queries. Fix top 10	Team	Mar 15	95% of endpoints <500ms

Escalation (what if policies don't fit?)

Action	Owner	Deadline	Success Criteria
Create RFC process for policy exceptions	Alice	Feb 28	RFC template published. One exception approved as test
Weekly arch review for >500ms endpoints	Tech Lead	Ongoing	First review Mar 1. Weekly recurring

Success Metrics

Instructions: How will we know the strategy worked? Pick 3-5 metrics. Measure weekly or monthly. Review after 2 months.

Metric	Baseline (today)	Target (end date)
API response time (p95)	3.2s	<500ms
Timeout rate	40%	<1%
Revenue loss	$50K/month	$0
Query count per request	87	<10
Cache hit rate	0%	>80%

Review date: [2 months from publish date]

Review questions:

Did metrics hit target?
Are teams following policies?
What surprised us? What would we change?
Should we continue, adjust, or retire this strategy?

FAQ / Escalation

Instructions: Anticipate questions or objections. Answer them upfront.

Q: What if my endpoint needs >10 queries? A: Write an RFC explaining why. Get tech lead approval. Document the exception.

Q: What if Redis goes down? A: Fallback to Postgres. Cache miss = slower, but functional. Add Redis uptime SLO (99.9%).

Q: What if this increases infrastructure cost? A: Yes, Redis will cost ~$500/month. Tradeoff: we're losing $50K/month to timeouts. 100x ROI.

Context / Background (optional)

Instructions: Add any supporting context. Previous attempts, related docs, research, incidents.

Postmortem: Feb 10 API outage — 87 queries per request caused cascading failure
RFC: Query budget middleware — Proposed solution, approved Feb 15
Competitor analysis: Stripe API — They use query budgets + aggressive caching

Appendix: Dissenting Opinions (optional)

Instructions: Capture dissenting opinions before publishing. Shows you listened. Builds trust.

Dave (Backend Lead): "Redis adds complexity. Why not just optimize queries?"

Response: We'll do both. Query optimization is in "Transition" actions. But caching gives us 10x headroom for traffic spikes. Worth the complexity.

Eve (SRE): "Who's on-call for Redis?"

Response: Backend team owns Redis SLO. SRE trains them on runbooks. Escalate to SRE if multi-hour outage.

Review Log

Instructions: After 2 months, document what happened. Did the strategy work? Keep this section updated.

[Date] — Initial publish

[2 months later] — Review results:

Metrics: [Did we hit targets?]
Compliance: [Are teams following policies?]
Surprises: [What did we learn?]
Next steps: [Continue, adjust, retire?]

Template Checklist

Before publishing, verify:

[ ] Diagnosis has specific metrics and dates
[ ] Impact table shows current vs target
[ ] Root cause identified (not just symptoms)
[ ] 3-5 policies, each with enforcement mechanism
[ ] Every action has owner + deadline + success criteria
[ ] Success metrics defined with baseline and target
[ ] 2-month review date in calendar
[ ] Shared draft with 5+ engineers for feedback
[ ] FAQ addresses likely objections
[ ] Dissenting opinions captured and responded to

Review this strategy after 2 months. If it didn't change decisions or improve metrics, retire it.

Back to Engineering Strategy Guide

Engineering Strategy Template ​

How to Use This Template ​

Strategy: [Problem Name] ​

Diagnosis ​

Problem Statement ​

Impact ​

Root Cause ​

Guiding Policies ​

Policy 1: [Name] ​

Policy 2: [Name] ​

Policy 3: [Name] ​

Coherent Actions ​

Enforcement (how we maintain the policies) ​

Transition (moving from current to desired state) ​

Escalation (what if policies don't fit?) ​

Success Metrics ​

FAQ / Escalation ​

Context / Background (optional) ​

Appendix: Dissenting Opinions (optional) ​

Review Log ​

Template Checklist ​