Engineering Strategy Template
Template for solving specific engineering problems. Copy and fill in each section.
How to Use This Template
- Name the file:
strategy-[problem]-[year-quarter].md- Example:
strategy-api-performance-2026-q1.md
- Example:
- Fill in each section below. Delete the instructions (gray text) as you write
- Workshop diagnosis with 3-5 engineers before writing policies
- Share draft org-wide for 3-day feedback window
- Finalize and publish. Set 2-month review date in calendar
When to use: You have a recurring problem (slow deploys, flaky tests, broken onboarding). Not one-time decisions (use RFC instead).
Strategy: [Problem Name]
Owner: [Your name] Date: [YYYY-MM-DD] Review Date: [2 months from now] Status: Draft | Published | Retired
Diagnosis
Instructions: What's the problem? Be specific with metrics, dates, and impact. Answer:
- What's broken or slowing us down?
- When did it start? What changed?
- What constraints define this problem? (team size, budget, tech stack, timeline)
- What happens if we don't fix it?
Example:
API response time degraded from 200ms (Jan 2025) to 3.2s (Feb 2026). 40% of requests timeout. Root cause: database queries grew from 12 per request to 87 (N+1 queries). No caching. No query budget enforcement. Impact: 15% of users abandon checkout due to timeouts. $50K/month revenue loss.
Problem Statement
[Write 2-3 paragraphs. Use numbers. Be honest about root cause.]
Impact
| Metric | Current | Target | Deadline |
|---|---|---|---|
| [e.g., API response time p95] | 3.2s | <500ms | Mar 15, 2026 |
| [e.g., Timeout rate] | 40% | <1% | Mar 15, 2026 |
| [e.g., Revenue loss] | $50K/month | $0 | Mar 30, 2026 |
Root Cause
[1-2 sentences. What's the underlying cause, not just the symptom?]
Example: N+1 queries caused by ORM defaults. No caching layer. No query budget enforcement in CI.
Guiding Policies
Instructions: What tradeoffs will we make? What rules apply to all teams? Answer the three questions:
- Resource allocation: Where do we spend engineering time?
- Fundamental rules: What standards apply across teams?
- Decision-making: How do teams make choices?
Good policies:
- Specific, not vague ("All endpoints <500ms" not "improve performance")
- Make tradeoffs explicit ("We choose speed over cost")
- Enforceable (you can measure compliance)
- 3-5 policies max (more = ignored)
Policy 1: [Name]
What: [One sentence rule]
Why: [One sentence rationale]
Enforcement: [How will we measure compliance? What happens if violated?]
Example:
- What: All API endpoints must return in <500ms at p95
- Why: 40% of users abandon checkout due to slow APIs, costing $50K/month
- Enforcement: CI fails if endpoint >500ms in load test. Alert fires if production >500ms
Policy 2: [Name]
[Repeat structure above]
Policy 3: [Name]
[Repeat structure above]
Coherent Actions
Instructions: What specific steps implement the policies? Each action needs:
- Owner (name, not "team")
- Deadline (specific date)
- Success criteria (how do we know it's done?)
Group actions by theme (e.g., Enforcement, Transition, Escalation).
Enforcement (how we maintain the policies)
| Action | Owner | Deadline | Success Criteria |
|---|---|---|---|
| Add query counter middleware. Error if >10 queries | Alice | Feb 25 | PR merged. All endpoints instrumented |
| Set alert: error if endpoint >500ms p95 | Alice | Feb 27 | Alert fires in #incidents channel |
| Add load test to CI. Block merge if >500ms | Bob | Mar 5 | CI runs load test on every PR |
Transition (moving from current to desired state)
| Action | Owner | Deadline | Success Criteria |
|---|---|---|---|
Implement Redis caching for /users, /orders | Bob | Mar 3 | Cache hit rate >80% after 1 week |
| Refactor checkout to async job queue | Charlie | Mar 10 | Checkout returns <200ms. Jobs process in background |
| Audit all endpoints for N+1 queries. Fix top 10 | Team | Mar 15 | 95% of endpoints <500ms |
Escalation (what if policies don't fit?)
| Action | Owner | Deadline | Success Criteria |
|---|---|---|---|
| Create RFC process for policy exceptions | Alice | Feb 28 | RFC template published. One exception approved as test |
| Weekly arch review for >500ms endpoints | Tech Lead | Ongoing | First review Mar 1. Weekly recurring |
Success Metrics
Instructions: How will we know the strategy worked? Pick 3-5 metrics. Measure weekly or monthly. Review after 2 months.
| Metric | Baseline (today) | Target (end date) | Current | Trend |
|---|---|---|---|---|
| API response time (p95) | 3.2s | <500ms | ||
| Timeout rate | 40% | <1% | ||
| Revenue loss | $50K/month | $0 | ||
| Query count per request | 87 | <10 | ||
| Cache hit rate | 0% | >80% |
Review date: [2 months from publish date]
Review questions:
- Did metrics hit target?
- Are teams following policies?
- What surprised us? What would we change?
- Should we continue, adjust, or retire this strategy?
FAQ / Escalation
Instructions: Anticipate questions or objections. Answer them upfront.
Q: What if my endpoint needs >10 queries? A: Write an RFC explaining why. Get tech lead approval. Document the exception.
Q: What if Redis goes down? A: Fallback to Postgres. Cache miss = slower, but functional. Add Redis uptime SLO (99.9%).
Q: What if this increases infrastructure cost? A: Yes, Redis will cost ~$500/month. Tradeoff: we're losing $50K/month to timeouts. 100x ROI.
Context / Background (optional)
Instructions: Add any supporting context. Previous attempts, related docs, research, incidents.
- Postmortem: Feb 10 API outage — 87 queries per request caused cascading failure
- RFC: Query budget middleware — Proposed solution, approved Feb 15
- Competitor analysis: Stripe API — They use query budgets + aggressive caching
Appendix: Dissenting Opinions (optional)
Instructions: Capture dissenting opinions before publishing. Shows you listened. Builds trust.
Dave (Backend Lead): "Redis adds complexity. Why not just optimize queries?"
Response: We'll do both. Query optimization is in "Transition" actions. But caching gives us 10x headroom for traffic spikes. Worth the complexity.
Eve (SRE): "Who's on-call for Redis?"
Response: Backend team owns Redis SLO. SRE trains them on runbooks. Escalate to SRE if multi-hour outage.
Review Log
Instructions: After 2 months, document what happened. Did the strategy work? Keep this section updated.
[Date] — Initial publish
[2 months later] — Review results:
- Metrics: [Did we hit targets?]
- Compliance: [Are teams following policies?]
- Surprises: [What did we learn?]
- Next steps: [Continue, adjust, retire?]
Template Checklist
Before publishing, verify:
- [ ] Diagnosis has specific metrics and dates
- [ ] Impact table shows current vs target
- [ ] Root cause identified (not just symptoms)
- [ ] 3-5 policies, each with enforcement mechanism
- [ ] Every action has owner + deadline + success criteria
- [ ] Success metrics defined with baseline and target
- [ ] 2-month review date in calendar
- [ ] Shared draft with 5+ engineers for feedback
- [ ] FAQ addresses likely objections
- [ ] Dissenting opinions captured and responded to
Review this strategy after 2 months. If it didn't change decisions or improve metrics, retire it.