Skip to content

Engineering Strategy Template

Template for solving specific engineering problems. Copy and fill in each section.


How to Use This Template

  1. Name the file: strategy-[problem]-[year-quarter].md
    • Example: strategy-api-performance-2026-q1.md
  2. Fill in each section below. Delete the instructions (gray text) as you write
  3. Workshop diagnosis with 3-5 engineers before writing policies
  4. Share draft org-wide for 3-day feedback window
  5. Finalize and publish. Set 2-month review date in calendar

When to use: You have a recurring problem (slow deploys, flaky tests, broken onboarding). Not one-time decisions (use RFC instead).


Strategy: [Problem Name]

Owner: [Your name] Date: [YYYY-MM-DD] Review Date: [2 months from now] Status: Draft | Published | Retired


Diagnosis

Instructions: What's the problem? Be specific with metrics, dates, and impact. Answer:

  • What's broken or slowing us down?
  • When did it start? What changed?
  • What constraints define this problem? (team size, budget, tech stack, timeline)
  • What happens if we don't fix it?

Example:

API response time degraded from 200ms (Jan 2025) to 3.2s (Feb 2026). 40% of requests timeout. Root cause: database queries grew from 12 per request to 87 (N+1 queries). No caching. No query budget enforcement. Impact: 15% of users abandon checkout due to timeouts. $50K/month revenue loss.

Problem Statement

[Write 2-3 paragraphs. Use numbers. Be honest about root cause.]

Impact

MetricCurrentTargetDeadline
[e.g., API response time p95]3.2s<500msMar 15, 2026
[e.g., Timeout rate]40%<1%Mar 15, 2026
[e.g., Revenue loss]$50K/month$0Mar 30, 2026

Root Cause

[1-2 sentences. What's the underlying cause, not just the symptom?]

Example: N+1 queries caused by ORM defaults. No caching layer. No query budget enforcement in CI.


Guiding Policies

Instructions: What tradeoffs will we make? What rules apply to all teams? Answer the three questions:

  1. Resource allocation: Where do we spend engineering time?
  2. Fundamental rules: What standards apply across teams?
  3. Decision-making: How do teams make choices?

Good policies:

  • Specific, not vague ("All endpoints <500ms" not "improve performance")
  • Make tradeoffs explicit ("We choose speed over cost")
  • Enforceable (you can measure compliance)
  • 3-5 policies max (more = ignored)

Policy 1: [Name]

What: [One sentence rule]

Why: [One sentence rationale]

Enforcement: [How will we measure compliance? What happens if violated?]

Example:

  • What: All API endpoints must return in <500ms at p95
  • Why: 40% of users abandon checkout due to slow APIs, costing $50K/month
  • Enforcement: CI fails if endpoint >500ms in load test. Alert fires if production >500ms

Policy 2: [Name]

[Repeat structure above]

Policy 3: [Name]

[Repeat structure above]


Coherent Actions

Instructions: What specific steps implement the policies? Each action needs:

  • Owner (name, not "team")
  • Deadline (specific date)
  • Success criteria (how do we know it's done?)

Group actions by theme (e.g., Enforcement, Transition, Escalation).

Enforcement (how we maintain the policies)

ActionOwnerDeadlineSuccess Criteria
Add query counter middleware. Error if >10 queriesAliceFeb 25PR merged. All endpoints instrumented
Set alert: error if endpoint >500ms p95AliceFeb 27Alert fires in #incidents channel
Add load test to CI. Block merge if >500msBobMar 5CI runs load test on every PR

Transition (moving from current to desired state)

ActionOwnerDeadlineSuccess Criteria
Implement Redis caching for /users, /ordersBobMar 3Cache hit rate >80% after 1 week
Refactor checkout to async job queueCharlieMar 10Checkout returns <200ms. Jobs process in background
Audit all endpoints for N+1 queries. Fix top 10TeamMar 1595% of endpoints <500ms

Escalation (what if policies don't fit?)

ActionOwnerDeadlineSuccess Criteria
Create RFC process for policy exceptionsAliceFeb 28RFC template published. One exception approved as test
Weekly arch review for >500ms endpointsTech LeadOngoingFirst review Mar 1. Weekly recurring

Success Metrics

Instructions: How will we know the strategy worked? Pick 3-5 metrics. Measure weekly or monthly. Review after 2 months.

MetricBaseline (today)Target (end date)CurrentTrend
API response time (p95)3.2s<500ms
Timeout rate40%<1%
Revenue loss$50K/month$0
Query count per request87<10
Cache hit rate0%>80%

Review date: [2 months from publish date]

Review questions:

  • Did metrics hit target?
  • Are teams following policies?
  • What surprised us? What would we change?
  • Should we continue, adjust, or retire this strategy?

FAQ / Escalation

Instructions: Anticipate questions or objections. Answer them upfront.

Q: What if my endpoint needs >10 queries? A: Write an RFC explaining why. Get tech lead approval. Document the exception.

Q: What if Redis goes down? A: Fallback to Postgres. Cache miss = slower, but functional. Add Redis uptime SLO (99.9%).

Q: What if this increases infrastructure cost? A: Yes, Redis will cost ~$500/month. Tradeoff: we're losing $50K/month to timeouts. 100x ROI.


Context / Background (optional)

Instructions: Add any supporting context. Previous attempts, related docs, research, incidents.


Appendix: Dissenting Opinions (optional)

Instructions: Capture dissenting opinions before publishing. Shows you listened. Builds trust.

Dave (Backend Lead): "Redis adds complexity. Why not just optimize queries?"

Response: We'll do both. Query optimization is in "Transition" actions. But caching gives us 10x headroom for traffic spikes. Worth the complexity.

Eve (SRE): "Who's on-call for Redis?"

Response: Backend team owns Redis SLO. SRE trains them on runbooks. Escalate to SRE if multi-hour outage.


Review Log

Instructions: After 2 months, document what happened. Did the strategy work? Keep this section updated.

[Date] — Initial publish

[2 months later] — Review results:

  • Metrics: [Did we hit targets?]
  • Compliance: [Are teams following policies?]
  • Surprises: [What did we learn?]
  • Next steps: [Continue, adjust, retire?]

Template Checklist

Before publishing, verify:

  • [ ] Diagnosis has specific metrics and dates
  • [ ] Impact table shows current vs target
  • [ ] Root cause identified (not just symptoms)
  • [ ] 3-5 policies, each with enforcement mechanism
  • [ ] Every action has owner + deadline + success criteria
  • [ ] Success metrics defined with baseline and target
  • [ ] 2-month review date in calendar
  • [ ] Shared draft with 5+ engineers for feedback
  • [ ] FAQ addresses likely objections
  • [ ] Dissenting opinions captured and responded to

Review this strategy after 2 months. If it didn't change decisions or improve metrics, retire it.


Back to Engineering Strategy Guide