Engineering Onboarding Guide

How we work as engineers. Universal principles that apply across all projects.

TL;DR

Mindset:

Problems first - Understand the problem before proposing solutions
No waste - Don't build features nobody needs. Every line of code must solve a real problem
Small, fast iterations - Branch from main, merge within 1-2 days
Feature flags - Ship incomplete code behind flags, not long branches

Communication:

Flag blockers early - Stuck for 30 minutes? Ask for help
Overcommunicate - Remote work means making your work visible
Explain the "why" - PRs should justify decisions, not just describe changes

Observability:

Three pillars - Logs (events), Metrics (aggregates), Traces (request flow)
Structured logs - JSON > string interpolation
Correlation IDs - Link logs to traces to requests

Alerting:

Symptoms over causes - Alert on error rate, not CPU usage
Actionable only - If you can't act at 3am, don't page

Debugging:

Mitigate first - Stop the bleeding, then investigate
Check what changed - Deploys, config, dependencies
Postmortems are blameless - Focus on systems, not people

Part 1: How We Work

Problems First, Solutions Later

Before writing any code, answer:

What problem are we solving? (Not what feature are we building)
Who is affected? (Users? Ops? Other engineers?)
What's the impact? (Quantify: latency, error rate, revenue)
What are we NOT solving?

Bad: "We need to add a caching layer"

Good: "Dashboard loads take 4.2s. Users abandon after 3s. We're losing 12% of sessions."

The solution might be caching. Or query optimization. Or pagination. Or removing unnecessary data. Let the problem guide you.

We Don't Build Waste

Engineers are not feature factories. Every line of code we write should solve a real problem.

Before building anything, ask:

What problem does this solve?
Who has this problem? (Can you name them?)
How do we know they have this problem? (Data, research, conversations?)
What happens if we don't build this?

Signs you might be building waste:

"The spec says to build X" (but no one can explain why)
"It might be useful later"
"Other systems have this feature"
"The API should support this operation" (but no client needs it)
"Let's add this while we're here"

The API example you'll see:

Designed:     GET /users/:id/preferences/notifications/email/frequency
Reality:      No client ever calls this endpoint

Why does this happen?

Designing "complete" APIs instead of needed APIs
Building for imagined future requirements
Not validating with actual consumers

The right approach:

Build the minimum that solves the current problem
Ship it
See what users/clients actually need next
Build that

YAGNI (You Aren't Gonna Need It)

Every feature has cost:

Code to write
Tests to maintain
Documentation to keep updated
Bugs to fix
Cognitive load for new engineers
Attack surface for security

Features that solve no problem have all the cost and zero benefit.

Our standard:

Question	Answer needed before building
What problem?	Specific, observable, measurable
Who has it?	Named users, teams, or systems
Evidence?	Data, research, or direct request
Impact of not building?	What breaks, who suffers

If you can't answer these, don't build it. Push back. Ask questions. The best code is the code you don't write.

Trunk-Based Development

Work directly off main. No long-lived feature branches.

The flow:

Branch from main
Make small, focused changes
PR and merge within 1-2 days
Use feature flags for incomplete work

Why it works:

Merge conflicts are rare and small
Everyone sees the latest code
CI/CD stays fast and reliable
No "integration hell" before releases

Anti-patterns:

Branch living for a week → break it down smaller
"I'll merge when it's done" → use feature flags
Rebasing a 50-commit branch → you waited too long

Feature Flags

Incomplete code can (and should) go to main. Wrap it in a flag.

typescript

if (featureFlags.newPaymentFlow) {
  // new implementation
} else {
  // existing implementation
}

When to use:

Work spanning multiple PRs
Risky changes needing gradual rollout
A/B testing
Kill switch for new features

Flag lifecycle:

Create flag (default: off)
Develop behind flag
Enable for internal testing
Gradual rollout (10% → 50% → 100%)
Remove flag and old code path

Don't let flags rot. Clean them up within 2 weeks of full rollout.

Small Commits, Small PRs

Commit size:

Each commit is self-contained and working
If you can't describe it in one line, it's too big
Aim for 50-200 lines changed per PR

Why small:

Reviewers actually read it
Easier to revert if something breaks
Easier to bisect when hunting bugs
Faster CI feedback

Breaking down large tasks:

Step	PR
Refactor existing code	PR 1
Add new interfaces/types	PR 2
Implement logic behind flag	PR 3
Write migration if needed	PR 4
Enable and monitor	PR 5
Clean up old code and flag	PR 6

One "feature" might be 5-6 PRs. That's fine. That's good.

Part 2: Communication

Flag Blockers Early

Stuck for more than 30 minutes:

Document what you tried (not just "it doesn't work")
Share context (error messages, logs, expected vs actual)
Ask for help (Slack, pair programming, async review)

Bad: Struggling silently for 4 hours then saying "I'm blocked"

Good: "Stuck 30min on X. Tried A, B, C. Getting error Y. Anyone seen this?"

Blockers hidden are blockers multiplied.

Overcommunicate

Remote/async work means no one sees you working. Make your work visible.

Daily:

Brief update on what you're working on
Call out blockers or risks
Share interesting findings

On PRs:

Explain the "why" not just the "what"
Highlight risks and mitigation
Tag relevant people proactively

On issues:

Update status when it changes
Document decisions and rationale
Link related PRs/issues

Silence is ambiguous. "No update" could mean smooth sailing or total disaster. Don't make people guess.

Part 3: Observability

You can't fix what you can't see.

The Three Pillars

Pillar	What	When to Use
Logs	Discrete events with context	Debugging specific requests, audit trails
Metrics	Aggregated numbers over time	Dashboards, alerts, capacity planning
Traces	Request flow across services	Finding where time is spent, dependency issues

Logging

What to log:

Request/response boundaries (API calls in/out)
State transitions (order: created → paid → fulfilled)
Errors with stack traces
Business events (user signed up, payment processed)

Never log:

PII (names, emails, phone numbers)
Credentials, tokens, secrets
High-frequency noise (every loop iteration)

Structure matters:

typescript

// Bad - not searchable
logger.info(`User ${userId} made payment of ${amount}`);

// Good - searchable
logger.info('Payment processed', {
  userId,
  amount,
  currency,
  paymentMethod,
  transactionId,
  duration: endTime - startTime
});

Metrics

The Four Golden Signals:

Latency - How long requests take (p50, p95, p99)
Traffic - Request rate (requests/sec)
Errors - Failure rate (5xx/total)
Saturation - Resource usage (CPU, memory, connections)

Naming convention:

<service>_<what>_<unit>_<type>

payment_api_request_duration_seconds_histogram
user_service_active_connections_gauge
order_created_total_counter

Traces

Traces show the journey of a single request across services.

Key concepts:

Trace - Entire journey (one request, many services)
Span - One unit of work (one service, one operation)
Trace ID - Links all spans together

When traces save you:

"Why is this endpoint slow?" → See which span takes longest
"Why did this request fail?" → See which service errored
"What services does this call?" → See the dependency graph

Correlation

Link everything with IDs:

typescript

{
  traceId: 'abc123',      // Links to distributed trace
  requestId: 'req-456',   // Links to specific request
  userId: 'user-789',     // Links to user's journey
  orderId: 'order-012'    // Links to business entity
}

When something breaks:

Find the error in logs
Get the trace ID
See the full request journey
Identify exactly where it failed

Part 4: Alerting

Alerts are for humans. Make them count.

Alert on Symptoms, Not Causes

Good: "Error rate > 5% for 5 minutes"

Bad: "Database CPU > 80%"

Users feel symptoms. High CPU might be fine if latency is good.

Alert Severity

Severity	Response Time	Example
Critical	Immediate (page)	Service down, data loss risk
High	Within 1 hour	Error rate elevated, degraded performance
Medium	Within 4 hours	Non-critical feature broken
Low	Next business day	Warning thresholds, capacity planning

Every Alert Needs

What - Clear description of what's wrong
Impact - Who/what is affected
Runbook link - How to investigate/fix
Dashboard link - Where to see more context

Avoiding Alert Fatigue

Problem: Too many alerts → people ignore them → real issues missed

Fix:

Tune thresholds based on actual impact
Use warning → critical escalation (warn at 3%, page at 5%)
Group related alerts
Review and retire stale alerts monthly
Track alert-to-action ratio (no action = remove the alert)

Part 5: Debugging

When things break, stay calm and be systematic.

The Process

Acknowledge - Confirm you're looking at it
Assess - What's the impact? How many users? Getting worse?
Mitigate - Reduce impact quickly (feature flag, rollback, scale up)
Investigate - Find root cause
Fix - Implement proper solution
Document - Write postmortem for significant incidents

Mitigation before investigation. Stop the bleeding first.

Investigation Checklist

□ What changed recently? (deploys, config, dependencies)
□ When did it start? (correlate with changes)
□ What's the error message/stack trace?
□ Which users/requests affected? (all or some?)
□ What do traces show?
□ What do metrics show? (latency, errors, saturation)
□ Can I reproduce locally?

Common Patterns

Symptom	Check
"It was working yesterday"	Recent deploys, config changes, dependency updates
"Only some users affected"	Geography, user type, feature flag, specific data
"Slow but not erroring"	Traces for slow spans, database queries, external calls
"Errors spike then recover"	Resource exhaustion, connection limits, rate limiting
"Works locally, not in prod"	Environment differences, config, network, permissions

Postmortems

Required for incidents that:

Affected users for > 15 minutes
Required emergency response
Revealed a systemic issue

Structure:

Summary - One paragraph, what happened
Timeline - Minute by minute
Impact - Users affected, duration, business impact
Root cause - Technical explanation
What went well - Detection, response, mitigation
What went poorly - Gaps in monitoring, slow response
Action items - Specific, assigned, time-bound

Postmortems are blameless. Focus on systems, not people.

Part 6: Day-to-Day Practices

Code Review

As author:

Self-review before requesting reviews
Keep PRs small and focused
Respond to feedback within 24 hours
Don't take feedback personally

As reviewer:

Review within 24 hours
Be specific and actionable
Distinguish blockers from suggestions
Approve when "good enough" not "perfect"

Testing

Write tests first (TDD)
Test behavior, not implementation
One assertion per test when possible
Name tests like documentation: should_reject_payment_when_card_expired

Documentation

Document the "why" more than the "how"
Keep docs close to code (README in each module)
Update docs when you change behavior
Delete outdated docs (wrong docs worse than no docs)

Quick Reference

Before Starting Work

[ ] Do I understand the problem (not just the solution)?
[ ] Who has this problem? Can I name them?
[ ] What's the evidence this problem exists?
[ ] What happens if we don't build this?
[ ] What's the impact if this succeeds?
[ ] What are the risks?
[ ] How will I know it's working?

Before Committing

[ ] Tests pass locally
[ ] Linting clean
[ ] Types check
[ ] Build succeeds
[ ] Commit message is clear

Before Merging

[ ] PR description explains why
[ ] Risks documented
[ ] Reviewers approved
[ ] CI green
[ ] Feature flagged if incomplete

After Deploying

[ ] Monitor metrics for 15 minutes
[ ] Check error rates
[ ] Verify feature works in prod
[ ] Update any related tickets

Common Mistakes

Mistake	Fix
Proposing solutions before understanding problems	Ask "what problem are we solving?" first
Building features nobody asked for	Validate: who has this problem? What's the evidence?
Designing "complete" APIs upfront	Build only what's needed now, expand when there's demand
Long-lived branches	Merge within 1-2 days, use feature flags
String interpolation in logs	Use structured JSON logs
Alerting on causes (CPU)	Alert on symptoms (error rate)
Investigating before mitigating	Stop the bleeding first
Blaming people in postmortems	Focus on systems and processes
Silent when blocked	Ask for help after 30 minutes
"No update" status	Silence is ambiguous, communicate proactively

Related:

Project Onboarding Template - Template for project-specific docs
Engineering Thinking - Decision frameworks (when to normalize, when to abstract)
Engineering Fundamentals - Technical concepts (indexing, caching, scaling)
Pull Requests - How to write good PRs
Clear Communication - Communication principles

Engineering Onboarding Guide ​

TL;DR ​

Part 1: How We Work ​

Problems First, Solutions Later ​

We Don't Build Waste ​

Trunk-Based Development ​

Feature Flags ​

Small Commits, Small PRs ​

Part 2: Communication ​

Flag Blockers Early ​

Overcommunicate ​

Part 3: Observability ​

The Three Pillars ​

Logging ​

Metrics ​

Traces ​

Correlation ​

Part 4: Alerting ​

Alert on Symptoms, Not Causes ​

Alert Severity ​

Every Alert Needs ​

Avoiding Alert Fatigue ​

Part 5: Debugging ​

The Process ​

Investigation Checklist ​

Common Patterns ​

Postmortems ​

Part 6: Day-to-Day Practices ​

Code Review ​

Testing ​

Documentation ​

Quick Reference ​

Before Starting Work ​

Before Committing ​

Before Merging ​

After Deploying ​

Common Mistakes ​

Engineering Onboarding Guide

TL;DR

Part 1: How We Work

Problems First, Solutions Later

We Don't Build Waste

Trunk-Based Development

Feature Flags

Small Commits, Small PRs

Part 2: Communication

Flag Blockers Early

Overcommunicate

Part 3: Observability

The Three Pillars

Logging

Metrics

Traces

Correlation

Part 4: Alerting

Alert on Symptoms, Not Causes

Alert Severity

Every Alert Needs

Avoiding Alert Fatigue

Part 5: Debugging

The Process

Investigation Checklist

Common Patterns

Postmortems

Part 6: Day-to-Day Practices

Code Review

Testing

Documentation

Quick Reference

Before Starting Work

Before Committing

Before Merging

After Deploying

Common Mistakes