Skip to content

Engineering Onboarding Guide

How we work as engineers. Universal principles that apply across all projects.


TL;DR

Mindset:

  • Problems first - Understand the problem before proposing solutions
  • Small, fast iterations - Branch from main, merge within 1-2 days
  • Feature flags - Ship incomplete code behind flags, not long branches

Communication:

  • Flag blockers early - Stuck for 30 minutes? Ask for help
  • Overcommunicate - Remote work means making your work visible
  • Explain the "why" - PRs should justify decisions, not just describe changes

Observability:

  • Three pillars - Logs (events), Metrics (aggregates), Traces (request flow)
  • Structured logs - JSON > string interpolation
  • Correlation IDs - Link logs to traces to requests

Alerting:

  • Symptoms over causes - Alert on error rate, not CPU usage
  • Actionable only - If you can't act at 3am, don't page

Debugging:

  • Mitigate first - Stop the bleeding, then investigate
  • Check what changed - Deploys, config, dependencies
  • Postmortems are blameless - Focus on systems, not people

Part 1: How We Work

Problems First, Solutions Later

Before writing any code, answer:

  • What problem are we solving? (Not what feature are we building)
  • Who is affected? (Users? Ops? Other engineers?)
  • What's the impact? (Quantify: latency, error rate, revenue)
  • What are we NOT solving?

Bad: "We need to add a caching layer"

Good: "Dashboard loads take 4.2s. Users abandon after 3s. We're losing 12% of sessions."

The solution might be caching. Or query optimization. Or pagination. Or removing unnecessary data. Let the problem guide you.

Trunk-Based Development

Work directly off main. No long-lived feature branches.

The flow:

  1. Branch from main
  2. Make small, focused changes
  3. PR and merge within 1-2 days
  4. Use feature flags for incomplete work

Why it works:

  • Merge conflicts are rare and small
  • Everyone sees the latest code
  • CI/CD stays fast and reliable
  • No "integration hell" before releases

Anti-patterns:

  • Branch living for a week → break it down smaller
  • "I'll merge when it's done" → use feature flags
  • Rebasing a 50-commit branch → you waited too long

Feature Flags

Incomplete code can (and should) go to main. Wrap it in a flag.

typescript
if (featureFlags.newPaymentFlow) {
  // new implementation
} else {
  // existing implementation
}

When to use:

  • Work spanning multiple PRs
  • Risky changes needing gradual rollout
  • A/B testing
  • Kill switch for new features

Flag lifecycle:

  1. Create flag (default: off)
  2. Develop behind flag
  3. Enable for internal testing
  4. Gradual rollout (10% → 50% → 100%)
  5. Remove flag and old code path

Don't let flags rot. Clean them up within 2 weeks of full rollout.

Small Commits, Small PRs

Commit size:

  • Each commit is self-contained and working
  • If you can't describe it in one line, it's too big
  • Aim for 50-200 lines changed per PR

Why small:

  • Reviewers actually read it
  • Easier to revert if something breaks
  • Easier to bisect when hunting bugs
  • Faster CI feedback

Breaking down large tasks:

StepPR
Refactor existing codePR 1
Add new interfaces/typesPR 2
Implement logic behind flagPR 3
Write migration if neededPR 4
Enable and monitorPR 5
Clean up old code and flagPR 6

One "feature" might be 5-6 PRs. That's fine. That's good.


Part 2: Communication

Flag Blockers Early

Stuck for more than 30 minutes:

  1. Document what you tried (not just "it doesn't work")
  2. Share context (error messages, logs, expected vs actual)
  3. Ask for help (Slack, pair programming, async review)

Bad: Struggling silently for 4 hours then saying "I'm blocked"

Good: "Stuck 30min on X. Tried A, B, C. Getting error Y. Anyone seen this?"

Blockers hidden are blockers multiplied.

Overcommunicate

Remote/async work means no one sees you working. Make your work visible.

Daily:

  • Brief update on what you're working on
  • Call out blockers or risks
  • Share interesting findings

On PRs:

  • Explain the "why" not just the "what"
  • Highlight risks and mitigation
  • Tag relevant people proactively

On issues:

  • Update status when it changes
  • Document decisions and rationale
  • Link related PRs/issues

Silence is ambiguous. "No update" could mean smooth sailing or total disaster. Don't make people guess.


Part 3: Observability

You can't fix what you can't see.

The Three Pillars

PillarWhatWhen to Use
LogsDiscrete events with contextDebugging specific requests, audit trails
MetricsAggregated numbers over timeDashboards, alerts, capacity planning
TracesRequest flow across servicesFinding where time is spent, dependency issues

Logging

What to log:

  • Request/response boundaries (API calls in/out)
  • State transitions (order: created → paid → fulfilled)
  • Errors with stack traces
  • Business events (user signed up, payment processed)

Never log:

  • PII (names, emails, phone numbers)
  • Credentials, tokens, secrets
  • High-frequency noise (every loop iteration)

Structure matters:

typescript
// Bad - not searchable
logger.info(`User ${userId} made payment of ${amount}`);

// Good - searchable
logger.info('Payment processed', {
  userId,
  amount,
  currency,
  paymentMethod,
  transactionId,
  duration: endTime - startTime
});

Metrics

The Four Golden Signals:

  1. Latency - How long requests take (p50, p95, p99)
  2. Traffic - Request rate (requests/sec)
  3. Errors - Failure rate (5xx/total)
  4. Saturation - Resource usage (CPU, memory, connections)

Naming convention:

<service>_<what>_<unit>_<type>

payment_api_request_duration_seconds_histogram
user_service_active_connections_gauge
order_created_total_counter

Traces

Traces show the journey of a single request across services.

Key concepts:

  • Trace - Entire journey (one request, many services)
  • Span - One unit of work (one service, one operation)
  • Trace ID - Links all spans together

When traces save you:

  • "Why is this endpoint slow?" → See which span takes longest
  • "Why did this request fail?" → See which service errored
  • "What services does this call?" → See the dependency graph

Correlation

Link everything with IDs:

typescript
{
  traceId: 'abc123',      // Links to distributed trace
  requestId: 'req-456',   // Links to specific request
  userId: 'user-789',     // Links to user's journey
  orderId: 'order-012'    // Links to business entity
}

When something breaks:

  1. Find the error in logs
  2. Get the trace ID
  3. See the full request journey
  4. Identify exactly where it failed

Part 4: Alerting

Alerts are for humans. Make them count.

Alert on Symptoms, Not Causes

Good: "Error rate > 5% for 5 minutes"

Bad: "Database CPU > 80%"

Users feel symptoms. High CPU might be fine if latency is good.

Alert Severity

SeverityResponse TimeExample
CriticalImmediate (page)Service down, data loss risk
HighWithin 1 hourError rate elevated, degraded performance
MediumWithin 4 hoursNon-critical feature broken
LowNext business dayWarning thresholds, capacity planning

Every Alert Needs

  1. What - Clear description of what's wrong
  2. Impact - Who/what is affected
  3. Runbook link - How to investigate/fix
  4. Dashboard link - Where to see more context

Avoiding Alert Fatigue

Problem: Too many alerts → people ignore them → real issues missed

Fix:

  • Tune thresholds based on actual impact
  • Use warning → critical escalation (warn at 3%, page at 5%)
  • Group related alerts
  • Review and retire stale alerts monthly
  • Track alert-to-action ratio (no action = remove the alert)

Part 5: Debugging

When things break, stay calm and be systematic.

The Process

  1. Acknowledge - Confirm you're looking at it
  2. Assess - What's the impact? How many users? Getting worse?
  3. Mitigate - Reduce impact quickly (feature flag, rollback, scale up)
  4. Investigate - Find root cause
  5. Fix - Implement proper solution
  6. Document - Write postmortem for significant incidents

Mitigation before investigation. Stop the bleeding first.

Investigation Checklist

□ What changed recently? (deploys, config, dependencies)
□ When did it start? (correlate with changes)
□ What's the error message/stack trace?
□ Which users/requests affected? (all or some?)
□ What do traces show?
□ What do metrics show? (latency, errors, saturation)
□ Can I reproduce locally?

Common Patterns

SymptomCheck
"It was working yesterday"Recent deploys, config changes, dependency updates
"Only some users affected"Geography, user type, feature flag, specific data
"Slow but not erroring"Traces for slow spans, database queries, external calls
"Errors spike then recover"Resource exhaustion, connection limits, rate limiting
"Works locally, not in prod"Environment differences, config, network, permissions

Postmortems

Required for incidents that:

  • Affected users for > 15 minutes
  • Required emergency response
  • Revealed a systemic issue

Structure:

  1. Summary - One paragraph, what happened
  2. Timeline - Minute by minute
  3. Impact - Users affected, duration, business impact
  4. Root cause - Technical explanation
  5. What went well - Detection, response, mitigation
  6. What went poorly - Gaps in monitoring, slow response
  7. Action items - Specific, assigned, time-bound

Postmortems are blameless. Focus on systems, not people.


Part 6: Day-to-Day Practices

Code Review

As author:

  • Self-review before requesting reviews
  • Keep PRs small and focused
  • Respond to feedback within 24 hours
  • Don't take feedback personally

As reviewer:

  • Review within 24 hours
  • Be specific and actionable
  • Distinguish blockers from suggestions
  • Approve when "good enough" not "perfect"

Testing

  • Write tests first (TDD)
  • Test behavior, not implementation
  • One assertion per test when possible
  • Name tests like documentation: should_reject_payment_when_card_expired

Documentation

  • Document the "why" more than the "how"
  • Keep docs close to code (README in each module)
  • Update docs when you change behavior
  • Delete outdated docs (wrong docs worse than no docs)

Quick Reference

Before Starting Work

  • [ ] Do I understand the problem (not just the solution)?
  • [ ] What's the impact if this succeeds?
  • [ ] What are the risks?
  • [ ] How will I know it's working?

Before Committing

  • [ ] Tests pass locally
  • [ ] Linting clean
  • [ ] Types check
  • [ ] Build succeeds
  • [ ] Commit message is clear

Before Merging

  • [ ] PR description explains why
  • [ ] Risks documented
  • [ ] Reviewers approved
  • [ ] CI green
  • [ ] Feature flagged if incomplete

After Deploying

  • [ ] Monitor metrics for 15 minutes
  • [ ] Check error rates
  • [ ] Verify feature works in prod
  • [ ] Update any related tickets

Common Mistakes

MistakeFix
Proposing solutions before understanding problemsAsk "what problem are we solving?" first
Long-lived branchesMerge within 1-2 days, use feature flags
String interpolation in logsUse structured JSON logs
Alerting on causes (CPU)Alert on symptoms (error rate)
Investigating before mitigatingStop the bleeding first
Blaming people in postmortemsFocus on systems and processes
Silent when blockedAsk for help after 30 minutes
"No update" statusSilence is ambiguous, communicate proactively

Related: