Engineering Onboarding Guide
How we work as engineers. Universal principles that apply across all projects.
TL;DR
Mindset:
- Problems first - Understand the problem before proposing solutions
- Small, fast iterations - Branch from main, merge within 1-2 days
- Feature flags - Ship incomplete code behind flags, not long branches
Communication:
- Flag blockers early - Stuck for 30 minutes? Ask for help
- Overcommunicate - Remote work means making your work visible
- Explain the "why" - PRs should justify decisions, not just describe changes
Observability:
- Three pillars - Logs (events), Metrics (aggregates), Traces (request flow)
- Structured logs - JSON > string interpolation
- Correlation IDs - Link logs to traces to requests
Alerting:
- Symptoms over causes - Alert on error rate, not CPU usage
- Actionable only - If you can't act at 3am, don't page
Debugging:
- Mitigate first - Stop the bleeding, then investigate
- Check what changed - Deploys, config, dependencies
- Postmortems are blameless - Focus on systems, not people
Part 1: How We Work
Problems First, Solutions Later
Before writing any code, answer:
- What problem are we solving? (Not what feature are we building)
- Who is affected? (Users? Ops? Other engineers?)
- What's the impact? (Quantify: latency, error rate, revenue)
- What are we NOT solving?
Bad: "We need to add a caching layer"
Good: "Dashboard loads take 4.2s. Users abandon after 3s. We're losing 12% of sessions."
The solution might be caching. Or query optimization. Or pagination. Or removing unnecessary data. Let the problem guide you.
Trunk-Based Development
Work directly off main. No long-lived feature branches.
The flow:
- Branch from
main - Make small, focused changes
- PR and merge within 1-2 days
- Use feature flags for incomplete work
Why it works:
- Merge conflicts are rare and small
- Everyone sees the latest code
- CI/CD stays fast and reliable
- No "integration hell" before releases
Anti-patterns:
- Branch living for a week → break it down smaller
- "I'll merge when it's done" → use feature flags
- Rebasing a 50-commit branch → you waited too long
Feature Flags
Incomplete code can (and should) go to main. Wrap it in a flag.
if (featureFlags.newPaymentFlow) {
// new implementation
} else {
// existing implementation
}When to use:
- Work spanning multiple PRs
- Risky changes needing gradual rollout
- A/B testing
- Kill switch for new features
Flag lifecycle:
- Create flag (default: off)
- Develop behind flag
- Enable for internal testing
- Gradual rollout (10% → 50% → 100%)
- Remove flag and old code path
Don't let flags rot. Clean them up within 2 weeks of full rollout.
Small Commits, Small PRs
Commit size:
- Each commit is self-contained and working
- If you can't describe it in one line, it's too big
- Aim for 50-200 lines changed per PR
Why small:
- Reviewers actually read it
- Easier to revert if something breaks
- Easier to bisect when hunting bugs
- Faster CI feedback
Breaking down large tasks:
| Step | PR |
|---|---|
| Refactor existing code | PR 1 |
| Add new interfaces/types | PR 2 |
| Implement logic behind flag | PR 3 |
| Write migration if needed | PR 4 |
| Enable and monitor | PR 5 |
| Clean up old code and flag | PR 6 |
One "feature" might be 5-6 PRs. That's fine. That's good.
Part 2: Communication
Flag Blockers Early
Stuck for more than 30 minutes:
- Document what you tried (not just "it doesn't work")
- Share context (error messages, logs, expected vs actual)
- Ask for help (Slack, pair programming, async review)
Bad: Struggling silently for 4 hours then saying "I'm blocked"
Good: "Stuck 30min on X. Tried A, B, C. Getting error Y. Anyone seen this?"
Blockers hidden are blockers multiplied.
Overcommunicate
Remote/async work means no one sees you working. Make your work visible.
Daily:
- Brief update on what you're working on
- Call out blockers or risks
- Share interesting findings
On PRs:
- Explain the "why" not just the "what"
- Highlight risks and mitigation
- Tag relevant people proactively
On issues:
- Update status when it changes
- Document decisions and rationale
- Link related PRs/issues
Silence is ambiguous. "No update" could mean smooth sailing or total disaster. Don't make people guess.
Part 3: Observability
You can't fix what you can't see.
The Three Pillars
| Pillar | What | When to Use |
|---|---|---|
| Logs | Discrete events with context | Debugging specific requests, audit trails |
| Metrics | Aggregated numbers over time | Dashboards, alerts, capacity planning |
| Traces | Request flow across services | Finding where time is spent, dependency issues |
Logging
What to log:
- Request/response boundaries (API calls in/out)
- State transitions (order: created → paid → fulfilled)
- Errors with stack traces
- Business events (user signed up, payment processed)
Never log:
- PII (names, emails, phone numbers)
- Credentials, tokens, secrets
- High-frequency noise (every loop iteration)
Structure matters:
// Bad - not searchable
logger.info(`User ${userId} made payment of ${amount}`);
// Good - searchable
logger.info('Payment processed', {
userId,
amount,
currency,
paymentMethod,
transactionId,
duration: endTime - startTime
});Metrics
The Four Golden Signals:
- Latency - How long requests take (p50, p95, p99)
- Traffic - Request rate (requests/sec)
- Errors - Failure rate (5xx/total)
- Saturation - Resource usage (CPU, memory, connections)
Naming convention:
<service>_<what>_<unit>_<type>
payment_api_request_duration_seconds_histogram
user_service_active_connections_gauge
order_created_total_counterTraces
Traces show the journey of a single request across services.
Key concepts:
- Trace - Entire journey (one request, many services)
- Span - One unit of work (one service, one operation)
- Trace ID - Links all spans together
When traces save you:
- "Why is this endpoint slow?" → See which span takes longest
- "Why did this request fail?" → See which service errored
- "What services does this call?" → See the dependency graph
Correlation
Link everything with IDs:
{
traceId: 'abc123', // Links to distributed trace
requestId: 'req-456', // Links to specific request
userId: 'user-789', // Links to user's journey
orderId: 'order-012' // Links to business entity
}When something breaks:
- Find the error in logs
- Get the trace ID
- See the full request journey
- Identify exactly where it failed
Part 4: Alerting
Alerts are for humans. Make them count.
Alert on Symptoms, Not Causes
Good: "Error rate > 5% for 5 minutes"
Bad: "Database CPU > 80%"
Users feel symptoms. High CPU might be fine if latency is good.
Alert Severity
| Severity | Response Time | Example |
|---|---|---|
| Critical | Immediate (page) | Service down, data loss risk |
| High | Within 1 hour | Error rate elevated, degraded performance |
| Medium | Within 4 hours | Non-critical feature broken |
| Low | Next business day | Warning thresholds, capacity planning |
Every Alert Needs
- What - Clear description of what's wrong
- Impact - Who/what is affected
- Runbook link - How to investigate/fix
- Dashboard link - Where to see more context
Avoiding Alert Fatigue
Problem: Too many alerts → people ignore them → real issues missed
Fix:
- Tune thresholds based on actual impact
- Use warning → critical escalation (warn at 3%, page at 5%)
- Group related alerts
- Review and retire stale alerts monthly
- Track alert-to-action ratio (no action = remove the alert)
Part 5: Debugging
When things break, stay calm and be systematic.
The Process
- Acknowledge - Confirm you're looking at it
- Assess - What's the impact? How many users? Getting worse?
- Mitigate - Reduce impact quickly (feature flag, rollback, scale up)
- Investigate - Find root cause
- Fix - Implement proper solution
- Document - Write postmortem for significant incidents
Mitigation before investigation. Stop the bleeding first.
Investigation Checklist
□ What changed recently? (deploys, config, dependencies)
□ When did it start? (correlate with changes)
□ What's the error message/stack trace?
□ Which users/requests affected? (all or some?)
□ What do traces show?
□ What do metrics show? (latency, errors, saturation)
□ Can I reproduce locally?Common Patterns
| Symptom | Check |
|---|---|
| "It was working yesterday" | Recent deploys, config changes, dependency updates |
| "Only some users affected" | Geography, user type, feature flag, specific data |
| "Slow but not erroring" | Traces for slow spans, database queries, external calls |
| "Errors spike then recover" | Resource exhaustion, connection limits, rate limiting |
| "Works locally, not in prod" | Environment differences, config, network, permissions |
Postmortems
Required for incidents that:
- Affected users for > 15 minutes
- Required emergency response
- Revealed a systemic issue
Structure:
- Summary - One paragraph, what happened
- Timeline - Minute by minute
- Impact - Users affected, duration, business impact
- Root cause - Technical explanation
- What went well - Detection, response, mitigation
- What went poorly - Gaps in monitoring, slow response
- Action items - Specific, assigned, time-bound
Postmortems are blameless. Focus on systems, not people.
Part 6: Day-to-Day Practices
Code Review
As author:
- Self-review before requesting reviews
- Keep PRs small and focused
- Respond to feedback within 24 hours
- Don't take feedback personally
As reviewer:
- Review within 24 hours
- Be specific and actionable
- Distinguish blockers from suggestions
- Approve when "good enough" not "perfect"
Testing
- Write tests first (TDD)
- Test behavior, not implementation
- One assertion per test when possible
- Name tests like documentation:
should_reject_payment_when_card_expired
Documentation
- Document the "why" more than the "how"
- Keep docs close to code (README in each module)
- Update docs when you change behavior
- Delete outdated docs (wrong docs worse than no docs)
Quick Reference
Before Starting Work
- [ ] Do I understand the problem (not just the solution)?
- [ ] What's the impact if this succeeds?
- [ ] What are the risks?
- [ ] How will I know it's working?
Before Committing
- [ ] Tests pass locally
- [ ] Linting clean
- [ ] Types check
- [ ] Build succeeds
- [ ] Commit message is clear
Before Merging
- [ ] PR description explains why
- [ ] Risks documented
- [ ] Reviewers approved
- [ ] CI green
- [ ] Feature flagged if incomplete
After Deploying
- [ ] Monitor metrics for 15 minutes
- [ ] Check error rates
- [ ] Verify feature works in prod
- [ ] Update any related tickets
Common Mistakes
| Mistake | Fix |
|---|---|
| Proposing solutions before understanding problems | Ask "what problem are we solving?" first |
| Long-lived branches | Merge within 1-2 days, use feature flags |
| String interpolation in logs | Use structured JSON logs |
| Alerting on causes (CPU) | Alert on symptoms (error rate) |
| Investigating before mitigating | Stop the bleeding first |
| Blaming people in postmortems | Focus on systems and processes |
| Silent when blocked | Ask for help after 30 minutes |
| "No update" status | Silence is ambiguous, communicate proactively |
Related:
- Project Onboarding Template - Template for project-specific docs
- Engineering Fundamentals - Technical concepts (indexing, caching, scaling)
- Pull Requests - How to write good PRs
- Daily Standup - Communication format