Engineering Onboarding Guide
How we work as engineers. Universal principles that apply across all projects.
TL;DR
Mindset:
- Problems first - Understand the problem before proposing solutions
- No waste - Don't build features nobody needs. Every line of code must solve a real problem
- Small, fast iterations - Branch from main, merge within 1-2 days
- Feature flags - Ship incomplete code behind flags, not long branches
Communication:
- Flag blockers early - Stuck for 30 minutes? Ask for help
- Overcommunicate - Remote work means making your work visible
- Explain the "why" - PRs should justify decisions, not just describe changes
Observability:
- Three pillars - Logs (events), Metrics (aggregates), Traces (request flow)
- Structured logs - JSON > string interpolation
- Correlation IDs - Link logs to traces to requests
Alerting:
- Symptoms over causes - Alert on error rate, not CPU usage
- Actionable only - If you can't act at 3am, don't page
Debugging:
- Mitigate first - Stop the bleeding, then investigate
- Check what changed - Deploys, config, dependencies
- Postmortems are blameless - Focus on systems, not people
Part 1: How We Work
Problems First, Solutions Later
Before writing any code, answer:
- What problem are we solving? (Not what feature are we building)
- Who is affected? (Users? Ops? Other engineers?)
- What's the impact? (Quantify: latency, error rate, revenue)
- What are we NOT solving?
Bad: "We need to add a caching layer"
Good: "Dashboard loads take 4.2s. Users abandon after 3s. We're losing 12% of sessions."
The solution might be caching. Or query optimization. Or pagination. Or removing unnecessary data. Let the problem guide you.
We Don't Build Waste
Engineers are not feature factories. Every line of code we write should solve a real problem.
Before building anything, ask:
- What problem does this solve?
- Who has this problem? (Can you name them?)
- How do we know they have this problem? (Data, research, conversations?)
- What happens if we don't build this?
Signs you might be building waste:
- "The spec says to build X" (but no one can explain why)
- "It might be useful later"
- "Other systems have this feature"
- "The API should support this operation" (but no client needs it)
- "Let's add this while we're here"
The API example you'll see:
Designed: GET /users/:id/preferences/notifications/email/frequency
Reality: No client ever calls this endpointWhy does this happen?
- Designing "complete" APIs instead of needed APIs
- Building for imagined future requirements
- Not validating with actual consumers
The right approach:
- Build the minimum that solves the current problem
- Ship it
- See what users/clients actually need next
- Build that
YAGNI (You Aren't Gonna Need It)
Every feature has cost:
- Code to write
- Tests to maintain
- Documentation to keep updated
- Bugs to fix
- Cognitive load for new engineers
- Attack surface for security
Features that solve no problem have all the cost and zero benefit.
Our standard:
| Question | Answer needed before building |
|---|---|
| What problem? | Specific, observable, measurable |
| Who has it? | Named users, teams, or systems |
| Evidence? | Data, research, or direct request |
| Impact of not building? | What breaks, who suffers |
If you can't answer these, don't build it. Push back. Ask questions. The best code is the code you don't write.
Trunk-Based Development
Work directly off main. No long-lived feature branches.
The flow:
- Branch from
main - Make small, focused changes
- PR and merge within 1-2 days
- Use feature flags for incomplete work
Why it works:
- Merge conflicts are rare and small
- Everyone sees the latest code
- CI/CD stays fast and reliable
- No "integration hell" before releases
Anti-patterns:
- Branch living for a week → break it down smaller
- "I'll merge when it's done" → use feature flags
- Rebasing a 50-commit branch → you waited too long
Feature Flags
Incomplete code can (and should) go to main. Wrap it in a flag.
if (featureFlags.newPaymentFlow) {
// new implementation
} else {
// existing implementation
}When to use:
- Work spanning multiple PRs
- Risky changes needing gradual rollout
- A/B testing
- Kill switch for new features
Flag lifecycle:
- Create flag (default: off)
- Develop behind flag
- Enable for internal testing
- Gradual rollout (10% → 50% → 100%)
- Remove flag and old code path
Don't let flags rot. Clean them up within 2 weeks of full rollout.
Small Commits, Small PRs
Commit size:
- Each commit is self-contained and working
- If you can't describe it in one line, it's too big
- Aim for 50-200 lines changed per PR
Why small:
- Reviewers actually read it
- Easier to revert if something breaks
- Easier to bisect when hunting bugs
- Faster CI feedback
Breaking down large tasks:
| Step | PR |
|---|---|
| Refactor existing code | PR 1 |
| Add new interfaces/types | PR 2 |
| Implement logic behind flag | PR 3 |
| Write migration if needed | PR 4 |
| Enable and monitor | PR 5 |
| Clean up old code and flag | PR 6 |
One "feature" might be 5-6 PRs. That's fine. That's good.
Part 2: Communication
Flag Blockers Early
Stuck for more than 30 minutes:
- Document what you tried (not just "it doesn't work")
- Share context (error messages, logs, expected vs actual)
- Ask for help (Slack, pair programming, async review)
Bad: Struggling silently for 4 hours then saying "I'm blocked"
Good: "Stuck 30min on X. Tried A, B, C. Getting error Y. Anyone seen this?"
Blockers hidden are blockers multiplied.
Overcommunicate
Remote/async work means no one sees you working. Make your work visible.
Daily:
- Brief update on what you're working on
- Call out blockers or risks
- Share interesting findings
On PRs:
- Explain the "why" not just the "what"
- Highlight risks and mitigation
- Tag relevant people proactively
On issues:
- Update status when it changes
- Document decisions and rationale
- Link related PRs/issues
Silence is ambiguous. "No update" could mean smooth sailing or total disaster. Don't make people guess.
Part 3: Observability
You can't fix what you can't see.
The Three Pillars
| Pillar | What | When to Use |
|---|---|---|
| Logs | Discrete events with context | Debugging specific requests, audit trails |
| Metrics | Aggregated numbers over time | Dashboards, alerts, capacity planning |
| Traces | Request flow across services | Finding where time is spent, dependency issues |
Logging
What to log:
- Request/response boundaries (API calls in/out)
- State transitions (order: created → paid → fulfilled)
- Errors with stack traces
- Business events (user signed up, payment processed)
Never log:
- PII (names, emails, phone numbers)
- Credentials, tokens, secrets
- High-frequency noise (every loop iteration)
Structure matters:
// Bad - not searchable
logger.info(`User ${userId} made payment of ${amount}`);
// Good - searchable
logger.info('Payment processed', {
userId,
amount,
currency,
paymentMethod,
transactionId,
duration: endTime - startTime
});Metrics
The Four Golden Signals:
- Latency - How long requests take (p50, p95, p99)
- Traffic - Request rate (requests/sec)
- Errors - Failure rate (5xx/total)
- Saturation - Resource usage (CPU, memory, connections)
Naming convention:
<service>_<what>_<unit>_<type>
payment_api_request_duration_seconds_histogram
user_service_active_connections_gauge
order_created_total_counterTraces
Traces show the journey of a single request across services.
Key concepts:
- Trace - Entire journey (one request, many services)
- Span - One unit of work (one service, one operation)
- Trace ID - Links all spans together
When traces save you:
- "Why is this endpoint slow?" → See which span takes longest
- "Why did this request fail?" → See which service errored
- "What services does this call?" → See the dependency graph
Correlation
Link everything with IDs:
{
traceId: 'abc123', // Links to distributed trace
requestId: 'req-456', // Links to specific request
userId: 'user-789', // Links to user's journey
orderId: 'order-012' // Links to business entity
}When something breaks:
- Find the error in logs
- Get the trace ID
- See the full request journey
- Identify exactly where it failed
Part 4: Alerting
Alerts are for humans. Make them count.
Alert on Symptoms, Not Causes
Good: "Error rate > 5% for 5 minutes"
Bad: "Database CPU > 80%"
Users feel symptoms. High CPU might be fine if latency is good.
Alert Severity
| Severity | Response Time | Example |
|---|---|---|
| Critical | Immediate (page) | Service down, data loss risk |
| High | Within 1 hour | Error rate elevated, degraded performance |
| Medium | Within 4 hours | Non-critical feature broken |
| Low | Next business day | Warning thresholds, capacity planning |
Every Alert Needs
- What - Clear description of what's wrong
- Impact - Who/what is affected
- Runbook link - How to investigate/fix
- Dashboard link - Where to see more context
Avoiding Alert Fatigue
Problem: Too many alerts → people ignore them → real issues missed
Fix:
- Tune thresholds based on actual impact
- Use warning → critical escalation (warn at 3%, page at 5%)
- Group related alerts
- Review and retire stale alerts monthly
- Track alert-to-action ratio (no action = remove the alert)
Part 5: Debugging
When things break, stay calm and be systematic.
The Process
- Acknowledge - Confirm you're looking at it
- Assess - What's the impact? How many users? Getting worse?
- Mitigate - Reduce impact quickly (feature flag, rollback, scale up)
- Investigate - Find root cause
- Fix - Implement proper solution
- Document - Write postmortem for significant incidents
Mitigation before investigation. Stop the bleeding first.
Investigation Checklist
□ What changed recently? (deploys, config, dependencies)
□ When did it start? (correlate with changes)
□ What's the error message/stack trace?
□ Which users/requests affected? (all or some?)
□ What do traces show?
□ What do metrics show? (latency, errors, saturation)
□ Can I reproduce locally?Common Patterns
| Symptom | Check |
|---|---|
| "It was working yesterday" | Recent deploys, config changes, dependency updates |
| "Only some users affected" | Geography, user type, feature flag, specific data |
| "Slow but not erroring" | Traces for slow spans, database queries, external calls |
| "Errors spike then recover" | Resource exhaustion, connection limits, rate limiting |
| "Works locally, not in prod" | Environment differences, config, network, permissions |
Postmortems
Required for incidents that:
- Affected users for > 15 minutes
- Required emergency response
- Revealed a systemic issue
Structure:
- Summary - One paragraph, what happened
- Timeline - Minute by minute
- Impact - Users affected, duration, business impact
- Root cause - Technical explanation
- What went well - Detection, response, mitigation
- What went poorly - Gaps in monitoring, slow response
- Action items - Specific, assigned, time-bound
Postmortems are blameless. Focus on systems, not people.
Part 6: Day-to-Day Practices
Code Review
As author:
- Self-review before requesting reviews
- Keep PRs small and focused
- Respond to feedback within 24 hours
- Don't take feedback personally
As reviewer:
- Review within 24 hours
- Be specific and actionable
- Distinguish blockers from suggestions
- Approve when "good enough" not "perfect"
Testing
- Write tests first (TDD)
- Test behavior, not implementation
- One assertion per test when possible
- Name tests like documentation:
should_reject_payment_when_card_expired
Documentation
- Document the "why" more than the "how"
- Keep docs close to code (README in each module)
- Update docs when you change behavior
- Delete outdated docs (wrong docs worse than no docs)
Quick Reference
Before Starting Work
- [ ] Do I understand the problem (not just the solution)?
- [ ] Who has this problem? Can I name them?
- [ ] What's the evidence this problem exists?
- [ ] What happens if we don't build this?
- [ ] What's the impact if this succeeds?
- [ ] What are the risks?
- [ ] How will I know it's working?
Before Committing
- [ ] Tests pass locally
- [ ] Linting clean
- [ ] Types check
- [ ] Build succeeds
- [ ] Commit message is clear
Before Merging
- [ ] PR description explains why
- [ ] Risks documented
- [ ] Reviewers approved
- [ ] CI green
- [ ] Feature flagged if incomplete
After Deploying
- [ ] Monitor metrics for 15 minutes
- [ ] Check error rates
- [ ] Verify feature works in prod
- [ ] Update any related tickets
Common Mistakes
| Mistake | Fix |
|---|---|
| Proposing solutions before understanding problems | Ask "what problem are we solving?" first |
| Building features nobody asked for | Validate: who has this problem? What's the evidence? |
| Designing "complete" APIs upfront | Build only what's needed now, expand when there's demand |
| Long-lived branches | Merge within 1-2 days, use feature flags |
| String interpolation in logs | Use structured JSON logs |
| Alerting on causes (CPU) | Alert on symptoms (error rate) |
| Investigating before mitigating | Stop the bleeding first |
| Blaming people in postmortems | Focus on systems and processes |
| Silent when blocked | Ask for help after 30 minutes |
| "No update" status | Silence is ambiguous, communicate proactively |
Related:
- Project Onboarding Template - Template for project-specific docs
- Engineering Thinking - Decision frameworks (when to normalize, when to abstract)
- Engineering Fundamentals - Technical concepts (indexing, caching, scaling)
- Pull Requests - How to write good PRs
- Clear Communication - Communication principles