Best Practices
Learn how to structure alerts effectively, reduce noise, and build a sustainable incident response culture. These practices are based on real-world experience managing alerts at scale.
Alert Design Principles
1. Actionable Over Informational
Every alert should require human action. If an alert doesn't need immediate attention, it shouldn't wake someone up.
✓ Good Alert
“Database replication lag > 30 seconds”
Clear issue requiring investigation
✗ Poor Alert
“CPU usage is 45%”
Not actionable, just informational
2. Customer Impact Focus
Alert on symptoms that affect users, not just on causes. Monitor what matters to your customers first.
- Primary: API error rate, response time, availability
- Secondary: CPU usage, memory, disk space
- Context: Include both in alert metadata
3. Clear Ownership
Every alert must have a clear owner. Use tags and routing rules to ensure alerts reach the right team immediately.
// Always include team ownership
await a11ops.alert({
title: "Payment processing errors increasing",
priority: "high",
metadata: {
team: "payments", // Clear ownership
service: "payment-api",
runbook: "link/to/runbook" // How to respond
}
});Reducing Alert Fatigue
Use Appropriate Thresholds
Set thresholds based on actual impact, not arbitrary numbers:
Implement Alert Deduplication
Prevent duplicate alerts from overwhelming your team:
- Use consistent alert names and keys for deduplication
- Group related alerts (e.g., all hosts with same issue)
- Set appropriate evaluation windows to prevent flapping
- Use “for” duration in alert rules (minimum 5 minutes)
Regular Alert Review
Schedule monthly reviews to improve alert quality:
- Review all alerts from the past month
- Identify alerts that were not actionable
- Find alerts that fired too frequently
- Adjust thresholds or remove unnecessary alerts
- Add missing alerts based on incidents
Alert Hierarchy Strategy
Structure your alerts in a hierarchy to ensure appropriate response:
Critical (Page immediately)
- Complete service outage
- Data corruption or loss risk
- Security breaches
- Payment system failures
Response time: < 5 minutes
High (Notify on-call)
- Degraded performance affecting users
- Error rate above acceptable threshold
- Key feature failures
Response time: < 30 minutes
Medium (Business hours)
- Resource usage trending high
- Non-critical service degradation
- Failed background jobs
Response time: < 4 hours
Low (Next business day)
- Upcoming certificate expirations
- Non-critical configuration drift
- Optimization opportunities
Response time: < 2 business days
Writing Effective Alert Messages
Alert Title Template
Use a consistent format for alert titles:
[SEVERITY] Service: Specific Issue - Impact✓ “Payment API: High error rate - 15% of transactions failing”
✓ “Database Primary: Connection pool exhausted - New queries timing out”
✗ “High CPU” (too vague)
✗ “ALERT!!!” (not descriptive)
Message Body Structure
Include essential information in a scannable format:
📊 WHAT: API response time p95 > 5 seconds 📍 WHERE: Production us-east-1, service: user-api 📈 METRICS: Current: 5.2s, Normal: 0.8s, Duration: 10 min 👥 IMPACT: ~2,000 users experiencing slow page loads 🔧 ACTION: 1. Check recent deployments 2. Review database slow query log 3. Scale service if needed 📖 RUNBOOK: https://wiki.company.com/runbooks/api-latency 📊 DASHBOARD: https://grafana.company.com/d/api-performance
Include Context in Metadata
await a11ops.alert({
title: "Order Processing: Queue backlog growing",
message: "Order queue depth exceeded threshold",
priority: "high",
metadata: {
// Current state
queue_depth: 5420,
threshold: 1000,
processing_rate: "50/min",
// Historical context
normal_depth: 100,
peak_hour_depth: 500,
// Business impact
orders_delayed: 5320,
estimated_delay: "90 minutes",
revenue_at_risk: "$45,000",
// Technical details
region: "us-west-2",
cluster: "orders-prod-1",
version: "2.3.1",
last_deploy: "2024-01-15T10:00:00Z"
}
});On-Call Best Practices
Rotation Schedule
- Weekly rotations (not longer)
- Clear handoff procedures
- Secondary on-call for escalation
- Compensate on-call time appropriately
Runbook Requirements
- Step-by-step troubleshooting guide
- Common issues and solutions
- Escalation procedures
- Rollback instructions
Alert Response
- Acknowledge within 5 minutes
- Update status every 30 minutes
- Document actions taken
- Create follow-up tickets
Post-Incident
- Blameless postmortems
- Update runbooks
- Improve alerts based on learnings
- Share knowledge with team
SLO-Based Alerting
Align alerts with Service Level Objectives (SLOs) for better prioritization:
1. Define SLIs (Service Level Indicators)
- Availability: % of successful requests
- Latency: % of requests under 200ms
- Error rate: % of requests without errors
2. Set SLOs (Service Level Objectives)
- 99.9% availability (43 minutes downtime/month)
- 95% of requests under 200ms
- 99.5% success rate
3. Alert on Error Budget Burn
// Alert when burning error budget too fast
if (errorBudgetBurnRate > 1) {
// Burning budget faster than allocated
await a11ops.alert({
title: "SLO: Error budget burn rate critical",
priority: burnRate > 10 ? "critical" : "high",
metadata: {
slo: "99.9% availability",
current_availability: "99.5%",
burn_rate: burnRate,
time_to_exhaustion: "4 hours"
}
});
}Automation and Self-Healing
Reduce alert volume by automating common responses:
Auto-Remediation Examples
- Restart crashed services
- Scale up during high load
- Clear full disk space
- Rotate logs automatically
- Renew certificates before expiry
When to Alert Humans
- Auto-remediation failed
- Multiple systems affected
- Data consistency issues
- Security concerns
- Business logic errors
Alert Quality Checklist
Use this checklist before creating a new alert:
Ready to Improve Your Alerts?
Start implementing these practices to build a better alerting culture.