Best Practices

Learn how to structure alerts effectively, reduce noise, and build a sustainable incident response culture. These practices are based on real-world experience managing alerts at scale.

Alert Design Principles

1. Actionable Over Informational

Every alert should require human action. If an alert doesn't need immediate attention, it shouldn't wake someone up.

✓ Good Alert

“Database replication lag > 30 seconds”

Clear issue requiring investigation

✗ Poor Alert

“CPU usage is 45%”

Not actionable, just informational

2. Customer Impact Focus

Alert on symptoms that affect users, not just on causes. Monitor what matters to your customers first.

Primary: API error rate, response time, availability
Secondary: CPU usage, memory, disk space
Context: Include both in alert metadata

3. Clear Ownership

Every alert must have a clear owner. Use tags and routing rules to ensure alerts reach the right team immediately.

// Always include team ownership
await a11ops.alert({
  title: "Payment processing errors increasing",
  priority: "high",
  metadata: {
    team: "payments",          // Clear ownership
    service: "payment-api",
    runbook: "link/to/runbook" // How to respond
  }
});

Reducing Alert Fatigue

Use Appropriate Thresholds

Set thresholds based on actual impact, not arbitrary numbers:

Response TimeAlert when p95 > SLO (not at 1 second)

Error RateAlert when errors affect > 1% of users

Resource UsageAlert with time to react (e.g., disk full in 4 hours)

Implement Alert Deduplication

Prevent duplicate alerts from overwhelming your team:

Use consistent alert names and keys for deduplication
Group related alerts (e.g., all hosts with same issue)
Set appropriate evaluation windows to prevent flapping
Use “for” duration in alert rules (minimum 5 minutes)

Regular Alert Review

Schedule monthly reviews to improve alert quality:

Review all alerts from the past month
Identify alerts that were not actionable
Find alerts that fired too frequently
Adjust thresholds or remove unnecessary alerts
Add missing alerts based on incidents

Alert Hierarchy Strategy

Structure your alerts in a hierarchy to ensure appropriate response:

Critical (Page immediately)

Complete service outage
Data corruption or loss risk
Security breaches
Payment system failures

Response time: < 5 minutes

High (Notify on-call)

Degraded performance affecting users
Error rate above acceptable threshold
Key feature failures

Response time: < 30 minutes

Medium (Business hours)

Resource usage trending high
Non-critical service degradation
Failed background jobs

Response time: < 4 hours

Low (Next business day)

Upcoming certificate expirations
Non-critical configuration drift
Optimization opportunities

Response time: < 2 business days

Writing Effective Alert Messages

Alert Title Template

Use a consistent format for alert titles:

[SEVERITY] Service: Specific Issue - Impact

✓ “Payment API: High error rate - 15% of transactions failing”

✓ “Database Primary: Connection pool exhausted - New queries timing out”

✗ “High CPU” (too vague)

✗ “ALERT!!!” (not descriptive)

Message Body Structure

Include essential information in a scannable format:

📊 WHAT: API response time p95 > 5 seconds
📍 WHERE: Production us-east-1, service: user-api
📈 METRICS: Current: 5.2s, Normal: 0.8s, Duration: 10 min
👥 IMPACT: ~2,000 users experiencing slow page loads
🔧 ACTION: 
   1. Check recent deployments
   2. Review database slow query log
   3. Scale service if needed
📖 RUNBOOK: https://wiki.company.com/runbooks/api-latency
📊 DASHBOARD: https://grafana.company.com/d/api-performance

Include Context in Metadata

await a11ops.alert({
  title: "Order Processing: Queue backlog growing",
  message: "Order queue depth exceeded threshold",
  priority: "high",
  metadata: {
    // Current state
    queue_depth: 5420,
    threshold: 1000,
    processing_rate: "50/min",
    
    // Historical context
    normal_depth: 100,
    peak_hour_depth: 500,
    
    // Business impact
    orders_delayed: 5320,
    estimated_delay: "90 minutes",
    revenue_at_risk: "$45,000",
    
    // Technical details
    region: "us-west-2",
    cluster: "orders-prod-1",
    version: "2.3.1",
    last_deploy: "2024-01-15T10:00:00Z"
  }
});

On-Call Best Practices

Rotation Schedule

Weekly rotations (not longer)
Clear handoff procedures
Secondary on-call for escalation
Compensate on-call time appropriately

Runbook Requirements

Step-by-step troubleshooting guide
Common issues and solutions
Escalation procedures
Rollback instructions

Alert Response

Acknowledge within 5 minutes
Update status every 30 minutes
Document actions taken
Create follow-up tickets

Post-Incident

Blameless postmortems
Update runbooks
Improve alerts based on learnings
Share knowledge with team

SLO-Based Alerting

Align alerts with Service Level Objectives (SLOs) for better prioritization:

1. Define SLIs (Service Level Indicators)

Availability: % of successful requests
Latency: % of requests under 200ms
Error rate: % of requests without errors

2. Set SLOs (Service Level Objectives)

99.9% availability (43 minutes downtime/month)
95% of requests under 200ms
99.5% success rate

3. Alert on Error Budget Burn

// Alert when burning error budget too fast
if (errorBudgetBurnRate > 1) {
  // Burning budget faster than allocated
  await a11ops.alert({
    title: "SLO: Error budget burn rate critical",
    priority: burnRate > 10 ? "critical" : "high",
    metadata: {
      slo: "99.9% availability",
      current_availability: "99.5%",
      burn_rate: burnRate,
      time_to_exhaustion: "4 hours"
    }
  });
}

Automation and Self-Healing

Reduce alert volume by automating common responses:

Auto-Remediation Examples

Restart crashed services
Scale up during high load
Clear full disk space
Rotate logs automatically
Renew certificates before expiry

When to Alert Humans

Auto-remediation failed
Multiple systems affected
Data consistency issues
Security concerns
Business logic errors

Alert Quality Checklist

Use this checklist before creating a new alert:

Does this alert require immediate human action?Is the alert based on customer impact, not just resource usage?Does the alert have a clear owner/team?Is there a runbook for responding to this alert?Have you tested the alert threshold to avoid false positives?Does the alert message include context and next steps?Can this issue be auto-remediated instead?Is this the right severity level for the impact?

Ready to Improve Your Alerts?

Start implementing these practices to build a better alerting culture.

Alert Configuration Set Up Integrations