Best Practices

Learn how to structure alerts effectively, reduce noise, and build a sustainable incident response culture. These practices are based on real-world experience managing alerts at scale.

Alert Design Principles

1. Actionable Over Informational

Every alert should require human action. If an alert doesn't need immediate attention, it shouldn't wake someone up.

✓ Good Alert

“Database replication lag > 30 seconds”

Clear issue requiring investigation

✗ Poor Alert

“CPU usage is 45%”

Not actionable, just informational

2. Customer Impact Focus

Alert on symptoms that affect users, not just on causes. Monitor what matters to your customers first.

  • Primary: API error rate, response time, availability
  • Secondary: CPU usage, memory, disk space
  • Context: Include both in alert metadata

3. Clear Ownership

Every alert must have a clear owner. Use tags and routing rules to ensure alerts reach the right team immediately.

// Always include team ownership
await a11ops.alert({
  title: "Payment processing errors increasing",
  priority: "high",
  metadata: {
    team: "payments",          // Clear ownership
    service: "payment-api",
    runbook: "link/to/runbook" // How to respond
  }
});

Reducing Alert Fatigue

Use Appropriate Thresholds

Set thresholds based on actual impact, not arbitrary numbers:

Response TimeAlert when p95 > SLO (not at 1 second)
Error RateAlert when errors affect > 1% of users
Resource UsageAlert with time to react (e.g., disk full in 4 hours)

Implement Alert Deduplication

Prevent duplicate alerts from overwhelming your team:

  • Use consistent alert names and keys for deduplication
  • Group related alerts (e.g., all hosts with same issue)
  • Set appropriate evaluation windows to prevent flapping
  • Use “for” duration in alert rules (minimum 5 minutes)

Regular Alert Review

Schedule monthly reviews to improve alert quality:

  1. Review all alerts from the past month
  2. Identify alerts that were not actionable
  3. Find alerts that fired too frequently
  4. Adjust thresholds or remove unnecessary alerts
  5. Add missing alerts based on incidents

Alert Hierarchy Strategy

Structure your alerts in a hierarchy to ensure appropriate response:

Critical (Page immediately)

  • Complete service outage
  • Data corruption or loss risk
  • Security breaches
  • Payment system failures

Response time: < 5 minutes

High (Notify on-call)

  • Degraded performance affecting users
  • Error rate above acceptable threshold
  • Key feature failures

Response time: < 30 minutes

Medium (Business hours)

  • Resource usage trending high
  • Non-critical service degradation
  • Failed background jobs

Response time: < 4 hours

Low (Next business day)

  • Upcoming certificate expirations
  • Non-critical configuration drift
  • Optimization opportunities

Response time: < 2 business days

Writing Effective Alert Messages

Alert Title Template

Use a consistent format for alert titles:

[SEVERITY] Service: Specific Issue - Impact

✓ “Payment API: High error rate - 15% of transactions failing”

✓ “Database Primary: Connection pool exhausted - New queries timing out”

✗ “High CPU” (too vague)

✗ “ALERT!!!” (not descriptive)

Message Body Structure

Include essential information in a scannable format:

📊 WHAT: API response time p95 > 5 seconds
📍 WHERE: Production us-east-1, service: user-api
📈 METRICS: Current: 5.2s, Normal: 0.8s, Duration: 10 min
👥 IMPACT: ~2,000 users experiencing slow page loads
🔧 ACTION: 
   1. Check recent deployments
   2. Review database slow query log
   3. Scale service if needed
📖 RUNBOOK: https://wiki.company.com/runbooks/api-latency
📊 DASHBOARD: https://grafana.company.com/d/api-performance

Include Context in Metadata

await a11ops.alert({
  title: "Order Processing: Queue backlog growing",
  message: "Order queue depth exceeded threshold",
  priority: "high",
  metadata: {
    // Current state
    queue_depth: 5420,
    threshold: 1000,
    processing_rate: "50/min",
    
    // Historical context
    normal_depth: 100,
    peak_hour_depth: 500,
    
    // Business impact
    orders_delayed: 5320,
    estimated_delay: "90 minutes",
    revenue_at_risk: "$45,000",
    
    // Technical details
    region: "us-west-2",
    cluster: "orders-prod-1",
    version: "2.3.1",
    last_deploy: "2024-01-15T10:00:00Z"
  }
});

On-Call Best Practices

Rotation Schedule

  • Weekly rotations (not longer)
  • Clear handoff procedures
  • Secondary on-call for escalation
  • Compensate on-call time appropriately

Runbook Requirements

  • Step-by-step troubleshooting guide
  • Common issues and solutions
  • Escalation procedures
  • Rollback instructions

Alert Response

  • Acknowledge within 5 minutes
  • Update status every 30 minutes
  • Document actions taken
  • Create follow-up tickets

Post-Incident

  • Blameless postmortems
  • Update runbooks
  • Improve alerts based on learnings
  • Share knowledge with team

SLO-Based Alerting

Align alerts with Service Level Objectives (SLOs) for better prioritization:

1. Define SLIs (Service Level Indicators)

  • Availability: % of successful requests
  • Latency: % of requests under 200ms
  • Error rate: % of requests without errors

2. Set SLOs (Service Level Objectives)

  • 99.9% availability (43 minutes downtime/month)
  • 95% of requests under 200ms
  • 99.5% success rate

3. Alert on Error Budget Burn

// Alert when burning error budget too fast
if (errorBudgetBurnRate > 1) {
  // Burning budget faster than allocated
  await a11ops.alert({
    title: "SLO: Error budget burn rate critical",
    priority: burnRate > 10 ? "critical" : "high",
    metadata: {
      slo: "99.9% availability",
      current_availability: "99.5%",
      burn_rate: burnRate,
      time_to_exhaustion: "4 hours"
    }
  });
}

Automation and Self-Healing

Reduce alert volume by automating common responses:

Auto-Remediation Examples

  • Restart crashed services
  • Scale up during high load
  • Clear full disk space
  • Rotate logs automatically
  • Renew certificates before expiry

When to Alert Humans

  • Auto-remediation failed
  • Multiple systems affected
  • Data consistency issues
  • Security concerns
  • Business logic errors

Alert Quality Checklist

Use this checklist before creating a new alert:

Ready to Improve Your Alerts?

Start implementing these practices to build a better alerting culture.