Alerts

Alerts are the core of a11ops. They represent critical events in your infrastructure that need immediate attention. Learn how to structure alerts effectively to maximize signal and minimize noise.

Understanding Alerts

An alert in a11ops is a notification about an event that requires attention. Each alert contains:

  • Title - A concise summary of the issue
  • Message - Detailed description and context
  • Severity - Priority level (critical, high, medium, low, info)
  • Timestamp - When the event occurred
  • Metadata - Custom key-value pairs for additional context

Real-time Delivery

Alerts are delivered instantly via push notifications, webhooks, and integrations to ensure rapid response times.

Smart Filtering

Use severity levels and metadata to filter and route alerts to the right team members at the right time.

Anatomy of an Alert

Basic Alert Structure

{
  "title": "Database Connection Pool Exhausted",
  "message": "All 100 connections in use, queries are timing out",
  "severity": "critical",
  "timestamp": "2024-01-15T10:30:45Z",
  "metadata": {
    "database": "postgres-primary",
    "connections_used": 100,
    "connections_max": 100,
    "queue_length": 47,
    "region": "us-east-1"
  }
}

Title Guidelines

The title should be concise yet descriptive, answering "what happened?"

"Database CPU at 95%"

Clear, specific, actionable

"High CPU"

Too vague, missing context

Message Best Practices

The message provides context and next steps:

  • Include relevant metrics and thresholds
  • Mention the impact on users or services
  • Suggest immediate actions if applicable
  • Link to runbooks or documentation

Metadata Usage

Use metadata for structured data that can be filtered or searched:

"metadata": {
  // Environment and location
  "environment": "production",
  "region": "us-east-1",
  "datacenter": "dc-1",
  
  // Service information
  "service": "api-gateway",
  "version": "2.3.1",
  "instance_id": "i-0a1b2c3d",
  
  // Metrics and thresholds
  "current_value": 95.3,
  "threshold": 80,
  "duration_minutes": 5,
  
  // References
  "runbook_url": "https://wiki.company.com/runbooks/high-cpu",
  "dashboard_url": "https://grafana.company.com/d/abc123"
}

Severity Levels

Choose the appropriate severity level to ensure alerts get the right attention:

Critical

Complete service outage or data loss risk. Requires immediate action.

await a11ops.critical("Payment system is completely down");

Examples: Database failure, payment processing down, security breach

High / Error

Degraded service or errors affecting users. Needs prompt attention.

await a11ops.error("API response time > 5 seconds");

Examples: High error rates, performance degradation, partial outages

Medium / Warning

Potential issues that may escalate. Should be investigated soon.

await a11ops.warning(“Disk usage at 85%”);

Examples: High resource usage, deprecation warnings, rate limit approaching

Low

Minor issues or notifications. Can be addressed during normal hours.

await a11ops.alert({title: "SSL certificate expires in 30 days", priority: "low"});

Examples: Upcoming maintenance, certificate expiration warnings

Info

Informational messages for audit trails and tracking.

await a11ops.info(“Deployment v2.0.1 completed successfully”);

Examples: Deployments, configuration changes, scheduled tasks

Common Alert Patterns

Threshold Alerts

Alert when a metric exceeds a defined threshold:

if (cpuUsage > 90) {
  await a11ops.critical({
    title: `CPU usage critical: ${cpuUsage}%`,
    message: "Server CPU has been above 90% for 5 minutes",
    metadata: {
      metric: "cpu_usage_percent",
      current: cpuUsage,
      threshold: 90,
      duration: "5m",
      hostname: os.hostname()
    }
  });
}

Error Rate Alerts

Monitor error rates and alert on anomalies:

const errorRate = (errors / requests) * 100;

if (errorRate > 5) {
  await a11ops.error({
    title: `High error rate: ${errorRate.toFixed(1)}%`,
    message: `API experiencing ${errors} errors out of ${requests} requests`,
    metadata: {
      endpoint: "/api/v1/orders",
      error_count: errors,
      request_count: requests,
      error_rate_percent: errorRate.toFixed(1),
      time_window: "5m"
    }
  });
}

Deployment Tracking

Track deployments and configuration changes:

await a11ops.info({
  title: "Deployment started",
  message: `Deploying version ${version} to production`,
  metadata: {
    version,
    commit_sha: process.env.COMMIT_SHA,
    deployed_by: process.env.USER,
    pipeline_url: process.env.CI_PIPELINE_URL,
    environment: "production"
  }
});

Alert Lifecycle

Understanding the alert lifecycle helps you manage alerts effectively:

1

Creation

Alert is sent via API or SDK with all required information

2

Delivery

Alert is routed to configured channels (push notifications, webhooks, integrations)

3

Acknowledgment

Team member acknowledges the alert, indicating they are investigating

4

Resolution

Issue is resolved and alert is marked as resolved with notes

5

Analysis

Post-incident review to prevent future occurrences

Alert Best Practices

Be Specific

Include exact metrics, service names, and impact in your alerts. Avoid generic messages that require investigation to understand.

Choose Severity Wisely

Reserve critical alerts for true emergencies. Overuse dilutes their importance and leads to alert fatigue.

Include Timing Context

Mention how long the issue has persisted and any patterns (e.g., "for the last 5 minutes" or "3 times in the past hour").

Actionable Information

Every alert should clearly indicate what action is needed. Link to runbooks, dashboards, or documentation when possible.

Next Steps

Learn how to integrate a11ops with your existing monitoring tools.