Alerts
Alerts are the core of a11ops. They represent critical events in your infrastructure that need immediate attention. Learn how to structure alerts effectively to maximize signal and minimize noise.
Understanding Alerts
An alert in a11ops is a notification about an event that requires attention. Each alert contains:
- Title - A concise summary of the issue
- Message - Detailed description and context
- Severity - Priority level (critical, high, medium, low, info)
- Timestamp - When the event occurred
- Metadata - Custom key-value pairs for additional context
Real-time Delivery
Alerts are delivered instantly via push notifications, webhooks, and integrations to ensure rapid response times.
Smart Filtering
Use severity levels and metadata to filter and route alerts to the right team members at the right time.
Anatomy of an Alert
Basic Alert Structure
{
"title": "Database Connection Pool Exhausted",
"message": "All 100 connections in use, queries are timing out",
"severity": "critical",
"timestamp": "2024-01-15T10:30:45Z",
"metadata": {
"database": "postgres-primary",
"connections_used": 100,
"connections_max": 100,
"queue_length": 47,
"region": "us-east-1"
}
}Title Guidelines
The title should be concise yet descriptive, answering "what happened?"
"Database CPU at 95%"Clear, specific, actionable
"High CPU"Too vague, missing context
Message Best Practices
The message provides context and next steps:
- Include relevant metrics and thresholds
- Mention the impact on users or services
- Suggest immediate actions if applicable
- Link to runbooks or documentation
Metadata Usage
Use metadata for structured data that can be filtered or searched:
"metadata": {
// Environment and location
"environment": "production",
"region": "us-east-1",
"datacenter": "dc-1",
// Service information
"service": "api-gateway",
"version": "2.3.1",
"instance_id": "i-0a1b2c3d",
// Metrics and thresholds
"current_value": 95.3,
"threshold": 80,
"duration_minutes": 5,
// References
"runbook_url": "https://wiki.company.com/runbooks/high-cpu",
"dashboard_url": "https://grafana.company.com/d/abc123"
}Severity Levels
Choose the appropriate severity level to ensure alerts get the right attention:
Critical
Complete service outage or data loss risk. Requires immediate action.
await a11ops.critical("Payment system is completely down");
Examples: Database failure, payment processing down, security breach
High / Error
Degraded service or errors affecting users. Needs prompt attention.
await a11ops.error("API response time > 5 seconds");
Examples: High error rates, performance degradation, partial outages
Medium / Warning
Potential issues that may escalate. Should be investigated soon.
await a11ops.warning(“Disk usage at 85%”);
Examples: High resource usage, deprecation warnings, rate limit approaching
Low
Minor issues or notifications. Can be addressed during normal hours.
await a11ops.alert({title: "SSL certificate expires in 30 days", priority: "low"});
Examples: Upcoming maintenance, certificate expiration warnings
Info
Informational messages for audit trails and tracking.
await a11ops.info(“Deployment v2.0.1 completed successfully”);
Examples: Deployments, configuration changes, scheduled tasks
Common Alert Patterns
Threshold Alerts
Alert when a metric exceeds a defined threshold:
if (cpuUsage > 90) {
await a11ops.critical({
title: `CPU usage critical: ${cpuUsage}%`,
message: "Server CPU has been above 90% for 5 minutes",
metadata: {
metric: "cpu_usage_percent",
current: cpuUsage,
threshold: 90,
duration: "5m",
hostname: os.hostname()
}
});
}Error Rate Alerts
Monitor error rates and alert on anomalies:
const errorRate = (errors / requests) * 100;
if (errorRate > 5) {
await a11ops.error({
title: `High error rate: ${errorRate.toFixed(1)}%`,
message: `API experiencing ${errors} errors out of ${requests} requests`,
metadata: {
endpoint: "/api/v1/orders",
error_count: errors,
request_count: requests,
error_rate_percent: errorRate.toFixed(1),
time_window: "5m"
}
});
}Deployment Tracking
Track deployments and configuration changes:
await a11ops.info({
title: "Deployment started",
message: `Deploying version ${version} to production`,
metadata: {
version,
commit_sha: process.env.COMMIT_SHA,
deployed_by: process.env.USER,
pipeline_url: process.env.CI_PIPELINE_URL,
environment: "production"
}
});Alert Lifecycle
Understanding the alert lifecycle helps you manage alerts effectively:
Creation
Alert is sent via API or SDK with all required information
Delivery
Alert is routed to configured channels (push notifications, webhooks, integrations)
Acknowledgment
Team member acknowledges the alert, indicating they are investigating
Resolution
Issue is resolved and alert is marked as resolved with notes
Analysis
Post-incident review to prevent future occurrences
Alert Best Practices
Be Specific
Include exact metrics, service names, and impact in your alerts. Avoid generic messages that require investigation to understand.
Choose Severity Wisely
Reserve critical alerts for true emergencies. Overuse dilutes their importance and leads to alert fatigue.
Include Timing Context
Mention how long the issue has persisted and any patterns (e.g., "for the last 5 minutes" or "3 times in the past hour").
Actionable Information
Every alert should clearly indicate what action is needed. Link to runbooks, dashboards, or documentation when possible.
Next Steps
Learn how to integrate a11ops with your existing monitoring tools.