Alerting

After setting up a monitoring tool, it is important to configure alert rules. Alert rules send a notification when a metric surpasses a threshold. This will help you prevent operational incidents.

This page describes which metrics and thresholds to build as a starting point. For more details on how to set up alert rules in Datadog or Grafana, refer to:

Thresholds

Alert rules tend to have two threshold levels, and we are going to define them as follows:

  • Warning: represents a call to attention to a symptom with high chances to develop into an issue.
  • Alert: represents an active issue that requires immediate action.

For each threshold level, use the following table as a guide to set up your own alert rules:

Metric Warning Alert Description
CPU 85% 100% Average CPU usage for a cluster in the last 15 minutes.
Memory 80% 90% Average memory usage for a cluster in the last 15 minutes.
Source status - On Change Source status change in the last 1 minute.
Cluster status - On Change Cluster replica status change in the last 1 minute.
Freshness > 5s > 1m Average lag behind an input in the last 15 minutes.

Custom Thresholds

For the following table, replace the two variables, X and Y, by your organization and use case:

Metric Warning Alert Description
Latency Avg > X Avg > Y Average latency in the last 15 minutes. Where X and Y are the expected latencies in milliseconds.
Credits Consumption rate increase by X% Consumption rate increase by Y% Average credit consumption in the last 60 minutes.
Back to top ↑