Alerting

After setting up a monitoring tool, it is important to configure alert rules. Alert rules send a notification when a metric surpasses a threshold. This will help you prevent operational incidents.

This page describes which metrics and thresholds to build as a starting point. For more details on how to set up alert rules in Datadog or Grafana, refer to:

Thresholds

Alert rules tend to have two threshold levels, and we are going to define them as follows:

Warning: represents a call to attention to a symptom with high chances to develop into an issue.
Alert: represents an active issue that requires immediate action.

For each threshold level, use the following table as a guide to set up your own alert rules:

Metric	Warning	Alert	Description
CPU	85%	100%	Average CPU usage for a cluster in the last 15 minutes.
Memory	80%	90%	Average memory usage for a cluster in the last 15 minutes.
Source status	-	On Change	Source status change in the last 1 minute.
Cluster status	-	On Change	Cluster replica status change in the last 1 minute.
Freshness	> 5s	> 1m	Average lag behind an input in the last 15 minutes.

Custom Thresholds

For the following table, replace the two variables, X and Y, by your organization and use case:

Metric	Warning	Alert	Description
Latency	Avg > X	Avg > Y	Average latency in the last 15 minutes. Where X and Y are the expected latencies in milliseconds.
Credits	Consumption rate increase by X%	Consumption rate increase by Y%	Average credit consumption in the last 60 minutes.