After setting up a monitoring tool, it is important to configure alert rules. Alert rules send a notification when a metric surpasses a threshold. This will help you prevent operational incidents.
This page describes which metrics and thresholds to build as a starting point. For more details on how to set up alert rules in Datadog or Grafana, refer to:
Alert rules tend to have two threshold levels, and we are going to define them as follows:
- Warning: represents a call to attention to a symptom with high chances to develop into an issue.
- Alert: represents an active issue that requires immediate action.
For each threshold level, use the following table as a guide to set up your own alert rules:
|Average CPU usage for a cluster in the last 15 minutes.
|Average memory usage for a cluster in the last 15 minutes.
|Source status change in the last 1 minute.
|Cluster replica status change in the last 1 minute.
|Average lag behind an input in the last 15 minutes.
For the following table, replace the two variables, X and Y, by your organization and use case:
|Avg > X
|Avg > Y
|Average latency in the last 15 minutes. Where X and Y are the expected latencies in milliseconds.
|Consumption rate increase by X%
|Consumption rate increase by Y%
|Average credit consumption in the last 60 minutes.
Materialize has a release and a maintenance window almost every week at a defined schedule. We announce every maintenance window on the status page, where you can subscribe to updates and receive alerts for this, or an unexpected incident at Materialize.
After an upgrade, you’ll experience a few minutes of downtime and the rehydration process. Alerts may get triggered during this brief period of time. For this case, you can configure your monitoring tool to avoid unnecessary alerts as follows: