Monitoring Tooling / Setting Alerts
So you've identified some of the tools you can use to monitor mission critical metrics, but what should you do next?
In this video, we'll learn how to configure alerts in Atlas. We'll also discuss implementing operational excellence into our monitoring strategy. Let's get started.
Alerts will let you know when certain metrics exceed specific thresholds.
They're a great way to keep track of events as they happen, and you can integrate alerts with tools like Slack, PagerDuty, and Datadog.
This integration ensures that critical notifications are incorporated into existing workflows, enabling teams to respond efficiently to issues. Many metrics have default alert settings, but we can adjust default settings to fit our needs and add alerts that are not yet set. Let's now set alerts for our operation counts, query targeting, and average execution times. If you haven't already or just need a refresher, check out our video on the Atlas metrics panel to learn more about each of these metrics. To configure alerts, we click on the add alert button from the cluster metrics page of our cluster.
This takes us to our project alerts page.
There are multiple categories of alerts that we can view or edit from this page.
Let's expand the host section to see more metrics.
If an alert already exists, we simply click on the ellipsis next to it and choose edit to make changes.
But in this case, we wanna add new alerts.
Let's add an alert of our average execution times.
To do so, we can click on the add alert button here.
This opens a dialog box for configuring our alert. The host category is already selected for us. But if we wanted to change to a different category like execution, we could use the drop down.
The type lets us choose if we want to isolate this alert to only primary or secondary nodes or other types of hosts.
To define the alert condition, let's simply select it from the condition metric drop down menu.
To save time scrolling through all of the available options, we can just delete the text in the box and begin typing execution to narrow down the options.
From this list, let's select average execution time reads is from the list.
In the next row, we can decide if we want to alert when values are above or below a threshold in the operator drop down.
We wanna be alerted when we go over a value, so we will leave it as above.
Next, we need to define the actual threshold.
Keeping execution times below fifty milliseconds is ideal for us, so we'll enter fifty here and change the units from nanoseconds to milliseconds.
We want this alert to trigger if this happens on any of our hosts, but we could limit it to specific hosts by using the options available under the host's where tab.
Finally, we need to add a notification method for the alert and define who will get notified.
Alerts can either go to all users that have a role in this project or we can choose specific roles.
Since we are still a small company and want to make sure our alerts get as much visibility as possible, let's go with all roles.
We can configure notifications via email, SMS, or both.
To prevent false alarms from temporary spikes, we'll trigger an alert only when the condition lasts for at least five minutes.
This helps us focus on persistent problems, like escalating execution times.
Once an alert is active, we'll receive notifications every five minutes.
Once we click save, the alert is configured and active.
Having alerts set up allows us to react to situations earlier before they become problems. But to make our entire database ecosystem run smoother, we need to think about the concept of operational excellence.
Our monitoring should be proactive and should be one component of a bigger plan to make things run smoothly and reliably.
Some key ways to do this are sharing what we know, creating playbooks for overall system health, and making runbooks for when things go seriously wrong. Knowledge sharing keeps everyone informed. Utilizing internal communication platforms and documentation systems helps teams stay updated on performance issues and changes.
Internal knowledge bases help disseminate information on how systems operate.
Additionally, participating in relevant communication channels enables open discussion about problems and solutions.
For alerts, having playbooks with step by step instructions is very helpful. This allows for consistent and effective responses to alerts, minimizing disruptions.
These playbooks should integrate with our existing monitoring, communication, and incident management tools, ensuring automated alert notifications are sent to relevant team channels for quick communication and action.
When really critical stuff happens, runbooks give us a detailed guide to fix things fast.
After these big events, we look back and learn, updating our runbooks so we're even more prepared next time. Integrating these ideas into our daily work not only strengthens our technology, but also fosters a culture of continuous improvement, leading to greater efficiency and better decision making.
By focusing on these key areas and how they connect to our monitoring and alerting, we create a clear story that links what we do with the bigger picture of running things efficiently and effectively.
Well done. In this video, we learned how to access the alert configuration in MongoDB Atlas and identify the crucial metrics of operation counts, query targeting, and average execution times for proactive monitoring.
Then we walked through the step by step process of adding a new alert for average execution time, including setting the threshold, duration, and notification preferences.
Finally, we discussed tying it all together by fostering a culture of operational excellence.
