Digitalization and the cloud have made it easier to deploy new features in software applications. This makes error-handling crucial. Any error in the chain, from writing code to deploying to monitoring performance, can immediately degrade customer experience, increase cost or interrupt critical services. Many systems have been designed to manage errors and maintain a chain. However, these systems are limited by the limitations of their tools and systems. IT teams must manually go through terabytes upon terabytes to find the problem. This is tedious and often delays rectification. Companies end up spending valuable resources.
What if there were a quick way to fix operational problems?
The answer is yes, Amazon DevOps Guru.
Amazon has launched DevOps Guru as a fully managed operation service that assists developers and operators in improving the availability and performance of their applications. DevOps Guru allows you to quickly implement improvements to your application by removing the administrative tasks associated in identifying operational problems.
Amazon DevOps Guru in action:
DevOps Guru provides reactive insights that you can use now to improve your application. It provides proactive insights that can help you avoid potential operational problems that could affect your application in future. It uses machine learning to analyze operational data, application metrics, and events to identify patterns that are different from the standard operating pattern. DevOps Guru will notify you when it detects an operational risk or issue. DevOps Guru provides intelligent recommendations to address the current and future operational issues for each issue.
The Proof of Concept:
We have created a use case for AWS to better understand the service. This POC will allow us to dive deep into the service. Click here to see the steps and instructions for performing this POC. Below are the steps and details for performing this POC.
We will deploy a CloudFormation Stack, and populate it using test data. This stack will launch a serverless app that includes an API Gateway and DynamoDB Table. To increase traffic, we will change the ReadCapacityUnits of DynamoDB Table from 5 to 1.
Step 1: Select the resource coverage you want and then enable Amazon DevOps Guru Service. After enabling the service, you can also select the resource coverage later.
Step 2: Select the stacks that you wish to monitor with DevOps Gura.
Now, you will need to wait for DevOps Guru’s completion of the resource baselining. This is an important step to establish the expected behavior. We recommend that you wait for 2 hours to complete the next steps for our serverless stack, which has three resources. It can take up 24 hours to complete a baseline when enabled in a production environment depending on the number of resources that are being monitored.
Step 3: Modify your stack and change the ReadCapacityUnits in the DynamoDB Table from 5 to 1.
Step 4: To increase traffic, run the script to trigger API Gateway endpoints in a loop.
Step 5: See the Insights in Amazon DevOps Guru Service to see the recommendations.
API gateway, Lambda and DynamoDB will be affected. The DevOps Guru (DynamoDB) will identify the root cause and provide insights.
These are the images of the POC that were taken.
Fig. 1: You can view various errors regarding the affected resources on the Insights dashboard.
Fig. 2: The Aggregated metrics dashboard shows the timeline of all metrics. You can also compare the origin time for spikes or errors in the metrics. This will help you determine the root cause.