Currently I’m using the logging setup of Beaver shipping logs into an ELK stack, and metrics with collectd shipping metrics into a Graphite stack. Now that Elastic have Beats that do both logging and metrics, its worth exploring further.
Just as you start off on a Monday morning, at 9:01am, there’s a page, that crucial, heavily used site is broken, users are blocked from working and frustrated. What went wrong?
One of the main pressures around response to incidents is simply being overwhelmed with tasks, the outcome of so many demands and so much context-switching can easily be chaos, or poor quality quick-fixes. As with all real-time response, the key thing is to take a step-back, and triage the incoming requests as they arrive, prioritising those we need to deal with first, and deferring those that we can tackle later.
Quick walkthrough of a problem on a 3 node elasticsearch cluster first noticed with the generic yellow/red cluster warning. The chain of events causing the problem looks like…
- The slides from a quick review of Sensu, in short, Sensu is good! RabbitMQ is the only point of communication needed between clients and servers Setup your client-customisable subscription checks on the server Setup any weird custom checks on your clients Please, please don’t alert on anything but the essentials Really, the above ^^^
- Some slides from an investigation into migrating to using Amazons Cloudwatch. Quick summary, Create metrics on Cloudwatch logging streams and alert on them, eg, number of 500’s in a minute You get basic free metrics from AWS, custom metrics are pretty easy to setup You have access to plenty of AWS specific metrics and triggers They are well integrated with other AWS stuff so you can do more advanced Lamda processing But, is it enough to moving away from your custom ELK/graphite type stack?