Reliability

Rolling instance cycling with elasticsearch

An easy way to cycle EC2 instances where we have an elasticsearch cluster running. As an example target we have, Two instances i-11111111 and i-22222222, both running elastic search as a cluster with replicas set to 2, so that each has a replica of the others primary indices. Add one to the auto-scaling group, increasing desired to 3 Wait for new instance i-333333 to join the cluster

November 12, 2016

Reliability Elasticsearch

Continuous Load for Live Services

Just as you start off on a Monday morning, at 9:01am, there’s a page, that crucial, heavily used site is broken, users are blocked from working and frustrated. What went wrong?

April 30, 2016

Reliability Observability

Triage for Incident Response

One of the main pressures around response to incidents is simply being overwhelmed with tasks, the outcome of so many demands and so much context-switching can easily be chaos, or poor quality quick-fixes. As with all real-time response, the key thing is to take a step-back, and triage the incoming requests as they arrive, prioritising those we need to deal with first, and deferring those that we can tackle later.

April 30, 2016

Reliability Observability

Minimum Downtime instance cycling

The goal here is to implement an instance cycling task, resulting in all current instances being replaced with new instances with no downtime. When working with auto-scaling groups, its important to remember that the auto-scaling group is in control! Simply rebooting will most likely spook the scaling group into replacing the downed instance.

March 5, 2016

AWS Reliability

Advanced RabbitMQ Containers

Basically, a RabbitMQ image that uses confd to capture some environment variables to set itself up. All sorts of queues, bindings, vhosts, users, etc can be set up using this method.

January 2, 2016

Automation Reliability