Quick walkthrough of a problem on a 3 node elasticsearch cluster first noticed with the generic yellow/red cluster warning. The chain of events causing the problem looks like…

  • Master node begins to exhaust its JVM heap and becomes unresponsive, as its a data node and a master, it has higher resource usage than other non-master nodes to account for its managing the cluster.
  • The non-master nodes can no longer contact the master and drop it as a node, dropping to a two node cluster.
  • As the previous master is no longer part of the cluster, resource usage drops, and becomes responsive again, so it rejoins the cluster.
  • The cluster begins to re-balance shards now that it has become 3 nodes again, increasing resource usage
  • Problem cycles round to start ad-infinituum

Symptoms

The elasticsearch nodes are all containerised, first thing we can see is that some of the containers had been recently restarted. By the time I had ssh’d in, there were 3 of 3 nodes in the cluster, and we had a master. CPU was high as the cluster was rebalancing shards.

Discovery

Looking at 2 of the nodes showed

    curl localhost:9200/_cluster/health?pretty
    {
    "cluster_name" : "my-cluster",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 3,
    "number_of_data_nodes" : 3,
    "active_primary_shards" : 2843,
    "active_shards" : 3162,
    "relocating_shards" : 0,
    "initializing_shards" : 12,
    "unassigned_shards" : 9280,
    "delayed_unassigned_shards" : 0,
    "number_of_pending_tasks" : 87,
    "number_of_in_flight_fetch" : 8213
    }

Then on checking the third node

    curl localhost:9200/_cluster/health?pretty
    {
    "error" : "MasterNotDiscoveredException[waited for [30s]]",
    "status" : 503
    }

Looking at the elasticsearch logs

    Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "elasticsearch[i-instance][[http_server_worker.default]][T#8]"
    Exception in thread "elasticsearch[i-instance][fetch_shard_started][T#89]" java.lang.OutOfMemoryError: Java heap space
    Exception in thread "elasticsearch[i-instance][[http_server_worker.default]][T#2]" java.lang.OutOfMemoryError: Java heap space

Actions

Recovering the cluster involves reducing the jvm heap and cpu usage to allow rebalancing to occur without stressing the master node.

    Disable shard rebalancing until we can cope with it

    curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
    "cluster.routing.allocation.enable" : "none"
    }
    }'

Close old indices, dropping resource usage

    curator close indices --older-than 30 --time-unit days --timestring '%Y.%m.%d

Allow primary rebalancing

    curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
    "cluster.routing.allocation.enable" : "primaries"
    }
    }'

Once we’re stable, allow all re-balancing

    curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
    "cluster.routing.allocation.enable" : "all"
    }
    }

Hopefully there is now enough resources for the shards be rebalanced in the cluster

    while true; do sleep 3; curl localhost:9200/_cluster/health?pretty -s | grep \"unassigned_shards\"; done
    "unassigned_shards" : 8420,
    "unassigned_shards" : 8384,
    "unassigned_shards" : 8339,

In reality I used a couple of tweaks to push things along,

Updating the concurrent recoveries to speed up rebalancing, I had enough slack in the system to do this

    curl -XPUT localhost:9200/_cluster/settings -d '{
    "persistent" : {
    "cluster.routing.allocation.node_concurrent_recoveries" : 20
    }
    }'
    {"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"node_concurrent_recoveries":"20"}}}},"transient":{}}

There were a couple of damaged indices that needed to be removed, you can recover these with backups and work

    curl -XGET -s 'localhost:9200/_cat/indices' | grep red
    red open logstash-sensu-2015.12.29 5 1
    red open logstash-myproject-2016.01.01 5 1