Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch multi-node cluster one node always fails with docker compose

I am attempting to setup a basic dev 3 node 7.7.0 ES cluster in docker compose following the official documentation here but cannot get all 3 nodes to operate at the same time. After running docker-compose up at least one of the containers exits either right away or shortly after like this:

Starting es01 ... done
Starting es03 ... done
Starting es02 ... done
Attaching to es03, es02, es01
es01 exited with code 137

If I try and bring it back up by running docker-compose start es01 (or whichever one happened to exit, it's random sometimes) it causes these errors that don't stop until I kill all containers:

es03    | {"type": "server", "timestamp": "2020-05-25T15:52:37,214Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-docker-cluster", "node.name": "es03", "message": "master not discovered or elected yet, an election requires 2 nodes with ids [9RNLWSMbTmetFFh3Tm1q0g, 3CbHK5iBQAqt7poPRsMmaw], have discovered [{es03}{3CbHK5iBQAqt7poPRsMmaw}{AGMLafOfRyyMVeEuFuaTOA}{172.25.0.2}{172.25.0.2:9300}{dilmrt}{ml.machine_memory=2085416960, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {es02}{9RNLWSMbTmetFFh3Tm1q0g}{bnuXTgAqQyCz04G7Cva1sA}{172.25.0.4}{172.25.0.4:9300}{dilmrt}{ml.machine_memory=2085416960, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] which is a quorum; discovery will continue using [172.25.0.4:9300] from hosts providers and [{es03}{3CbHK5iBQAqt7poPRsMmaw}{AGMLafOfRyyMVeEuFuaTOA}{172.25.0.2}{172.25.0.2:9300}{dilmrt}{ml.machine_memory=2085416960, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
es02    | {"type": "server", "timestamp": "2020-05-25T15:52:37,936Z", "level": "WARN", "component": "o.e.d.SeedHostsResolver", "cluster.name": "es-docker-cluster", "node.name": "es02", "message": "failed to resolve host [es01]", "cluster.uuid": "e16AK3EnQ-28Xky2afhx4A", "node.id": "9RNLWSMbTmetFFh3Tm1q0g"

If I don't attempt to bring back up the container that exited the other two will seemingly work fine, if I go to http://localhost:9200/_cluster/health I can see

{"cluster_name":"es-docker-cluster","status":"green","timed_out":false,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}

I am using Docker Desktop version 2.2.0.5 for MacOS Catalina version 10.15.4 with Docker Compose version 1.25.4

This is my docker-compose.yml file for reference:

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.7.0
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastic
  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.7.0
    container_name: es02
    environment:
      - node.name=es02
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data02:/usr/share/elasticsearch/data
    networks:
      - elastic
  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.7.0
    container_name: es03
    environment:
      - node.name=es03
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data03:/usr/share/elasticsearch/data
    networks:
      - elastic

volumes:
  data01:
    driver: local
    name: data01
  data02:
    driver: local
    name: data02
  data03:
    driver: local
    name: data03

networks:
  elastic:
    driver: bridge
    name: elastic
like image 429
Nick Avatar asked Dec 22 '22 17:12

Nick


2 Answers

My docker desktop memory settings were at a 2gb, once I set it to above 4gb the issue stopped.

Turns out exit code 137 with docker containers is commonly due to memory allocation issues!

like image 74
Nick Avatar answered Dec 28 '22 11:12

Nick


For me exit code 137 was caused by giving so much memory to Elasticsearch that it tried to reserve more than my system's available memory for the replica. The primary took up 6 GB in docker stats, so Elasticsearch was reserving another 6 GB for a replica (but never allocating it because it was on the same node).

I limited the heap size in docker-compose.yml:

environment:
    - ES_JAVA_OPTS=-Xms750m -Xmx750m

This ran everything just as fast in 1 GB of memory as it formerly did in 6 GB, without crashing.

like image 20
Noumenon Avatar answered Dec 28 '22 11:12

Noumenon