Сбой Elasticsearch

#elasticsearch #magento2 #elasticsearch-7

Вопрос:

У нас время от времени возникают проблемы с сбоями Elasticsearch. Это также иногда увеличивает оперативную память процессор, и сервер перестает отвечать на запросы.

Мы оставили большинство настроек как есть, но нам пришлось добавить больше оперативной памяти в кучу JVM (48 ГБ), чтобы она не выходила из строя часто.

Я начал копать, и, по-видимому, 32 ГБ-это максимум, который вы должны использовать. Мы это исправим.

Сервер находится:

 CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME

^^^ для обработки чего-то подобного более чем достаточно аппаратного обеспечения, но что-то подсказывает мне, что для обработки такого количества данных необходимо выполнить дополнительную настройку.

У нас есть магазин Magento 2.4.3 CE, в котором около 400 000 товаров.

Вот все наши конфигурационные файлы:

файл jvm.options

     ## JVM configuration
    
    ################################################################
    ## IMPORTANT: JVM heap size
    ################################################################
    ##
    ## You should always set the min and max JVM heap
    ## size to the same value. For example, to set
    ## the heap to 4 GB, set:
    ##
    ## -Xms4g
    ## -Xmx4g
    ##
    ## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
    ## for more information
    ##
    ################################################################
    
    # Xms represents the initial size of total heap space
    # Xmx represents the maximum size of total heap space
    
    -Xms48g
    -Xmx48g
    
    ################################################################
    ## Expert settings
    ################################################################
    ##
    ## All settings below this section are considered
    ## expert settings. Don't tamper with them unless
    ## you understand what you are doing
    ##
    ################################################################
    
    ## GC configuration
    8-13:-XX: UseConcMarkSweepGC
    8-13:-XX:CMSInitiatingOccupancyFraction=75
    8-13:-XX: UseCMSInitiatingOccupancyOnly
    
    ## G1GC Configuration
    # NOTE: G1 GC is only supported on JDK version 10 or later
    # to use G1GC, uncomment the next two lines and update the version on the
    # following three lines to your version of the JDK
    # 10-13:-XX:-UseConcMarkSweepGC
    # 10-13:-XX:-UseCMSInitiatingOccupancyOnly
    14-:-XX: UseG1GC
    14-:-XX:G1ReservePercent=25
    14-:-XX:InitiatingHeapOccupancyPercent=30
    
    ## DNS cache policy
    # cache ttl in seconds for positive DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.ttl; set to -1 to cache forever
    -Des.networkaddress.cache.ttl=60
    # cache ttl in seconds for negative DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.negative ttl; set to -1 to cache
    # forever
    -Des.networkaddress.cache.negative.ttl=10
    
    ## optimizations
    
    # pre-touch memory pages used by the JVM during initialization
    -XX: AlwaysPreTouch
    
    ## basic
    
    # explicitly set the stack size
    -Xss1m
    
    # set to headless, just in case
    -Djava.awt.headless=true
    
    # ensure UTF-8 encoding by default (e.g. filenames)
    -Dfile.encoding=UTF-8
    
    # use our provided JNA always versus the system one
    -Djna.nosys=true
    
    # turn off a JDK optimization that throws away stack traces for common
    # exceptions because stack traces are important for debugging
    -XX:-OmitStackTraceInFastThrow
    
    # enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
    # they are supported
    14-:-XX: ShowCodeDetailsInExceptionMessages
    
    # flags to configure Netty
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    
    # log4j 2
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    
    -Djava.io.tmpdir=${ES_TMPDIR}
    
    ## heap dumps
    
    # generate a heap dump when an allocation from the Java heap fails
    # heap dumps are created in the working directory of the JVM
    -XX: HeapDumpOnOutOfMemoryError
    
    # specify an alternative path for heap dumps; ensure the directory exists and
    # has sufficient space
    -XX:HeapDumpPath=/var/lib/elasticsearch
    
    # specify an alternative path for JVM fatal error logs
    -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
    
    ## JDK 8 GC logging
    
    8:-XX: PrintGCDetails
    8:-XX: PrintGCDateStamps
    8:-XX: PrintTenuringDistribution
    8:-XX: PrintGCApplicationStoppedTime
    8:-Xloggc:/var/log/elasticsearch/gc.log
    8:-XX: UseGCLogFileRotation
    8:-XX:NumberOfGCLogFiles=32
    8:-XX:GCLogFileSize=64m
    
    # JDK 9  GC logging
    9-:-Xlog:gc*,gc age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
    # due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
    # time/date parsing will break in an incompatible way for some date patterns and locals
    9-:-Djava.locale.providers=COMPAT
    
    # temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
    10-:-XX:UseAVX=2


**elasticsearch.yml**

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2   1):
#
#discovery.zen.minimum_master_nodes: 
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true

Я исследовал скачок оперативной памяти процессора, который может быть вызван тем, что эти настройки не заданы:

 gateway.expected_nodes: 10
gateway.recover_after_time: 5m

Here is some other data from Elasticsearch:

 curl -XGET --user username:password http://localhost:9200/

{
  "name" : "web1.example.com",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
  "version" : {
    "number" : "7.13.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
    "build_date" : "2021-06-10T21:01:55.251515791Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

curl --user username:password -sS http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 55.55555555555556
}

curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "example-amasty_product_1_v156",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-09-14T16:52:28.854Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2THEUTSaQdmOJAAhTTN71g",
      "node_name" : "web1.example.com",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "134622244864",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "51539607552"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node"
        }
      ]
    }
  ]
}

^^^ The issue is that I don’t know how to set up multiple nodes on one machine.

The misconfig from what I understand is that we’re running one node only. From my readings 3 master nodes is required for green status.

How do I set up multiple nodes on a single machine and do I need to increase data nodes?

My main suspicions:

not enough master / data nodes
newer Garbage Collector is having issues (G1GC is enabled — I’m not sure how to determine which one is currently enabled from the config) —— ALREADY FIGURED IT OUT — G1 is used.
no recovery setup in case of crash (gateway.expected_nodes, gateway.recover_after_time)

UPDATE:

Here is the error log from elasticsearch.log

https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=

Apologies the log file did not fit into Stackoverflow post 🙂

Pastebin:

Part #1: https://pastebin.com/86sLM9BD
Part #2: https://pastebin.com/1VEn63TQ

UPDATE:

OUTPUT OF: _cluster/stats?prettyamp;human

https://pastebin.com/EM8ZMVst

UPDATE:

Figured out how to limit the number of replicas.

This can be done via templates:

 PUT _template/all
{
  "template": "*",
  "settings": {
    "number_of_replicas": 0
  }
}

I will be testing it tomorrow if it makes an effect and makes the status green.

I don’t think it will do anything performance wise, but we’ll see.

I’m working through other suggestions:

Limited RAM use to 31GB
File descriptor is already set to 65535
Maximum number of threads is already set to 4096
Maximum size virtual memory check is already increased and configured
Maximum map count bumped to 262144
G1GC is disabled (by default)

One thing I’m trying is to reduce the:

 8-13:-XX:CMSInitiatingOccupancyFraction=75

 8-13:-XX:CMSInitiatingOccupancyFraction=70

I believe this will speed up garbage collection and will prevent out of ram errors. We’ll try to adjust this up/down to see it if helps.

Switch to G1GC

Я понимаю, что это не очень поощряется, но есть статьи об этом, посвященные аналогичным проблемам нехватки памяти, в которых переключение на G1GC помогло решить проблему: https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181

Это будет последнее, что я собираюсь попробовать.

Обновить:

После всех этих изменений индекс наконец-то стал зеленым (исправление шаблона сработало).

У него также не было никаких проблем за ночь. Он не такой быстрый, как с 50 ГБ оперативной памяти, но, по крайней мере, он стабилен.

Общие рекомендации для будущих средств устранения неполадок Elasticsearch: пройдите проверку начальной загрузки — это, по крайней мере, позволит вам оценить базовую производительность.

ОБНОВЛЕНИЕ: Обнаружены проблемы с настройками захвата JVM из двух мест и их использованием для разных целей.

Похоже, системный администратор поместил heap_size.options в

/etc/elasticsearch/jvm.options.d

с настройками JVM 31 ГБ, но основной файл jvm.options показывал 8 ГБ. Это повлияло на потоки сбора GC, которые работали только с 8 ГБ оперативной памяти (все же весь 31 ГБ оперативной памяти все еще был занят).

Я удалил файл и добавил 31 ГБ в файл jvm.options.

Это несколько стабилизировало ситуацию, но ГК по — прежнему занимается сбором средств высокими темпами.

Как только я добавил какие-либо атрибуты в список для индексации, коллекция GC снова заполнила память.

Единственное, что спасает от этого, — это удаление индекса и повторная индексация.

Я нахожусь на том этапе, когда подумываю уничтожить всю установку Elasticsearch, а затем сделать это сам.

Это не должно быть так сложно.

1. Желтый кластер означает, что некоторые реплики не могут быть выделены, так как у вас всего 1 узел , вам нужно установить количество реплик 0 равным, это изменит состояние кластера на зеленое. Что касается причины сбоя вашего кластера, вам нужно будет предоставить дополнительную информацию, можете ли вы поделиться журналами при его сбое? Также попробуйте установить объем кучи в 30 ГБ вместо 48 ГБ документации .

2. @leandrojmp спасибо! Я обновил сообщение с помощью elasticsearch.log. Я застрял без сервера разработки прямо сейчас (используется для чего-то другого), поэтому мне нужно будет разобраться в этом при настройке настройки локальной машины.

Ответ №1:

несколько вещей

высокая загрузка ЦП или памяти не будет связана с отсутствием этих gateway настроек, и как кластер с одним узлом они несколько неуместны
мы рекомендуем хранить кучи <32 ГБ, см. https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
вы никогда не сможете выделить сегменты реплики на том же узле, что и основной. таким образом, для кластера с одним узлом вам необходимо либо удалить реплики (рискованно), либо добавить в кластер еще (в идеале) 2 узла
настройка кластера с несколькими узлами на одном хосте немного бессмысленна. конечно, ваши реплики будут выделены, но если вы потеряете хост, вы все равно потеряете все данные

Я бы предложил посмотреть на https://www.elastic.co/guide/en/elasticsearch/reference/7.14/bootstrap-checks.html и применение настроек, о которых идет речь, потому что даже если вы используете один узел, это то, что мы называем настройками, готовыми к работе

кроме того, включен ли у вас мониторинг? что показывают ваши журналы Elasticsearch? а как насчет горячих нитей? или медленные журналы?

(и, чтобы быть маятниковым, это эластичный поиск, s-это не верблюжий чехол ;))

1. Спасибо! Я обновил сообщение с помощью исправления elasticsearch.log и camelcase 🙂

2. спасибо за это. к вашему сведению, в будущем gist/pastebin/etc лучше подходит для журналов, гораздо проще делиться, чем загружать загружать файл 🙂

3. хорошо, да, там много gc, которые мало что делают. каков вывод из _cluster/stats?prettyamp;human API (пожалуйста, в gist/pastebin/etc)?

Ответ №2:

Мы решили эту проблему. Проблема заключалась в плохой установке.

Что-то работало неправильно (до сих пор не знаю, в чем именно заключалась проблема).

Как ES, так и Java были переустановлены. Я сопоставил ES с конкретной версией, которая работает в моей среде разработки.

Вы можете видеть здесь, что GC, наконец, работает правильно.

Мы также получили ES непосредственно из источника. Предыдущая установка была из какого-то случайного репозитория.

Я включил все атрибуты, которые были необходимы компании, и она даже не заметила — стабильная и быстрая.

Спасибо всем, кто помог мне пройти через эти шаги, так как я бы не стал уничтожать установку ES, не зная, что сделал все возможное, чтобы стабилизировать ее.

Это также дало мне урок по настройке ES 🙂