Апач Ignite.NET : некоторые узлы Ignite не запускаются после обновления до версии v2.9 — обнаружено разрушение стека

#ignite

#ignite

Вопрос:

Я использую Apache Ignite.Сеть в кластере Kubernetes на узлах Linux.

Недавно я обновил свой кластер ignite 2.8.1 до версии v2.9. После обновления некоторые службы, входящие в состав кластера, не запускаются со следующим сообщением:

*** обнаружено разбиение стека ***: завершено

Интересно, что чаще всего это происходит со 2-м экземпляром одного и того же микросервиса. Первые экземпляры обычно запускаются успешно (но иногда и первые экземпляры терпят неудачу). Другое наблюдение заключается в том, что это происходит с узлами, которые публикуют службы сетки обслуживания. Иногда полная переработка кластера (уничтожение всех узлов с последующим их повторным запуском) помогает запустить все узлы, иногда нет.

Я что-то испортил во время обновления? Что я должен проверить в первую очередь?

Ниже приведена выдержка из журнала Ignite.

 2020-12-08 22:05:25,683 [1] DEBUG  [(null)] - Classpath resolved to: /app/libs/spring-jdbc-4.3.26.RELEASE.jar;/app/libs/spring-messaging-4.3.29.RELEASE.jar;/app/libs/ignite-indexing-2.9.0.jar;/app/libs/opencensus-impl-core-0.22.0.jar;/app/libs/jackson-annotations-2.10.1.jar;/app/libs/lucene-analyzers-common-7.4.0.jar;/app/libs/jackson-dataformat-smile-2.10.1.jar;/app/libs/commons-logging-1.1.1.jar;/app/libs/spring-context-4.3.26.RELEASE.jar;/app/libs/tyrus-standalone-client-1.15.jar;/app/libs/jackson-core-2.10.1.jar;/app/libs/spring-core-4.3.29.RELEASE.jar;/app/libs/control-center-agent-2.9.0.0.jar;/app/libs/commons-codec-1.11.jar;/app/libs/disruptor-3.4.2.jar;/app/libs/javassist-3.21.0-GA.jar;/app/libs/spring-tx-4.3.26.RELEASE.jar;/app/libs/spring-core-4.3.26.RELEASE.jar;/app/libs/commons-logging-1.2.jar;/app/libs/spring-beans-4.3.26.RELEASE.jar;/app/libs/h2-1.4.197.jar;/app/libs/ignite-core-2.9.0.jar;/app/libs/spring-aop-4.3.26.RELEASE.jar;/app/libs/reflections8-0.11.7.jar;/app/libs/cache-api-1.0.0.jar;/app/libs/spring-websocket-4.3.29.RELEASE.jar;/app/libs/lucene-core-7.4.0.jar;/app/libs/jackson-databind-2.10.1.jar;/app/libs/ignite-spring-2.9.0.jar;/app/libs/grpc-context-1.19.0.jar;/app/libs/lucene-queryparser-7.4.0.jar;/app/libs/spring-web-4.3.29.RELEASE.jar;/app/libs/ignite-shmem-1.0.0.jar;/app/libs/guava-26.0-android.jar;/app/libs/spring-expression-4.3.26.RELEASE.jar:/app/libs/spring-jdbc-4.3.26.RELEASE.jar:/app/libs/spring-messaging-4.3.29.RELEASE.jar:/app/libs/ignite-indexing-2.9.0.jar:/app/libs/opencensus-impl-core-0.22.0.jar:/app/libs/jackson-annotations-2.10.1.jar:/app/libs/lucene-analyzers-common-7.4.0.jar:/app/libs/jackson-dataformat-smile-2.10.1.jar:/app/libs/commons-logging-1.1.1.jar:/app/libs/spring-context-4.3.26.RELEASE.jar:/app/libs/tyrus-standalone-client-1.15.jar:/app/libs/jackson-core-2.10.1.jar:/app/libs/spring-core-4.3.29.RELEASE.jar:/app/libs/control-center-agent-2.9.0.0.jar:/app/libs/commons-codec-1.11.jar:/app/libs/disruptor-3.4.2.jar:/app/libs/javassist-3.21.0-GA.jar:/app/libs/spring-tx-4.3.26.RELEASE.jar:/app/libs/spring-core-4.3.26.RELEASE.jar:/app/libs/commons-logging-1.2.jar:/app/libs/spring-beans-4.3.26.RELEASE.jar:/app/libs/h2-1.4.197.jar:/app/libs/ignite-core-2.9.0.jar:/app/libs/spring-aop-4.3.26.RELEASE.jar:/app/libs/reflections8-0.11.7.jar:/app/libs/cache-api-1.0.0.jar:/app/libs/spring-websocket-4.3.29.RELEASE.jar:/app/libs/lucene-core-7.4.0.jar:/app/libs/jackson-databind-2.10.1.jar:/app/libs/ignite-spring-2.9.0.jar:/app/libs/grpc-context-1.19.0.jar:/app/libs/lucene-queryparser-7.4.0.jar:/app/libs/spring-web-4.3.29.RELEASE.jar:/app/libs/ignite-shmem-1.0.0.jar:/app/libs/guava-26.0-android.jar:/app/libs/spring-expression-4.3.26.RELEASE.jar:
2020-12-08 22:05:25,860 [1] DEBUG  [(null)] - JVM started.
[22:05:26,184][INFO][main][XmlBeanDefinitionReader] Loading XML bean definitions from URL [file:/app/./kubernetes.config
...
2020-12-08 22:05:37,936 [70] INFO  org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander [(null)] - Completed rebalance future: RebalanceFuture [state=STARTED, grp=CacheGroupContext [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=82, minorTopVer=0], rebalanceId=1, routines=4, receivedBytes=1200, receivedKeys=0, partitionsLeft=0, startTime=1607465137846, endTime=-1, lastCancelledTime=-1, next=null]
2020-12-08 22:05:37,936 [70] DEBUG org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander [(null)] - Partitions have been scheduled to resend [reason=Rebalance is done, grp=ignite-sys-cache]
2020-12-08 22:05:37,937 [70] DEBUG org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander [(null)] - Finished rebalancing partition: [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=82, minorTopVer=0], supplier=12ca76f0-3286-4779-a426-408d5d6cf226, p=61]
2020-12-08 22:05:37,937 [70] DEBUG org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander [(null)] - Will not request next demand message [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=82, minorTopVer=0], supplier=12ca76f0-3286-4779-a426-408d5d6cf226, rebalanceFuture=RebalanceFuture [state=STARTED, grp=CacheGroupContext [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=82, minorTopVer=0], rebalanceId=1, routines=4, receivedBytes=1200, receivedKeys=0, partitionsLeft=0, startTime=1607465137846, endTime=1607465137937, lastCancelledTime=-1, next=null]]
2020-12-08 22:05:37,943 [71] DEBUG org.apache.ignite.internal.processors.odbc.ClientListenerProcessor [(null)] - Grid runnable started: nio-acceptor-client-listener
2020-12-08 22:05:37,944 [72] DEBUG org.apache.ignite.internal.processors.odbc.ClientListenerProcessor [(null)] - Grid runnable started: grid-nio-worker-client-listener-0
2020-12-08 22:05:37,944 [1] DEBUG org.apache.ignite.internal.processors.service.IgniteServiceProcessor [(null)] - Started service processor.
2020-12-08 22:05:37,954 [73] DEBUG org.apache.ignite.internal.processors.service.ServiceDeploymentManager [(null)] - Grid runnable started: services-deployment-worker
2020-12-08 22:05:37,955 [73] DEBUG org.apache.ignite.internal.processors.service.ServiceDeploymentTask [(null)] - Started services deployment task init: [depId=ServiceDeploymentProcessId [topVer=AffinityTopologyVersion [topVer=81, minorTopVer=0], reqId=null], locId=c894369e-d55b-4d7b-8e5e-c990d0547121, evt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=c894369e-d55b-4d7b-8e5e-c990d0547121, consistentId=product-service-deployment-7c69d99ff6-vc6nb, addrs=ArrayList [10.0.2.27, 127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500, product-service-deployment-7c69d99ff6-vc6nb/10.0.2.27:47500], discPort=47500, order=81, intOrder=44, lastExchangeTime=1607465137554, loc=true, ver=2.9.0#20201015-sha1:70742da8, isClient=false], topVer=81, msgTemplate=null, span=org.apache.ignite.internal.processors.tracing.NoopSpan@3f4cf36, nodeId8=c894369e, msg=null, type=NODE_JOINED, tstamp=1607465136027]]
2020-12-08 22:05:38,017 [73] DEBUG org.apache.ignite.internal.processors.resource.GridResourceProcessor [(null)] - Injecting resources [obj=org.apache.ignite.internal.processors.platform.cluster.PlatformClusterNodeFilterImpl@5d421915]
2020-12-08 22:05:38,038 [1] DEBUG org.apache.ignite.internal.processors.rest.GridRestProcessor [(null)] - REST processor started.
2020-12-08 22:05:38,056 [74] DEBUG org.apache.ignite.internal.processors.rest.GridRestProcessor [(null)] - Grid runnable started: session-timeout-worker
2020-12-08 22:05:38,098 [32] DEBUG org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor [(null)] - Timeout has occurred [obj=CancelableTask [id=d5e43644671-3ea29289-4345-4d80-8eab-97397473a5a9, endTime=1607465138070, period=10000, cancel=false, task=org.apache.ignite.internal.processors.query.h2.ConnectionManager$Lambda$307/57085696@6197e588], process=true]
2020-12-08 22:05:38,110 [1] DEBUG org.apache.ignite.internal.processors.resource.GridResourceProcessor [(null)] - Injecting resources [obj=org.gridgain.control.agent.processor.lifecycle.ClusterLifecycleProcessor$Lambda$586/893320639@55cff952]
2020-12-08 22:05:38,142 [75] DEBUG org.apache.ignite.internal.managers.communication.GridIoManager [(null)] - Message set has not been changed: GridCommunicationMessageSet [nodeId=3f89e86c-f636-4324-895b-1a77cec8ed11, endTime=1607465141249, timeoutId=8fe43644671-3ea29289-4345-4d80-8eab-97397473a5a9, topic=TOPIC_COMM_USER, plc=0, msgs=ConcurrentLinkedDeque [], reserved=false, timeout=5000, skipOnTimeout=true, lastTs=1607465136249]
2020-12-08 22:05:38,148 [1] WARN  org.gridgain.control.agent.ControlCenterAgent [(null)] - Current Ignite configuration does not support tracing functionality and Control Center agent will not collect traces (consider adding ignite-opencensus module to classpath).
2020-12-08 22:05:38,152 [1] DEBUG org.apache.ignite.internal.processors.resource.GridResourceProcessor [(null)] - Injecting resources [obj=org.gridgain.control.agent.ControlCenterAgent$Lambda$591/1985869725@151335cb]
2020-12-08 22:05:38,175 [76] DEBUG org.apache.ignite.internal.managers.communication.GridIoManager [(null)] - Message set has not been changed: GridCommunicationMessageSet [nodeId=3f89e86c-f636-4324-895b-1a77cec8ed11, endTime=1607465141249, timeoutId=8fe43644671-3ea29289-4345-4d80-8eab-97397473a5a9, topic=TOPIC_COMM_USER, plc=0, msgs=ConcurrentLinkedDeque [], reserved=false, timeout=5000, skipOnTimeout=true, lastTs=1607465136249]
2020-12-08 22:05:38,476 [73] DEBUG org.apache.ignite.internal.processors.service.ServiceDeploymentTask [(null)] - Calculated service assignment : [srvcId=56296344671-81118589-d216-4762-a835-3df2230389c5, srvcTop={c894369e-d55b-4d7b-8e5e-c990d0547121=1, 3f89e86c-f636-4324-895b-1a77cec8ed11=1}]
2020-12-08 22:05:38,484 [73] DEBUG org.apache.ignite.internal.processors.resource.GridResourceProcessor [(null)] - Injecting resources [obj=org.apache.ignite.internal.processors.platform.dotnet.PlatformDotNetServiceImpl@20119802]
*** stack smashing detected ***: <unknown> terminated
 

Спасибо!

Ответ №1:

stack smashing detected обычно указывает a NullReferenceException в коде C #.

COMPlus_EnableAlternateStackCheck 1 Перед запуском приложения установите переменную среды в значение, чтобы увидеть полную трассировку стека (это работает для .NET Core 3.0 и более поздних версий).

https://ignite.apache.org/docs/latest/net-specific/net-troubleshooting#stack-smashing-detected-dotnet-terminated

Комментарии:

1. Спасибо, я попробую

2. Браво, это помогло даже больше, чем я мог ожидать от вашего предложения: после установки COMPlus_EnableAlternateStackCheck на 1 узлы перестали сбоить. Не уверен, почему, если var был добавлен только для выявления деталей ошибки, но еще раз спасибо!