Обратные вызовы Tensorboard, вызывающие ошибку сегментации

#python #tensorflow

Вопрос:

Я создаю CNN в Tensorflow, раньше все работало, и проблем не было. Через некоторое время при обучении он потерпит неудачу со следующим

 2021-05-12 12:23:32.842941: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.summary API due to missing TensorBoard installation.
2021-05-12 12:23:34.248072: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing.
2021-05-12 12:23:34.248256: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started.
2021-05-12 12:23:34.249111: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-05-12 12:23:34.270346: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1365] Profiler found 1 GPUs
2021-05-12 12:23:34.271925: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cupti64_110.dll
2021-05-12 12:23:34.274700: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.
2021-05-12 12:23:34.274888: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1487] CUPTI activity buffer flushed
2021-05-12 12:23:34.619548: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-12 12:23:34.619821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:65:00.0 name: NVIDIA GeForce RTX 3060 computeCapability: 8.6
coreClock: 1.807GHz coreCount: 28 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 335.32GiB/s
2021-05-12 12:23:34.620095: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-12 12:23:34.634109: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-12 12:23:34.634374: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-12 12:23:34.637843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-12 12:23:34.639245: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-12 12:23:34.647947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-12 12:23:34.650857: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-12 12:23:34.651522: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-12 12:23:34.651679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-05-12 12:23:34.652288: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-12 12:23:34.653598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:65:00.0 name: NVIDIA GeForce RTX 3060 computeCapability: 8.6
coreClock: 1.807GHz coreCount: 28 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 335.32GiB/s
2021-05-12 12:23:34.653893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-12 12:23:34.654023: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-12 12:23:34.654325: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-12 12:23:34.654433: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-12 12:23:34.654527: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-12 12:23:34.654604: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-12 12:23:34.654686: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-12 12:23:34.654783: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-12 12:23:34.654896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-05-12 12:23:35.053408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-12 12:23:35.053846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-05-12 12:23:35.054309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-05-12 12:23:35.054710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10491 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:65:00.0, compute capability: 8.6)
2021-05-12 12:23:35.055670: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-12 12:23:35.514513: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/10
2021-05-12 12:23:35.959114: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-12 12:23:36.538496: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-12 12:23:36.555303: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-12 12:23:37.720066: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2021-05-12 12:23:37.758845: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2021-05-12 12:23:37.912211: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
 1/38 [..............................] - ETA: 1:38 - loss: 0.7005 - accuracy: 0.25002021-05-12 12:23:38.263462: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing.
2021-05-12 12:23:38.263648: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started.
 2/38 [>.............................] - ETA: 14s - loss: 0.7037 - accuracy: 0.2891 2021-05-12 12:23:38.606175: I tensorflow/core/profiler/lib/profiler_session.cc:71] Profiler session collecting data.
2021-05-12 12:23:38.606558: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1487] CUPTI activity buffer flushed
Segmentation fault
 

После некоторого копания он понял, что это был обратный вызов Tensorboard, вызвавший это

 model.fit(X, y, batch_size=32, epochs=10, validation_split=0.3, callbacks=[tensorboard])

 

Если я удалю обратный вызов, обучение будет работать так, как ожидалось, однако теперь я не могу просмотреть его в Tensorboard.

Я пытался удалить и переустановить Tensorflow и Tensorboard, но ничего не сработало.

У кого-нибудь есть какие-нибудь идеи о том, как это исправить?

Комментарии:

1. Можете ли вы поделиться автономным кодом для репликации вашей проблемы? чтобы мы могли попытаться вам помочь. Спасибо!

Ответ №1:

примечание https://github.com/tensorflow/tensorboard/issues/3149 — вместо удаления всего обратного вызова вы можете создать его с помощью profile_batch=0. Помог мне в очень похожей ситуации