#tensorflow #debugging #keras #crash #google-colaboratory
#tensorflow #отладка #keras #сбой #google-colaboratory
Вопрос:
При запуске следующего кода в Google colab с «GPU» runtime (поскольку один из моих пользовательских слоев выполняет tensorflow.fft с использованием GPU), мой сеанс завершается сбоем —
fc2_shape = 32*32
model = models.Sequential()
model.add(layers.Flatten(input_shape=(32, 32, 2)))
model.add(layers.Dense(fc2_shape, activation='tanh'))
model.add(layers.Dense(fc2_shape, activation='tanh'))
model.add(layers.Reshape((32, 32, 1)))
model.add(conv2d_layer(num_features=32, kernel_size=5, type_conv="complex"))
model.add(layers.Activation('relu'))
model.add(conv2d_layer(num_features=32, kernel_size=5, type_conv="complex", kernel_regularizer=regularizers.l1(0.0001)))
model.add(layers.Activation('relu'))
model.add(complex_conv_transpose_layer(num_features=1, kernel_size=9, strides=1))
model.summary()
Он завершается сбоем с сообщением «Сбой вашего сеанса. Автоматический перезапуск.. Ваш сеанс перезапущен после сбоя. Отладка.. Сбой сеанса по неизвестной причине. Просмотр журнала выполнения »
новая строка
Ниже представлены журналы выполнения. Могу ли я получить некоторую помощь в понимании того, что может быть причиной сбоя, поскольку в журнале даже нет сообщения об ошибке, а только предупреждения. Перепробовал множество методов, предложенных для нескольких предупреждений в журнале, но, похоже, ни один из них не работает. Необходимо выяснить точную причину. Спасибо.
Jan 12, 2021, 9:02:25 AM WARNING WARNING:root:kernel 96640b1f-78c4-4aee-8612-299bbd2a4d8d restarted
Jan 12, 2021, 9:02:25 AM INFO KernelRestarter: restarting kernel (1/5), keep random ports
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.430989: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.256096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13960 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.256008: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.255209: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.254277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.253280: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.247860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.247846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
Jan 12, 2021, 9:02:20 AM WARNING 2021-01-12 03:32:20.247794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.208463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.205616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.204824: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203940: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203862: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203825: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203806: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203787: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203768: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203743: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203699: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Jan 12, 2021, 9:02:16 AM WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
Jan 12, 2021, 9:02:16 AM WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.203627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.202836: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.202313: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.201374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.197656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.196404: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:16 AM WARNING 2021-01-12 03:32:16.196182: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
Jan 12, 2021, 9:02:15 AM WARNING 2021-01-12 03:32:15.708027: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
Jan 12, 2021, 9:02:15 AM WARNING 2021-01-12 03:32:15.691252: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
Jan 12, 2021, 9:02:15 AM WARNING 2021-01-12 03:32:15.439674: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
Jan 12, 2021, 9:02:15 AM WARNING 2021-01-12 03:32:15.390009: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
Jan 12, 2021, 9:02:15 AM WARNING 2021-01-12 03:32:15.274378: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
Jan 12, 2021, 9:02:15 AM WARNING 2021-01-12 03:32:15.274194: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Jan 12, 2021, 9:02:14 AM WARNING 2021-01-12 03:32:14.935005: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Jan 12, 2021, 9:02:14 AM WARNING coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
Jan 12, 2021, 9:02:14 AM WARNING pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
Jan 12, 2021, 9:02:14 AM WARNING 2021-01-12 03:32:14.934938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
Jan 12, 2021, 9:02:14 AM WARNING 2021-01-12 03:32:14.933915: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Jan 12, 2021, 9:02:14 AM WARNING 2021-01-12 03:32:14.866772: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
Jan 12, 2021, 9:02:14 AM WARNING 2021-01-12 03:32:14.865367: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
Jan 12, 2021, 9:02:14 AM WARNING tcmalloc: large alloc 1228800000 bytes == 0x14210000 @ 0x7f536abaa1e7 0x7f53620a841e 0x7f53620f8c2b 0x7f53620f8cc8 0x7f53621b4d19 0x7f53621b7dec 0x7f53622d6ddf 0x7f53622dcf15 0x7f53622ded9d 0x7f53622e0476 0x5a48ec 0x5a4fb8 0x7f53621bf438 0x59c9f0 0x50ea2d 0x507be4 0x5161c5 0x50a12f 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50beb4 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x508ec2 0x594a01 0x59fd0e
Jan 12, 2021, 9:02:06 AM WARNING 2021-01-12 03:32:06.977597: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Jan 12, 2021, 9:01:54 AM INFO Adapting to protocol v5.1 for kernel 96640b1f-78c4-4aee-8612-299bbd2a4d8d
Jan 12, 2021, 9:01:52 AM INFO Kernel started: 96640b1f-78c4-4aee-8612-299bbd2a4d8d
Jan 12, 2021, 9:00:04 AM INFO Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jan 12, 2021, 9:00:04 AM INFO http://172.28.0.2:9000/
Jan 12, 2021, 9:00:04 AM INFO The Jupyter Notebook is running at:
Jan 12, 2021, 9:00:04 AM INFO 0 active kernels
Jan 12, 2021, 9:00:04 AM INFO Serving notebooks from local directory: /
Jan 12, 2021, 9:00:04 AM INFO google.colab serverextension initialized.
Jan 12, 2021, 9:00:04 AM INFO Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Jan 12, 2021, 9:00:04 AM WARNING Config option `delete_to_trash` not recognized by `ColabFileContentsManager`.
Комментарии:
1. Добавили ли вы что-нибудь для изменения конфигурации TF, например, config.gpu_options. allow_growth = True.
2. Нет, я не вносил никаких изменений в конфигурацию tf
Ответ №1:
Обычно это происходит, когда вы используете всю доступную в Google colab оперативную память, поскольку она не может обрабатывать очень большие наборы данных. Вы можете попробовать обновить оперативную память или использовать другие сервисы.
Попробуйте Microsoft Azure, AWS или Облачные сервисы Google
Все они имеют довольно хорошие продукты машинного обучения, но все они платные. Другой альтернативой является запуск его локально на Jupyter.
Комментарии:
1. Но использование оперативной памяти не увеличивается до максимального предела перед сбоем. Кроме того, в журнале выполнения есть несколько предупреждений, кроме большого alloc. Я хочу понять из журнала, что вызвало сбой.
2. это не проблема, в журнале было бы указано, что сбой произошел из-за использования ОЗУ, но, похоже, это не так
3. @psj Я не уверен, что это может быть тогда. Но, возможно, попробуйте запустить его локально на ноутбуке Jupyter и посмотреть, не столкнетесь ли вы с подобной ошибкой. Если это так, то проблема может быть не в Google colab.
4. Да, я пытаюсь запустить локальный jupyter notebook, чтобы получать сообщения об ошибках в журнале, в отличие от colab, который просто сбой без предоставления надлежащего журнала ошибок.