Обучение CNN с выходом GPU в эпоху 1 с кодом 3221226505

#tensorflow #keras

#tensorflow #keras

Вопрос:

Я установил TensorFlow, cudnn и cuda и все необходимое для запуска TensorFlow с графическим процессором. При запуске кода он завершает первую эпоху с кодом ошибки 3221226505. Я не уверен, почему это происходит, запуск программы в Google collab работает и занимает всего около 1 ГБ памяти GPU.

python — 3.8.6, cuda — 11.2.1, cudnn — 11.2

 from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
import pandas as pd 
import time
import tensorflow as tf

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)
data = pd.read_csv("hmnist_28_28_RGB.csv") 

X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]

X = X / 255.0
X = X.values.reshape(-1,28,28,3)

y = tf.keras.utils.to_categorical(y.values,7)

print(y.shape)
print(X.shape)

model = Sequential()
model.add(Conv2D(256, (3, 3), input_shape=X.shape[1:]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(256, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(7))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X, y, batch_size=5, epochs=100, validation_split=0.3,verbose=1)
 
 2021-02-17 16:16:20.046679: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-17 16:16:20.047795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-02-17 16:16:20.073694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 3060 Ti computeCapability: 8.6
coreClock: 1.665GHz coreCount: 38 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-02-17 16:16:20.074002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-02-17 16:16:20.090128: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-02-17 16:16:20.090262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-02-17 16:16:20.094202: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-02-17 16:16:20.095533: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-02-17 16:16:20.100795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-02-17 16:16:20.104097: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-02-17 16:16:20.104808: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-02-17 16:16:20.105102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-17 16:16:20.105825: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-17 16:16:20.107415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 3060 Ti computeCapability: 8.6
coreClock: 1.665GHz coreCount: 38 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-02-17 16:16:20.107754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-02-17 16:16:20.107967: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-02-17 16:16:20.108188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-02-17 16:16:20.108410: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-02-17 16:16:20.108609: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-02-17 16:16:20.108798: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-02-17 16:16:20.108900: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-02-17 16:16:20.109119: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-02-17 16:16:20.109366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-17 16:16:20.583958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-17 16:16:20.584082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-02-17 16:16:20.584220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-02-17 16:16:20.584499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1024 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060 Ti, pci bus id: 0000:09:00.0, compute capability: 8.6)
2021-02-17 16:16:20.585568: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
1 Physical GPUs, 1 Logical GPUs
(10015, 7)
(10015, 28, 28, 3)
2021-02-17 16:16:23.006535: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/100
2021-02-17 16:16:23.409332: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-02-17 16:16:24.173684: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-02-17 16:16:24.177694: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll

[Done] exited with code=3221226505 in 7.113 seconds
 

Ответ №1:

У меня была одна и та же проблема в течение нескольких дней, и, наконец, мне удалось ее решить, похоже, это .dll проблема.

Сначала я запустил программу в Windows Power Shell, вот так (в моем случае):

 python -u "D:DocumentsPythontest.py"
 

И я заметил, что после последних трех "Successfully opened dynamic library..." строк появилась новая ошибка, в которой говорилось, что не удалось загрузить библиотеку (в частности cudnn_adv_train64_8.dll ) с Error 126 помощью .

Затем я просто скопировал все файлы в bin папку cudnn:

 cudnn_adv_infer64_8.dll
cudnn_adv_train64_8.dll
cudnn_cnn_infer64_8.dll
cudnn_cnn_train64_8.dll
cudnn_ops_infer64_8.dll
cudnn_ops_train64_8.dll
cudnn64_8.dll
 

И вставил их в папку CUDA bin (в моем случае):

C:Program FilesNVIDIA GPU Computing ToolkitCUDAv11.0bin

Теперь, похоже, он работает нормально и определенно использует мой графический процессор, потому что время обучения сократилось с 2-3 минут до 20 секунд.

Надеюсь, это сработает и для вас.