Модели Tensorflow не удается обучиться на сервере («Не удается найти исходный код …»; скрипт работает нормально на другой машине)

# script runs fine on different machine)

Вопрос:

У меня есть CNN, который я могу обучать на своем локальном настольном компьютере (теоретически; т. Е. С размером пакета 1 и в течение очень долгого времени). Однако, когда я запускаю тот же код на другой машине, я получаю некоторые странные (?) Ошибки. Фактически, предыдущие версии скрипта действительно выполнялись, и ошибка указывает на функцию, которую, как мне кажется, я не менял с момента последнего корректного запуска скрипта на сервере.

Я думаю, что исходная ошибка, ответственная за остановку, заключается в следующем:

 Unable to locate the source code of <function load_image_train at 0x14f2f6d0b820>.
 

Функция load_image_train представляет собой загрузчик обучающих изображений, который я получил с этого сайта.
Он определяется в скрипте следующим образом:

 @tf.function
def load_image_train(datapoint: dict) -> tuple:
    input_image = tf.image.resize(datapoint['image'], (IMG_SIZE, IMG_SIZE))
    input_mask = tf.image.resize(datapoint['segmentation_mask'], (IMG_SIZE, IMG_SIZE))

    if tf.random.uniform(()) > 0.5:
        input_image = tf.image.flip_left_right(input_image)
        input_mask = tf.image.flip_left_right(input_mask)

    input_image, input_mask = normalize(input_image, input_mask)

    return input_image, input_mask
 

Итак, в чем же здесь проблема и почему она работает на одной машине, но не на другой?

Полный вывод выполнения (системная информация и ошибки):

 Running on Linux 4.18.0-193.65.2.el8_2.x86_64.
Python version: 3.8.0 (default, Mar  9 2020, 18:02:46) 
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]
Tensorflow version: 2.6.0
UTC time (start): 2021-10-28 07:58:46.099075
Local time (start): 2021-10-28 09:58:50.733831
N GPUs available:  4
2021-10-28 09:58:53.298641: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-28 09:58:55.358312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30988 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
2021-10-28 09:58:55.360055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30988 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2021-10-28 09:58:55.361619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 30988 MB memory:  -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
2021-10-28 09:58:55.363165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 30988 MB memory:  -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0
WARNING:tensorflow:AutoGraph could not transform <function parse_image at 0x14f2f76173a0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the entire output.
Cause: Unable to locate the source code of <function parse_image at 0x14f2f76173a0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain, the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function load_image_train at 0x14f2f6d0b820> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the entire output.
Cause: Unable to locate the source code of <function load_image_train at 0x14f2f6d0b820>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
Traceback (most recent call last):
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/pyct/parser.py", line 154, in parse_entity
    original_source = inspect_utils.getimmediatesource(entity)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/pyct/inspect_utils.py", line 151, in getimmediatesource
    lines, lnum = inspect.findsource(obj)
  File "/usr/lib64/python3.8/inspect.py", line 798, in findsource
    raise OSError('could not get source code')
OSError: could not get source code

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 432, in converted_call
    converted_f = _convert_actual(target_entity, program_ctx)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 274, in _convert_actual
    transformed, module, source_map = _TRANSPILER.transform(entity, program_ctx)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/pyct/transpiler.py", line 286, in transform
    return self.transform_function(obj, user_context)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/pyct/transpiler.py", line 470, in transform_function
    nodes, ctx = super(PyToPy, self).transform_function(fn, user_context)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/pyct/transpiler.py", line 346, in transform_function
    node, source = parser.parse_entity(fn, future_features=future_features)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/pyct/parser.py", line 156, in parse_entity
    raise ValueError(
ValueError: Unable to locate the source code of <function load_image_train at 0x14f2f6d0b820>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./LeleNet/py3/LeleNet_trn.py", line 377, in <module>
    dataset["train"] = dataset["train"]
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1863, in map
    return ParallelMapDataset(
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 5020, in __init__
    self._map_func = StructuredFunctionWrapper(
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4218, in __init__
    self._function = fn_factory()
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3150, in get_concrete_function
    graph_function = self._get_concrete_function_garbage_collected(
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3116, in _get_concrete_function_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3463, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3298, in _create_graph_function
    func_graph_module.func_graph_from_py_func(
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1007, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4195, in wrapped_fn
    ret = wrapper_helper(*args)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4125, in wrapper_helper
    ret = autograph.tf_convert(self._func, ag_ctx)(*nested_args)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 933, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 759, in _initialize
    self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3066, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3463, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3298, in _create_graph_function
    func_graph_module.func_graph_from_py_func(
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1007, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 668, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 983, in wrapper
    return autograph.converted_call(
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 439, in converted_call
    return _fall_back_unconverted(f, args, kwargs, options, e)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 486, in _fall_back_unconverted
    return _call_unconverted(f, args, kwargs, options)
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 463, in _call_unconverted
    return f(*args, **kwargs)
  File "./LeleNet/py3/LeleNet_trn.py", line 351, in load_image_train
    if tf.random.uniform(()) > 0.5:
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 900, in __bool__
    self._disallow_bool_casting()
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 503, in _disallow_bool_casting
    self._disallow_when_autograph_enabled(
  File "/home/kit/ifgg/mp3890/.local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 489, in _disallow_when_autograph_enabled
    raise errors.OperatorNotAllowedInGraphError(
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.
 

Информация о системе / Python машины, на которой выполняется скрипт:

 Running on Windows 8.1.
Python version: 3.9.5 (tags/v3.9.5:0a7dcbd, May  3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
Tensorflow version: 2.5.0
 

If required: the full script can be found here. In both cases, I ran it from a terminal/command prompt as ~$ python LeleNet_trn.py "fcd" 6 60 -op "adam" .

Update:
The problem apparently only occurs when I run the script from the home directory, specifying the path to the script as python ./some_folder/script.py . When I run python ~/some_folder/script.py or cd ./some_folder python script.py the issue does not occur.