#azure #tensorflow #opencv #azure-machine-learning-studio #horovod
Вопрос:
Я пытаюсь создать новую среду, основанную на кураторской среде TF 2.4 с opencv. Поддержка opencv-это единственное отличие. Я изменил файл dockerfile, чтобы включить opencv следующим образом:
FROM mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04:20211005.v1
ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/tensorflow-2.4
# Create conda environment
RUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH
python=3.7 pip=20.2.4
# Prepend path to AzureML conda environment
ENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH
# Install pip dependencies
RUN HOROVOD_WITH_TENSORFLOW=1
pip install 'matplotlib>=3.3,<3.4'
'psutil>=5.8,<5.9'
'tqdm>=4.59,<4.60'
'pandas>=1.1,<1.2'
'scipy>=1.5,<1.6'
'numpy>=1.10,<1.20'
'ipykernel~=6.0'
'azureml-core==1.34.0'
'azureml-defaults==1.34.0'
'azureml-mlflow==1.34.0'
'azureml-telemetry==1.34.0'
'tensorboard==2.4.0'
'tensorflow-gpu==2.4.1'
'tensorflow-datasets==4.3.0'
'onnxruntime-gpu>=1.7,<1.8'
'horovod[tensorflow-gpu]==0.21.3'
'opencv-python'
# This is needed for mpi to locate libpython
ENV LD_LIBRARY_PATH $AZUREML_CONDA_ENVIRONMENT_PATH/lib:$LD_LIBRARY_PATH
Однако horovod не удается построить tensorflow и выдает следующее сообщение об ошибке:
ERROR: Command errored out with exit status 1:
command: /azureml-envs/tensorflow-2.4/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-pjyu9d6m/horovod/setup.py'"'"'; __file__='"'"'/tmp/pip-install-pjyu9d6m/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'rn'"'"', '"'"'n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-0t6zraqk
cwd: /tmp/pip-install-pjyu9d6m/horovod/
Complete output (233 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/horovod
copying horovod/__init__.py -> build/lib.linux-x86_64-3.7/horovod
creating build/lib.linux-x86_64-3.7/horovod/runner
copying horovod/runner/task_fn.py -> build/lib.linux-x86_64-3.7/horovod/runner
copying horovod/runner/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner
copying horovod/runner/launch.py -> build/lib.linux-x86_64-3.7/horovod/runner
copying horovod/runner/js_run.py -> build/lib.linux-x86_64-3.7/horovod/runner
copying horovod/runner/gloo_run.py -> build/lib.linux-x86_64-3.7/horovod/runner
copying horovod/runner/run_task.py -> build/lib.linux-x86_64-3.7/horovod/runner
copying horovod/runner/mpi_run.py -> build/lib.linux-x86_64-3.7/horovod/runner
creating build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/elastic.py -> build/lib.linux-x86_64-3.7/horovod/_keras
creating build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/sync_batch_norm.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/optimizer.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/functions.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.7/horovod/torch
creating build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/elastic.py -> build/lib.linux-x86_64-3.7/horovod/keras
creating build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/sync_batch_norm.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/__init__.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/elastic.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/gradient_aggregation_eager.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/gradient_aggregation.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/functions.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
creating build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.7/horovod/spark
creating build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/__init__.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/exceptions.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/elastic.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/util.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/basics.py -> build/lib.linux-x86_64-3.7/horovod/common
creating build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/__init__.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/functions.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
creating build/lib.linux-x86_64-3.7/horovod/ray
copying horovod/ray/runner.py -> build/lib.linux-x86_64-3.7/horovod/ray
copying horovod/ray/__init__.py -> build/lib.linux-x86_64-3.7/horovod/ray
copying horovod/ray/ray_logger.py -> build/lib.linux-x86_64-3.7/horovod/ray
copying horovod/ray/elastic.py -> build/lib.linux-x86_64-3.7/horovod/ray
copying horovod/ray/utils.py -> build/lib.linux-x86_64-3.7/horovod/ray
copying horovod/ray/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/ray
creating build/lib.linux-x86_64-3.7/horovod/runner/util
copying horovod/runner/util/lsf.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
copying horovod/runner/util/streams.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
copying horovod/runner/util/threads.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
copying horovod/runner/util/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
copying horovod/runner/util/remote.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
copying horovod/runner/util/network.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
copying horovod/runner/util/cache.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
creating build/lib.linux-x86_64-3.7/horovod/runner/http
copying horovod/runner/http/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/http
copying horovod/runner/http/http_client.py -> build/lib.linux-x86_64-3.7/horovod/runner/http
copying horovod/runner/http/http_server.py -> build/lib.linux-x86_64-3.7/horovod/runner/http
creating build/lib.linux-x86_64-3.7/horovod/runner/common
copying horovod/runner/common/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/common
creating build/lib.linux-x86_64-3.7/horovod/runner/task
copying horovod/runner/task/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/task
copying horovod/runner/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/task
creating build/lib.linux-x86_64-3.7/horovod/runner/driver
copying horovod/runner/driver/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/driver
copying horovod/runner/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/driver
creating build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/worker.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/driver.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/registration.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/rendezvous.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/constants.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/discovery.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
copying horovod/runner/elastic/settings.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
creating build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/host_hash.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/config_parser.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/timeout.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/tiny_shell_exec.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/env.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/codec.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/network.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
copying horovod/runner/common/util/hosts.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
creating build/lib.linux-x86_64-3.7/horovod/runner/common/service
copying horovod/runner/common/service/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/service
copying horovod/runner/common/service/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/service
copying horovod/runner/common/service/task_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/service
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
creating build/lib.linux-x86_64-3.7/horovod/torch/elastic
copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch/elastic
copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.7/horovod/torch/elastic
copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.7/horovod/torch/elastic
creating build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/elastic.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
creating build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
creating build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
creating build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/gloo_exec_fn.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
creating build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/host_discovery.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/rendezvous.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/rsh.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
running build_ext
-- Could not find CCache. Consider installing CCache to speed up compilation.
-- The CXX compiler identification is GNU 7.5.0
-- Check for working CXX compiler: /usr/bin/c
-- Check for working CXX compiler: /usr/bin/c -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build architecture flags: -mf16c -mavx -mfma
-- Using command /azureml-envs/tensorflow-2.4/bin/python
-- Found MPI_CXX: /usr/local/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Found CUDA: /usr/local/cuda (found version "11.0")
-- Linking against static NCCL library
-- Found NCCL: /usr/include
-- Determining NCCL version from the header file: /usr/include/nccl.h
-- NCCL_MAJOR_VERSION: 2
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl_static.a)
-- The C compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found MPI_C: /usr/local/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- MPI include path: /usr/local/include
-- MPI libraries: /usr/local/lib/libmpi.so
CMake Error at /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
Could NOT find Tensorflow (missing: Tensorflow_LIBRARIES) (Required is at
least version "1.15.0")
Call Stack (most recent call first):
/usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
cmake/Modules/FindTensorflow.cmake:31 (find_package_handle_standard_args)
horovod/tensorflow/CMakeLists.txt:12 (find_package)
-- Configuring incomplete, errors occurred!
See also "/tmp/pip-install-pjyu9d6m/horovod/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-pjyu9d6m/horovod/setup.py", line 188, in <module>
'horovodrun = horovod.runner.launch:run_commandline'
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
self.run_command('build')
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/tmp/pip-install-pjyu9d6m/horovod/setup.py", line 89, in build_extensions
cwd=self.build_temp)
File "/azureml-envs/tensorflow-2.4/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-pjyu9d6m/horovod', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-pjyu9d6m/horovod/build/lib.linux-x86_64-3.7', '-DPYTHON_EXECUTABLE:FILEPATH=/azureml-envs/tensorflow-2.4/bin/python']' returned non-zero exit status 1.
----------------------------------------
ERROR: Failed building wheel for horovod
Я новичок в Azure-ml и нахожу документацию немного неясной. Я также попытался просто добавить opencv-python в существующую кураторскую среду, выполнив conda_dep.add_pip_package(«opencv-python»). Результат тот же самый.
Комментарии:
1. Сталкивались ли вы с какой-либо ошибкой без opencv-python?
2. Также проверьте вывод команды pip list.
Ответ №1:
Некоторые из кураторских изображений, предоставленных для вычислительных кластеров. Следующий файл Dockerfile можно настроить для ваших личных рабочих процессов. https://docs.microsoft.com/en-us/azure/machine-learning/resource-curated-environments#tensorflow
Вот ссылка на учебное руководство по распределенному графическому процессору.