Исключение WebserviceException: Не удается развернуть модель с помощью aks и машинного обучения azure

#azure-aks #azure-databricks #azure-machine-learning-service

Вопрос:

Я попытался развернуть новую модель в записной книжке azure databricks. Этим утром он работал, и теперь у меня следующая ошибка:

После

 service.wait_for_deployment(show_output=True)
print(service.state)
print(service.get_logs())
 

У меня есть:

 "message": "Timed out waiting for AKS deployment to complete. pollTimeout : 00:20:00 serviceName: simdev serviceId: ...",
  "details": [
    {
      "code": "DeploymentTimedOut",
      "message": "Your container endpoint is not available. Please follow the steps to debug:
    1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.
    2. You can also interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.
    3. View the diagnostic events to check status of container, it may help you to debug the issue.
{"InvolvedObject":"simdev-757df4f999-rbcws","InvolvedKind":"Pod","Type":"Warning","Reason":"FailedScheduling","Message":"0/2 nodes are available: 2 Insufficient nvidia.com/gpu.","LastTimestamp":null}
{"InvolvedObject":"simdev-757df4f999-rbcws","InvolvedKind":"Pod","Type":"Warning","Reason":"FailedScheduling","Message":"0/2 nodes are available: 2 Insufficient nvidia.com/gpu.","LastTimestamp":null}
{"InvolvedObject":"simdev-757df4f999-rbcws","InvolvedKind":"Pod","Type":"Normal","Reason":"Scheduled","Message":"Successfully assigned azureml-train-aml-001-dev/simdev-757df4f999-rbcws to aks-agentpool-34690879-vmss000000","LastTimestamp":null}
 

Вчера это не сработало. Сегодня утром-да, а теперь-нет.

Вот конфигурация aks:

 aks_config = AksWebservice.deploy_configuration(cpu_cores=0.7,
                                                memory_gb=0.7,
                                                gpu_cores=1,
                                                period_seconds=1800,
                                                failure_threshold=10,
                                                timeout_seconds=60,
                                                max_request_wait_time=300000,
                                                scoring_timeout_ms=300000,)
 

Комментарии:

1. Можете ли вы попробовать перейти в AzureML -> Конечные точки ->> >><Модель-Конечная точка> -<Модель-Конечная точка>> Журналы развертывания и вставить это