#machine-learning #amazon-sagemaker
Вопрос:
беспокоит в течение нескольких дней встроенный алгоритм rcf sagemaker.
Я хотел бы проверить модель во время обучения, но могут быть вещи, которые я неправильно понял.
Первая подгонка только с обучающим каналом работает нормально:
container=sagemaker.image_uris.retrieve("randomcutforest", region, "us-east-1")
print(container)
rcf = sagemaker.estimator.Estimator(
image_uri=container,
role=role,
instance_count=1,
sagemaker_session=sagemaker.Session(),
instance_type="ml.m4.xlarge",
data_location=f"s3://{bucket}/{prefix}/",
output_path=f"s3://{bucket}/{prefix}/output"
)
rcf.set_hyperparameters(
feature_dim = 116,
eval_metrics = 'precision_recall_fscore',
num_samples_per_tree=256,
num_trees=100,
)
train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')
rcf.fit({'train': train_data})
[06/28/2021 09:45:24 INFO 140226936620864] Test data is not provided.
#metrics {"StartTime": 1624873524.6154933, "EndTime": 1624873524.6156445, "Dimensions": {"Algorithm": "RandomCutForest", "Host": "algo-1", "Operation": "training"}, "Metrics": {"setuptime": {"sum": 40.169477462768555, "count": 1, "min": 40.169477462768555, "max": 40.169477462768555}, "totaltime": {"sum": 13035.491704940796, "count": 1, "min": 13035.491704940796, "max": 13035.491704940796}}}
2021-06-28 09:45:50 Completed - Training job completed
ProfilerReport-1624873226: NoIssuesFound
Training seconds: 78
Billable seconds: 78
Но когда я хочу проверить свою модель во время обучения:
train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')
val_data = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='text/csv;label_size=1', distribution='FullyReplicated')
rcf.fit({'train': train_data, 'validation': val_data}, wait=True)
Я получаю ошибку:
AWS Region: us-east-1
RoleArn: arn:aws:iam::517714493426:role/service-role/AmazonSageMaker-ExecutionRole-20210409T152960
382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:1
2021-06-28 10:14:12 Starting - Starting the training job...
2021-06-28 10:14:14 Starting - Launching requested ML instancesProfilerReport-1624875252: InProgress
......
2021-06-28 10:15:27 Starting - Preparing the instances for training.........
2021-06-28 10:17:07 Downloading - Downloading input data...
2021-06-28 10:17:27 Training - Downloading the training image..Docker entrypoint called with argument(s): train
Running default environment configuration script
[06/28/2021 10:17:53 INFO 140648505521984] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}
[06/28/2021 10:17:53 INFO 140648505521984] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'num_trees': '100', 'num_samples_per_tree': '256', 'feature_dim': '116', 'eval_metrics': 'precision_recall_fscore'}
[06/28/2021 10:17:53 INFO 140648505521984] Final configuration: {'num_samples_per_tree': '256', 'num_trees': '100', 'force_dense': 'true', 'eval_metrics': 'precision_recall_fscore', 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999, 'feature_dim': '116'}
[06/28/2021 10:17:53 ERROR 140648505521984] Customer Error: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: Additional properties are not allowed ('validation' was unexpected)
Failed validating 'additionalProperties' in schema:
{'$schema': 'http://json-schema.org/draft-04/schema#',
'additionalProperties': False,
'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'},
'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'},
'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
'type': 'object'},
'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'},
'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
'S3DistributionType': {'$ref': '#/definitions/s3_sharded_type'},
'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
'type': 'object'},
'record_wrapper_type': {'enum': ['None', 'Recordio'],
'type': 'string'},
's3_replicated_type': {'enum': ['FullyReplicated'],
'type': 'string'},
's3_sharded_type': {'enum': ['ShardedByS3Key'],
'type': 'string'},
'training_input_mode': {'enum': ['File', 'Pipe'],
'type': 'string'}},
'properties': {'state': {'$ref': '#/definitions/data_channel'},
'test': {'$ref': '#/definitions/data_channel_replicated'},
'train': {'$ref': '#/definitions/data_channel_sharded'}},
'required': ['train'],
'type': 'object'}
On instance:
{'train': {'ContentType': 'text/csv;label_size=0',
'RecordWrapperType': 'None',
'S3DistributionType': 'ShardedByS3Key',
'TrainingInputMode': 'File'},
'validation': {'ContentType': 'text/csv;label_size=1',
'RecordWrapperType': 'None',
'S3DistributionType': 'FullyReplicated',
'TrainingInputMode': 'File'}}
2021-06-28 10:18:10 Uploading - Uploading generated training model
2021-06-28 10:18:10 Failed - Training job failed
ProfilerReport-1624875252: Stopping
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-34-c624ace00c69> in <module>
33
34
---> 35 rcf.fit({'train': train_data, 'validation': val_data}, wait=True)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
680 self.jobs.append(self.latest_training_job)
681 if wait:
--> 682 self.latest_training_job.wait(logs=logs)
683
684 def _compilation_job_name(self):
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1623 # If logs are requested, call logs_for_jobs.
1624 if logs != "None":
-> 1625 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1626 else:
1627 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3679
3680 if wait:
-> 3681 self._check_job_status(job_name, description, "TrainingJobStatus")
3682 if dot:
3683 print()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3243 ),
3244 allowed_statuses=["Completed", "Stopped"],
-> 3245 actual_status=status,
3246 )
3247
UnexpectedStatusException: Error for Training job randomcutforest-2021-06-28-10-14-12-783: Failed. Reason: ClientError: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: Additional properties are not allowed ('validation' was unexpected)
Failed validating 'additionalProperties' in schema:
{'$schema': 'http://json-schema.org/draft-04/schema#',
'additionalProperties': False,
'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'},
'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'},
'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
'type': 'object'},
'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'},
Может ли кто-нибудь помочь мне, как я правильно выполняю эту проверку во время обучения?
Это было бы лучшим, что могло бы на самом деле случиться со мной. 😀
С наилучшими пожеланиями, Кристина
Ответ №1:
Я нашел ошибку: вместо «проверка» вам нужно назвать канал «тест», тогда он работает: rcf.fit({«поезд»: train_data, «тест»: test_data}, wait=True)