How to configure HyperparameterTuning Job using Sagemaker Pipeline Python SDK?

ben_jones · July 10, 2023, 6:32pm

Hello folks! Could anyone advise on the following, please?
I’m currently working on an AutoML pipeline and using Sagemaker Pipeline Python SDK based on steps.
One of the steps is HyperparameterTuning Job that looks the following way:

    framework="xgboost",
    region=pipeline_session.boto_region_name,
    version="1.0-1",
    py_version="py3",
    instance_type=instance_type,
)


tuner_hpo = HyperparameterTuner(
    estimator = Estimator(
        image_uri=image_uri,
        instance_type=instance_type,
        instance_count=1,
        output_path=estimator_path,
        role=role,
        sagemaker_session=pipeline_session,
        hyperparameters = {
            "eval_metric": "rmse",
            "objective": "reg:squarederror",
            "num_round": 10,
            "eta": 0.2,
        }
    ),
    objective_metric_name = 'validation:rmse',
    hyperparameter_ranges={
        'max_depth': IntegerParameter(10, 11),
    },
    objective_type='Minimize', 
    max_jobs=2,
    max_parallel_jobs=2, 
)

step_tuning = TuningStep(
    name="HPOTuning",
    tuner=tuner_hpo,
    inputs = {
        "train": TrainingInput(
            s3_data=step_preprocess_input_data.properties.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_preprocess_input_data.properties.ProcessingOutputConfig.Outputs['validation'].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
    #cache_config=cache_config
)```
afterwards, I want to train the best_estimator from the tuning job on another dataset. My way of thinking is to create another estimator and exploit the TrainStep in the following way:
```best_estimator = Estimator(
    image_uri=image_uri,
    role=role,
    output_path=trained_model_path,
    instance_type=instance_type,
    instance_count=1,
    sagemaker_session=pipeline_session,
    hyperparameters=?,
    model_uri=?
)

training_step_args = best_estimator.fit(
    inputs={
        "train": TrainingInput(
          s3_data=step_preprocess_input_data.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri, # train -&gt; full_history
            content_type="text/csv",
            ),
        }
)

step_train = TrainingStep(
    name="TrainBestEstimator",
    step_args=training_step_args
)```
The problems are:
• model_uri doesn't allow the Join object as an input, because it doesn't have a decode method
• model_uri doesn't allow the String object as an input, because it doesn't have a decode method
• in order to explicitely set the hyperparameters, it is required to get them from the tuning job somehow, I do not see the way to tackle it for now.

harlanK · July 10, 2023, 7:23pm

I think you could try the following:

You can use the attach method of the HyperparameterTuner class to attach to the completed tuning job and get the best training job. Then, you can use the attach method of the Estimator class to attach to the best training job and get the hyperparameters and model_uri. Here is an example:

# Attach to the completed tuning job
tuner_hpo_attached = HyperparameterTuner.attach(tuner_hpo.latest_tuning_job.job_name, sagemaker_session=pipeline_session)

# Get the best training job
best_training_job_name = tuner_hpo_attached.best_training_job()

# Attach to the best training job
best_estimator = Estimator.attach(best_training_job_name)

# Get the hyperparameters and model_uri
hyperparameters = best_estimator.hyperparameters()
model_uri = best_estimator.model_data

Then, you can use these hyperparameters and model_uri to create another estimator and exploit the TrainStep. Note that the model_uri is the S3 location of the model artifacts, and the hyperparameters is a dictionary of the hyperparameters used in the best training job.

Let me know how this goes!