Optimize models using Automatic Model Tuning
Introduction
When training ML models, hyperparameter tuning is a step taken to find the best performing training model. In this lab you will apply a random algorithm of Automated Hyperparameter Tuning to train a BERT-based natural language processing (NLP) classifier. The model analyzes customer feedback and classifies the messages into positive (1), neutral (0), and negative (-1) sentiments.
Amazon SageMaker supports Automated Hyperparameter Tuning. It runs multiple training jobs on the training dataset using the hyperparameter ranges specified by the user. Then it chooses the combination of hyperparameters that leads to the best model candidate. The choice is made based on the objective metrics, e.g. maximization of the validation accuracy.
For the choice of hyperparameters combinations, SageMaker supports two different types of tuning strategies: random and Bayesian. This capability can be further extended by providing an implementation of a custom tuning strategy as a Docker container.
In this lab you will perform the following three steps:
First, let's install and import the required modules.
# please ignore warning messages during the installation
!pip install --disable-pip-version-check -q sagemaker==2.35.0
!conda install -q -y pytorch==1.6.0 -c pytorch
!pip install --disable-pip-version-check -q transformers==3.5.1
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0mCollecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: /opt/conda
added / updated specs:
- pytorch==1.6.0
The following packages will be downloaded:
package | build
---------------------------|-----------------
ca-certificates-2023.05.30 | h06a4308_0 120 KB
certifi-2022.12.7 | py37h06a4308_0 150 KB
cudatoolkit-10.2.89 | hfd86e86_1 365.1 MB
ninja-1.10.2 | h06a4308_5 8 KB
ninja-base-1.10.2 | hd09550d_5 109 KB
pytorch-1.6.0 |py3.7_cuda10.2.89_cudnn7.6.5_0 537.7 MB pytorch
------------------------------------------------------------
Total: 903.1 MB
The following NEW packages will be INSTALLED:
cudatoolkit pkgs/main/linux-64::cudatoolkit-10.2.89-hfd86e86_1 None
ninja pkgs/main/linux-64::ninja-1.10.2-h06a4308_5 None
ninja-base pkgs/main/linux-64::ninja-base-1.10.2-hd09550d_5 None
pytorch pytorch/linux-64::pytorch-1.6.0-py3.7_cuda10.2.89_cudnn7.6.5_0 None
The following packages will be UPDATED:
ca-certificates conda-forge::ca-certificates-2022.12.~ --> pkgs/main::ca-certificates-2023.05.30-h06a4308_0 None
The following packages will be SUPERSEDED by a higher-priority channel:
certifi conda-forge/noarch::certifi-2022.12.7~ --> pkgs/main/linux-64::certifi-2022.12.7-py37h06a4308_0 None
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Retrieving notices: ...working... done
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m[33m
[0m
import boto3
import sagemaker
import pandas as pd
import botocore
config = botocore.config.Config(user_agent_extra='dlai-pds/c3/w1')
# low-level service client of the boto3 session
sm = boto3.client(service_name='sagemaker',
config=config)
sess = sagemaker.Session(sagemaker_client=sm)
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name
1. Configure dataset and Hyperparameter Tuning Job (HTP)
1.1. Configure dataset
Let's set up the paths and copy the data to the S3 bucket:
processed_train_data_s3_uri = 's3://{}/transformed/data/sentiment-train/'.format(bucket)
processed_validation_data_s3_uri = 's3://{}/transformed/data/sentiment-validation/'.format(bucket)
processed_test_data_s3_uri = 's3://{}/transformed/data/sentiment-test/'.format(bucket)
Upload the data to the S3 bucket:
!aws s3 cp --recursive ./data/sentiment-train $processed_train_data_s3_uri
!aws s3 cp --recursive ./data/sentiment-validation $processed_validation_data_s3_uri
!aws s3 cp --recursive ./data/sentiment-test $processed_test_data_s3_uri
upload: data/sentiment-train/part-algo-1-womens_clothing_ecommerce_reviews.tsv to s3://sagemaker-us-east-1-610140168408/transformed/data/sentiment-train/part-algo-1-womens_clothing_ecommerce_reviews.tsv
upload: data/sentiment-validation/part-algo-1-womens_clothing_ecommerce_reviews.tsv to s3://sagemaker-us-east-1-610140168408/transformed/data/sentiment-validation/part-algo-1-womens_clothing_ecommerce_reviews.tsv
upload: data/sentiment-test/part-algo-1-womens_clothing_ecommerce_reviews.tsv to s3://sagemaker-us-east-1-610140168408/transformed/data/sentiment-test/part-algo-1-womens_clothing_ecommerce_reviews.tsv
Check the existence of those files in the S3 bucket:
!aws s3 ls --recursive $processed_train_data_s3_uri
2023-06-12 02:35:27 4894416 transformed/data/sentiment-train/part-algo-1-womens_clothing_ecommerce_reviews.tsv
!aws s3 ls --recursive $processed_validation_data_s3_uri
2023-06-12 02:35:28 276522 transformed/data/sentiment-validation/part-algo-1-womens_clothing_ecommerce_reviews.tsv
!aws s3 ls --recursive $processed_test_data_s3_uri
2023-06-12 02:35:29 273414 transformed/data/sentiment-test/part-algo-1-womens_clothing_ecommerce_reviews.tsv
Exercise 1
Set up a dictionary of the input training and validation data channels, wrapping the corresponding S3 locations in a TrainingInput
object.
Instructions: Pass the S3 input paths for training and validation data into the TrainingInput
function
TrainingInput(s3_data=...)
to construct the Amazon SageMaker channels for S3 input data sources. Then put the corresponding channels into the dictionary.
from sagemaker.inputs import TrainingInput
data_channels = {
### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
'train': TrainingInput(s3_data=processed_train_data_s3_uri), # Replace None
'validation': TrainingInput(s3_data=processed_validation_data_s3_uri) # Replace None
### END SOLUTION - DO NOT delete this comment for grading purposes
}
There is no need to create a test data channel, as the test data is used later at the evaluation stage and does not need to be wrapped into the sagemaker.inputs.TrainingInput
function.
1.2. Configure Hyperparameter Tuning Job
Model hyperparameters need to be set prior to starting the model training as they control the process of learning. Some of the hyperparameters you will set up as static - they will not be explored during the tuning job. For the non-static hyperparameters you will set the range of possible values to be explored.
First, configure static hyperparameters including the instance type, instance count, maximum sequence length, etc. For the purposes of this lab, you will use a relatively small instance type. Please refer to this link for additional instance types that may work for your use cases outside of this lab.
max_seq_length=128 # maximum number of input tokens passed to BERT model
freeze_bert_layer=False # specifies the depth of training within the network
epochs=3
train_steps_per_epoch=50
validation_batch_size=64
validation_steps_per_epoch=50
seed=42
train_instance_count=1
train_instance_type='ml.c5.9xlarge'
train_volume_size=256
input_mode='File'
run_validation=True
Some of these will be passed into the PyTorch estimator and tuner in the hyperparameters argument. Let's set up the dictionary for that:
hyperparameters_static={
'freeze_bert_layer': freeze_bert_layer,
'max_seq_length': max_seq_length,
'epochs': epochs,
'train_steps_per_epoch': train_steps_per_epoch,
'validation_batch_size': validation_batch_size,
'validation_steps_per_epoch': validation_steps_per_epoch,
'seed': seed,
'run_validation': run_validation
}
Configure hyperparameter ranges to explore in the Tuning Job. The values of the ranges typically come from prior experience, research papers, or other models similar to the task you are trying to do.
from sagemaker.tuner import IntegerParameter
from sagemaker.tuner import ContinuousParameter
from sagemaker.tuner import CategoricalParameter
hyperparameter_ranges = {
'learning_rate': ContinuousParameter(0.00001, 0.00005, scaling_type='Linear'), # specifying continuous variable type, the tuning job will explore the range of values
'train_batch_size': CategoricalParameter([128, 256]), # specifying categorical variable type, the tuning job will explore only listed values
}
1.3. Set up evaluation metrics
Choose loss and accuracy as the evaluation metrics. The regular expressions Regex
will capture the values of metrics that the algorithm will emit.
metric_definitions = [
{'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9.]+)'},
{'Name': 'validation:accuracy', 'Regex': 'val_acc: ([0-9.]+)'},
]
For example, these sample log lines...
[step: 100] val_loss: 0.76 - val_acc: 70.92%
...will produce the following metrics in CloudWatch:
validation:loss
= 0.76
validation:accuracy
= 70.92
In the Tuning Job, you will be maximizing validation accuracy as the objective metric.
2. Run Tuning Job
2.1. Set up the RoBERTa and PyTorch script to run on SageMaker
Prepare the PyTorch model to run as a SageMaker Training Job. The estimator takes into the entry point a separate Python file, which will be called during the training. You can open and review this file src/train.py.
For more information on the PyTorchEstimator
, see the documentation here: https://sagemaker.readthedocs.io/
from sagemaker.pytorch import PyTorch as PyTorchEstimator
# Note: indeed, it is not compulsory to rename the PyTorch estimator,
# but this is useful for code clarity, especially when a few modules of 'sagemaker.pytorch' are used
estimator = PyTorchEstimator(
entry_point='train.py',
source_dir='src',
role=role,
instance_count=train_instance_count,
instance_type=train_instance_type,
volume_size=train_volume_size,
py_version='py3',
framework_version='1.6.0',
hyperparameters=hyperparameters_static,
metric_definitions=metric_definitions,
input_mode=input_mode,
)
2.2. Launch the Hyperparameter Tuning Job
A hyperparameter tuning job runs a series of training jobs that each test a combination of hyperparameters for a given objective metric (i.e. validation:accuracy
). In this lab, you will use a Random
search strategy to determine the combinations of hyperparameters - within the specific ranges - to use for each training job within the tuning job. For more information on hyperparameter tuning search strategies, please see the following documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html
When the tuning job completes, you can select the hyperparameters used by the best-performing training job relative to the objective metric.
The max_jobs
parameter is a stop criteria that limits the number of overall training jobs (and therefore hyperparameter combinations) to run within the tuning job.
The max_parallel_jobs
parameter limits the number of training jobs (and therefore hyperparameter combinations) to run in parallel within the tuning job. This parameter is often used in combination with the Bayesian
search strategy when you want to test a smaller set of training jobs (less than the max_jobs
), learn from the smaller set of training jobs, then apply Bayesian methods to determine the next set of hyperparameters used by the next set of training jobs. Bayesian methods can improve hyperparameter-tuning performance in some cases.
The early_stopping_type
parameter is used by SageMaker hyper-parameter tuning jobs to automatically stop a training job if the job is not improving the objective metrics (i.e. validation:accuracy
) relative to previous training jobs within the tuning job. For more information on early stopping, please see the following documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-early-stopping.html.
Exercise 2
Set up the Hyperparameter Tuner.
Instructions: Use the function HyperparameterTuner
, passing the variables defined above. Please use tuning strategy 'Random'
.
tuner = HyperparameterTuner(
estimator=..., # estimator
hyperparameter_ranges=..., # hyperparameter ranges
metric_definitions=..., # definition metric
strategy='...', # tuning strategy
objective_type='Maximize',
objective_metric_name='validation:accuracy',
max_jobs=2, # maximum number of jobs to run
max_parallel_jobs=2, # maximum number of jobs to run in parallel
early_stopping_type='Auto' # early stopping criteria
)
from sagemaker.tuner import HyperparameterTuner
tuner = HyperparameterTuner(
### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
estimator=estimator, # Replace None
hyperparameter_ranges=hyperparameter_ranges, # Replace None
metric_definitions=metric_definitions, # Replace None
strategy='Random', # Replace None
### END SOLUTION - DO NOT delete this comment for grading purposes
objective_type='Maximize',
objective_metric_name='validation:accuracy',
max_jobs=2, # maximum number of jobs to run
max_parallel_jobs=2, # maximum number of jobs to run in parallel
early_stopping_type='Auto' # early stopping criteria
)
Exercise 3
Launch the SageMaker Hyper-Parameter Tuning (HPT) Job.
Instructions: Use the tuner.fit
function, passing the configured train and validation inputs (data channels).
tuner.fit(
inputs=..., # train and validation input
include_cls_metadata=False, # to be set as false if the algorithm cannot handle unknown hyperparameters
wait=False # do not wait for the job to complete before continuing
)
tuner.fit(
### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
inputs=data_channels, # Replace None
### END SOLUTION - DO NOT delete this comment for grading purposes
include_cls_metadata=False,
wait=False
)
2.3. Check Tuning Job status
You can see the Tuning Job status in the console. Let's get the Tuning Job name to construct the link.
tuning_job_name = tuner.latest_tuning_job.job_name
print(tuning_job_name)
pytorch-training-230612-0238
Check the status of the Tuning Job.
from IPython.core.display import display, HTML
display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/hyper-tuning-jobs/{}">Hyper-Parameter Tuning Job</a></b>'.format(region, tuning_job_name)))
Review Hyper-Parameter Tuning Job
Wait for the Tuning Job to complete.
This cell will take approximately 20-30 minutes to run.
%%time
tuner.wait()
.................................................................................................................................................................................................................................................................................................................................................!
CPU times: user 1.55 s, sys: 160 ms, total: 1.71 s
Wall time: 28min 54s
Wait until the ^^ Tuning Job ^^ completes above
The results of the SageMaker Hyperparameter Tuning Job are available on the analytics
of the tuner object
. The dataframe
function converts the result directly into the dataframe. You can explore the results with the following lines of the code:
import time
time.sleep(10) # slight delay to allow the analytics to be calculated
df_results = tuner.analytics().dataframe()
df_results.shape
(2, 8)
df_results.sort_values('FinalObjectiveValue', ascending=0)
learning_rate | train_batch_size | TrainingJobName | TrainingJobStatus | FinalObjectiveValue | TrainingStartTime | TrainingEndTime | TrainingElapsedTimeSeconds | |
---|---|---|---|---|---|---|---|---|
1 | 0.000022 | "128" | pytorch-training-230612-0238-001-e2bd7fda | Completed | 71.480003 | 2023-06-12 02:39:29+00:00 | 2023-06-12 03:03:36+00:00 | 1447.0 |
0 | 0.000025 | "256" | pytorch-training-230612-0238-002-0346b365 | Completed | 68.750000 | 2023-06-12 02:40:05+00:00 | 2023-06-12 03:04:11+00:00 | 1446.0 |
When training and tuning at scale, it is important to continuously monitor and use the right compute resources. While you have the flexibility of choosing different compute options how do you choose the specific instance types and sizes to use? There is no standard answer for this. It comes down to understanding the workload and running empirical testing to determine the best compute resources to use for the training.
SageMaker Training Jobs emit CloudWatch metrics for resource utilization. You can review them in the AWS console:
- open the link
- notice that you are in the section Amazon SageMaker -> Hyperparameter tuning jobs
- have a look at the list of the Training jobs below and click on one of them
- scroll down to the Monitor section and review the available metrics
from IPython.core.display import display, HTML
display(HTML('<b>Review Training Jobs of the <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/hyper-tuning-jobs/{}">Hyper-Parameter Tuning Job</a></b>'.format(region, tuning_job_name)))
Review Training Jobs of the Hyper-Parameter Tuning Job
3. Evaluate the results
An important part of developing a model is evaluating the model with a test data set - one that the model has never seen during its training process. The final metrics resulting from this evaluation can be used to compare competing machine learning models. The higher the value of these metrics, the better the model is able to generalize.
3.1. Show the best candidate
Exercise 4
Show the best candidate - the one with the highest accuracy result.
Instructions: Use the sort_values
function to sort the results by accuracy, which is stored in the column FinalObjectiveValue
. Put ascending=0
and head(1)
for the selection.
df_results.sort_values(
'...', # column name for sorting
ascending=0).head(1)
df_results.sort_values(
### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
'FinalObjectiveValue', # Replace None
### END SOLUTION - DO NOT delete this comment for grading purposes
ascending=0).head(1)
learning_rate | train_batch_size | TrainingJobName | TrainingJobStatus | FinalObjectiveValue | TrainingStartTime | TrainingEndTime | TrainingElapsedTimeSeconds | |
---|---|---|---|---|---|---|---|---|
1 | 0.000022 | "128" | pytorch-training-230612-0238-001-e2bd7fda | Completed | 71.480003 | 2023-06-12 02:39:29+00:00 | 2023-06-12 03:03:36+00:00 | 1447.0 |
3.2. Evaluate the best candidate
Let's pull the information about the best candidate from the dataframe and then take the Training Job name from the column TrainingJobName
.
best_candidate = df_results.sort_values('FinalObjectiveValue', ascending=0).iloc[0]
best_candidate_training_job_name = best_candidate['TrainingJobName']
print('Best candidate Training Job name: {}'.format(best_candidate_training_job_name))
Best candidate Training Job name: pytorch-training-230612-0238-001-e2bd7fda
Exercise 5
Show accuracy result for the best candidate.
Instructions: Use the example in the cell above.
### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
best_candidate_accuracy = best_candidate['FinalObjectiveValue'] # Replace all None
### END SOLUTION - DO NOT delete this comment for grading purposes
print('Best candidate accuracy result: {}'.format(best_candidate_accuracy))
Best candidate accuracy result: 71.4800033569336
You can use the function describe_training_job
of the service client to get some more information about the best candidate. The result is in dictionary format. Let's check that it has the same Training Job name:
best_candidate_description = sm.describe_training_job(TrainingJobName=best_candidate_training_job_name)
best_candidate_training_job_name2 = best_candidate_description['TrainingJobName']
print('Training Job name: {}'.format(best_candidate_training_job_name2))
Training Job name: pytorch-training-230612-0238-001-e2bd7fda
Exercise 6
Pull the Tuning Job and Training Job Amazon Resource Name (ARN) from the best candidate training job description.
Instructions: Print the keys of the best candidate Training Job description dictionary, choose the ones related to the Tuning Job and Training Job ARN and print their values.
print(best_candidate_description.keys())
dict_keys(['TrainingJobName', 'TrainingJobArn', 'TuningJobArn', 'ModelArtifacts', 'TrainingJobStatus', 'SecondaryStatus', 'HyperParameters', 'AlgorithmSpecification', 'RoleArn', 'InputDataConfig', 'OutputDataConfig', 'ResourceConfig', 'StoppingCondition', 'CreationTime', 'TrainingStartTime', 'TrainingEndTime', 'LastModifiedTime', 'SecondaryStatusTransitions', 'FinalMetricDataList', 'EnableNetworkIsolation', 'EnableInterContainerTrafficEncryption', 'EnableManagedSpotTraining', 'TrainingTimeInSeconds', 'BillableTimeInSeconds', 'ProfilingStatus', 'WarmPoolStatus', 'ResponseMetadata'])
### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
best_candidate_tuning_job_arn = best_candidate_description['TuningJobArn'] # Replace None
best_candidate_training_job_arn = best_candidate_description['TrainingJobArn'] # Replace None
### END SOLUTION - DO NOT delete this comment for grading purposes
print('Best candidate Tuning Job ARN: {}'.format(best_candidate_tuning_job_arn))
print('Best candidate Training Job ARN: {}'.format(best_candidate_training_job_arn))
Best candidate Tuning Job ARN: arn:aws:sagemaker:us-east-1:610140168408:hyper-parameter-tuning-job/pytorch-training-230612-0238
Best candidate Training Job ARN: arn:aws:sagemaker:us-east-1:610140168408:training-job/pytorch-training-230612-0238-001-e2bd7fda
Pull the path of the best candidate model in the S3 bucket. You will need it later to set up the Processing Job for the evaluation.
model_tar_s3_uri = sm.describe_training_job(TrainingJobName=best_candidate_training_job_name)['ModelArtifacts']['S3ModelArtifacts']
print(model_tar_s3_uri)
s3://sagemaker-us-east-1-610140168408/pytorch-training-230612-0238-001-e2bd7fda/output/model.tar.gz
To perform model evaluation you will use a scikit-learn-based Processing Job. This is essentially a generic Python Processing Job with scikit-learn pre-installed. You can specify the version of scikit-learn you wish to use. Also pass the SageMaker execution role, processing instance type and instance count.
from sagemaker.sklearn.processing import SKLearnProcessor
processing_instance_type = "ml.c5.2xlarge"
processing_instance_count = 1
processor = SKLearnProcessor(
framework_version="0.23-1",
role=role,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
max_runtime_in_seconds=7200,
)
The model evaluation Processing Job will be running the Python code from the file src/evaluate_model_metrics.py. You can open and review the file.
Launch the Processing Job, passing the defined above parameters, custom script, path and the S3 bucket location of the test data.
from sagemaker.processing import ProcessingInput, ProcessingOutput
processor.run(
code="src/evaluate_model_metrics.py",
inputs=[
ProcessingInput(
input_name="model-tar-s3-uri",
source=model_tar_s3_uri,
destination="/opt/ml/processing/input/model/"
),
ProcessingInput(
input_name="evaluation-data-s3-uri",
source=processed_test_data_s3_uri,
destination="/opt/ml/processing/input/data/",
),
],
outputs=[
ProcessingOutput(s3_upload_mode="EndOfJob", output_name="metrics", source="/opt/ml/processing/output/metrics"),
],
arguments=["--max-seq-length", str(max_seq_length)],
logs=True,
wait=False,
)
Job Name: sagemaker-scikit-learn-2023-06-12-03-14-44-443
Inputs: [{'InputName': 'model-tar-s3-uri', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-610140168408/pytorch-training-230612-0238-001-e2bd7fda/output/model.tar.gz', 'LocalPath': '/opt/ml/processing/input/model/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'evaluation-data-s3-uri', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-610140168408/transformed/data/sentiment-test/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-610140168408/sagemaker-scikit-learn-2023-06-12-03-14-44-443/input/code/evaluate_model_metrics.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs: [{'OutputName': 'metrics', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-610140168408/sagemaker-scikit-learn-2023-06-12-03-14-44-443/output/metrics', 'LocalPath': '/opt/ml/processing/output/metrics', 'S3UploadMode': 'EndOfJob'}}]
You can see the information about the Processing Jobs using the describe
function. The result is in dictionary format. Let's pull the Processing Job name:
scikit_processing_job_name = processor.jobs[-1].describe()["ProcessingJobName"]
print('Processing Job name: {}'.format(scikit_processing_job_name))
Processing Job name: sagemaker-scikit-learn-2023-06-12-03-14-44-443
Exercise 7
Pull the Processing Job status from the Processing Job description.
Instructions: Print the keys of the Processing Job description dictionary, choose the one related to the status of the Processing Job and print the value of it.
print(processor.jobs[-1].describe().keys())
dict_keys(['ProcessingInputs', 'ProcessingOutputConfig', 'ProcessingJobName', 'ProcessingResources', 'StoppingCondition', 'AppSpecification', 'RoleArn', 'ProcessingJobArn', 'ProcessingJobStatus', 'LastModifiedTime', 'CreationTime', 'ResponseMetadata'])
### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
scikit_processing_job_status = processor.jobs[-1].describe()['ProcessingJobStatus'] # Replace None
### END SOLUTION - DO NOT delete this comment for grading purposes
print('Processing job status: {}'.format(scikit_processing_job_status))
Processing job status: InProgress
Review the created Processing Job in the AWS console.
Instructions:
- open the link
- notice that you are in the section Amazon SageMaker
-> Processing Jobs
- check the name of the Processing Job, its status and other available information
from IPython.core.display import display, HTML
display(
HTML(
'<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(
region, scikit_processing_job_name
)
)
)
Review Processing Job
Wait for about 5 minutes to review the CloudWatch Logs. You may open the file src/evaluate_model_metrics.py again and examine the outputs of the code in the CloudWatch Logs.
from IPython.core.display import display, HTML
display(
HTML(
'<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> after about 5 minutes</b>'.format(
region, scikit_processing_job_name
)
)
)
Review CloudWatch Logs after about 5 minutes
After the completion of the Processing Job you can also review the output in the S3 bucket.
from IPython.core.display import display, HTML
display(
HTML(
'<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 output data</a> after the Processing Job has completed</b>'.format(
bucket, scikit_processing_job_name, region
)
)
)
Review S3 output data after the Processing Job has completed
Monitor the Processing Job:
from pprint import pprint
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
processing_job_name=scikit_processing_job_name, sagemaker_session=sess
)
processing_job_description = running_processor.describe()
pprint(processing_job_description)
{'AppSpecification': {'ContainerArguments': ['--max-seq-length', '128'],
'ContainerEntrypoint': ['python3',
'/opt/ml/processing/input/code/evaluate_model_metrics.py'],
'ImageUri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3'},
'CreationTime': datetime.datetime(2023, 6, 12, 3, 14, 44, 964000, tzinfo=tzlocal()),
'LastModifiedTime': datetime.datetime(2023, 6, 12, 3, 18, 26, 397000, tzinfo=tzlocal()),
'ProcessingInputs': [{'AppManaged': False,
'InputName': 'model-tar-s3-uri',
'S3Input': {'LocalPath': '/opt/ml/processing/input/model/',
'S3CompressionType': 'None',
'S3DataDistributionType': 'FullyReplicated',
'S3DataType': 'S3Prefix',
'S3InputMode': 'File',
'S3Uri': 's3://sagemaker-us-east-1-610140168408/pytorch-training-230612-0238-001-e2bd7fda/output/model.tar.gz'}},
{'AppManaged': False,
'InputName': 'evaluation-data-s3-uri',
'S3Input': {'LocalPath': '/opt/ml/processing/input/data/',
'S3CompressionType': 'None',
'S3DataDistributionType': 'FullyReplicated',
'S3DataType': 'S3Prefix',
'S3InputMode': 'File',
'S3Uri': 's3://sagemaker-us-east-1-610140168408/transformed/data/sentiment-test/'}},
{'AppManaged': False,
'InputName': 'code',
'S3Input': {'LocalPath': '/opt/ml/processing/input/code',
'S3CompressionType': 'None',
'S3DataDistributionType': 'FullyReplicated',
'S3DataType': 'S3Prefix',
'S3InputMode': 'File',
'S3Uri': 's3://sagemaker-us-east-1-610140168408/sagemaker-scikit-learn-2023-06-12-03-14-44-443/input/code/evaluate_model_metrics.py'}}],
'ProcessingJobArn': 'arn:aws:sagemaker:us-east-1:610140168408:processing-job/sagemaker-scikit-learn-2023-06-12-03-14-44-443',
'ProcessingJobName': 'sagemaker-scikit-learn-2023-06-12-03-14-44-443',
'ProcessingJobStatus': 'InProgress',
'ProcessingOutputConfig': {'Outputs': [{'AppManaged': False,
'OutputName': 'metrics',
'S3Output': {'LocalPath': '/opt/ml/processing/output/metrics',
'S3UploadMode': 'EndOfJob',
'S3Uri': 's3://sagemaker-us-east-1-610140168408/sagemaker-scikit-learn-2023-06-12-03-14-44-443/output/metrics'}}]},
'ProcessingResources': {'ClusterConfig': {'InstanceCount': 1,
'InstanceType': 'ml.c5.2xlarge',
'VolumeSizeInGB': 30}},
'ProcessingStartTime': datetime.datetime(2023, 6, 12, 3, 18, 26, 394000, tzinfo=tzlocal()),
'ResponseMetadata': {'HTTPHeaders': {'content-length': '2367',
'content-type': 'application/x-amz-json-1.1',
'date': 'Mon, 12 Jun 2023 03:19:32 GMT',
'x-amzn-requestid': '617a6701-8bdd-4c06-9091-c8fefea9ed4f'},
'HTTPStatusCode': 200,
'RequestId': '617a6701-8bdd-4c06-9091-c8fefea9ed4f',
'RetryAttempts': 0},
'RoleArn': 'arn:aws:iam::610140168408:role/sagemaker-studio-vpc-firewall-us-east-1-sagemaker-execution-role',
'StoppingCondition': {'MaxRuntimeInSeconds': 7200}}
Wait for the Processing Job to complete.
This cell will take approximately 5-10 minutes to run.
%%time
running_processor.wait(logs=False)
.............!CPU times: user 43 ms, sys: 25.2 ms, total: 68.2 ms
Wall time: 1min 5s
Please wait until ^^ Processing Job ^^ completes above
3.3. Inspect the processed output data
Let's take a look at the results of the Processing Job. Get the S3 bucket location of the output metrics:
processing_job_description = running_processor.describe()
output_config = processing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
if output["OutputName"] == "metrics":
processed_metrics_s3_uri = output["S3Output"]["S3Uri"]
print(processed_metrics_s3_uri)
s3://sagemaker-us-east-1-610140168408/sagemaker-scikit-learn-2023-06-12-03-14-44-443/output/metrics
List the content of the folder:
!aws s3 ls $processed_metrics_s3_uri/
2023-06-12 03:20:39 19519 confusion_matrix.png
2023-06-12 03:20:39 56 evaluation.json
The test accuracy can be pulled from the evaluation.json
file.
import json
from pprint import pprint
metrics_json = sagemaker.s3.S3Downloader.read_file("{}/evaluation.json".format(
processed_metrics_s3_uri
))
print('Test accuracy: {}'.format(json.loads(metrics_json)))
Test accuracy: {'metrics': {'accuracy': {'value': 0.7346278317152104}}}
Copy image with the confusion matrix generated during the model evaluation into the folder generated
.
!aws s3 cp $processed_metrics_s3_uri/confusion_matrix.png ./generated/
import time
time.sleep(10) # Slight delay for our notebook to recognize the newly-downloaded file
download: s3://sagemaker-us-east-1-610140168408/sagemaker-scikit-learn-2023-06-12-03-14-44-443/output/metrics/confusion_matrix.png to generated/confusion_matrix.png
Show and review the confusion matrix, which is a table of all combinations of true (actual) and predicted labels. Each cell contains the number of the reviews for the corresponding sentiments. You can see that the highest numbers of the reviews appear in the diagonal cells, where the predicted sentiment equals the actual one.
%%html
<img src='./generated/confusion_matrix.png'>
Upload the notebook into S3 bucket for grading purposes.
Note: you may need to click on "Save" button before the upload.
!aws s3 cp ./C3_W1_Assignment.ipynb s3://$bucket/C3_W1_Assignment_Learner.ipynb
Created: July 22, 2024