Skip to content

Detect data bias with Amazon SageMaker Clarify

Introduction

Bias can be present in your data before any model training occurs. Inspecting the dataset for bias can help detect collection gaps, inform your feature engineering, and understand societal biases the dataset may reflect. In this lab you will analyze bias on the dataset, generate and analyze bias report, and prepare the dataset for the model training.

First, let's install and import required modules.

# please ignore warning messages during the installation
!pip install --disable-pip-version-check -q sagemaker==2.35.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

import boto3
import sagemaker
import pandas as pd
import numpy as np
import botocore

config = botocore.config.Config(user_agent_extra='dlai-pds/c1/w2')

# low-level service client of the boto3 session
sm = boto3.client(service_name='sagemaker', 
                  config=config)

sess = sagemaker.Session(sagemaker_client=sm)

bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name
Clean data in minutes
Automatically visualize data, and improve data quality in a few clicks. Learn more
Remind me later
Don't show again
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

1. Analyze the dataset

1.1. Create a pandas data frame from the CSV file

Create a pandas dataframe from each of the product categories and concatenate them into one.

!aws s3 cp 's3://dlai-practical-data-science/data/transformed/womens_clothing_ecommerce_reviews_transformed.csv' ./
download: s3://dlai-practical-data-science/data/transformed/womens_clothing_ecommerce_reviews_transformed.csv to ./womens_clothing_ecommerce_reviews_transformed.csv
path = './womens_clothing_ecommerce_reviews_transformed.csv'

df = pd.read_csv(path)
df.head()
sentiment review_body product_category
0 1 If this product was in petite i would get the... Blouses
1 1 Love this dress! it's sooo pretty. i happene... Dresses
2 0 I had such high hopes for this dress and reall... Dresses
3 1 I love love love this jumpsuit. it's fun fl... Pants
4 1 This shirt is very flattering to all due to th... Blouses

As you saw in the previous lab, there are way more positive reviews than negative or neutral. Such a dataset is called unbalanced.

In this case, using a relatively small data subset you could visualize the occurring unbalances. At scale, you would need to perform bias analysis. Let's use this dataset as an example.

import seaborn as sns

sns.countplot(data=df, x='sentiment', hue='product_category')

plt.legend(loc='upper right',bbox_to_anchor=(1.3, 1.1))
<matplotlib.legend.Legend at 0x7fb430c7cd10>

png

1.2. Upload the dataset to S3 bucket

Upload the dataset to a private S3 bucket in a folder called bias/unbalanced.

data_s3_uri_unbalanced = sess.upload_data(bucket=bucket, 
                               key_prefix='bias/unbalanced', 
                               path='./womens_clothing_ecommerce_reviews_transformed.csv')
data_s3_uri_unbalanced
's3://sagemaker-us-east-1-505802170089/bias/unbalanced/womens_clothing_ecommerce_reviews_transformed.csv'

You can review the uploaded CSV file in the S3 bucket.

Instructions: - open the link - click on the S3 bucket name sagemaker-us-east-1-ACCOUNT - go to the folder bias/unbalanced - check the existence of the file womens_clothing_ecommerce_reviews_transformed.csv

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="top" href="https://s3.console.aws.amazon.com/s3/home?region={}#">Amazon S3 bucket</a></b>'.format(region)))

Review Amazon S3 bucket

2. Analyze class imbalance on the dataset with Amazon SageMaker Clarify

Let's analyze bias in sentiment with respect to the product_category facet on the dataset.

2.1. Configure a DataConfig

Information about the input data needs to be provided to the processor. This can be done with the DataConfig of the Clarify container. It stores information about the dataset to be analyzed, for example the dataset file, its format, headers and labels.

Exercise 1

Configure a DataConfig for Clarify.

Instructions: Use DataConfig to configure the target column ('sentiment' label), data input (data_s3_uri_unbalanced) and output paths (bias_report_unbalanced_output_path) with their formats (header names and the dataset type):

data_config_unbalanced = clarify.DataConfig(
    s3_data_input_path=..., # S3 object path containing the unbalanced dataset
    s3_output_path=..., # path to store the output
    label='...', # target column
    headers=df_unbalanced.columns.to_list(),
    dataset_type='text/csv'
)
from sagemaker import clarify

bias_report_unbalanced_output_path = 's3://{}/bias/generated_bias_report/unbalanced'.format(bucket)

data_config_unbalanced = clarify.DataConfig(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    s3_data_input_path='s3://sagemaker-us-east-1-505802170089/bias/unbalanced/womens_clothing_ecommerce_reviews_transformed.csv', # Replace None
    s3_output_path=bias_report_unbalanced_output_path, # Replace None
    label='sentiment', # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    headers=df.columns.to_list(),
    dataset_type='text/csv'
)

2.2. Configure BiasConfig

Bias is measured by calculating a metric and comparing it across groups. To compute it, you will specify the required information in the BiasConfig API. SageMaker Clarify needs the sensitive columns (facet_name) and the desirable outcomes (label_values_or_threshold). Here product_category is the sensitive facet and the desired outcome is with the sentiment==1.

SageMaker Clarify can handle both categorical and continuous data for label_values_or_threshold. In this case you are using categorical data.

bias_config_unbalanced = clarify.BiasConfig(
    label_values_or_threshold=[1], # desired sentiment
    facet_name='product_category' # sensitive column (facet)
)

2.3. Configure Amazon SageMaker Clarify as a processing job

Now you need to construct an object called SageMakerClarifyProcessor. This allows you to scale the process of data bias detection using two parameters, instance_count and instance_type. Instance_count represents how many nodes you want in the distributor cluster during the data detection. Instance_type specifies the processing capability (compute capacity, memory capacity) available for each one of those nodes. For the purposes of this lab, you will use a relatively small instance type. Please refer to this link for additional instance types that may work for your use case outside of this lab.

clarify_processor_unbalanced = clarify.SageMakerClarifyProcessor(role=role,
                                                      instance_count=1,
                                                      instance_type='ml.m5.large',
                                                      sagemaker_session=sess)

2.4. Run the Amazon SageMaker Clarify processing job

Exercise 2

Run the configured processing job to compute the requested bias methods of the input data

Instructions: Apply the run_pre_training_bias method to the configured Clarify processor, passing the configured input/output data (data_config_unbalanced), configuration of sensitive groups (bias_config_unbalanced) with the other job setup parameters:

clarify_processor_unbalanced.run_pre_training_bias(
    data_config=..., # configured input/output data
    data_bias_config=..., # configured sensitive groups
    methods=["CI", "DPL", "KL", "JS", "LP", "TVD", "KS"], # selector of a subset of potential metrics
    wait=False, # whether the call should wait until the job completes (default: True)
    logs=False # whether to show the logs produced by the job. Only meaningful when wait is True (default: True)
)
clarify_processor_unbalanced.run_pre_training_bias(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    data_config=data_config_unbalanced, # Replace None
    data_bias_config=bias_config_unbalanced, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    methods=["CI", "DPL", "KL", "JS", "LP", "TVD", "KS"],
    wait=False,
    logs=False
)
Job Name:  Clarify-Pretraining-Bias-2023-06-09-13-54-14-420
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-505802170089/bias/unbalanced/womens_clothing_ecommerce_reviews_transformed.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/unbalanced/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/unbalanced', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
run_unbalanced_bias_processing_job_name = clarify_processor_unbalanced.latest_job.job_name
print(run_unbalanced_bias_processing_job_name)
Clarify-Pretraining-Bias-2023-06-09-13-54-14-420

2.5. Run and review the Amazon SageMaker Clarify processing job on the unbalanced dataset

Review the created Amazon SageMaker Clarify processing job and the Cloud Watch logs.

Instructions: - open the link - note that you are in the section Amazon SageMaker -> Processing jobs - check the processing job name - note which other properties of the processing job you can see in the console

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">processing job</a></b>'.format(region, run_unbalanced_bias_processing_job_name)))

Review processing job

Instructions: - open the link - open the log stream with the name, which starts from the processing job name - have a quick look at the log messages

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch logs</a> after about 5 minutes</b>'.format(region, run_unbalanced_bias_processing_job_name)))

Review CloudWatch logs after about 5 minutes

running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=run_unbalanced_bias_processing_job_name,
                                                                            sagemaker_session=sess)

This cell will take approximately 5-10 minutes to run.

%%time

running_processor.wait(logs=False)
!CPU times: user 5.23 ms, sys: 292 µs, total: 5.53 ms
Wall time: 56.6 ms

2.6. Analyze unbalanced bias report

In this run, you analyzed bias for sentiment relative to the product_category for the unbalanced data. Let's have a look at the bias report.

List the files in the output path bias_report_unbalanced_output_path:

!aws s3 ls $bias_report_unbalanced_output_path/
2023-06-09 14:00:41      31732 analysis.json
2023-06-09 13:54:15        346 analysis_config.json
2023-06-09 14:00:41    1251091 report.html
2023-06-09 14:00:41     990884 report.ipynb
2023-06-09 14:00:41     846666 report.pdf

Download generated bias report from S3 bucket:

!aws s3 cp --recursive $bias_report_unbalanced_output_path ./generated_bias_report/unbalanced/
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/unbalanced/analysis_config.json to generated_bias_report/unbalanced/analysis_config.json
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/unbalanced/analysis.json to generated_bias_report/unbalanced/analysis.json
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/unbalanced/report.ipynb to generated_bias_report/unbalanced/report.ipynb
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/unbalanced/report.pdf to generated_bias_report/unbalanced/report.pdf
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/unbalanced/report.html to generated_bias_report/unbalanced/report.html

Review the downloaded bias report (in HTML format):

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="./generated_bias_report/unbalanced/report.html">unbalanced bias report</a></b>'))

Review unbalanced bias report

The bias report shows a number of metrics, but here you can focus on just two of them: - Class Imbalance (CI). Measures the imbalance in the number of members between different facet values. Answers the question, does a product_category have disproportionately more reviews than others? Values of CI will become equal for even distribution between facets. Here, different CI values show the existence of imbalance. - Difference in Positive Proportions in Labels (DPL). Measures the imbalance of positive outcomes between different facet values. Answers the question, does a product_category have disproportionately higher ratings than others? With the range over the interval from -1 to 1, if there is no bias, you want to see this value as close as possible to zero. Here, non-zero values indicate the imbalances.

3. Balance the dataset by product_category and sentiment

Let's balance the dataset by product_category and sentiment. Then you can configure and run SageMaker Clarify processing job to analyze the bias of it. Which metrics values do you expect to see in the bias report?

df_grouped_by = df.groupby(['product_category', 'sentiment'])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
df_balanced
sentiment review_body product_category
product_category sentiment
Blouses -1 0 -1 An absolutely gorgeous but poorly fitting blo... Blouses
1 -1 I just can't get past how different the blouse... Blouses
2 -1 I'll start by saying the fabric and concept ar... Blouses
3 -1 The beadwork is gorgeous but the sleeves are ... Blouses
4 -1 Loved the fit and colors but the fabric is ve... Blouses
... ... ... ... ... ...
Trend 1 4 1 Never spent this much on a dress so it needs t... Trend
5 1 I love this tunic! it's wool but the same kind... Trend
6 1 This dress us high quality and feels great ho... Trend
7 1 Besides being my favorite color to wear this ... Trend
8 1 This dress is just my style. i have been waiti... Trend

486 rows × 3 columns

Visualize the distribution of review sentiment in the balanced dataset.

import seaborn as sns

sns.countplot(data=df_balanced, x='sentiment', hue='product_category')

plt.legend(loc='upper right',bbox_to_anchor=(1.3, 1.1))

<matplotlib.legend.Legend at 0x7fb42a9217d0>

png

4. Analyze bias on balanced dataset with Amazon SageMaker Clarify

Let's analyze bias in sentiment with respect to the product_category facet on your balanced dataset.

Save and upload balanced data to S3 bucket.

path_balanced = './womens_clothing_ecommerce_reviews_balanced.csv'
df_balanced.to_csv(path_balanced, index=False, header=True)

data_s3_uri_balanced = sess.upload_data(bucket=bucket, key_prefix='bias/balanced', path=path_balanced)
data_s3_uri_balanced
's3://sagemaker-us-east-1-505802170089/bias/balanced/womens_clothing_ecommerce_reviews_balanced.csv'

You can review the uploaded CSV file in the S3 bucket and prefix bias/balanced.

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="top" href="https://s3.console.aws.amazon.com/s3/home?region={}#">Amazon S3 bucket</a></b>'.format(region)))

Review Amazon S3 bucket

4.1. Configure a DataConfig

Exercise 3

Configure a DataConfig for Clarify to analyze bias on the balanced dataset.

Instructions: Pass the S3 object path containing the balanced dataset, the path to store the output (bias_report_balanced_output_path) and the target column. You can use exercise 1 as an example.

from sagemaker import clarify

bias_report_balanced_output_path = 's3://{}/bias/generated_bias_report/balanced'.format(bucket)

data_config_balanced = clarify.DataConfig(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    s3_data_input_path='s3://sagemaker-us-east-1-505802170089/bias/balanced/womens_clothing_ecommerce_reviews_balanced.csv', # Replace None
    s3_output_path=bias_report_balanced_output_path, # Replace None
    label='sentiment', # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    headers=df_balanced.columns.to_list(),
    dataset_type='text/csv'
)

4.2. Configure BiasConfig

BiasConfig for the balanced dataset will have the same settings as before.

bias_config_balanced = clarify.BiasConfig(
    label_values_or_threshold=[1], # desired sentiment
    facet_name='product_category' # sensitive column (facet)
)

4.3. Configure SageMaker Clarify as a processing job

SageMakerClarifyProcessor object will also have the same parameters.

clarify_processor_balanced = clarify.SageMakerClarifyProcessor(role=role,
                                                      instance_count=1,
                                                      instance_type='ml.m5.large',
                                                      sagemaker_session=sess)

4.4. Run the Amazon SageMaker Clarify processing job

Exercise 4

Run the configured processing job for the balanced dataset.

Instructions: Apply the run_pre_training_bias method to the configured Clarify processor, passing the input/output data, configuration of sensitive groups with the other job setup parameters. You can use exercise 2 as an example.

clarify_processor_balanced.run_pre_training_bias(
    ### BEGIN SOLUTION - DO NOT delete this comment for grading purposes
    data_config=data_config_balanced , # Replace None
    data_bias_config=bias_config_balanced, # Replace None
    ### END SOLUTION - DO NOT delete this comment for grading purposes
    methods=["CI", "DPL", "KL", "JS", "LP", "TVD", "KS"],
    wait=False,
    logs=False
)
Job Name:  Clarify-Pretraining-Bias-2023-06-09-14-10-36-505
Inputs:  [{'InputName': 'dataset', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-505802170089/bias/balanced/womens_clothing_ecommerce_reviews_balanced.csv', 'LocalPath': '/opt/ml/processing/input/data', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'analysis_config', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/balanced/analysis_config.json', 'LocalPath': '/opt/ml/processing/input/config', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'analysis_result', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/balanced', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
run_balanced_bias_processing_job_name = clarify_processor_balanced.latest_job.job_name
print(run_balanced_bias_processing_job_name)
Clarify-Pretraining-Bias-2023-06-09-14-10-36-505

4.5. Run and review the Clarify processing job on the balanced dataset

Review the results of the run following the links:

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">processing job</a></b>'.format(region, run_balanced_bias_processing_job_name)))

Review processing job

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch logs</a> after about 5 minutes</b>'.format(region, run_balanced_bias_processing_job_name)))

Review CloudWatch logs after about 5 minutes

running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=run_balanced_bias_processing_job_name,
                                                                            sagemaker_session=sess)

This cell will take approximately 5-10 minutes to run.

%%time

running_processor.wait(logs=False)
!CPU times: user 4.75 ms, sys: 574 µs, total: 5.32 ms
Wall time: 58.3 ms

4.6. Analyze balanced bias report

List the files in the output path bias_report_balanced_output_path:

!aws s3 ls $bias_report_balanced_output_path/
2023-06-09 14:17:10      29889 analysis.json
2023-06-09 14:10:37        346 analysis_config.json
2023-06-09 14:17:10    1226587 report.html
2023-06-09 14:17:10     966380 report.ipynb
2023-06-09 14:17:10     834280 report.pdf

Download generated bias report from S3 bucket:

!aws s3 cp --recursive $bias_report_balanced_output_path ./generated_bias_report/balanced/
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/balanced/analysis_config.json to generated_bias_report/balanced/analysis_config.json
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/balanced/analysis.json to generated_bias_report/balanced/analysis.json
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/balanced/report.html to generated_bias_report/balanced/report.html
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/balanced/report.ipynb to generated_bias_report/balanced/report.ipynb
download: s3://sagemaker-us-east-1-505802170089/bias/generated_bias_report/balanced/report.pdf to generated_bias_report/balanced/report.pdf

Review the downloaded bias report (in HTML format):

from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="./generated_bias_report/balanced/report.html">balanced bias report</a></b>'))

Review balanced bias report

In this run, you analyzed bias for sentiment relative to the product_category for the balanced data. Note that the Class Imbalance (CI) metric is equal across all product categories for the target label, sentiment. And Difference in Positive Proportions in Labels (DPL) metric values are zero.

Upload the notebook into S3 bucket for grading purposes.

Note: you may need to click on "Save" button before the upload.

!aws s3 cp ./C1_W2_Assignment.ipynb s3://$bucket/C1_W2_Assignment_Learner.ipynb
upload: ./C1_W2_Assignment.ipynb to s3://sagemaker-us-east-1-505802170089/C1_W2_Assignment_Learner.ipynb

Please go to the main lab window and click on Submit button (see the Finish the lab section of the instructions).



Last update: July 22, 2024
Created: July 22, 2024