Supervised learning needs labels, or annotations, that tell the algorithm what the right answers are in the training phases of your project. In fact, many of the examples of using MXNet, TensorFlow, and PyTorch start with annotated data sets you can use to explore the various features of those frameworks. Unfortunately, when you move from the examples to application, it’s much less common to have a fully annotated set of data at your fingertips. This tutorial will show you how you can use Amazon Mechanical Turk (MTurk) from within your Amazon SageMaker notebook to get annotations for your data set and use them for training.
TensorFlow provides an example of using an Estimator to classify irises using a neural network classifier. This example can also be found in the SageMaker sample-notebooks library or in the SageMaker Examples project on GitHub. The tutorial and sample notebook both use the Iris data set, which includes measurements for three related Iris species and the species they are associated with. Those data points are then used to train a model that can predict the species of iris based on four measurements.
This is a great example for getting started with TensorFlow Estimators, but it would be much more difficult if all you had were images of each Iris and the associated measurements. Without annotated data, you could find yourself manually annotating the images and wasting a lot of time that you’d rather spend developing your model.
MTurk provides human intelligence through an API and is ideally suited to providing a wide range of annotations that can then be used in your training. With MTurk you get access to a 24x7x365 global workforce that can supply annotations for your data set. Simply define a task that you want Workers to complete for each of your data points, post it to the marketplace, and retrieve your results in a matter of minutes or hours. MTurk allows you to quickly get the annotations you need without the need to hire a team or spend your own time.
From within your SageMaker notebook you can quickly send tasks to MTurk Workers for annotation, review the results, and move on to your training. In this tutorial we’ll build a version of the Iris data set that includes images instead of species information, and then ask MTurk Workers to identify the species based on those images. Finally, we can use the Amazon SageMaker Python SDK to construct and train a neural network classifier using TensorFlow’s tf.estimator. The notebook for this tutorial can be downloaded here. Note that spinning up an Amazon SageMaker notebook, gathering annotations with MTurk, training your model, and deploying it will all incur fees you will be charged for. Don’t forget to delete your resources at completion of training so you won’t continue to be billed.
You’ll perform eight steps to annotate the data and train a model:
-
Set up your Amazon Mechanical Turk Requester account and link it to your AWS account.
-
Load the Iris data set and modify it to include images.
-
Define an MTurk task for annotating the data set.
-
Submit the task to MTurk Workers for annotation.
-
Retrieve and reconcile the results provided by Workers.
-
Construct and train a model using the training data.
-
Deploy the model to an endpoint and use it to classify new samples.
-
Clean up your resources
Note that this project makes use of the images above within both the data set and the MTurk task itself. These images are used here under the Creative Commons Attribution-ShareAlike 2.0 and 3.0 licenses and have been provided by Radomil (Iris setosa, CC BY-SA 3.0), Dlanglois (Iris versicolor, CC BY-SA 3.0), and Frank Mayfield (Iris virginica, CC BY-SA 2.0).
Step 1. Amazon Mechanical Turk Account setup
If you haven’t already, you’ll need to set up an Amazon Mechanical Turk Requester account that is linked to the AWS account you’re using with Amazon SageMaker. To get started, visit https://requester.mturk.com and create a new account.
After you’ve set up your MTurk account, you need to link it to the AWS account you’re using with Amazon SageMaker. Sign into your AWS account as a root user by using the email address for the root of your account. If you see the IAM user sign-in page, choose Sign-in using root account credentials. Then go to https://requester.mturk.com/developer to link them together.
We now need to update the permissions for your Amazon SageMaker role to include the AmazonMechanicalTurkFullAccess policy. To do this, in the AWS Management Console open the IAM console and view your roles. Select the IAM role associated with your SageMaker instance and choose Attach policy. Search for AmazonMechanicalTurkFullAccess and add the policy to the role.
To post tasks to MTurk for Workers to complete you first need to purchase prepaid HITs that can be used to reward Workers that complete your tasks. A HIT (Human Intelligence Task) is how individual tasks are represented in MTurk. You can do this at https://requester.mturk.com/account.
Note: As defined in the code that follows, the tutorial will cost you between $18.00 and $19.00 in Worker rewards and fees.
We will be using the xmltodict library to handle results returned from Mturk, so you will want to install it in your SageMaker notebook instance.
We’ll also import xmltodict and a few other libraries that we’ll need. Specify the Amazon S3 bucket we’ll use for storing our annotated data, which will be used for training.
1 |
|
MTurk has two environments you can use, Production and Sandbox. When you use the Production environment your tasks are visible to Workers at https://worker.mturk.com. There is also a Sandbox environment that can be used for testing. Workers won’t see your tasks in the Sandbox, but you can visit https://workersandbox.mturk.com to do them yourself and test your task interfaces. There is no cost to use the Sandbox environment, and it’s recommended that you test there first to make sure your task returns the data you need before moving to the Production environment. If you want to test in the Sandbox, note that you will need to create an additional account in the Sandbox and link it to your AWS account.
The following instructions create a client in one of the two environments, depending on the value of create_hits_in_production. If you want to test in the Sandbox, change this value to False.
1 |
|
To confirm your account is set up correctly, make a call to get your account balance. If you’ve connected to the Sandbox your balance is always $10,000. This task will require between $18.00 and $19.00 to complete so you should add $19.00 to your Production account. The funds must be in your account before you submit tasks for Workers to complete.
print(mturk.get_account_balance()[‘AvailableBalance’])
1 |
|
Load the iris data set
training_df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data’, header=None)
Name the columns
training_df.columns = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’]
Add an image_url column and remove the species column
def species_to_url(species): if (species == ‘Iris-setosa’): return ‘https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg’ elif (species == ‘Iris-versicolor’): return ‘https://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg’ else: return ‘https://upload.wikimedia.org/wikipedia/commons/3/38/Iris_virginica_-_NRCS.jpg’ image_urls = [ species_to_url(row.species) for index, row in training_df.iterrows() ] training_df[‘image_url’] = image_urls del training_df[‘species’]
training_df
1 |
|
html_layout = open(‘./IrisAnnotation.html’, ‘r’).read()
QUESTION_XML = “”“
</HTMLQuestion>””” question_xml = QUESTION_XML.format(html_layout)
1 |
|
task_attributes = { ‘MaxAssignments’: 2, ‘LifetimeInSeconds’: 60604, # How long the task will be available on the MTurk website (4 hours) ‘AssignmentDurationInSeconds’: 60*5, # How long will Workers have to complete each item (5 minutes) ‘Reward’: ‘0.05’, # The reward you will offer Workers for each response ‘Title’: ‘Classify images of flowers’, ‘Description’: ‘Provide the species of iris in each image’, ‘Keywords’: ‘classification, image’ }
1 |
|
hit_type_id = ‘’ results = []
for index, row in training_df.iterrows(): response = mturk.create_hit( **task_attributes, Question=question_xml.replace(‘${image_url}’,row[‘image_url’]) ) hit_type_id = response[‘HIT’][‘HITTypeId’] results.append({ ‘hit_id’: response[‘HIT’][‘HITId’] })
print(“You can view the HITs here:”) print(mturk_environment[‘preview’] + “?groupId={}”.format(hit_type_id))
1 |
|
species_count = 0 for item in results:
# Get the status of the HIT hit = mturk.get_hit(HITId=item[‘hit_id’]) item[‘status’] = hit[‘HIT’][‘HITStatus’]
# Get a list of the Assignments that have been submitted by Workers assignmentsList = mturk.list_assignments_for_hit( HITId=item[‘hit_id’], AssignmentStatuses=[‘Submitted’, ‘Approved’], MaxResults=10 )
# Get the assignments that have been submitted and capture a count in the results assignments = assignmentsList[‘Assignments’] item[‘assignments_submitted_count’] = len(assignments)
answers = [] for assignment in assignments:
# Retrieve the attributes for each Assignment worker_id = assignment[‘WorkerId’] assignment_id = assignment[‘AssignmentId’]
# Retrieve the value submitted by the Worker from the XML answer_dict = xmltodict.parse(assignment[‘Answer’]) answer = answer_dict[‘QuestionFormAnswers’][‘Answer’][‘FreeText’] answers.append(int(answer))
# Approve the Assignment (if it hasn’t already been approved) if assignment[‘AssignmentStatus’] == ‘Submitted’: mturk.approve_assignment( AssignmentId=assignment_id, OverrideRejection=False )
# Add the answers that have been retrieved for this item to the results item[‘answers’] = answers
# If we’ve received at least 2 answers for the same category, use that answer if len(answers) > 1: for species in [0,1,2]: if answers.count(species) >= 2: item[‘species’] = species species_count += 1
# If none of the Workers agree after all answers have been provided, # add an additional Assignment to break the tie if len(answers) == hit[‘HIT’][‘MaxAssignments’] and ‘species’ not in item: extend_result = mturk.create_additional_assignments_for_hit(HITId=item[‘hit_id’], NumberOfAdditionalAssignments=1) print(“Extended HIT {} to {} assignments”.format(item[‘hit_id’], hit[‘HIT’][‘MaxAssignments’] + 1))
print(“Irises annotated: {}”.format(species_count))
1 |
|
merge the results back into the dataframe
results_df = pd.merge(training_df, pd.DataFrame(results), left_index=True, right_index=True)[[‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’,’species’]]
ensure the species column is defined as an int
results_df[‘species’] = results_df[‘species’].astype(int)
split the frame into training and test
test_df = results_df.sample(n=30) test_df.columns = [30,4,’setosa’,’versicolor’,’virginica’] train_df = results_df.drop(test_df.index) train_df.columns = [120,4,’setosa’,’versicolor’,’virginica’]
train_df
1 |
|
csv_buffer = io.StringIO() train_df.to_csv(csv_buffer, index=False) s3.Object(training_bucket_name, ‘trainingdata/iris_training.csv’).put(Body=csv_buffer.getvalue())
csv_buffer = io.StringIO() test_df.to_csv(csv_buffer, index=False) s3.Object(training_bucket_name, ‘trainingdata/iris_test.csv’).put(Body=csv_buffer.getvalue())
1 |
|
from sagemaker import get_execution_role
#Bucket location to save your custom code in tar.gz format. custom_code_upload_location = ‘s3://{}/customcode/tensorflow_iris’.format(training_bucket_name)
#Bucket location where results of model training are saved. model_artifacts_location = ‘s3://{}/artifacts’.format(training_bucket_name)
#IAM execution role that gives SageMaker access to resources in your AWS account. role = get_execution_role()
1 |
|
from sagemaker.tensorflow import TensorFlow
iris_estimator = TensorFlow(entry_point=’iris_dnn_classifier.py’, role=role, output_path=model_artifacts_location, code_location=custom_code_upload_location, train_instance_count=1, train_instance_type=’ml.c4.xlarge’, training_steps=1000, evaluation_steps=100)
train_data_location = ‘s3://{}/trainingdata’.format(training_bucket_name) iris_estimator.fit(train_data_location)
1 |
|
iris_predictor = iris_estimator.deploy(initial_instance_count=1, instance_type=’ml.m4.xlarge’)
iris_predictor.predict([6.4, 3.2, 4.5, 1.5])
```
Step 8. Clean up your resources
When you’re done with this tutorial, you can avoid incurring unnecessary charges by using the AWS Management Console to delete the resources that you created for this exercise.
-
Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/ and delete the endpoint, endpoint configuration, and the model. Then stop your notebook instance and delete it.
-
Open the Amazon S3 console at https://console.aws.amazon.com/s3/ and delete the bucket that you created for storing model artifacts and the training dataset.
-
Open the IAM console at https://console.aws.amazon.com/iam/ and delete the IAM role. If you created permission policies, you can delete them, too.
-
Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/ and delete all of the log groups that have names starting with /aws/sagemaker/.
Conclusion
As you can see, MTurk makes it possible to build the annotated data you need for your project. We started with an unannotated data set, created an MTurk task to annotate it, retrieved our Worker provided results, and then started training our model, all within a SageMaker notebook. As a next step, we could use MTurk Workers to validate the results provided by your model. This is particularly powerful when your model returns low confidence predictions for a portion of your data. By using MTurk to review those items, you can get new annotations that you’ll be able to use to re-train your model. By iteratively using MTurk to provide annotations, you can steadily refine your model as new data is available. For more examples of how to use MTurk in a variety of contexts, check out the tutorials provided on the MTurk Blog.
Amazon Mechanical Turk is a powerful tool for any data scientist that needs to collect, annotate, or validate data for training their models. With easy access to MTurk from within Amazon SageMaker, you get access to an on-demand workforce that can help you quickly and economically get the data you need for your project.
About the Author
Dave Schultz leads Business Development for Amazon Mechanical Turk. He helps customers find ways to apply human intelligence to complex problems in machine learning and data management. In his spare time, he enjoys woodworking and the odd programming project.