How to use common workflows on Amazon SageMaker notebook instances

Amazon SageMaker notebook instances provide a scalable cloud based development environment to do data science and machine learning. This blog post will show common workflows to make you more productive and effective.

The techniques in this blog post will give you tools to treat your notebook instances in a more cloud native way, remembering that they disposable and replaceable. We’ll walk through the following:

  • First, we will show how to use GitHub and AWS CodeCommit for collaborative development.

  • Second, how to use AWS CloudFormation to automatically provision notebook instances and upload jupyter notebooks to them.

  • Third, how to back up and restore assets from a notebook instance using an Amazon S3 bucket.

Tutorials

Collaboration with Git

Notebook instances are best used when a single developer is assigned to a single instance. However, data scientist often work in a collaborative environment. Git is a tool that allows multiple contributors to write in a version controlled code repository. Multiple developers/data scientists can work on their own notebook instance, pull code down from a remote repository, and push (or commit) changes back to that repository. AWS CodeCommit and GitHub are two places where you can put your remote Git repository.

CodeCommit

To work with CodeCommit repos open the Amazon SageMaker console and follow these instructions:

SageMaker Instance Role Config

  1. Login to the AWS console.

  2. Type “sagemaker” into the search bar and open up the sagemaker console.

  3. On the left hand menu select notebooks and choose a notebook (or create one first).

  4. click the link under “IAM role ARN”.

  5. Under the Permissions tab click attach policy.

  6. In the search bar type “CodeCommitPowerUser”.

  7. Select the policy named “CodeCommitPowerUser”.

  8. Click “Attach Policy” in the bottom right corner.

Create CodeCommit Repo

  1. Login to the AWS console.

  2. Type “codecommit” into the search bar and open up the CodeCommit console.

  3. Click “Create Repository”.

  4. Name the repository “SageMaker-git-tutorial” and (optionaly) a description.

  5. Click “Create Repository”.

  6. Click “skip”.

  7. Select HTTPS for connect type.

  8. Copy the command starting with “git clone” at the bottom of the page.

Clone CodeCommit Repo

  1. Login to the AWS console.

  2. Type “sagemaker” into the search bar and open up the sagemaker console.

  3. On the left hand menu select notebooks and choose a notebook (or create one first).

  4. click the link under “IAM role ARN”.

  5. Under the Permissions tab click attach policy.

  6. In the search bar type “CodeCommitPowerUser”.

  7. Select the policy named “CodeCommitPowerUser”.

  8. Click “Attach Policy” in the bottom right corner.

 

The commands from step #3 are listed here for your convenience:

1
2
3
4
5
6
7
8
9
10
11
12
13
bash 
git config --global credential.helper '!aws codecommit credential-helper $@'
git config --global credential.UseHttpPath true
ls 
cd SageMaker
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/SageMaker-git-tutorial
cd SageMaker-git-tutorial
touch README.md
git status
git add README.md
git commit -m "add README.md"
git push origin master

GitHub

To use Github with an Amazon SageMaker notebook instance follow these instructions.

  1. Create a GitHub repo.

  2. Open up a terminal on your notebook instance, like we did at the end of the CodeCommit tutorial.

Follow these instructions to configure your ssh keys. In step 3 it will tell you to run this command:

ssh-add -K ~/.ssh/id_rsa Run this one instead (lowercase k):

ssh-add -k ~/.ssh/id_rsa

Follow these instructions to add your SSH key to GitHub. In step 1, where you copy your SSH key, instead of running:

pbcopy < ~/.ssh/id_rsa.pub Run: Then select and copy the output which is your ssh public key.

  1. Get your repo’s SSH url by going to your repo, clicking “clone or download”, and then clicking “Use ssh”.

Go Back in your notebook instance terminal and paste and run the following command git clone {url you copied}

You now have a local copy of a remote Git repository on your notebook instance. Many developers can have their own copy on their own instance and contribute code to the remote repository. See this description for a more indepth discussion of working and collaborating with Git.

Each project should have its own Git repository. You could work on multiple projects in a single instance. You can also save your configuration information inside a Git repository and clone that repository as part of your notebook instance configuration. See here for more discussion on these “dotfiles” and here for my personal dotfile repo.

Initialization

You might want to do some configuration steps when launching a notebook instance. With AWS CloudFormation you can automate the provisioning of Amazon SageMaker notebook instances and the uploading of notebooks to the instance. These notebooks could contain all the setup code a user would need, such as configuring Git or installing Python modules.

You can use Amazon SageMaker Lifecycle configurations to download these initialization assets to a notebook instance when it is created.

Here is an example script that will download files from an Amazon S3 bucket.

1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash

set -ex

BUCKET="your-s3-bucket"
DIRECTORY="s3-assets"

mkdir -p $DIRECTORY
aws s3 cp s3://$BUCKET /home/ec2-user/$DIRECTORY --recursive
chown "ec2-user":"ec2-user" $DIRECTORY --recursive

Note: Your instance will need the proper permissions to access the S3 bucket for these scripts to work.

Back up restore

You may have data sets or static assets you need downloaded to your instance. You can store that data in Amazon S3 and use the following command to download to your instance:

1
aws s3 sync s3://{your bucket} {the directory to download to}

This command uploads changes to Amazon S3:

1
aws s3 sync {the directory to download to} s3://{your bucket} 

You will need to make sure your instance has the permission to to access your S3 bucket.

You can use SageMaker Lifecycle configurations to create a cron job in your notebook instance that automatically updates your instance from an S3 bucket on a regular schedule.

Here is a script you can use in a lifecycle configuration to download files from an S3 bucket when an instance starts:

1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash

set -ex

BUCKET="your-s3-bucket"
DIRECTORY="/home/ec2-user/s3-assets"

mkdir -p $DIRECTORY
chown "ec2-user":"ec2-user" $DIRECTORY --recursive
aws s3 cp s3://$BUCKET $DIRECTORY --recursive > /home/ec2-user/backup.log 2>&1

Here is another script that will add a cron job to your instance to backup a directory to an S3 bucket on an hourly schedule.

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/bash

set -ex

BUCKET="your-s3-bucket"
DIRECTORY="s3-assets"
CMD="aws s3 sync /home/ec2-user/$DIRECTORY s3://$BUCKET--recursive > /home/ec2-user/backup.log 2>&1"
cronjob="0 * * * * $CMD"

mkdir -p $DIRECTORY
chown "ec2-user":"ec2-user" $DIRECTORY --recursive
echo "$cronjob" | crontab -u "ec2-user" -

Note: Your instance will need the proper permissions to access the S3 bucket for these scripts to work.

File Share

You can use Amazon Elastic File System to share files between SageMaker notebook instances. This blog post, Mount an EFS file system to an Amazon SageMaker notebook (with lifecycle configurations), will walk you through how to configure your instance to use EFS. This a great way to store a user’s “/home” directory in one place and use across multiple instances. Imagine this, You turn on your Notebook instance, the lifecycle configuration mounts your home directory from EFS to your home directory on the instance. You can now use all your configuration and dot files! If you launch a different notebook instance or even an EC2 instance, then you can mount the EFS directory and access your files there too!

Conclusion

Without these techniques a notebook instance becomes fragile and unreproducible, both isolating your work and risking its loss. You can now imagine launching a CloudFormation template that creates a notebook instance and uploads a bootstrap script, a script which installs packages, downloads data from S3, and clones your project repository. This template would be used by individuals in your team, replicating your environment across accounts and AWS Regions, reproducing your work, encouraging collaboration and experimentation, utilizing the full power of the cloud, not just working in it.


About the Author

John Calhoun is an Associate Solutions Architect for the AWS Public Sector Partners team. He works with our customers and partners to provide leadership on a variety of projects, helping them shorten their time to value when using AWS.