If you’re a data scientist at one of the few companies with massive corporate datasets defined in an integrated system, and can analyze the data easily, lucky you! If you’re like the rest of the world, and need some help joining disparate datasets, analyzing the data, and visualizing the analysis in preparation for your machine learning work, then read on.
In this blog post, I’ll show you how to perform exploratory analysis on massive corporate data sets in Amazon SageMaker. From your Jupyter notebook running on Amazon SageMaker, you’ll identify and explore several corporate datasets in the corporate data lake that seem interesting to you. You’ll discover that each contains a subset of the information you need. You’ll join them to extract the interesting information, then continue analyzing and visualizing your data in your Amazon SageMaker notebook, in a seamless experience.
Introduction
Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. Amazon SageMaker provides an integrated Jupyter authoring environment for data scientists to perform initial data exploration, analysis, and model building.
From a Jupyter notebook running on an Amazon SageMaker notebook instance, you can easily read Amazon S3 datasets into your notebook, and process them there. However, first you must locate the datasets of interest. In a large data lake, identifying the exact datasets that contain the fields you want to analyze can be challenging. As the individual datasets get larger, reading them into your notebook ceases to be practical. Your notebook has limited disk space and memory, compared to the size of many of today’s datasets. If the information you need is spread across multiple datasets, as is frequently the case, the data exploration becomes more challenging still. You must locate all the datasets needed, then join and filter them. Reading and joining very large datasets in your notebook can cripple your productivity, or, as the datasets get larger, become impractical. This kind of data combination and exploration can take up to 80% of a data scientist’s time, so reducing friction here is critical to accelerating the completion of your machine learning projects.
Many large corporations are using AWS Glue to manage their data lake. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue contains a central metadata repository known as the AWS Glue Data Catalog, which makes the enriched and categorized data in the data lake available for search and querying. You can use the metadata in the Data Catalog to identify the names, locations, content, and characteristics of datasets of interest.
Filtering or aggregating extremely large datasets from your Amazon S3 data lake, potentially combining and joining them with other datasets, is a task that Apache Spark on Amazon EMR is ideally suited for. Spark is a cluster-computing framework with built-in modules providing support for analytics from a variety of languages, including Python, Java, and Scala. Spark on the Amazon EMR ability to scale is a good fit for the large datasets frequently found in corporate data lakes. If the datasets are already defined in your AWS Glue Data Catalog, it becomes easier still to access them. You can use the AWS Glue Data Catalog as the Amazon EMR external Hive metastore. After filtering, the resulting reduced dataset can be read into your notebook, allowing you to explore, process, and visualize only the subset of corporate data relevant to the task at hand. The heavyweight processing has been farmed out to a service intimately suited to that task.
By following the instructions in this blog, you’ll be accessing your data lake and analyzing the data in your Amazon SageMaker notebook in no time!
Note: This blog post builds on a prior blog post, Build Amazon SageMaker notebooks backed by Spark in Amazon EMR.
Step 1: Solution overview
This solution shows you how to combine these capabilities. It will show how to:
-
Identify Amazon S3 datasets of interest to you in the AWS Glue Data Catalog from within your Amazon SageMaker-hosted Jupyter notebook.
-
Perform the joining, filtering, and aggregation task in Amazon EMR.
-
Use the resulting dataset directly as a Pandas dataframe – a 2D data structure, with potentially different data types in the columns that you can easily manipulate – in your notebook
The solution is implemented using three AWS services and some open source components: an Amazon SageMaker notebook instance running Jupyter notebooks and SparkMagic, an Amazon EMR cluster running Apache Spark and Apache Livy, and the AWS Glue Data Catalog.
-
Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. A key part of SageMaker is its support for authoring: it provides zero-setup hosted Jupyter notebook IDEs for data exploration, cleaning, and preprocessing.
-
SparkMagic is a set of tools for interactively working with remote Spark clusters through Livy in Jupyter The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.
-
Livy is a service that enables easy interaction with Spark on an Amazon EMR cluster over a REST interface. Livy enables the use of Spark for interactive web/mobile applications – in this case, from your Jupyter notebook.
-
AWS Glue Data Catalog acts as a central metadata repository, making your data available for search and query using services such as Amazon Athena, Amazon Redshift, and Amazon EMR. You can also use the AWS Glue Data Catalog as your external Apache Hive compatible metastore for big data applications running on Amazon EMR – which is how you’ll use it here.
In this blog post, all of these components interact as shown in the following diagram.
You can access the datasets on Amazon S3 and defined in the AWS Glue Data Catalog by following these steps:
-
You work in your Jupyter Sparkmagic notebook in Amazon SageMaker. Within the notebook, you issue commands that either execute locally or are sent to Spark running on the Amazon EMR cluster. You can use PySpark commands, or you can use SQL magics to issue HiveQL commands.
-
The commands to the EMR cluster are received by the Livy service, running on an EMR cluster.
-
Livy passes the commands to Spark, running on the EMR cluster.
-
Spark accesses the Hive metastore to identify the location, schema, and properties of the cataloged dataset. In this case, the Hive metastore has been set to the AWS Glue Data Catalog.
-
Spark uses the information from the Glue Data Catalog to directly read the data from Amazon S3.
-
After you perform your manipulations, Livy can return the data to your notebook as a Pandas dataframe for additional analysis and visualization.
In the sections that follow, you’ll perform these steps on a small sample set of data: some information about legislators, drawn from http://everypolitician.org/. The data is normalized into several tables. To get personal information on the legislators along with the political party and their political service, you’ll need to combine information from several tables.
You’ll perform the following steps:
-
Use a provided AWS CloudFormation stack to create the Amazon SageMaker notebook and an EMR cluster with Livy and Spark, and specify AWS Glue as the cluster’s Hive compatible metastore. The stack also sets up an AWS Glue crawler to crawl some sample data.
-
Execute the AWS Glue crawler to populate metadata in the AWS Glue Data Catalog.
From your Jupyter notebook on Amazon SageMaker, you’ll then:
-
Access the data defined in the AWS Glue Data Catalog, joining some tables and filtering the result.
-
Pull the reduced dataset into your notebook and explore it.
-
Perform some analysis, and cluster the data.
Step 2: Set up infrastructure components
In this step, you’ll launch a predefined AWS CloudFormation stack to set up the individual components of the solution. The provided stack sets up the following:
-
An EMR cluster with Livy and Spark, and using the Glue Data Catalog as the external Hive compatible metastore. In addition, the stack configures Livy on Amazon EMR to use the same metastore (the AWS Glue Data Catalog) as the EMR cluster.
-
An AWS Glue database, and an AWS Glue crawler set up to discover the legislator dataset on Amazon S3 and catalog it in the AWS Glue Data Catalog. The AWS Identity and Access Management (IAM) role the Glue crawler needs is also created. You’ll run the crawler in the next step, and explore that data later from your Amazon SageMaker notebook.
Infrastructure needed for Amazon SageMaker:
-
An IAM role for use by the Amazon SageMaker notebook instance. The IAM role has the managed role AmazonSageMakerFullAccess, plus access to S3.
-
A security group, used for the Amazon SageMaker notebook instance.
-
An Amazon SageMaker lifecycle configuration that configures Livy to access the EMR cluster launched by the stack, and copies in a predefined Jupyter notebook with the sample code.
To see this solution in operation in the us-east-1 AWS Region, open the AWS CloudFormation console and choose the Launch Stack button.
Choose Next, then update the parameters for your environment as shown in the following screenshot.
-
Give your stack a name.
-
At a minimum, update: the VPCId, and VPCSubnet. Choose a VPC and public subnet with internet access (for example, the default VPC created in your account), to allow needed software components to be installed.
-
Also check that the default names provided for the Amazon SageMaker notebook instance, lifecycle config, and AWS Glue database are not already used in your account. If they are, replace them with different names.
Choose Next. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names, then choose Create.
Wait for the CloudFormation master stack and its three nested stacks to reach a status of CREATE_COMPLETE. It takes about 10-15 minutes to deploy the stack.
On the master stack, check the Outputs tab for the EMRClusterId, SageMakerNotebookInstanceId, and SageMakerInstanceSecurityGroupId. You’ll use these in the next step.
Step 3: Run the AWS Glue crawler
At this point, the CloudFormation stack has performed two AWS Glue-related steps for you:
-
Created a Glue database, called legislators.
-
Created a Glue crawler, called LegislatorsS3Crawler.
However, there currently is no metadata in the AWS Glue Data Catalog about any datasets on Amazon S3. To populate the database with that metadata, run the AWS Glue crawler. To do so:
-
On the AWS Management Console, navigate to the AWS Glue console.
-
Choose Crawlersfrom the left navigation bar.
-
Find the crawler LegislatorsS3Crawler in the list. Select the crawler, then choose Run Crawler. You will be asked to choose a role for the crawler to use or to create a new role. Choose the IAM role created by the CloudFormation stack.
-
Wait as the crawler runs. You’ll see the status change to Starting, then Running, Stopping, and finallyReady.
-
When the crawler status isReady, check the column under Tables Added. You should see that 6 tables have been added.
To review the tables themselves, in the AWS Glue console, on the left navigation pane, choose Databases, and then choose the legislatorss3 database. Choose Tables in legislatorss3. You should see a list of tables similar to the one in the following screenshot. These tables contain sample data in JSON format about United States legislators and the seats that they have held in the US House of Representatives and Senate. (This is the same data loaded in the tutorial Crawling the Sample Data Used in the Tutorials, which manually walks you through creating the crawler that the CloudFormation template created for you here). You’ll be accessing these tables from your Jupyter notebook in a later step.
Take a moment to explore the tables. You can see the schema, size information, location, and other information about your datasets here.
Step 4: Access data defined in AWS Glue Data Catalog from your Amazon SageMaker notebook
In this step, you’ll access and explore these same tables from an Amazon SageMaker notebook.
In the AWS Management Console navigate to the Amazon SageMaker console. On the left navigation pane choose Notebook instances. Locate the notebook instance started by your CloudFormation stack, the SageMakerNotebookInstanceId (the default in the template is ‘sageglue’). Choose Open from next to the notebook name.
You’ll see a page similar to the following screenshot. The Amazon SageMaker lifecycle configuration in the CloudFormation stack automatically uploaded the notebook Using_SageMaker_to_access_S3_Using_Glue_Data_Catalog to your Jupyter dashboard.
Open the “Using Sagemaker” notebook by clicking on the title.
Note that the kernel type is “SparkMagic (PySpark)”. From the menu bar at the top, choose Kernel, then Restart to restart the kernel. Reply “restart” to the question, ”Do you want to restart the current kernel? All variables will be lost.” Livy requires a session to the EMR cluster and the session times out after some time (configurable in Livy). Restarting the kernel and re-running the notebook from the top if you’re starting or restarting work reduces errors due to dropped connections.
Click the first cell, then choose Run from the menu bar at the top. The notebook will execute the cell and advance to the next cell. The number next to code cells will switch from “*” to a number after the cell has completed executing. Continue choosing Run for each. Begin executing the cells in the notebook. Read the explanations in the cells to understand the actions being taken by the code. These instructions walk you through:
-
Accessing the Spark cluster, and running a simple PySpark statement.
-
Listing the databases in your Glue data catalog, and showing the tables in the Legislators database you set up earlier.
-
Using SQL to join 3 tables in the Legislators database, filter the resulting rows on a condition, and identify the specific columns of interest.
-
Using the resulting data frame containing only the fields and rows of interest locally on your notebook instance for further processing.
-
Performing a small clustering example.
Step 5: Clean up
Remember to delete your AWS CloudFormation master stack to stop incurring charges. Deleting the master stack will delete the nested stacks for you.
The cost to run this solution in the default template configuration is approximately $0.15 per hour.
Note that this is a small configuration for demo purposes. To process large amounts of data, the instance sizes for Amazon EMR and Amazon SageMaker should be scaled appropriately.
Step 6: (Optional) Debugging and lifecycle configurations
This section provides additional, optional information on two topics:
-
Debugging your connection
-
More on SageMaker lifecycle configurations
Debugging your connection
If your notebook does not connect to your cluster, perform the following steps to see where the problem lies. Not having the correct settings in livy.conf and not having the correct ports open on the security groups between the EMR cluster and the notebook instance are the most common sources of connection difficulties.
When the notebook instance is created or started, the results of running the lifecycle config are captured in an Amazon CloudWatch Logs log group called /aws/sagemaker/NotebookInstances. This log group has a stream for
Next, check the Livy configuration and EMR access on the notebook instance. In the Jupyter files dashboard, choose New, then select Terminal. This will open a shell for you on the notebook instance.
The Livy config file is stored in: /home/ec2-user/SageMaker/.sparkmagic/livy.conf. Check to make sure that the EMR master’s private IP address has replaced the original http://localhost:8998 in 3 places in the file.Next, check the Livy configuration and EMR access on the notebook instance. In the Jupyter files dashboard, choose New, then select Terminal. This will open a shell for you on the notebook instance.
You can also enter the following from this terminal, to ensure that you have network connectivity:
1 |
|
If you have long-running Spark jobs, you might want to alter the livy.server.session.timeout. Currently it defaults to 1 hour, which might be too short for your usage. This setting can be added to the EMR cluster’s livy-conf configuration, set in the CloudFormation template.
More on Amazon SageMaker lifecycle configurations
SageMaker notebook instances can use a lifecycle configuration. With a lifecycle configuration, you can provide a Bash script to be run whenever an Amazon SageMaker notebook instance is created, or when it is restarted after having been stopped. The CloudFormation template uses a creation-time script to configure Livy on the notebook instance with the address of the Amazon EMR master instance created earlier. A single lifecycle configuration can be used by many notebook instances.
Note that you can modify lifecycle configurations while they’re attached to a notebook instance. Currently you can’t, however, attach a lifecycle configuration to a notebook instance after the instance has been launched. In the event that a lifecycle configuration is sufficiently malformed, the instance might crash – and any Jupyter notebooks and other artifacts currently stored on the instance may be lost.
In this step, you’ve provided a hardcoded IP address at notebook instance creation time. For a more robust solution, you might want to have the master’s IP address automatically discovered each time the notebook instance is restarted. For example, you could set a specific tag for the EMR cluster that notebook instances should use, and have the start script locate the cluster with that tag and use its IP address to reset the Livy config settings. With that approach, you can continue to connect to the AWS Glue Data Catalog from a notebook instance even as the cluster’s master or the cluster itself is replaced over time.
Conclusion
By now you can see the true power of this combination – the ability to use the scalability and processing capabilities of Amazon EMR to preprocess, filter, join, and aggregate data from your Amazon S3 data lake data. Your data scientists can use tools they’re familiar with – Jupyter notebooks and SQL – to quickly explore and visualize data that’s already been cataloged in your data lake. Another source of friction has been removed, and your research can move at the pace of business.
About the Author
Dr. Veronika Megler is a senior consultant at Amazon Web Services. She works with our customers to implement innovative big data, AI and ML projects, helping them accelerate their time-to-value when using AWS.