The volume, velocity and variety of data has been ever increasing since the advent of the internet. The problem many enterprises face is managing this “big data” and trying to make sense out of it to yield the most desirable outcome. Siloes in enterprises, continuous ingestion of data in numerous formats, and the ever-changing technology landscape make it difficult to collect, store, share, analyze, and visualize data. The question is: How do you build that robust data pipeline that connects to the data store and provides data scientists and engineers a platform to gain insights into your data?
Many organizations are starting to invest in a centralized data store and coming to understand the benefits of using Amazon Simple Storage Service (S3) as their data lake. Building a data pipeline from the data lake involves a data discovery process that includes identifying the data format and schema, cataloging the data, understanding the metadata to build table definitions, and providing the ability to query the data.
This blog post walks you through all the steps required to build the data pipeline—from creating a crawler for your data in Amazon S3 and using AWS Glue for data discovery and cataloging, to using Amazon Athena to store and retrieve table metadata, and Amazon SageMaker to query Athena tables.
Amazon SageMaker is an end-to-end machine learning (ML) platform you can use to build, train, and deploy machine learning models in AWS. It’s a highly modular service that lets you use each of these components independently of each other. Using the Amazon SageMaker Notebook module improves the efficiency of interacting with the data without the latency of bringing it locally. We will show you how to use the Jupyter notebook component of Amazon SageMaker to integrate with Athena and populate data frames for data manipulation.
In the example in this blog post, we are using the 2008 data set from Airline On-Time performance. This data set consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. We uploaded the data to Amazon S3 for use with AWS Glue and Athena followed by Amazon SageMaker.
Data discovery with AWS Glue
The data catalog features of AWS Glue and the inbuilt integration to Amazon S3 simplify the process of identifying data and deriving the schema definition out of the discovered data. Using AWS Glue crawlers within your data catalog, you can traverse your data stored in Amazon S3 and build out the metadata tables that are defined in your data catalog.
The first step in populating the data catalog is to define a database that holds your table definitions.
In the AWS Glue console, choose Add Database. Provide a database name and choose Create.
Next create an AWS Glue crawler to add a table to the database. The crawler crawls the data in Amazon S3 and adds the table definitions to the database.
Choose a data store. This is the URL to your data stored in Amazon S3. Then choose Next.
For an IAM role: You will need to create a new role that gives AWS Glue permissions to access the files in Amazon S3 and has decryption permissions on the files.
Choose Next to schedule the crawler with a Frequency of Run on demand.
Next you choose the database created in step one.
Review the steps, and choose Finish. The crawler is ready to run. Choose Run it now. This will create the table definitions for your data in Amazon S3.
Query Amazon S3 data using Athena
Athena lets you query data in Amazon S3 using a standard SQL interface. By using the AWS Glue data catalog, you can create interactive queries and perform any data manipulations required for further downstream processing. The Amazon Athena console automatically reflects the databases and tables in the AWS Glue catalog. Simply log in to the AWS Management Console, navigate to the Amazon Athena console, and in the Query Editor you will see the databases and tables that you created previously.
Be sure to choose the Settings button on the top right to note the staging directory. Amazon Athena stores the query results in an S3 bucket that will be needed by in the next steps when Amazon SageMaker queries Athena.
Data analysis and management using Amazon SageMaker
The final step in this data pipeline is to make these table definitions available in a Jupyter notebook instance of Amazon SageMaker. Jupyter notebooks are popularly used among data scientists to visualize data, perform statistical analysis, do data manipulations, and make the data ready for machine learning work.
In the Amazon SageMaker console choose Create notebook instance.
Under Notebook Instance settings populate the Notebook instance name, choose an instance type, and a default role for the notebook instances in Amazon SageMaker to interact with Amazon S3.
Wait for the Notebook instances to be created and the Status to change to InService.
Choose the Open link, which will open the notebook interface in a separate browser window.
Click New->Terminal to open the terminal window. You will need to update your AWS credentials file for query authentication.
Update the AWS credentials file with your credentials and the region. As best practice it is recommended to use temporary credentials that expires after a certain duration.
Click new to create a new notebook in Jupyter. Amazon SageMaker provides several kernels for Jupyter including support for Python 2 and 3, MXNet, TensorFlow, and PySpark. Choose Python as the kernel for this exercise as it comes with the Pandas library built in. Within the notebook, execute the following commands to install the Athena JDBC driver. PyAthena is a Python DB API 2.0 (PEP 249) compliant client for the Amazon Athena JDBC driver.
import sys !{sys.executable} -m pip install PyAthena
After the Athena driver is installed, you can use the JDBC connection to connect to Athena and populate the Pandas data frames. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. Pandas is the ideal tool for all of these tasks.
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir=’**
df = pd.read_sql(“SELECT * FROM athenaquery.**
Conclusion
The solution described in this blog post provides an automated way to catalog the incoming data as it comes into the data store, and it provides the ability to query the data for data manipulation and analysis. In addition, this scenario sets the stage for building more ML models through feature engineering, training, and scoring to gain more insights into your data and deliver significant business outcomes.
About the Author
Anjana Kandalam is a Solutions Architect at Amazon Web Services based out of New York. She works with enterprise customers in the financial vertical helping them align their business goals with the art of the possible using cloud based technologies. She enjoys traveling and loves spending time doing outdoor activities with her kids.