Practical Apache Spark in 10 Minutes

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. It has originally been developed at UC Berkeley in 2009, while Databricks was founded later by the creators of Spark in 2013.

The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters. It is used to perform ETL, interactive queries (SQL), advanced analytics (e.g., machine learning) and streaming over large datasets in a wide range of data stores (e.g., HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python, and Scala.

Part 1 - Ubuntu installation

In this article, we are going to walk you through the installation process of Spark as well as Hadoop which we will need in the future. So follow the instructions to start working with Spark.

Part 2 - RDD

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). It is a fault-tolerant collection of elements which allows parallel operations upon itself. RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.

Part 3 - DataFrames and SQL

Spark SQL is a part of Apache Spark big data framework designed for processing structured and semi-structured data. It provides a DataFrame API that simplifies and accelerates data manipulations. DataFrame is a special type of object, conceptually similar to a table in relational database. It represents a distributed collection of data organized into named columns. DataFrames can be created from external sources, retrieved with a query from a database, or converted from RDD; the inverse transform is also possible. This abstraction is designed for sampling, filtering, aggregating, and visualizing the data.

In this blog post, we’re going to show you how to load a DataFrame and perform basic operations on DataFrames with both API and SQL. We’ll also go through DataFrame to RDD and vice-versa conversions.

Part 4 - MLlib

The vast possibilities of artificial intelligence are of increasing interest in the field of modern information technologies. One of its most promising and evolving directions is machine learning (ML), which becomes the essential part in various aspects of our life. ML has found successful applications in Natural Languages Processing, Face Recognition, Autonomous Vehicles, Fraud detection, Machine vision and many other fields.

Machine learning utilizes the mathematical algorithms that can solve specific tasks in a way analogous to the human brain. Depending on the neural network training method, ML algorithms can be divided into supervised (with labeled data), unsupervised (with unlabeled data), semi-supervised (there are both labeled and unlabeled data in the dataset) and reinforcement (based on reward receiving) learning. Solving the most basic and popular ML tasks, such as classification and regression, is mainly based on supervised learning algorithms. Among the variety of existing ML tools, Spark MLlib is a popular and easy-to-start library which enables training neural networks for solving the problems mentioned above.

In this post, we would like to consider classification task. We will classify Iris plants to the 3 categories according to the size of their sepals and petals. The public dataset with Iris classification is available here. To move forward, download the file bezdekIris.data to the working folder.

Part 5 - Streaming

Spark is a powerful tool which can be applied to solve many interesting problems. Some of them have been discussed in our previous posts. Today we will consider another important application, namely streaming. Streaming data is the data which continuously comes as small records from different sources. There are many use cases for streaming technology such as sensor monitoring in industrial or scientific devices, server logs checking, financial markets monitoring, etc.

In this post, we will examine the case with sensors temperature monitoring. For example, we have several sensors (1,2,3,4,…) in our device. Their state is defined by the following parameters: date (dd/mm/year), sensor number, state (1 - stable, 0 - critical), and temperature (degrees Celsius). The data with the sensors state comes in streaming, and we want to analyze it. Streaming data can be loaded from the different sources. As we don’t have the real streaming data source, we should simulate it. For this purpose, we can use Kafka, Flume, and Kinesis, but the simplest streaming data simulator is Netcat.

Part 6 - GraphX

In our last post, we explained the basics of streaming with Spark. Today, we want to talk about graphs and explore Apache Spark GraphX tool for graph computation and analysis. It is necessary to say that GraphX works only with Scala.

A graph is a structure which consists of vertices and edges between them. Graph theory finds its application in various fields such as computer science, linguistics, physics, chemistry, social sciences, biology, mathematics, and others. Problems connected with graph analysis are rather complicated, but there are many modern convenient instruments and libraries for these purposes.

In this post, we will consider the following example of the graph: the cities are the vertices and the distances between them are the edges. You can see the Google Maps illustration of this structure in the figure below.

Part 7 - GraphX and Neo4j

As is known and seen from the series of blog posts, Apache Spark is a powerful tool with many useful libraries (like MLlib and GraphX) which deals with big data. Sometimes you have to work with data that has lots of relationships, and usual SQL databases are not the best option then. However, that’s when NoSQL databases come to action. They are more flexible, and will better represent data with numerous connections.

In the previous blog post, we have considered the Apache Spark GraphX tool but to illustrate its possibilities we have used a small graph object. In the real world, the graphs can be much bigger and more complicated. Therefore in this post, we will create a bigger graph, will store it in Neo4j database and will analyze it using Apache Spark GraphX tool, namely Pagerank algorithm.

To do this, we are going to load some data from csv files which contain words with corresponding parts of speech (nodes) and edges between them (edges have “with” property which means that words can be often seen in one sentence). At first, let’s prepare data using pandas. Then we will push this data to Neo4j database. And finally, we’ll integrate Neo4j with GraphX to load graph from Neo4j database and analyze it using Spark GraphX.

ActiveWizards is a team of data scientists and engineers, focused exclusively on data projects (big data, data science, machine learning, data visualizations). Areas of core expertise include data science (research, machine learning algorithms, visualizations and engineering), data visualizations ( d3.js, Tableau and other), big data engineering (Hadoop, Spark, Kafka, Cassandra, HBase, MongoDB and other), and data intensive web applications development (RESTful APIs, Flask, Django, Meteor).

Related: