I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.
Why did you decide to write this book?
Ben: The content was originally part of a class that Jenny and I were teaching together. The course was about distributed computing with Apache Hadoop (MapReduce in particular) but the students came from a strong statistical or research background. From this experience, Jenny and I quickly realized that learning Hadoop or Apache Spark from the computer scientist/programmer perspective is very different than the perspective of a data scientist. Unfortunately, most tutorials and books focus on programmers who understand distributed systems and Linux admin, and yet at the same time, data scientists are becoming the primary users of Hadoop and Spark for analytics and data management.
Also, Jenny and I both work on data teams and have worked on data teams together. We’ve seen (and felt) first-hand the frustration of having a large and rich dataset that doesn’t fit on an RDBMS that we can query or dump into predictive models. We wanted to write this book to unlock the door to the tools that might give opportunities for new analyses in a wide array of domains.
Who is the intended reader?
Ben: Our audience is the data scientist, a technically competent individual with a wide variety of experience in programming, analytics, modeling, visualization, business, research, and domain expertise but no deep specialization. We assume that this data scientist is familiar with R or Python, but not systems programming languages like Java or C. Additionally, we expect that the reader has used data management tools like relational databases, but perhaps isn’t used to systems administration on Linux machines. This description includes people from a wide array of experiences: beginners who are trying to take a multifaceted approach to their education, to seasoned researchers and professionals who simply haven’t used distributed computing before.
Our secondary audience is someone who is completely new to Hadoop. We want our book to serve as a lightweight introduction to distributed computing concepts and the Hadoop ecosystem.
Jenny: I would add that our intended audience includes data analysts as well as data scientists. These readers likely come from a statistical or analytic background, and have a solid understanding of data mining, statistical modeling, and predictive analytics or machine-learning techniques. As such, we assume that our readers are comfortable programming in Python to perform statistical analysis and modeling techniques on their local machine, but want to learn how to perform these tasks on large data sets that require distributed computation on a cluster.
What will readers learn, and how does it complement what they will learn from other similar titles on the market?
Jenny: Readers will learn the basic principles of distributed cluster computing in Hadoop, and get a high-level introduction to the programming APIs in Hadoop and Spark and useful ecosystem tools that will actually enable them to use Hadoop as a big data platform. We also introduce some of the exciting built-in APIs in Spark that are specifically designed for analytic and machine-learning workloads, including the DataFrames API and MLlib.
Ben: The primary takeaway we want readers to have is a comfort or familiarity interacting with a cluster. Most new users only experience computing on a single machine, and their thoughts about how computation works has to change a bit to reflect the realities of computing in a distributed environment. Our book doesn’t go into depth on the API or core of either MapReduce or Spark, nor does it go in depth about the Hadoop architecture, any of the ecosystem tools, analytical concepts, or machine learning. However, we cover these things in such a way as to point our reader to the materials that will best support them in learning what they specifically need to know to get the job done.
From that perspective, we see our book as the first read for anyone interested in either Hadoop or Spark. We give you enough terminology and concepts to make you dangerous, and then we point you down a path to the content (other similar titles on the market) in which that reader is specifically interested.
Do you feel like the practice of data science has already changed since you began the project?
Ben: The primary shift of the practice that I’ve seen is in two directions; one away from computer science approaches back to more traditional research and analytics, and a second towards applied machine learning. The great thing about ML is that it’s a hacker’s art, which means you don’t need a graduate degree to do it. More and more, we’re seeing “entry-level” scientists: folks with a brand-new education in data science (something that is also very new), who understand a wide array of tools and techniques, but importantly who can do and build things with data, code, and analytics. Of course, I hope that equipping more data scientists with distributed computing tools will allow them to do and build bigger and more awesome things.
Jenny: Well certainly, the rapid pace of innovation in big data technologies has introduced new tools, processing abilities, and even algorithms (neural networks and deep learning, for example) to the data science space. We’ve also seen a continued expansion of use-cases for sophisticated analytics and machine-learning algorithms across new sectors and industries. But while these trends have had a significant impact on the kinds of insights that can be derived and the types of data assets that can be built, the fundamental data science methodology of loading, wrangling, exploring, and modeling data has remained largely constant. However, what we’re seeing instead is a shift in the expected output of the data science practice, where the end result is not just data analytics but reusable data products.
If you wanted readers to do one thing after reading the book, what would it be?
Jenny: It would be awesome to hear back from our readers with their experiences and use cases for Hadoop and Spark. Cloudera is in the business of making these tools more accessible to data analysts, but there’s still a lot that we can do to bridge the gap between the technology and the big data problems that they’re trying to solve. Hopefully, this book will be the first step for many in their big data journey and the start of an ongoing dialogue and collaboration.
Ben: As for me, I want them to read the book cover-to-cover before doing anything else. Only at that point do I want them to set up a mini-cluster, either in pseudo-distributed mode or a small one on EC2, and try some of the examples in the book. Then, it will be up to them to select the next, more in-depth reading on one of the topics we’ve discussed in the book.