Data Center Scale Computing and Artificial Intelligence with Matei Zaharia, Inventor of Apache Spark

Matei Zaharia, Chief Technologist at Databricks & Assistant Professor of Computer Science at Stanford University, in conversation with Joseph Sirosh, Chief Technology Officer of Artificial Intelligence in Microsoft’s Worldwide Commercial Business

At Microsoft, we are privileged to work with individuals whose ideas are blazing a trail, transforming entire businesses through the power of the cloud, big data and artificial intelligence. Our new “Pioneers in AI” series features insights from such pathbreakers. Join us as we dive into these innovators’ ideas and the solutions they are bringing to market. See how your own organization and customers can benefit from their solutions and insights.

Our first guest in the series, Matei Zaharia, started the Apache Spark project during his PhD at the University of California, Berkeley, in 2009. His research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in Computer Science. He is a co-founder of Databricks, which offers a Unified Analytics Platform powered by Apache Spark. Databricks’ mission is to accelerate innovation by unifying data science, engineering and business. Microsoft has partnered with Databricks to bring you Azure Databricks, a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. Azure Databricks offers one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts to generate great value from data faster.

So, let’s jump right in and see what Matei has to say about Spark, machine learning, and interesting AI applications that he’s encountered lately.

Video and podcast versions of this session are available at the links below. The podcast is also available from your Spotify app and via Stitcher. Alternatively, just continue reading the text version of their conversation below, via this blog post.

Joseph Sirosh: Matei, could you tell us a little bit about how you got started with Spark and this new revolution in analytics you are driving?

Matei Zaharia: Back in 2007, I started doing my PhD at UC Berkeley and I was very interested in data center scale computing, and we just saw at the time that there was an open source MapReduce implementation in Apache Hadoop, so I started early on by looking at that. Actually, the first project was profiling Hadoop workloads to identify some bottlenecks and, as part of that, we made some improvements to the Hadoop job scheduler and that actually went into Hadoop and I started working with some of the early users of that, especially Facebook and Yahoo. And what we saw across all of these is that this type of large data center scale computing was very powerful, there were a lot of interesting applications they could do with them, but just the map-reduce programming model alone wasn’t really sufficient, especially for machine learning – that’s something everyone wanted to do where it wasn’t a good fit but also for interactive queries and streaming and other workloads.

So, after seeing this for a while, the first project we built was the Apache Mesos cluster manager, to let you run other types of computations next to Hadoop. And then we said, you know, we should try to build our own computation engine which ended up becoming Apache Spark.

JS: What was unique about Spark?

MZ: I think there were a few interesting things about it. One of them was that it tried to be a general or unified programming model that can support many types of computations. So, before the Spark project, people wanted to do these different computations on large clusters and they were designing specialized engines to do particular things, like graph processing, SQL custom code, ETL which would be map-reduce, they were all separate projects and engines. So in Spark we kind of stepped back at them and looked at these and said is there any way we can come up with a common abstraction that can handle these workloads and we ended up with something that was a pretty small change to MapReduce – MapReduce plus fast data sharing, which is the in-memory RDDs in Spark, and just hooking these up into a graph of computations turned out to be enough to get really good performance for all the workloads and matched the specialized engines, and also much better performance if your workload combines a bunch of steps. So that is one of the things.

I think the other thing which was important is, having a unified engine, we could also have a very composable API where a lot of the things you want to use would become libraries, so now there are hundreds maybe thousands of third party packages that you can use with Apache Spark which just plug into it that you can combine into a workflow. Again, none of the earlier engines had focused on establishing a platform and an ecosystem but that’s why it’s really valuable to users and developers, is just being able to pick and choose libraries and arm them.

JS: Machine Learning is not just one single thing, it involves so many steps. Now Spark provides a simple way to compose all of these through libraries in a Spark pipeline and build an entire machine learning workflow and application. Is that why Spark is uniquely good at machine learning?

MZ: I think it’s a couple of reasons. One reason is much of machine learning is preparing and understanding the data, both the input data and also actually the predictions and the behavior of the model, and Spark really excels at that ad hoc data processing using code – you can use SQL, you can use Python, you can use DataFrames, and it just makes those operations easy, and, of course, all the operations you do also scale to large datasets, which is, of course, important because you want to train machine learning on lots of data.

Beyond that, it does support iterative in-memory computation, so many algorithms run pretty well inside it, and because of this support for composition and this API where you can plug in libraries, there are also quite a few libraries you can plug in that call external compute engines that are optimized to do different types of numerical computation.

JS: So why didn’t some of these newer deep learning toolsets get built on top of Spark? Why were they all separate?

MZ: That’s a good question. I think a lot of the reason is probably just because people, you know, just started with a different programming language. A lot of these were started with C++, for example, and of course, they need to run on the GPU using CUDA which is much easier to do from C++ than from Java. But one thing we’re seeing is really good connectors between Spark and these tools. So, for example, TensorFlow has a built-in Spark connector that can be used to get data from Spark and convert it to TFRecords. It also actually connects to HDFS and different sorts of big data file systems. At the same time, in the Spark community, there are packages like deep learning pipelines from Databricks and quite a few other packages as well that let you setup a workflow of steps that include these deep learning engines and Spark processing steps. |“None of the earlier engines [prior to Apache Spark] had focused on establishing a platform and an ecosystem.”|

JS: If you were rebuilding these deep learning tools and frameworks, would you recommend that people build it on top of Spark? (i.e. instead of the current approach, of having a tool, but they have an approach of doing distributed computing across GPUs on their own.)

MZ: It’s a good question. I think initially it was easier to write GPU code directly, to use CUDA and C++ and so on. And over time, actually, the community has been adding features to Spark that will make it easier to do that in there. So, there’s definitely been a lot of proposals and design to make GPU a first-class resource. There’s also this effort called Project Hydrogen which is to change the scheduler to support these MPI-like batch jobs. So hopefully it will become a good platform to do that, internally. I think one of the main benefits of that is again for users that they can either program in one programming language, they can learn just one way to deploy and manage clusters and it can do deep learning and the data preprocessing and analytics after that.

JS: That’s great. So, Spark – and Databricks as commercialized Spark – seems to be capable of doing many things in one place. But what is not good at? Can you share some areas where people should not be stretching Spark?

MZ: Definitely. One of the things it doesn’t do, by design, is it doesn’t do transactional workloads where you have fine grained updates. So, even though it might seem like you can store a lot of data in memory and then update it and serve it, it is not really designed for that. It is designed for computations that have a large amount of data in each step. So, it could be streaming large continuous streams, or it could be batch but is it not these point queries.

And I would say the other thing it does not do it is doesn’t have a built-in persistent storage system. It is designed so it’s just a compute engine and you can connect it to different types of storage and that actually makes a lot of sense, especially in the cloud, with separating compute and storage and scaling them independently. But it is different from, you know, something like a database where the storage and compute are co-designed to live together.

JS: That makes sense. What do you think of frameworks like Ray for machine learning?

MZ: There are lot of new frameworks coming out for machine learning and it’s exciting to see the innovation there, both in the programming models, the interface, and how to work with it. So I think Ray has been focused on reinforcement learning which is where one of the main things you have to do is spawn a lot of little independent tasks, so it’s a bit different from a big data framework like Spark where you’re doing one computation on lots of data – these are separate computations that will take different amounts of time, and, as far as I know, users are starting to use that and getting good traction with it. So, it will be interesting to see how these things come about.

I think the thing I’m most interested in, both for Databricks products and for Apache Spark, is just enabling it to be a platform where you can combine the best algorithms, libraries and frameworks and so on, because that’s what seems to be very valuable to end users, is they can orchestrate a workflow and just program it as easily as writing a single machine application where you just import a bunch of libraries.

JS: Now, stepping back, what do you see as the most exciting applications that are happening in AI today?

MZ: Yeah, it depends on how recent. I mean, in the past five years, deep learning is definitely the thing that has changed a lot of what we can do, and, in particular, it has made it much easier to work with unstructured data – so images, text, and so on. So that is pretty exciting.

I think, honestly, for like wide consumption of AI, the cloud computing AI services make it significantly easier. So, I mean, when you’re doing machine learning AI projects, it’s really important to be able to iterate quickly because it’s all about, you know, about experimenting, about finding whether something will work, failing fast if a particular idea doesn’t work. And I think the cloud makes it much easier.

JS: Cloud AI is super exciting, I completely agree. Now, at Stanford, being a professor, you must see a lot of really exciting pieces of work that are going on, both at Stanford and at startups nearby. What are some examples?

MZ: Yeah, there are a lot of different things. One of the things that is really useful for end users is all the work on transfer learning, and in general all the work that lets you get good results with AI using smaller training datasets. There are other approaches as well like weak supervision that do that as well. And the reason that’s important is that for web-scale problems you have lot of labeled data, so for something like web search you can solve it, but for many scientific or business problems you don’t have that, and so, how can you learn from a large dataset that’s not quite in your domain like the web and then apply to something like, say, medical images, where only a few hundred patients have a certain condition so you can’t get a zillion images. So that’s where I’ve seen a lot of exciting stuff.

But yeah, there’s everything from new hardware for machine learning where you throw away the constraints that the computation has to be precise and deterministic, to new applications, to things like, for example security of AI, adversarial examples, verifiability, I think they are all pretty interesting things you can do.

JS: What are some of the most interesting applications you have seen of AI?

MZ: So many different applications to start with. First of all, we’ve seen consumer devices that bring AI into every home, or every phone, or every PC – these have taken off very quickly and it’s something that a large fraction of customers use, so that’s pretty cool to see.

In the business space, probably some of the more exciting things are actually dealing with image data, where, using deep learning and transfer learning, you can actually start to reliably build classifiers for different types of domain data. So, whether it’s maps, understanding satellite images, or even something as simple as people uploading images of a car to a website and you try to give feedback on that so it’s easier to describe it, a lot of these are starting to happen. So, it’s kind of a new class of data, visual data – we couldn’t do that much with it automatically before, and now you can get both like little features and big products that use it.

JS: So what do you see as the future of Databricks itself? What are some of the innovations you are driving?

MZ: Databricks, for people not familiar, we offer basically, a Unified Analytics Platform, where you can work with big data mostly through Apache Spark and collaborate with it in an organization, so you can have different people, developing say notebooks to perform computations, you can have people developing production jobs, you can connect these together into workflows, and so on.

So, we’re doing a lot of things to further expand on that vision. One of the things that we announced recently is what we call machine learning runtime where we have preinstalled versions of popular machine learning libraries like XGBoost or TensorFlow or Horovod on your Databricks cluster, so you can set those up as easily as you can set up as easily as you can setup an Apache Spark cluster in the past. And then another product that we featured a lot at our Spark Summit conference this year is Databricks Delta which is basically a transactional data management layer on top of cloud objects stores that lets us do things like indexing, reliable exactly once stream processing, and so on at very massive scale, and that’s a problem that all our users have, because all our users have to setup a reliable data ingest pipeline.

JS: Who are some of the most exciting customers of Databricks and what are they doing?

MZ: There are a lot of really interesting customers doing pretty cool things. So, at our conference this year, for example, one of the really cool presentations we saw was from Apple. So, Apple’s internal information security group – this is the group that does network monitoring, basically gets hundreds of terabytes of network events per day to process, to detect intrusions and information security problems. They spoke about using Databricks Delta and streaming with Apache Spark to handle all of that – so it’s one of the largest applications people have talked about publicly, and it’s very cool because the whole goal there – it’s kind of an arms race between the security team and attackers – so you really want to be able to design new rules, new measurements and add new data sources quickly. And so, the ease of programming and the ease of collaborating with this team of dozens of people was super important.

We also have some really exciting health and life sciences applications, so some of these are actually starting to discover new drugs that companies can actually productionize to tackle new diseases, and this is all based on large scale genomics and statistical studies.

And there are a lot of more fun applications as well. Like actually the largest video game in the world, League of Legends, they use Databricks and Apache Spark to detect players that are misbehaving or to recommend items to people or things like that. These are all things that were featured at the conference.

JS: If you had one advice to developers and customers using Spark or Databricks, or guidance on what they should learn, what would that be?

MZ: It’s a good question. There are a lot of high-quality training materials online, so I would say definitely look at some of those for your use case and see what other people are doing in that area. The Spark Summit conference is also a good way to see videos and talks and we make all of those available for free, the goal of that is to help and grow the developer community. So, look for someone who is doing similar things and be inspired by that and kinda see what the best practices are around that, because you might see a lot of different options for how to get started and it can be hard to see what the right path is.

JS: One last question – in recent years there’s been a lot of fear, uncertainty and doubt about AI, and a lot of popular press. Now – how real are they, and what do you think people should be thinking?

MZ: That’s a good question. My personal view is – this sort of evil artificial general intelligence stuff – we are very far away from it. And basically, if you don’t believe that, I would say just try doing machine learning tutorials and see how these models break down – you get a sense for how difficult that is.

But there are some real challenges that will come from AI, so I think one of them is the same challenge as with all technology which is, automation – how quickly does it happen. Ultimately, after automation, people usually end up being better off, but it can definitely affect some industries in a pretty bad way and if there is no time for people to transition out, that can be a problem.

I think the other interesting problem there is always a discussion about is basically access to data, privacy, managing the data, algorithmic discrimination – so I think we are still figuring out how to handle that. Companies are doing their best, but there are also many unknowns as to how these techniques will do that. That’s why we’ll see better best practices or regulations and things like that.

JS: Well, thank you Matei, it’s simply amazing to see the innovations you have driven, and looking forward to more to come.

MZ: Thanks for having me. |“When you’re doing machine learning AI projects, it’s really important to be able to iterate quickly because it’s all about experimenting, about finding whether something will work, failing fast if a particular idea doesn’t work.And I think the cloud makes it much easier.”|

And I think the cloud makes it much easier.”

We hope you enjoyed this blog post. This being our first episode in the series, we are eager to hear your feedback, so please share your thoughts and ideas below.

The AI / ML Blog Team

Resources