4 reasons why Spark could jolt Hadoop into hyperdrive

Apache Spark has been winning over users since it was developed at the University of California, Berkekey, AMPLab in 2009, but it has taken on a whole new level of popularity in the last year. All of the major Hadoop distributions now support it, it’s a top-level Apache Software Foundation project and there’s a startup, called Databricks, dedicated to productizing, supporting and certifying Spark. Matei Zaharia, one of the creators of Spark and the co-founder and CTO of Databricks, came on the Structure Show podcast this week to talk about what Spark is and why people love it so much.

Here are the highlights of that interview, but anyone interested in the history and capabilities of Spark, or where the big data industry might be heading, will want to hear the whole thing. The second-annual Spark Summit also kicks off Monday in San Francisco should anyone want to plan a last-minute trip.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

Spark is fast

“Basically, [Spark is] based on seeing how some of the earliest users were using MapReduce and seeing the problems they had, and trying to improve on the model to solve those problems,” Zaharia explained.

He continued: “The thing that got us started was some users of Hadoop at UC Berkeley actually, in our lab who were doing machine learning, wanted to run the algorithms at scale on Hadoop and they ran them and they said, ‘Well, actually, because it’s doing all these scans over the data this is slower than me running it on my laptop. So can we design a distributed execution engine that can actually scale these out?’ … As we went along, we started covering other use cases beyond machine learning, as well.”

And while Spark is best known for being much faster than MapReduce because Spark is an in-memory data-processing framework, it can still be 5 to 10 times faster on disk depending on the workload. Zaharia said the goal was to let users write the same program and run it anywhere.

Databricks co-founder and CTO Matei Zaharia. Source: Databricks

Spark is flexible

Hadoop really is a revolutionary technology, but probably more because it lets users store unprecedented amounts of data for much less cost than previously possible than because MapReduce is the best thing since sliced bread. “Spark actually extends and generalized the MapReduce execution model to be able to do more types of computations more efficiently,” Zaharia said.

“OK, now that you’ve stored a bunch of data, what can you compute with it?” he continued later. “And the MapReduce model initially was designed for these batch jobs that existed at web companies that they need to run once a night, so it was fine for that. And after that people wanted to do more and more things.