Mahout 1.0 not just built on Spark, it subsumes Spark

This evening, Ted Dunning from MapR and major contributor to Apache Mahout, showed the future of Mahout.

We already knew since March that Mahout is being ported from Map/Reduce over to Spark, as well as other in-memory grid platforms. But what wasn't clear from those press releases was the upcoming depth of integration of Mahout into Spark. Just as the Spark Shell builds upon the Scala Shell, so there will be in Mahout 1.0 a Mahout Shell that builds upon the Spark Shell.

A Shell, also known as a REPL (Read-eval-print loop from LISP half a century ago) provides interactivity like R, Matlab, and iPython Notebook. Someone in the audience asked if it is possible to paste in a bunch of lines all at once, and Ted Dunning jokingly answered, "yes, that's called a program." Yes, it is possible to paste into the REPL, but the interactivity, immediate response, and fast iterative development are the real power of a REPL.

Now let's just hope that some of the problems with the Spark REPL get fixed, because I fear they would also be problems with the Mahout REPL.

Other highlights from Ted Dunning's talk:

  • The big advantage of Mahout over R, etc. is the ability to handle cluster-scale sized data sets.
  • H2O can be up to 20 times as fast as Spark on some machine learning algorithms, which is why Mahout 1.0 is targeting H20 in addition to Spark (and also Stratosphere).
  • Ted didn't see much point to MLBase, which purports to reduce the number of knobs to machine learning and make machine learning applications easier to code, since the big task is data cleansing. Getting clean data is the hard job.
  • Ted observed that the BDAS machine learning stack hasn't seen much development progress over the past year due to focus on other parts of the ever-growing BDAS stack.

  • In the field, Mahout is most commonly used for recommendation engines, with clustering a distant use case behind that.
  • And finally from the presentation that preceded Ted's this evening, by Joe McTee from Tendril, the Canopy Algorithm is a good way to seed K-Means with an initial set of centroids.