Alternatives to Spark for In-Memory Distributed Computing

Apache Spark, having been the first substantial open source distributed in-memory processing system (version 0.6 released in November, 2012), is gaining in popularity worldwide. But there are contenders now.

First is the huge news that the once-commercial GridGain was open sourced last month.

GridGain gave a great presentation in Boulder two years ago, showing how it's based on Scala, how Scala code could be remotely executed, and how "in memory data grids" can be treated as homogeneous location-agnostic data stores, similar to Spark's RDDs. Except GridGain has been available (commercially) since 2007, has been popular with Wall Street for real-time decision-making based on streaming data, and now it's available for free.

Stratosphere is new -- it's only up to version 0.4 right now -- but it's already caught the attention of the Apache Mahout team. Currently Mahout is limited in performance by Map/Reduce, and they are considering porting Mahout to Stratosphere in addition to porting it to Spark.

A Masters degree thesis compared Spark with Stratosphere and found Spark's performance degrades significantly when the data size exceeds the combined memory of the cluster.

The H2O Project, rather than being an API for Java or other general purpose language, instead interfaces to R and Excel to give easy access to distributed in-memory machine learning and statistics algorithms to data scientists.

Then there are some older research projects, such as Piccolo from New York University in 2011 and RAMCloud from Stanford also in 2011. There is also HaLoop, a caching mechanism from 2010 for Hadoop Map/Reduce that was funded by the National Science Foundation.

In-memory distributed computing got its start in the 2010 to 2011 timeframe, and it has now arrived. Spark 1.0 should be out in May according to Patrick Wendell earlier this week. And alternative frameworks are popping up. Non-distributed computing, and even plain old Hadoop, will soon seem as antiquated as desktop GUI software.