Flink Benchmarked 2-3x Faster Than Spark at Data Science Tasks

Last month, a group of German researchers published benchmarks comparing Apache Spark against Apache Flink, which was formerly the Stratosphere project. I've blogged about Stratosphere before as one of the alternatives to Spark.

The paper, Evaluating New Approaches of Big Data Analytics Frameworks, benchmarked three tasks: Word Count, K-Means, Relational Query, and PageRank. Flink was 2-3x faster than Spark on all the tasks except for Word Count. On Word Count, Spark was twice as fast as Flink.

There are a few caveats:

  • The version of Spark almost certainly did not include the improvements in Project Tungsten, which is being rolled out starting in Spark 1.4 (released June 12, 2015), with most of the improvements coming in Spark 1.5 and the remainder in Spark 1.6. The paper neglected to specify the version of Spark used (highly irregular for a benchmark) other than to say it was "2015". Given the paper's publication date of June 15, 2015, the version of Spark was thus probably either 1.2 (released December 18, 2014) or 1.3 (released March 13, 2015). (UPDATE 2015-07-07: the paper's author indicated in e-mail to me that Spark 1.2 was used.)
  • The paper attributes much of Stratosphere's high performance to its database-style query plan optimizer. Spark SQL does do some query plan optimization, with optimizer improvements incorporated into every release.
  • The paper authors may be more familiar with Stratosphere/Flink due to Stratosphere having originated in their home country of Germany, and thus may not be familiar with Spark performance optimization techniques.
  • Spark is more mature. Besides providing, as the paper authors point out, more facilities such as the REPL shell for interactive data science use, maturity may also result in fewer bugs that lead to erroneous results as Spark 1.0 had.

One key optimization from Flink that is not currently in Spark's foreseeable future (i.e. not even planned for Project Tungsten) is cross-iteration query optimization. Currently, Spark must materialize its data from one iteration to the next, whereas Flink can be lazy between iterations.

The Spark Project Tungsten performance improvement approaches read very similarly to Flink's approaches. One might suspect that Project Tungsten was inspired by some of Flink's techniques. In any case, it seems Flink is putting pressure on the Spark core developers to improve performance, and that benefits us all.