Spark 1.3: Stop Using RDDs

At the Bay Area Spark Meetup held at Databricks on January 13, 2015 (slides), the Databricks folks announced a major shift from RDDs to SchemaRDDs -- yes those things from Spark SQL. SchemaRDDs carry type information along with them instead of just naked data, and this allows the Spark Core to do database-style query planning and optimization. SchemaRDDs also have built-in the capability to communicate with various data sources via the spark.sql.sources API introduced in Spark 1.2.

Despite the name, and despite its Spark SQL origin, the plan is to revamp all of Spark under the covers to use SchemaRDD instead of RDD. In the video, Reynold Xin explains how already in Spark 1.2, MLlib uses SchemaRDD throughout, including in the public APIs. In Spark 1.3 and beyond, SchemaRDD will make its way to other Spark APIs and modules.

The concept is that SchemaRDD will be the basis of a class that will be similar in function to data frames from R and IPython Notebook, and thus be more familiar to and easier to be worked with by data scientists.

Other announcements from that Meetup:

  • SparkR An AMPLab project, and not a Databricks or Apache project. There are some licensing issues hindering its adoption into the main Apache Spark project, but the plan is to in 2015 either overcome those hurdles or set it up as its own project.
  • YARN Databricks is spending no time on improving this, and is deferring to Cloudera for any further enhancements.
  • MLlib The Pipeline that was introduced in 1.2 will gain more capabilities in 1.3 and beyond to simplify creation of machine learning processing pipelines.
  • SparkSQL Even more data connectors, such as to HBase. But some will be developed by third parties and not all will be bundled with the main Apache Spark project.

There was talk of creating a community similar to those of R and IPython Notebook that creates a lot of useful packages that everyone uses. They pointed to http://spark-packages.org/, but in my opinion, a Spark Package Manager is what's really needed to juice the widespread creation and adoption of Spark packages.

UPDATE 2015-01-29: Besides just carrying data types along with the data, another advantage of SchemaRDD is its columnar compression in RAM and native serialization to Parquet.