Minor earthquake this past week: Classic Shark gone

Using SQL on Hadoop data has been deemed a holy grail and Apache Spark has gained the mainstream press coverage it has long deserved in part due to its 1.0 release. So using SQL on Spark should be the hottest thing going, right?

Well, there was a quiet earthquake on that front this past week. Shark has for the past two years been the SQL-on-Spark solution. But in a May 26 message major Spark contributor Michael Armbrust laid out the roadmap for SQL-on-Spark and its based not on Shark as we have known it but rather on Spark SQL.

Whereas Shark was never an Apache project or part of the codebase that made it into Apache Spark when Spark became an Apache project, Spark SQL is already included in yesterday's Spark 1.0 release.

Shark, being a shim jar in the midst of the Hive stack, was tied to tightly to particular versions of Hive and always ended up being compatible with only obsolete versions of Hive. In contrast, Spark SQL, formerly known as Catalyst, is a separated layer that either make use of Hive or access compressed columnar formats like Parquet, with the results ending up in a new datatype called SchemaRDD. Thus like Shark, Spark SQL, fulfills the fundamental mission of caching results in an RDD. But Spark SQL also adds a Scala-based querying mechanism so that all code can just remain in Scala without needing a combination of Scala and SQL.

Now it appears that the Shark name will continue, but it seems it will be a name only, with Catalyst/Spark SQL being the new architecture.

The slides from Michael Armbrust's April 8, 2014 presentation are embedded below.

Link to slideshow: PDF Slideshow