Beyond GraphX in graphs for Spark

This week Databricks announced GraphFrames, a library posted to spark-packages.org that is based on Spark SQL Dataframes rather than RDDs (as GraphX is). GraphFrames is still a work in progress -- it is currently at the 0.1 version -- so it provides interoperability with GraphX (graphs can be converted back and forth).

GraphFrames provides the graph querying capability that GraphX always had trouble with. GraphFrames, because it uses DataFrames from Spark SQL, allows you to query graphs using SQL. Plus GraphFrames sports a subset of Cypher, the query language from Neo4j.

I describe GraphFrames and provide some interesting examples in chapter 10 of my book. Chapter 10 was just released to the MEAP (Manning Early Access Program) for my book this week.

GraphFrames is also performant due to the two optimization layers built-in to Spark SQL: Catalyst and Tungsten. Catalyst is an RDBMS-style query plan optimizer, and Tungsten leverages the sun.misc.unsafe API to do direct OS memory access, bypassing the JVM (as well as avoiding garbage collection). Tungsten also performs code generation, generating JVM bytecode on the fly to access Tungsten-laid-out memory structures in a maximally efficient manner. One of the examples in my book shows an 8x speedup compared to the GraphX version.

And, in a hat tip to Andy Petrella, author of Spark Notebook, GraphFrames is not the only new library published on spark-projects.org. There are also:

  • Spark Centrality - Library for computing centrality for graph nodes
  • spark-beetweenness - k Betweenness Centrality algorithm for Spark using GraphX
  • sparkling-graph - Large scale, distributed graph processing made easy! Load your graph from multiple formats and compute measures (but not only)