michaelmalak's blog

Spark Streaming 1.6: Stop Using updateStateByKey()

Last night, Tathagata Das resolved SPARK-11290, "Implement trackStateByKey for improved state management", which will bring a 7x performance improvement to Spark Streaming when Spark 1.6 is released in December, 2015.

trackStateByKey() offers three benefits over updateStateByKey(), which has served as the workhorse of Spark Streaming since its inception in 2012:

39 Machine Learning Libraries for Spark, Categorized

Apache Spark itself

1. MLlib


Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.

ML Base

Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer.

Data Science Overtaking The Data Scientist

The data scientist is dead. Long live data science!

Well, not dead, but certainly dying. Up until late 2012, the Google search popularity for "data scientist" tracked that for "data science" but thereafter has sagged.

This trend is even confirmed, though to a lesser degree, in Indeed.com job postings:

Why is this? I can think of three possible reasons: