michaelmalak's blog

Semantic Similarity Metrics

Data Science is more than just statistics and machine learning on numbers. A lot of data is "unstructured," which means text (or worse, both text and numbers). While natural language processing has been around for half a century, its importance in the fields of Big Data and Data Science is growing and can no longer be ignored if one is to maintain competitive advantage.

There is a planet full of tools, and herein I describe one grain of sand out of that planet: Semantic Similarity Metrics.

Intel's FPGA+Xeon for Data Streaming

This past week, Intel announced a future Xeon would have an FPGA integrated on the chip, and still plug into a standard CPU socket.

This was reported around the various blogs and news outlets, but little attention to what it could actually be used for. In the popular press, FPGA seems to be thought of as an odd cousin to GPUs, sometimes useful for BitCoin mining and cracking encryption.

Meta Data Science

When we practice data science, even if we've done everything correctly and in an unbiased manner, how do we know that our message has been correctly and fully received?

Every human communication goes through a "noisy channel" as illustrated below (image is from idealliance.org).

Apache Spark 1.0 almost here. Is it ready with 16 (*) "unresolved blockers" in Jira? (* UPDATED x2)

UPDATE 2014-05-20: Matei Zaharia commented on the issue of combining map() and lookup(), stating that it's not within the current design to allow nested RDD functions, and that it's a feature he'd like to see added in the future. In the same Jira ticket, I had posted a workaround using join() which works fine.