michaelmalak's blog

Entropy vs. Value

In thermodynamics, entropy measures disorder, or the amount of unuseful dispersed heat.

In information theory, entropy measures how difficult it is to ZIP a file, i.e. how poorly a lossless compressor performs. This is because Claude Shannon based his definition of entropy upon the probabilities of seeing various outcomes such as various strings of bits.

17 Qualities of the Ideal Recommender System

When constructing a recommender system and selecting algorithms, there is more to consider than just "accuracy". The most "accurate" recommender system would recommend the same items (whether those "items" are books, websites, options available to a software end user, etc.) over and over again, focused on a narrow topic area, and ignorant of context. Below are features of various recommender systems that, if combined, would perhaps form the ideal recommender system to produce "useful" rather than "accurate" results.

Active vs Passive Data Variety

The "3 V's" are typically portrayed as a problem to be solved by Big Data and Data Science. The trite knee-jerk response is that they're not problems but opportunities.

But what about taking that trite statement to heart? If we did that, we would seek out additional data, rather than be content with the Big Data that happens to be in our Hadoop or that happens to be streaming in from Spark Streaming!

Data Fusion is Data Destruction

Data Fusion originated as a technique from the 1980's defense industry where the goal was to, when doing digital intelligence, to come up with a single, unified truth -- e.g. combining information from multiple sources the belief is that enemy tank serial number A123 is at location X.

Semantic Similarity Metrics

Data Science is more than just statistics and machine learning on numbers. A lot of data is "unstructured," which means text (or worse, both text and numbers). While natural language processing has been around for half a century, its importance in the fields of Big Data and Data Science is growing and can no longer be ignored if one is to maintain competitive advantage.

There is a planet full of tools, and herein I describe one grain of sand out of that planet: Semantic Similarity Metrics.