Your Data Science Portfolio: Math Skills Don't Matter

TL;DR - A "Data Scientist" is a data pipeline plumber. Analytics are icing, not cake.

 

This article is written specifically for unemployed and underemployed graduates of math intensive subjects like physics and statistics. Others may have more to prove.

 

After writing my introductory reviews of ETL and visualization, I was going to write something about algorithms and analysis. Then it dawned on me: beyond proving that I'm not completely brain dead, my math skills NEVER helped me get a job.

 

Don't Drink The Kool-Aid

Predictive analytics is supposed to be the heart of a Data Scientist. It's a lie. Numbers are a figment of our imagination and math is a powerless spectre. It's the code that does the work. Computer Scientists are good at math too, so unless you have domain specific expertise, your 'analytics' skills contribution to the team are questionable at best. The Cost of Unoptimized Code Adds Up Computer time is money, which means that depending on how often it's run, your MATLAB code and R scripts may need to be re-written in a language that can be optimized to run 10-1000x faster (Python, Julia, Go, FORTRAN,or C).

 

Juiceless Lemons

In building the analytics section of your portfolio, the bare minimum is enough. You only need to show basic competency with the most popular models - see Rexer's list in a previous post. Placing dead last in a Kaggle competition shows awareness, initiative, and basic competency. Trying out different types of algorithms yields diminishing returns. Spending hours tweaking parameters is strictly hobby time. If you think that you can beat Ivy League researchers at their own game, by all means, set your eyes on the prize. Otherwise, be happy to see your name at the bottom of the Kaggle standings and spend your time learning ETL instead.

 

Script Kiddies > Actuarial Scientists

The "Data Scientist" who covets math skills as a jewel puts themselves in a precarious position: as soon as a company has their most important scripts written, the worker becomes redundant. As github's code grows to cover more situations, employment opportunities shrink. Years from now, script aggregating tools will be so sophisticated that actuaries will find themselves competing against kids that scorn certification and degrees. Excel jockeys that can't read JavaScript will be put out to pasture. Don't get caught on the wrong side of the fence when "Data Science" dies. As Miko Matsumura says, "be the developer, programmer, or entrepreneur."

 

Context to The Rescue

As I've said before, one of the beautiful things about working with data is that it provides concrete context. An employee that intimately understands the context of the company's data is indispensable. They are able to take one glimpse at a report and say "something's wrong here", cutting through hours/weeks of an analyst's work. Statistical models are built on pyramids of assumptions, and assumptions are famously brittle. As the sun sets on the Data Science marketing hype, the success of your your transition into a new position will depend on how well you understand the intricacies of how your industry's data relates to the real world.

 

The Map Is Not The Territory; There's No Such Thing As Raw Data.

The "concrete context" is of the domain the data sits in. "Raw data" has the closest connection to real devices taking real measurements, but they aren't really raw - they're numbers, not things. Numbers aren't real. Every time we summarize or aggregate data, abstractions push the context up the pyramid of assumptions, further away from our physical realm.

 

Asset or Liability?

Those that are unable to compensate for each assumption as it shifts or breaks down quickly become a liability. The annals of Wall Street are full of stories like Merton and Scholes' LTCM. The data they were analyzing was information about information, reports about reports. They didn't realize their models and abstractions pushed them too far out of context to make sound decisions. High on ego, they put total faith in their formula and doubled down on debt when they should have hedged with something more stable. A novice mortgage broker could have seen their insanity.

The problem isn't a lack of mathematical acuity - Merton and Scholes invented the nobel-prize winning formula their company was exploiting. The problem is the data that traders swim in are tenuously connected to reality. Financial analysts are unable to understand the assumptions that are baked into the many formulas used to build Wall Street's house of cards.

 

Plumbers, Not Tinkerers

When you strip away all of the hype and jargon that stems from differences in hardware and software, Data Science is fundamentally about one thing: building data pipelines. Most of our intricate problems of analysis have been solved by a myriad of open source software and commodity hardware. Certainly within the 80/20 margins. As niche markets for specialists shrink, a wise tinkerer will round out his skill set and become a pipeline plumber to stay relevant.

 

Three Pronged Portfolio

To summarize, there are 3 major components of a comprehensive Data Science portfolio. Here is an example that should be received with serious consideration by any Big Data company.

Transform: GitHub scripts for open data curation. Mailing lists 3.0.

Visualize: Examples of the canonical plots from the Visualization Zoo. Matlplotlib is just fine.

Model: Kaggle competition. Bottom of the standings.