What Parts of Data Science Will Get Automated

In response to last month's KD Nuggets poll on Data Scientists Automated and Unemployed by 2025?, I've seen various musings on which parts will get automated first. Some say data cleaning. Others say feature engineering. Some opined that all data scientists would need to be concerned about would be familiarity with machine learning algorithms.

As I blogged a year ago in The End of Data Science As We Know It, the whole issue of data cleaning, which consumes 90% of the time to conduct data science, will eventually just vanish on its own. In each case, it could be for any of the following reasons:

  • As I suggested in the End of Data Science blog, because of the shift from batch data analytics to streaming, and streaming data is clean by construction.
  • It will become a social taboo to generate data that is not machine-readable. A lot of messy data today comes from, for example, server logs or closed-source office productivity suites. Eventually, it will enter the collective consciousness that this is not acceptable. I am reminded of another similar computing social taboo. One of the hallmarks of early versions of BEA WebLogic (like, circa 2001) was that it had a web-based admin user interface. This was cool back then. But soon WebLogic users needed automation and demanded scripting-compatible tools. Today, no one would even think of making a server-side tool that wasn't command-line driven!
  • Playing with the definitions of words here, but data science is now a team sport and we are past the past the webmaster phase of data science. So even if neither of the above two reasons hold, and even if there is some residual data cleaning to do, it will be the job of the data engineer rather than of the data scientist himself.

But as for those who express the notion that that only leaves machine learning left for data scientists, they couldn't be more wrong. First, that part of data science is already being automated. Second, the two most important bubbles from the four-bubble Data Science Venn Diagram are domain knowledge and social sciences. These two are the parts that will not be (or will be the last to be) automated.

Now, that's not to jettison knowledge of machine learning completely. While a lot of machine learning will be automated -- algorithms will be pre-written, industry-standard domain-specific features pre-engineered, and both of these automatically selected -- knowledge of how to apply machine learning to specific domains (and specific companies and specific circumstances within those companies) and to different social situations (customers, regulatory bodies, shareholders, other stakeholders, etc.) will require a human for a long time to come. Or at least until the singularity.


data scientist's picture

Wonderful and timely information. I agree with your analysis concerning automation in data science.I believe that automation will take a central position
in data science and many parts of data scientists jobs will be automated. However the demand and salary for data scientists will continue to be on the high side. Data scientist jobs will also enjoy more prominence.