Cowboy Data Science
On 3 Feb, 2015 By michaelmalak 0 Comments
Meme from Steve Temple
The opposite of "cowboy coding" is having a solid process, which includes unit testing. Data Science has matured to the point now that we should look to add this as standard practice and start to expect it.
The challenges are:
- The first phase of any data science is data exploration. Unit testing is inappropriate for this phase. The problem is that the line between exploration and the next phase, analysis, is blurry. Code developed to support exploration usually makes its way into analysis. The problem comes when subsequent questions are asked, and this once-exploratory analysis code gets tweaked and retweaked, with a great possibility of breaking original assumptions and introducing bugs. Overcoming this challenge is mostly just a matter of discipline, and changing the attitude from "I'm a data scientist, not a unit tester" to "Can I still validly claim I am still in the exploratory phase?"
- Many tools do not currently have unit test frameworks, such as Excel, Hive, and all those various command line tools. Other interactive tools such as IPython Notebook and the Spark Shell REPL that support data exploration well do have unit test frameworks available for the underlying languages (Python and Scala, respectively), but as mentioned above, the momentum of once having started exploration in one of these tools leads data scientists down the road of never implementing unit tests (until possibly the phase of integration with true production code). R and Cascading are notable exceptions in the Data Science universe that do have unit testing frameworks available.
- Even if tools do have unit testing frameworks, putting them under a common build or other test execution/automation framework is challenging. For example, for R code, one could use a JVM implementation of R in order to get it executing under the Maven umbrella, but that introduces its own set of complications.
A good data science unit test would load known data, execute the transformations and algorithms, and compare against expected results. If after several iterations of tweaking code to answer new questions these expected results from various intermediate processing stages diverge from the actual results, the unit tests will flag them, preventing the reporting of erroneous results.
Plus there are all the other usual benefits of unit tests:
- Validates from the outset that the processing and algorithms are behaving as expected
- Serves as documentation to readers of the code what the expected inputs and outputs are
- Provides confidence during refactoring