Cognitive Bias in Data Science

A Business Insider infographic 20 cognitive biases that screw up your decisions went viral this past week. Each and every one of those 20 biases can negatively affect data scientists in their work:

  1. Anchoring bias. A data scientist might find an interesting result in early exploration, and ignore other possible results or worse, ignore conflicting information.
  2. Availability heuristic. As I blogged in If All Your Data is Big Data, You May Not Be a Complete Data Scientist, data scientists all too often rely only on pre-collected data and do not conduct randomized controlled experiments of their own. Even without experiments (which are extremely expensive and time consuming), as I described in Active vs Passive Data Variety, a data scientist may just rely only on available data and not seek out additional sources of pre-collected data.
  3. Bandwagon effect. A data scientist may have heard of a particular phenomenon (say, the wrong shade of blue causing people to abandon a website) and try to reproduce it for the immediate client.
  4. Blind-spot bias. As I explained in Hypothesis formation, a data scientist has to really work to develop a sufficient number of quality hypotheses.
  5. Choice-supportive bias. A data scientist may become too wed to an initial result, visualization, or machine learning technique.
  6. Clustering illusion. The world of data science is replete with spurious correlation.
  7. Confirmation bias. There is no room for preconceived notions in the world of data science.
  8. Conservativism bias. Data can sometimes reveal emerging trends, e.g. about customers and their habits, that are so unexpected and counter to past behavior that they are difficult to accept. Data scientists need to follow the data.
  9. Information bias. Data scientists should expend time and energy only when there is a possibility of producing an actionable result, where actionable includes no insurmountable financial or corporate culture barriers to implementation.
  10. Ostrich effect. Just this past week, a new study on Paxil showed the original 2001 study opted to code suicidal thoughts as "emotional liability", thus hiding the seriousness of its side effects. This can be partly alleviated by data scientists always making all of their data and techniques available. This is easier now in the era of "notebooks" (Jupyter, Databricks, etc.).
  11. Outcome bias. If the CEO makes a risky gamble with the company, it's not the job of the data scientist he hires to retroactively justify that decision.
  12. Overconfidence. A data scientist should never assume that he/she understands everything about a domain, or that the data science tools available and known to him/her are sufficient.
  13. Placebo effect. In a UX A/B comparison, a data scientist needs to ensure the test population of users is unaware which is the "old" and which is the "new", or worse, if the test users are internal, which option is preferred by their manager or CEO.
  14. Pro-innovation bias. If a data scientist comes up with an actionable result, the data scientist should ensure a quantitative benefit can be established for taking that action.
  15. Recency. When an anomaly occurs, a data scientist needs to assess the relevance and causes.
  16. Salience. A data scientist needs to focus on results that carry the greatest quantifiable impact, not the ones that carry a captivating narrative.
  17. Selective perception. If by applying data science a data scientist uncovers flaws somewhere in a system or organization, the data scientist needs to apply such rigor uniformly across the entire system or organization.
  18. Stereotyping. If doing customer segmentation, stereotyping can actually help the data scientist form hypotheses (e.g. households with children may be more likely to purchase toys), but these hypotheses need to be tested (e.g. the data may show grandparents buy a lot of toys too).
  19. Survivorship bias. A data scientist performing churn analysis on current data may miss out on reasons why customers left out a year ago when a major website change was made.
  20. Zero-risk bias. A data scientist should not gravitate toward a high-confidence low-impact actionable result over a medium-confidence high-impact actionable result. Instead, the data scientist should work on increasing the confidence of the latter or otherwise refining its boundaries.