Amount of False Data Creation is Rising

One major issue with data science results is the truthfulness of data - also known as "data veracity". In the past few years - especially last year (2014) - we have seen a rapid rise in the amount of false data creation.

Data veracity is defined as false or inaccurate data. The data may be intentionally, negligently or mistakenly falsified. Data veracity may be distinguished from data quality, usually defined as reliability and application efficiency of data, and sometimes used to describe incomplete, uncertain or imprecise data.

The truthfulness or accuracy of data supersedes data quality issues: if data is objectively false then data science results are meaningless and unreliable and may create an illusion of reality causing bad or sub-optimal decisions and sometimes fraud with civil or criminal liability.

In health care, public policy and science false data may harm or kill humans and other life. In finance and business false data may lead to both civil and criminal fraud. Modern societies have developed laws and ethical codes to deter and mitigate damage from the creation of and relying on false data. 

The creation of false data incorporated into both private and public marketing campaigns - an historic reality at low levels - is rapidly rising in quantity and sophistication. It is difficult, if not impossible, to distinguish between false and accurate data used in marketing campaigns.

A recent article entitled "A Wave of P.R. Data" shows a disturbing trend of false data marketing by private firms and government or publicly funded entities. The article provides the following examples:

  • Democrats watch more pornography than Republicans, according to Pornhub.
  • Mexicans and Nigerians are the best at sex, as polled by condom manufacturer Durex.
  • The nation’s most stressed zipcodes include one near you, as reported by real estate blog Movoto.
  • Washington residents complain about rats more than New Yorkers, as reported by Orkin.
  • People sometimes use car-share services after hooking up, thanks to some creepy oversharing from Uber.

It is alleged that these false or misleading data stories were created in attempt to get free public relations from online news organizations with the intent of generating viral feedback loops and massive coverage. The article suggests the major public relation firms are using this technique with some success and are busy perfecting this type of public relations data marketing.

In other cases false or misleading statistics (e.g., college rape culture) are used to shape public opinion in support or against public policies or proposed laws and regulations. These statistics are usually derived from small sample size questionnaires or surveys where the exact wording of the questions usually cause widely disparate human responses. These suspect statistics are then repeated and published as purported facts and evidence. In reality they are false data creating an illusion of reality.

Unless we take steps to deter the creation of false data, incentivize the discovery of false data and create serious civil and criminal consequences for intentionally (and negligently) falsifying data and creating an illusion of reality - many data science results will be meaningless and unreliable. We need to take meaningful and prudent action now or risk the promise and credibility of data science.