Data Scientists Should Assign a Data Veracity Score

Recently we have seen a significant rise in the amount of untruthful data and false data creation.

Data veracity (defined as false or inaccurate data) is often overlooked yet may be as important as the 3 V's of big data: volume, velocity and variety. Here data may be intentionally, negligently or mistakenly falsified. Data veracity may be distinguished from data quality, usually defined as reliability and application efficiency of data, and sometimes used to describe incomplete, uncertain or imprecise data.

Traditional data warehouse / business intelligence (DW/BI) architecture assumes certain and precise data considering unreasonably large amounts of human capital that would be required to be spent on data preparation, ETL/ELT and master data management.
 
Yet the big data revolution forces us to rethink the traditional DW/BI architecture to accept massive amounts of both structured and unstructured data at great velocity. By definition, unstructured data contains a significant amount of false as well as uncertain and imprecise data. For example, social media data is inherently uncertain and contains many falsehoods.

For many data science projects about one half or more of time is spent on "data preparation" processes (e.g., removing duplicates, fixing partial entries, eliminating null/blank entries, concatenating data, collapsing columns or splitting columns, aggregating results into buckets...etc.). I suggest this is a "data quality" issue in contrast to false or inaccurate data that is a "data veracity" issue.

Considering variety and velocity of big data, an organization can no longer commit time and resources on traditional ETL/ELT and data preparation to clean up the data to make it certain and precise for analysis. While there are tools to help automate data preparation and cleansing, they are still in the pre-industrial age.

Data veracity issues can wreck a data science project: if the data is objectively false then any analytical results are meaningless and creates an illusion of reality that may cause bad decisions and fraud - sometimes with civil liability or even criminal consequences.
 
As a result, I strongly advise data scientists to assign a data veracity score and ranking for specific data sets to avoid making decisions based on analysis of false data.