LineageDB Architecture for Big Data Analytics

The traditional approach to data warehousing results in data analytic systems that are: 

  • tightly coupled to relational databases, 
  • limited to dimensions and fact tables (which are notoriously difficult to keep in sync with data sources), and 
  • heavily dependent upon extremely brittle ETL processes.

As volumes of data reach into 100's of TBs, the inherent limitations of relational b-tree data structures quickly become evident. You can gain some more yardage by vertically scaling most relational servers, but exercising that option is extremely expensive, and you eventually run out of run-way.

And, it has become obvious that morphing relational data servers into MPP platforms have resulted in data services that are not flexible enough to support quickly changing data analysis requirements. As your data analytic needs change, the deployed configuration of the MPP platform becomes obsolete, and typically need to be significantly re-configured to support the new data analytic requirements.

Data analytic projects are notoriously difficult to deliver successfully, and frequently disappoint the business sponsors and users. We can achieve significantly better results is we adopt a contemporary, non-traditional, approach to building data warehouses and data marts.

First, we need to recognize that requirements related to storing, indexing and analyzing large volumes of data demand dedicated, specialized, best-of-breed services and tools. For big data analytics, a relational database is not a one-stop-shop.

We need to abandon the use of relational databases as our primary data storage service, and replace them with clusters of peer-to-peer key-value data services in which we store our immutable master data type records. There are pure key-value data services, such as Riak, and there are key-value hybrids such as Cassandra, both of which are well suited for our primary data storage service.

In that the physical and logical definitions of a master data type evolves over time, the schema-less aspect of key-value data services proves to be a great benefit to engineering business intelligence solutions. The schema-less aspect of key-value stores enables the business to change the representation of a master data type at will, without needing to re-cast records that conform to a deprecated representation.

We'll call this key-value (hybrid) data service that stores our immutable master data type records the LineageDB. From a data analytic perspective, the contents of the key-value store is the book-of-record. Inside the LineageDB are stored every instance of every version of a master data type record captured from the source systems. Inside the LineageDB the business preserves the 'lineage' of every master data type record. As such, from the contents of the LineageDB, we can reconstruct the version of the facts known to the business, at any particular point in time. 

From this LineageDB we extract any atomic fact we need to support our data analysis process. By storing the master data type as a key-value (hybrid) data structure, we can derive whatever data representation is best suited for a given data analysis algorithm. Unlike traditional relational data warehouses, by adopting a key-value store, analytic capabilities are no longer limited to dimension and fact tables. Collections of master data type records can become linked lists, arrays, maps, hash tables, JSON documents, AVRO files stored in HDFS, graphs, etc., whatever data structure is ideal for the analytic algorithm; even dimension and fact tables. Of course, we intend to bulk load these target data structures from the LineageDB wherever possible.

With the adoption of the LineageDB, data analytics are no longer constrained by the limitations of traditional relational data services. Nor is data analytics any longer tightly coupled to relational data services, in any regard. Now, if and only if a relational data service is a best fit for a given data analysis algorithm would a relational data service be used.

As regards other benefits of the LineageDB, we extensively scale the key-value data store horizontally using commodity server, as well as virtual machines hosted in-house or in the cloud. Since it is implemented as a cluster of peer-to-peer nodes, we have a highly available data warehouse with robust business continuance. And, in that the individual records are immutable, the LineageDB may prove suitable to support relevant audit or compliance requirements. Lastly, data from transaction systems that are being deprecated can be loaded into the LineageDB so that the data the transaction system captured continues to be available to the business.

Before we dive deeper into ETLs, let's take up indexing in support of data analysis. Data analysts care about the data itself, as well as what the data is telling them. In order to discover the content of the master data types, indexes of the master data types need to be created. To provide those indexes, a dedicated and optimized index service, such as ElasticSearch is needed.

Finding the information the business needs to support its decisions is a huge challenge to every business. The terrible secret about data analysis is that we don't know what particular content we might uncover. Consequently, whatever dimension or fact tables we create will inevitably prove inadequate over time. 

Whenever using a relational database to manage large data sets, in order for it to be performant, you must keep your inventory of indexes to the smallest possible volume. Otherwise , maintaining those indexes robs you of CPU cycles and memory needed to support data analysis computations. This resource constraint is why you have to adopt heap structures, and abandon indexes, as the data volume grows into the 100s to TBs in relational data services.

However, in order to search the data content from numerous view points, and to do so quickly, indexes are an absolute must-have. In addition, a major challenge to the business is the tracking of any change to its underlying master data type records. The data analyst needs to see the information that was created as well as the changes it has undergone over time (within the source data systems). The ability to observe such changes is critical to the successful development of machine learning algorithms - you have to be able to collect the evidence, recorded within the business data, that the algorithms delivers the promised benefits. 

To provide whatever indexes may be needed, and to do so without degrading the performance of the data warehouse, master data type records are loaded into ElasticSearch and extensively indexed. At any point in time, we just don't know what might be important to the business at some future point in time. Therefore, we need to continually support numerous searchable views across the data contained in the LineageDB. Since it is impossible to know before hand what questions the analyst will need to ask of the data, we provide complete index coverage over all property values of the master data type. And, in the case of ElasticSearch, the data analyst is able to use standard cURL to easily search, track and explore the data via REST. Compared to using SQL as a query language for analyzing data, cURL is far easier to learn and master - and more powerful than SQL, because you can easily embed (JavaScript, or Groovy) scripts that process the data in relatively sophisticated ways within the context of the search.

Now, let's move onto the use of ETLs. As someone who's been developing ETLs since the early 90s, they are notoriously difficult to develop, and once deployed prove to be extremely brittle - a change in the source system will break the ETLs. When you develop an ETL you are engaged in reverse engineering the commands that capture and change the source data. In that virtually no production transaction system full documents their data capture and change computations, the reverse engineering exercise is time consuming, highly error prone, and relatively expensive. Consequently, we need to move beyond this approach, wherever possible.

Is it possible to move beyond ETLs? That depends on whether or not the business has access to the source application's code base. If you have access to the system's code base, never go down the path of ETLs. Instead, exploit the Command-Query-Responsibility-Segregation (CQRS) design pattern. 

In far less time, with far less effort, it is (compared to developing reverse engineered ETLs processes) far easier to extend the code base by introducing into it the CQRS design pattern. To extend the code base with CQRS, you identify the application's objects' state transitions in which master data types are inserted, updated and deleted. These methods are extended to place a simple command message on a stable message queue. At the other end of that message queue is a service that pulls the command messages off the queue, and applies the content of the command message to the LineageDB. I have found that Kafka is ideal for such a messaging system, and Storm + Trident is very well suited for pulling those command messages off the queue, processing them as may be needed, and loading them into the LineageDB and into ElasticSearch.

In cases where the business does not have access to the data source system code set, you are still stuck with having to reverse engineer ETLs. However, you can significantly increase the rate at which source data is assimilated into the LineageDB by introducing a service which extracts baseline and delta data sets from the source system, and placing the deltas onto a Kafka queue, and using Storm and Trident to process those data deltas.

Lastly, let's explore this notion of real-time data analysis in relation to the LineageDB. So far, we have pointed out how Kafka, Storm and Trident can be adopted to support the complex event processing challenges germane to keeping source data systems in sync with the LineageDB. But these tools are best suited to integrating data sources with the LineageDB. To support real-time analytics, we need to provision the contents of the LineageDB in memory. There are numerous open source frameworks that can be used to provide in memory data sets. In my experience, Apache Spark is a very good framework for provisioning in memory data sets. 

While I have had very good results using this approach for analyzing 100s of TBs, it remains to be shown if this architecture can scale into the petrabyte realm.