Contemporary Big Data Architecture

A Day In The Life

 

Typically, data sets are very tightly coupled to a single business application, and are managed by relational database services even when doing so requires the implementation of SQL anti-patterns. Moreover, the Create-Read-Update-Delete (CRUD) activities related to these data sets are obfuscated and controlled by that application's hard-wired state transitions. Under the near-sighted leadership of application architects, we have deployed 100s of thousands of data silos in which valuable data assets are not accessible from outside the silo.

 

As parties are recognizing the potential market valuations of their data sets, they're quickly discovering that the present and future opportunities to maximize a return on those data assets are severely limited by any OLTP or OLAP system who's state transitions are tightly coupled to those data sets.

Immutable Information-Resource services provide consumers with easy access to data sets, thereby enabling their monitization. These immutable data sets are the outputs of various and numerous distributed computations (most of which execute on the same hardware that persists the data sets to storage devices). Their design being based on the Command-Query-Responsibility-Segregation (CQRS) pattern. Accordingly, OLTP and OLAP application state transitions are commands delivered by a message broker to the information resource service endpoint, thereby un-coupling the application from the information resource.

Brave New World

 

So, if you recognize a market value and wish to maximize a return, then how do you wisely define the scope of responsibilities of the application architect role and the data architect role on a software project? The key to successfully demarcating these responsibilities is the presence of a domain model. 

A domain model identifies the things of interest to the business, the characteristic properties of those things and their behaviors, as well as the relationships that exist between those things. A typical business has a core sub-domain (in which it excels), a few supporting sub-domains, and those generic sub-domains (security; network;etc.) that are common to all business systems. When you drill down into any of these sub-domains you'll find collections of computations and data sets. In-turn, you can further segregate those computations into commands (which create, update, or delete data) and queries (which read data). 

In distributed systems composed of big data sets, to avoid moving data across the network to remote computations, one intentionally co-locates carefully chosen commands, queries and data sets on a per host computer basis - which, can be practically understood to be a bounded context, one which can be specifically tested, monitored, measured, and turned.

Those commands, queries and data sets, in every bounded context, are the responsibilities of the data architect. And, it is the data architect who establishes how these information resources (commands, queries, data sets) are provisioned to consumers (as a formatted file on disk; available via an SQL service; computed within an in-memory data grid; accessible via a REST API; etc.).

An application architecture consumes the information resource services available within the data architecture. In this way, the business ensures two critical outcomes: first, that data silos are no longer created, and second, that the full present value of all data assets within its domain can be fully realized.

Radical as these role definitions may seem to parties habituated to creating business applications that are tightly coupled to data sets, it is not a novel premise. These role definitions align with 2 fundamental principles of the world wide web - 'easy to use' & 'loose coupling.' 

As a reader of this web blog you do not need to concern yourself with how this information resource is created, updated or deleted, and as long as you use a standard web client (your browser) you do not need to concern yourself with how this information resource is read. That passes the 'easy to use' test. And, just so long as the format of the information resource is web compliant then I am free to mutate the information resource at will without disabling your ability to access it 99.99% of the time. That passes the 'loose coupling' test.
 

Cultural bias meets a turning worm

 

If you are still convinced that the application architecture dominates contemporary big data architecture then perhaps your time is better spent reading something else, say, a client-server white paper from the 90s. No? Ok then, consider the following if you will ....

 

Since the late 80s and mid 90s businesses around the world have been implementing solutions based on relational database servers. Unfortunately, things have clearly gone way too far, and way off track, as regards SQL services. 

First things first, avoid SQL anti-patterns and remember that only a non-scalar data type (a collection; a set of values) belongs inside a relational database service. Business domain models typically contain data structures, such as organization hierarchy, product catalogs, web site log files, etc, that are actually SQL anti-patterns. Commands and queries against any data structure that is also an SQL anti-pattern can never be optimized by a SQL server. One can implement many different types of data structures within an SQL server; it won't stop you as long as the DDL syntax is valid. But the SQL server will only optimize commands and queries over non-scalar data types. You know SQL services have gone off track when you see, as you commonly do today, SQL anti-patterns proliferating inside a business's relational database. 

And, how do you know if things have gone too far in SQL land? When most everyone imagines that SQL is the tool of choice for manipulating data sets, that's a clear sign that things have gone too far. People, if you are implementing an OLTP solution then think SQL service + SQL, otherwise use some other type of data service (cause b tress do not scale linearly). And if you are thinking data analytics then take the late great E. F. Codd's advice (Codd is, so to speak, the mother of the SQL service) and don't choose SQL as your domain specific language; instead, Codd recommended that you choose a language that supports first order predicate logic - like Clojure.

Add, to these negative cultural traditions (of data sets being hard-wired to business applications, implemented as SQL anti-patterns, and only manipulated by SQL queries) the current challenge of accelerated rates of data volume growth - data sets are just getting bigger and bigger, at faster and faster rates. with no practical end in site. In the face of these barriers, you will be doing nothing more than digging a deeper and wider hole if you continue to allow your application architecture to determine your big data architecture.

Contemporary Big Data Architecture

 

At minimum, a contemporary big data architecture has a number of characteristics: 
 

  • adopts and implements ACID, referential integrity, and state as may be needed, on a per data set, per command, basis. However, it does not guarantee that unintended outcomes or failures will never happen.
  • segregates command from query responsibilities whenever consumers interact with information resources, and does so successfully without introducing tight coupling of the consumer to either the information schema or semantics.
  • uses HTTP/HTTPS  as a means of connecting distributed immutable information resources that are available in common representation formats - data sets are no longer stored in proprietary formats.
  • provides massively parallel SQL services for accessing distributed immutable information resources - after all, some consumers still prefer SQL.
  • provides massively parallel REST services for accessing distributed immutable information resources - for the discerning consumer.
  • on a per consumer, per data set, per query basis, provides WebSockets for accessing distributed immutable information resources.
  • provides master data management, and facilitates data governance, across the entire data life-cycle.
  • prefers asynchronous over synchronous communication with clients.
  • matches the sub-domain data structure(s) to a best-fit data service, leading most often to information resource services consisting of (a polyglot of) various types of data services (running un-noticed behind the curtains).
  • prefers in-memory data sets over on-disk data sets (until the moments they have to persist and replicate), while preferring off-network data sets whenever practical.
  • facilitates data cleansing, transformation, integration, mining, measuring, aggregating, analyzing and visualizing.
  • facilitates decision support services, machine learning, and predictive analytics (which are, supported by batch, as well as streaming, data lineage services). 
  • debunks real-time and provides somewhere-near-immediate service levels on a per consumer, per data set, per command/query basis.
  • secures (and may encrypt) information resources while granting appropriate privileges to authenticated and authorized consumers.