As companies strive to become more agile in today’s ever-changing business world, a common theme is getting data faster and, in turn, getting insights from data faster. That’s where the notion of schema-free queries often comes in, where all sorts of unstructured data goes into files in Hadoop (Hadoop Distributed File System w/ e.g. HIVE or Drill querying), SQL and NoSQL databases that support late binding. Late binding, to get on the same page, is the practice of transforming and binding data based on relationships at program runtime, vs. early binding where transformations are done when data moves from source systems into the database.
These databases or data stores often enable rapid exploration via schema-free queries. And, it’s true that rapid exploration is a key piece of any agile company’s foundation, just as it’s true that some corners of the technology world are evolving so quickly that having to slow down and put governance and forethought into data storage and structure can be the difference between success and failure.
But with schema-free queries, it also pays to be prudent. If you’re not careful, they can make your data dishonest.
The fact that data isn’t wrapped in governance is fine (and preferred) for just poking around. We opt for schema-free queries in the first place because a lot is changing around us and new data sources are emerging regularly. The fact is that schema less/free is great for an initial prototype, but once we move past the prototype stage, the lack of schema quickly becomes a governance nightmare.
A Crumbling Analytics House Built on Schema-Free
Otherwise, whatever you produce – whether it’s a dashboard, or some metric read-out – could begin lying to you. This is the exact problem we faced in the mid 2000s during my tenure at eBay when an entire experimentation platform, with hundreds of experiments built on late binding, was starting to fold like a house of cards. The reason was that the incoming data started changing on us without any controls in place, but there was no governance to catch the change.
It only takes one developer upstream going about his day-to-day work to change the meaning of a tag, thinking he is the only one using it. Once that happens, everything built with that data could produce slightly to completely different results. Plus, there is no lineage with schema-free queries, so you won’t even know that anything has been changed!
Put simply, schema-free queries can quickly become a foundation for a house that crumbles after it’s built.
Don’t get me wrong: Late binding is a must have capability in today’s data infrastructure. We have long been working on getting more and more late binding features into our various products with the latest example being high performance and binary JSON storage and processing natively within the Teradata database.
Building Trust in Your Data
While systems need to support both late and early binding, tight and loose coupling, the evolution towards schema (even if only for subsets of data) is a must have step for any data product development process.
Schema is not just a nuisance. Its not there to be painful, it’s there to control structure and actually reject mismatches along the way. It forces a different thinking on production quality, than a free flowing unstructured lake that changes by the minute and is hard to rely on in terms of repeatability. Trust in repeatable and consistent results is key to the success of Big Data.
The lesson is that you need to constantly check if schema-less data is being used for production purposes. Similarly, the moment you find something with your data exploration, figure out what tags you’re using for the production-like environment, and make sure you have the ability to check on them. While there’s often value in getting to data quickly to uncover new things, there is also value in knowing that a particular tag has a certain meaning – especially once you make the move from exploration to production.
As part of the series of articles on the concept of the Sentient Enterprise I have talked about the need for the Layered Data Architecture – a data classification framework that allows for the rapid and agile integration of unstructured or late binding data. The key to success is to properly classify all your incoming data as it is being accessed, used and relied on and to elevate data elements from none- to loosely- to tightly coupled status.
When we build algorithms, models, reports – any form of repeatable usage of data, we are obligated to have control and authority over the data behind it, so we can make sure it will continue to do what it claims to do.
Don’t let your data lie to you.