Most people think of data warehouses as databases that solve reporting problems. However, it’s more useful to think of them as addressing two sets of problems: 1) Reporting, or data distribution, problems and 2) Data integration problems. You probably already realize this. Heck, Bill Inmon’s original definition of the data warehouse said that it is a subject oriented, integrated, time variant, non-volatile collection of data in support of management decisions.
What you may not realize, however, is that the data structures best for data integration are not best for data distribution. Data integration works best with narrow, normalized tables. This format makes it easy to work with data at its atomic level, making small schema changes when required without causing major headaches. Just like when building a model of anything (and a database is, after all, a model of reality), you’ll get a more accurate representation when starting with small things (e.g. atoms) than when starting with large things (e.g. wooden boards).
Normalized tables, however, don’t always work well for data distribution because a number of expensive joins are frequently required to create reports. (more on why data warehouses are best modeled in normalized schema in a future post).
Data distribution, however, works best with non-traditional data structures. One example of these are wide, denormalized tables in forms like star schemata. Other examples of non-traditional data structures would include multidimensional databases, like Essbase, and various in-memory kinds of databases like those offered by QlikView and Tableau. These aren’t necessarily great for data integration but they are wonderful for distributing data really quickly. In fact, I used to call them “HPQSs” for High Performance Query Structures (I think I stopped using that name because it takes too long to type).
Why are these non-traditional data structures so good at data distribution? Because, in essence, they anticipate what reporting users will want to see and they pre-process the data to support those needs. Think about it, if you stayed with a normalized schema, you could logically recreate a star schema dimension table by placing a view over a set of its tables. You would, though, lose the performance benefits of doing that view’s joins in advance, like a true star schema dimension table does.
Now, some extremely bright people have come up with ways to replace normalized data warehouses with star schemata tied together via “conformed dimensions.” It sounds like a nice work-saver but, in actuality, it creates architectures that don’t really support the broader range of things that a well-designed data warehouse can handle. Things like serving as a source for master data.
Consider the following:
- SUBJECTIVITY vs OBJECTIVITY: Star schemata are inherently subjective (e.g. is department a dimension OR an attribute of employee OR both?) Normalized schemata, on the other hand, are less subjective – Since this data will eventually be used in a variety of ways, do you want the core place where all data is integrated to be highly subjective?
- BRITTLENESS: Star schemata are inherently brittle. For example, if you integrate in a star, when you add a column to the source system you have to figure out how to handle all the historical records in the related dimension table. Normalized schemata don’t suffer from those same effects.
In the end, yes, you can shoehorn a normalized schema into the reporting role or shoehorn a star schema into the data warehouse role but, really, they serve different purposes and are best at different things. As a result, our preferred architecture for big, complex reporting needs is a normalized data warehouse feeding out to denormalized, or alternative technology, tools.
Thoughts or comments? Please post them here.
Until next time…
Ok, there’s a ton of publicity right now around Amazon’s Redshift service. Every article I see calls it a software as a service (SAAS) data warehouse. Really?
As we read in kindergarten, a data warehouse is a “subject oriented, integrated, non-volatile, time variant collection of data in support of management decisions” (Or something to that effect – I’ll have to see if I can dig up an old copy of Dick and Jane and Bill Inmon).
Now, Redshift is cool but it’s not a data warehouse. From what I can tell, it’s really a database (kind of like SQL Server, or Oracle, or MySQL …), optimized for data warehousing (kind of like InfoBright), running on servers at some Amazon facility, and accessible through the web. In other words, it really saves you the headaches of buying and managing database hardware and software.
On the other hand, it doesn’t come with user requirements, schema, ETL designs or programs, business intelligence tools or any of the stuff we work so hard to build in data warehousing.
So, is Redshift cool & valuable? Undeniably. Is it a data warehouse? Deniably (actually, the spell checker is telling me that “deniably” is not a word so let me change that to “No”).
One place where QlikView has always been weak is in real time access to relational data. But the company just announced a new technology, Direct Discovery, to overcome this weakness. While the software is not available to customers yet, this posting on the data visualization blog discusses its plusses and minuses as well as how the market is driving this move and how it represents QV’s entry into the big data space.