A Dumb Thing that 99 out of 100 Data Warehousers do
Is your terminology causing your BI / DW team to head in the wrong direction?
As a business intelligence and data warehousing consultant I constantly work with IT teams that, even internally, can’t agree on whether they have a data warehouse, a data mart, a reporting database, or something else. As a result they’re not sure what BI / DW components they need and the proper roles for those components that they do have.
What common BI & DW terms mean to normal people
How many times have you told someone that you work in data warehousing, had them suggest that they knew about data warehousing, and then had them follow up with questions about how you manage to keep server farms running? I constantly run into folks who think that data warehousing is about storing huge amounts of data. Let’s get it straight – we’re not about storing data, we’re about getting meaning from data and getting that meaning to the right people at the right time! The storage, while necessary, is secondary.
You can’t blame folks, though – at best, the term data warehousing is misleading and, at worse, stupid. Warehousing implies storage, not use. But, “data warehousing” isn’t our only loaded term. What about “data mart” – what exactly does that imply? Or, a personal favorite, “operational data store” (ODS) – I can’t tell you how many times folks have told me about their ODS by describing the Oracle database underlying their SAP implementation. An operational database is not an ODS! But, once again, our terminology is so imprecise that it leads people in the wrong direction.
Well, everyone else is doing it
Somewhere along the line people grabbed onto terminology like data warehouse, data mart, and ODS because everyone else was grabbing onto it. Let’s do what the cool kids are doing, right?
Sadly, though, the terminology just doesn’t really describe what these key architectural components do. And, while there are true, academic definitions for them (anyone remember subject oriented, integrated, time variant, non volatile collection of data in support of management decisions?), many (most?) practitioners don’t know these definitions.
This whole situation leads to confusion, waste, and in some cases, failure. So, want to get better at BI and data warehousing? Start by changing your terminology.
Why not try these terms?
I propose that we clear up some confusion by changing our terminology, to use terms that focus on the goals of the components in our BI architecture. Many of the terms I propose have both an architectural focus (generally those terms called ‘layers’) and a physical focus (generally those terms called ‘databases’).
Instead of data warehouse
A traditional, Bill Inmon architecture is built around a data warehouse – a very cryptic term. So, why not call it what it is – a database where data from various sources is integrated and where we capture history. In other words, let’s just call this our Integration and History Layer or our Integration and History Database.
Instead of data mart
Data marts? What are they there for? They really exist to distribute data to users in highly-performant, user-friendly formats. So, why not call this our Data Distribution Layer? This layer can be composed of any number of databases or data stores, all focused on the same job – rapidly distributing data to users. These databases can be star schemas but they can also be multidimensional databases (MDDB), flat files, specialized databases (like Qlik’s Associative Database), or any other format.
In the end, all viable data stores in a Data Distribution Layer have one thing in common – they provide high performance response to user queries. So, let’s group all of these technologies together and call them High Performance Query Technologies. (Maybe High Performance Data Distribution Technologies is even better? – I haven’t worked through that one yet).
Instead of operational data store
And finally, let’s think about that Operational Data Store – it’s supposed to be about reporting operational data, not running operational transactions. So, let’s call it what it is: an Operational Reporting Database. Our Operational Reporting Layer is comprised of one or more of these databases.
The “new” architecture
Putting this all together, we come up with a new picture of the modern reporting environment – a picture in which each component is explicitly named by the role it performs.
What you’ll gain
So, why buck the terminology trend? Because using this new terminology will allow you to shift focus away from what ‘everyone else is doing’ and onto the jobs you need done and the right way to do those jobs. It, then, opens up your mind, and that of your team, to the possibility of things like alternative high performance query technologies, virtual data warehouses, data lakes…
Focus on the job you need done. The ‘traditional’ jobs are pretty much all covered by this new architecture. If, however, what you need doesn’t fall into it, it’s fine to come up with new layers and components – just make sure you understand what new functionality you need from those components and then name them what they are (BTW, already, in these early days, there’s a ton of confusion around the term ‘data lake’ – have a better name for it?)
I believe this revised approach will reduce confusion and get your whole team – sponsor to designer to developer to tester to user – on the same page about what needs to be done.
Thanks for reading. Have any input? I’d love to hear it. Share by leaving a comment.