Star Schema: Data Mart? Yes – Data Warehouse? No

Ben

Most people think of data warehouses as databases that solve reporting problems.  However, it’s more useful to think of them as addressing two sets of problems: 1) Reporting, or data distribution, problems and 2) Data integration problems.  You probably already realize this.  Heck, Bill Inmon’s original definition of the data warehouse said that it is a subject oriented, integrated, time variant, non-volatile collection of data in support of management decisions.

What you may not realize, however, is that the data structures best for data integration are not best for data distribution.  Data integration works best with narrow, normalized tables.  This format makes it easy to work with data at its atomic level, making small schema changes when required without causing major headaches.  Just like when building a model of anything (and a database is, after all, a model of reality), you’ll get a more accurate representation when starting with small things (e.g. atoms) than when starting with large things (e.g. wooden boards).

Normalized tables, however, don’t always work well for data distribution because a number of expensive  joins are frequently required to create reports.  (more on why data warehouses are best modeled in normalized schema in a future post).

Data distribution, however, works best with non-traditional data structures.  One example of these are wide, denormalized tables in forms like star schemata.  Other examples of non-traditional data structures would include multidimensional databases, like Essbase, and various in-memory kinds of databases like those offered by QlikView and Tableau.  These aren’t necessarily great for data integration but they are wonderful for distributing data really quickly.  In fact, I used to call them “HPQSs” for High Performance Query Structures (I think I stopped using that name because it takes too long to type).

Why are these non-traditional data structures so good at data distribution?  Because, in essence, they anticipate what reporting users will want to see and they pre-process the data to support those needs.  Think about it, if you stayed with a normalized schema, you could logically recreate a star schema dimension table by placing a view over a set of its tables.  You would, though, lose the performance benefits of doing that view’s joins in advance, like a true star schema dimension table does.

Now, some extremely bright people have come up with ways to replace normalized data warehouses with star schemata tied together via “conformed dimensions.”  It sounds like a nice work-saver but, in actuality, it creates architectures that don’t really support the broader range of things that a well-designed data warehouse can handle.  Things like serving as a source for master data.

Consider the following:

  • SUBJECTIVITY vs OBJECTIVITY: Star schemata are inherently subjective (e.g. is department a dimension OR an attribute of employee OR both?)  Normalized schemata, on the other hand, are less subjective – Since this data will eventually be used in a variety of ways, do you want the core place where all data is integrated to be highly subjective?
  • BRITTLENESS:  Star schemata are inherently brittle.  For example, if you integrate in a star, when you add a column to the source system you have to figure out how to handle all the historical records in the related dimension table.  Normalized schemata don’t suffer from those same effects.

In the end, yes, you can shoehorn a normalized schema into the reporting role or shoehorn a star schema into the data warehouse role but, really, they serve different purposes and are best at different things.  As a result, our preferred architecture for big, complex reporting needs is a normalized data warehouse feeding out to denormalized, or alternative technology, tools.

 

Thoughts or comments?  Please post them here.

Until next time…

Ben

Dataspace

White House response to petition to: Secure resources and funding, and begin construction of a Death Star by 2016

January 13, 2013 by  
Filed under Latest

Ben in front of whiteboard - half legsMy son pointed out the fact that the White House has received the following petition: “Secure resources and funding, and begin construction of a Death Star by 2016″. The full petition and the administration’s response is available here: White House response

 

Ben

 

 

 

dsi_logo

 

 

 

 

Amazon Redshift: Data Warehouse Appliance, not SAAS Data Warehouse

November 30, 2012 by  
Filed under Business Intelligence, Data Warehousing, Latest

BenOk, there’s a ton of publicity right now around Amazon’s Redshift service.  Every article I see calls it a software as a service (SAAS) data warehouse.  Really?

As we read in kindergarten, a data warehouse is a “subject oriented, integrated, non-volatile, time variant collection of data in support of management decisions” (Or something to that effect – I’ll have to see if I can dig up an old copy of Dick and Jane and Bill Inmon).

Now, Redshift is cool but it’s not a data warehouse.  From what  I can tell, it’s really a database (kind of like SQL Server, or Oracle, or MySQL …), optimized for data warehousing (kind of like InfoBright), running on servers at some Amazon facility, and accessible through the web.  In other words, it really saves you the headaches of buying and managing database hardware and software.

On the other hand, it doesn’t come with user requirements, schema, ETL designs or programs, business intelligence tools or any of the stuff we work so hard to build in data warehousing.

So, is Redshift cool & valuable?  Undeniably.  Is it a data warehouse?  Deniably (actually, the spell checker is telling me that “deniably” is not a word so let me change that to “No”).

Ben

Dataspace

Links to Details on QlikView’s New Access to Relational Data

October 22, 2012 by  
Filed under Business Intelligence, Latest

One place where QlikView has always been weak is in real time access to relational data.  But the company just announced a new technology, Direct Discovery, to overcome this weakness.  While the software is not available to customers yet, this posting on the data visualization blog discusses its plusses and minuses as well as how the market is driving this move and how it represents QV’s entry into the big data space.

RECRUITING: Technical Manager – Contract to Hire – Honolulu, Hawaii

October 9, 2012 by  
Filed under Latest, Recruiting

THE ROLE: Technical Team Lead

We are looking for a person to fill an interim (to permanent?) position as a technical leader of a team. This person should have skills and experience in the listed areas, but most importantly is able to interact with both business and technical management and to provide leadership to their team. Per our client, “This is probably a six to twelve month assignment and could lead to a permanent position if desired.”

THE LOCATION: Honolulu, Hawaii

 

REQUIRED SKILLS

This is a lead, not a developer, role.  However, strong knowledge of the following technologies is desired:

  • Basic Business Skills: Passion & Initiative -  An excitement about working with data along with the quality of being self-directed, solving problems, and providing recommendations even when not required to do so - Critical
  • Basic Business Skills: References - Critical
  • Basic Business Skills: Verbal – Critical
  • Analysis & Design: Technical Analysis & Design - The ability to understand the business and technical requirements of a system and design an architecture to satisfy those requirements - Critical
  • ETL: DataStage Development – Ability to develop and debug DataStage code – Important
  • ETL: DataStage Information Analyzer - The ability to profile data using the IBM Infosphere Information Analyzer product - Important
  • ETL: DataStage QualityStage – The ability to develop and implement the IBM Infosphere Quality Stage product - Important
  • Database: Oracle General knowledge – Basic familiarity with the Oracle RDBMS including knowledge of RAC & Data Guard - Important
  • Business Intelligence: Oracle OBIEE – Ability to implement and code in the OBIEE environment – Important
  • Basic Business Skills: Written – The ability to write documentation in English – Important
  • ETL: OWB (Oracle Warehouse Builder) - Ability to develop and debug OWB jobs – Nice to have
  • Database: SQL Server – Knowledge of the capabilities and limitations of SQL Server – Nice to have
  • Industry knowledge: Banking – Nice to have

 

Interested?

Contact our CEO directly at Benjamin.Taub@Dataspace.com.

About Dataspace

Since 1994 Dataspace has been a recognized leader in the BI / DW services arena.  We’re currently pushing ourselves through a major transformation, redoubling our efforts to be the #1 recognized name for BI / DW services.  Our goal is to be the premium “brand”, widely known as the most competent, client-focused consultancy in this space.

Stephen Few’s View of Big Data

September 21, 2012 by  
Filed under Business Intelligence, Data Warehousing, Latest

Stephen Few is a well known expert on the visual presentation of data.  He recently wrote this blog post on the concept of Big Data.  I couldn’t agree more.

Big Data & Data Warehousing: The Coming Shakeout – Part 3/3 – How to Move Forward

September 18, 2012 by  
Filed under Business Intelligence, Data Warehousing, Latest

Ben

In my last few posts I discussed how big data is really just data, how we’ve been in this situation before and how, over the next few years, the market will be ripe for consolidation.  In this post I’ll make some recommendations for how to move forward.  Now, remember, this is intended for readers in 2012.  If you’ve happened on this post is some archive and it’s now 2017 or later, please contact me and let me know how close I came.

RECOMMENDATION 1: Don’t Avoid Big Data

Just because the path forward isn’t 100% clear doesn’t mean that you should avoid getting into big data.  Work with the business to determine what they need.  Help them figure out what questions to ask and then use your web logs, your twitter logs, your machine generated data to help them answer those questions – mine value from that data.

RECOMMENDATION 2: Avoid Making Big Bets That Assume Your Big Data Technology Will be Around Forever

In the old days we used to say that, “No one ever got fired for going with Big Blue.”   In other words, so long as you chose IBM, no one could question your decision.  But, in 2012, there is no “IBM” for big data (although IBM itself actually has some compelling ideas here).  We just don’t know who’s going to win.  So, be cautious with ‘betting’ the farm today.  Instead, make small, manageable bets that get you into the field.

RECOMMENDATION 3: Consider Cloud-Based Options

As a follow on to recommendation two, seriously consider cloud based options.  One analogy to help think about this is your housing situation just after college.

Most of us begin our post-college lives as renters.  We don’t really know what we’re going to do with our lives or where we’re going to live in the long term (In my case, for example, my permanent home – Ann Arbor – really started as two years of graduate school which eventually went horribly awry).  So, we rent.

As we get more experienced we settle down and many of us buy.

Today, most of us are the big data equivalent of new graduates.  It’s very hard to tell what the future holds.  And, in cases like that, you don’t want to sink your tent stakes too deeply into the ground (I’m not a camper but did I get that analogy right?).  Thus, renting (i.e. housing our data in the cloud) is a great way to start.

RECOMMENDATION 4: Build Loosely-Coupled Architectures Between Data Warehouse & Big Data Stores

This is a good recommendation for any data warehouse / data integration architecture: loosely couple your ETL jobs.  In other words, rather than writing ETL that brings data from your big data stores into your data warehouse, write two jobs: one to extract from your big data stores and the second to load into your data warehouse.  This approach protects at least the load half of your ETL jobs should you decide to change your big data technologies.  This, of course, assumes that we haven’t yet developed the technologies to make all this detail available directly in your data warehouse.

RECOMMENDATION 5: THE MOST IMPORTANT RECOMMENDATION – Make Sure you Really Have “Big Data”

The fact that you have social media data or machine-generated data doesn’t mean you have “Big Data”.  It could mean, simply, that you have “Data”.  We’re way past the time when a few million records a year is really anything special.  We regularly handle data that size on laptop computers.  So, before dumping a lot of time and money into big data technologies, make sure that your current technologies can’t already handle your needs.

It’s certainly fun to lead the technology curve but it’s even more fun to see big payoffs from small investments.

Wrapping Up

So, fight it all you like.  Kick and scream about how “big data” has really become “big marketing”.  The truth is that there is value in capturing and analyzing large volumes of machine generated data.  Get started now but be smart, and be ready for a shakeout in the market for big data technologies.

Ben

Dataspace

Big Data & Data Warehousing: The Coming Shakeout – Part 2/3 – The RDBMS Shakeout as a Model for The Future of Big Data

September 11, 2012 by  
Filed under Business Intelligence, Data Warehousing, Latest

Ben

In my last post I discussed how big data is really just data and, while it hasn’t happened yet, history shows that our ‘big data’ and our analytic data stores will eventually be integrated.  This points to a situation where big data technologies are in flux and the eventual, long term, industry standard tool set isn’t yet known.

Another perspective on the big data / data warehouse situation points toward a coming big data shakeout.

RDBMS 1990

In the early 1990s there were a lot of credible, relational database (RDBMS) choices.  For example, back in ’90 you could have chosen Oracle or IBM DB2 or Informix or Sybase Database Players - 1990SQL Server (yes, it was called that) or Microsoft SQL Server or Digital Equipment’s RDB, or Ingres, or…

Most of these still exist in one form or another but we all know that the real winners were Oracle, IBM and Microsoft.  And organizations that made the wrong choice back in 1990 were frequently forced to pivot to one of the winners.

Big Data 2012

Big Data Players - 2012

Now look at big data in 2012.  You can choose between Hadoop, HBase, Big Table, Mongo DB, Couch DB.

To work with data you can choose Pig or Hive or MapReduce or Ysmart or…

What Will Happen?

So, who’s going to win and who’s going to be left out in the cold?  In all honesty, I don’t know.  But, history does show that the market eventually declares winners and losers.  The big data space will be no different.  Perhaps the correct strategy for now is… caution.  Caution!

Up Next Week

My next post will go over some recommendations on how to move forward with big data in 2012.

Ben

Dataspace

RECRUITING: Business Intelligence / Data Warehouse Consulting Experts

September 5, 2012 by  
Filed under Latest, Recruiting

Overview

Are you a BI/DW consultant whose firm just doesn’t put enough emphasis on your specialty?

Are you working in BI/DW for a company but crave to be the product that your company delivers, rather than just expendable overhead?

Do you want to work with a select set of like minded professionals as we build something special?

Dataspace is looking to hire a few, extremely skilled business intelligence consultants who want to dedicate their careers to being the best in the field.  We are looking for the ‘cream of the crop’ and plan to grow them further.

Our business plan relies on the development of nationally recognized BI/DW experts.  In other words, our success and recognition as a firm depend on your success and recognition as a professional.

Some benefits of joining us include:

  • The opportunity to work as part of a group of professionals who share the same, intense focus as you.
  • A strong compensation plan that provides for significant upside.
  • A direct line to the company’s CEO and the ability to impact our direction.
  • Performance goals that include elevating your personal status in the data and business communities.
  • Opportunities to participate in sales and marketing efforts.
  • Periodic company meetings for information sharing and team building.

Location

Ann Arbor, Michigan preferred but may be negotiable.  While we ask clients to support remote work when it does not affect the quality or timeliness of projects, some travel (3 – 4 days / week) may be required.

Type

Permanent hire

Experience Level

Senior level position (minimum of five years experience required.  Ten+ years preferred)

Industry

Consulting

Compensation

$120,000 + performance based bonuses + extremely strong benefit package.

Description

  • Lead Business Intelligence / Data Warehouse / Data Architecture efforts and be comfortable playing any role on those efforts.  Tasks may include:
    • Requirements gathering and documentation
    • Conducting assessments and issuing opinions regarding existing BI & DW systems
    • Data modeling
    • Architecture design
    • Workplan development
    • Management of contract developers
    • ETL, DB & BI development
    • Sales support
  • You must be a generalist, able to pick up any role on a project;  However, you will also be expected to serve as Dataspace’s expert in some vendor or technology for when expert insight is required.

Skills & Experience

Required

  • Strong familiarity with data modeling, business intelligence, ETL, system architecture design & project management.
  • A love of BI & DW & Making Data Make Sense – You choose to do this because this stuff is cool and you want to help organizations do it successfully.
  • At least five years in leadership roles on business intelligence / data warehousing projects.
  • Bachelor’s degree.

Desired

  • Ten+ years in leadership roles on business intelligence / data warehousing projects.
  • Advanced degree in business or computer science.

Interested?

Contact our CEO directly at Benjamin.Taub@Dataspace.com.

About Dataspace

Since 1994 Dataspace has been a recognized leader in the BI / DW services arena.  We’re currently pushing ourselves through a major transformation, redoubling our efforts to be the #1 recognized name for BI / DW services.  Our goal is to be the premium “brand”, widely known as the most competent, client-focused consultancy in this space.

Big Data & Data Warehousing: The Coming Shakeout – Part 1/3 – The Drive for Detail

September 4, 2012 by  
Filed under Business Intelligence, Data Warehousing, Latest

Ben

At this point in time it would be ludicrous to deny the truth: big data exists and it’s here to stay (a quick definition, from Wikipedia, In information technologybig data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools)  For the most part, big data is data generated by machines and programs.  It includes things like social media data (like twitter tweets & Facebook posts), website generated data (like web logs), and machine generated data (like intra-second readings taken by industrial equipment).

However, we are just at the beginning of the big data movement and, as I’ll discuss in this and my next few posts, experience with similar movements teaches us that, at this point it is probably better to treat big data in small tactical projects rather than large strategic efforts.

What’s Happening Today

Looking at case studies you’ll see that most data warehousing applications of big data actually separate the data warehouse from the big-data data store.  For example, companies will gather their detail data in Hadoop, or some similar technology, and then periodically aggregate it into their data warehouses.

Hey, Haven’t I Seen This Before?

I entered the IT field in 1989.  Back then we experienced something very similar to today’s “Big Data” craze.  We called it “A lot of Data”.  Believe it or not, the bits we worked with were shaped exactly the same as they’re shaped today – nothing has changed except what was an enormous volume back then is minuscule by today’s standards (don’t even get me going on how cool it was to have 10k of hard disk storage on a PC in 1982).

In an effort to improve reporting performance, when data set sizes got too large, we’d summarize the data and move the summaries into summary tables and multidimensional databases for analysis.

Still, we yearned for and craved analytic access to the detail (yes, actually yearned AND craved – it was a simpler time).  Ralph Kimball told us that “retail is detail” and taught us that, if we kept our detail, we could always roll up but, once we lost the detail, we couldn’t drill down.

How Did We Get Back to Detail?

The data warehousing industry attacked the problem of detail in a variety of ways.  These included techniques like summary tables with aggregate navigation; database technologies like partitioning, advanced indexing technologies and purpose-built databases like Red Brick; AND new hardware technologies like Teradata and data warehouse appliances.

Each of these tools gave us the ability to store our detail in the data warehouse while still providing rapid response to aggregate queries.

There is Nothing New Under the Sun

Thus, using history as a guide, this cycle will repeat itself.  What we call “Big Data” today, we’ll just call “Data” tomorrow.  In the next few years our industry will develop the tools and techniques necessary store our detail data along with our summary data, removing the Big Data – Data Warehouse tier from our future architectures.

So, if you’re implementing big data today, there are some steps you should take to insulate yourself from these coming technology shifts.  Let’s cover these in part 3 of this series of posts.  Next week: Part 2 – The RDBMS Shakeout as a Model for The Future of Big Data.

Ben

Dataspace

Next Page »