It’s been a while since I worked on an interesting Business Intelligence problem, so a recent distraction has proved most stimulating. The reason for this post is a question I’ve had to consider many times in the past: should I design my schema by looking at the reports I’ve been asked for, or is there some other driving force to help me identify “the grain” of a dimensional model?
In this particular case we’re capturing events out of a requirements/project management system and providing analysis of process flow and planned (versus actual) cost. This is fascinating. We might for instance, be looking to answer a question about how far through the design/development process a product has come, how long it took to get there or how much CAPEX is estimated for each segment (at whatever level). The interesting part is that the source data is hierarchical, decomposing from high level demand driven requirements capture down to the more detailed level (i.e., engineering). I’ll describe this in a little more detail later on, but for now we’ll note that requirements are decomposed over time and in all levels of this parent-child hierarchy we see both transitive and independent transient state changes taking place.
I’ve always believed “the grain” of a dimensional model to be the atomic definition of the event that creates a fact. To adhere to this means that the grain can only pertain to a specific domain of discourse, to one specific business process, which is as it should be. Fact tables of mixed granularity open up all kinds of design and implementation problems in development, make for nightmarish back room maintenance and present a range of reporting problems such as double counting and measures whose additivity is unclear to users. A corollary to this rule about maintaining proper granularity in the data model, is that relying on tools support to “work around” granularity problems never seems to work in practice. Sticking to a single, atomic grain ensures that the fact tables are easy to implement (i.e., the back room process is fast to develop and simple to maintain), durable to changing business needs and performant in the face of ad-hoc user queries. Another key consideration is that granularity should be determined by the availability of physical source data. This reenforces the idea that a fact table record is an atomic view of an event in the business process under analysis. The physical definition of a fact table record is, of course, more than this; A catenation of the dimensionality of the subject area (in the imported surrogate dimension keys) and quantifiable measures. The question of source data granularity goes far beyond simply identifying which fields are available to us however. Let’s consider an arbitrary example to see why.
In an operational system tracking accounts receivable, the concept of balance is likely held as a scalar value against a given data entity (such as customer account). There is no concept of history here because the system is designed to answer only for the current balance. If you want to understand the account history (how often has this customer been in the red, for example), then you need to keep a history of your changes and that requires taking snapshots over time. It is unlikely that you’ll get exact timestamps for each balance update from the transactional system, so you’ll have to track the changes each time you take a snapshot of the source data and timestamp it yourself. This is where the ETL process comes in; Whether your updates are pushed at you (event driven BI) or you pull snapshots at predefined intervals, your back room process has become responsible for tracking the dates and times at which the change in balance was observed. Despite the inferred time stamp, the observation date is valid source data and clearly eligible for inclusion in the granularity of the data model, allowing for analysis of balance trends over time. Despite insisting on atomicity for fact records, source events need not necessarily come from a single (or even a physical) data source, but life is far simpler for the ETL developers if they do.
Going back to the problem at hand then, there is clearly a question about how source data changes over time and how we observe these changes as part of our ETL process. In our hierarchy of stories/requirements, there are a few things to consider. Firstly, change events for entities at all levels are delivered to the back room via a queue. Secondly, some process states are transitive and affect children (for example, we may say that a work package cannot enter the “committed” state and obtain due dates until the entire tree – from customer demand stories down to engineering goals – is fully estimated). Our first port of call is understanding the grain of the events. Are events published each time a record changes, only during state transitions, or based on some other set of rules? How can we want to track changes across the levels of the hierarchy? If we’re tracking the hierarchy changes themselves over time (such as when a story was decomposed) then that is likely a whole subject area by itself. There is no sane approach to defining a model that captures changes to this kind of complex directed graph whilst also capturing atomic and/or summary measures at the same time; A “factless” fact table ensues and our grain changes completely. If on the other hand, we care only about tracking state changes to individual records then our modelling of the levels doesn’t affect the grain because the subject area remains focussed on tracking planned cost across each of the levels. Once again we find that the grain is defined by the availability and structure of the source data. We will consider how to model the level based hierarchy in the dimensions and so on, but the overall grain will not change because of this. The grain is described by the subject area, which is aligned with a specific business process. In terms of the measures then, questions arise such as whether or not the cost analysis is only valid for a committed demand story, or one that is associated with a release? Such business rules again reinforce the correlation between subject area, business process and dimensional model granularity.
To paraphrase Ralph Kimball, “A dimensional model designed around business questions and reports has lost its connection to the original data source and is hostage to the ‘report of the day’ mentality that causes such a database to be tweaked and altered until no one can explain why a record is in the table or not.” So clearly whilst looking at the business questions that need answering is important, understanding the grain of the source data is more so. And looking at specific reports is less useful still, because they don’t tell us anything about the granularity of the real data or the business process it is supporting.