Simple Analogies that Seemed to Flip the Switch
in my SSAS Workshops
There are many mental blocks
that attendees of my SQL Server Analysis Services (SSAS) who are new to OLAP
concepts face during my workshops. It's usually worse for the
advanced relational database folks who must fight their zeal for 3rd Normal Form
perfection. Over the years I've tried many different analogies to help the
aspiring SSAS developers to overcome these mental blocks. Some worked well for
everyone, some succeeded for a few but left the others more confused, and some
failed miserably. The analogies range from basic to advanced concepts. (The
basic analogies are used in the version of my workshop taking the attendee from
a solid Level 100/weak 200 to a solid 200/weak 300, and the advanced analogies
are used in the version of my workshop taking the attendee from a solid Level
200/weak 300 to a solid 300 or weak 400 for the exceptional attendees.)
Since the number of SSAS developers
is rapidly increasing due to the general growth of BI and the growing success of
PerformancePoint Server 2007, I thought I'd create a new category, "Lightbulb BI
Thoughts", on this blog site where I'll post these analogies I've used. Many of
the posts in this category may be duh to many of my colleagues (so skip anything
under this category) or perhaps the metaphors aren't 100%, but it will hopefully
help to wire those last few neuron gaps in the new SSAS developers.
Therefore, posts under this category are intended for people in the process of
learning OLAP, not as an introduction to OLAP nor to provide tips and tricks for
advanced OLAP developers.
I'll begin this category with a
graphic I bring up early on in the workshop (at least in the basic version). The
mental block it targets is that OLAP data structures are exotic technologies
stolen from stranded extra-terrestrial aliens. This mental block usually happens
with advanced relational database folks. I can see the gears in their
heads constantly trying to figure out how MDX works in terms of SQL.
In a relational database, we find
data by sifting through a set of tables, joining them, sifting that result,
until we end up with the end result of a table. In OLAP we find data by
specifying coordinates much as how we specify the coordinates of a place on a
mountain trail with longitude, latitude, and altitude (spatial dimensions). This
is an easy way for people to find information. Using a spreadsheet as
a two-dimensional example of an OLAP data source with cities on the X axis and
dates in the Y axis, we can quickly find the sales amount of a particular city
on a particular date by looking for the respective points on the axis of the
city and date in question and tracing to the point of intersection.
Arranging data in this
multi-dimensional manner makes it easy to find values so we use a language
geared towards multi-dimensional access (MDX - Multiple
Dimensional eXpressions) to talk to these data sources. If one were simply a
consumer of OLAP data (a business analyst), this is all that would need to be
said. However, for developers of OLAP cubes (modelers, DBAs), questions around
configuration and optimization also arise. That means there are hundreds of
questions around what all those dozens of properties in BIDS (the environment
where SSAS cubes are developed) mean. Many of those properties make a lot more
sense when it is realized that although one talks to the OLAP data source in
multi-dimensional terms, the OLAP data source is persisted in a format that is
pretty much the same as any relational data store. There are a few sort of
exotic structures such as bitmaps and trees, but thought of in relational terms,
it's easier to understand the "whys" of those dozens of properties.
The reason we talk to OLAP data
stores in multi-dimensional terms but store the OLAP data in a rather relational
manner is due to the "sparsity" of multi-dimensional spaces. Sparsity means that
most points in a multi-dimenional space have a value of null. A non-null value
occurs at points where something has happened. For example, if there was a sale
in Boise for a refrigirator by Brad, the point where Boise, refrigirator and
Brad intersect has the value of the sale. For a cube to be "dense", there would
need to be a fact for most or all combinations of points. For example, every
salesman would need to sell every product every day to every customer in
every city in every ... well, you get it. So, if there were 100 salesman, 1000
products, over 1000 days (3 years), to 100,000 customers, and 100 cities, the
cube-space would have a quadrillion points, requiring at least four quadrillion
bytes to hold a 32-bit int value. One or two dimensions more and the cube space
eventually goes into the gazillions.
It's very fortunate that
multi-dimensional data sources are sparse. If they weren't, we probably wouldn't
have such a thing as OLAP. If I knew nothing about OLAP and someone approached
me to develop a multi-dimensional data store, being very honest with myself, I
suppose my first thought would be to store the structure as a multi-dimensional
array, for example: private int[] cellvalues = new
int[1000,1000,1000,100000,100]; If that's how cubes were stored, I certainly
wouldn't have been able to create cubes on my old piddly 1 GB RAM, 80 GB hard
drive laptop. Conversely, if cubes were dense, that n-dimensional array would
then be the best way to store and retrieve OLAP data with little overhead and a
direct way to access the cells.
So the multi-dimensional data
stores only what exists (not a slot in a multi-dimensional array for every
intersection) in structures somewhat similar to relational schemas. The figure
below shows how a user requests a portion of the mostly sparse cube-space (a sub
cube) in the MDX language and SSAS in turn queries its rather relational data
stores in a way akin to SQL. (I stress that by "its rather relational data
stores", I'm referring to the fact data, aggregations and variety of indexes
SSAS generates when a cube is processed - at least as far as MOLAP is concerned
- not the underlying data warehouse data on the far right of the
picture.) Understanding this dichotomy of OLAP is one of the foundational steps
to OLAP greatness.

Day 2 is about MDX, the language
and how to use it for querying, in the MDX Script, KPIs, Actions, and role-based
security. We focus solely on the multi-dimensional aspects. In teaching MDX I
ensure the attendee is fully cognizant the underlying mission of MDX to explore
dimensional space. An empathy for the basic workings of the OLAP engine are
stressed; being fully cognizant of what the OLAP engine generally does when
recieving the query. At the highest level this means the notion of first
determining what the boundaries of the sub-cube we're requesting are (what are
the points on each axis), then filling up each point in that cube-space is
stressed. Read Chapter 4 - MDX Query Context and Execution - of
MDX Solutions.
Day 3 is about partitioning,
aggregations and optimization - a complete 180 from Day 2, This means we talk
more in terms of relational data stores and not very much in multi-dimensional
terms. Measure Group partitions and aggregations are pretty much stored in
structures kind of as they would be in a relational database (but more optimized
since a relational database like SQL Server needs to be more "general"). So,
many of the same relational database concepts apply similarly to optimizing an
OLAP cube. For example, fewer joins, narrower tables, etc, are better.
Day 4 and 5 are generally geared
towards building a quick proof-of-concept using the customer's data.
Again, this all may be obvious to
most reading this post, but I've seen it snap a few synapses in place on more
than one occassion.
I'll try to post more of these
short insights as often as I can.