Simple Analogies that Seemed to Flip the Switch in my SSAS Workshops

There are many mental blocks that attendees of my SQL Server Analysis Services (SSAS) who are new to OLAP concepts face during my workshops. It's usually worse for the advanced relational database folks who must fight their zeal for 3rd Normal Form perfection. Over the years I've tried many different analogies to help the aspiring SSAS developers to overcome these mental blocks. Some worked well for everyone, some succeeded for a few but left the others more confused, and some failed miserably. The analogies range from basic to advanced concepts. (The basic analogies are used in the version of my workshop taking the attendee from a solid Level 100/weak 200 to a solid 200/weak 300, and the advanced analogies are used in the version of my workshop taking the attendee from a solid Level 200/weak 300 to a solid 300 or weak 400 for the exceptional attendees.)

Since the number of SSAS developers is rapidly increasing due to the general growth of BI and the growing success of PerformancePoint Server 2007, I thought I'd create a new category, "Lightbulb BI Thoughts", on this blog site where I'll post these analogies I've used. Many of the posts in this category may be duh to many of my colleagues (so skip anything under this category) or perhaps the metaphors aren't 100%, but it will hopefully help to wire those last  few neuron gaps in the new SSAS developers. Therefore, posts under this category are intended for people in the process of learning OLAP, not as an introduction to OLAP nor to provide tips and tricks for advanced OLAP developers.

I'll begin this category with a graphic I bring up early on in the workshop (at least in the basic version). The mental block it targets is that OLAP data structures are exotic technologies stolen from stranded extra-terrestrial aliens. This mental block usually happens with advanced relational database folks. I can see the gears in their heads constantly trying to figure out how MDX works in terms of SQL.

In a relational database, we find data by sifting through a set of tables, joining them, sifting that result, until we end up with the end result of a table. In OLAP we find data by specifying coordinates much as how we specify the coordinates of a place on a mountain trail with longitude, latitude, and altitude (spatial dimensions). This is an easy way for people to find information. Using a spreadsheet as a two-dimensional example of an OLAP data source with cities on the X axis and dates in the Y axis, we can quickly find the sales amount of a particular city on a particular date by looking for the respective points on the axis of the city and date in question and tracing to the point of intersection.

Arranging data in this multi-dimensional manner makes it easy to find values so we use a language geared towards multi-dimensional access (MDX - Multiple Dimensional eXpressions) to talk to these data sources. If one were simply a consumer of OLAP data (a business analyst), this is all that would need to be said. However, for developers of OLAP cubes (modelers, DBAs), questions around configuration and optimization also arise. That means there are hundreds of questions around what all those dozens of properties in BIDS (the environment where SSAS cubes are developed) mean. Many of those properties make a lot more sense when it is realized that although one talks to the OLAP data source in multi-dimensional terms, the OLAP data source is persisted in a format that is pretty much the same as any relational data store. There are a few sort of exotic structures such as bitmaps and trees, but thought of in relational terms, it's easier to understand the "whys" of those dozens of properties. 

The reason we talk to OLAP data stores in multi-dimensional terms but store the OLAP data in a rather relational manner is due to the "sparsity" of multi-dimensional spaces. Sparsity means that most points in a multi-dimenional space have a value of null. A non-null value occurs at points where something has happened. For example, if there was a sale in Boise for a refrigirator by Brad, the point where Boise, refrigirator and Brad intersect has the value of the sale. For a cube to be "dense", there would need to be a fact for most or all combinations of points. For example, every salesman would need to sell every product every day to every customer in every city in every ... well, you get it. So, if there were 100 salesman, 1000 products, over 1000 days (3 years), to 100,000 customers, and 100 cities, the cube-space would have a quadrillion points, requiring at least four quadrillion bytes to hold a 32-bit int value. One or two dimensions more and the cube space eventually goes into the gazillions.

It's very fortunate that multi-dimensional data sources are sparse. If they weren't, we probably wouldn't have such a thing as OLAP. If I knew nothing about OLAP and someone approached me to develop a multi-dimensional data store, being very honest with myself, I suppose my first thought would be to store the structure as a multi-dimensional array, for example: private int[] cellvalues = new int[1000,1000,1000,100000,100]; If that's how cubes were stored, I certainly wouldn't have been able to create cubes on my old piddly 1 GB RAM, 80 GB hard drive laptop. Conversely, if cubes were dense, that n-dimensional array would then be the best way to store and retrieve OLAP data with little overhead and a direct way to access the cells.

So the multi-dimensional data stores only what exists (not a slot in a multi-dimensional array for every intersection) in structures somewhat similar to relational schemas. The figure below shows how a user requests a portion of the mostly sparse cube-space (a sub cube) in the MDX language and SSAS in turn queries its rather relational data stores in a way akin to SQL. (I stress that by "its rather relational data stores", I'm referring to the fact data, aggregations and variety of indexes SSAS generates when a cube is processed - at least as far as MOLAP is concerned - not the underlying data warehouse data on the far right of the picture.) Understanding this dichotomy of OLAP is one of the foundational steps to OLAP greatness.

Introducing this notion of speaking in multi-dimensional terms to a "virtual" multi-dimensional data source that really stores things in a rather relational format early in the workshop helps get people through the first three days. During Day 1, I take the class through building a very simple cube. What I do differently is that while processing the dimensions and cubes, I go over what the SQL queries to the underlying data source are doing and how that data is stored in the SSAS stores. Of course, this makes going to the point of building a cube with the Cube Design Wizard take the entire day. At that point, we see what a cube looks like and how it's browsed, but we really don't use much MDX. So, Day 1 talks about both the multi-dimensional nature of how we interact with cubes and the rather relational nature of how the OLAP data is stored. Read Chapter 20 - The Physical Data Model - of Microsoft SQL Server 2005 Analysis Services.

Day 2 is about MDX, the language and how to use it for querying, in the MDX Script, KPIs, Actions, and role-based security. We focus solely on the multi-dimensional aspects. In teaching MDX I ensure the attendee is fully cognizant the underlying mission of MDX to explore dimensional space. An empathy for the basic workings of the OLAP engine are stressed; being fully cognizant of what the OLAP engine generally does when recieving the query. At the highest level this means the notion of first determining what the boundaries of the sub-cube we're requesting are (what are the points on each axis), then filling up each point in that cube-space is stressed. Read Chapter 4 - MDX Query Context and Execution - of MDX Solutions.

Day 3 is about partitioning, aggregations and optimization - a complete 180 from Day 2, This means we talk more in terms of relational data stores and not very much in multi-dimensional terms. Measure Group partitions and aggregations are pretty much stored in structures kind of as they would be in a relational database (but more optimized since a relational database like SQL Server needs to be more "general"). So, many of the same relational database concepts apply similarly to optimizing an OLAP cube. For example, fewer joins, narrower tables, etc, are better.

Day 4 and 5 are generally geared towards building a quick proof-of-concept using the customer's data.

Again, this all may be obvious to most reading this post, but I've seen it snap a few synapses in place on more than one occassion.

I'll try to post more of these short insights as often as I can.