15 January 2014

Distinct Count using Many to Many relationship - SSAS

I've blogged here about implementing a many to many relationship. That was in order to view changes in a dimension across time (like Clients changing Advisers between one month to another). This time I want to blog about implementing the same technique in order to view the distinct count of Clients, as obviously it is a very important measure for an advisory company.
The problem with Distinct Count for SSAS 2008 R2 is that it is heavy to use. Of course you can use some methods to improve the performance, but nevertheless it is a heavy task. In my not-too-big data warehouse eliminating the measure altogether gave me 1 minute of benefit in the processing, which is a lot. So instead of going Measures -> New Measure... -> and choosing Distinct Count for the appropriate measure (in my case: the natural key of the Client, as Clients are a Slowly Changing Dimension).
I recommend reading the wonderful document The many to many revolution. This article was written by Marco Russo & Alberto Ferrari and I base this blog entry on their work, and obviously on what works for me. You can read an example in this blog entry, but I think he ignores Slowly Changing nature of the Slowly Changing Dimension.
So this is how it's done properly:

Distinct Count on SCD type 2 using Many to Many relationship

The SQL part

First, you need to create a view on the SQL server. This view bridge the Client and the Fact Table.
Fact Table: FactCommission
Dim Table: DimClient
I don't want the (surrogate) key which links the Fact Commission to Dim Client: I want the Natural key!
Therefore I created a view as follows:

SELECT DISTINCT [Client Natural Key]
FROM         dbo.factCommission

Since I can't create an Index on a view which uses DISTINCT, I had to change the view as follows:

CREATE VIEW [dbo].[v_bridge_fCommissionClient]
WITH SCHEMABINDING
AS
SELECT  [Client Natural Key]
COUNT_BIG(*) AS totalNumber
FROM         dbo.factCommission
GROUP BY [Client Natural Key]


I also created a Unique Clustered Index on the [Client Natural Key] field. I am not certain that this part is necessary when you use M2M, on the other hand I don't think it hurts...

The Data Modelling Part

After you add the v_bridge_fCommissionClient table to the data source view, and add it to the Cube Structure. First it is a Dimension, and then you create a measure out of it (when it is based as a fact).
The dimension has obviously only one attribute: [Client Natural Key]. When you add it as a dimension it is important to make sure the the link to the Fact Commission is based on the [Client Natural Key].
And now in order to make this work across the board you must link *all* of the relevant dimensions from Fact Commission to the new dimension using the Many To Many relationship. Many would advise you to link the Time dimension via the Fact Table using the Many To Many: true, but not enough. All the dimensions that are needed to be queried are queried via the Fact Commission measure. Therefore, you need to link them as well using Many To Many relationship (see below). 


This way when you browse the data you get the numbers correctly for all dimensions.

Conclusion

I'm happy to say that:
  1. It does give you the correct data across the board just like the good old distinct count 
  2. It works like a charm and there seems to be no overhead for the SSAS engine on the extra data. I believe this new Measure helps even helps the Distinct Count measure (but maybe it's just a belief).
  3. Once you get the hang of it, it is actually even simpler than using Distinct Count...
  4. Marco & Alberto claims that it is easier on the Caching. I'll have to see about that.