09 November 2011

No id, please

When I was 16 I got my first id card, that’s obviously a milestone in every person’s life. Not even using it for much (maybe for seeing M rated movies) it means that I have “id”. The second time I got an id was after I immigrated to Australia; as an immigrant to show Australian id is a real milestone.
In the SQL world - databases vs. data warehouses - we have two process. Learning databases you learn that normalization involves that each table has a unique key, and that every row in each table can be identified by a unique key. The first normal form requires every row to have a unique key. This unique key can be an “id” - identifier, or a composite key.
For many years I’ve been working with MS-Access. In Ms-Access, the best practice would be to have the unique key as an Identifier and not as a composite key; that’s due to the fact that if you want to link from one database to another it links the tables better. It’s usually best practice to split the access application to back end (data) and front end (user interface)  in two different mdb files, so when you try to change network location (or migrate the date to sql server) and you need to change “in bulk” the location of the referring database, if you have a field that’s defined as an “id” it would be done quicker and more accurately than it would if you have a composite key.
In data warehouses’ design things are different. While designing a dimension every row must have a unique key which is  a foreign key in the fact table. This unique key better be truly unique: not only it should be different than the “name” or the natural key of the external system, it should also be acting as a surrogate key, in case we’re going to change the “external system” that we’re working with. But apparently, a fact table doesn’t have to have a key! A unique composite key is “good enough”, and that's because we're dealing mainly with measures, and some people claim that even that isn't really necessary. Think "Sum" and then "group by"; no, we don't need Id as identity over there, do we? Moreover, I’ve found out that using a composite key gives me the ability to minimize the data and to warn me whenever I’m doing an error, like inserting the data with the wrong date. (I don’t disable the indices while inserting; I’d rather lose half a minute waiting for data to be inserted, than lose a day’s work thinking what went wrong with the job I’ve done. For my size of database it is “a good enough practice”).
So far I’ve found out that’s personally, this is the most difficult part of “denormalization”. Another thing which I needed to get my head around is that actually, when we’re discussing Hierarchies in SSAS, we don’t really need the “id” there, especially not for Slowly Changing Dimensions. Let’s take “employee” as an example: I’d group by the key of “employee id” and add a line for a change in the name, for example. As I group by the “employee id” I would like to see all that’s relating to a certain employee no matter if they got married/divorced in October and therefore changed their name... Keeping the “id” as an attribute, but outside of the “Hierarchy”, enables me to browse the Hierarchy correctly without seeing repeating lines of Employee Id where I shouldn’t.

No comments:

Post a Comment