If you have managed the design of transactional systems, you
would have known that in-spite of all its stated benefits (like data integrity,
clean design..), normalization is not always the best choice, as it sometime
makes the design too complex and also negatively impacts the performance (due
to complex Joins). The same principle applies in Data warehouse as well, and
much more. As Data warehouse population is typically done once in a day and
it is done through an extensive Extraction, Transformation and Loading routines
with extensive DW data quality checks, the DW data is consistent and has integrity.
The fundamental requirement of a DW is the fast response time of any expected
and unexpected query (Refer Data Warehouse v/s Transactional Systems ). Therefore
we expect dimensional models to be more de-normalized (or totally de-normalized).
Any joint can put a large overload for a large query.
Balancing Tips
- Avoid Extremes: It is advised to use a balanced normalization approach like star-schema
or snow-flaked schema (in limited cases as it can take normalization to unacceptable
levels). A single big table with facts and dimension attributes is another extreme,
which one should avoid.
- Descriptive in snow-Flake: Descriptive (or what we call them as non-classification) attributes in a dimensional model are those attributes, which are not used for analytics, but mainly for reporting. You do not do any kind of derivation or modeling around these attributes. examples are - Address (not pin-code as that can be used), description and Name. For these attributes, you can make them part of snow flake, whereby they are not directly linked to the fact table (refer Dimensional model schema to understand these terms). The reason for keeping any attribute in a star-schema model is to enable fast query response time on large and/or ad-hoc queries. The non-classification attributes will not be typically needed for these kinds of queries.
- Different approach for aggregate vs. transactional schemas: It has been mentioned before that for the same set of data, you should not mix the detailed and summary level in the same star-schema. You should have a separate star-schema for detailed data and separate schema for summary data. In that kind of situation, the chances are that summary schema will be used for analytics and detailed data for enterprise reporting or when you want to do drill down from summary data. In that kind of situation you can have the detailed data in more of a normalized form (star-schema), and have the summary data in relatively de-normalized form (snow-flake). For example, if you have the sales revenue data in the aggregate form at sales agent level, you may keep it in star-schema. The same data at invoice level (transaction level) can be kept in the snow-flake form.
|