慢到底有多慢?在SSA中慢慢改变尺寸

SCD是设计良好的数据仓库的关键组件

许多商业智能培训课程中似乎省略的一个重要主题是缓慢变化的维度或SCD。当我教双级时,我要覆盖这个话题,它可能会让更多的学生对BI的比任何其他话题多。并相信我,这比做更容易。兴奋和商业智能并不总是适合同一个句子。..好的,所以我们知道数据仓库由涉及在“星形”或“雪花”模式中排列的事实,尺寸,属性和层次结构的关系设计。事实通常是我们想要用于评估销售收入,成本,利润率等的业务的数字价值。如果事实是我们想要衡量的事实,那么我们希望如何分析事实;例如,销售收入逐年,季度,按月,客户或区域。属性被添加到维度表中以填充维度,以使其更具意义,例如客户人口统计数据,如年龄或工资范围。层次结构是彼此相关的多个维度,例如年/四分之一/月/月或客户/地区。那么慢慢改变的维度是什么? As the name suggests, it’s a dimension that changes slowly and predictably. The big decision we have to make is: do we care? When a customer moves from one region to another, what should happen to their previous orders? If we are not careful, they will end up appearing under the new region and go missing from the old one. Not bad if you are a sales rep for the new region but disastrous if you are the sales rep who actually made those sales. Another example is when analyzing schools: when a student moves to another school should their previous exam results apply to the new school? Of course not. (Depends on whether they are a good or bad student I hear you say? Shame on you…). We want to be accurate in all such cases, but with standard dimensions we may fall into the trap of losing sight of that history. At the dimension attribute level there are at least 3 types of SCD. Type 1 SCD means the attribute is a “changing attribute” but we only care about the most current value. Type 2 means it is a “historical attribute” and we very much care about maintaining historical accuracy. Type 3 is for the rare attribute where we only care about the original and current value but not those changes in between, sometimes called “First and Last”. There are some other types but then we are getting too academic for this time of the day… If we do nothing about our dimension design, we will end up with all Type 1 attributes. This might be OK for Customer Last Name which may change but as long as we don’t need to analyze how many “Smiths” bought a particular product then Type 1 should be just fine. Sales By Region or Scores by School are classic examples where Type 2 SCD design is definitely needed. And the trick to enable this design is both simple and ingenious. For the customer who moves from one region to another, or the student who moves to a new school we need to create a row in the dimension table for each move with a corresponding Start Date and End Date to indicate the period they were there. However, relational constraints limit the Primary Key to a single unique value per row so, enter the concept of the “Surrogate Key”. Within the Data Warehouse design, the original business key from the operational system becomes just another attribute. Then we add a surrogate key as the primary key and have the system generate it to be unique. Now we can have multiple rows for a customer or student, with accurate start and end dates to indicate the history of the moves. The single row without an end date value will indicate the current location. SSIS has a Slowly Changing Dimension Transform which helps us with the incremental load. And here is the ingenious part: we only need to join to the fact table using the surrogate key because, as long as we load the data correctly with our incremental load processes, the end user will be able to analyze Customer by Region or Scores by School with confidence of accuracy even when analyzing over time-periods. The “magic” is in the data load. Don’t worry; I was skeptical too, until I tested I out. It’s a “eureka moment” when you see it working for the first time. The Business Analysts then perform their analysis as normal, using high performing cubes or multi-dimensional databases, oblivious to the underlying complexity using their “drag and drop” GUI in Excel and it’s “Business Intelligence as Usual” except with deadly accuracy. Cheers, Brian.

加入网络世界社区有个足球雷竞技app脸谱网LinkedIn对自己最关心的话题发表评论。

版权©2011Raybet2

2021年IT薪资调查:结果已经出来了