Slowly Changing Dimension type 2 in Property and Casualty Insurance Company Data Warehouse

Contents

Slowly Changing Dimension type 2 in Property and Casualty Insurance Company Data Warehouse. 1

Introduction. 1

Slowly Changing Dimension Type 2. 1

History of changes in the transactional system.. 1

Business Case from a property and casualty insurance company (back dated transactions). 1

Change Date. 1

ETL tool and later coming changes (back dated transactions). 1

Conclusion. 1

 

 

 

Introduction

 

Property and Casualty Insurance case study is the latest one in the best-selling book The Data Warehouse Toolkit (Ralph Kimball) and it unites concepts from all previous chapters. But some real-life cases were not found even in this fundamental study.

 

In fact, I did not find any mentioning of the problem Later Coming Changes (changes not in order) in Slowly Changing Dimension Type 2 (SCD type 2 or SCD2 later) or using SCD2 in monthly aggregated tables. The other issues we have in building DW model and SCD Type 2:

 

-          The history of the changes is very specific for the transactional system in use. No dates are stored with the history of changes, a separate special table should be used to join the changes and the dates of changes. On the other hand, there are a lot of available dates used in the process and too many choices for ChangeDate.

 

-          There are no primary keys in most of the source system tables and no foreign keys.

 

-          There are also inconsistent data updates throughout the life of a policy and risk.

 

-          The values of the attributes are not always consistent due to different factors

 

Some of the issues is the result of using XML based transactional system. The source data are XML, not relational tables. There is an additional process to shred XML data into relational tables not very well designed for transactional inserts and updates or reporting. It`s rather a data storage.

 

The other issues come from the business processes adopted in the company.

 

Does it mean we can not use dimensional modeling? The fundamental base of the enterprise data warehouse is a 3rd party, out of the box system. It`s designed as a dimensional model, with a few star schemas which use conformed dimensions. It`s natural at least to try to continue development of the existing model.

 

Out of the box DIM_POLICY and DIM_CLAIM, originally set up in the DW as SCD Type 2 do not contain historical info and were demoted to SCD Type 1.

 

Instead, several other custom dimensions related to insured risks were created as SCD Type 2:

DIM_VEHICLE

DIM_DRIVER

DIM_BUILDING

 

Let`s see how we workaround or resolved the issues.

 

All premiums and classification codes as well as table structures in examples below are not real and provided only as a schematic illustration.

 

Slowly Changing Dimension Type 2

 

SCD type 2 is well known and widely used technic in DW design. It`s used to keep a history of changes. A new record is added in the dimension each time when a value of an attribute is changed.

New records in a fact table will use this new added row in the dimension.

 

 

Column Name

Data Type

Req`d

PK

Comment

Risk_Id

Int

Yes

Yes

Risk_Id identifies DIM_RISK, primary, surrogate key

Risk_Uniqueid

Varchar(100)

Yes

 

Risk_Uniqueid is a unique ID of a risk from a source system, natural key

Valid_FromDate

Date

Yes

 

Risk state is valid starting from Valid_FromDate

Valid_ToDate

Date

Yes

 

Risk state is valid till Valid_ToDate

IsCurrent

Varchar(100)

Yes

 

IsCurrent identifies if the record is a current (Y/N), latest record. Optional in the dimension structure. Can be calculated based on Valid_ToDate

Record_Version

Int

Yes

 

Record_Version is a number of a risk state

RiskTypeCd

Varchar(20)

Yes

 

RiskTypeCd is a risk attribute. It`s an example of a non-changeable attribute. We do not need track its history.

ClassCd

Varchar(10)

Yes

 

ClassCd is a risk attribute which can be changed. We need to track the history

 

STG_RISK

 

Risk_Uniqueid

ChangeDate

RiskTypeCd

ClassCd

Risk1

12/15/2017

Vehicle

SM01

Risk1

6/1/2018

Vehicle

SM01

Risk2

1/1/2018

Vehicle

MF20

 

DIM_RISK

 

Risk_Id

Risk_Uniqueid

Valid_FromDate

Valid_ToDate

IsCurrent

Record_Version

RiskTypeCd

ClassCd

1

Risk1

12/15/2017

1/1/2018

N

1

Vehicle

SM01

2

Risk1

1/1/2018

12/31/3000

Y

2

Vehicle

MF20

 

 

Figure 1. SCD type 2 basic structure and a table sample

 

A Star schema SQL

 

 

A Fact table ETL SQL

 

Figure 2. Join between SCD type 2 and a fact table in a star schema and an ETL SQL to add SCD type 2 Id in a fact table

 

Nowadays it is not a problem to build the dimension. ETL tools have embedded functionality to create SCD type 2. A developer needs to provide only staging data in a proper order and set up a transformation process. Usually, it includes settings of natural and surrogate keys, valid from and valid to dates, record version, current record fields and what to do (update or insert a new record) for a change in a specific attribute.

 

You can find a lot of information regarding structure and building SCD Type 2 yourself. I would like to describe the issues we have in our system.

 

History of changes in the transactional system

 

Once established, a policy is renewed periodically. Each renewal is a policy term. A change (new vehicle or driver or change of address) is done via Endorsement transaction (mid-term change).

 

The current policy term state is always updated in the transactional system with each mid-term change. The previous state is available only in the correspondent applications and quotes which stored separately. Some applications are approved automatically and some are manually reviewed and approved or rejected by an underwriter.

 

The whole bunch of policy data is stored all together in one XML blob. It means if there is an application to change a garage address of a policy with 5 vehicles and 7 drivers, an application XML blob will have a lot of redundant information which is not changed from previous application (endorsement). These redundant information in XML is shredded to specific to the information type relation tables: address, vehicle, driver etc.

 

The key idea is to build a consistent history of mid-term changes along with renewals to join application info to transaction effective dates via a transaction history table.

 

Unfortunately, it is not enough. Time to time the data at the policy level is updated independently from the application level. In an ideal word, the latest approved application should be exactly the same as the latest policy state. If some issues are discovered only policy state can be updated and application level is still not updated. It`s important to incorporate in the staging data both levels – application (endorsements) and policy. The latest, current record in SCD2 should come from the policy level. In this way, SCD2 current record (the latest) comes from the policy level.

 

 

Business Case from a property and casualty insurance company (back dated transactions)

 

Let`s say there is a policy for 2 vehicles, effective January 1, 2018. It was automatically renewed in November 2017 and there are correspondent transactions in the system. Figure 3 presents a simplified data entry in one or more tables in an insurance transactional system.

 

Policy

Term

Risk

Risk Type

Transaction Effective Date

Transaction Number

Transaction

Class

Premium

Entry Date

Accounting Date

Book Date

A1

3

Vehicle 2

Toyota Corolla 2008

1/1/2018

1

Renewal

MF20

300

11/1/2017

1/1/2018

11/1/2017

A1

3

Vehicle 1

Nissan Pathfinder 2016

1/1/2018

1

Renewal

MM25

500

11/1/2017

1/1/2018

11/1/2017

 

Figure 3. Automatic renewal transaction number 1. The primary driver of Vehicle 2 is an insured wife (MF20).

 

Before the renewal effective date, but after automatically created renewal transaction, the insured informed the agent his 18 years son will drive Vehicle 2 .

 

Policy

Term

Risk

Risk Type

Transaction Effective Date

Transaction Number

Transaction

Class

Premium

Entry Date

Accounting Date

Book Date

A1

3

Vehicle 2

Toyota Corolla 2008

1/1/2018

2

Endorsement

SM01

400

12/15/2017

1/1/2018

12/15/2017

 

Figure 4. Endorsement transaction number 2.

 

5 month later the insured bought a new vehicle and want to replace Vehicle Number 2 in the policy. All transactions are still in an order.

 

Policy

Term

Risk

Risk Type

Transaction Effective Date

Transaction Number

Transaction

Class

Premium

Entry Date

Accounting Date

Book Date

A1

3

Vehicle 2

Toyota RAV4 2018

6/1/2018

3

Endorsement

SM01

600

5/31/2018

6/1/2018

6/1/2018

 

 

Figure 5. Endorsement transaction number 3.

 

 

A few days later, in June, the insured informed the agent that his 18-years old son moved to an other state and does not drive Vehicle Number 2 since February. The premium can be recalculated based on the other driver classification code and it`s less.

 

Policy

Term

Risk

Risk Type

Transaction Effective Date

Transaction Number

Transaction

Class

Premium

Entry Date

Accounting Date

Book Date

A1

3

Vehicle 2

Tyta Corola 2008

2/15/2018

4

Endorsement

MF20

200

6/3/2018

6/4/2018

6/4/2018

A1

3

Vehicle 2

Toyota RAV4 2018

6/1/2018

5

Endorsement

MF20

500

6/3/2018

6/4/2018

6/4/2018

 

 

Figure 6. Endorsement transactions number 4 and 5

 

Transaction 4 is effective before transaction 3 but came into the system 3 days later!

 

Vehicle information from the transactional tables belongs to a dimension in DW system. Since the classification code is one of the base rating variables for premium calculation (measure) we need a history of changes if we want to analyze dependencies later. The described case raises a few questions. Here are 2 the most difficult:

1.       What date should be used as a Change Date to build Valid From and Valid To dates in SCD type2? There are too many choices:

a.       Entry date is a date when the info was added in the transactional system. But as you can see, it`s the same for transactions 3 and 4. ChangeDate is the same for different Risk types.

b.       Book date is a date when the transaction was booked, registered in the system. It`s the latest date from Transaction Effective Date and Accounting date. All the changes can be added on different dates (Saturday and Sunday) but booked on the same date (Monday). The date can be the same for different changes.

c.       Accounting date is a date when movement of money is registered in the system. It may be account in previous month even if added in the first few days of a new calendar month or vice versa. The latest from Book and Effective dates is used as Accounting date. Usually only accounting month is important, not the exact date.

d.       Effective date is a date when the change is in an effect. Different changes can be effective on the same date. The order of the transaction is also important.

 

Some of these dates are the same but, in general, they all can be different and do not match.

 

2.       So, who is the driver of Vehicle 2 in April 2018? Son or wife of the insured? It depends on when we ask this question and for what purpose.

a.       If we are interested only in an aggregated monthly premium (finance department) then it does not matter.

b.       If we need to analyze why this particular amount was payed in December 2017, and a specific monthly written and earned premium was calculated in April, and then a different one in June then it`s better to keep a record of the exact classification codes used in each transaction.

c.       If we want to model and predict frequency and severity of claims it`s better to know who the real driver in April and the real classification code was.

As you can see, we need to keep a classification code which was used in the premium calculation even if it`s changed later, and on the other hand, we need to know a real driver and classification code for an other task.

 

The other questions you may think about: What is the natural key in DIM_VEHICLE SCD2? VIN number is unique. Can we use it? No, we can not. The dimension entry is not a physical object (vehicle) with unique VIN number but, a risk with additional characteristics. The same vehicle (but different risk with it`s own classification code) can exist twice in the transactional system in the same period of time . Think about expiring policy term and a renewal for the same policy and vehicle but different driver or estimated annual distance.

 

 

Change Date

 

After reviewing all use cases I choose to use Transaction Effective Date as a Change Date. Transaction Number is also important and should be accounted to, because it`s possible to have different state of a risk on the same transaction effective date but different transactions (Transaction 1 and Transaction 2 in the example above). Since Transaction Effective Date in the transactional system is a Date (not DateTime) I convert it to DateTime and use Transaction Number as a number of seconds.

-          Effective date supports a natural order of changes in a risk. It allows to build the precise dataset for modeling

-          SCD Type 2 is deformed based on other dates. There are too many entries with the same Valid_FromDate and Valid_ToDate for the same risk

 

SCD type 2 base on transaction effective date and transaction number is easily used in transactional fact tables and model data fact tables. (Except a case of a full DW reload)

 

Monthly summary fact tables (based on accounting date month) need additional processing to get the latest transaction in a specific accounting month to link for a proper dimensional record.

 

ETL tool and later coming changes (back dated transactions)

 

Straightforward approach to build an SCD type 2 from the dataset described above results in a staging data not in order.

 

STG_VEHICLE

 

Vehicle_Uniqueid

ChangeDate

Vehicle

Class

Book Date

Comment

A1_3_VEHICLE2

1/1/2018

Toyota Corolla 2008

MF20

11/1/2017

 

A1_3_VEHICLE2

1/1/2018 1sec

Toyota Corolla 2008

SM01

12/15/2017

 

A1_3_VEHICLE2

6/1/2018 2 sec

Toyota RAV4 2018

SM01

6/1/2018

 

A1_3_VEHICLE2

2/15/218 3 sec

Tyta Corola 2008

MF20

6/4/2018

Back dated transaction

A1_3_VEHICLE2

6/1/2018 4 sec

Toyota RAV4 2018

MF20

6/4/2018

 

 

DIM_VEHICLE

 

Vehicle_Id

Vehicle_Uniqueid

Valid_FromDate

Valid_ToDate

Record_Version

Vehicle

Class

Comment

1

A1_3_VEHICLE2

1/11/2017

1/1/2018

1

Toyota Corolla 2008

MF20

 

2

A1_3_VEHICLE2

1/1/2018 1 sec

2/15/2018 2 sec

2

Toyota Corolla 2008

SM01

 

3

A1_3_VEHICLE2

6/1/2018 2 sec

6/1/2018 2 sec

3

Toyota RAV4 2018

SM01

We need it to use in Transaction 3 in a fact transaction table but not in a model dataset.

4

A1_3_VEHICLE2

2/15/2018 3 sec

6/1/2018 3 sec

4

Tyta Corola 2008

MF20

 

5

A1_3_VEHICLE2

6/1/2018 4 sec

12/31/3000

5

Toyota RAV4 2018

MF20

 

 

 

Figure 7. Staging data for DIM_VEHICLE for each transaction (book date) and desirable SCD Type 2 output

 

We use Pentaho SPOON and it creates 2 or more ``current`` records if it receives staging date for SCD 2 not in order.

 

Vehicle_Id

Vehicle_Uniqueid

Valid_FromDate

Valid_ToDate

Record_Version

Vehicle

Class

1

A1_3_VEHICLE2

1/11/2017

1/1/2018

1

Toyota Corolla 2008

MF20

2

A1_3_VEHICLE2

1/1/2018 1 sec

6/1/2018 4 sec

2

Toyota Corolla 2008

SM01

4

A1_3_VEHICLE2

6/1/2018 4 sec

12/31/3000

4

Toyota RAV4 2018

SM01

3

A1_3_VEHICLE2

2/15/2018 3 sec

12/31/3000

3

Tyta Corola 2008

MF20

5

A1_3_VEHICLE2

6/1/2018 4 sec

12/31/3000

4

Toyota RAV4 2018

MF20

 

 

Figure 8. Corrupted SCD type 2 created by Pentaho SPOON

 

I need to note we use an old version of Pentaho SPOON (build of 2015 year) and it`s possible the functionality was adjusted since that time.

 

We can not just delete existing records in dimensions which are no longer valid because there are already records in fact tables related to them. And this relation is absolutely valid at the moment of the transaction.

 

The approach is to update ValidFrom, ValidTo and Record Version to some default values not in range in existing not valid anymore records to preserve the primary key ID (surrogate key) which is referred from a fact table foreign key.

 

The original data are preserved in other columns but the updated record is not selected anymore for new transactional fact table records and do not mess SCD2 creation.

 

Vehicle_Id

Vehicle_Uniqueid

Valid_FromDate

Valid_ToDate

Record_Version

Vehicle

Class

1

A1_3_VEHICLE2

1/11/2017

1/1/2018

1

Toyota Corolla 2008

MF20

2

A1_3_VEHICLE2

1/1/2018 1 sec

6/1/2018 1 sec

2

Toyota Corolla 2008

SM01

3

A1_3_VEHICLE2

6/1/2018 2 sec

12/31/3000

3

Toyota RAV4 2018

SM01

 

 

Vehicle_Id

Vehicle_Uniqueid

Valid_FromDate

Valid_ToDate

Record_Version

Vehicle

Class

Original

Valid_FromDate

Original

Valid_ToDate

Original

Record_Version

1

A1_3_VEHICLE2

1/11/2017

1/1/2018

1

Toyota Corolla 2008

MF20

1/1/1900

1/1/1900

0

2

A1_3_VEHICLE2

1/1/2018 1 sec

2/15/2018 2 sec

2

Toyota Corolla 2008

SM01

1/1/1900

1/1/1900

0

3

A1_3_VEHICLE2

1/1/1900

1/1/1900

0

Toyota RAV4 2018

SM01

6/1/2018 2 sec

12/31/3000

3

4

A1_3_VEHICLE2

2/15/2018 3 sec

6/1/2018 3 sec

4

Tyta Corola 2008

MF20

1/1/1900

1/1/1900

0

5

A1_3_VEHICLE2

6/1/2018 4 sec

12/31/3000

5

Toyota RAV4 2018

MF20

6/1/2011/1/19008 4 sec

1/1/1900

0

 

 

Figure 9. Updating existing, used in transactional fact table but not valid anymore record in SCD2 (Vehicle_Id=3).

 

This additional update is done by ETL after data are staged but before dimensional transformations. The condition is the Change date in staging is less then ValidFrom in a correspondent dimension for the same natural key.

 

An other problem is full DW load. When we do incremental daily load, a transactional fact record uses only available records in a dimension. In this way, even if a dimensional record will be invalidated later, the transactional fact table record uses a proper one for the transaction time record.

 

There is already all staging dimensional and fact tables records in a full reload. We need first to build a dimension to separate later discarded records in the same way as above and then use them first to link fact table records.

 

It`s done in a dimension initial load SQL script which run once. The other purpose of the script instead of Pentaho SPOON is to load data in a timely way. Using Pentaho SPOON transformation for the initial load takes days due to in memory lookups vs SQL running about an hour.

 

Model data fact tables are built from scratch daily using only valid records ignoring invalid in dimension.

 

In the third type of fact tables, monthly summaries, transaction effective dates are not used at all. The data are summarized based on accounting date month. It`s possible to have several changes in a risk (transactions effective in different dates of a month) but only one, the latest, can be recorded in the fact table. The only one risk change is the latest one in a particular accounting month ordered by transaction effective date. A standard ETL approach adding SCD Type 2 Id in this type of fact tables does not work. Instead, there is an ETL lookup table joining for each policy/risk accounting month and latest transaction which takes place in this month. Based on this information, the final script uses a dimensional id related to this transaction from fact transaction. It maybe very valid dimensional id, or invalid now, because of a later coming back dated transaction.

 

In fact, this information is not important for the primary purpose of summary tables - financial reports. The idea of adding dimensional information of this level came later and the data are used in monthly risk state dashboards which do not required a high level of time accuracy.

 

Transactional Fact table

 

Policy

Policy Term

Transaction

Effective

Date

Transaction Number

Vehicle_Id

Premium

Book

Date

Accounting

Date

Comment

A1

3

1/1/2018

1

1

300

11/1/2017

1/1/2018

 

A1

3

1/1/2018

2

2

400

12/15/2017

1/1/2018

 

A1

3

6/1/2018

3

3

600

6/1/2018

6/1/2018

The dimensional record Vehicle_Id=3 is invalidated from SCD type 2 but for history we need the Id

A1

3

2/15/2018

4

4

200

6/4/2018

6/4/2018

 

A1

3

6/1/2018

5

5

500

6/4/2018

6/4/2018

 

 

Model dataset fact table

 

Policy

Policy Term

Start Date

End Date

Vehicle Id

Exposure

Comment

A1

3

1/1/2018

2/15/2018

2

 

The 1st transaction was not really in effect

 

 

2/15/2018

6/1/2018

4

 

3rd transaction data were replaced transaction 4

 

 

6/1/2018

6/30/2018

5

 

Transaction 5 is in effect till the end of the policy term

 

Monthly summary fact table

 

Month_Id

Policy

Policy Term

Written Premium

Earned Premium

Vehicle_id

201801

A1

3

300

100

2

. . .

 

 

 

 

 

201804

A1

3

0

100

2

. . .

 

 

 

 

 

201806

A1

3

500

200

5

 

 

Figure 9. Fact tables and SCD Type 2 Id.

 

Conclusion

 

So far, we built several SCD Type 2 dimensions. There are few issues left, the process is complex but I estimate DW matches the transactional system 96 - 98%. The size of the dimensions is approximately 30-50 % of the staging tables with redundant data. The storage is not an issue nowadays and we are a relatively small company but we run ETL in MS SQL and move data to Redshift. We save not just disk space but time moving the data.

 

We may need to change the ETL tool and will have less issues. Not just only because of SCD type 2 but there are other concerns too.