Modeling data in Property and Casualty Insurance
Contents
Data structure and granularity
Claims and policy mid-term changes
Nowadays Predictive modeling is widely used in P&C insurance companies for underwriting/risk selection etc. You can google a lot of information from scholar papers to Kaggle competitions.
However, you hardly find any guidance how to build a data set for modeling. All papers mention a lot of available variables in P&C Insurance but not how to arrange the data.
What level of granularity should we use in the model dataset? Individual risk or policy level? Policy, policy term or mid-term changes level? This information is not provided in the papers.
What is Risk? Sometimes it`s easy : building in homeowners or dwelling. How about Auto? Is it a vehicle and primary driver or we need to account for any other driver in the household?
Usually there are more small losses then huge losses (auto). Should we consider this disproportion when preparing a modeling dataset?
There are no perfect data. What should we do with slightly imperfect data? How much is slightly?
I created datasets for modeling in Private Passenger Auto, Homeowners and Dwelling lines and share my thoughts and findings in the article. The details maybe obvious for actuarial people, but not for database developers who know the data and can provide stable and consistent datasets.
At the very high level the modeling dataset is a combination of policies, risks, and claims data. The number of risks per policy depends on the line of business. In most cases there is just one risk per policy in Homeowners and Dwelling (building) and several vehicles and drivers in Auto. In some cases, it`s possible to have more than one building in Dwelling at the same moment of time or different building in different time periods in Homeowners but they are still just separate buildings in contrast to Auto where possible to have different combinations of vehicles and drivers, permitted drivers, not registered in the policy, or rented vehicles.
So, each line in the modeling data represents:
- Building in Homeowners and Dwelling with some unique identifiers which building is it (address, building number)
- Vehicle and assigned driver combination for Auto with unique identifiers (VIN, Driver License, vehicle number and driver number)
No need to add unique identifiers to the training and testing data or used as predictors. The purpose is to assigned proper risk characteristics and do testing and filtering.
Adding proper vehicle and assigned drivers combinations can be a big issue depending on how clean or messy the source system is. It`s possible to have duplicate VINs in a policy as well as duplicate or double assignments.
A transactional system may allow to add not just primary assigned drivers but also primarily or occasional. For some types of analysis (multivariate e.g., GLM or XGBoost), a researcher may need not just primary assigned drivers but also any possible combination of vehicles and drivers in a policy to get proper characteristics of risks to include in a model.
Risk attributes and the potential adherence to a loss can be changed mid-term. So, the natural granularity of the modeling dataset is a mid-term change. What`s a mid-term change? It`s a transaction in the system after a policy Renewal (or New Business) but before a new term effective date or cancellation. Usually, Endorsement transactions change risks attributes and associated policy premium.
Depending on a transactional system it can be a problem to identify valid mid-term changes. Endorsement transactions can change anything in a policy even not related to the selected risk attributes. We do not need a new record in the modeling data if there is a transaction to correct an insured first name, but if it`s a correction of a drivers date of birth (age) then it should be recorded.
One of the ways, to select only needed transactions is to check if there is a change in a policy premium comparing to the previous policy state. If there is a change then most likely an important rating variable was changed and it impacted the premium.
There is always a chance to loss the premium change if several changes were combined into one transaction: some changes increase the premium and the other are discounts. In addition, the values of attributes itself can be checked for a change.
Technically, several transactions can be effective on the same day. A policy automatic renewal is added in the system 2 months in advance. But it`s possible an immediate endorsement is issued with the same effective date as the renewal transaction during these 2 months or on the first effective date. If the duration of the midterm change is too short (less than a day) we do not include this change in the data.
On the other hand, if there are mid-term changes starting in one calendar year and ending in the next, we separate them into 2. Each line in the model data is attributed with Calendar Year which is the year of the mid-term start date. Calendar Year helps to divide data into training/validation/test sets later in the process.
Corresponding Written and Earned premium as well as calendar year exposures are added for each mid-term change.
In a simplified version, strictly 1-year exposures are the base for the modeling data.
Policy and claims data are different entities which can be stored not just in separate tables but even in different databases and transactional systems. You should not expect a proper database relation (foreign and primary keys) between these two objects. Usually premium and losses are joined at a different, higher level, then it`s required for the modeling data. It can be monthly, line of business, state aggregations but not individual policies or risks.
Policy Number or policy term ID most likely is the part of a claims database. To get a proper mid-term change you need to use Loss Date. Both claim attributes are not static thru the life of a claim. A wrong loss date or policy term can be automatically added via a data feed or manually and corrected later. It`s important to use the latest known claim data in the joining process. Sometimes the errors are still in the claim database and there is a mismatch between a correspondent policy term Effective and Expiration dates and a claims Loss Date.
The other problem is joining claims to a specific risk in a PersonalAuto policy. Unknown vehicle and/or driver in the claims database is not always an issue but can be a natural claim feature. A claim can be related to a rented vehicle. A rented vehicle is not part of the policy database and no additional info (risk characteristics) is available in a claim. Even when a vehicle is known, a driver can be unknown in the policy database but permitted by an owner to use the vehicle or not relevant to a claim (parked vehicle).
We use a policy Vehicle Number 1 and Driver Number 1 as a default vehicle and driver when both are unknown and assigned vehicle or driver if at least one is known in the claim database. But because for some analysis these claims are not relevant it`s important to flag such replacements for filtering.
PersonalAuto claims may involve more than one vehicle. A claim automatically created by a data feed may have populated a wrong column (``other party`` instead of ``insured vehicle``) with a known vehicle VIN number from a policy. It`s a good idea to research all related tables before use default values in the modeling data.
There is a straightforward way to join a correct risk from a policy database in most Homeowners and Dwelling claims. However, some claims like a theft from a vehicle may not have any risk identification in a claim database.
Policy level attributes can be easily added in such structures since the data are at the mid-term basis. But claims info (incurred loss and claims counts) is aggregated because we may have more than 1 claim per the same mid-term change.
Coverages and Line of Business also impact the data structure because of the different importance and ways of data analysis in different lines.
Each mid-term change/record in PersonalAuto modeling data is for a specific coverage. On the opposite, we do have coverage related data (premium, losses) in columns in Property modeling data. In addition, losses are aggregated not just by coverage but also on perils and catastrophe basis in Property data. Cause of loss is not so important for PersonalAuto analysis and catastrophe claims are rare in this line of business. That`s why they are just additional columns in PersonalAuto modeling data. In this way catastrophe losses can be easily separated.
Some similar coverages can be grouped under the same name. It`s possible when essentially the same coverages have different codes in different products in the same lines or from different transaction systems if we extract data from a retired one and a current.
PersonalAuto data are often analyzed for specific limits combinations. That`s why for easy filtering we have BI and some other limits in each record even if it`s another coverage dataset record.
There are hundreds of available variables describing a risk. Most of them are categorical with many levels and require conversion. They are coming from insured, agents, vendors data report, underwriters, etc
AllPerilDeductible |
Gatedcommunityind |
RentersInsurance |
Stories |
BurglaryAlarmType |
Homegardcreditind |
ReplacementCostDwellingInd |
Tenantevictions |
ConstructionCd |
KitchenFireExtinguisherind |
Replacementvalueind |
Units |
Deadboltind |
NeighborhoodCrimeWatchInd |
RoofCd |
UsageType |
DeluxePackageInd |
Numberoffamilies |
SafeguardPlusInd |
WaterDeductible |
Equipmentbreakdown |
OccupancyCd |
SecondaryResidence |
WaterDetectionDevice |
FireAlarmType |
OrdinanceOrLawPct |
Serviceline |
WaterRiskFreq |
FireHazardScore |
Poolind |
SmokeDetector |
WaterRiskSeverity |
FireLineAssessment |
PropertyManager |
Sprinklersystem |
WUICLASS |
FunctionalReplacementCost |
ProtectionClass |
SqFt |
YearBuilt |
Table 1 Sample variables for Property risks (Dwelling and Homeowners)
acci_pointschargedterm |
drivertypecd |
newteenexpirationdt |
Acci5yr, Acci7yr |
gendercd |
newtostateind |
birthdt |
gooddriverind |
occupationclasscd |
driver |
licensedstateprovcd |
other_pointschargedterm |
driverage |
licensedt |
scholasticdiscountind |
driverinceptiondate |
licensenumber |
susp_pointschargedterm |
driverlicenseage |
maritalstatuscd |
viol_pointschargedterm |
drivernumber |
maturedriverind |
|
driverstatuscd |
mvrstatus |
|
DriverTrainingInd |
mvrstatusdt |
|
Annual_Mileage_Estimate |
Last_Owner_Recent_Annual_Mileage |
restraintcd |
antitheftcd |
manufacturer |
titlehistoryissue |
AverageMileage |
medicalpartsaccessibility |
vehbodytypecd |
CaliforniaRecentMileage |
mileage |
vehicleinceptiondate |
daysperweekdriven |
ModeledAnnualMileage |
vehusecd |
Estimated_Current_Mileage |
modelyr |
vin |
estimatedannualdistance |
neworusedind |
weekspermonthdriven |
estimatedworkdistance |
odometerreading |
|
Last_Owner_Average_Miles |
performancecd |
|
Last_Owner_Government_Recent_Annual_Mileage |
RecentAverageMileage |
Table 2 Sample variables for Auto risks (PersonalAuto)
Here are some issues to research and resolve before adding a potential attribute into the dataset:
1. Empty or None values
a. A variable is recently added in the system and there is not enough history
b. Some attributes available only for a specific product or state. They are not populated in other states or products even in the same line of business
c. The same attribute may be stored in a different database tables or columns depending on product/release.
2. Not consistent values in the same attribute:
a. Data migrated from a former transactional system can have a different set of categories
b. Classification codes change over time and states
c. Adjusting made in the same system over time
d. 3rd party systems may provide the same information in a different way in a past or depending how it was requested: regular updates on renewal or one-time request for the whole book of business.
e. Free form text input in a transactional system instead of a drop-down list with predefined values
3. Wrong data stored in the database because of missing validation in a transactional system: a driver license date before a date of birth, date of loss before policy in effect etc.
Some of the questions above can be resolve easily using simple SQL but the others may require long communications with IT developers, analysts, product managers. Data Catalog tools can replace such tribe knowledge and help to identify the potential data sources. It`s also much easy to build the modeling data based on a data warehouse system when most if not all attributes are already identified and stored in risk item dimensions as DIM_BUILDING, DIM_VEHICLE, DIM_DRIVER.
Another group of variables is calculated attributes. We may not have them ``as is`` in a table in transactional or data warehouse system but they can be pre-calculated based on other related data. As an example, the total number of insured properties/policies in a company per insured. For PersonalAuto analysis important to have number of active vehicles, drivers, non-drivers, excluded drivers , young drivers at each midterm change. PersonalAuto policies may have only liability coverages and others only Comprehensive coverage. These policies can be flagged in the data.
Some policy level attributes like discounts or inception dates are also included in the modeling data. They are usually more robust and consistent comparing to risk attributes. Its discussable if limits and deductibles should be included in models as features in the final datasets but they may be needed for filtering. Policy Numbers and policy term IDs, effective and expiration dates are unique and adding them in the models will produce a random effect. If our final interest to produce estimates for each policy, then policy numbers and policy term IDs are must have in the modeling data but maybe not in training dataset. But think about correlation between multiple renewals of the same policy included in the dataset. A researcher may wish to include policy term IDs specifically for this random effect. The rest of the policy level attributes (agency, product, company etc.) are useful for special research filtering.
Policy Number |
Inception date |
Policy Term ID |
Persistency Discount date |
Effective date |
Multi Policy Discount |
Expiration date |
Multi Car Discount |
Company |
Employee Discount |
Carrier |
Late fee |
Policy New or Renewal |
Agency |
Renewal Term |
Agency Group |
Table 3 Sample attributes at policy level
What attributes to include in the modeling data depends on the final purpose. Everything can be included in a general purpose or only selected for a specific research.
A wide variety of data are available and can be included in a research.
- Demographic data
- Credit history
- Police reports
- Criminal history
- Social-media
I would not include this type of data in a general-purpose modeling dataset. Instead, I would save the data in separate tables and describe in a Data Catalog tool the ways how to join the external data and a modeling dataset.
Since Insurance is a highly regulated business there is always a chance the external data are not supported by a regulator. For example, using in modeling social-media usage habits (other than phone use while driving) is prohibited in Nevada. The rules are specific for each state. It may be a good idea to add comments in a Data Catalog tool for each external dataset if its allowed to be used and for what kind of models.
A general-purpose modeling data can have several target variables.
- Pure premium
- Claim frequency
- Claim severity
- Binary category claim/no claim
- Binary category Large Loss/No Large Loss
Since the analysis can be done for a specific type of coverage, the target variables should be separated for each coverage.
Each peril (or group) can be modeled separately and the targets should allow to model on perils basis.
Catastrophe losses are modeled separately from other type of losses. That`s why a researcher should be able to separate target variables based on a catastrophe flag.
Mix of small and very large losses impacts the result of modeling. If it`s the case, like in PersonalAuto line, capping losses at some thresholds can help to create more stable models. A threshold should be set at a point to discover a systematic part, but not too high to capture noise from large losses.
Column
suffix |
Condition |
Threshold |
_le500 |
Loss Less or Equal |
500 |
_1000 |
Loss Greater then |
1,000 |
_1500 |
Loss Greater then |
1,500 |
_2000 |
Loss Greater then |
2,000 |
_2500 |
Loss Greater then |
2,500 |
_5k |
Loss Greater then |
5,000 |
_10k |
Loss Greater then |
10,000 |
_25k |
Loss Greater then |
25,000 |
_50k |
Loss Greater then |
50,000 |
_100k |
Loss Greater then |
100,000 |
_250k |
Loss Greater then |
250,000 |
_500k |
Loss Greater then |
500,000 |
_750k |
Loss Greater then |
750,000 |
_1M |
Loss Greater then |
1,000,000 |
Table 4 Example of capping claim losses in PersonalAuto model data
Based on perils: |
There is an attribute based on these categories for each
coverage group* |
Peril |
Non-Catastrophe Water |
cova_il_nc_water |
Flood, Water Backup, Water Damage, Water Discharge |
Non- Catatastrophe Wind & Hail |
cova_il_nc_wh |
Hail, Windstorm |
Non- Catatastrophe Theft &
Vandalism |
cova_il_nc_tv |
Theft, Theft From Unattended Auto, Vandalism
Malicious Mischief |
Non- Catatastrophe Fire &
Lightning |
cova_il_nc_fl |
Fire, Lightning, Smoke |
Non- Catatastrophe All Other |
cova_il_nc_ao |
NOT in (Flood, Water Backup, Water
Damage, Water Discharge, Hail, Windstorm, Theft, Theft From
Unattended Auto, Vandalism Malicious Mischief, Fire, Lightning, Smoke) |
Catatastrophe Fire
& Smoke (i.e., Wildfire) |
cova_il_cat_fire |
Fire, Lightning, Smoke |
Catatastrophe All
Other |
cova_il_cat_ao |
NOT in (Fire, Lightning, Smoke) |
Table 5 Example of grouping claim losses for CovA in Property model data
Deep understanding of your data reveals how imperfect they are. In some cases, more research is needed to fix an issue but sometimes it`s just the matter of fact. People makes mistakes in the process of entering data or a program code populates wrong fields in a database. Even if an issue discovered later it`s too costly to fix historical data.
There is also a question if it`s a data issue or feature. Does it happen only in a specific product or claim type? Maybe it`s expected and default values can substitute missing.
High percent of wrong values in an attribute is a reason to revisit the source of the data or a way how it`s calculated. If nothing can be done then, of course, better not to include the attribute in the final dataset. If there are 5-6% of imperfect or suspicious values (and usually it happens for known products/states/time periods) it may be more costly to fix then exclude them completely. The other way is to keep them in the dataset but flag as an error/issue. If the attribute is a major one for a research, the records can be easily filtered out.
Quality flags can be part of precalculated data in DW tables and excluded from the process from the beginning or transferred to the final dataset. The flags help to test the data and estimate quality in general.
Flag |
Meaning |
Policy History inconsistency |
Some manual updates changed the consistency of policy history |
Risk duplicates |
There is more than one active risk with duplicate numbers or
other attributes exist at the same time in a policy |
Excluded driver |
Excluded drivers in our system have the same IDs. If there is a
claim for an excluded driver it is not possible to assign to a proper driver.
The number of excluded drivers also can be affected |
Replaced vin |
A vehicle was not set as "Deleted" but rather replaced
with another vehicle (VIN). It may create wrong mid-term changes for
vehicle-driver pairs |
Replaced driver |
A driver was not set as "Deleted" but rather replaced
with another driver (driver licensed was changed) |
Claim with unknown VIN |
A claim with Unknown VIN in the system |
Claim with unknown VIN and driver |
A claim with Unknown VIN and a driver not in the system or
missing |
Claim policy term join issue |
A claim is assigned to a first record in a policy term due to
issues in a related policy term ref and/or loss date. |
Table 6 Quality flags
We created 2 general purpose modeling datasets with different structures:
- Auto modeling data based on PersonalAuto line
- Property modeling data based on Homeowners and Dwelling lines
The datasets available to users are views based on fact tables and dimensions in DW. The views can be used ``as is`` or a new , more specific for a project structure can be built based on fact and dimensional tables. An additional info, like demographic data or 3rd party vendors data can be easily joined if needed.
The fact tables are policy risks mid-term changes with policy premiums, exposures, claims count and losses as measures. FACT_AUTO_MODELDATA combines individual coverages data in rows (coverage codes, premiums, losses in each row). It`s a very long table. FACT_PROPERTY_MODELDATA has the coverage related data in columns. It`s a very wide table.
Dimensions are policy and risk data. They are conformed slowly changing dimensions type 2.
Image 1 Property Modeling Data schema
Fact_Auto_Modeldata combines only vehicle and primary driver IDs. If it`s needed, the structure can be extended with primarily and occasional drivers related to the same vehicle_id in the fact table via DIM_RISK_EXTENSION. It should be done with caution because written and earned premium are only for risks based on primary drivers.
Image 2 Auto Modeling Data schema
There are hundreds of columns in the final views. The tables below show only major groups of fields in the views with some examples to better understand the structure.
Area |
Attribute
Name |
Comment |
Agency |
ProducerName |
Agency name, code, status |
Building |
RoofCd |
Building attributes (yearbuilt, usage,
construction) |
Claims |
AllCov_LossInc |
Claims counts and losses for a specific policy risk (building)
and mid-term change grouped by coverage, peril, and catastrophe |
Claims |
AllCov_LossDCCE |
|
Claims |
AllCov_LossALAE |
|
Claims |
covA_il_nc_water |
CovA incurred
losses (non-catastrophe) , water peril |
Claims |
covA_il_nc_wh |
|
Claims |
covA_il_nc_tv |
|
Claims |
covA_il_nc_fl |
|
Claims |
covA_il_nc_ao |
|
Claims |
covA_il_cat_fire |
|
Claims |
covA_il_cat_ao |
|
Claims |
covB_il_... |
|
Claims |
covC_il_... |
|
Claims |
covD_il_... |
|
Claims |
covE_il_... |
|
Claims |
covF_il_... |
|
Claims |
liab_il_... |
|
Claims |
..._dcce_nc_water |
dcce |
Claims |
..._alae_nc_water |
alae |
Claims |
covA_ic_nc_water |
incurred count |
Claims |
covD_ic_cat_fire |
catastrophe related |
Coverage |
CovA_WP |
CovA - CovF mid-term change written premium |
Coverage |
CovA_EP |
CovA - CovF mid-term change earned premium |
Coverage |
covA_deductible |
CovA - CovF deductible |
Coverage |
covA_limit |
CovA - CovF limit |
Insured |
InsuranceScore |
Insured Attributes (scores) |
Mid-Term Changes |
cal_year |
Year of Start Date |
Mid-Term Changes |
startdate |
Exact Mid-Term Start date or Jan 1 next year if a mid-term
covers more than one year. |
Mid-Term Changes |
enddate |
Exact Mid-Term End date or Dec 31 if a mid-term covers more than
one year. |
Mid-Term Changes |
startdatetm |
Exact Mid-Term Start date (transaction effective date) |
Mid-Term Changes |
enddatetm |
Exact Mid-Term End Date (next transaction effective date) |
Mid-Term Changes |
ecy |
Calendar Year Exposures: difference between StartDate and EndDate divided by 365.25 |
Policy |
policynumber |
Policy Attributes |
Quality |
excludeddrvflg |
Quality flags |
Testing/ProofOfConcept |
HasCreditCardPaymentByInsured |
Some calculated attributes |
Table 7 Property Model data attributes (view columns)
Group |
Attribute
Name Sample |
Comment |
Agency |
ProducerCode |
Agency name, code, status |
Claims |
cov_claim_count_le500 |
Claims capped counts and losses for a policy risk (vehicle and
driver) , coverage, mid-term change. AtFault or not
flag |
Coverage |
coverage |
Coverage Code (PD, COMP, COLL, BI) |
Coverage |
wp |
Coverage mid-term Written Premium |
Coverage |
ep |
Coverage mid-term Earned Premium |
Coverage |
Limit |
Coverage limits |
Coverage |
Deductible |
Coverage deductibles |
Coverage |
comp_deductible |
Specific
coverage limits and deductibles for filtering |
Coverage |
bi_limit1 (umbi,pd) |
|
Coverage |
bi_limit2 (umbi,pd) |
|
Driver |
driver |
Driver attributes (age, points, etc) |
Insured |
InsuranceScore |
Insured Attributes (scores) |
Mid-Term Changes |
cal_year |
Year of Start Date |
Mid-Term Changes |
startdate |
Exact Mid-Term Start date or Jan 1 next year if a mid-term
covers more than one year. |
Mid-Term Changes |
enddate |
Exact Mid-Term End date or Dec 31 if a mid-term covers more than
one year. |
Mid-Term Changes |
startdatetm |
Exact Mid-Term Start date (transaction effective date) |
Mid-Term Changes |
enddatetm |
Exact Mid-Term End Date (next transaction effective date) |
Mid-Term Changes |
ecy |
Calendar Year Exposures: difference between StartDate and EndDate divided by 365.25 |
Policy |
policynumber |
Policy Attributes |
Quality |
excludeddrvflg |
Quality flags |
Vehicle |
VIN |
Vehicle attributes (year, model, annual distance, etc) |
Table 8 Auto Model data attributes (view columns)
There are complete datasets in the fact tables but views do not include recent 2 months to give some time for claims development.
Monthly snapshots are created to reproduce history if needed.
From a DW developer point of view, the main issue in the transactional system is back dated transactions when a today`s transaction updates something in a past. The oldest update I found is for 7 years back.
The data in the modeling data fact tables are refreshed daily in full. This is the most accurate up to date source of the data in the system. As for now, the full refresh is not the longest process in ETL. But if needed, it`s possible to limit daily refresh to the latest year or two and lost some history.
A full refresh is not an option for conformed dimensions. There are already links to the dimensions from other fact tables which are refreshed incrementally. The workaround is discussed here.
Datasets created specifically for modeling became very popular between actuaries and analysts not just for modeling purposes but also for other projects.
The structures are complex and required detail description of how it was built and how it can be extended with other data available in the system. Examples of frequently asked questions from users:
- How losses were aggregated?
- Is coverage XYZ included in Liability Coverage group?
- How to add one more attribute from DIM_BUILDING?
- Why a N policy mid-term change is not included?
Predictive models filed for review to state regulators must be thoroughly documented. They look at the details and have standards regarding information required as support for models. Here are few support raw input data requirements from State of Nevada:
- The company should provide a list of the input variables that were used in the model
- The company needs to specify timeframe to which the data apply
- The company needs to identify the jurisdictional scope of the data (state-specific, countrywide etc.).
- The company needs to specify the book of business included in the model`s input data (private passenger auto, home, etc.).
- If third-party data were used, the company should identify who the third party is and how customers can obtain and correct any errors that might be discovered in the data.
- Underlying data for the model should include Earned Premium/Losses/Number of Policies
- How did the company address the lack of credible data for any segments that had limited historical experience?
A data catalog tool comes in handy to manage modeling datasets, answer questions, document models.