dataset <- read.csv("C:\\Kate\\Research\\Property\\Data\\property_water_claims_non_cat_fs_v5.csv", header=TRUE)
library(GoodmanKruskal)
## Warning: package 'GoodmanKruskal' was built under R version 3.5.3
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 3.5.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.5.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
library(weights)
## Warning: package 'weights' was built under R version 3.5.3
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## Loading required package: gdata
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## The following objects are masked from 'package:xts':
## 
##     first, last
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith
## Loading required package: mice
## Warning: package 'mice' was built under R version 3.5.3
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(ggplot2)

EDA Water peril claims in CovA. Correlation

There is no correlation observed between pedictors and response variable. However, correlation between predictors can explain why visually we can see some not expecting dependency between some predictors and response variables.

Also strong correlated predictors should not be used together in some models.

The code below builds for subset of predictors:

  1. Pearson correlation coefficients matrix
  2. Pearson correlation coefficients matrix with crossed insignificant pairs (sig.level = 0.01)
  3. GoodmanKruskal correlation coefficients matrix

GoodmanKruskal

https://cran.r-project.org/web/packages/GoodmanKruskal/vignettes/GoodmanKruskal.html

It is desirable to measure the association between numerical and categorical variable types. A GoodmanKruskal package function converts numerical variables into categorical ones, which may then be used as a basis for association analysis between mixed variable types.

This approach is somewhat experimental: there is loss of information in grouping a numerical variable into a categorical variable, but neither the extent of this information loss nor its impact are clear. Also, it is not obvious how many groups should be chosen, or how the results are influenced by different grouping strategies

Pearson

It requires only numeric attributes. If it’s an ordered factor variable, it’s converted to integer. For non-ordered factor variables I set an integer value, highest at the most used. The less it’s used in the original data, the less the numerical representation. They have _encd suffix in teh dataset.

Weighted Pearson correlation repeats the same pattern as not weighted

Spearman correlation

Spearman correlation repeats the same pattern as Pearson

plot_correlation <- function (v_set) {
  res <- rcorr(as.matrix(dataset[v_set]),type="pearson")
r <- round(res$r, 2)
p <- round(res$P, 2)
corrplot(r, method = "number")
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(r, method = "color", col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "darkblue", tl.srt = 45, #Text label color and rotation
         # Combine with significance level
         p.mat = p, sig.level = 0.01, 
         # 
         diag = FALSE 
         )
datacarFrame<- subset(dataset, select = v_set)
GKmatrix<- GKtauDataframe(datacarFrame)
plot(GKmatrix, corrColors = "blue")
}

Landlord (?) Product Configuration

  • LandLordInd discount and related number of other policies for teh same customer (customer_cnt_active_policies_binned)
  • LandLordInd and Replacementcostdwellingind and this can explain why we see Replacementcostdwellingind as an important feature in XGB classification
  • OrdinanceOrLawpct and SafeguardPlusind
  • Equipmentbreakdown and OrdinanceOrLawpct
  • Replacementvalueind and ordinanceorlawpct (GoodmanKruskal)
v_set<- c(
'propertymanager',
'rentersinsurance',
'ordinanceorlawpct',
'landlordind',
'customer_cnt_active_policies_binned',
'safeguardplusind',
'replacementcostdwellingind',
'homegardcreditind',
'equipmentbreakdown',
'replacementvalueind'
)
plot_correlation(v_set)

Misc discounts

Strong (or higher then usual) correlation between:

  • firealarmtype and burglaryalarmtype. is this one device?
  • kitchenfireextinguisherind and deadboltind (what’s this?)

Equipmentbreakdown is one of the important features in XGB Classification. Probably because of the correlation (0.03 in GoodmanKruskal is significant) with firealarmtype (which is not explainable by itself) and serviceline

v_set<- c(
'waterdetectiondevice',
'sprinklersystem',
'firealarmtype',
'burglaryalarmtype',
'kitchenfireextinguisherind',
'deadboltind',
'serviceline',
'gatedcommunityind',
'poolind',
'equipmentbreakdown',
'replacementvalueind'
)
plot_correlation(v_set)

Water Risk Scores

Strong correlation between: water_risk_3_blk and water_risk_fre_3_blk

Negative correlation between: water_risk_fre_3_blk and water_risk_sev_3_blk

v_set<- c(
'water_risk_3_blk',
'water_risk_fre_3_blk',
'water_risk_sev_3_blk',
'appl_fail_3_blk',
'fixture_leak_3_blk',
'pipe_froze_3_blk',
'plumb_leak_3_blk',
'rep_cost_3_blk',
'ustructure_fail_3_blk',
'waterh_fail_3_blk'
)
plot_correlation(v_set)

Replacementvalueind is one of the important features in XGB classification but there is no explanation or strong correlation with any other attribute. There is GoodmanKruskal correlation with very strong predictor usagetype and ordinanceorloawpct (it’s strong because of the correlation with landordind).