dataset <- read.csv("C:\\Kate\\Research\\Property\\Data\\property_water_claims_non_cat_fs_v5.csv", header=TRUE)
library(GoodmanKruskal)
## Warning: package 'GoodmanKruskal' was built under R version 3.5.3
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 3.5.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.5.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
library(weights)
## Warning: package 'weights' was built under R version 3.5.3
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: gdata
## gdata: Unable to locate valid perl interpreter
## gdata:
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata:
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
##
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
##
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
##
## Attaching package: 'gdata'
## The following objects are masked from 'package:xts':
##
## first, last
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
## Loading required package: mice
## Warning: package 'mice' was built under R version 3.5.3
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(ggplot2)
There is no correlation observed between pedictors and response variable. However, correlation between predictors can explain why visually we can see some not expecting dependency between some predictors and response variables.
Also strong correlated predictors should not be used together in some models.
The code below builds for subset of predictors:
https://cran.r-project.org/web/packages/GoodmanKruskal/vignettes/GoodmanKruskal.html
It is desirable to measure the association between numerical and categorical variable types. A GoodmanKruskal package function converts numerical variables into categorical ones, which may then be used as a basis for association analysis between mixed variable types.
This approach is somewhat experimental: there is loss of information in grouping a numerical variable into a categorical variable, but neither the extent of this information loss nor its impact are clear. Also, it is not obvious how many groups should be chosen, or how the results are influenced by different grouping strategies
It requires only numeric attributes. If it’s an ordered factor variable, it’s converted to integer. For non-ordered factor variables I set an integer value, highest at the most used. The less it’s used in the original data, the less the numerical representation. They have _encd suffix in teh dataset.
Weighted Pearson correlation repeats the same pattern as not weighted
Spearman correlation repeats the same pattern as Pearson
plot_correlation <- function (v_set) {
res <- rcorr(as.matrix(dataset[v_set]),type="pearson")
r <- round(res$r, 2)
p <- round(res$P, 2)
corrplot(r, method = "number")
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(r, method = "color", col = col(200),
type = "upper", order = "hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col = "darkblue", tl.srt = 45, #Text label color and rotation
# Combine with significance level
p.mat = p, sig.level = 0.01,
#
diag = FALSE
)
datacarFrame<- subset(dataset, select = v_set)
GKmatrix<- GKtauDataframe(datacarFrame)
plot(GKmatrix, corrColors = "blue")
}
v_set<- c(
'propertymanager',
'rentersinsurance',
'ordinanceorlawpct',
'landlordind',
'customer_cnt_active_policies_binned',
'safeguardplusind',
'replacementcostdwellingind',
'homegardcreditind',
'equipmentbreakdown',
'replacementvalueind'
)
plot_correlation(v_set)
Strong (or higher then usual) correlation between:
Equipmentbreakdown is one of the important features in XGB Classification. Probably because of the correlation (0.03 in GoodmanKruskal is significant) with firealarmtype (which is not explainable by itself) and serviceline
v_set<- c(
'waterdetectiondevice',
'sprinklersystem',
'firealarmtype',
'burglaryalarmtype',
'kitchenfireextinguisherind',
'deadboltind',
'serviceline',
'gatedcommunityind',
'poolind',
'equipmentbreakdown',
'replacementvalueind'
)
plot_correlation(v_set)
Strong correlation between: water_risk_3_blk and water_risk_fre_3_blk
Negative correlation between: water_risk_fre_3_blk and water_risk_sev_3_blk
v_set<- c(
'water_risk_3_blk',
'water_risk_fre_3_blk',
'water_risk_sev_3_blk',
'appl_fail_3_blk',
'fixture_leak_3_blk',
'pipe_froze_3_blk',
'plumb_leak_3_blk',
'rep_cost_3_blk',
'ustructure_fail_3_blk',
'waterh_fail_3_blk'
)
plot_correlation(v_set)
Replacementvalueind is one of the important features in XGB classification but there is no explanation or strong correlation with any other attribute. There is GoodmanKruskal correlation with very strong predictor usagetype and ordinanceorloawpct (it’s strong because of the correlation with landordind).