In [1]:
library(MASS)
library(lmtest)
library(pscl)
library(xgboost)
library(ggplot2)
library(caret)
Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis

Loading required package: lattice

In [2]:
source('/home/kate/code/Utils/MyFunctions.R')
In [3]:
ModelsDir <- '/home/kate/Research/Property/Models/'
UseSavedIfExists <- TRUE
In [4]:
training_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_training.csv", header=TRUE)
testing_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_testing.csv", header=TRUE)

NB work with factors and because some numerical columns are factors (ordered) I convert them in factors explicitly. I may need to do this for some other columns which are Yes/No.

In [5]:
training_data$appl_fail_3_blk <- as.factor( training_data$appl_fail_3_blk )
training_data$fixture_leak_3_blk <- as.factor( training_data$fixture_leak_3_blk )
training_data$pipe_froze_3_blk <- as.factor( training_data$pipe_froze_3_blk )
training_data$plumb_leak_3_blk <- as.factor( training_data$plumb_leak_3_blk )
training_data$rep_cost_3_blk <- as.factor( training_data$rep_cost_3_blk )
training_data$ustructure_fail_3_blk <- as.factor( training_data$ustructure_fail_3_blk )
training_data$waterh_fail_3_blk <- as.factor( training_data$waterh_fail_3_blk )
In [6]:
testing_data$appl_fail_3_blk <- as.factor( testing_data$appl_fail_3_blk )
testing_data$fixture_leak_3_blk <- as.factor( testing_data$fixture_leak_3_blk )
testing_data$pipe_froze_3_blk <- as.factor( testing_data$pipe_froze_3_blk )
testing_data$plumb_leak_3_blk <- as.factor( testing_data$plumb_leak_3_blk )
testing_data$rep_cost_3_blk <- as.factor( testing_data$rep_cost_3_blk )
testing_data$ustructure_fail_3_blk <- as.factor( testing_data$ustructure_fail_3_blk )
testing_data$waterh_fail_3_blk <- as.factor( testing_data$waterh_fail_3_blk )
In [7]:
max_yearbuilt <- max(training_data$yearbuilt)
max_sqft <- max(training_data$sqft)
max_water_risk_3_blk <- max(training_data$water_risk_3_blk)
max_water_risk_fre_3_blk <- max(training_data$water_risk_fre_3_blk)
max_water_risk_sev_3_blk <- max(training_data$water_risk_sev_3_blk)
In [8]:
training_data$log_norm_yearbult <- log(training_data$yearbuilt/max_yearbuilt)
training_data$log_norm_sqft <- log(training_data$sqft/max_sqft)
training_data$log_norm_water_risk_3_blk <- log(training_data$water_risk_3_blk/max_water_risk_3_blk) 
training_data$log_norm_water_risk_fre_3_blk <- log(training_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
training_data$log_norm_water_risk_sev_3_blk <- log(training_data$sqft/max_water_risk_sev_3_blk)
In [9]:
testing_data$log_norm_yearbult <- log(testing_data$yearbuilt/max_yearbuilt)
testing_data$log_norm_sqft <- log(testing_data$sqft/max_sqft)
testing_data$log_norm_water_risk_3_blk <- log(testing_data$water_risk_3_blk/max_water_risk_3_blk) 
testing_data$log_norm_water_risk_fre_3_blk <- log(testing_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
testing_data$log_norm_water_risk_sev_3_blk <- log(testing_data$sqft/max_water_risk_sev_3_blk)
In [10]:
training_data$stories <- as.factor(training_data$stories)
training_data$units <- as.factor(training_data$units)
training_data$secondaryresidence <- as.factor(training_data$secondaryresidence)
training_data$sprinklersystem <- as.factor(training_data$sprinklersystem)
training_data$firealarmtype <- as.factor(training_data$firealarmtype)
training_data$waterdetectiondevice <- as.factor(training_data$waterdetectiondevice)
training_data$propertymanager <- as.factor(training_data$propertymanager)
training_data$cova_deductible <- as.factor(training_data$cova_deductible)
training_data$cova_limit       <- as.factor(training_data$cova_limit      )
training_data$waterded <- as.factor(training_data$waterded)
training_data$protectionclass <- as.factor(training_data$protectionclass)
training_data$fire_risk_model_score <- as.factor(training_data$fire_risk_model_score)
training_data$ordinanceorlawpct <- as.factor(training_data$ordinanceorlawpct)
training_data$functionalreplacementcost <- as.factor(training_data$functionalreplacementcost)
training_data$homegardcreditind <- as.factor(training_data$homegardcreditind)
training_data$landlordind <- as.factor(training_data$landlordind)
training_data$rentersinsurance <- as.factor(training_data$rentersinsurance) 
training_data$burglaryalarmtype <- as.factor(training_data$burglaryalarmtype) 
training_data$propertymanager <- as.factor(training_data$propertymanager)
In [11]:
testing_data$stories <- as.factor(testing_data$stories)
testing_data$units <- as.factor(testing_data$units)
testing_data$secondaryresidence <- as.factor(testing_data$secondaryresidence)
testing_data$sprinklersystem <- as.factor(testing_data$sprinklersystem)
testing_data$firealarmtype <- as.factor(testing_data$firealarmtype)
testing_data$waterdetectiondevice <- as.factor(testing_data$waterdetectiondevice)
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
testing_data$cova_deductible <- as.factor(testing_data$cova_deductible)
testing_data$cova_limit       <- as.factor(testing_data$cova_limit      )
testing_data$waterded <- as.factor(testing_data$waterded)
testing_data$protectionclass <- as.factor(testing_data$protectionclass)
testing_data$fire_risk_model_score <- as.factor(testing_data$fire_risk_model_score)
testing_data$ordinanceorlawpct <- as.factor(testing_data$ordinanceorlawpct)
testing_data$functionalreplacementcost <- as.factor(testing_data$functionalreplacementcost)
testing_data$homegardcreditind <- as.factor(testing_data$homegardcreditind)
testing_data$landlordind <- as.factor(testing_data$landlordind)
testing_data$rentersinsurance <- as.factor(testing_data$rentersinsurance) 
testing_data$burglaryalarmtype <- as.factor(testing_data$burglaryalarmtype) 
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
In [12]:
#unique numclaims values
numclaims <- unique(training_data$cova_ic_nc_water)

Modeling

In [49]:
formula_sign <-  cova_ic_nc_water ~  
log_norm_yearbult + 
log_norm_sqft + 
#stories + 
#roofcd + 
#units + 
#waterded + 
#protectionclass + 
#constructioncd + 
#sprinklersystem + 
#landlordind + 
#firealarmtype + 
cova_limit + 
log_norm_water_risk_3_blk + 
log_norm_water_risk_fre_3_blk + 
appl_fail_3_blk + 
fixture_leak_3_blk + 
#pipe_froze_3_blk + 
plumb_leak_3_blk + 
waterh_fail_3_blk |
log_norm_yearbult + 
log_norm_sqft + 
stories + 
roofcd + 
units + 
waterded + 
protectionclass + 
constructioncd + 
sprinklersystem + 
landlordind + 
firealarmtype + 
cova_limit + 
log_norm_water_risk_3_blk + 
log_norm_water_risk_fre_3_blk + 
appl_fail_3_blk + 
fixture_leak_3_blk + 
pipe_froze_3_blk + 
plumb_leak_3_blk + 
waterh_fail_3_blk
UseSavedIfExists <- FALSE
In [50]:
ModelFile <- paste(ModelsDir,"hurdlenb_sign.rds",sep="")
if(file.exists(ModelFile) && UseSavedIfExists){
    hurdlenb_sign <- readRDS(ModelFile)
    } else {
    hurdlenb_sign <- hurdle(formula_sign,  data = training_data,  offset=log_ecy,dist ="negbin",zero.dist = "negbin",link= "logit")
    saveRDS(hurdlenb_sign, ModelFile)
    }
summary(hurdlenb_sign)
Call:
hurdle(formula = formula_sign, data = training_data, offset = log_ecy, 
    dist = "negbin", zero.dist = "negbin", link = "logit")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.20491 -0.11816 -0.09754 -0.07601 54.81541 

Count model coefficients (truncated negbin with log link):
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -15.39766 2274.07212  -0.007 0.994598    
log_norm_yearbult                8.37934    8.82994   0.949 0.342636    
log_norm_sqft                    0.21383    0.32996   0.648 0.516953    
cova_limit200000                -1.19337    0.49989  -2.387 0.016975 *  
cova_limit300000                -0.36550    0.38941  -0.939 0.347935    
cova_limit400000                -0.34984    0.41053  -0.852 0.394123    
cova_limit500000                -0.16992    0.45527  -0.373 0.708981    
cova_limit600000                -0.35003    0.51050  -0.686 0.492931    
cova_limit700000                 0.07342    0.55652   0.132 0.895046    
cova_limit800000                -0.55717    0.69396  -0.803 0.422041    
cova_limit900000                -0.12775    0.70805  -0.180 0.856820    
cova_limit1000000                0.27457    0.76018   0.361 0.717953    
cova_limit1200000                0.13837    0.82679   0.167 0.867083    
cova_limit1300000                0.06198    0.85706   0.072 0.942351    
log_norm_water_risk_3_blk       -0.33308    0.39183  -0.850 0.395296    
log_norm_water_risk_fre_3_blk    0.57943    0.38169   1.518 0.128999    
appl_fail_3_blk1                -1.89063    0.59800  -3.162 0.001569 ** 
appl_fail_3_blk2                -1.33125    0.65534  -2.031 0.042217 *  
appl_fail_3_blk3                -1.33491    0.64162  -2.081 0.037477 *  
appl_fail_3_blk4                -1.60169    0.58509  -2.737 0.006191 ** 
appl_fail_3_blk5                -1.93621    0.57197  -3.385 0.000711 ***
fixture_leak_3_blk1             -0.51052    0.29719  -1.718 0.085828 .  
fixture_leak_3_blk2             -0.28594    0.24957  -1.146 0.251900    
fixture_leak_3_blk3             -0.34919    0.28399  -1.230 0.218846    
fixture_leak_3_blk4             -0.60972    0.43584  -1.399 0.161829    
fixture_leak_3_blk5             -0.63172    1.08464  -0.582 0.560282    
plumb_leak_3_blk1               15.53947 2274.05525   0.007 0.994548    
plumb_leak_3_blk2               14.03767 2274.05519   0.006 0.995075    
plumb_leak_3_blk3               15.26909 2274.05537   0.007 0.994643    
plumb_leak_3_blk4               15.78141 2274.05524   0.007 0.994463    
plumb_leak_3_blk5               15.47721 2274.05513   0.007 0.994570    
waterh_fail_3_blk1               0.27501    0.25214   1.091 0.275402    
waterh_fail_3_blk2               0.02393    0.16935   0.141 0.887610    
waterh_fail_3_blk3              -0.39264    0.32501  -1.208 0.227018    
waterh_fail_3_blk4              -0.71218    0.53478  -1.332 0.182957    
waterh_fail_3_blk5              -0.67211    0.75521  -0.890 0.373489    
Log(theta)                       0.12247    2.29497   0.053 0.957443    
Zero hurdle model coefficients (censored negbin with log link):
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -3.132816   0.497787  -6.293 3.10e-10 ***
log_norm_yearbult             11.924830   2.341381   5.093 3.52e-07 ***
log_norm_sqft                  0.358799   0.087654   4.093 4.25e-05 ***
stories2                       0.079273   0.047554   1.667 0.095510 .  
stories3                      -0.580055   0.252344  -2.299 0.021524 *  
roofcdMEMBRANE                 0.205176   0.150476   1.364 0.172720    
roofcdMETAL                    0.304660   0.305783   0.996 0.319091    
roofcdOTHER                    0.493981   0.071638   6.895 5.37e-12 ***
roofcdTAR                      0.045489   0.113380   0.401 0.688264    
roofcdTILE                     0.099330   0.044559   2.229 0.025802 *  
roofcdWOOD                     0.358184   0.158697   2.257 0.024006 *  
units2                        -0.577583   0.106761  -5.410 6.30e-08 ***
units3                        -0.206236   0.161523  -1.277 0.201664    
units4                         0.067875   0.129895   0.523 0.601295    
waterded5000                  -0.490264   0.255844  -1.916 0.055332 .  
waterded7500                  -2.519778   0.750595  -3.357 0.000788 ***
waterded10000                 -1.840365   0.647057  -2.844 0.004452 ** 
protectionclass1               0.875207   0.208674   4.194 2.74e-05 ***
protectionclass2               0.839519   0.195440   4.296 1.74e-05 ***
protectionclass3               1.026456   0.200272   5.125 2.97e-07 ***
protectionclass4               1.184240   0.208296   5.685 1.31e-08 ***
protectionclass5               1.030953   0.214948   4.796 1.62e-06 ***
protectionclass6               0.755698   0.204528   3.695 0.000220 ***
protectionclass7               1.186880   0.595523   1.993 0.046261 *  
protectionclass8               0.317858   0.655511   0.485 0.627746    
protectionclass9              -0.978342   1.128430  -0.867 0.385945    
protectionclass10              0.598216   0.691078   0.866 0.386694    
constructioncdB                0.475553   0.150825   3.153 0.001616 ** 
constructioncdF                0.482519   0.057422   8.403  < 2e-16 ***
constructioncdM                0.674558   0.160630   4.199 2.68e-05 ***
constructioncdOTHER            0.363329   0.167475   2.169 0.030049 *  
sprinklersystem1              -0.121777   0.344192  -0.354 0.723485    
sprinklersystem2              -0.500003   0.105545  -4.737 2.17e-06 ***
landlordind1                  -0.662386   0.090067  -7.354 1.92e-13 ***
firealarmtype1                 0.391898   0.049487   7.919 2.39e-15 ***
cova_limit200000              -0.393803   0.114596  -3.436 0.000589 ***
cova_limit300000              -0.083125   0.105352  -0.789 0.430102    
cova_limit400000               0.173593   0.108622   1.598 0.110011    
cova_limit500000               0.197987   0.118661   1.669 0.095215 .  
cova_limit600000               0.277378   0.130348   2.128 0.033339 *  
cova_limit700000               0.197164   0.144949   1.360 0.173757    
cova_limit800000              -0.009915   0.165113  -0.060 0.952116    
cova_limit900000               0.364495   0.188329   1.935 0.052940 .  
cova_limit1000000              0.138417   0.218609   0.633 0.526620    
cova_limit1200000              0.015430   0.227770   0.068 0.945990    
cova_limit1300000              0.515527   0.245105   2.103 0.035441 *  
log_norm_water_risk_3_blk      0.220337   0.098299   2.241 0.024994 *  
log_norm_water_risk_fre_3_blk  0.498158   0.101304   4.917 8.77e-07 ***
appl_fail_3_blk1              -0.041176   0.238894  -0.172 0.863155    
appl_fail_3_blk2              -0.178518   0.248897  -0.717 0.473228    
appl_fail_3_blk3              -0.070801   0.248496  -0.285 0.775706    
appl_fail_3_blk4               0.050012   0.238516   0.210 0.833918    
appl_fail_3_blk5               0.022199   0.236523   0.094 0.925225    
fixture_leak_3_blk1           -0.389735   0.081469  -4.784 1.72e-06 ***
fixture_leak_3_blk2           -0.088013   0.059946  -1.468 0.142046    
fixture_leak_3_blk3           -0.152429   0.070292  -2.169 0.030121 *  
fixture_leak_3_blk4           -0.550288   0.114280  -4.815 1.47e-06 ***
fixture_leak_3_blk5           -0.897754   0.214837  -4.179 2.93e-05 ***
pipe_froze_3_blk1             -0.394985   0.086516  -4.565 4.98e-06 ***
pipe_froze_3_blk2             -0.539443   0.068190  -7.911 2.55e-15 ***
pipe_froze_3_blk3             -0.489226   0.076467  -6.398 1.58e-10 ***
pipe_froze_3_blk4             -0.394497   0.288123  -1.369 0.170939    
pipe_froze_3_blk5             -0.139778   0.223066  -0.627 0.530906    
plumb_leak_3_blk1              0.247118   0.223179   1.107 0.268179    
plumb_leak_3_blk2              0.204708   0.243149   0.842 0.399840    
plumb_leak_3_blk3              0.118195   0.235366   0.502 0.615544    
plumb_leak_3_blk4              0.332433   0.226751   1.466 0.142629    
plumb_leak_3_blk5              0.600195   0.233199   2.574 0.010060 *  
waterh_fail_3_blk1             0.298249   0.072077   4.138 3.50e-05 ***
waterh_fail_3_blk2             0.174581   0.042633   4.095 4.22e-05 ***
waterh_fail_3_blk3             0.031888   0.064998   0.491 0.623711    
waterh_fail_3_blk4             0.122794   0.100393   1.223 0.221279    
waterh_fail_3_blk5             0.126055   0.146135   0.863 0.388364    
Log(theta)                    -4.354794   0.236174 -18.439  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 1.1303, zero = 0.0128
Number of iterations in BFGS optimization: 163 
Log-likelihood: -5.322e+04 on 111 Df

Hurdle NB model is even more sensitive to the predictors then Hurdle Poisson or Zero Inflated. The predictors above are only I was able to add to get non NAN resullts.

Rootograms for Assessing Goodness of Fit of Probability...

In [58]:
library(countreg)
Attaching package: ‘countreg’


The following objects are masked from ‘package:pscl’:

    hurdle, hurdle.control, hurdletest, zeroinfl, zeroinfl.control


In [59]:
par(mfrow = c(1, 2))
rootogram(hurdlenb_sign,max = max(numclaims),main="Significant Features",style="hanging")

Predicting Number of claims

I use XGB to predict the final result based on probabilities of number claims in each row.

In [51]:
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "poisson-nloglik" 
)
In [52]:
#Adding  predicted probability values 
#to training and testing dataset for each numclaims
#as separate column



training_sign <- predict(hurdlenb_sign, training_data, type='prob')
testing_sign <- predict(hurdlenb_sign, testing_data, type='prob')

for(unique_value in numclaims){


training_data[paste("hurdlenb_sign", unique_value, sep = ".")] <- training_sign[,as.numeric(unique_value)+1]
testing_data[paste("hurdlenb_sign", unique_value, sep = ".")] <- testing_sign[,as.numeric(unique_value)+1]
    
}
In [53]:
#Features with probabilities listed by model

HNB_sign_features <- list();

i <- 1
for(unique_value in numclaims){
    HNB_sign_features[[i]] <- paste("hurdlenb_sign", unique_value, sep = ".")

    i <- i + 1
}
In [54]:
xgHNB_sign_training = xgb.DMatrix(as.matrix(training_data[,unlist(HNB_sign_features)]), label = training_data$cova_ic_nc_water)
setinfo(xgHNB_sign_training, "base_margin", log(training_data$ecy))

xgHNB_sign_testing = xgb.DMatrix(as.matrix(testing_data[,unlist(HNB_sign_features)]), label = testing_data$cova_ic_nc_water)
setinfo(xgHNB_sign_testing, "base_margin", log(testing_data$ecy))
TRUE
TRUE
In [55]:
xgb = xgb.train(
  nrounds = 10000
  , params = param0
  , data = xgHNB_sign_training
)
training_data$hurdlenb_sign = predict(xgb, xgHNB_sign_training)
testing_data$hurdlenb_sign = predict(xgb, xgHNB_sign_testing)

As an accuracy measure I use Normalized Weighted Gini:

In [57]:
Testing_Gini <- data.frame ("Significant Only" = NormalizedWeightedGini(testing_data$cova_ic_nc_water,testing_data$hurdlenb_sign,testing_data$ecy)
                   )
Testing_Gini
A data.frame: 1 × 1
Significant.Only
<dbl>
0.226707

K Fold Validation

TBD Does not work with Caret very strightforward and my manual approach is wrong somewhere

Conclusion

  • Normalization is needed to make the model work. I need to include normalized log columns in the datasets when it's extracted.
  • I convert all other columns into factors but not sure if it's needed.
  • I can use very small set of predictors in this model. With adding any other I get NAN in theta for the NB part.
  • Almost nothing significant for the NB part
  • Normalized Weighted Gini is the worse so far.
  • Is Logistic model (0/1) would be more precise?