In [2]:
library(MASS)
library(lmtest)
library(pscl)
library(xgboost)
library(ggplot2)
library(caret)
Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis

Loading required package: lattice

In [3]:
source('/home/kate/code/Utils/MyFunctions.R')
In [4]:
ModelsDir <- '/home/kate/Research/Property/Models/'
UseSavedIfExists <- TRUE
In [5]:
training_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_training.csv", header=TRUE)
testing_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_testing.csv", header=TRUE)

NB work with factors and because some numerical columns are factors (ordered) I convert them in factors explicitly. I may need to do this for some other columns which are Yes/No.

In [6]:
training_data$appl_fail_3_blk <- as.factor( training_data$appl_fail_3_blk )
training_data$fixture_leak_3_blk <- as.factor( training_data$fixture_leak_3_blk )
training_data$pipe_froze_3_blk <- as.factor( training_data$pipe_froze_3_blk )
training_data$plumb_leak_3_blk <- as.factor( training_data$plumb_leak_3_blk )
training_data$rep_cost_3_blk <- as.factor( training_data$rep_cost_3_blk )
training_data$ustructure_fail_3_blk <- as.factor( training_data$ustructure_fail_3_blk )
training_data$waterh_fail_3_blk <- as.factor( training_data$waterh_fail_3_blk )
In [7]:
testing_data$appl_fail_3_blk <- as.factor( testing_data$appl_fail_3_blk )
testing_data$fixture_leak_3_blk <- as.factor( testing_data$fixture_leak_3_blk )
testing_data$pipe_froze_3_blk <- as.factor( testing_data$pipe_froze_3_blk )
testing_data$plumb_leak_3_blk <- as.factor( testing_data$plumb_leak_3_blk )
testing_data$rep_cost_3_blk <- as.factor( testing_data$rep_cost_3_blk )
testing_data$ustructure_fail_3_blk <- as.factor( testing_data$ustructure_fail_3_blk )
testing_data$waterh_fail_3_blk <- as.factor( testing_data$waterh_fail_3_blk )
In [8]:
max_yearbuilt <- max(training_data$yearbuilt)
max_sqft <- max(training_data$sqft)
max_water_risk_3_blk <- max(training_data$water_risk_3_blk)
max_water_risk_fre_3_blk <- max(training_data$water_risk_fre_3_blk)
max_water_risk_sev_3_blk <- max(training_data$water_risk_sev_3_blk)
In [9]:
training_data$log_norm_yearbult <- log(training_data$yearbuilt/max_yearbuilt)
training_data$log_norm_sqft <- log(training_data$sqft/max_sqft)
training_data$log_norm_water_risk_3_blk <- log(training_data$water_risk_3_blk/max_water_risk_3_blk) 
training_data$log_norm_water_risk_fre_3_blk <- log(training_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
training_data$log_norm_water_risk_sev_3_blk <- log(training_data$sqft/max_water_risk_sev_3_blk)
In [10]:
testing_data$log_norm_yearbult <- log(testing_data$yearbuilt/max_yearbuilt)
testing_data$log_norm_sqft <- log(testing_data$sqft/max_sqft)
testing_data$log_norm_water_risk_3_blk <- log(testing_data$water_risk_3_blk/max_water_risk_3_blk) 
testing_data$log_norm_water_risk_fre_3_blk <- log(testing_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
testing_data$log_norm_water_risk_sev_3_blk <- log(testing_data$sqft/max_water_risk_sev_3_blk)
In [11]:
training_data$stories <- as.factor(training_data$stories)
training_data$units <- as.factor(training_data$units)
training_data$secondaryresidence <- as.factor(training_data$secondaryresidence)
training_data$sprinklersystem <- as.factor(training_data$sprinklersystem)
training_data$firealarmtype <- as.factor(training_data$firealarmtype)
training_data$waterdetectiondevice <- as.factor(training_data$waterdetectiondevice)
training_data$propertymanager <- as.factor(training_data$propertymanager)
training_data$cova_deductible <- as.factor(training_data$cova_deductible)
training_data$cova_limit       <- as.factor(training_data$cova_limit      )
training_data$waterded <- as.factor(training_data$waterded)
training_data$protectionclass <- as.factor(training_data$protectionclass)
training_data$fire_risk_model_score <- as.factor(training_data$fire_risk_model_score)
training_data$ordinanceorlawpct <- as.factor(training_data$ordinanceorlawpct)
training_data$functionalreplacementcost <- as.factor(training_data$functionalreplacementcost)
training_data$homegardcreditind <- as.factor(training_data$homegardcreditind)
training_data$landlordind <- as.factor(training_data$landlordind)
training_data$rentersinsurance <- as.factor(training_data$rentersinsurance) 
training_data$burglaryalarmtype <- as.factor(training_data$burglaryalarmtype) 
training_data$propertymanager <- as.factor(training_data$propertymanager)
In [12]:
testing_data$stories <- as.factor(testing_data$stories)
testing_data$units <- as.factor(testing_data$units)
testing_data$secondaryresidence <- as.factor(testing_data$secondaryresidence)
testing_data$sprinklersystem <- as.factor(testing_data$sprinklersystem)
testing_data$firealarmtype <- as.factor(testing_data$firealarmtype)
testing_data$waterdetectiondevice <- as.factor(testing_data$waterdetectiondevice)
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
testing_data$cova_deductible <- as.factor(testing_data$cova_deductible)
testing_data$cova_limit       <- as.factor(testing_data$cova_limit      )
testing_data$waterded <- as.factor(testing_data$waterded)
testing_data$protectionclass <- as.factor(testing_data$protectionclass)
testing_data$fire_risk_model_score <- as.factor(testing_data$fire_risk_model_score)
testing_data$ordinanceorlawpct <- as.factor(testing_data$ordinanceorlawpct)
testing_data$functionalreplacementcost <- as.factor(testing_data$functionalreplacementcost)
testing_data$homegardcreditind <- as.factor(testing_data$homegardcreditind)
testing_data$landlordind <- as.factor(testing_data$landlordind)
testing_data$rentersinsurance <- as.factor(testing_data$rentersinsurance) 
testing_data$burglaryalarmtype <- as.factor(testing_data$burglaryalarmtype) 
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
In [13]:
#unique numclaims values
numclaims <- unique(training_data$cova_ic_nc_water)

Modeling

In [14]:
formula_all <-  cova_ic_nc_water ~  
log_norm_yearbult + 
log_norm_sqft + 
stories + 
roofcd + 
units + 
occupancycd + 
waterded + 
protectionclass + 
constructioncd + 
#fire_risk_model_score + 
#usagetype + 
secondaryresidence + 
ordinanceorlawpct + 
functionalreplacementcost + 
homegardcreditind + 
sprinklersystem + 
landlordind + 
rentersinsurance + 
firealarmtype + 
burglaryalarmtype + 
waterdetectiondevice + 
propertymanager + 
#cova_deductible + 
cova_limit + 
log_norm_water_risk_3_blk + 
log_norm_water_risk_fre_3_blk + 
#log_norm_water_risk_sev_3_blk + 
appl_fail_3_blk + 
fixture_leak_3_blk + 
pipe_froze_3_blk + 
plumb_leak_3_blk + 
rep_cost_3_blk + 
ustructure_fail_3_blk + 
waterh_fail_3_blk

Poisson models

In [15]:
ModelFile <- paste(ModelsDir,"hurdlepoisson_all.rds",sep="")
if(file.exists(ModelFile) && UseSavedIfExists){
    hurdlepoisson_all <- readRDS(ModelFile)
    } else {
    hurdlepoisson_all <- hurdle(formula_all,  data = training_data,  offset=log_ecy,dist ="poisson",zero.dist = "poisson",link= "logit")
    saveRDS(hurdlepoisson_all, ModelFile)
    }
summary(hurdlepoisson_all)
Call:
hurdle(formula = formula_all, data = training_data, offset = log_ecy, 
    dist = "poisson", zero.dist = "poisson", link = "logit")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.25827 -0.11563 -0.09533 -0.07697 47.69662 

Count model coefficients (truncated poisson with log link):
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -9.553e+00  3.380e+01  -0.283 0.777456    
log_norm_yearbult              1.426e+01  1.042e+01   1.369 0.170906    
log_norm_sqft                  3.091e-01  3.524e-01   0.877 0.380430    
stories2                       1.038e-01  1.962e-01   0.529 0.596736    
stories3                      -1.299e+01  8.069e+02  -0.016 0.987159    
roofcdMEMBRANE                 4.324e-01  5.940e-01   0.728 0.466680    
roofcdMETAL                   -1.411e+01  1.496e+03  -0.009 0.992475    
roofcdOTHER                    1.725e-01  2.490e-01   0.693 0.488578    
roofcdTAR                     -1.335e+00  1.022e+00  -1.307 0.191356    
roofcdTILE                    -9.501e-03  1.881e-01  -0.051 0.959718    
roofcdWOOD                    -1.446e+01  8.489e+02  -0.017 0.986406    
units2                        -1.364e-01  6.009e-01  -0.227 0.820424    
units3                         3.495e-01  6.383e-01   0.548 0.583969    
units4                        -1.715e-01  5.798e-01  -0.296 0.767457    
occupancycdTENANT             -2.266e-01  3.644e-01  -0.622 0.534001    
waterded5000                  -1.361e+01  1.083e+03  -0.013 0.989974    
waterded7500                  -3.266e+00  3.810e+01  -0.086 0.931676    
waterded10000                 -5.249e+00  4.875e+01  -0.108 0.914255    
protectionclass1              -1.407e-01  7.756e-01  -0.181 0.856056    
protectionclass2              -6.678e-01  7.292e-01  -0.916 0.359753    
protectionclass3              -5.476e-01  7.227e-01  -0.758 0.448650    
protectionclass4              -5.916e-01  7.228e-01  -0.819 0.413068    
protectionclass5              -3.888e-02  7.729e-01  -0.050 0.959883    
protectionclass6              -8.568e-01  8.124e-01  -1.055 0.291560    
protectionclass7              -1.356e+01  2.008e+03  -0.007 0.994613    
protectionclass8              -5.731e+00  7.939e+01  -0.072 0.942453    
protectionclass9              -1.322e+00  2.603e+03  -0.001 0.999595    
protectionclass10             -6.184e+00  1.002e+02  -0.062 0.950807    
constructioncdB                1.638e+00  5.326e-01   3.075 0.002103 ** 
constructioncdF               -1.017e-01  1.838e-01  -0.553 0.580219    
constructioncdM                9.349e-01  5.827e-01   1.604 0.108638    
constructioncdOTHER           -7.768e-01  1.032e+00  -0.753 0.451672    
secondaryresidence1           -9.048e+00  2.601e+02  -0.035 0.972247    
ordinanceorlawpct10            2.176e-01  1.734e-01   1.255 0.209376    
ordinanceorlawpct15           -9.629e+00  2.628e+02  -0.037 0.970773    
ordinanceorlawpct20           -2.157e-01  2.500e-01  -0.863 0.388284    
ordinanceorlawpct25            9.286e-01  5.285e-01   1.757 0.078897 .  
ordinanceorlawpct40           -1.273e+01  9.193e+02  -0.014 0.988948    
ordinanceorlawpct50            1.659e+00  7.335e-01   2.261 0.023746 *  
ordinanceorlawpct65           -1.162e+01  6.164e+02  -0.019 0.984953    
ordinanceorlawpct75            2.551e+00  1.009e+00   2.529 0.011445 *  
ordinanceorlawpct90           -1.070e+01  4.159e+02  -0.026 0.979470    
ordinanceorlawpct100          -1.046e+01  4.414e+02  -0.024 0.981089    
functionalreplacementcost1    -4.946e+00  5.053e+01  -0.098 0.922039    
homegardcreditind1            -4.701e-01  2.171e-01  -2.165 0.030358 *  
sprinklersystem1              -1.354e+01  1.107e+03  -0.012 0.990242    
sprinklersystem2               2.725e-01  3.364e-01   0.810 0.417828    
landlordind1                  -1.315e+00  7.312e-01  -1.798 0.072131 .  
rentersinsurance1             -1.438e+01  1.502e+03  -0.010 0.992362    
firealarmtype1                -3.064e-01  1.716e-01  -1.786 0.074097 .  
burglaryalarmtype1             1.996e-01  1.793e-01   1.114 0.265490    
waterdetectiondevice1         -1.867e-01  1.088e+04   0.000 0.999986    
propertymanager1               1.021e+00  6.112e-01   1.670 0.094893 .  
cova_limit200000              -1.259e+00  5.210e-01  -2.416 0.015700 *  
cova_limit300000              -3.145e-01  3.974e-01  -0.791 0.428729    
cova_limit400000              -3.228e-01  4.200e-01  -0.769 0.442070    
cova_limit500000              -2.326e-01  4.623e-01  -0.503 0.614897    
cova_limit600000              -4.185e-01  5.232e-01  -0.800 0.423820    
cova_limit700000              -4.851e-02  5.658e-01  -0.086 0.931677    
cova_limit800000              -6.945e-01  7.015e-01  -0.990 0.322168    
cova_limit900000              -2.515e-01  7.111e-01  -0.354 0.723545    
cova_limit1000000              1.309e-01  7.615e-01   0.172 0.863557    
cova_limit1200000              1.401e-02  8.323e-01   0.017 0.986571    
cova_limit1300000             -6.320e-02  8.693e-01  -0.073 0.942042    
log_norm_water_risk_3_blk     -1.642e-01  4.399e-01  -0.373 0.708912    
log_norm_water_risk_fre_3_blk  3.900e-01  4.371e-01   0.892 0.372302    
appl_fail_3_blk1              -1.883e+00  5.762e-01  -3.267 0.001085 ** 
appl_fail_3_blk2              -1.190e+00  6.329e-01  -1.880 0.060043 .  
appl_fail_3_blk3              -1.339e+00  6.185e-01  -2.166 0.030347 *  
appl_fail_3_blk4              -1.593e+00  5.644e-01  -2.823 0.004757 ** 
appl_fail_3_blk5              -1.933e+00  5.503e-01  -3.512 0.000445 ***
fixture_leak_3_blk1           -3.793e-01  3.124e-01  -1.214 0.224724    
fixture_leak_3_blk2           -1.429e-01  2.546e-01  -0.561 0.574645    
fixture_leak_3_blk3           -2.558e-01  2.938e-01  -0.871 0.383897    
fixture_leak_3_blk4           -4.338e-01  4.532e-01  -0.957 0.338471    
fixture_leak_3_blk5           -5.712e-01  1.103e+00  -0.518 0.604638    
pipe_froze_3_blk1             -4.992e-01  3.799e-01  -1.314 0.188809    
pipe_froze_3_blk2             -2.295e-01  1.878e-01  -1.222 0.221519    
pipe_froze_3_blk3             -1.332e-01  2.424e-01  -0.549 0.582691    
pipe_froze_3_blk4             -1.122e+01  5.198e+02  -0.022 0.982780    
pipe_froze_3_blk5              2.593e-01  1.065e+00   0.244 0.807573    
plumb_leak_3_blk1              6.414e+00  2.526e+01   0.254 0.799571    
plumb_leak_3_blk2              4.994e+00  2.528e+01   0.198 0.843394    
plumb_leak_3_blk3              6.273e+00  2.527e+01   0.248 0.803913    
plumb_leak_3_blk4              6.593e+00  2.526e+01   0.261 0.794120    
plumb_leak_3_blk5              6.231e+00  2.526e+01   0.247 0.805203    
rep_cost_3_blk1                4.154e+00  2.246e+01   0.185 0.853278    
rep_cost_3_blk2                6.251e+00  2.246e+01   0.278 0.780709    
rep_cost_3_blk3                6.421e+00  2.245e+01   0.286 0.774893    
rep_cost_3_blk4                4.964e+00  2.244e+01   0.221 0.824933    
rep_cost_3_blk5                5.425e+00  2.244e+01   0.242 0.808947    
ustructure_fail_3_blk1        -4.470e-01  5.723e-01  -0.781 0.434756    
ustructure_fail_3_blk2        -1.165e+00  6.402e-01  -1.819 0.068843 .  
ustructure_fail_3_blk3        -9.772e-02  6.325e-01  -0.154 0.877221    
ustructure_fail_3_blk4        -7.904e-01  5.896e-01  -1.341 0.180067    
ustructure_fail_3_blk5        -6.472e-01  5.678e-01  -1.140 0.254349    
waterh_fail_3_blk1             2.384e-01  2.678e-01   0.890 0.373227    
waterh_fail_3_blk2             1.450e-02  1.748e-01   0.083 0.933865    
waterh_fail_3_blk3            -4.366e-01  3.301e-01  -1.323 0.185872    
waterh_fail_3_blk4            -8.171e-01  5.513e-01  -1.482 0.138274    
waterh_fail_3_blk5            -8.318e-01  7.816e-01  -1.064 0.287192    
Zero hurdle model coefficients (censored poisson with log link):
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -4.194262   0.360932 -11.621  < 2e-16 ***
log_norm_yearbult              5.746150   1.379265   4.166 3.10e-05 ***
log_norm_sqft                  0.181123   0.050201   3.608 0.000309 ***
stories2                       0.079921   0.030131   2.652 0.007990 ** 
stories3                      -0.350438   0.185516  -1.889 0.058893 .  
roofcdMEMBRANE                 0.136598   0.097827   1.396 0.162617    
roofcdMETAL                    0.211128   0.190730   1.107 0.268315    
roofcdOTHER                    0.275231   0.036674   7.505 6.15e-14 ***
roofcdTAR                      0.027985   0.080414   0.348 0.727833    
roofcdTILE                     0.096572   0.029402   3.284 0.001022 ** 
roofcdWOOD                     0.227921   0.098356   2.317 0.020487 *  
units2                        -0.446763   0.081702  -5.468 4.55e-08 ***
units3                        -0.146778   0.119170  -1.232 0.218073    
units4                         0.085228   0.086815   0.982 0.326239    
occupancycdTENANT             -0.002897   0.051786  -0.056 0.955383    
waterded5000                  -0.344591   0.193704  -1.779 0.075247 .  
waterded7500                  -2.190995   0.707541  -3.097 0.001957 ** 
waterded10000                 -1.543012   0.577811  -2.670 0.007575 ** 
protectionclass1               0.542264   0.154245   3.516 0.000439 ***
protectionclass2               0.530657   0.146262   3.628 0.000285 ***
protectionclass3               0.656033   0.145848   4.498 6.86e-06 ***
protectionclass4               0.751120   0.145967   5.146 2.66e-07 ***
protectionclass5               0.651113   0.153852   4.232 2.32e-05 ***
protectionclass6               0.522587   0.152962   3.416 0.000634 ***
protectionclass7               0.755802   0.405251   1.865 0.062178 .  
protectionclass8               0.077847   0.521687   0.149 0.881379    
protectionclass9              -0.857809   1.010635  -0.849 0.396003    
protectionclass10              0.240159   0.520951   0.461 0.644798    
constructioncdB                0.297216   0.095078   3.126 0.001772 ** 
constructioncdF                0.297666   0.026621  11.181  < 2e-16 ***
constructioncdM                0.426862   0.091721   4.654 3.26e-06 ***
constructioncdOTHER            0.257353   0.107565   2.393 0.016732 *  
secondaryresidence1            0.057080   0.707618   0.081 0.935708    
ordinanceorlawpct10            0.021473   0.027224   0.789 0.430262    
ordinanceorlawpct15           -0.427303   0.409323  -1.044 0.296519    
ordinanceorlawpct20            0.041367   0.038911   1.063 0.287733    
ordinanceorlawpct25            0.007282   0.120324   0.061 0.951740    
ordinanceorlawpct40            0.069841   0.215586   0.324 0.745969    
ordinanceorlawpct50            0.038681   0.225406   0.172 0.863748    
ordinanceorlawpct65           -0.270334   0.379175  -0.713 0.475874    
ordinanceorlawpct75           -0.559651   0.707746  -0.791 0.429089    
ordinanceorlawpct90           -0.170274   0.269084  -0.633 0.526869    
ordinanceorlawpct100           0.016964   0.501564   0.034 0.973018    
functionalreplacementcost1    -1.040015   0.708843  -1.467 0.142321    
homegardcreditind1             0.184041   0.031577   5.828 5.60e-09 ***
sprinklersystem1              -0.014616   0.213892  -0.068 0.945519    
sprinklersystem2              -0.312265   0.065481  -4.769 1.85e-06 ***
landlordind1                  -0.541306   0.068427  -7.911 2.56e-15 ***
rentersinsurance1              0.055546   0.201896   0.275 0.783222    
firealarmtype1                 0.204142   0.027327   7.470 8.00e-14 ***
burglaryalarmtype1             0.041433   0.028792   1.439 0.150142    
waterdetectiondevice1          0.169481   1.001930   0.169 0.865675    
propertymanager1               0.057252   0.117715   0.486 0.626709    
cova_limit200000              -0.268268   0.071116  -3.772 0.000162 ***
cova_limit300000              -0.045017   0.066908  -0.673 0.501062    
cova_limit400000               0.133127   0.069758   1.908 0.056338 .  
cova_limit500000               0.161142   0.075672   2.129 0.033215 *  
cova_limit600000               0.190368   0.082509   2.307 0.021041 *  
cova_limit700000               0.137388   0.091703   1.498 0.134084    
cova_limit800000              -0.003662   0.105421  -0.035 0.972290    
cova_limit900000               0.217972   0.116013   1.879 0.060264 .  
cova_limit1000000              0.092530   0.138781   0.667 0.504944    
cova_limit1200000             -0.004605   0.147074  -0.031 0.975024    
cova_limit1300000              0.290072   0.145433   1.995 0.046093 *  
log_norm_water_risk_3_blk      0.114629   0.064409   1.780 0.075125 .  
log_norm_water_risk_fre_3_blk  0.364387   0.063242   5.762 8.32e-09 ***
appl_fail_3_blk1              -0.015908   0.164443  -0.097 0.922934    
appl_fail_3_blk2              -0.121912   0.172227  -0.708 0.479036    
appl_fail_3_blk3              -0.051283   0.171093  -0.300 0.764380    
appl_fail_3_blk4               0.038693   0.164134   0.236 0.813635    
appl_fail_3_blk5               0.019092   0.162944   0.117 0.906726    
fixture_leak_3_blk1           -0.233725   0.048659  -4.803 1.56e-06 ***
fixture_leak_3_blk2           -0.071304   0.040280  -1.770 0.076696 .  
fixture_leak_3_blk3           -0.090355   0.046210  -1.955 0.050546 .  
fixture_leak_3_blk4           -0.330706   0.067685  -4.886 1.03e-06 ***
fixture_leak_3_blk5           -0.592042   0.137909  -4.293 1.76e-05 ***
pipe_froze_3_blk1             -0.228888   0.050285  -4.552 5.32e-06 ***
pipe_froze_3_blk2             -0.322046   0.029574 -10.889  < 2e-16 ***
pipe_froze_3_blk3             -0.288505   0.039463  -7.311 2.66e-13 ***
pipe_froze_3_blk4             -0.250039   0.199436  -1.254 0.209941    
pipe_froze_3_blk5             -0.057536   0.152842  -0.376 0.706589    
plumb_leak_3_blk1              0.290529   0.167329   1.736 0.082515 .  
plumb_leak_3_blk2              0.175153   0.182112   0.962 0.336156    
plumb_leak_3_blk3              0.159464   0.176932   0.901 0.367445    
plumb_leak_3_blk4              0.363683   0.169174   2.150 0.031574 *  
plumb_leak_3_blk5              0.505662   0.171144   2.955 0.003131 ** 
rep_cost_3_blk1                0.114813   0.192579   0.596 0.551052    
rep_cost_3_blk2               -0.148637   0.240122  -0.619 0.535911    
rep_cost_3_blk3                0.145678   0.231421   0.629 0.529027    
rep_cost_3_blk4                0.009310   0.183095   0.051 0.959447    
rep_cost_3_blk5                0.011760   0.179147   0.066 0.947660    
ustructure_fail_3_blk1         0.061469   0.112045   0.549 0.583273    
ustructure_fail_3_blk2         0.132210   0.114368   1.156 0.247679    
ustructure_fail_3_blk3        -0.105524   0.124551  -0.847 0.396862    
ustructure_fail_3_blk4         0.027009   0.112193   0.241 0.809759    
ustructure_fail_3_blk5        -0.004462   0.110312  -0.040 0.967733    
waterh_fail_3_blk1             0.153615   0.043567   3.526 0.000422 ***
waterh_fail_3_blk2             0.094511   0.026827   3.523 0.000427 ***
waterh_fail_3_blk3             0.001891   0.043218   0.044 0.965094    
waterh_fail_3_blk4             0.047039   0.064977   0.724 0.469106    
waterh_fail_3_blk5             0.003693   0.092138   0.040 0.968028    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 179 
Log-likelihood: -5.317e+04 on 202 Df

when fire risk is added - kernel died, looks like to many predictors usagetype - all NANs secondaryresidence is Ok cova_deductible - all NANs log_norm_water_risk_sev_3_blk - Error in optim(fn = countDist, gr = countGrad, par = c(start$count, if (dist == : non-finite value supplied by optim Is the error related to a specific predictor or a "summarized" effect?

Normalization is needed to make the model work. I was not able to add into the model and test cova_deductible and log_norm_water_risk_sev_3_blk. It returns errors. usagetype returns NA in covariance matrix for some levels. Some predictors (like ordinanceorlawpct) are marked as significant but really it's a random component I feel.

In [16]:
formula_sign <-  cova_ic_nc_water ~  
log_norm_yearbult + 
log_norm_sqft + 
stories + 
roofcd + 
units + 
waterded + 
protectionclass + 
constructioncd + 
sprinklersystem + 
landlordind + 
firealarmtype + 
cova_limit + 
log_norm_water_risk_fre_3_blk + 
appl_fail_3_blk +
fixture_leak_3_blk + 
pipe_froze_3_blk + 
plumb_leak_3_blk + 
waterh_fail_3_blk
In [17]:
ModelFile <- paste(ModelsDir,"hurdlepoisson_sign.rds",sep="")
if(file.exists(ModelFile) && UseSavedIfExists){
    hurdlepoisson_sign <- readRDS(ModelFile)
    } else {
    hurdlepoisson_sign <- hurdle(formula_sign,  data = training_data,  offset=log_ecy,dist ="poisson",zero.dist = "poisson",link= "logit")
    saveRDS(hurdlepoisson_sign, ModelFile)
    }
summary(hurdlepoisson_sign)
Call:
hurdle(formula = formula_sign, data = training_data, offset = log_ecy, 
    dist = "poisson", zero.dist = "poisson", link = "logit")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.25169 -0.11595 -0.09583 -0.07708 47.44414 

Count model coefficients (truncated poisson with log link):
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -4.86432   24.80218  -0.196 0.844513    
log_norm_yearbult                9.32513   10.15360   0.918 0.358406    
log_norm_sqft                    0.18683    0.34298   0.545 0.585947    
stories2                         0.15034    0.19307   0.779 0.436165    
stories3                       -13.47389  959.05181  -0.014 0.988791    
roofcdMEMBRANE                   0.36506    0.59219   0.616 0.537597    
roofcdMETAL                    -13.12433 1010.54052  -0.013 0.989638    
roofcdOTHER                      0.17839    0.23629   0.755 0.450260    
roofcdTAR                       -1.25749    1.01131  -1.243 0.213712    
roofcdTILE                       0.02332    0.18705   0.125 0.900803    
roofcdWOOD                     -15.07423 1194.79967  -0.013 0.989934    
units2                          -0.13832    0.59428  -0.233 0.815952    
units3                           0.36815    0.62217   0.592 0.554036    
units4                          -0.19648    0.55522  -0.354 0.723439    
waterded5000                   -13.46006  994.31783  -0.014 0.989199    
waterded7500                    -2.50407   31.54022  -0.079 0.936720    
waterded10000                   -4.63935   34.40981  -0.135 0.892749    
protectionclass1                -0.18233    0.77332  -0.236 0.813607    
protectionclass2                -0.64126    0.72893  -0.880 0.379002    
protectionclass3                -0.53935    0.72329  -0.746 0.455855    
protectionclass4                -0.55284    0.72492  -0.763 0.445692    
protectionclass5                -0.02203    0.77265  -0.029 0.977258    
protectionclass6                -0.80039    0.81767  -0.979 0.327647    
protectionclass7               -15.03879 3779.68063  -0.004 0.996825    
protectionclass8                -6.34371  104.97996  -0.060 0.951815    
protectionclass9                -1.16655 4360.37413   0.000 0.999787    
protectionclass10               -6.40503  116.02899  -0.055 0.955978    
constructioncdB                  1.26198    0.50660   2.491 0.012735 *  
constructioncdF                 -0.10169    0.17887  -0.568 0.569704    
constructioncdM                  1.02961    0.48891   2.106 0.035211 *  
constructioncdOTHER             -0.81337    1.02726  -0.792 0.428486    
sprinklersystem1               -13.57665 1089.25637  -0.012 0.990055    
sprinklersystem2                 0.34423    0.33209   1.037 0.299938    
landlordind1                    -1.31244    0.71346  -1.840 0.065834 .  
firealarmtype1                  -0.30976    0.14443  -2.145 0.031980 *  
cova_limit200000                -1.32971    0.50548  -2.631 0.008524 ** 
cova_limit300000                -0.33777    0.38983  -0.866 0.386237    
cova_limit400000                -0.32719    0.41244  -0.793 0.427598    
cova_limit500000                -0.16730    0.45501  -0.368 0.713111    
cova_limit600000                -0.36619    0.51239  -0.715 0.474810    
cova_limit700000                 0.06752    0.55209   0.122 0.902659    
cova_limit800000                -0.55320    0.68513  -0.807 0.419418    
cova_limit900000                -0.18661    0.69751  -0.268 0.789055    
cova_limit1000000                0.19140    0.74830   0.256 0.798124    
cova_limit1200000                0.12818    0.81437   0.157 0.874929    
cova_limit1300000                0.06786    0.84339   0.080 0.935869    
log_norm_water_risk_fre_3_blk    0.23012    0.18527   1.242 0.214210    
appl_fail_3_blk1                -1.94459    0.56531  -3.440 0.000582 ***
appl_fail_3_blk2                -1.34160    0.62172  -2.158 0.030938 *  
appl_fail_3_blk3                -1.36606    0.60840  -2.245 0.024747 *  
appl_fail_3_blk4                -1.66654    0.55251  -3.016 0.002559 ** 
appl_fail_3_blk5                -2.01327    0.53811  -3.741 0.000183 ***
fixture_leak_3_blk1             -0.48907    0.29896  -1.636 0.101862    
fixture_leak_3_blk2             -0.25674    0.24708  -1.039 0.298756    
fixture_leak_3_blk3             -0.30783    0.28096  -1.096 0.273230    
fixture_leak_3_blk4             -0.54578    0.43532  -1.254 0.209931    
fixture_leak_3_blk5             -0.55959    1.07405  -0.521 0.602357    
pipe_froze_3_blk1               -0.46664    0.37780  -1.235 0.216767    
pipe_froze_3_blk2               -0.20880    0.18392  -1.135 0.256254    
pipe_froze_3_blk3               -0.08144    0.23852  -0.341 0.732766    
pipe_froze_3_blk4              -12.07604  698.29076  -0.017 0.986202    
pipe_froze_3_blk5                0.09477    1.02962   0.092 0.926665    
plumb_leak_3_blk1                6.32347   24.77207   0.255 0.798518    
plumb_leak_3_blk2                4.90735   24.79172   0.198 0.843089    
plumb_leak_3_blk3                6.17209   24.77674   0.249 0.803277    
plumb_leak_3_blk4                6.50558   24.77256   0.263 0.792849    
plumb_leak_3_blk5                6.13759   24.77313   0.248 0.804326    
waterh_fail_3_blk1               0.18911    0.25137   0.752 0.451845    
waterh_fail_3_blk2              -0.01773    0.16956  -0.105 0.916703    
waterh_fail_3_blk3              -0.44317    0.32330  -1.371 0.170448    
waterh_fail_3_blk4              -0.78828    0.52936  -1.489 0.136452    
waterh_fail_3_blk5              -0.82213    0.74840  -1.099 0.271978    
Zero hurdle model coefficients (censored poisson with log link):
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -4.199141   0.297442 -14.118  < 2e-16 ***
log_norm_yearbult              7.171127   1.353920   5.297 1.18e-07 ***
log_norm_sqft                  0.193720   0.049304   3.929 8.53e-05 ***
stories2                       0.060043   0.029713   2.021 0.043308 *  
stories3                      -0.374868   0.185353  -2.022 0.043129 *  
roofcdMEMBRANE                 0.151724   0.097626   1.554 0.120152    
roofcdMETAL                    0.218772   0.190619   1.148 0.251095    
roofcdOTHER                    0.321459   0.034833   9.229  < 2e-16 ***
roofcdTAR                      0.028400   0.080230   0.354 0.723353    
roofcdTILE                     0.089960   0.029292   3.071 0.002132 ** 
roofcdWOOD                     0.235801   0.098294   2.399 0.016443 *  
units2                        -0.473331   0.080761  -5.861 4.60e-09 ***
units3                        -0.187620   0.118080  -1.589 0.112078    
units4                         0.017997   0.083896   0.215 0.830145    
waterded5000                  -0.322724   0.193401  -1.669 0.095182 .  
waterded7500                  -2.176533   0.707300  -3.077 0.002089 ** 
waterded10000                 -1.522848   0.577797  -2.636 0.008399 ** 
protectionclass1               0.615051   0.153125   4.017 5.90e-05 ***
protectionclass2               0.593518   0.145377   4.083 4.45e-05 ***
protectionclass3               0.717597   0.145083   4.946 7.57e-07 ***
protectionclass4               0.801837   0.145333   5.517 3.44e-08 ***
protectionclass5               0.704298   0.153185   4.598 4.27e-06 ***
protectionclass6               0.536763   0.152798   3.513 0.000443 ***
protectionclass7               0.785782   0.404920   1.941 0.052309 .  
protectionclass8               0.113455   0.521606   0.218 0.827811    
protectionclass9              -0.839474   1.010467  -0.831 0.406099    
protectionclass10              0.286807   0.520708   0.551 0.581769    
constructioncdB                0.312008   0.092673   3.367 0.000761 ***
constructioncdF                0.309253   0.025721  12.023  < 2e-16 ***
constructioncdM                0.422180   0.088998   4.744 2.10e-06 ***
constructioncdOTHER            0.261364   0.107103   2.440 0.014675 *  
sprinklersystem1              -0.019024   0.213874  -0.089 0.929121    
sprinklersystem2              -0.321098   0.065195  -4.925 8.43e-07 ***
landlordind1                  -0.526001   0.065787  -7.996 1.29e-15 ***
firealarmtype1                 0.248930   0.022281  11.172  < 2e-16 ***
cova_limit200000              -0.252979   0.069421  -3.644 0.000268 ***
cova_limit300000              -0.016147   0.065923  -0.245 0.806511    
cova_limit400000               0.172814   0.068986   2.505 0.012242 *  
cova_limit500000               0.210685   0.074832   2.815 0.004871 ** 
cova_limit600000               0.249377   0.081429   3.063 0.002195 ** 
cova_limit700000               0.202338   0.090378   2.239 0.025170 *  
cova_limit800000               0.069833   0.103910   0.672 0.501547    
cova_limit900000               0.297562   0.114306   2.603 0.009236 ** 
cova_limit1000000              0.185616   0.137031   1.355 0.175562    
cova_limit1200000              0.094734   0.145070   0.653 0.513743    
cova_limit1300000              0.389688   0.143098   2.723 0.006465 ** 
log_norm_water_risk_fre_3_blk  0.455509   0.027658  16.469  < 2e-16 ***
appl_fail_3_blk1               0.002121   0.164232   0.013 0.989694    
appl_fail_3_blk2              -0.111077   0.172032  -0.646 0.518489    
appl_fail_3_blk3              -0.047778   0.170917  -0.280 0.779830    
appl_fail_3_blk4               0.051217   0.163911   0.312 0.754684    
appl_fail_3_blk5               0.034393   0.162698   0.211 0.832579    
fixture_leak_3_blk1           -0.246404   0.046815  -5.263 1.41e-07 ***
fixture_leak_3_blk2           -0.069345   0.039436  -1.758 0.078676 .  
fixture_leak_3_blk3           -0.099678   0.044651  -2.232 0.025589 *  
fixture_leak_3_blk4           -0.349377   0.065286  -5.351 8.72e-08 ***
fixture_leak_3_blk5           -0.603239   0.135515  -4.451 8.53e-06 ***
pipe_froze_3_blk1             -0.245244   0.049676  -4.937 7.94e-07 ***
pipe_froze_3_blk2             -0.327711   0.028815 -11.373  < 2e-16 ***
pipe_froze_3_blk3             -0.302004   0.038348  -7.875 3.40e-15 ***
pipe_froze_3_blk4             -0.239348   0.198956  -1.203 0.228968    
pipe_froze_3_blk5             -0.040042   0.150373  -0.266 0.790021    
plumb_leak_3_blk1              0.271433   0.166006   1.635 0.102032    
plumb_leak_3_blk2              0.180569   0.181710   0.994 0.320358    
plumb_leak_3_blk3              0.128201   0.176194   0.728 0.466850    
plumb_leak_3_blk4              0.358395   0.167576   2.139 0.032460 *  
plumb_leak_3_blk5              0.524825   0.169063   3.104 0.001907 ** 
waterh_fail_3_blk1             0.175223   0.040998   4.274 1.92e-05 ***
waterh_fail_3_blk2             0.106066   0.025883   4.098 4.17e-05 ***
waterh_fail_3_blk3             0.009240   0.042107   0.219 0.826317    
waterh_fail_3_blk4             0.075419   0.062155   1.213 0.224977    
waterh_fail_3_blk5             0.053614   0.088247   0.608 0.543488    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 145 
Log-likelihood: -5.322e+04 on 144 Df

Covariance matrix estimates are definetly better then in the complex model

Model comparison

In [18]:
lrtest(hurdlepoisson_sign,hurdlepoisson_all)
A anova: 2 × 5
#DfLogLikDfChisqPr(>Chisq)
<dbl><dbl><dbl><dbl><dbl>
1144-53220.55NA NA NA
2202-53165.7558109.58745.008976e-05
In [19]:
vuong(hurdlepoisson_sign,hurdlepoisson_all)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                  -5.4443796 model2 > model1 2.5993e-08
AIC-corrected         0.2907136 model1 > model2    0.38564
BIC-corrected        33.9381659 model1 > model2 < 2.22e-16

The results is a little bit contraversial. According to Raw, the complex model is better then the nested one. However Significant features model is better because it has smaller AIC and BIC and it's significant.

In [20]:
models <- list("All Features" = hurdlepoisson_all,  "Sign Features"=hurdlepoisson_sign)
In [21]:
#Log likelyhood for all the models and DF
Models_LogLik <- rbind(logLik = sapply(models, function(x) round(logLik(x), digits = 0)),
    Df = sapply(models, function(x) attr(logLik(x), "df")))
Models_LogLik
A matrix: 2 × 2 of type dbl
All FeaturesSign Features
logLik-53166-53221
Df 202 144
In [22]:
#AIC
Models_AIC <- rbind(AIC = sapply(models, function(x) round(AIC(x), digits = 0)))
Models_AIC
A matrix: 1 × 2 of type dbl
All FeaturesSign Features
AIC106736106729
In [23]:
#BIC
Models_BIC <- rbind(BIC = sapply(models, function(x) round(BIC(x), digits = 0)))
Models_BIC
A matrix: 1 × 2 of type dbl
All FeaturesSign Features
BIC109106108419

Smaller AIC (BIC) suggests a better model. The model with significant features has a smaller AIC and BIC then the model with all features.

Coefficients means

In [24]:
sapply(models, function(x) coef(x)[1:20])
A matrix: 20 × 2 of type dbl
All FeaturesSign Features
count_(Intercept) -9.552883493 -4.86431971
count_log_norm_yearbult 14.263890096 9.32513182
count_log_norm_sqft 0.309109171 0.18682712
count_stories2 0.103796449 0.15034020
count_stories3-12.986247313-13.47388696
count_roofcdMEMBRANE 0.432395730 0.36505892
count_roofcdMETAL-14.113187432-13.12433174
count_roofcdOTHER 0.172480060 0.17839412
count_roofcdTAR -1.335201562 -1.25749299
count_roofcdTILE -0.009501178 0.02331574
count_roofcdWOOD-14.464079186-15.07422870
count_units2 -0.136411995 -0.13832285
count_units3 0.349517925 0.36815319
count_units4 -0.171460062 -0.19647523
count_occupancycdTENANT -0.226644231-13.46005562
count_waterded5000-13.606757623 -2.50406716
count_waterded7500 -3.266278748 -4.63934541
count_waterded10000 -5.248595427 -0.18232837
count_protectionclass1 -0.140688969 -0.64126493
count_protectionclass2 -0.667800887 -0.53935372

The models agree on the sign of coefficients

Rootograms for Assessing Goodness of Fit of Probability...

In [25]:
library(countreg)
Attaching package: ‘countreg’


The following objects are masked from ‘package:pscl’:

    hurdle, hurdle.control, hurdletest, zeroinfl, zeroinfl.control


In [26]:
par(mfrow = c(1, 2))
rootogram(hurdlepoisson_all,max = max(numclaims),main="All Features",style="hanging")
rootogram(hurdlepoisson_sign,max = max(numclaims),main="Significant Features",style="hanging")

The models look similar, without a deviance at number claims 2 point.

Predicting Number of claims

I use XGB to predict the final result based on probabilities of number claims in each row.

In [27]:
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "poisson-nloglik" 
)
In [28]:
#Adding  predicted probability values 
#to training and testing dataset for each numclaims
#as separate column

training_all <- predict(hurdlepoisson_all, training_data, type='prob')
testing_all <- predict(hurdlepoisson_all, testing_data, type='prob')

training_sign <- predict(hurdlepoisson_sign, training_data, type='prob')
testing_sign <- predict(hurdlepoisson_sign, testing_data, type='prob')

for(unique_value in numclaims){

training_data[paste("hurdlepoisson_all", unique_value, sep = ".")] <- training_all[,as.numeric(unique_value)+1]
testing_data[paste("hurdlepoisson_all", unique_value, sep = ".")] <- testing_all[,as.numeric(unique_value)+1]

training_data[paste("hurdlepoisson_sign", unique_value, sep = ".")] <- training_sign[,as.numeric(unique_value)+1]
testing_data[paste("hurdlepoisson_sign", unique_value, sep = ".")] <- testing_sign[,as.numeric(unique_value)+1]
    
}
In [33]:
#Features with probabilities listed by model
HP_all_features <- list();
HP_sign_features <- list();

i <- 1
for(unique_value in numclaims){
    HP_all_features[[i]] <- paste("hurdlepoisson_all", unique_value, sep = ".")
    HP_sign_features[[i]] <- paste("hurdlepoisson_sign", unique_value, sep = ".")

    i <- i + 1
}
In [34]:
xgHP_all_training = xgb.DMatrix(as.matrix(training_data[,unlist(HP_all_features)]), label = training_data$cova_ic_nc_water)
setinfo(xgHP_all_training, "base_margin", log(training_data$ecy))

xgHP_all_testing = xgb.DMatrix(as.matrix(testing_data[,unlist(HP_all_features)]), label = testing_data$cova_ic_nc_water)
setinfo(xgHP_all_testing, "base_margin", log(testing_data$ecy))
TRUE
TRUE
In [35]:
xgb = xgb.train(
  nrounds = 10000
  , params = param0
  , data = xgHP_all_training
)
training_data$hurdlepoisson_all = predict(xgb, xgHP_all_training)
testing_data$hurdlepoisson_all = predict(xgb, xgHP_all_testing)
In [36]:
xgHP_sign_training = xgb.DMatrix(as.matrix(training_data[,unlist(HP_sign_features)]), label = training_data$cova_ic_nc_water)
setinfo(xgHP_sign_training, "base_margin", log(training_data$ecy))

xgHP_sign_testing = xgb.DMatrix(as.matrix(testing_data[,unlist(HP_sign_features)]), label = testing_data$cova_ic_nc_water)
setinfo(xgHP_sign_testing, "base_margin", log(testing_data$ecy))
TRUE
TRUE
In [37]:
xgb = xgb.train(
  nrounds = 10000
  , params = param0
  , data = xgHP_sign_training
)
training_data$hurdlepoisson_sign = predict(xgb, xgHP_sign_training)
testing_data$hurdlepoisson_sign = predict(xgb, xgHP_sign_testing)

As an accuracy measure I use Normalized Weighted Gini:

In [40]:
Testing_Gini <- data.frame ("All" = NormalizedWeightedGini(testing_data$cova_ic_nc_water,testing_data$hurdlepoisson_all,testing_data$ecy)
                   ,"Significant Only" = NormalizedWeightedGini(testing_data$cova_ic_nc_water,testing_data$hurdlepoisson_sign,testing_data$ecy)
                   )
Testing_Gini
A data.frame: 1 × 2
AllSignificant.Only
<dbl><dbl>
0.22015320.2398513

Overall, the models are very close to each other. There is a difference in Normalized Weighted Gini which maybe significant and indicates Significant Features Only model is better.

K Fold Validation

TBD Does not work with Caret very strightforward and my manual approach is wrong somewhere

Conclusion

  • Normalization is needed to make the model work. I need to include normalized log columns in the datasets when it's extracted.
  • I convert all other columns into factors but not sure if it's needed.
  • I was not able to add into the model and test cova_deductible and log_norm_water_risk_sev_3_blk. It returns errors.
  • usagetype returns NA in covariance matrix for some levels. Should I include in the model only "good" levels?
  • Some predictors (like ordinanceorlawpct) are marked as significant but really it's a random component I feel.
  • Significant features model has smaller AIC and BIC and should be better then All features model. The Normalized Weighted Genin is less in SIgnificant features model but probably the difference is not significant. Need F test to confirm.