In [2]:
library(MASS)
library(lmtest)
library(pscl)
library(xgboost)
library(ggplot2)
library(caret)
Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis

Loading required package: lattice

In [3]:
source('/home/kate/code/Utils/MyFunctions.R')
In [4]:
ModelsDir <- '/home/kate/Research/Property/Models/'
UseSavedIfExists <- FALSE
In [5]:
training_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_training.csv", header=TRUE)
testing_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_testing.csv", header=TRUE)

NB work with factors and because some numerical columns are factors (ordered) I convert them in factors explicitly. I may need to do this for some other columns which are Yes/No.

In [67]:
training_data$appl_fail_3_blk <- as.factor( training_data$appl_fail_3_blk )
training_data$fixture_leak_3_blk <- as.factor( training_data$fixture_leak_3_blk )
training_data$pipe_froze_3_blk <- as.factor( training_data$pipe_froze_3_blk )
training_data$plumb_leak_3_blk <- as.factor( training_data$plumb_leak_3_blk )
training_data$rep_cost_3_blk <- as.factor( training_data$rep_cost_3_blk )
training_data$ustructure_fail_3_blk <- as.factor( training_data$ustructure_fail_3_blk )
training_data$waterh_fail_3_blk <- as.factor( training_data$waterh_fail_3_blk )
In [68]:
testing_data$appl_fail_3_blk <- as.factor( testing_data$appl_fail_3_blk )
testing_data$fixture_leak_3_blk <- as.factor( testing_data$fixture_leak_3_blk )
testing_data$pipe_froze_3_blk <- as.factor( testing_data$pipe_froze_3_blk )
testing_data$plumb_leak_3_blk <- as.factor( testing_data$plumb_leak_3_blk )
testing_data$rep_cost_3_blk <- as.factor( testing_data$rep_cost_3_blk )
testing_data$ustructure_fail_3_blk <- as.factor( testing_data$ustructure_fail_3_blk )
testing_data$waterh_fail_3_blk <- as.factor( testing_data$waterh_fail_3_blk )
In [73]:
max_yearbuilt <- max(training_data$yearbuilt)
max_sqft <- max(training_data$sqft)
max_water_risk_3_blk <- max(training_data$water_risk_3_blk)
max_water_risk_fre_3_blk <- max(training_data$water_risk_fre_3_blk)
max_water_risk_sev_3_blk <- max(training_data$water_risk_sev_3_blk)
In [74]:
training_data$log_norm_yearbult <- log(training_data$yearbuilt/max_yearbuilt)
training_data$log_norm_sqft <- log(training_data$sqft/max_sqft)
training_data$log_norm_water_risk_3_blk <- log(training_data$water_risk_3_blk/max_water_risk_3_blk) 
training_data$log_norm_water_risk_fre_3_blk <- log(training_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
training_data$log_norm_water_risk_sev_3_blk <- log(training_data$sqft/max_water_risk_sev_3_blk)
In [75]:
testing_data$log_norm_yearbult <- log(testing_data$yearbuilt/max_yearbuilt)
testing_data$log_norm_sqft <- log(testing_data$sqft/max_sqft)
testing_data$log_norm_water_risk_3_blk <- log(testing_data$water_risk_3_blk/max_water_risk_3_blk) 
testing_data$log_norm_water_risk_fre_3_blk <- log(testing_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
testing_data$log_norm_water_risk_sev_3_blk <- log(testing_data$sqft/max_water_risk_sev_3_blk)
In [83]:
training_data$stories <- as.factor(training_data$stories)
training_data$units <- as.factor(training_data$units)
training_data$secondaryresidence <- as.factor(training_data$secondaryresidence)
training_data$sprinklersystem <- as.factor(training_data$sprinklersystem)
training_data$firealarmtype <- as.factor(training_data$firealarmtype)
training_data$waterdetectiondevice <- as.factor(training_data$waterdetectiondevice)
training_data$propertymanager <- as.factor(training_data$propertymanager)
training_data$cova_deductible <- as.factor(training_data$cova_deductible)
training_data$cova_limit       <- as.factor(training_data$cova_limit      )
training_data$waterded <- as.factor(training_data$waterded)
training_data$protectionclass <- as.factor(training_data$protectionclass)
training_data$fire_risk_model_score <- as.factor(training_data$fire_risk_model_score)
training_data$ordinanceorlawpct <- as.factor(training_data$ordinanceorlawpct)
training_data$functionalreplacementcost <- as.factor(training_data$functionalreplacementcost)
training_data$homegardcreditind <- as.factor(training_data$homegardcreditind)
training_data$landlordind <- as.factor(training_data$landlordind)
training_data$rentersinsurance <- as.factor(training_data$rentersinsurance) 
training_data$burglaryalarmtype <- as.factor(training_data$burglaryalarmtype) 
training_data$propertymanager <- as.factor(training_data$propertymanager)
In [82]:
testing_data$stories <- as.factor(testing_data$stories)
testing_data$units <- as.factor(testing_data$units)
testing_data$secondaryresidence <- as.factor(testing_data$secondaryresidence)
testing_data$sprinklersystem <- as.factor(testing_data$sprinklersystem)
testing_data$firealarmtype <- as.factor(testing_data$firealarmtype)
testing_data$waterdetectiondevice <- as.factor(testing_data$waterdetectiondevice)
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
testing_data$cova_deductible <- as.factor(testing_data$cova_deductible)
testing_data$cova_limit       <- as.factor(testing_data$cova_limit      )
testing_data$waterded <- as.factor(testing_data$waterded)
testing_data$protectionclass <- as.factor(testing_data$protectionclass)
testing_data$fire_risk_model_score <- as.factor(testing_data$fire_risk_model_score)
testing_data$ordinanceorlawpct <- as.factor(testing_data$ordinanceorlawpct)
testing_data$functionalreplacementcost <- as.factor(testing_data$functionalreplacementcost)
testing_data$homegardcreditind <- as.factor(testing_data$homegardcreditind)
testing_data$landlordind <- as.factor(testing_data$landlordind)
testing_data$rentersinsurance <- as.factor(testing_data$rentersinsurance) 
testing_data$burglaryalarmtype <- as.factor(testing_data$burglaryalarmtype) 
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
In [69]:
#unique numclaims values
numclaims <- unique(training_data$cova_ic_nc_water)

Modeling

In [121]:
formula_all <-  cova_ic_nc_water ~  
log_norm_yearbult + 
log_norm_sqft + 
stories + 
roofcd + 
units + 
occupancycd + 
waterded + 
protectionclass + 
constructioncd + 
fire_risk_model_score + 
usagetype + 
secondaryresidence + 
ordinanceorlawpct + 
functionalreplacementcost + 
homegardcreditind + 
sprinklersystem + 
landlordind + 
rentersinsurance + 
firealarmtype + 
burglaryalarmtype + 
waterdetectiondevice + 
propertymanager + 
#cova_deductible + 
cova_limit + 
log_norm_water_risk_3_blk + 
log_norm_water_risk_fre_3_blk + 
#log_norm_water_risk_sev_3_blk + 
appl_fail_3_blk + 
fixture_leak_3_blk + 
pipe_froze_3_blk + 
plumb_leak_3_blk + 
rep_cost_3_blk + 
ustructure_fail_3_blk + 
waterh_fail_3_blk

Poisson models

In [122]:
ModelFile <- paste(ModelsDir,"zip_all.rds",sep="")
if(file.exists(ModelFile) && UseSavedIfExists){
    zip_all <- readRDS(ModelFile)
    } else {
    zip_all <- zeroinfl(formula_all,  data = training_data,  offset=log_ecy,dist = "poisson",link= "logit")
    saveRDS(zip_all, ModelFile)
    }
summary(zip_all)
Warning message in sqrt(diag(object$vcov)):
“NaNs produced”
Call:
zeroinfl(formula = formula_all, data = training_data, offset = log_ecy, 
    dist = "poisson", link = "logit")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.29267 -0.11977 -0.09309 -0.06690 61.52197 

Count model coefficients (poisson with log link):
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -7.31873    1.22697  -5.965 2.45e-09 ***
log_norm_yearbult             -56.71229    3.54539 -15.996  < 2e-16 ***
log_norm_sqft                  -0.40078    0.10413  -3.849 0.000119 ***
stories2                        0.17597    0.05491   3.205 0.001352 ** 
stories3                       -0.54146    0.28731  -1.885 0.059488 .  
roofcdMEMBRANE                 -0.27089    0.20679  -1.310 0.190201    
roofcdMETAL                    -0.07934    0.52195  -0.152 0.879187    
roofcdOTHER                     0.11623    0.08653   1.343 0.179180    
roofcdTAR                       0.05137    0.19747   0.260 0.794752    
roofcdTILE                     -0.07302    0.06212  -1.175 0.239803    
roofcdWOOD                      0.07966    0.24461   0.326 0.744684    
units2                         -0.26020    0.20830  -1.249 0.211609    
units3                         -0.32866    0.27157  -1.210 0.226202    
units4                         -0.16803    0.18568  -0.905 0.365492    
occupancycdTENANT              -0.07485    0.10280  -0.728 0.466545    
waterded5000                   -0.04551    0.37659  -0.121 0.903819    
waterded7500                   -2.69202    0.91011  -2.958 0.003097 ** 
waterded10000                  -1.17369    0.97035  -1.210 0.226448    
protectionclass1               -0.24224    0.32132  -0.754 0.450917    
protectionclass2               -0.34781    0.30477  -1.141 0.253773    
protectionclass3               -0.24955    0.30433  -0.820 0.412217    
protectionclass4               -0.18520    0.30329  -0.611 0.541432    
protectionclass5               -0.30477    0.31522  -0.967 0.333620    
protectionclass6               -0.15907    0.31933  -0.498 0.618398    
protectionclass7                0.62876    1.06324   0.591 0.554277    
protectionclass8               -0.49817    1.46219  -0.341 0.733329    
protectionclass9                3.90360    2.50116   1.561 0.118591    
protectionclass10              -0.77296    0.74511  -1.037 0.299560    
constructioncdB                -0.27999    0.25303  -1.107 0.268482    
constructioncdF                 0.19838    0.05153   3.850 0.000118 ***
constructioncdM                 0.05752    0.23958   0.240 0.810274    
constructioncdOTHER             0.50633    0.25718   1.969 0.048979 *  
usagetypePRIMARY                4.30949    1.07677   4.002 6.27e-05 ***
usagetypeRENTAL                 3.98270    1.07778   3.695 0.000220 ***
usagetypeSEASONAL               4.42867    1.15877   3.822 0.000132 ***
usagetypeSECONDARY              1.47665    1.22213   1.208 0.226949    
usagetypeUNOCCUPIED            -6.20719         NA      NA       NA    
usagetypeVACANT                -0.43766    1.98476  -0.221 0.825476    
secondaryresidence1             3.04069    1.19795   2.538 0.011141 *  
ordinanceorlawpct10             0.21973    0.05183   4.239 2.24e-05 ***
ordinanceorlawpct15             1.86276    1.48947   1.251 0.211074    
ordinanceorlawpct20             0.09222    0.06813   1.354 0.175830    
ordinanceorlawpct25            -0.01232    0.23985  -0.051 0.959033    
ordinanceorlawpct40            -0.69823    0.31092  -2.246 0.024722 *  
ordinanceorlawpct50            -0.19859    0.38996  -0.509 0.610578    
ordinanceorlawpct65            -0.04354    0.93713  -0.046 0.962940    
ordinanceorlawpct75            -1.34099    0.58765  -2.282 0.022493 *  
ordinanceorlawpct90             0.17654    0.51638   0.342 0.732437    
ordinanceorlawpct100           -0.03015    0.70837  -0.043 0.966045    
functionalreplacementcost1     -1.76830    1.85000  -0.956 0.339156    
homegardcreditind1              0.06342    0.06171   1.028 0.304086    
sprinklersystem1                0.52001    0.40675   1.278 0.201086    
sprinklersystem2               -0.10968    0.09627  -1.139 0.254582    
landlordind1                   -0.48360    0.12481  -3.875 0.000107 ***
rentersinsurance1              -0.17171    0.41184  -0.417 0.676724    
firealarmtype1                  0.15066    0.04953   3.042 0.002351 ** 
burglaryalarmtype1             -0.08823    0.04928  -1.791 0.073367 .  
waterdetectiondevice1           3.41948    4.25572   0.804 0.421686    
propertymanager1               -0.02879    0.19552  -0.147 0.882946    
cova_limit200000               -0.31158    0.16309  -1.911 0.056068 .  
cova_limit300000               -0.05446    0.15512  -0.351 0.725534    
cova_limit400000                0.20398    0.15892   1.284 0.199301    
cova_limit500000                0.36652    0.16988   2.158 0.030965 *  
cova_limit600000                0.45044    0.18206   2.474 0.013357 *  
cova_limit700000                0.42091    0.19512   2.157 0.030993 *  
cova_limit800000                0.35908    0.21649   1.659 0.097194 .  
cova_limit900000                0.54088    0.22628   2.390 0.016833 *  
cova_limit1000000               0.68646    0.28232   2.432 0.015036 *  
cova_limit1200000               0.19565    0.26319   0.743 0.457246    
cova_limit1300000               0.80050    0.26912   2.974 0.002935 ** 
log_norm_water_risk_3_blk       0.32726    0.12141   2.695 0.007029 ** 
log_norm_water_risk_fre_3_blk   0.27580    0.12208   2.259 0.023874 *  
appl_fail_3_blk1               -0.31597    0.34621  -0.913 0.361430    
appl_fail_3_blk2               -0.30617    0.35919  -0.852 0.393989    
appl_fail_3_blk3               -0.26945    0.35868  -0.751 0.452513    
appl_fail_3_blk4               -0.26168    0.34489  -0.759 0.448021    
appl_fail_3_blk5               -0.31839    0.34197  -0.931 0.351834    
fixture_leak_3_blk1            -0.24124    0.08355  -2.887 0.003885 ** 
fixture_leak_3_blk2            -0.12056    0.06455  -1.868 0.061785 .  
fixture_leak_3_blk3            -0.19600    0.07651  -2.562 0.010411 *  
fixture_leak_3_blk4            -0.42731    0.12387  -3.450 0.000561 ***
fixture_leak_3_blk5            -0.55453    0.29294  -1.893 0.058365 .  
pipe_froze_3_blk1              -0.28240    0.08544  -3.305 0.000949 ***
pipe_froze_3_blk2              -0.35691    0.05679  -6.285 3.28e-10 ***
pipe_froze_3_blk3              -0.26906    0.07055  -3.814 0.000137 ***
pipe_froze_3_blk4              -1.44498    0.27734  -5.210 1.89e-07 ***
pipe_froze_3_blk5               0.33229    0.38893   0.854 0.392905    
plumb_leak_3_blk1               0.08859    0.27154   0.326 0.744243    
plumb_leak_3_blk2              -0.10090    0.29556  -0.341 0.732806    
plumb_leak_3_blk3              -0.12406    0.28823  -0.430 0.666877    
plumb_leak_3_blk4               0.26155    0.27514   0.951 0.341792    
plumb_leak_3_blk5               0.39500    0.27935   1.414 0.157370    
rep_cost_3_blk1                 0.14523    0.36645   0.396 0.691869    
rep_cost_3_blk2                 0.53608    0.65796   0.815 0.415209    
rep_cost_3_blk3                -0.15121    0.43624  -0.347 0.728872    
rep_cost_3_blk4                 0.10714    0.35012   0.306 0.759593    
rep_cost_3_blk5                -0.01154    0.34176  -0.034 0.973060    
ustructure_fail_3_blk1         -0.02207    0.16967  -0.130 0.896520    
ustructure_fail_3_blk2          0.09051    0.17287   0.524 0.600581    
ustructure_fail_3_blk3         -0.16129    0.19000  -0.849 0.395931    
ustructure_fail_3_blk4         -0.04218    0.17097  -0.247 0.805144    
ustructure_fail_3_blk5         -0.01616    0.16864  -0.096 0.923647    
waterh_fail_3_blk1              0.05696    0.07424   0.767 0.442969    
waterh_fail_3_blk2              0.08110    0.05013   1.618 0.105681    
waterh_fail_3_blk3              0.08974    0.07791   1.152 0.249396    
waterh_fail_3_blk4              0.14797    0.11353   1.303 0.192450    
waterh_fail_3_blk5              0.18004    0.16461   1.094 0.274066    

Zero-inflation model coefficients (binomial with logit link):
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -5.185e+00  2.571e+00  -2.017 0.043709 *  
log_norm_yearbult             -1.198e+02  5.609e+00 -21.351  < 2e-16 ***
log_norm_sqft                 -1.459e+00  1.981e-01  -7.367 1.75e-13 ***
stories2                       1.676e-01  1.326e-01   1.264 0.206087    
stories3                      -7.485e-01  6.090e-01  -1.229 0.219096    
roofcdMEMBRANE                -6.131e-01  4.101e-01  -1.495 0.134949    
roofcdMETAL                   -3.652e-01  1.048e+00  -0.349 0.727463    
roofcdOTHER                   -1.277e-01  1.517e-01  -0.842 0.400027    
roofcdTAR                     -4.298e-02  3.004e-01  -0.143 0.886226    
roofcdTILE                    -6.520e-01  1.494e-01  -4.365 1.27e-05 ***
roofcdWOOD                    -4.003e-03  4.469e-01  -0.009 0.992853    
units2                         1.064e-01  3.091e-01   0.344 0.730724    
units3                        -3.981e-01  4.462e-01  -0.892 0.372261    
units4                        -3.040e-01  3.486e-01  -0.872 0.383227    
occupancycdTENANT             -5.566e-01  1.933e-01  -2.880 0.003976 ** 
waterded5000                   6.649e-01  6.992e-01   0.951 0.341592    
waterded7500                  -1.629e+00  2.218e+00  -0.735 0.462517    
waterded10000                  7.077e-01  1.564e+00   0.452 0.650934    
protectionclass1              -5.489e-01  5.926e-01  -0.926 0.354293    
protectionclass2              -6.809e-01  5.653e-01  -1.205 0.228353    
protectionclass3              -6.115e-01  5.659e-01  -1.081 0.279871    
protectionclass4              -6.452e-01  5.653e-01  -1.141 0.253737    
protectionclass5              -7.029e-01  5.990e-01  -1.173 0.240642    
protectionclass6              -4.090e-01  6.012e-01  -0.680 0.496347    
protectionclass7               9.709e-01  1.702e+00   0.571 0.568278    
protectionclass8               1.342e-01  2.894e+00   0.046 0.962998    
protectionclass9               7.057e+00  2.378e+00   2.967 0.003004 ** 
protectionclass10             -9.931e-01  1.728e+00  -0.575 0.565405    
constructioncdB               -1.120e+00  5.385e-01  -2.079 0.037611 *  
constructioncdF               -1.010e-01  1.095e-01  -0.922 0.356417    
constructioncdM               -6.139e-01  4.983e-01  -1.232 0.217941    
constructioncdOTHER            4.264e-01  4.421e-01   0.964 0.334828    
usagetypePRIMARY               2.599e+00  2.267e+00   1.146 0.251664    
usagetypeRENTAL                2.844e+00  2.268e+00   1.254 0.209935    
usagetypeSEASONAL              3.155e+00  2.393e+00   1.318 0.187371    
usagetypeSECONDARY            -1.206e+01         NA      NA       NA    
usagetypeUNOCCUPIED            6.577e+00         NA      NA       NA    
usagetypeVACANT                1.636e+00  3.485e+00   0.470 0.638667    
secondaryresidence1            1.487e+01         NA      NA       NA    
ordinanceorlawpct10            2.470e-01  1.074e-01   2.299 0.021485 *  
ordinanceorlawpct15            3.400e+00  1.693e+00   2.008 0.044678 *  
ordinanceorlawpct20            1.274e-01  1.507e-01   0.845 0.397930    
ordinanceorlawpct25           -1.694e-01  4.871e-01  -0.348 0.728039    
ordinanceorlawpct40           -2.855e+00  1.219e+00  -2.341 0.019232 *  
ordinanceorlawpct50           -6.663e-01  8.850e-01  -0.753 0.451508    
ordinanceorlawpct65            2.300e-01  1.837e+00   0.125 0.900334    
ordinanceorlawpct75           -1.195e+01         NA      NA       NA    
ordinanceorlawpct90            4.328e-01  9.591e-01   0.451 0.651813    
ordinanceorlawpct100           8.424e-02  1.592e+00   0.053 0.957796    
functionalreplacementcost1    -1.493e+00  2.273e+00  -0.657 0.511311    
homegardcreditind1             2.167e-01  1.275e-01   1.700 0.089147 .  
sprinklersystem1               1.347e+00  8.476e-01   1.590 0.111897    
sprinklersystem2               2.555e-01  2.914e-01   0.877 0.380436    
landlordind1                  -1.113e-01  2.658e-01  -0.419 0.675402    
rentersinsurance1             -7.704e-01  1.068e+00  -0.721 0.470649    
firealarmtype1                -4.193e-02  1.040e-01  -0.403 0.686736    
burglaryalarmtype1            -1.723e-01  1.124e-01  -1.533 0.125340    
waterdetectiondevice1          5.151e+00  4.013e+00   1.284 0.199310    
propertymanager1              -6.440e-01  4.504e-01  -1.430 0.152797    
cova_limit200000               3.232e-01  3.082e-01   1.049 0.294356    
cova_limit300000               5.006e-01  2.967e-01   1.688 0.091485 .  
cova_limit400000               7.271e-01  3.062e-01   2.375 0.017569 *  
cova_limit500000               1.024e+00  3.306e-01   3.096 0.001960 ** 
cova_limit600000               1.153e+00  3.600e-01   3.202 0.001365 ** 
cova_limit700000               1.111e+00  3.957e-01   2.807 0.005004 ** 
cova_limit800000               1.282e+00  4.504e-01   2.846 0.004426 ** 
cova_limit900000               1.176e+00  4.741e-01   2.480 0.013152 *  
cova_limit1000000              1.840e+00  5.898e-01   3.120 0.001808 ** 
cova_limit1200000              6.140e-01  5.804e-01   1.058 0.290165    
cova_limit1300000              1.563e+00  5.889e-01   2.654 0.007951 ** 
log_norm_water_risk_3_blk      9.028e-01  2.477e-01   3.644 0.000268 ***
log_norm_water_risk_fre_3_blk -4.861e-01  2.414e-01  -2.014 0.044036 *  
appl_fail_3_blk1              -5.686e-01  6.063e-01  -0.938 0.348406    
appl_fail_3_blk2              -4.443e-01  6.244e-01  -0.712 0.476729    
appl_fail_3_blk3              -4.551e-01  6.284e-01  -0.724 0.468937    
appl_fail_3_blk4              -5.243e-01  6.062e-01  -0.865 0.387077    
appl_fail_3_blk5              -5.858e-01  5.999e-01  -0.977 0.328773    
fixture_leak_3_blk1            6.042e-02  1.848e-01   0.327 0.743681    
fixture_leak_3_blk2            2.233e-02  1.516e-01   0.147 0.882893    
fixture_leak_3_blk3           -9.331e-02  1.721e-01  -0.542 0.587666    
fixture_leak_3_blk4           -1.418e-01  2.580e-01  -0.549 0.582708    
fixture_leak_3_blk5            3.991e-02  5.956e-01   0.067 0.946574    
pipe_froze_3_blk1             -3.022e-01  2.288e-01  -1.321 0.186562    
pipe_froze_3_blk2             -1.338e-01  1.133e-01  -1.181 0.237543    
pipe_froze_3_blk3             -7.760e-02  1.469e-01  -0.528 0.597271    
pipe_froze_3_blk4             -4.243e+00  1.617e+00  -2.623 0.008717 ** 
pipe_froze_3_blk5              1.024e+00  7.231e-01   1.416 0.156910    
plumb_leak_3_blk1             -3.666e-01  6.314e-01  -0.581 0.561524    
plumb_leak_3_blk2             -5.751e-01  6.653e-01  -0.864 0.387354    
plumb_leak_3_blk3             -5.697e-01  6.567e-01  -0.868 0.385640    
plumb_leak_3_blk4             -7.779e-02  6.376e-01  -0.122 0.902892    
plumb_leak_3_blk5              5.129e-02  6.475e-01   0.079 0.936869    
rep_cost_3_blk1                1.339e-01  9.089e-01   0.147 0.882914    
rep_cost_3_blk2                1.380e+00  1.262e+00   1.094 0.274078    
rep_cost_3_blk3               -7.723e-01  1.117e+00  -0.692 0.489116    
rep_cost_3_blk4                1.724e-01  8.860e-01   0.195 0.845725    
rep_cost_3_blk5               -1.821e-01  8.696e-01  -0.209 0.834153    
ustructure_fail_3_blk1        -1.695e-01  4.649e-01  -0.365 0.715459    
ustructure_fail_3_blk2        -8.260e-02  4.802e-01  -0.172 0.863419    
ustructure_fail_3_blk3        -1.927e-01  5.151e-01  -0.374 0.708249    
ustructure_fail_3_blk4        -7.013e-02  4.638e-01  -0.151 0.879805    
ustructure_fail_3_blk5         4.065e-02  4.559e-01   0.089 0.928954    
waterh_fail_3_blk1            -3.068e-01  1.791e-01  -1.713 0.086712 .  
waterh_fail_3_blk2             4.334e-02  9.900e-02   0.438 0.661562    
waterh_fail_3_blk3             2.682e-01  1.654e-01   1.621 0.104956    
waterh_fail_3_blk4             2.977e-01  2.819e-01   1.056 0.290957    
waterh_fail_3_blk5             5.012e-01  4.122e-01   1.216 0.224007    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 223 
Log-likelihood: -5.145e+04 on 214 Df

Normalization is needed to make the model work. I was not able to add into the model and test cova_deductible and log_norm_water_risk_sev_3_blk. It returns errors. usagetype returns NA in covariance matrix for some levels. Some predictors (like ordinanceorlawpct) are marked as significant but really it's a random component I feel.

In [124]:
formula_sign <-  cova_ic_nc_water ~  
log_norm_yearbult + 
log_norm_sqft + 
stories + 
waterded + 
constructioncd +  
usagetype + 
landlordind + 
firealarmtype + 
cova_limit + 
log_norm_water_risk_3_blk + 
log_norm_water_risk_fre_3_blk + 
fixture_leak_3_blk + 
pipe_froze_3_blk |
log_norm_yearbult +
log_norm_sqft +
roofcd + 
occupancycd +
protectionclass +
constructioncd +
cova_limit + 
log_norm_water_risk_3_blk +
log_norm_water_risk_fre_3_blk +
pipe_froze_3_blk
In [125]:
ModelFile <- paste(ModelsDir,"zip_sign.rds",sep="")
if(file.exists(ModelFile) && UseSavedIfExists){
    zip_sign <- readRDS(ModelFile)
    } else {
    zip_sign <- zeroinfl(formula_sign,  data = training_data,  offset=log_ecy,dist = "poisson",link= "logit")
    saveRDS(zip_sign, ModelFile)
    }
summary(zip_sign)
Call:
zeroinfl(formula = formula_sign, data = training_data, offset = log_ecy, 
    dist = "poisson", link = "logit")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.28437 -0.12035 -0.09250 -0.06738 65.46562 

Count model coefficients (poisson with log link):
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -4.069091   0.597702  -6.808 9.90e-12 ***
log_norm_yearbult             41.451373   2.664522  15.557  < 2e-16 ***
log_norm_sqft                  0.546648   0.062151   8.796  < 2e-16 ***
stories2                       0.118883   0.030172   3.940 8.14e-05 ***
stories3                      -0.303811   0.185689  -1.636 0.101814    
waterded5000                  -0.334142   0.193748  -1.725 0.084596 .  
waterded7500                  -2.161977   0.707641  -3.055 0.002249 ** 
waterded10000                 -1.505610   0.578050  -2.605 0.009197 ** 
constructioncdB                0.501656   0.141726   3.540 0.000401 ***
constructioncdF                0.332390   0.031185  10.659  < 2e-16 ***
constructioncdM                0.449169   0.103787   4.328 1.51e-05 ***
constructioncdOTHER            0.212652   0.127138   1.673 0.094403 .  
usagetypePRIMARY               2.937792   0.578169   5.081 3.75e-07 ***
usagetypeRENTAL                2.525265   0.578451   4.366 1.27e-05 ***
usagetypeSEASONAL              2.812475   0.603334   4.662 3.14e-06 ***
usagetypeSECONDARY             1.694024   0.731192   2.317 0.020515 *  
usagetypeUNOCCUPIED           -6.577947  80.593119  -0.082 0.934950    
usagetypeVACANT               -1.312806   1.154774  -1.137 0.255601    
landlordind1                  -0.460655   0.066094  -6.970 3.18e-12 ***
firealarmtype1                 0.167663   0.022187   7.557 4.13e-14 ***
cova_limit200000              -0.422208   0.083523  -5.055 4.30e-07 ***
cova_limit300000              -0.270028   0.078805  -3.427 0.000611 ***
cova_limit400000              -0.149937   0.083134  -1.804 0.071300 .  
cova_limit500000              -0.169353   0.091549  -1.850 0.064334 .  
cova_limit600000              -0.209909   0.102440  -2.049 0.040454 *  
cova_limit700000              -0.220865   0.115595  -1.911 0.056046 .  
cova_limit800000              -0.430982   0.133849  -3.220 0.001282 ** 
cova_limit900000              -0.282331   0.149291  -1.891 0.058606 .  
cova_limit1000000             -0.539027   0.176270  -3.058 0.002228 ** 
cova_limit1200000             -0.307735   0.188825  -1.630 0.103158    
cova_limit1300000             -0.230033   0.198802  -1.157 0.247235    
log_norm_water_risk_3_blk     -0.001452   0.073412  -0.020 0.984219    
log_norm_water_risk_fre_3_blk  0.350611   0.070047   5.005 5.58e-07 ***
fixture_leak_3_blk1           -0.198714   0.040500  -4.907 9.27e-07 ***
fixture_leak_3_blk2           -0.060470   0.035463  -1.705 0.088164 .  
fixture_leak_3_blk3           -0.097936   0.039335  -2.490 0.012781 *  
fixture_leak_3_blk4           -0.257662   0.057176  -4.506 6.59e-06 ***
fixture_leak_3_blk5           -0.457073   0.125066  -3.655 0.000258 ***
pipe_froze_3_blk1             -0.303095   0.071872  -4.217 2.47e-05 ***
pipe_froze_3_blk2             -0.279586   0.034515  -8.100 5.47e-16 ***
pipe_froze_3_blk3             -0.252991   0.049301  -5.132 2.87e-07 ***
pipe_froze_3_blk4             -0.631999   0.201340  -3.139 0.001695 ** 
pipe_froze_3_blk5             -0.382666   0.201246  -1.901 0.057238 .  

Zero-inflation model coefficients (binomial with logit link):
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    3.968e+00  7.943e-01   4.995 5.87e-07 ***
log_norm_yearbult              3.069e+02  2.876e+01  10.671  < 2e-16 ***
log_norm_sqft                  1.407e+00  2.244e-01   6.267 3.67e-10 ***
roofcdMEMBRANE                 4.205e-01  4.085e-01   1.029 0.303248    
roofcdMETAL                   -7.075e-02  9.753e-01  -0.073 0.942171    
roofcdOTHER                   -4.825e-01  2.080e-01  -2.320 0.020349 *  
roofcdTAR                     -1.337e-02  4.638e-01  -0.029 0.976999    
roofcdTILE                    -6.622e-02  1.244e-01  -0.532 0.594453    
roofcdWOOD                    -7.364e-02  8.041e-01  -0.092 0.927033    
occupancycdTENANT             -2.105e-01  1.934e-01  -1.089 0.276228    
protectionclass1              -1.060e-01  6.232e-01  -0.170 0.864956    
protectionclass2               4.916e-02  5.880e-01   0.084 0.933364    
protectionclass3               8.753e-03  5.869e-01   0.015 0.988101    
protectionclass4              -2.163e-01  5.855e-01  -0.370 0.711743    
protectionclass5               2.200e-01  6.046e-01   0.364 0.715942    
protectionclass6               1.702e-01  6.129e-01   0.278 0.781243    
protectionclass7              -2.599e+00  3.460e+00  -0.751 0.452570    
protectionclass8               2.665e+00  1.900e+00   1.403 0.160759    
protectionclass9               3.248e+01  4.093e+04   0.001 0.999367    
protectionclass10              2.615e-01  1.266e+00   0.207 0.836377    
constructioncdB                2.726e+00  6.465e-01   4.216 2.48e-05 ***
constructioncdF                3.732e-01  1.022e-01   3.650 0.000262 ***
constructioncdM                6.263e-01  4.966e-01   1.261 0.207226    
constructioncdOTHER           -6.286e-01  6.803e-01  -0.924 0.355512    
cova_limit200000              -1.449e-01  3.583e-01  -0.405 0.685786    
cova_limit300000              -8.257e-02  3.319e-01  -0.249 0.803516    
cova_limit400000              -2.536e-01  3.402e-01  -0.745 0.455994    
cova_limit500000              -6.135e-01  3.616e-01  -1.697 0.089741 .  
cova_limit600000              -9.616e-01  3.915e-01  -2.456 0.014050 *  
cova_limit700000              -9.126e-01  4.227e-01  -2.159 0.030848 *  
cova_limit800000              -1.217e+00  4.669e-01  -2.608 0.009120 ** 
cova_limit900000              -1.424e+00  5.297e-01  -2.688 0.007197 ** 
cova_limit1000000             -2.234e+00  6.643e-01  -3.364 0.000769 ***
cova_limit1200000             -8.592e-01  5.810e-01  -1.479 0.139177    
cova_limit1300000             -1.736e+00  6.480e-01  -2.679 0.007387 ** 
log_norm_water_risk_3_blk     -2.790e-01  2.775e-01  -1.005 0.314819    
log_norm_water_risk_fre_3_blk  8.687e-02  2.723e-01   0.319 0.749676    
pipe_froze_3_blk1             -1.536e-01  2.187e-01  -0.702 0.482502    
pipe_froze_3_blk2              3.490e-01  1.260e-01   2.769 0.005624 ** 
pipe_froze_3_blk3              2.152e-01  1.596e-01   1.348 0.177533    
pipe_froze_3_blk4             -4.649e+01  4.094e+04  -0.001 0.999094    
pipe_froze_3_blk5             -8.637e-01  8.205e-01  -1.053 0.292486    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 169 
Log-likelihood: -5.15e+04 on 85 Df

Covariance matrix estimates are definetly better then in the complex model

Model comparison

In [126]:
lrtest(zip_sign,zip_all)
A anova: 2 × 5
#DfLogLikDfChisqPr(>Chisq)
<dbl><dbl><dbl><dbl><dbl>
1 85-51504.44 NA NA NA
2214-51453.35129102.16710.9608432
In [127]:
vuong(zip_sign,zip_all)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                   -1.796934 model2 > model1  0.0361731
AIC-corrected          2.740819 model1 > model2  0.0030643
BIC-corrected         29.363543 model1 > model2 < 2.22e-16

The results is a little bit contraversial. According to Raw, the complex model is better then the nested one. However Significant features model is better because it has smaller AIC and BIC and it's significant.

In [128]:
models <- list("All Features" = zip_all,  "Sign Features"=zip_sign)
In [129]:
#Log likelyhood for all the models and DF
Models_LogLik <- rbind(logLik = sapply(models, function(x) round(logLik(x), digits = 0)),
    Df = sapply(models, function(x) attr(logLik(x), "df")))
Models_LogLik
A matrix: 2 × 2 of type dbl
All FeaturesSign Features
logLik-51453-51504
Df 214 85
In [130]:
#AIC
Models_AIC <- rbind(AIC = sapply(models, function(x) round(AIC(x), digits = 0)))
Models_AIC
A matrix: 1 × 2 of type dbl
All FeaturesSign Features
AIC103335103179
In [145]:
#BIC
Models_BIC <- rbind(BIC = sapply(models, function(x) round(BIC(x), digits = 0)))
Models_BIC
A matrix: 1 × 2 of type dbl
All FeaturesSign Features
BIC105846104176

Smaller AIC (BIC) suggests a better model. The model with significant features has a smaller AIC and BIC then the model with only significant features.

Coefficients means

In [132]:
sapply(models, function(x) coef(x)[1:20])
A matrix: 20 × 2 of type dbl
All FeaturesSign Features
count_(Intercept) -7.31872566-4.0690906
count_log_norm_yearbult-56.7122904041.4513732
count_log_norm_sqft -0.40078445 0.5466477
count_stories2 0.17596772 0.1188826
count_stories3 -0.54145745-0.3038107
count_roofcdMEMBRANE -0.27089129-0.3341417
count_roofcdMETAL -0.07933711-2.1619774
count_roofcdOTHER 0.11623174-1.5056096
count_roofcdTAR 0.05137161 0.5016561
count_roofcdTILE -0.07301558 0.3323903
count_roofcdWOOD 0.07965761 0.4491690
count_units2 -0.26019785 0.2126523
count_units3 -0.32865973 2.9377920
count_units4 -0.16802906 2.5252654
count_occupancycdTENANT -0.07484673 2.8124748
count_waterded5000 -0.04550602 1.6940236
count_waterded7500 -2.69201946-6.5779473
count_waterded10000 -1.17369011-1.3128057
count_protectionclass1 -0.24223724-0.4606550
count_protectionclass2 -0.34781444 0.1676632

Sometimes the sign of the mean is different in the models

Rootograms for Assessing Goodness of Fit of Probability...

In [133]:
library(countreg)
In [134]:
par(mfrow = c(1, 2))
rootogram(zip_all,max = max(numclaims),main="All Features",style="hanging")
rootogram(zip_sign,max = max(numclaims),main="Significant Features",style="hanging")

All features model has deviance on numebr claims 2 and looks like more for only Significant features model

Predicting Number of claims

I use XGB to predict the final result based on probabilities of number claims in each row.

In [135]:
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "poisson-nloglik" 
)
In [136]:
#Adding  predicted probability values 
#to training and testing dataset for each numclaims
#as separate column

training_all <- predict(zip_all, training_data, type='prob')
testing_all <- predict(zip_all, testing_data, type='prob')

training_sign <- predict(zip_sign, training_data, type='prob')
testing_sign <- predict(zip_sign, testing_data, type='prob')

for(unique_value in numclaims){

training_data[paste("zip_all", unique_value, sep = ".")] <- training_all[,as.numeric(unique_value)+1]
testing_data[paste("zip_all", unique_value, sep = ".")] <- testing_all[,as.numeric(unique_value)+1]

training_data[paste("zip_sign", unique_value, sep = ".")] <- training_sign[,as.numeric(unique_value)+1]
testing_data[paste("zip_sign", unique_value, sep = ".")] <- testing_sign[,as.numeric(unique_value)+1]
    
}
In [137]:
#Features with probabilities listed by model
zip_all_features <- list();
zip_sign_features <- list();

i <- 1
for(unique_value in numclaims){
    zip_all_features[[i]] <- paste("zip_all", unique_value, sep = ".")
    zip_sign_features[[i]] <- paste("zip_sign", unique_value, sep = ".")

    i <- i + 1
}
In [138]:
xgZIP_all_training = xgb.DMatrix(as.matrix(training_data[,unlist(zip_all_features)]), label = training_data$cova_ic_nc_water)
setinfo(xgZIP_all_training, "base_margin", log(training_data$ecy))

xgZIP_all_testing = xgb.DMatrix(as.matrix(testing_data[,unlist(zip_all_features)]), label = testing_data$cova_ic_nc_water)
setinfo(xgZIP_all_testing, "base_margin", log(testing_data$ecy))
TRUE
TRUE
In [139]:
xgb = xgb.train(
  nrounds = 10000
  , params = param0
  , data = xgZIP_all_training
)
training_data$zip_all = predict(xgb, xgZIP_all_training)
testing_data$zip_all = predict(xgb, xgZIP_all_testing)
In [140]:
xgZIP_sign_training = xgb.DMatrix(as.matrix(training_data[,unlist(zip_sign_features)]), label = training_data$cova_ic_nc_water)
setinfo(xgZIP_sign_training, "base_margin", log(training_data$ecy))

xgZIP_sign_testing = xgb.DMatrix(as.matrix(testing_data[,unlist(zip_sign_features)]), label = testing_data$cova_ic_nc_water)
setinfo(xgZIP_sign_testing, "base_margin", log(testing_data$ecy))
TRUE
TRUE
In [141]:
xgb = xgb.train(
  nrounds = 10000
  , params = param0
  , data = xgZIP_sign_training
)
training_data$zip_sign = predict(xgb, xgZIP_sign_training)
testing_data$zip_sign = predict(xgb, xgZIP_sign_testing)

As an accuracy measure I use Normalized Weighted Gini:

In [144]:
Testing_Gini <- data.frame ("All" = NormalizedWeightedGini(testing_data$cova_ic_nc_water,testing_data$zip_all,testing_data$ecy)
                   ,"Significant Only" = NormalizedWeightedGini(testing_data$cova_ic_nc_water,testing_data$zip_sign,testing_data$ecy)
                   )
Testing_Gini
A data.frame: 1 × 2
AllSignificant.Only
<dbl><dbl>
0.28620430.2711677

Overall, the models are very close to each other. There is a difference in Normalized Weighted Gini which does NOT looks significant.

K Fold Validation

TBD Does not work with Caret very strightforward and my manual approach is wrong somewhere

Conclusion

  • Normalization is needed to make the model work. I need to include normalized log columns in the datasets when it's extracted.
  • I convert all other columns into factors but not sure if it's needed.
  • I was not able to add into the model and test cova_deductible and log_norm_water_risk_sev_3_blk. It returns errors.
  • usagetype returns NA in covariance matrix for some levels. Should I include in the model only "good" levels?
  • Some predictors (like ordinanceorlawpct) are marked as significant but really it's a random component I feel.
  • Significant features model has smaller AIC and BIC and should be better then All features model. The Normalized Weighted Genin is less in SIgnificant features model but probably the difference is not significant. Need F test to confirm.