In [1]:
library(MASS)
library(lmtest)
library(pscl)
library(xgboost)
library(ggplot2)
library(caret)
Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis

Loading required package: lattice

In [2]:
source('/home/kate/code/Utils/MyFunctions.R')
In [3]:
ModelsDir <- '/home/kate/Research/Property/Models/'
UseSavedIfExists <- TRUE
In [4]:
training_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_training.csv", header=TRUE)
testing_data <- read.csv("/home/kate/Research/Property/Data/property_wcf_testing.csv", header=TRUE)

NB work with factors and because some numerical columns are factors (ordered) I convert them in factors explicitly. I may need to do this for some other columns which are Yes/No.

In [5]:
training_data$appl_fail_3_blk <- as.factor( training_data$appl_fail_3_blk )
training_data$fixture_leak_3_blk <- as.factor( training_data$fixture_leak_3_blk )
training_data$pipe_froze_3_blk <- as.factor( training_data$pipe_froze_3_blk )
training_data$plumb_leak_3_blk <- as.factor( training_data$plumb_leak_3_blk )
training_data$rep_cost_3_blk <- as.factor( training_data$rep_cost_3_blk )
training_data$ustructure_fail_3_blk <- as.factor( training_data$ustructure_fail_3_blk )
training_data$waterh_fail_3_blk <- as.factor( training_data$waterh_fail_3_blk )
In [6]:
testing_data$appl_fail_3_blk <- as.factor( testing_data$appl_fail_3_blk )
testing_data$fixture_leak_3_blk <- as.factor( testing_data$fixture_leak_3_blk )
testing_data$pipe_froze_3_blk <- as.factor( testing_data$pipe_froze_3_blk )
testing_data$plumb_leak_3_blk <- as.factor( testing_data$plumb_leak_3_blk )
testing_data$rep_cost_3_blk <- as.factor( testing_data$rep_cost_3_blk )
testing_data$ustructure_fail_3_blk <- as.factor( testing_data$ustructure_fail_3_blk )
testing_data$waterh_fail_3_blk <- as.factor( testing_data$waterh_fail_3_blk )
In [7]:
max_yearbuilt <- max(training_data$yearbuilt)
max_sqft <- max(training_data$sqft)
max_water_risk_3_blk <- max(training_data$water_risk_3_blk)
max_water_risk_fre_3_blk <- max(training_data$water_risk_fre_3_blk)
max_water_risk_sev_3_blk <- max(training_data$water_risk_sev_3_blk)
In [8]:
training_data$log_norm_yearbult <- log(training_data$yearbuilt/max_yearbuilt)
training_data$log_norm_sqft <- log(training_data$sqft/max_sqft)
training_data$log_norm_water_risk_3_blk <- log(training_data$water_risk_3_blk/max_water_risk_3_blk) 
training_data$log_norm_water_risk_fre_3_blk <- log(training_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
training_data$log_norm_water_risk_sev_3_blk <- log(training_data$sqft/max_water_risk_sev_3_blk)
In [9]:
testing_data$log_norm_yearbult <- log(testing_data$yearbuilt/max_yearbuilt)
testing_data$log_norm_sqft <- log(testing_data$sqft/max_sqft)
testing_data$log_norm_water_risk_3_blk <- log(testing_data$water_risk_3_blk/max_water_risk_3_blk) 
testing_data$log_norm_water_risk_fre_3_blk <- log(testing_data$water_risk_fre_3_blk/max_water_risk_fre_3_blk)
testing_data$log_norm_water_risk_sev_3_blk <- log(testing_data$sqft/max_water_risk_sev_3_blk)
In [10]:
training_data$stories <- as.factor(training_data$stories)
training_data$units <- as.factor(training_data$units)
training_data$secondaryresidence <- as.factor(training_data$secondaryresidence)
training_data$sprinklersystem <- as.factor(training_data$sprinklersystem)
training_data$firealarmtype <- as.factor(training_data$firealarmtype)
training_data$waterdetectiondevice <- as.factor(training_data$waterdetectiondevice)
training_data$propertymanager <- as.factor(training_data$propertymanager)
training_data$cova_deductible <- as.factor(training_data$cova_deductible)
training_data$cova_limit       <- as.factor(training_data$cova_limit      )
training_data$waterded <- as.factor(training_data$waterded)
training_data$protectionclass <- as.factor(training_data$protectionclass)
training_data$fire_risk_model_score <- as.factor(training_data$fire_risk_model_score)
training_data$ordinanceorlawpct <- as.factor(training_data$ordinanceorlawpct)
training_data$functionalreplacementcost <- as.factor(training_data$functionalreplacementcost)
training_data$homegardcreditind <- as.factor(training_data$homegardcreditind)
training_data$landlordind <- as.factor(training_data$landlordind)
training_data$rentersinsurance <- as.factor(training_data$rentersinsurance) 
training_data$burglaryalarmtype <- as.factor(training_data$burglaryalarmtype) 
training_data$propertymanager <- as.factor(training_data$propertymanager)
In [11]:
testing_data$stories <- as.factor(testing_data$stories)
testing_data$units <- as.factor(testing_data$units)
testing_data$secondaryresidence <- as.factor(testing_data$secondaryresidence)
testing_data$sprinklersystem <- as.factor(testing_data$sprinklersystem)
testing_data$firealarmtype <- as.factor(testing_data$firealarmtype)
testing_data$waterdetectiondevice <- as.factor(testing_data$waterdetectiondevice)
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
testing_data$cova_deductible <- as.factor(testing_data$cova_deductible)
testing_data$cova_limit       <- as.factor(testing_data$cova_limit      )
testing_data$waterded <- as.factor(testing_data$waterded)
testing_data$protectionclass <- as.factor(testing_data$protectionclass)
testing_data$fire_risk_model_score <- as.factor(testing_data$fire_risk_model_score)
testing_data$ordinanceorlawpct <- as.factor(testing_data$ordinanceorlawpct)
testing_data$functionalreplacementcost <- as.factor(testing_data$functionalreplacementcost)
testing_data$homegardcreditind <- as.factor(testing_data$homegardcreditind)
testing_data$landlordind <- as.factor(testing_data$landlordind)
testing_data$rentersinsurance <- as.factor(testing_data$rentersinsurance) 
testing_data$burglaryalarmtype <- as.factor(testing_data$burglaryalarmtype) 
testing_data$propertymanager <- as.factor(testing_data$propertymanager)
In [12]:
#unique numclaims values
numclaims <- unique(training_data$cova_ic_nc_water)

Modeling

In [13]:
formula_all <-  cova_ic_nc_water ~  
log_norm_yearbult + 
log_norm_sqft + 
stories + 
roofcd + 
units + 
occupancycd + 
waterded + 
protectionclass + 
constructioncd + 
fire_risk_model_score + 
usagetype + 
secondaryresidence + 
ordinanceorlawpct + 
functionalreplacementcost + 
homegardcreditind + 
sprinklersystem + 
landlordind + 
rentersinsurance + 
firealarmtype + 
burglaryalarmtype + 
waterdetectiondevice + 
propertymanager + 
#cova_deductible + 
cova_limit + 
log_norm_water_risk_3_blk + 
log_norm_water_risk_fre_3_blk + 
#log_norm_water_risk_sev_3_blk + 
appl_fail_3_blk + 
fixture_leak_3_blk + 
pipe_froze_3_blk + 
plumb_leak_3_blk + 
rep_cost_3_blk + 
ustructure_fail_3_blk + 
waterh_fail_3_blk

NB models

In [15]:
ModelFile <- paste(ModelsDir,"zinb_all.rds",sep="")
if(file.exists(ModelFile) && UseSavedIfExists){
    zinb_all <- readRDS(ModelFile)
    } else {
    zinb_all <- zeroinfl(formula_all,offset=log_ecy,data=training_data,dist = "negbin",link= "logit")
    saveRDS(zinb_all, ModelFile)
    }
summary(zinb_all)
Warning message in sqrt(diag(object$vcov)):
“NaNs produced”
Call:
zeroinfl(formula = formula_all, data = training_data, offset = log_ecy, 
    dist = "negbin", link = "logit")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.33213 -0.11951 -0.09321 -0.06695 61.43785 

Count model coefficients (negbin with log link):
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -7.23130    1.03728  -6.971 3.14e-12 ***
log_norm_yearbult             -38.56751    3.60143 -10.709  < 2e-16 ***
log_norm_sqft                  -0.30640    0.09306  -3.293 0.000992 ***
stories2                        0.16434    0.04752   3.458 0.000544 ***
stories3                       -1.29869    0.22152  -5.863 4.56e-09 ***
roofcdMEMBRANE                 -0.25331    0.17492  -1.448 0.147568    
roofcdMETAL                    -0.12572    0.39343  -0.320 0.749305    
roofcdOTHER                     0.11411    0.07304   1.562 0.118187    
roofcdTAR                      -0.01529    0.15979  -0.096 0.923745    
roofcdTILE                     -0.05750    0.05350  -1.075 0.282450    
roofcdWOOD                      0.20425    0.21456   0.952 0.341129    
units2                         -0.31604    0.17124  -1.846 0.064954 .  
units3                         -0.44400    0.20729  -2.142 0.032198 *  
units4                         -0.03731    0.15639  -0.239 0.811456    
occupancycdTENANT              -0.02762    0.09028  -0.306 0.759694    
waterded5000                   -0.12038    0.31126  -0.387 0.698934    
waterded7500                   -2.52843    0.84712  -2.985 0.002838 ** 
waterded10000                  -1.10480    0.88406  -1.250 0.211412    
protectionclass1               -0.27543    0.26870  -1.025 0.305334    
protectionclass2               -0.31760    0.25433  -1.249 0.211748    
protectionclass3               -0.23166    0.25377  -0.913 0.361315    
protectionclass4               -0.16148    0.25299  -0.638 0.523284    
protectionclass5               -0.25056    0.26295  -0.953 0.340641    
protectionclass6               -0.15589    0.26643  -0.585 0.558472    
protectionclass7               -0.21508    0.79591  -0.270 0.786978    
protectionclass8               -0.86584    0.82600  -1.048 0.294534    
protectionclass9                2.19611    1.39069   1.579 0.114301    
protectionclass10              -0.39847    0.70374  -0.566 0.571247    
constructioncdB                -0.12203    0.22019  -0.554 0.579432    
constructioncdF                 0.18360    0.04516   4.066 4.79e-05 ***
constructioncdM                 0.01650    0.18583   0.089 0.929243    
constructioncdOTHER             0.52574    0.20484   2.567 0.010272 *  
fire_risk_model_score0          0.11456    0.07776   1.473 0.140706    
fire_risk_model_score1         -0.12244    0.09972  -1.228 0.219505    
fire_risk_model_score2          0.09947    0.09146   1.088 0.276792    
fire_risk_model_score3         -0.21608    0.23946  -0.902 0.366868    
fire_risk_model_score4         -0.01158    0.15166  -0.076 0.939141    
fire_risk_model_score5          2.58431    1.16403   2.220 0.026409 *  
fire_risk_model_score6          0.13451    0.32378   0.415 0.677826    
fire_risk_model_score7          1.18790    1.63646   0.726 0.467904    
fire_risk_model_score8         -8.68542  899.13154  -0.010 0.992293    
fire_risk_model_score9         -9.68987  249.66245  -0.039 0.969040    
fire_risk_model_score10        -0.17149         NA      NA       NA    
fire_risk_model_score11        -2.68500         NA      NA       NA    
fire_risk_model_score12         0.37585    1.13086   0.332 0.739616    
fire_risk_model_score13        -8.34019  573.21888  -0.015 0.988391    
fire_risk_model_score14        -0.04944         NA      NA       NA    
fire_risk_model_score15        -8.13500  500.91063  -0.016 0.987043    
fire_risk_model_score17        -0.11260         NA      NA       NA    
fire_risk_model_score18        -0.02604         NA      NA       NA    
usagetypePRIMARY                4.02694    0.87303   4.613 3.98e-06 ***
usagetypeRENTAL                 3.67302    0.87446   4.200 2.67e-05 ***
usagetypeSEASONAL               4.24141    0.94645   4.481 7.42e-06 ***
usagetypeSECONDARY              1.59865    1.04574   1.529 0.126330    
usagetypeUNOCCUPIED            -3.69055  255.23048  -0.014 0.988463    
usagetypeVACANT                -0.58938    1.71528  -0.344 0.731143    
secondaryresidence1             2.63790    1.10579   2.386 0.017054 *  
ordinanceorlawpct10             0.19601    0.04569   4.290 1.78e-05 ***
ordinanceorlawpct15             1.40543    1.00311   1.401 0.161190    
ordinanceorlawpct20             0.08060    0.06002   1.343 0.179327    
ordinanceorlawpct25             0.01914    0.21758   0.088 0.929917    
ordinanceorlawpct40            -0.49296    0.25887  -1.904 0.056876 .  
ordinanceorlawpct50            -0.15094    0.34559  -0.437 0.662278    
ordinanceorlawpct65            -0.23675    0.66158  -0.358 0.720447    
ordinanceorlawpct75            -0.77476    0.65111  -1.190 0.234083    
ordinanceorlawpct90             0.36711    0.45973   0.799 0.424561    
ordinanceorlawpct100            0.04952    0.65708   0.075 0.939920    
functionalreplacementcost1     -1.82229    1.31342  -1.387 0.165307    
homegardcreditind1              0.04189    0.05422   0.773 0.439706    
sprinklersystem1                0.29066    0.31279   0.929 0.352761    
sprinklersystem2               -0.14577    0.08572  -1.701 0.089022 .  
landlordind1                   -0.47574    0.10805  -4.403 1.07e-05 ***
rentersinsurance1               0.38963    0.40060   0.973 0.330753    
firealarmtype1                  0.16225    0.04335   3.743 0.000182 ***
burglaryalarmtype1             -0.07424    0.04327  -1.716 0.086174 .  
waterdetectiondevice1           0.44390    1.57023   0.283 0.777407    
propertymanager1               -0.06905    0.17117  -0.403 0.686671    
cova_limit200000               -0.30418    0.13849  -2.196 0.028059 *  
cova_limit300000               -0.08141    0.13118  -0.621 0.534900    
cova_limit400000                0.15477    0.13483   1.148 0.251029    
cova_limit500000                0.29872    0.14490   2.062 0.039254 *  
cova_limit600000                0.36478    0.15614   2.336 0.019482 *  
cova_limit700000                0.33743    0.16790   2.010 0.044463 *  
cova_limit800000                0.27498    0.18674   1.473 0.140873    
cova_limit900000                0.46204    0.19479   2.372 0.017692 *  
cova_limit1000000               0.53750    0.24681   2.178 0.029419 *  
cova_limit1200000               0.09967    0.22766   0.438 0.661530    
cova_limit1300000               0.64042    0.23030   2.781 0.005422 ** 
log_norm_water_risk_3_blk       0.26804    0.10784   2.486 0.012933 *  
log_norm_water_risk_fre_3_blk   0.30999    0.10799   2.871 0.004098 ** 
appl_fail_3_blk1               -0.26819    0.29650  -0.905 0.365718    
appl_fail_3_blk2               -0.28046    0.30775  -0.911 0.362121    
appl_fail_3_blk3               -0.22901    0.30575  -0.749 0.453852    
appl_fail_3_blk4               -0.20443    0.29464  -0.694 0.487783    
appl_fail_3_blk5               -0.24345    0.29305  -0.831 0.406102    
fixture_leak_3_blk1            -0.25820    0.07336  -3.520 0.000432 ***
fixture_leak_3_blk2            -0.12438    0.05739  -2.167 0.030206 *  
fixture_leak_3_blk3            -0.17337    0.06757  -2.566 0.010300 *  
fixture_leak_3_blk4            -0.41005    0.10802  -3.796 0.000147 ***
fixture_leak_3_blk5            -0.33925    0.24097  -1.408 0.159170    
pipe_froze_3_blk1              -0.29445    0.07480  -3.937 8.26e-05 ***
pipe_froze_3_blk2              -0.35877    0.04997  -7.180 6.99e-13 ***
pipe_froze_3_blk3              -0.26269    0.06284  -4.180 2.91e-05 ***
pipe_froze_3_blk4              -1.21489    0.23274  -5.220 1.79e-07 ***
pipe_froze_3_blk5               0.25770    0.28185   0.914 0.360546    
plumb_leak_3_blk1               0.20744    0.23740   0.874 0.382221    
plumb_leak_3_blk2              -0.05706    0.25634  -0.223 0.823859    
plumb_leak_3_blk3              -0.03095    0.25128  -0.123 0.901968    
plumb_leak_3_blk4               0.35803    0.24038   1.489 0.136380    
plumb_leak_3_blk5               0.49515    0.24386   2.030 0.042310 *  
rep_cost_3_blk1                 0.24412    0.31301   0.780 0.435450    
rep_cost_3_blk2                 0.69827    0.56096   1.245 0.213214    
rep_cost_3_blk3                -0.36089    0.33708  -1.071 0.284342    
rep_cost_3_blk4                 0.15224    0.29876   0.510 0.610364    
rep_cost_3_blk5                 0.06803    0.28970   0.235 0.814335    
ustructure_fail_3_blk1          0.04921    0.14992   0.328 0.742719    
ustructure_fail_3_blk2          0.16950    0.15284   1.109 0.267423    
ustructure_fail_3_blk3         -0.10748    0.16843  -0.638 0.523387    
ustructure_fail_3_blk4          0.01938    0.15148   0.128 0.898217    
ustructure_fail_3_blk5          0.04841    0.14907   0.325 0.745390    
waterh_fail_3_blk1              0.09362    0.06520   1.436 0.151052    
waterh_fail_3_blk2              0.08350    0.04374   1.909 0.056228 .  
waterh_fail_3_blk3              0.09438    0.06858   1.376 0.168771    
waterh_fail_3_blk4              0.13180    0.09712   1.357 0.174754    
waterh_fail_3_blk5              0.18866    0.14195   1.329 0.183818    
Log(theta)                      0.52545    0.23163   2.268 0.023301 *  

Zero-inflation model coefficients (binomial with logit link):
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -6.576e+00  2.852e+00  -2.306 0.021136 *  
log_norm_yearbult             -1.137e+02  6.064e+00 -18.745  < 2e-16 ***
log_norm_sqft                 -1.628e+00  2.146e-01  -7.587 3.28e-14 ***
stories2                       2.173e-01  1.487e-01   1.462 0.143873    
stories3                      -1.624e+01         NA      NA       NA    
roofcdMEMBRANE                -7.717e-01  4.767e-01  -1.619 0.105460    
roofcdMETAL                   -6.513e-01  1.137e+00  -0.573 0.566896    
roofcdOTHER                   -1.500e-01  1.595e-01  -0.940 0.347244    
roofcdTAR                     -1.666e-01  3.018e-01  -0.552 0.580932    
roofcdTILE                    -8.359e-01  1.891e-01  -4.420 9.88e-06 ***
roofcdWOOD                     3.093e-01  4.528e-01   0.683 0.494660    
units2                         4.797e-02  3.078e-01   0.156 0.876167    
units3                        -6.839e-01  4.405e-01  -1.552 0.120572    
units4                        -8.454e-02  3.636e-01  -0.232 0.816160    
occupancycdTENANT             -5.574e-01  2.093e-01  -2.663 0.007735 ** 
waterded5000                   6.805e-01  6.934e-01   0.981 0.326386    
waterded7500                  -1.615e+00  2.785e+00  -0.580 0.562024    
waterded10000                  1.100e+00  1.597e+00   0.689 0.490755    
protectionclass1              -7.115e-01  5.896e-01  -1.207 0.227486    
protectionclass2              -7.195e-01  5.590e-01  -1.287 0.198083    
protectionclass3              -6.794e-01  5.594e-01  -1.214 0.224566    
protectionclass4              -7.271e-01  5.605e-01  -1.297 0.194603    
protectionclass5              -7.121e-01  5.984e-01  -1.190 0.234027    
protectionclass6              -5.083e-01  6.013e-01  -0.845 0.397913    
protectionclass7              -6.899e-01  2.031e+00  -0.340 0.734110    
protectionclass8              -9.229e-01  2.167e+00  -0.426 0.670232    
protectionclass9               1.538e+01  4.686e+01   0.328 0.742780    
protectionclass10              6.426e-03  1.609e+00   0.004 0.996813    
constructioncdB               -1.179e+00  6.161e-01  -1.913 0.055702 .  
constructioncdF               -1.536e-01  1.202e-01  -1.277 0.201466    
constructioncdM               -9.841e-01  5.504e-01  -1.788 0.073794 .  
constructioncdOTHER            5.453e-01  4.222e-01   1.292 0.196516    
fire_risk_model_score0         6.572e-02  2.483e-01   0.265 0.791256    
fire_risk_model_score1        -3.269e-01  3.481e-01  -0.939 0.347622    
fire_risk_model_score2         3.749e-02  2.957e-01   0.127 0.899107    
fire_risk_model_score3        -6.617e-01  8.458e-01  -0.782 0.434007    
fire_risk_model_score4         2.103e-01  4.274e-01   0.492 0.622708    
fire_risk_model_score5         6.312e+00  1.618e+00   3.901 9.56e-05 ***
fire_risk_model_score6         2.468e-01  8.693e-01   0.284 0.776460    
fire_risk_model_score7         3.677e+00  2.194e+00   1.676 0.093693 .  
fire_risk_model_score8         1.337e+01  6.684e+03   0.002 0.998404    
fire_risk_model_score9         1.228e+01         NA      NA       NA    
fire_risk_model_score10        1.059e+01         NA      NA       NA    
fire_risk_model_score11        1.339e+01         NA      NA       NA    
fire_risk_model_score12       -1.447e+00  8.396e+00  -0.172 0.863190    
fire_risk_model_score13        1.193e+01  3.410e+03   0.003 0.997209    
fire_risk_model_score14        1.063e+01         NA      NA       NA    
fire_risk_model_score15        1.062e+01         NA      NA       NA    
fire_risk_model_score17        1.111e+01         NA      NA       NA    
fire_risk_model_score18        1.110e+01         NA      NA       NA    
usagetypePRIMARY               2.595e+00  2.321e+00   1.118 0.263594    
usagetypeRENTAL                2.827e+00  2.324e+00   1.217 0.223746    
usagetypeSEASONAL              3.572e+00  2.444e+00   1.461 0.143936    
usagetypeSECONDARY            -1.923e+01  3.064e+02  -0.063 0.949974    
usagetypeUNOCCUPIED            9.286e+00  5.125e+02   0.018 0.985544    
usagetypeVACANT                1.740e+00  3.605e+00   0.483 0.629314    
secondaryresidence1            2.213e+01  3.064e+02   0.072 0.942420    
ordinanceorlawpct10            2.763e-01  1.189e-01   2.324 0.020115 *  
ordinanceorlawpct15            3.424e+00  1.327e+00   2.581 0.009861 ** 
ordinanceorlawpct20            1.359e-01  1.679e-01   0.810 0.418092    
ordinanceorlawpct25           -1.407e-01  5.699e-01  -0.247 0.805040    
ordinanceorlawpct40           -3.645e+00  1.707e+00  -2.136 0.032704 *  
ordinanceorlawpct50           -8.003e-01  1.063e+00  -0.753 0.451579    
ordinanceorlawpct65           -2.530e-01  1.845e+00  -0.137 0.890920    
ordinanceorlawpct75           -3.638e+00  3.942e+00  -0.923 0.356057    
ordinanceorlawpct90            1.126e+00  9.371e-01   1.202 0.229484    
ordinanceorlawpct100           3.635e-01  1.806e+00   0.201 0.840516    
functionalreplacementcost1    -1.880e+00  2.010e+00  -0.935 0.349550    
homegardcreditind1             2.215e-01  1.434e-01   1.544 0.122496    
sprinklersystem1               1.205e+00  8.872e-01   1.359 0.174287    
sprinklersystem2               2.575e-01  3.423e-01   0.752 0.451782    
landlordind1                  -1.001e-01  2.858e-01  -0.350 0.726122    
rentersinsurance1              8.334e-01  9.395e-01   0.887 0.375031    
firealarmtype1                -5.899e-02  1.145e-01  -0.515 0.606264    
burglaryalarmtype1            -1.841e-01  1.279e-01  -1.439 0.150021    
waterdetectiondevice1          1.125e+00  3.848e+00   0.292 0.770052    
propertymanager1              -1.030e+00  5.359e-01  -1.922 0.054617 .  
cova_limit200000               4.376e-01  3.276e-01   1.336 0.181617    
cova_limit300000               5.663e-01  3.159e-01   1.793 0.073022 .  
cova_limit400000               7.947e-01  3.269e-01   2.431 0.015046 *  
cova_limit500000               1.129e+00  3.544e-01   3.187 0.001440 ** 
cova_limit600000               1.222e+00  3.880e-01   3.149 0.001639 ** 
cova_limit700000               1.155e+00  4.312e-01   2.680 0.007368 ** 
cova_limit800000               1.390e+00  4.918e-01   2.826 0.004720 ** 
cova_limit900000               1.269e+00  5.136e-01   2.470 0.013495 *  
cova_limit1000000              1.917e+00  6.624e-01   2.893 0.003812 ** 
cova_limit1200000              3.405e-01  6.669e-01   0.511 0.609626    
cova_limit1300000              1.548e+00  6.436e-01   2.405 0.016157 *  
log_norm_water_risk_3_blk      9.929e-01  2.804e-01   3.541 0.000399 ***
log_norm_water_risk_fre_3_blk -5.238e-01  2.701e-01  -1.939 0.052507 .  
appl_fail_3_blk1              -6.464e-01  6.453e-01  -1.002 0.316491    
appl_fail_3_blk2              -5.280e-01  6.633e-01  -0.796 0.426066    
appl_fail_3_blk3              -5.140e-01  6.638e-01  -0.774 0.438765    
appl_fail_3_blk4              -5.803e-01  6.428e-01  -0.903 0.366627    
appl_fail_3_blk5              -6.177e-01  6.387e-01  -0.967 0.333509    
fixture_leak_3_blk1           -7.332e-03  2.085e-01  -0.035 0.971948    
fixture_leak_3_blk2           -1.254e-02  1.746e-01  -0.072 0.942756    
fixture_leak_3_blk3           -9.488e-02  1.952e-01  -0.486 0.626951    
fixture_leak_3_blk4           -1.812e-01  2.852e-01  -0.635 0.525242    
fixture_leak_3_blk5            6.077e-01  5.640e-01   1.077 0.281307    
pipe_froze_3_blk1             -5.500e-01  2.844e-01  -1.934 0.053118 .  
pipe_froze_3_blk2             -1.650e-01  1.246e-01  -1.325 0.185281    
pipe_froze_3_blk3             -1.126e-01  1.632e-01  -0.690 0.490357    
pipe_froze_3_blk4             -1.532e+01  4.687e+01  -0.327 0.743787    
pipe_froze_3_blk5              1.107e+00  6.549e-01   1.690 0.091004 .  
plumb_leak_3_blk1             -2.970e-02  7.162e-01  -0.041 0.966921    
plumb_leak_3_blk2             -5.649e-01  7.545e-01  -0.749 0.454057    
plumb_leak_3_blk3             -3.978e-01  7.418e-01  -0.536 0.591794    
plumb_leak_3_blk4              2.817e-01  7.225e-01   0.390 0.696596    
plumb_leak_3_blk5              4.568e-01  7.330e-01   0.623 0.533121    
rep_cost_3_blk1                4.785e-01  1.105e+00   0.433 0.665029    
rep_cost_3_blk2                2.105e+00  1.423e+00   1.480 0.138949    
rep_cost_3_blk3               -7.180e+00  8.867e+00  -0.810 0.418088    
rep_cost_3_blk4                3.675e-01  1.092e+00   0.337 0.736490    
rep_cost_3_blk5                9.368e-03  1.065e+00   0.009 0.992981    
ustructure_fail_3_blk1         1.129e-01  6.216e-01   0.182 0.855868    
ustructure_fail_3_blk2         2.979e-01  6.402e-01   0.465 0.641658    
ustructure_fail_3_blk3         2.485e-03  6.810e-01   0.004 0.997088    
ustructure_fail_3_blk4         2.197e-01  6.217e-01   0.353 0.723756    
ustructure_fail_3_blk5         3.835e-01  6.115e-01   0.627 0.530617    
waterh_fail_3_blk1            -3.155e-01  2.068e-01  -1.526 0.127013    
waterh_fail_3_blk2             5.905e-02  1.090e-01   0.542 0.588007    
waterh_fail_3_blk3             3.713e-01  1.839e-01   2.019 0.043505 *  
waterh_fail_3_blk4             3.218e-01  3.123e-01   1.030 0.302796    
waterh_fail_3_blk5             6.995e-01  4.434e-01   1.578 0.114650    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 1.6912 
Number of iterations in BFGS optimization: 627 
Log-likelihood: -5.144e+04 on 251 Df

Normalization is needed to make the model work. I was not able to add into the model and test cova_deductible and log_norm_water_risk_sev_3_blk. It returns errors. usagetype does not returns NA in covariance matrix for some levels as in ZI Poisson but stories(3) and fire risk model score do. Some predictors (like ordinanceorlawpct) are marked as significant but really it's a random component I feel.

In [17]:
formula_sign <-  cova_ic_nc_water ~  
log_norm_yearbult + 
log_norm_sqft + 
stories + 
waterded + 
constructioncd +  
usagetype + 
landlordind + 
firealarmtype + 
cova_limit + 
log_norm_water_risk_3_blk + 
log_norm_water_risk_fre_3_blk + 
fixture_leak_3_blk + 
pipe_froze_3_blk +
plumb_leak_3_blk |
log_norm_yearbult +
log_norm_sqft +
roofcd + 
occupancycd +
constructioncd +
cova_limit + 
log_norm_water_risk_3_blk +
log_norm_water_risk_fre_3_blk +
pipe_froze_3_blk +
waterh_fail_3_blk
In [18]:
ModelFile <- paste(ModelsDir,"zip_sign.rds",sep="")
if(file.exists(ModelFile) && UseSavedIfExists){
    zinb_sign <- readRDS(ModelFile)
    } else {
    zinb_sign <- zeroinfl(formula_sign,offset=log(exposure),data=training_data,dist = "negbin",link= "logit")
    saveRDS(zinb_sign, ModelFile)
    }
summary(zinb_sign)
Call:
zeroinfl(formula = formula_sign, data = training_data, offset = log_ecy, 
    dist = "poisson", link = "logit")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.28437 -0.12035 -0.09250 -0.06738 65.46562 

Count model coefficients (poisson with log link):
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -4.069091   0.597702  -6.808 9.90e-12 ***
log_norm_yearbult             41.451373   2.664522  15.557  < 2e-16 ***
log_norm_sqft                  0.546648   0.062151   8.796  < 2e-16 ***
stories2                       0.118883   0.030172   3.940 8.14e-05 ***
stories3                      -0.303811   0.185689  -1.636 0.101814    
waterded5000                  -0.334142   0.193748  -1.725 0.084596 .  
waterded7500                  -2.161977   0.707641  -3.055 0.002249 ** 
waterded10000                 -1.505610   0.578050  -2.605 0.009197 ** 
constructioncdB                0.501656   0.141726   3.540 0.000401 ***
constructioncdF                0.332390   0.031185  10.659  < 2e-16 ***
constructioncdM                0.449169   0.103787   4.328 1.51e-05 ***
constructioncdOTHER            0.212652   0.127138   1.673 0.094403 .  
usagetypePRIMARY               2.937792   0.578169   5.081 3.75e-07 ***
usagetypeRENTAL                2.525265   0.578451   4.366 1.27e-05 ***
usagetypeSEASONAL              2.812475   0.603334   4.662 3.14e-06 ***
usagetypeSECONDARY             1.694024   0.731192   2.317 0.020515 *  
usagetypeUNOCCUPIED           -6.577947  80.593119  -0.082 0.934950    
usagetypeVACANT               -1.312806   1.154774  -1.137 0.255601    
landlordind1                  -0.460655   0.066094  -6.970 3.18e-12 ***
firealarmtype1                 0.167663   0.022187   7.557 4.13e-14 ***
cova_limit200000              -0.422208   0.083523  -5.055 4.30e-07 ***
cova_limit300000              -0.270028   0.078805  -3.427 0.000611 ***
cova_limit400000              -0.149937   0.083134  -1.804 0.071300 .  
cova_limit500000              -0.169353   0.091549  -1.850 0.064334 .  
cova_limit600000              -0.209909   0.102440  -2.049 0.040454 *  
cova_limit700000              -0.220865   0.115595  -1.911 0.056046 .  
cova_limit800000              -0.430982   0.133849  -3.220 0.001282 ** 
cova_limit900000              -0.282331   0.149291  -1.891 0.058606 .  
cova_limit1000000             -0.539027   0.176270  -3.058 0.002228 ** 
cova_limit1200000             -0.307735   0.188825  -1.630 0.103158    
cova_limit1300000             -0.230033   0.198802  -1.157 0.247235    
log_norm_water_risk_3_blk     -0.001452   0.073412  -0.020 0.984219    
log_norm_water_risk_fre_3_blk  0.350611   0.070047   5.005 5.58e-07 ***
fixture_leak_3_blk1           -0.198714   0.040500  -4.907 9.27e-07 ***
fixture_leak_3_blk2           -0.060470   0.035463  -1.705 0.088164 .  
fixture_leak_3_blk3           -0.097936   0.039335  -2.490 0.012781 *  
fixture_leak_3_blk4           -0.257662   0.057176  -4.506 6.59e-06 ***
fixture_leak_3_blk5           -0.457073   0.125066  -3.655 0.000258 ***
pipe_froze_3_blk1             -0.303095   0.071872  -4.217 2.47e-05 ***
pipe_froze_3_blk2             -0.279586   0.034515  -8.100 5.47e-16 ***
pipe_froze_3_blk3             -0.252991   0.049301  -5.132 2.87e-07 ***
pipe_froze_3_blk4             -0.631999   0.201340  -3.139 0.001695 ** 
pipe_froze_3_blk5             -0.382666   0.201246  -1.901 0.057238 .  

Zero-inflation model coefficients (binomial with logit link):
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    3.968e+00  7.943e-01   4.995 5.87e-07 ***
log_norm_yearbult              3.069e+02  2.876e+01  10.671  < 2e-16 ***
log_norm_sqft                  1.407e+00  2.244e-01   6.267 3.67e-10 ***
roofcdMEMBRANE                 4.205e-01  4.085e-01   1.029 0.303248    
roofcdMETAL                   -7.075e-02  9.753e-01  -0.073 0.942171    
roofcdOTHER                   -4.825e-01  2.080e-01  -2.320 0.020349 *  
roofcdTAR                     -1.337e-02  4.638e-01  -0.029 0.976999    
roofcdTILE                    -6.622e-02  1.244e-01  -0.532 0.594453    
roofcdWOOD                    -7.364e-02  8.041e-01  -0.092 0.927033    
occupancycdTENANT             -2.105e-01  1.934e-01  -1.089 0.276228    
protectionclass1              -1.060e-01  6.232e-01  -0.170 0.864956    
protectionclass2               4.916e-02  5.880e-01   0.084 0.933364    
protectionclass3               8.753e-03  5.869e-01   0.015 0.988101    
protectionclass4              -2.163e-01  5.855e-01  -0.370 0.711743    
protectionclass5               2.200e-01  6.046e-01   0.364 0.715942    
protectionclass6               1.702e-01  6.129e-01   0.278 0.781243    
protectionclass7              -2.599e+00  3.460e+00  -0.751 0.452570    
protectionclass8               2.665e+00  1.900e+00   1.403 0.160759    
protectionclass9               3.248e+01  4.093e+04   0.001 0.999367    
protectionclass10              2.615e-01  1.266e+00   0.207 0.836377    
constructioncdB                2.726e+00  6.465e-01   4.216 2.48e-05 ***
constructioncdF                3.732e-01  1.022e-01   3.650 0.000262 ***
constructioncdM                6.263e-01  4.966e-01   1.261 0.207226    
constructioncdOTHER           -6.286e-01  6.803e-01  -0.924 0.355512    
cova_limit200000              -1.449e-01  3.583e-01  -0.405 0.685786    
cova_limit300000              -8.257e-02  3.319e-01  -0.249 0.803516    
cova_limit400000              -2.536e-01  3.402e-01  -0.745 0.455994    
cova_limit500000              -6.135e-01  3.616e-01  -1.697 0.089741 .  
cova_limit600000              -9.616e-01  3.915e-01  -2.456 0.014050 *  
cova_limit700000              -9.126e-01  4.227e-01  -2.159 0.030848 *  
cova_limit800000              -1.217e+00  4.669e-01  -2.608 0.009120 ** 
cova_limit900000              -1.424e+00  5.297e-01  -2.688 0.007197 ** 
cova_limit1000000             -2.234e+00  6.643e-01  -3.364 0.000769 ***
cova_limit1200000             -8.592e-01  5.810e-01  -1.479 0.139177    
cova_limit1300000             -1.736e+00  6.480e-01  -2.679 0.007387 ** 
log_norm_water_risk_3_blk     -2.790e-01  2.775e-01  -1.005 0.314819    
log_norm_water_risk_fre_3_blk  8.687e-02  2.723e-01   0.319 0.749676    
pipe_froze_3_blk1             -1.536e-01  2.187e-01  -0.702 0.482502    
pipe_froze_3_blk2              3.490e-01  1.260e-01   2.769 0.005624 ** 
pipe_froze_3_blk3              2.152e-01  1.596e-01   1.348 0.177533    
pipe_froze_3_blk4             -4.649e+01  4.094e+04  -0.001 0.999094    
pipe_froze_3_blk5             -8.637e-01  8.205e-01  -1.053 0.292486    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 169 
Log-likelihood: -5.15e+04 on 85 Df

No NA for teh levels as in the complex model. Some predictors which were significant now are not significant.

Model comparison

In [19]:
lrtest(zinb_sign,zinb_all)
A anova: 2 × 5
#DfLogLikDfChisqPr(>Chisq)
<dbl><dbl><dbl><dbl><dbl>
1 85-51504.44 NA NA NA
2251-51440.07166128.74010.9855326
In [20]:
vuong(zinb_sign,zinb_all)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                   -2.103459 model2 > model1 0.01771285
AIC-corrected          3.288344 model1 > model2 0.00050389
BIC-corrected         34.921734 model1 > model2 < 2.22e-16

The results is a little bit contraversial. According to Raw, the complex model is better then the nested one. However Significant features model is better because it has smaller AIC and BIC and it's significant.

In [21]:
models <- list("All Features" = zinb_all,  "Sign Features"=zinb_sign)
In [22]:
#Log likelyhood for all the models and DF
Models_LogLik <- rbind(logLik = sapply(models, function(x) round(logLik(x), digits = 0)),
    Df = sapply(models, function(x) attr(logLik(x), "df")))
Models_LogLik
A matrix: 2 × 2 of type dbl
All FeaturesSign Features
logLik-51440-51504
Df 251 85
In [23]:
#AIC
Models_AIC <- rbind(AIC = sapply(models, function(x) round(AIC(x), digits = 0)))
Models_AIC
A matrix: 1 × 2 of type dbl
All FeaturesSign Features
AIC103382103179
In [38]:
#BIC
Models_BIC <- rbind(BIC = sapply(models, function(x) round(BIC(x), digits = 0)))
Models_BIC
A matrix: 1 × 2 of type dbl
All FeaturesSign Features
BIC106327104176

Smaller AIC (BIC) suggests a better model. The model with significant features has a smaller AIC and BIC.

Coefficients means

In [25]:
sapply(models, function(x) coef(x)[1:20])
A matrix: 20 × 2 of type dbl
All FeaturesSign Features
count_(Intercept) -7.23130407-4.0690906
count_log_norm_yearbult-38.5675063641.4513732
count_log_norm_sqft -0.30640350 0.5466477
count_stories2 0.16433789 0.1188826
count_stories3 -1.29868767-0.3038107
count_roofcdMEMBRANE -0.25331111-0.3341417
count_roofcdMETAL -0.12572419-2.1619774
count_roofcdOTHER 0.11411463-1.5056096
count_roofcdTAR -0.01529493 0.5016561
count_roofcdTILE -0.05749957 0.3323903
count_roofcdWOOD 0.20424889 0.4491690
count_units2 -0.31604145 0.2126523
count_units3 -0.44400345 2.9377920
count_units4 -0.03730728 2.5252654
count_occupancycdTENANT -0.02761540 2.8124748
count_waterded5000 -0.12038390 1.6940236
count_waterded7500 -2.52843194-6.5779473
count_waterded10000 -1.10480393-1.3128057
count_protectionclass1 -0.27543334-0.4606550
count_protectionclass2 -0.31759860 0.1676632

Sometimes the sign of the mean is different in the models

Rootograms for Assessing Goodness of Fit of Probability...

In [26]:
library(countreg)
Attaching package: ‘countreg’


The following objects are masked from ‘package:pscl’:

    hurdle, hurdle.control, hurdletest, zeroinfl, zeroinfl.control


In [27]:
par(mfrow = c(1, 2))
rootogram(zinb_all,max = max(numclaims),main="All Features",style="hanging")
rootogram(zinb_sign,max = max(numclaims),main="Significant Features",style="hanging")

Significant features model has deviance on numebr claims 2

Predicting Number of claims

I use XGB to predict the final result based on probabilities of number claims in each row.

In [28]:
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "poisson-nloglik" 
)
In [29]:
#Adding  predicted probability values 
#to training and testing dataset for each numclaims
#as separate column

training_all <- predict(zinb_all, training_data, type='prob')
testing_all <- predict(zinb_all, testing_data, type='prob')

training_sign <- predict(zinb_sign, training_data, type='prob')
testing_sign <- predict(zinb_sign, testing_data, type='prob')

for(unique_value in numclaims){

training_data[paste("zinb_all", unique_value, sep = ".")] <- training_all[,as.numeric(unique_value)+1]
testing_data[paste("zinb_all", unique_value, sep = ".")] <- testing_all[,as.numeric(unique_value)+1]

training_data[paste("zinb_sign", unique_value, sep = ".")] <- training_sign[,as.numeric(unique_value)+1]
testing_data[paste("zinb_sign", unique_value, sep = ".")] <- testing_sign[,as.numeric(unique_value)+1]
    
}
In [30]:
#Features with probabilities listed by model
zinb_all_features <- list();
zinb_sign_features <- list();

i <- 1
for(unique_value in numclaims){
    zinb_all_features[[i]] <- paste("zinb_all", unique_value, sep = ".")
    zinb_sign_features[[i]] <- paste("zinb_sign", unique_value, sep = ".")

    i <- i + 1
}
In [31]:
xgNB_all_training = xgb.DMatrix(as.matrix(training_data[,unlist(zinb_all_features)]), label = training_data$cova_ic_nc_water)
setinfo(xgNB_all_training, "base_margin", log(training_data$ecy))

xgNB_all_testing = xgb.DMatrix(as.matrix(testing_data[,unlist(zinb_all_features)]), label = testing_data$cova_ic_nc_water)
setinfo(xgNB_all_testing, "base_margin", log(testing_data$ecy))
TRUE
TRUE
In [32]:
xgb = xgb.train(
  nrounds = 10000
  , params = param0
  , data = xgNB_all_training
)
training_data$zinb_all = predict(xgb, xgNB_all_training)
testing_data$zinb_all = predict(xgb, xgNB_all_testing)
In [33]:
xgNB_sign_training = xgb.DMatrix(as.matrix(training_data[,unlist(zinb_sign_features)]), label = training_data$cova_ic_nc_water)
setinfo(xgNB_sign_training, "base_margin", log(training_data$ecy))

xgNB_sign_testing = xgb.DMatrix(as.matrix(testing_data[,unlist(zinb_sign_features)]), label = testing_data$cova_ic_nc_water)
setinfo(xgNB_sign_testing, "base_margin", log(testing_data$ecy))
TRUE
TRUE
In [34]:
xgb = xgb.train(
  nrounds = 10000
  , params = param0
  , data = xgNB_sign_training
)
training_data$zinb_sign = predict(xgb, xgNB_sign_training)
testing_data$zinb_sign = predict(xgb, xgNB_sign_testing)

As an accuracy measure I use Normalized Weighted Gini:

In [37]:
Testing_Gini <- data.frame ("All" = NormalizedWeightedGini(testing_data$cova_ic_nc_water,testing_data$zinb_all,testing_data$ecy)
                   ,"Significant Only" = NormalizedWeightedGini(testing_data$cova_ic_nc_water,testing_data$zinb_sign,testing_data$ecy)
                   )
Testing_Gini
A data.frame: 1 × 2
AllSignificant.Only
<dbl><dbl>
0.27187870.2711677

Overall, the models are very close to each other. There is a difference in Normalized Weighted Gini which does NOT looks significant.

K Fold Validation

TBD Does not work with Caret very strightforward and my manual approach is wrong somewhere

Conclusion

  • Normalization is needed to make the model work. I need to include normalized log columns in the datasets when it's extracted.
  • I convert all other columns into factors but not sure if it's needed.
  • I was not able to add into the model and test cova_deductible and log_norm_water_risk_sev_3_blk. It returns errors.
  • There is a different set of columns with NA then in Poisson zi (stories and fire risk score). Should I include in the model only "good" levels?
  • Some predictors (like ordinanceorlawpct) are marked as significant but really it's a random component I feel.
  • Significant features model has smaller AIC and BIC and should be better then All features model. The Normalized Weighted Genin is almost the same. Need F test to confirm.