How to get more information from XGBoost CV function.

 

Introduction

 

Cross-validation is a well-known method to estimate true model accuracy on unseen data. Thera are different methods of cross-validation: holdout, K-fold cross-validation, leave-one-out cross-validation. But the general idea is to partition the available dataset into training and validation subsets several times, train and validate models on each group of subsets and average the results. Reducing the size of training data is compensated by discovering patterns in validation data, unseen in training.

The process can be implemented as a loop to split the data, train, and validate models, calculate mean, std and/or sem of the evaluation metrics. It becomes more challenging if you cannot implement a straightforward loop, when a dataset is huge, and you use parallel model training in a cloud like AWS SageMaker.

XGBoost CV function does everything in one function. You do not need to split the dataset into folds, train models or average the result yourself.

cv_results = xgb.cv(

params,

dtrain,

num_boost_round=num_boost_round,

seed=42,

nfold=5,

early_stopping_rounds=100

)

cv_results

 

train-auc-mean

train-auc-std

test-auc-mean

test-auc-std

0

0.500000

0.000000

0.500000

0.000000

1

0.500000

0.000000

0.500000

0.000000

2

0.500000

0.000000

0.500000

0.000000

3

0.500000

0.000000

0.500000

0.000000

4

0.500000

0.000000

0.500000

0.000000

...

...

...

...

...

691

0.729942

0.001573

0.687806

0.015197

692

0.729998

0.001572

0.687810

0.015194

693

0.730056

0.001574

0.687823

0.015194

694

0.730084

0.001564

0.687826

0.015202

695

0.730137

0.001560

0.687837

0.015183

 

CV returns a table with train and test mean as well as std of an evaluation metric from all folds. Each row corresponds to a boosting iteration.

 

The drawbacks of the function:

 

- We do not have separate results from each model trained/validated on a specific fold. There is just mean and std. We may need it if we want to analyze the result using t-test or use models outputs in a pipeline/model stacking.

- The output of CV may provide wrong std. I observed wrong AUC std output in several XGBoost versions. It looks like they calculate mean first and then include mean to calculate std.

- We may need sem instead of or along with std.

- We do not have best models at the end of the process and need to train them again based on the selected best parameters.

- There is no ongoing output from the function, and it might take a while. In some cases, the output is required for outside monitoring systems.

In this article I will show few tricks how to make CV more usable.

Callback Functions

 

Callback functions were designed to implement extensions of CV function. They are invoked inside an internal loop. There are several built-in callbacks, and you can design your own functions.

xgb.callback.print_evaluation(period=n , show_stdv=True) is an example of a built-in callback function. It prints output from CV every n periods providing ongoing output as a log file.

According to XGBoost documentation Early_Stopping_Rounds and verbose/verbose_eval parameters are also implemented via callback functions internally and can be passed in a form of a callback function.

 

Custom Callback Functions

 

The function is implemented as an ordinary Python function with OUT parameters to return something from the iteration. In the example below I extract best models and train/test scores from each model at each step.

A best model can be a model with a score greater then at previous step like AUC or smaller like MAE or RMSE. That is why I have one more Boolean parameter in my function - maximize. Base on this parameter I will choose best score and model.

def cv_misc_callback(oof_train_scores:list, oof_valid_scores:list, best_models:list, maximize=True):

The function body is very short:

It starts from an instruction when to run it - before or after iteration:

callback.before_iteration = False

And then there is just a call of a nested function in return statement.

return callback

However, there are 2 nested functions:

Init initializes the variable for a best score.

Callback does the real job: calls Init once to initialize the best score, selects the best score (maximum or minimum) and the correspondent model, adds best models and train and valid scores at each step to the output parameters.

All info is extracted from env variable (xgboost.core.XGBoostCallbackEnv) available in the callback functions. It has a complex structure:

XGBoostCallbackEnv(

model=None,

cvfolds=[<xgboost.training.CVPack object at 0x7f29bfb65610>,

<xgboost.training.CVPack object at 0x7f29bfb65040>,

<xgboost.training.CVPack object at 0x7f29bfb65850>,

 

. . . The number of CVPack corresponds to the number of folds . . .

 

<xgboost.training.CVPack object at 0x7f29c2191a90>],

iteration=0,

begin_iteration=0,

end_iteration=5000,

rank=0,

evaluation_result_list=[(`train-auc`, 0.5, 0.0), (`test-auc`, 0.5, 0.0)])

 

CVPack object is what I use to extract best models and scores. Each fold has its own CVPack object entity, and they are available in cvfolds list.

That is why there is a loop thru all folds:

for i, cvpack in enumerate(env.cvfolds)

Best model is available in cvpack.bst and it is extracted only when the fold score is better (less or higher) then a previous one.

Models scores from each fold are extracted from cvpack.eval(iteration=0,feval=None)

I use iteration=0 to get the scores. Any other iteration values in this function returns the same score. Parameter feval is used in a case of a custom function.

The output is a string.

[0]

train-auc:0.659214

test-auc:0.637258

and a regular expression helps to separate training and testing scores.

How to call a callback function

 

There is a special parameter in CV function which accepts a list of callback functions (standard or custom):

cv_results=xgb.cv(params,

d_train,

num_boost_round=num_boost_round,

nfold=nfold,

stratified=True,

shuffle=True,

early_stopping_rounds=early_stopping_rounds,

seed=42,

callbacks=[cv_misc_callback(oof_train_scores, oof_valid_scores,best_models,True), xgb.callback.print_evaluation(period=10)]

)

. . .

[660]

train-auc:0.72809+0.00157

test-auc:0.68774+0.01521

[670]

train-auc:0.72862+0.00158

test-auc:0.68776+0.01517

[680]

train-auc:0.72917+0.00164

test-auc:0.68776+0.01515

[690]

train-auc:0.72989+0.00158

test-auc:0.68781+0.01518

[700]

train-auc:0.73049+0.00150

test-auc:0.68774+0.01520

[710]

train-auc:0.73104+0.00145

test-auc:0.68776+0.01520

[720]

train-auc:0.73165+0.00139

test-auc:0.68776+0.01530

[730]

train-auc:0.73226+0.00138

test-auc:0.68776+0.01523

[740]

train-auc:0.73283+0.00140

test-auc:0.68762+0.01517

[750]

train-auc:0.73344+0.00132

test-auc:0.68763+0.01524

[760]

train-auc:0.73400+0.00130

test-auc:0.68754+0.01519

[770]

train-auc:0.73462+0.00133

test-auc:0.68760+0.01510

[780]

train-auc:0.73520+0.00131

test-auc:0.68762+0.01522

[790]

train-auc:0.73584+0.00133

test-auc:0.68755+0.01530

 

The above is xgb.callback.print_evaluation output.

Standard vs Custom output

 

CV function returns mean of train and test scores as well as std.

cv_results.tail()

 

train-auc-mean

train-auc-std

test-auc-mean

test-auc-std

691

0.729942

0.001573

0.687806

0.015197

692

0.729998

0.001572

0.687810

0.015194

693

0.730056

0.001574

0.687823

0.015194

694

0.730084

0.001564

0.687826

0.015202

695

0.730137

0.001560

0.687837

0.015183

 

Now, let us compare to the content of oof_train_scores, oof_valid_scores from cv_misc_callback.

df_oof_train_scores.tail()

 

0

1

2

3

4

5

6

7

8

9

791

0.735064

0.735104

0.736658

0.735220

0.734851

0.734872

0.739483

0.736426

0.735215

0.735818

792

0.735095

0.735177

0.736683

0.735259

0.734865

0.734899

0.739491

0.736517

0.735308

0.735933

793

0.735140

0.735224

0.736726

0.735302

0.734933

0.734914

0.739524

0.736520

0.735352

0.735974

794

0.735178

0.735296

0.736740

0.735449

0.734973

0.734960

0.739541

0.736532

0.735376

0.736010

795

0.735239

0.735383

0.736768

0.735530

0.734995

0.735020

0.739602

0.736540

0.735416

0.736025

 

df_oof_valid_scores.tail()

 

0

1

2

3

4

5

6

7

8

9

791

0.705811

0.693888

0.674917

0.679050

0.714809

0.694630

0.657618

0.688595

0.683779

0.682329

792

0.705817

0.693903

0.674868

0.679003

0.714863

0.694604

0.657607

0.688551

0.683657

0.682485

793

0.705779

0.693944

0.674881

0.678924

0.714802

0.694577

0.657672

0.688561

0.683647

0.682396

794

0.705714

0.693914

0.674875

0.678910

0.714855

0.694593

0.657623

0.688544

0.683658

0.682407

795

0.705704

0.693926

0.674860

0.678882

0.714846

0.694548

0.657598

0.688539

0.683622

0.682378

 

First, there are separate scores for each fold. We need to calculate mean and std ourselves, but it also allows to do a t-test to compare results between different experiment.

There are also 100 (early_stopping_rounds) more results then in the standard output (cv_results). The same as in the output from xgb.callback.print_evaluation. It just does not print each output. CV standard output (cv_results) does not provide anything above the best score. Well, if we would not use early_stopping_rounds we would get the same number of iterations in both outputs.

Let us calculate mean and std from the custom output and find the first best valid score:

 

train-auc-mean

train-auc-sem

train-auc-std

valid-auc-mean

valid-auc-sem

valid-auc-std

695

0.730137

0.00052

0.001645

0.687837

0.005061

0.016004

 

The best iteration is the same as in the standard output, means match, but std is different! If you recalculate std yourself from the detail data you will see, mean is added in the calculation.

df_oof_train_scores.iloc[[695]]

 

0

1

2

3

4

5

6

7

8

9

std

sem

mean

695

0.729312

0.728186

0.731243

0.729432

0.728717

0.729594

0.733679

0.731821

0.729852

0.729536

0.001645

0.00052

0.730137

 

df_oof_valid_scores.iloc[[695]]

 

0

1

2

3

4

5

6

7

8

9

std

sem

mean

695

0.705288

0.694344

0.675143

0.67969

0.714451

0.695773

0.657519

0.689098

0.684439

0.682625

0.016004

0.005061

0.687837

 

Full version is available here

Conclusion

 

Callback functions in XGBoost CV function come in handy when you need something more then a standard output. An example of application is implementing hyperparameters tuning in AWS SageMaker. It allows with few lines of code or with a UI set up perform Bayesian or Random optimization but there is no way to implement cross-validation for each separate model training. OpenSource XGBoost allows to use a custom script for training where CV function can be used. However, AWS SageMaker monitoring system needs an output aka logs but even standard callback function does not provide an output suitable for AWS monitoring. The monitoring system expects ``valid`` not ``test`` in the output. Custom callback function can output log in acceptable format. Just add properly formatted print function inside a custom callback function and do not use a standard one.