How to get more information from XGBoost CV function.

Introduction

Cross-validation is a well-known method to estimate true model accuracy on unseen data. Thera are different methods of cross-validation: holdout, K-fold cross-validation, leave-one-out cross-validation. But the general idea is to partition the available dataset into training and validation subsets several times, train and validate models on each group of subsets and average the results. Reducing the size of training data is compensated by discovering patterns in validation data, unseen in training.

The process can be implemented as a loop to split the data, train, and validate models, calculate mean, std and/or sem of the evaluation metrics. It becomes more challenging if you cannot implement a straightforward loop, when a dataset is huge, and you use parallel model training in a cloud like AWS SageMaker.

XGBoost CV function does everything in one function. You do not need to split the dataset into folds, train models or average the result yourself.

cv_results = xgb.cv(

params,

dtrain,

num_boost_round=num_boost_round,

seed=42,

nfold=5,

early_stopping_rounds=100

)

cv_results

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.500000	0.000000	0.500000	0.000000
1	0.500000	0.000000	0.500000	0.000000
2	0.500000	0.000000	0.500000	0.000000
3	0.500000	0.000000	0.500000	0.000000
4	0.500000	0.000000	0.500000	0.000000
...	...	...	...	...
691	0.729942	0.001573	0.687806	0.015197
692	0.729998	0.001572	0.687810	0.015194
693	0.730056	0.001574	0.687823	0.015194
694	0.730084	0.001564	0.687826	0.015202
695	0.730137	0.001560	0.687837	0.015183

CV returns a table with train and test mean as well as std of an evaluation metric from all folds. Each row corresponds to a boosting iteration.

The drawbacks of the function:

- We do not have separate results from each model trained/validated on a specific fold. There is just mean and std. We may need it if we want to analyze the result using t-test or use models outputs in a pipeline/model stacking.

- The output of CV may provide wrong std. I observed wrong AUC std output in several XGBoost versions. It looks like they calculate mean first and then include mean to calculate std.

- We may need sem instead of or along with std.

- We do not have best models at the end of the process and need to train them again based on the selected best parameters.

- There is no ongoing output from the function, and it might take a while. In some cases, the output is required for outside monitoring systems.

In this article I will show few tricks how to make CV more usable.

Callback Functions

Callback functions were designed to implement extensions of CV function. They are invoked inside an internal loop. There are several built-in callbacks, and you can design your own functions.

xgb.callback.print_evaluation(period=n , show_stdv=True) is an example of a built-in callback function. It prints output from CV every n periods providing ongoing output as a log file.

According to XGBoost documentation Early_Stopping_Rounds and verbose/verbose_eval parameters are also implemented via callback functions internally and can be passed in a form of a callback function.

Custom Callback Functions

The function is implemented as an ordinary Python function with OUT parameters to return something from the iteration. In the example below I extract best models and train/test scores from each model at each step.

A best model can be a model with a score greater then at previous step like AUC or smaller like MAE or RMSE. That is why I have one more Boolean parameter in my function - maximize. Base on this parameter I will choose best score and model.

def cv_misc_callback(oof_train_scores:list, oof_valid_scores:list, best_models:list, maximize=True):

The function body is very short:

It starts from an instruction when to run it - before or after iteration:

callback.before_iteration = False

And then there is just a call of a nested function in return statement.

return callback

However, there are 2 nested functions:

Init initializes the variable for a best score.

Callback does the real job: calls Init once to initialize the best score, selects the best score (maximum or minimum) and the correspondent model, adds best models and train and valid scores at each step to the output parameters.

All info is extracted from env variable (xgboost.core.XGBoostCallbackEnv) available in the callback functions. It has a complex structure:

XGBoostCallbackEnv(

model=None,

cvfolds=[<xgboost.training.CVPack object at 0x7f29bfb65610>,

<xgboost.training.CVPack object at 0x7f29bfb65040>,

<xgboost.training.CVPack object at 0x7f29bfb65850>,

. . . The number of CVPack corresponds to the number of folds . . .

<xgboost.training.CVPack object at 0x7f29c2191a90>],

iteration=0,

begin_iteration=0,

end_iteration=5000,

rank=0,

evaluation_result_list=[(`train-auc`, 0.5, 0.0), (`test-auc`, 0.5, 0.0)])

CVPack object is what I use to extract best models and scores. Each fold has its own CVPack object entity, and they are available in cvfolds list.

That is why there is a loop thru all folds:

for i, cvpack in enumerate(env.cvfolds)

Best model is available in cvpack.bst and it is extracted only when the fold score is better (less or higher) then a previous one.

Models scores from each fold are extracted from cvpack.eval(iteration=0,feval=None)

I use iteration=0 to get the scores. Any other iteration values in this function returns the same score. Parameter feval is used in a case of a custom function.

The output is a string.

[0]

train-auc:0.659214

test-auc:0.637258

and a regular expression helps to separate training and testing scores.

How to call a callback function

There is a special parameter in CV function which accepts a list of callback functions (standard or custom):

cv_results=xgb.cv(params,

d_train,

num_boost_round=num_boost_round,

nfold=nfold,

stratified=True,

shuffle=True,

early_stopping_rounds=early_stopping_rounds,

seed=42,

callbacks=[cv_misc_callback(oof_train_scores, oof_valid_scores,best_models,True), xgb.callback.print_evaluation(period=10)]

)

. . .

[660]	train-auc:0.72809+0.00157	test-auc:0.68774+0.01521
[670]	train-auc:0.72862+0.00158	test-auc:0.68776+0.01517
[680]	train-auc:0.72917+0.00164	test-auc:0.68776+0.01515
[690]	train-auc:0.72989+0.00158	test-auc:0.68781+0.01518
[700]	train-auc:0.73049+0.00150	test-auc:0.68774+0.01520
[710]	train-auc:0.73104+0.00145	test-auc:0.68776+0.01520
[720]	train-auc:0.73165+0.00139	test-auc:0.68776+0.01530
[730]	train-auc:0.73226+0.00138	test-auc:0.68776+0.01523
[740]	train-auc:0.73283+0.00140	test-auc:0.68762+0.01517
[750]	train-auc:0.73344+0.00132	test-auc:0.68763+0.01524
[760]	train-auc:0.73400+0.00130	test-auc:0.68754+0.01519
[770]	train-auc:0.73462+0.00133	test-auc:0.68760+0.01510
[780]	train-auc:0.73520+0.00131	test-auc:0.68762+0.01522
[790]	train-auc:0.73584+0.00133	test-auc:0.68755+0.01530

The above is xgb.callback.print_evaluation output.

Standard vs Custom output

CV function returns mean of train and test scores as well as std.

cv_results.tail()

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
691	0.729942	0.001573	0.687806	0.015197
692	0.729998	0.001572	0.687810	0.015194
693	0.730056	0.001574	0.687823	0.015194
694	0.730084	0.001564	0.687826	0.015202
695	0.730137	0.001560	0.687837	0.015183

Now, let us compare to the content of oof_train_scores, oof_valid_scores from cv_misc_callback.

df_oof_train_scores.tail()

	0	1	2	3	4	5	6	7	8	9
791	0.735064	0.735104	0.736658	0.735220	0.734851	0.734872	0.739483	0.736426	0.735215	0.735818
792	0.735095	0.735177	0.736683	0.735259	0.734865	0.734899	0.739491	0.736517	0.735308	0.735933
793	0.735140	0.735224	0.736726	0.735302	0.734933	0.734914	0.739524	0.736520	0.735352	0.735974
794	0.735178	0.735296	0.736740	0.735449	0.734973	0.734960	0.739541	0.736532	0.735376	0.736010
795	0.735239	0.735383	0.736768	0.735530	0.734995	0.735020	0.739602	0.736540	0.735416	0.736025

df_oof_valid_scores.tail()

	0	1	2	3	4	5	6	7	8	9
791	0.705811	0.693888	0.674917	0.679050	0.714809	0.694630	0.657618	0.688595	0.683779	0.682329
792	0.705817	0.693903	0.674868	0.679003	0.714863	0.694604	0.657607	0.688551	0.683657	0.682485
793	0.705779	0.693944	0.674881	0.678924	0.714802	0.694577	0.657672	0.688561	0.683647	0.682396
794	0.705714	0.693914	0.674875	0.678910	0.714855	0.694593	0.657623	0.688544	0.683658	0.682407
795	0.705704	0.693926	0.674860	0.678882	0.714846	0.694548	0.657598	0.688539	0.683622	0.682378

First, there are separate scores for each fold. We need to calculate mean and std ourselves, but it also allows to do a t-test to compare results between different experiment.

There are also 100 (early_stopping_rounds) more results then in the standard output (cv_results). The same as in the output from xgb.callback.print_evaluation. It just does not print each output. CV standard output (cv_results) does not provide anything above the best score. Well, if we would not use early_stopping_rounds we would get the same number of iterations in both outputs.

Let us calculate mean and std from the custom output and find the first best valid score:

	train-auc-mean	train-auc-sem	train-auc-std	valid-auc-mean	valid-auc-sem	valid-auc-std
695	0.730137	0.00052	0.001645	0.687837	0.005061	0.016004

The best iteration is the same as in the standard output, means match, but std is different! If you recalculate std yourself from the detail data you will see, mean is added in the calculation.

df_oof_train_scores.iloc[[695]]

	0	1	2	3	4	5	6	7	8	9	std	sem	mean
695	0.729312	0.728186	0.731243	0.729432	0.728717	0.729594	0.733679	0.731821	0.729852	0.729536	0.001645	0.00052	0.730137

df_oof_valid_scores.iloc[[695]]

	0	1	2	3	4	5	6	7	8	9	std	sem	mean
695	0.705288	0.694344	0.675143	0.67969	0.714451	0.695773	0.657519	0.689098	0.684439	0.682625	0.016004	0.005061	0.687837

Full version is available here

Conclusion

Callback functions in XGBoost CV function come in handy when you need something more then a standard output. An example of application is implementing hyperparameters tuning in AWS SageMaker. It allows with few lines of code or with a UI set up perform Bayesian or Random optimization but there is no way to implement cross-validation for each separate model training. OpenSource XGBoost allows to use a custom script for training where CV function can be used. However, AWS SageMaker monitoring system needs an output aka logs but even standard callback function does not provide an output suitable for AWS monitoring. The monitoring system expects ``valid`` not ``test`` in the output. Custom callback function can output log in acceptable format. Just add properly formatted print function inside a custom callback function and do not use a standard one.