Andrej Baranovskij Blog: Selecting Optimal Parameters for XGBoost Model Training

There is always a bit of luck involved when selecting parameters for Machine Learning model training. Lately, I work with gradient boosted trees and XGBoost in particular. We are using XGBoost in the enterprise to automate repetitive human tasks. While training ML models with XGBoost, I created a pattern to choose parameters, which helps me to build new models quicker. I will share it in this post, hopefully you will find it useful too.

I’m using Pima Indians Diabetes Database for the training, CSV data can be downloaded from here.

This is the Python code which runs XGBoost training step and builds a model. Training is executed by passing pairs of train/test data, this helps to evaluate training quality ad-hoc during model construction:

Key parameters in XGBoost (the ones which would affect model quality greatly), assuming you already selected max_depth (more complex classification task, deeper the tree), subsample (equal to evaluation data percentage), objective (classification algorithm):

n_estimators — the number of runs XGBoost will try to learn
learning_rate — learning speed
early_stopping_rounds — overfitting prevention, stop early if no improvement in learning

When model.fit is executed with verbose=True, you will see each training run evaluation quality printed out. At the end of the log, you should see which iteration was selected as the best one. It might be the number of training rounds is not enough to detect the best iteration, then XGBoost will select the last iteration to build the model.

With matpotlib library we can plot training results for each run (from XGBoost output). This helps to understand if iteration which was chosen to build the model was the best one possible. Here we are using sklearn library to evaluate model accuracy and then plotting training results with matpotlib:

Let’s describe my approach to select parameters (n_estimators, learning_rate, early_stopping_rounds) for XGBoost training.

Step 1. Start with what you feel works best based on your experience or what makes sense

n_estimators = 300
learning_rate = 0.01
early_stopping_rounds = 10

Results:

Stop iteration = 237
Accuracy = 78.35%

Results plot:

With the first attempt, we already get good results for Pima Indians Diabetes dataset. Training was stopped at iteration 237. Classification error plot shows a lower error rate around iteration 237. This means learning rate 0.01 is suitable for this dataset and early stopping of 10 iterations (if the result doesn’t improve in the next 10 iterations) works.

Step 2. Experiment with learning rate, try to set a smaller learning rate parameter and increase number of learning iterations

n_estimators = 500
learning_rate = 0.001
early_stopping_rounds = 10

Results:

Stop iteration = didn’t stop, spent all 500 iterations
Accuracy = 77.56%

Results plot:

Smaller learning rate wasn’t working for this dataset. Classification error almost doesn’t change and XGBoost log loss doesn’t stabilize even with 500 iterations.

Step 3. Try to increase the learning rate.

n_estimators = 300
learning_rate = 0.1
early_stopping_rounds = 10

Results:

Stop iteration = 27
Accuracy = 76.77%

Results plot:

With increased learning rate, the algorithm learns quicker, it stops already at iteration Nr. 27. XGBoost log loss error is stabilizing, but the overall classification accuracy is not ideal.

Step 4. Select optimal learning rate from the first step and increase early stopping (to give the algorithm more chances to find a better result).

n_estimators = 300
learning_rate = 0.01
early_stopping_rounds = 15

Results:

Stop iteration = 265
Accuracy = 78.74%

Results plot:

A slightly better result is produced with 78.74% accuracy — this is visible in the classification error plot.

Resources:

Jupyter notebook on GitHub
Blog post — Jupyter Notebook — Forget CSV, fetch data from DB with Python
Blog post — Avoid Overfitting By Early Stopping With XGBoost In Python

Andrej Baranovskij Blog

Wednesday, March 13, 2019

Selecting Optimal Parameters for XGBoost Model Training

No comments:

Post a Comment