Wednesday, March 13, 2019

Selecting Optimal Parameters for XGBoost Model Training

There is always a bit of luck involved when selecting parameters for Machine Learning model training. Lately, I work with gradient boosted trees and XGBoost in particular. We are using XGBoost in the enterprise to automate repetitive human tasks. While training ML models with XGBoost, I created a pattern to choose parameters, which helps me to build new models quicker. I will share it in this post, hopefully you will find it useful too.

I’m using Pima Indians Diabetes Database for the training, CSV data can be downloaded from here.

This is the Python code which runs XGBoost training step and builds a model. Training is executed by passing pairs of train/test data, this helps to evaluate training quality ad-hoc during model construction:

Key parameters in XGBoost (the ones which would affect model quality greatly), assuming you already selected max_depth (more complex classification task, deeper the tree), subsample (equal to evaluation data percentage), objective (classification algorithm):
  • n_estimators — the number of runs XGBoost will try to learn 
  • learning_rate — learning speed 
  • early_stopping_rounds — overfitting prevention, stop early if no improvement in learning 
When model.fit is executed with verbose=True, you will see each training run evaluation quality printed out. At the end of the log, you should see which iteration was selected as the best one. It might be the number of training rounds is not enough to detect the best iteration, then XGBoost will select the last iteration to build the model.

With matpotlib library we can plot training results for each run (from XGBoost output). This helps to understand if iteration which was chosen to build the model was the best one possible. Here we are using sklearn library to evaluate model accuracy and then plotting training results with matpotlib:

Let’s describe my approach to select parameters (n_estimators, learning_rate, early_stopping_rounds) for XGBoost training.

Step 1. Start with what you feel works best based on your experience or what makes sense
  • n_estimators = 300 
  • learning_rate = 0.01 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = 237 
  • Accuracy = 78.35% 
Results plot:


With the first attempt, we already get good results for Pima Indians Diabetes dataset. Training was stopped at iteration 237. Classification error plot shows a lower error rate around iteration 237. This means learning rate 0.01 is suitable for this dataset and early stopping of 10 iterations (if the result doesn’t improve in the next 10 iterations) works.

Step 2. Experiment with learning rate, try to set a smaller learning rate parameter and increase number of learning iterations
  • n_estimators = 500 
  • learning_rate = 0.001 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = didn’t stop, spent all 500 iterations 
  • Accuracy = 77.56% 
Results plot:


Smaller learning rate wasn’t working for this dataset. Classification error almost doesn’t change and XGBoost log loss doesn’t stabilize even with 500 iterations.

Step 3. Try to increase the learning rate.
  • n_estimators = 300 
  • learning_rate = 0.1 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = 27 
  • Accuracy = 76.77% 
Results plot:


With increased learning rate, the algorithm learns quicker, it stops already at iteration Nr. 27. XGBoost log loss error is stabilizing, but the overall classification accuracy is not ideal.

Step 4. Select optimal learning rate from the first step and increase early stopping (to give the algorithm more chances to find a better result).
  • n_estimators = 300 
  • learning_rate = 0.01 
  • early_stopping_rounds = 15 
Results:
  • Stop iteration = 265 
  • Accuracy = 78.74% 
Results plot:


A slightly better result is produced with 78.74% accuracy — this is visible in the classification error plot.

Resources:

No comments: