There is always a bit of luck involved when selecting parameters for Machine Learning model training. Lately, I work with gradient boosted trees and XGBoost in particular. We are using XGBoost in the enterprise to automate repetitive human tasks. While training ML models with XGBoost, I created a pattern to choose parameters, which helps me to build new models quicker. I will share it in this post, hopefully you will find it useful too.
I’m using Pima Indians Diabetes Database for the training, CSV data can be downloaded from here.
This is the Python code which runs XGBoost training step and builds a model. Training is executed by passing pairs of train/test data, this helps to evaluate training quality ad-hoc during model construction:
Key parameters in XGBoost (the ones which would affect model quality greatly), assuming you already selected max_depth (more complex classification task, deeper the tree), subsample (equal to evaluation data percentage), objective (classification algorithm):
With matpotlib library we can plot training results for each run (from XGBoost output). This helps to understand if iteration which was chosen to build the model was the best one possible. Here we are using sklearn library to evaluate model accuracy and then plotting training results with matpotlib:
Let’s describe my approach to select parameters (n_estimators, learning_rate, early_stopping_rounds) for XGBoost training.
Step 1. Start with what you feel works best based on your experience or what makes sense
With the first attempt, we already get good results for Pima Indians Diabetes dataset. Training was stopped at iteration 237. Classification error plot shows a lower error rate around iteration 237. This means learning rate 0.01 is suitable for this dataset and early stopping of 10 iterations (if the result doesn’t improve in the next 10 iterations) works.
Step 2. Experiment with learning rate, try to set a smaller learning rate parameter and increase number of learning iterations
I’m using Pima Indians Diabetes Database for the training, CSV data can be downloaded from here.
This is the Python code which runs XGBoost training step and builds a model. Training is executed by passing pairs of train/test data, this helps to evaluate training quality ad-hoc during model construction:
Key parameters in XGBoost (the ones which would affect model quality greatly), assuming you already selected max_depth (more complex classification task, deeper the tree), subsample (equal to evaluation data percentage), objective (classification algorithm):
- n_estimators — the number of runs XGBoost will try to learn
- learning_rate — learning speed
- early_stopping_rounds — overfitting prevention, stop early if no improvement in learning
With matpotlib library we can plot training results for each run (from XGBoost output). This helps to understand if iteration which was chosen to build the model was the best one possible. Here we are using sklearn library to evaluate model accuracy and then plotting training results with matpotlib:
Let’s describe my approach to select parameters (n_estimators, learning_rate, early_stopping_rounds) for XGBoost training.
Step 1. Start with what you feel works best based on your experience or what makes sense
- n_estimators = 300
- learning_rate = 0.01
- early_stopping_rounds = 10
- Stop iteration = 237
- Accuracy = 78.35%
With the first attempt, we already get good results for Pima Indians Diabetes dataset. Training was stopped at iteration 237. Classification error plot shows a lower error rate around iteration 237. This means learning rate 0.01 is suitable for this dataset and early stopping of 10 iterations (if the result doesn’t improve in the next 10 iterations) works.
Step 2. Experiment with learning rate, try to set a smaller learning rate parameter and increase number of learning iterations
- n_estimators = 500
- learning_rate = 0.001
- early_stopping_rounds = 10
- Stop iteration = didn’t stop, spent all 500 iterations
- Accuracy = 77.56%
Smaller learning rate wasn’t working for this dataset. Classification error almost doesn’t change and XGBoost log loss doesn’t stabilize even with 500 iterations.
Step 3. Try to increase the learning rate.
With increased learning rate, the algorithm learns quicker, it stops already at iteration Nr. 27. XGBoost log loss error is stabilizing, but the overall classification accuracy is not ideal.
Step 4. Select optimal learning rate from the first step and increase early stopping (to give the algorithm more chances to find a better result).
A slightly better result is produced with 78.74% accuracy — this is visible in the classification error plot.
Resources:
Step 3. Try to increase the learning rate.
- n_estimators = 300
- learning_rate = 0.1
- early_stopping_rounds = 10
- Stop iteration = 27
- Accuracy = 76.77%
With increased learning rate, the algorithm learns quicker, it stops already at iteration Nr. 27. XGBoost log loss error is stabilizing, but the overall classification accuracy is not ideal.
Step 4. Select optimal learning rate from the first step and increase early stopping (to give the algorithm more chances to find a better result).
- n_estimators = 300
- learning_rate = 0.01
- early_stopping_rounds = 15
- Stop iteration = 265
- Accuracy = 78.74%
A slightly better result is produced with 78.74% accuracy — this is visible in the classification error plot.
Resources:
- Jupyter notebook on GitHub
- Blog post — Jupyter Notebook — Forget CSV, fetch data from DB with Python
- Blog post — Avoid Overfitting By Early Stopping With XGBoost In Python