Wednesday, March 13, 2019

Selecting Optimal Parameters for XGBoost Model Training

There is always a bit of luck involved when selecting parameters for Machine Learning model training. Lately, I work with gradient boosted trees and XGBoost in particular. We are using XGBoost in the enterprise to automate repetitive human tasks. While training ML models with XGBoost, I created a pattern to choose parameters, which helps me to build new models quicker. I will share it in this post, hopefully you will find it useful too.

I’m using Pima Indians Diabetes Database for the training, CSV data can be downloaded from here.

This is the Python code which runs XGBoost training step and builds a model. Training is executed by passing pairs of train/test data, this helps to evaluate training quality ad-hoc during model construction:

Key parameters in XGBoost (the ones which would affect model quality greatly), assuming you already selected max_depth (more complex classification task, deeper the tree), subsample (equal to evaluation data percentage), objective (classification algorithm):
  • n_estimators — the number of runs XGBoost will try to learn 
  • learning_rate — learning speed 
  • early_stopping_rounds — overfitting prevention, stop early if no improvement in learning 
When model.fit is executed with verbose=True, you will see each training run evaluation quality printed out. At the end of the log, you should see which iteration was selected as the best one. It might be the number of training rounds is not enough to detect the best iteration, then XGBoost will select the last iteration to build the model.

With matpotlib library we can plot training results for each run (from XGBoost output). This helps to understand if iteration which was chosen to build the model was the best one possible. Here we are using sklearn library to evaluate model accuracy and then plotting training results with matpotlib:

Let’s describe my approach to select parameters (n_estimators, learning_rate, early_stopping_rounds) for XGBoost training.

Step 1. Start with what you feel works best based on your experience or what makes sense
  • n_estimators = 300 
  • learning_rate = 0.01 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = 237 
  • Accuracy = 78.35% 
Results plot:


With the first attempt, we already get good results for Pima Indians Diabetes dataset. Training was stopped at iteration 237. Classification error plot shows a lower error rate around iteration 237. This means learning rate 0.01 is suitable for this dataset and early stopping of 10 iterations (if the result doesn’t improve in the next 10 iterations) works.

Step 2. Experiment with learning rate, try to set a smaller learning rate parameter and increase number of learning iterations
  • n_estimators = 500 
  • learning_rate = 0.001 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = didn’t stop, spent all 500 iterations 
  • Accuracy = 77.56% 
Results plot:


Smaller learning rate wasn’t working for this dataset. Classification error almost doesn’t change and XGBoost log loss doesn’t stabilize even with 500 iterations.

Step 3. Try to increase the learning rate.
  • n_estimators = 300 
  • learning_rate = 0.1 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = 27 
  • Accuracy = 76.77% 
Results plot:


With increased learning rate, the algorithm learns quicker, it stops already at iteration Nr. 27. XGBoost log loss error is stabilizing, but the overall classification accuracy is not ideal.

Step 4. Select optimal learning rate from the first step and increase early stopping (to give the algorithm more chances to find a better result).
  • n_estimators = 300 
  • learning_rate = 0.01 
  • early_stopping_rounds = 15 
Results:
  • Stop iteration = 265 
  • Accuracy = 78.74% 
Results plot:


A slightly better result is produced with 78.74% accuracy — this is visible in the classification error plot.

Resources:

Wednesday, March 6, 2019

Prepare Your Data for Machine Learning Training

The process to prepare data for Machine Learning model training to me looks somewhat similar to the process of preparing food ingredients to cook dinner. You know in both cases it takes time, but then you are rewarded with tasty dinner or a great ML model.

I will not be diving here into data science subject and discussing how to structure and transform data. It all depends on the use case and there are so many ways to reformat data to get the most out of it. I will rather focus on simple, but a practical example — how to split data into training and test datasets with Python.

Make sure to check my previous post, today example is based on a notebook from this post — Jupyter Notebook — Forget CSV, fetch data from DB with Python. It is explained there how to load data from DB and construct a data frame.

This Python code snippet builds train/test datasets:

The first thing is to assign X and Y. Data columns assigned to X array are the ones which produce decision encoded in Y array. We assign X and Y by extracting columns from the data frame.

In the next step train X/Y and test X/Y sets are constructed by function train_test_split from sklearn module. You must import this function in Python script:

from sklearn.model_selection import train_test_split

One of the parameters for train_test_split function — test_size. This parameter controls the proportion of test data set size taken from the entire data set (~30% in this example).

Parameter stratify is enforcing equal distribution of Y data across train and test data sets.

Parameter random_state ensures data split will be the same in the next run too. To change the split, it is enough to change this parameter value.

Function train_test_split returns four arrays. Train X/Y and test X/Y pairs can be used for train and test ML model. Data set shape and structure can be printed out too for the convenience purpose.

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.