Machine Learning is all about data. The way how you transform and feed data into ML algorithm - greatly depends training success. I will give you an example based on date type data. I will be using scenario described in my previous post - Machine Learning - Getting Data Into Right Shape. This scenario is focused around invoice risk, ML trains to recognize when invoice payment is at risk.
One of the key attributes in invoice data are dates - invoice date, payment due date and payment date. ML algorithm expects number as training feature, it can't operate with literals or dates. This is when data transformation comes in - out of original data we need to prepare data which can be understood by ML.
How we can transform dates into numbers? One of the ways is to split date value into multiple columns with numbers describing original date (year, quarter, month, week, day of year, day of month, day of week). This might work? To be sure - we need to run training and validate training success.
Resources:
1. Sample Jupyter notebooks and datasets are available on my GitHub repo
2. I would recommend to read this book - Machine Learning for Business
Two approaches:
1. Date feature transformation into multiple attributes
Example where date is split into multiple columns:
Correlation between decision column and features show many dependencies, but it doesn't pick up all columns for payment date feature. This is early sign training might not work well:
We need to create test (1/3 of remaining data), validation (2/3 of remaining data) and training (70% of all data) datasets to be able to train, validate and test ML model. Splitting original dataset into three parts:
Running training using XGBoost (Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes). Read more about XGBoost parameters. We have validation dataset and this allows to use XGBoost early stopping functionality, if training quality would not improve in N (10 in our case) rounds - it will stop and pick best iteration as the one to be used for training result:
Result: training accuracy 93% and validation accuracy 74%. Validation accuracy is too low, this means training wasn't successful and we should try to transform dates in another way:
2. Date feature transformation into difference between dates
Instead of splitting date into multiple attributes, we should reduce number of attributes to two. We can use date difference as such:
- Day difference between Payment Due Date and Invoice Date
- Day difference between Payment Date and Invoice Date
This should bring clear pattern, when there is payment delay - difference between payment date/invoice date will be bigger than between payment due date/invoice date. Sample data with date feature transformed into date difference:
Correlation is much better this time. Decision correlates well with date differences and total:
Test, validation and training data sets will be prepared in the same proportions as in previous test. But we will be using stratify option. This option helps to shuffle data and create test, validation and training data sets where decision attribute is well represented:
Training, validation and test datasets are prepared:
Using same XGBoost training parameters:
Result: This time we get 99% training accuracy and 97% validation accuracy. Great result. You can see how important is data preparation step for ML. It directly relates to ML training quality:
One of the key attributes in invoice data are dates - invoice date, payment due date and payment date. ML algorithm expects number as training feature, it can't operate with literals or dates. This is when data transformation comes in - out of original data we need to prepare data which can be understood by ML.
How we can transform dates into numbers? One of the ways is to split date value into multiple columns with numbers describing original date (year, quarter, month, week, day of year, day of month, day of week). This might work? To be sure - we need to run training and validate training success.
Resources:
1. Sample Jupyter notebooks and datasets are available on my GitHub repo
2. I would recommend to read this book - Machine Learning for Business
Two approaches:
1. Date feature transformation into multiple attributes
Example where date is split into multiple columns:
Correlation between decision column and features show many dependencies, but it doesn't pick up all columns for payment date feature. This is early sign training might not work well:
We need to create test (1/3 of remaining data), validation (2/3 of remaining data) and training (70% of all data) datasets to be able to train, validate and test ML model. Splitting original dataset into three parts:
Running training using XGBoost (Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes). Read more about XGBoost parameters. We have validation dataset and this allows to use XGBoost early stopping functionality, if training quality would not improve in N (10 in our case) rounds - it will stop and pick best iteration as the one to be used for training result:
Result: training accuracy 93% and validation accuracy 74%. Validation accuracy is too low, this means training wasn't successful and we should try to transform dates in another way:
2. Date feature transformation into difference between dates
Instead of splitting date into multiple attributes, we should reduce number of attributes to two. We can use date difference as such:
- Day difference between Payment Due Date and Invoice Date
- Day difference between Payment Date and Invoice Date
This should bring clear pattern, when there is payment delay - difference between payment date/invoice date will be bigger than between payment due date/invoice date. Sample data with date feature transformed into date difference:
Correlation is much better this time. Decision correlates well with date differences and total:
Test, validation and training data sets will be prepared in the same proportions as in previous test. But we will be using stratify option. This option helps to shuffle data and create test, validation and training data sets where decision attribute is well represented:
Training, validation and test datasets are prepared:
Using same XGBoost training parameters:
Result: This time we get 99% training accuracy and 97% validation accuracy. Great result. You can see how important is data preparation step for ML. It directly relates to ML training quality:











 
No comments:
Post a Comment