Andrej Baranovskij Blog: Prepare Your Data for Machine Learning Training

Wednesday, March 6, 2019

Prepare Your Data for Machine Learning Training

The process to prepare data for Machine Learning model training to me looks somewhat similar to the process of preparing food ingredients to cook dinner. You know in both cases it takes time, but then you are rewarded with tasty dinner or a great ML model.

I will not be diving here into data science subject and discussing how to structure and transform data. It all depends on the use case and there are so many ways to reformat data to get the most out of it. I will rather focus on simple, but a practical example — how to split data into training and test datasets with Python.

Make sure to check my previous post, today example is based on a notebook from this post — Jupyter Notebook — Forget CSV, fetch data from DB with Python. It is explained there how to load data from DB and construct a data frame.

This Python code snippet builds train/test datasets:

The first thing is to assign X and Y. Data columns assigned to X array are the ones which produce decision encoded in Y array. We assign X and Y by extracting columns from the data frame.

In the next step train X/Y and test X/Y sets are constructed by function train_test_split from sklearn module. You must import this function in Python script:

from sklearn.model_selection import train_test_split

One of the parameters for train_test_split function — test_size. This parameter controls the proportion of test data set size taken from the entire data set (~30% in this example).

Parameter stratify is enforcing equal distribution of Y data across train and test data sets.

Parameter random_state ensures data split will be the same in the next run too. To change the split, it is enough to change this parameter value.

Function train_test_split returns four arrays. Train X/Y and test X/Y pairs can be used for train and test ML model. Data set shape and structure can be printed out too for the convenience purpose.

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.

Wednesday, March 6, 2019

Prepare Your Data for Machine Learning Training

No comments: