Monday, April 1, 2019

Publishing Machine Learning API with Python Flask

Flask is fun and easy to setup, as it says on Flask website. And that's true. This microframework for Python offers a powerful way of annotating Python function with REST endpoint. I’m using Flask to publish ML model API to be accessible by the 3rd party business applications.

This example is based on XGBoost.

For better code maintenance, I would recommend using a separate Jupyter notebook where ML model API will be published. Import Flask module along with Flask CORS:


Model is trained on Pima Indians Diabetes Database. CSV data can be downloaded from here. To construct Pandas data frame variable as input for model predict function, we need to define an array of dataset columns:


Previously trained and saved model is loaded using Pickle:


It is always a good practice to do a test run and check if the model performs well. Construct data frame with an array of column names and an array of data (using new data, the one which is not present in train or test datasets). Calling two functions — model.predict and model.predict_proba. Often I prefer model.predict_proba, it returns probability which describes how likely will be 0/1, this helps to interpret the result based on a certain range (0.25 to 0.75 for example). Pandas data frame is constructed with sample payload and then the model prediction is executed:


Flask API. Make sure you enable CORS, otherwise API call will not work from another host. Write annotation before the function you want to expose through REST API. Provide an endpoint name and supported REST methods (POST in this example). Payload data is retrieved from the request, Pandas data frame is constructed and model predict_proba function is executed:


Response JSON string is constructed and returned as a function result. I’m running Flask in Docker container, that's why using 0.0.0.0 as the host on which it runs. Port 5000 is mapped as external port and this allows calls from the outside.

While it works to start Flask interface directly in Jupyter notebook, I would recommend to convert it to Python script and run from command line as a service. Use Jupyter nbconvert command to convert to Python script:

jupyter nbconvert — to python diabetes_redsamurai_endpoint_db.ipynb

Python script with Flask endpoint can be started as the background process with PM2 process manager. This allows to run endpoint as a service and start other processes on different ports. PM2 start command:

pm2 start diabetes_redsamurai_endpoint_db.py


pm2 monit helps to display info about running processes:


ML model classification REST API call from Postman through endpoint served by Flask:


More info:

- GitHub repo with source code
- Previous post about XGBoost model training

Wednesday, March 13, 2019

Selecting Optimal Parameters for XGBoost Model Training

There is always a bit of luck involved when selecting parameters for Machine Learning model training. Lately, I work with gradient boosted trees and XGBoost in particular. We are using XGBoost in the enterprise to automate repetitive human tasks. While training ML models with XGBoost, I created a pattern to choose parameters, which helps me to build new models quicker. I will share it in this post, hopefully you will find it useful too.

I’m using Pima Indians Diabetes Database for the training, CSV data can be downloaded from here.

This is the Python code which runs XGBoost training step and builds a model. Training is executed by passing pairs of train/test data, this helps to evaluate training quality ad-hoc during model construction:

Key parameters in XGBoost (the ones which would affect model quality greatly), assuming you already selected max_depth (more complex classification task, deeper the tree), subsample (equal to evaluation data percentage), objective (classification algorithm):
  • n_estimators — the number of runs XGBoost will try to learn 
  • learning_rate — learning speed 
  • early_stopping_rounds — overfitting prevention, stop early if no improvement in learning 
When model.fit is executed with verbose=True, you will see each training run evaluation quality printed out. At the end of the log, you should see which iteration was selected as the best one. It might be the number of training rounds is not enough to detect the best iteration, then XGBoost will select the last iteration to build the model.

With matpotlib library we can plot training results for each run (from XGBoost output). This helps to understand if iteration which was chosen to build the model was the best one possible. Here we are using sklearn library to evaluate model accuracy and then plotting training results with matpotlib:

Let’s describe my approach to select parameters (n_estimators, learning_rate, early_stopping_rounds) for XGBoost training.

Step 1. Start with what you feel works best based on your experience or what makes sense
  • n_estimators = 300 
  • learning_rate = 0.01 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = 237 
  • Accuracy = 78.35% 
Results plot:


With the first attempt, we already get good results for Pima Indians Diabetes dataset. Training was stopped at iteration 237. Classification error plot shows a lower error rate around iteration 237. This means learning rate 0.01 is suitable for this dataset and early stopping of 10 iterations (if the result doesn’t improve in the next 10 iterations) works.

Step 2. Experiment with learning rate, try to set a smaller learning rate parameter and increase number of learning iterations
  • n_estimators = 500 
  • learning_rate = 0.001 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = didn’t stop, spent all 500 iterations 
  • Accuracy = 77.56% 
Results plot:


Smaller learning rate wasn’t working for this dataset. Classification error almost doesn’t change and XGBoost log loss doesn’t stabilize even with 500 iterations.

Step 3. Try to increase the learning rate.
  • n_estimators = 300 
  • learning_rate = 0.1 
  • early_stopping_rounds = 10 
Results:
  • Stop iteration = 27 
  • Accuracy = 76.77% 
Results plot:


With increased learning rate, the algorithm learns quicker, it stops already at iteration Nr. 27. XGBoost log loss error is stabilizing, but the overall classification accuracy is not ideal.

Step 4. Select optimal learning rate from the first step and increase early stopping (to give the algorithm more chances to find a better result).
  • n_estimators = 300 
  • learning_rate = 0.01 
  • early_stopping_rounds = 15 
Results:
  • Stop iteration = 265 
  • Accuracy = 78.74% 
Results plot:


A slightly better result is produced with 78.74% accuracy — this is visible in the classification error plot.

Resources:

Wednesday, March 6, 2019

Prepare Your Data for Machine Learning Training

The process to prepare data for Machine Learning model training to me looks somewhat similar to the process of preparing food ingredients to cook dinner. You know in both cases it takes time, but then you are rewarded with tasty dinner or a great ML model.

I will not be diving here into data science subject and discussing how to structure and transform data. It all depends on the use case and there are so many ways to reformat data to get the most out of it. I will rather focus on simple, but a practical example — how to split data into training and test datasets with Python.

Make sure to check my previous post, today example is based on a notebook from this post — Jupyter Notebook — Forget CSV, fetch data from DB with Python. It is explained there how to load data from DB and construct a data frame.

This Python code snippet builds train/test datasets:

The first thing is to assign X and Y. Data columns assigned to X array are the ones which produce decision encoded in Y array. We assign X and Y by extracting columns from the data frame.

In the next step train X/Y and test X/Y sets are constructed by function train_test_split from sklearn module. You must import this function in Python script:

from sklearn.model_selection import train_test_split

One of the parameters for train_test_split function — test_size. This parameter controls the proportion of test data set size taken from the entire data set (~30% in this example).

Parameter stratify is enforcing equal distribution of Y data across train and test data sets.

Parameter random_state ensures data split will be the same in the next run too. To change the split, it is enough to change this parameter value.

Function train_test_split returns four arrays. Train X/Y and test X/Y pairs can be used for train and test ML model. Data set shape and structure can be printed out too for the convenience purpose.

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.

Saturday, February 23, 2019

Oracle JET Table with Template Slots for Custom Cells

Oracle JET table comes with template slot option. This is helpful to build generic functionality to render custom cell within the table.

In this example, custom cells are used to render dates, amount and risk gauge:


While implementing Oracle JET table it is a best practice to read table column structure from a variable, not to define the entire structure in HTML itself. Property columns refer to the variable. Template called cellTemplate is a default template to render cell content:


Table column structure is defined in JS. To apply specific cell template, it is specified in column definition:


Table data is static in this example and coming through JSON array based on JET Array Data Provider:


Sample code is available on GitHub.

Wednesday, February 20, 2019

Intercepting ADF Table Column Show/Hide Event with Custom Change Manager Class

Ever wondered how to intercept ADF table column show/hide event from ADF Panel Collection component? Yes, you could use ADF MDS functionality to store user preference for table visible columns. But what if you would want to implement it yourself without using MDS? Actually, this is possible through custom persistence manager class. I will show you how.

If you don't know what I'm talking about. Check below screenshot, this popup comes out of the box with ADF Panel Collection and it helps to manage table visible columns. Pretty much useful, especially for large tables:


Obviously, we would like to store user preference and next time the user comes back to the form, he should see previously stored setup for the table columns. One way to achieve this is to use out of the box ADF MDS functionality. But what if you don't want to use it? Still possible - we can catch all changes done through Manage Columns popup in custom Change Manager class. Extend from SessionChangeManager and override only a single method - addComponentChange. This is the place where we intercept changes and could log them to DB for example (later on form load, we could read table setup and apply it before fragment is rendered):


Register custom Change Manager class in web.xml:


Manage Columns popup is out of the box functionality offered by ADF Panel Collection component:


Method addComponentChange will be automatically invoked and you should see similar output when changing table columns visibility:


Download sample application code from my GitHub repository.

Friday, February 15, 2019

ADF Performance Improvement with Nginx Compression

We are using Nginx web server for Oracle ADF WorkBetter hosted demo hosted on DigitalOcean cloud server. Nginx helps to serve web application content fast and offer improved performance. One of the important tuning options - content compression, Nginx does this job well and is simple to setup.

Content compression doesn't provide direct runtime performance, a browser would run the same code, doesn't matter it was compressed or not. But it brings improved perceived performance (which is very important), network time is way faster, because of reduced content size. Oracle ADF is a server-side framework, each request would bring content from the server - faster this content comes, means better application performance.

1. Content Compression = OFF

Let see stats, when no content compression applied (using our Oracle ADF WorkBetter hosted demo).

Page load size is 2.69 MB transferred. Finish time 1.55 s:


Navigation to the employee section generates 165.76 KB and finish time 924 ms:


Navigation to employee compensation generates 46.19 KB and finish time 494 ms:


2. Nginx compression

Compression is simple to setup in Nginx. Gzip settings are set in nginx.conf, make sure to list all content types which must be supported for compression. Restart nginx process after new settings are saved in nginx.conf:


3. Content Compression = ON

Page load size is 733.84 KB transferred. Finish time 1.48 s:


Navigation to the employee section generates 72.75 KB and finish time 917 ms:


Navigation to employee compensation generates 7.59 KB and finish time 498 ms:

Monday, February 11, 2019

Jupyter Notebook — Forget CSV, fetch data from DB with Python

If you read a book, article or blog about Machine Learning — high chances it will use training data from CSV file. Nothing wrong with CSV, but let’s think if it is really practical. Wouldn’t be better to read data directly from the DB? Often you can’t feed business data directly into ML training, it needs pre-processing — changing categorial data, calculating new data features, etc. Data preparation/transformation step can be done quite easily with SQL while fetching original business data. Another advantage of reading data directly from DB — when data changes, it is easier to automate ML model re-train process.

In this post I describe how to call Oracle DB from Jupyter notebook Python code.

Step 1 

Install cx_Oracle Python module:

python -m pip install cx_Oracle

This module helps to connect to Oracle DB from Python.

Step 2

cx_Oracle enables to execute SQL call from Python code. But to be able to call remote DB from Python script, we need to install and configure Oracle Instant Client on the machine where Python runs.

If you are using Ubuntu, install alien:

sudo apt-get update 
sudo apt-get install alien 

Download RPM files for Oracle Instant Client and install with alien:

alien -i oracle-instantclient18.3-basiclite-18.3.0.0.0–1.x86_64.rpm 
alien -i oracle-instantclient18.3-sqlplus-18.3.0.0.0–1.x86_64.rpm 
alien -i oracle-instantclient18.3-devel-18.3.0.0.0–1.x86_64.rpm 

Add environment variables:

export ORACLE_HOME=/usr/lib/oracle/18.3/client64 
export PATH=$PATH:$ORACLE_HOME/bin 

Read more here.

Step 3 

Install Magic SQL Python modules:

pip install jupyter-sql 
pip install ipython-sql 

Installation and configuration complete.

For today sample I’m using Pima Indians Diabetes Database. CSV data can be downloaded from here. I uploaded CSV data into the database table and will be fetching it through SQL directly in Jupyter notebook.

First of all, the connection is established to the DB and then SQL query is executed. Query result set is stored in a variable called result. Do you see %%sql — this magic SQL:


Username and password must be specified while establishing a connection. To avoid sharing a password, make sure to read password value from the external source (it could be simple JSON file as in this example or more advanced encoded token from keyring).

The beauty of this approach — data fetched through SQL query is out of the box available in Data Frame. Machine Learning engineer can work with the data in the same way as it would be loaded through CSV:

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.