Monday, February 11, 2019

Jupyter Notebook — Forget CSV, fetch data from DB with Python

If you read a book, article or blog about Machine Learning — high chances it will use training data from CSV file. Nothing wrong with CSV, but let’s think if it is really practical. Wouldn’t be better to read data directly from the DB? Often you can’t feed business data directly into ML training, it needs pre-processing — changing categorial data, calculating new data features, etc. Data preparation/transformation step can be done quite easily with SQL while fetching original business data. Another advantage of reading data directly from DB — when data changes, it is easier to automate ML model re-train process.

In this post I describe how to call Oracle DB from Jupyter notebook Python code.

Step 1 

Install cx_Oracle Python module:

python -m pip install cx_Oracle

This module helps to connect to Oracle DB from Python.

Step 2

cx_Oracle enables to execute SQL call from Python code. But to be able to call remote DB from Python script, we need to install and configure Oracle Instant Client on the machine where Python runs.

If you are using Ubuntu, install alien:

sudo apt-get update 
sudo apt-get install alien 

Download RPM files for Oracle Instant Client and install with alien:

alien -i oracle-instantclient18.3-basiclite-18.3.0.0.0–1.x86_64.rpm 
alien -i oracle-instantclient18.3-sqlplus-18.3.0.0.0–1.x86_64.rpm 
alien -i oracle-instantclient18.3-devel-18.3.0.0.0–1.x86_64.rpm 

Add environment variables:

export ORACLE_HOME=/usr/lib/oracle/18.3/client64 
export PATH=$PATH:$ORACLE_HOME/bin 

Read more here.

Step 3 

Install Magic SQL Python modules:

pip install jupyter-sql 
pip install ipython-sql 

Installation and configuration complete.

For today sample I’m using Pima Indians Diabetes Database. CSV data can be downloaded from here. I uploaded CSV data into the database table and will be fetching it through SQL directly in Jupyter notebook.

First of all, the connection is established to the DB and then SQL query is executed. Query result set is stored in a variable called result. Do you see %%sql — this magic SQL:


Username and password must be specified while establishing a connection. To avoid sharing a password, make sure to read password value from the external source (it could be simple JSON file as in this example or more advanced encoded token from keyring).

The beauty of this approach — data fetched through SQL query is out of the box available in Data Frame. Machine Learning engineer can work with the data in the same way as it would be loaded through CSV:

Sample Jupyter notebook available on GitHub. Sample credentials JSON file.

No comments: