Diabetes Progression Prediction

Overview

We will show you how Neptune machine learning platform can help to manage, visualize and compare machine learning experiments. The following example shows how easy it is to integrate Neptune with your existing code.

We will adapt the Linear Regression example from the scikit-learn library to utilize features of Neptune. The example consists of a single Python file using scikit-learn to train and evaluate a simple linear regression model that predicts disease progression of diabetes patients.

Integration of the code with Neptune Client Library will allow us to run the code multiple times as a single Grid Search Experiment. This kind of experiment provides effortless way to execute the same code with different parameters and to compare all the results in Web UI.

Dataset Information

Dataset: Diabetes

Dataset size: 442 examples (422 examples of the training set and 20 examples of the test set).

Dataset description: Ten normalized baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

Business purpose: Predict disease progression for diabetes patients in the next year.

Data set credits: Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics.

Features Provided by Neptune

Using Neptune grid search functionality we can easily check which variable from the dataset is the best predictor (yields the best metric value). Neptune gives us convenient Web UI where we can compare and share results of our experiments. Additionally, for every execution Neptune stores parameter values and creates a snapshot of used source code so we are able to recreate every result.

Neptune stores images sent from the job’s code via image channels. We can use image channels to send a custom chart containing the regression line and target values.

Let’s Start Editing the Code!

To run the code from this example, you need to have the following installed:

We need the base source file to start with. Let’s download plot_ols.py and rename it to plot_ols_neptune.py.

At first, we will go through the changes that are required to integrate the code with Neptune.

If you want to download the code that is ready to run, it’s available on GitHub.

Imports

First, let’s add the imports of the additional libraries.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# The additional libraries.
import io
import time
from deepsense import neptune
from PIL import Image

Job Configuration

In the next step, we need to create a Context object to enable communication from the job to Neptune.

ctx = neptune.Context()

Once we have the Context object, we can configure several channels to send logs, the image of the chart and the metric value.

We use a single metric to measure the quality of our model: the value of MSE. To send the metric value to Neptune, we need to create a NUMERIC channel. We will call it mse_channel.

# A channel to send the Mean Squared Error metric value.
mse_channel = ctx.job.create_channel(
    name='MSE',
    channel_type=neptune.ChannelType.NUMERIC)

To send a chart with the regression line to Neptune, we need an image channel. Let’s name it regression_chart_channel.

# A channel to send the regression chart.
regression_chart_channel = ctx.job.create_channel(
    name='Regression chart',
    channel_type=neptune.ChannelType.IMAGE)

We also create a TEXT channel named logs_channel to send logging information about job’s execution.

# A channel to log information about job's execution.
logs_channel = ctx.job.create_channel(
    name='logs',
    channel_type=neptune.ChannelType.TEXT)

Parametrizing the Job

Our simple regression model uses only one feature. In the original code, it’s the feature with index 2 - the patient’s BMI. Let’s introduce a new numeric parameter named feature_index so we can select the feature index when running our experiment. That way we can test the importance of different features without changing the code.

After the line that loads the data set:

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

let’s add the following code:

# Add a tag containing the name of the feature.
feature_names = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
used_feature_name = feature_names[ctx.params.feature_index]
ctx.job.tags.append('diabetes-feature-' + used_feature_name)

Then, replace:

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

with:

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, ctx.params.feature_index]

We replaced the hardcoded feature’s index with a job’s parameter and added a tag containing the name of the feature. The tag will be displayed on the job list, so we can easily identify the feature used in a specific experiment.

Training the Model

We leave the code training the model unchanged.

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

Sending Validation Results to Neptune

Instead of printing information to the console, we want to send it to Neptune through channels. This will enable us to browse the information in the Web UI.

Let’s replace the code printing the model’s coefficients:

# The coefficients
print('Coefficients: \n', regr.coef_)

with:

# The coefficients
logs_channel.send(x=time.time(), y='Coefficients: ' + str(regr.coef_))

Replace the code printing the mean squared error:

# The mean square error
print("Mean squared error: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))

with:

# The mean square error
mse = np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)
mse_channel.send(x=time.time(), y=mse)
logs_channel.send(x=time.time(), y="Mean squared error: %.2f" % mse)

Replace the code printing the variance score:

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

with:

# Explained variance score: 1 is perfect prediction
logs_channel.send(x=time.time(), y='Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

The original code plots a chart with a regression line and displays it in a window. Let’s modify the code to send the chart to Neptune and make it visible in Web UI.

The original code:

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

The modified code:

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

# Convert the chart to an image.
image_buffer = io.BytesIO()
plt.savefig(image_buffer, format='png')
image_buffer.seek(0)

# Send the chart to Neptune through an image channel.
regression_chart_channel.send(
    x=time.time(),
    y=neptune.Image(
        name='Regression chart',
        description='A chart containing predictions and target values '
                    'for diabetes progression regression. '
                    'Feature used: ' + used_feature_name,
        data=Image.open(image_buffer)))

Instead of displaying the chart in a window, we saved it as an image in a buffer. We also sent the image of the chart to Neptune via regression_chart_channel. To make the chart more descriptive, we removed the lines hiding the scale for X and Y axes. The chart will be visible in the Web UI.

Configuration File

In order to create an experiment, we need to prepare a short configuration file describing it:

neptune.yaml

name: Diabetes Progression Prediction
description: Linear Regression with scikit-learn on diabetes dataset.
project: Diabetes
parameters:
  - name: feature_index
    type: int
    description: The index of the feature used to train the linear regression model. It should be between 0 and 9.
    required: true
metric:
    channel: MSE
    direction: minimize

Our configuration file contains: the experiment’s name and description, the project it belongs to, the schema of it’s parameters that will be injected to the job and the metric used for comparison. Our experiment has only one parameter, named feature_index, which is responsible for selecting the feature used during training. We’ve also explicitly declared metric and bound it to MSE channel we use in our code to send calculated mean squared error of model’s predictions. Setting direction to minimize means that the job having smaller MSE value is considered to be the better one. Metric declaration is required in order to use grid search functionality.

Once the source and the configuration files are ready, we can create a grid search experiment to compare the results of possible features and check which one is the best predictor (yields the lowest MSE).

We can run the experiment using the neptune run command in the directory containing the plot_ols_neptune.py and neptune.yaml files.

$ neptune run plot_ols_neptune.py -- --feature_index "(0, 9, 1)"

As a parameter value for feature_index we’ve passed "(0, 9, 1)" - a range from 0 to 9 with step 1. Passing multiple values as a parameter value enables grid search functionality. In our case Neptune will create ten jobs - each with different value of feature_index.

When the command is ran, your browser should automatically open the experiment’s dashboard in the Web UI where you can track the overall progress of the experiment.

Grid Search Dashboard

Explore the best job!

Neptune automatically picks the job with the lowest MSE and marks it as the best job. Select it to enter it’s dashboard.

Viewing the Regression Line

To view the regression line of the selected job let’s navigate to the Channels tab.

Channels of the Diabetes Progression Prediction job

Next, click on the Regression chart tile. You will see the chart’s thumbnail with a description.

View of images sent to the "Regression chart" channel

To view the chart in full resolution, let’s click on the top right corner of the thumbnail.

The chart containing a regression line and target values

Viewing Job Logs

To view the logs sent from the job to Neptune during the execution, we need to open the “Channels” tab and click in the logs tile.

Values sent to the logs channel

Summary

We have used Neptune to manage a simple machine learning experiment. Neptune allowed us to run a grid search experiment and track its progress. Furthermore, the best feature was selected for us. We have also visualized the predictions by plotting them against target values, using Neptune’s image channels.

That is only a part of Neptune’s features. To see an advanced example and explore more features, see the Handwritten Digits example.