Note

This page was generated from notebooks/inference_with_smurff.ipynb.

Inference with SMURFF

In this notebook we will continue on the first example. After running a training trainSession again in SMURFF, we will look deeper into how to use SMURFF for making predictions. The full Python API for predictions is available in Python API Reference » Inference.

To make predictions we recall that the value of a tensor model is given by a tensor contraction of all latent matrices. Specifically, the prediction for the element \(\hat{Y}_{ijk}\) of a rank-3 tensor is given by

\[\hat{Y}_{ijk} = \sum_{d=1}^D u^{(1)}_{d,i} u^{(2)}_{d,j} u^{(3)}_{d,k} + mean\]

Since a matrix is a rank-2 tensor the prediction for a matrix is given by:

\[\hat{Y}_{ij} = \sum_{d=1}^D u^{(1)}_{d,i} u^{(2)}_{d,j} + mean\]

These inner products are computed by SMURFF automagicaly, as we will see below.

Saving models

We run a Macau training trainSession using side information (ecfp) from the chembl dataset. We make sure we save every 10th sample, such that we can load the model afterwards. This run will take some minutes to run.

[ ]:
import smurff
import os
import logging

ic50_train, ic50_test, ecfp = smurff.load_chembl()

# limit to 100 rows and 100 features to make thinks go faster
ic50_train = ic50_train.tocsr()[:100,:]
ic50_test = ic50_test.tocsr()[:100,:]
ecfp = ecfp.tocsr()[:100,:].tocsc()[:,:100]

trainSession = smurff.MacauSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       side_info  = [ecfp, None],
                       num_latent = 16,
                       burnin     = 200,
                       nsamples   = 100,
                       save_freq  = 10,
                       save_name  = "ic50-macau.hdf5",
                       verbose    = 2,)

predictions = trainSession.run()

Saved Model

The model is saved in an HDF5 file, in this case ic50-macau.hdf5. The file contains all saved info from this training run. For example:

[ ]:
%%bash

h5ls -r ic50-macau.hdf5 | head -n 30

The structure of the HDF5 file is:

  • Datasets in /config contain the input data and configuration provided to the TrainSession
  • The different /sample_* datasets contain for each posterior sample:
  • Predictions for the provided test matrix:
    • predictions/pred_1sample: Predictions from this sample
    • predictions/pred_avg: Predictions average across this and all previous samples
    • predictions/pred_var: Predictions variance across this and all previous samples
  • latents_*: Latent samples for each dimension
  • link_matrices/: When sideinfo is used with the MacauPrior, this HDF5 group contains the ß link matrix, and the µ HyperPrior sample. This will allow to make predictions from unseen sideinfo.

Sparse matrices and tensors are stored using the h5sparse-tensor Python package which is automatically installed as a dependency of smurff.

Making predictions from a TrainSession

The easiest way to make predictions is from an existing TrainSession:

[ ]:
predictor = trainSession.makePredictSession()
print(predictor)

Once we have a PredictSession, there are serveral ways to make predictions:

  • From a sparse matrix
  • For all possible elements in the matrix (the complete \(U \times V\))
  • For a single point in the matrix
  • Using only side-information

Predict all elements

We can make predictions for all rows \(\times\) columns in our matrix

[ ]:
p = predictor.predict_all()
print(p.shape) # p is a numpy array of size: (num samples) x (num rows) x (num columns)

Predict element in a sparse matrix

We can make predictions for a sparse matrix, for example our ic50_test matrix:

[ ]:
p = predictor.predict_some(ic50_test)
print(len(p),"predictions") # p is a list of Predictions
print("predictions 1:", p[0])

Predict just one element

Or just one element. Let’s predict the first element of our ic50_test matrix:

[ ]:
from scipy.sparse import find
(i,j,v) = find(ic50_test)
p = predictor.predict_one((i[0],j[0]),v[0])
print(p)

And plot the histogram of predictions for this element.

[ ]:
%matplotlib inline
import matplotlib.pyplot as plt

# Plot a histogram of the samples.
plt.subplot(111)
plt.hist(p.pred_all, bins=10, density=True, label = "predictions's histogram")
plt.plot(p.val, 1., 'ro', markersize =5, label = 'actual value')
plt.legend()
plt.title('Histogram of ' + str(len(p.pred_all)) + ' predictions')
plt.show()

Make predictions using side information

We can make predictions for rows/columns not in our train matrix, using only side info:

[ ]:
import numpy as np
from scipy.sparse import find

(i,j,v) = find(ic50_test)
row_side_info = ecfp.tocsr().getrow(i[0])
p = predictor.predict_one((row_side_info,j[0]),v[0])
print(p)

It is also possible to provide sideinfo for the columns, if the MacauPrior was used for the columns. See smurff.PredictSession.predict for the full documentation.

Making predictions from saved run

One can also make a PredictSession from a saved HDF5 file:

[ ]:
import smurff

predictor = smurff.PredictSession("ic50-macau.hdf5")
print(predictor)
[ ]: