Trying different Matrix Factorzation Methods¶

In this notebook we will try out several MF methods supported by SMURFF.

Downloading the files¶

As in the previous example we download the ChemBL dataset. The resulting IC50 matrix is a compound x protein matrix, split into train and test. The ECFP matrix has features as side information on the compounds.

[ ]:

import smurff
import logging
logging.basicConfig(level = logging.INFO)

ic50_train, ic50_test, ecfp = smurff.load_chembl()

Matrix Factorization without Side Information (BPMF)¶

As a first example we can run SMURFF without side information. The method used here is BPMF.

Input matrix for Y is a sparse scipy matrix (either coo_matrix, csr_matrix or csc_matrix). The test matrix Ytest also needs to ne sparse matrix of the same size as Y. Here we have used burn-in of 20 samples for the Gibbs sampler and then collected 80 samples from the model. We use 16 latent dimensions in the model.

For good results you will need to run more sampling and burnin iterations (>= 1000) and maybe more latent dimensions.

We create a trainSession, and the run method returns the predictions of the Ytest matrix. predictions is a list of of type Prediction.

[ ]:

trainSession = smurff.BPMFSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       num_latent = 16,
                       burnin     = 20,
                       nsamples   = 80,
                       verbose    = 0,)

predictions = trainSession.run()
print("First prediction element: ", predictions[0])

rmse = smurff.calc_rmse(predictions)
print("RMSE =", rmse)

Matrix Factorization with Side Information (Macau)¶

If we want to use the compound features we can use the Macau algorithm.

The parameter side_info = [ecfp, None] sets the side information for rows and columns, respectively. In this example we only use side information for the compounds (rows of the matrix).

[ ]:

predictions = smurff.MacauSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       side_info  = [ecfp, None],
                       direct     = False,
                       num_latent = 16,
                       burnin     = 40,
                       nsamples   = 100).run()

smurff.calc_rmse(predictions)

Macau univariate sampler¶

SMURFF also includes an option to use a very fast univariate sampler, i.e., instead of sampling blocks of variables jointly it samples each individually. An example:

[ ]:

predictions = smurff.MacauSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       side_info  = [ecfp, None],
                       direct     = False,
                       univariate = True,
                       num_latent = 32,
                       burnin     = 500,
                       nsamples   = 3500,
                       verbose    = 1,).run()
smurff.calc_rmse(predictions)

When using it we recommend using larger values for burnin and nsamples, because the univariate sampler mixes slower than the blocked sampler.

[ ]: