Note

This page was generated from notebooks/different_noise_models.ipynb.

Different noise models

In this notebook we look at the different noise models.

Prepare train, test and side-info

We first need to download and prepare the data files. This can be acomplished using this a built-in function is smurff. IC50 is a compound x protein matrix, The ECFP matrix as features as side information on the compounds.

[ ]:
import smurff
import logging
logging.basicConfig(level = logging.INFO)

ic50_train, ic50_test, ecfp = smurff.load_chembl()

Fixed noise

The noise model of observed data can be annotated by calling addTrainAndTest with the optional parameter noise_model. The default for this parameter is FixedNoise with precision 5.0

[ ]:
trainSession = smurff.TrainSession(
                            priors = ['normal', 'normal'],
                            num_latent=32,
                            burnin=100,
                            nsamples=500)

# the following line is equivalent to the default, not specifing noise_model
trainSession.addTrainAndTest(ic50_train, ic50_test, smurff.FixedNoise(5.0))
predictions = trainSession.run()
print("RMSE = %.2f" % smurff.calc_rmse(predictions))

Adaptive noise

Instead of a fixed precision, we can also allow the model to automatically determine the precision of the noise by using AdaptiveNoise, with signal-to-noise ratio parameters sn_init and sn_max.

  • sn_init is an initial signal-to-noise ratio.
  • sn_max is the maximum allowed signal-to-noise ratio. This means that if the updated precision would imply a higher signal-to-noise ratio than sn_max, then the precision value is set to (sn_max + 1.0) / Yvar where Yvar is the variance of the training dataset Y.
[ ]:
trainSession = smurff.TrainSession(
                            priors = ['normal', 'normal'],
                            num_latent=32,
                            burnin=100,
                            nsamples=500)

trainSession.addTrainAndTest(ic50_train, ic50_test, smurff.AdaptiveNoise(1.0, 10.))
predictions = trainSession.run()
print("RMSE = %.2f" % smurff.calc_rmse(predictions))

Binary matrices

SMURFF can also factorize binary matrices (with or without side information). The input matrices can contain arbitrary values, and are converted to 0’s and 1’ by means of a threshold. To factorize them we employ probit noise model ProbitNoise, taking this threshold as a parameter.

To evaluate binary factorization, we recommed to use ROC AUC, which can be enabled by providing a threshold also to the TrainSession.

[ ]:
ic50_threshold = 6.
trainSession = smurff.TrainSession(
                            priors = ['normal', 'normal'],
                            num_latent=32,
                            burnin=100,
                            nsamples=100,
                            # Using threshold of 6. to calculate AUC on test data
                            threshold=ic50_threshold)

## using activity threshold pIC50 > 6. to binarize train data
trainSession.addTrainAndTest(ic50_train, ic50_test, smurff.ProbitNoise(ic50_threshold))
predictions = trainSession.run()
print("RMSE = %.2f" % smurff.calc_rmse(predictions))
print("AUC = %.2f" % smurff.calc_auc(predictions, ic50_threshold))

The input train and test sets are converted to -1 and +1 values, if the original values are below or above the threshold (respectively). Similarly, the resulting predictions will be negative, if the model predicts the value to be below the threshold, or positive, if the model predicts the value to be above the threshold.

[ ]:
predictions[0]

Binary matrices with Side Info

It is possible to enhance the model for binary matrices by adding side information using the Macau algorithm. Note that the binary here refers to the train and test data, not to the side information.

[ ]:
ic50_threshold = 6.
trainSession = smurff.TrainSession(
                            priors = ['macau', 'normal'],
                            num_latent=32,
                            burnin=100,
                            nsamples=100,
                            # Using threshold of 6. to calculate AUC on test data
                            threshold=ic50_threshold)

## using activity threshold pIC50 > 6. to binarize train data
trainSession.addTrainAndTest(ic50_train, ic50_test, smurff.ProbitNoise(ic50_threshold))
trainSession.addSideInfo(0, ecfp, direct = False)
predictions = trainSession.run()
print("RMSE = %.2f" % smurff.calc_rmse(predictions))
print("AUC = %.2f" % smurff.calc_auc(predictions, ic50_threshold))
[ ]:

[ ]:

[ ]: