Centering¶

In this notebook we look at why it is important to center your matrices before using them with SMURFF.

SMURFF provides a function similar to sklearn.preprocessing.scale with the difference that SMURFF also supports scaling of sparse matrices. This makes sense only when the matrix is scarce, i.e. when the zero-elements represent unknown values.

The Matrix Factorization methods are meant for modeling the variance in the data. Hence it makes sense to subtract the mean first.

[ ]:

import smurff

import logging
logging.basicConfig(level = logging.INFO)

ic50_train, ic50_test, ecfp = smurff.load_chembl()

ic50_train_centered, global_mean, _ = smurff.center_and_scale(ic50_train, "global", with_mean = True, with_std = False)

ic50_test_centered = ic50_test
ic50_test_centered.data -= global_mean # only touch non-zeros

When we now run a SMURFF train trainSession, we can see from the TrainSession information, that the data has been centered:

PythonSession {
  Data: {
    Type: ScarceMatrixData [with NAs]
    Component-wise mean: 3.86555e-16
    ...

[ ]:

trainSession = smurff.BPMFSession(
                       Ytrain     = ic50_train_centered,
                       Ytest      = ic50_test,
                       num_latent = 16,
                       burnin     = 40,
                       nsamples   = 200,
                       verbose    = 0,)

predictions = trainSession.run()
rmse = smurff.calc_rmse(predictions)
print(rmse)