Note
This page was generated from notebooks/centering.ipynb.
CenteringΒΆ
In this notebook we look at why it is important to center your matrices before using them with SMURFF.
SMURFF provides a function similar to sklearn.preprocessing.scale with the difference that SMURFF also supports scaling of sparse matrices. This makes sense only when the matrix is scarce, i.e. when the zero-elements represent unknown values.
The Matrix Factorization methods are meant for modeling the variance in the data. Hence it makes sense to subtract the mean first.
[ ]:
import smurff
ic50_train, ic50_test, ecfp = smurff.load_chembl()
ic50_train_centered, global_mean, _ = smurff.center_and_scale(ic50_train, "global", with_mean = True, with_std = False)
ic50_test_centered = ic50_test
ic50_test_centered.data -= global_mean # only touch non-zeros
When we now run a SMURFF train session, we can see from the Session
information, that the data has been centered:
PythonSession {
Data: {
Type: ScarceMatrixData [with NAs]
Component-wise mean: 3.86555e-16
...
[ ]:
session = smurff.BPMFSession(
Ytrain = ic50_train_centered,
Ytest = ic50_test,
num_latent = 16,
burnin = 40,
nsamples = 200,
verbose = 1,)
predictions = session.run()
rmse = smurff.calc_rmse(predictions)
print(rmse)