Preprocessing and Input Files

This section talks about:

Sparse Tensors

class smurff.SparseTensor(data, shape=None)

Wrapper around a pandas DataFrame to represent a sparse tensor

The DataFrame should have N index columns (int type) and 1 value column (float type) N is the dimensionality of the tensor

You can also specify the shape of the tensor. If you don’t it is detected automatically.

Split Train and Test

smurff.make_train_test(Y, ntest, shape=None, seed=None)

Splits a sparse matrix Y into a train and a test matrix.

Parameters:Y

numpy.ndarray or pandas.DataFrame or smurff.SparseTensor

Matrix/Array/Tensor to split

Returns:
  • Ytrain (csr_matrix) – train part
  • Ytest (csr_matrix) – test part

Scaling and Centering

smurff.center.center_and_scale(m, mode, with_mean=True, with_std=True)

Center and/or scale the matrix m to the mean and/or standard deviation.

Parameters:
  • m ({array-like, sparse matrix}) – The data to center and scale.
  • mode ({ "rows", "cols", "global" }) –
    • “rows”: center/scale each row indepently
    • ”cols”: center/scale each column idependently
    • ”global”: center/scale using global meand and/or standard deviation/
  • with_mean (boolean, True by default) – If True, center the data before scaling.
  • with_std (boolean, True by default) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
Returns:

  • m (array-like) – Transformed array.
  • mean (array-like or double or None) – Computed mean depending on mode
  • std (array-like or double or None) – Computed standard deviation depending on mode

Notes

Also supports scaling of sparse matrices. This makes sense only when the matrix is scarce, i.e. when the zero-elements represent unknown values.

Example ChEMBL dataset

smurff.load_chembl()

Downloads a small subset of the ChEMBL dataset.

Returns:
  • ic50_train (sparse matrix) – sparse train matrix
  • ic50_test (sparse matrix) – sparse test matrix
  • feat (sparse matrix) – sparse row features