Preprocessing and Input Files

This section talks about:

Sparse Tensors

class smurff.SparseTensor(data, shape=None)

Wrapper around a pandas DataFrame to represent a sparse tensor

The DataFrame should have N index columns (int type) and 1 value column (float type) N is the dimensionality of the tensor

You can also specify the shape of the tensor. If you don’t it is detected automatically.

Split Train and Test

smurff.make_train_test(Y, ntest)

Splits a sparse matrix Y into a train and a test matrix.

Parameters:
  • Y (scipy sparse matrix (coo_matrix, csr_matrix or csc_matrix)) – Matrix to split
  • ntest (float <1.0 or integer.) –
    • if float, then indicates the ratio of test cells
    • if integer, then indicates the number of test cells
Returns:

  • Ytrain (coo_matrix) – train part
  • Ytest (coo_matrix) – test part

Scaling and Centering

smurff.center.center_and_scale(m, mode, with_mean=True, with_std=True)

Center and/or scale the matrix m to the mean and/or standard deviation.

Parameters:
  • m ({array-like, sparse matrix}) – The data to center and scale.
  • mode ({ "rows", "cols", "global" }) –
    • “rows”: center/scale each row indepently
    • ”cols”: center/scale each column idependently
    • ”global”: center/scale using global meand and/or standard deviation/
  • with_mean (boolean, True by default) – If True, center the data before scaling.
  • with_std (boolean, True by default) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
Returns:

  • m (array-like) – Transformed array.
  • mean (array-like or double or None) – Computed mean depending on mode
  • std (array-like or double or None) – Computed standard deviation depending on mode

Notes

Also supports scaling of sparse matrices. This makes sense only when the matrix is scarce, i.e. when the zero-elements represent unknown values.

Example ChEMBL dataset

smurff.load_chembl()

Downloads a small subset of the ChEMBL dataset.

Returns:
  • ic50_train (sparse matrix) – sparse train matrix
  • ic50_test (sparse matrix) – sparse test matrix
  • feat (sparse matrix) – sparse row features