Preprocessing and Input Files ¶

This section talks about:

Contents

Sparse Tensors ¶

class smurff.SparseTensor(data, shape=None)¶

Wrapper around a pandas DataFrame to represent a sparse tensor

The DataFrame should have N index columns (int type) and 1 value column (float type) N is the dimensionality of the tensor

You can also specify the shape of the tensor. If you don’t it is detected automatically.

smurff.make_train_test(Y, ntest, shape=None, seed=None)¶

Splits a sparse matrix Y into a train and a test matrix.

Parameters:

Y –

numpy.ndarray or pandas.DataFrame or smurff.SparseTensor

Matrix/Array/Tensor to split

Returns:

smurff.center.center_and_scale(m, mode, with_mean=True, with_std=True)¶

Center and/or scale the matrix m to the mean and/or standard deviation.

Parameters:

m ({array-like, sparse matrix}) – The data to center and scale.
mode ({ "rows", "cols", "global" }) –
- “rows”: center/scale each row indepently
- ”cols”: center/scale each column idependently
- ”global”: center/scale using global meand and/or standard deviation/
with_mean (boolean, True by default) – If True, center the data before scaling.
with_std (boolean, True by default) – If True, scale the data to unit variance (or equivalently, unit standard deviation).

Returns:

m (array-like) – Transformed array.
mean (array-like or double or None) – Computed mean depending on mode
std (array-like or double or None) – Computed standard deviation depending on mode

Notes

Also supports scaling of sparse matrices. This makes sense only when the matrix is scarce, i.e. when the zero-elements represent unknown values.

smurff.load_chembl()¶

Downloads a small subset of the ChEMBL dataset.

Returns:	ic50_train (sparse matrix) – sparse train matrix ic50_test (sparse matrix) – sparse test matrix feat (sparse matrix) – sparse row features