Preprocessing and Input Files¶
This section talks about:
Contents
Sparse Tensors¶
-
class
smurff.
SparseTensor
(data, shape=None)¶ Wrapper around a pandas DataFrame to represent a sparse tensor
The DataFrame should have N index columns (int type) and 1 value column (float type) N is the dimensionality of the tensor
You can also specify the shape of the tensor. If you don’t it is detected automatically.
Split Train and Test¶
-
smurff.
make_train_test
(Y, ntest, shape=None, seed=None)¶ Splits a sparse matrix Y into a train and a test matrix.
Parameters: Y – numpy.ndarray
orpandas.DataFrame
orsmurff.SparseTensor
Matrix/Array/Tensor to split
Returns: - Ytrain (csr_matrix) – train part
- Ytest (csr_matrix) – test part
Scaling and Centering¶
-
smurff.center.
center_and_scale
(m, mode, with_mean=True, with_std=True)¶ Center and/or scale the matrix m to the mean and/or standard deviation.
Parameters: - m ({array-like, sparse matrix}) – The data to center and scale.
- mode ({ "rows", "cols", "global" }) –
- “rows”: center/scale each row indepently
- ”cols”: center/scale each column idependently
- ”global”: center/scale using global meand and/or standard deviation/
- with_mean (boolean, True by default) – If True, center the data before scaling.
- with_std (boolean, True by default) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
Returns: - m (array-like) – Transformed array.
- mean (array-like or double or None) – Computed mean depending on mode
- std (array-like or double or None) – Computed standard deviation depending on mode
Notes
Also supports scaling of sparse matrices. This makes sense only when the matrix is scarce, i.e. when the zero-elements represent unknown values.
Example ChEMBL dataset¶
-
smurff.
load_chembl
()¶ Downloads a small subset of the ChEMBL dataset.
Returns: - ic50_train (sparse matrix) – sparse train matrix
- ic50_test (sparse matrix) – sparse test matrix
- feat (sparse matrix) – sparse row features