Training

The most versatile class is TrainSession. MacauSession and BPMFSession provide a simpler interface.

TrainSession

class smurff.TrainSession(priors=[u'normal', u'normal'], num_latent=NUM_LATENT_DEFAULT_VALUE, num_threads=NUM_THREADS_DEFAULT_VALUE, burnin=BURNIN_DEFAULT_VALUE, nsamples=NSAMPLES_DEFAULT_VALUE, seed=RANDOM_SEED_DEFAULT_VALUE, threshold=None, verbose=1, save_prefix=None, save_extension=None, save_freq=None, checkpoint_freq=None, csv_status=None)

Class for doing a training run in smurff

A simple use case could be:

>>> session = smurff.TrainSession(burnin = 5, nsamples = 5)
>>> session.addTrainAndTest(Ydense)
>>> session.run()
priors

The type of prior to use for each dimension

Type:list, where element is one of { “normal”, “normalone”, “macau”, “macauone”, “spikeandslab” }
num_latent

Number of latent dimensions in the model

Type:int
burnin

Number of burnin samples to discard

Type:int
nsamples

Number of samples to keep

Type:int
num_threads

Number of OpenMP threads to use for model building

Type:int
verbose

Verbosity level for C++ library

Type:{0, 1, 2}
seed

Random seed to use for sampling

Type:float
save_prefix

Path where to store the samples. The path includes the directory name, as well as the initial part of the file names.

Type:path
save_freq
  • N>0: save every Nth sample
  • N==0: never save a sample
  • N==-1: save only the last sample
Type:int
save_extension
  • .csv: save in textual csv file format
  • .ddm: save in binary file format
Type:{ “.csv”, “.ddm” }
checkpoint_freq

Save the state of the session every N seconds.

Type:int
csv_status

Stores limited set of parameters, indicative for training progress in this file. See StatusItem

Type:filepath
addData(self, pos, Y, is_scarce=False, noise=PyNoiseConfig())

Stacks more matrices/tensors next to the main train matrix.

pos : shape
Block position of the data with respect to train. The train matrix/tensor has implicit block position (0, 0).
Y : :class: numpy.ndarray, scipy.sparse matrix or :class: SparseTensor
Data matrix/tensor to add
is_scarce : bool
When Y is sparse, and is_scarce is True the missing values are considered as unknown. When Y is sparse, and is_scarce is False the missing values are considered as zero. When Y is dense, this parameter is ignored.
noise : :class: PyNoiseConfig
Noise model to use for Y
addPropagatedPosterior(self, mode, mu, Lambda)

Adds mu and Lambda from propagated posterior

mode : int
dimension to add side info (rows = 0, cols = 1)
mu : :class: numpy.ndarray matrix
mean matrix mu should have as many rows as num_latent mu should have as many columns as size of dimension mode in train
Lambda : :class: numpy.ndarray matrix
co-variance matrix Lambda should be shaped like K x K x N Where K == num_latent and N == dimension mode in train
addSideInfo(self, mode, Y, noise=PyNoiseConfig(), tol=1e-6, direct=False)

Adds fully known side info, for use in with the macau or macauone prior

mode : int
dimension to add side info (rows = 0, cols = 1)
Y : :class: numpy.ndarray, scipy.sparse matrix
Side info matrix/tensor Y should have as many rows in Y as you have elemnts in the dimension selected using mode. Columns in Y are features for each element.
noise : :class: PyNoiseConfig
Noise model to use for Y
direct : boolean
  • When True, uses a direct inversion method.
  • When False, uses a CG solver

The direct method is only feasible for a small (< 100K) number of features.

tol : float
Tolerance for the CG solver.
addTrainAndTest(self, Y, Ytest=None, noise=PyNoiseConfig(), is_scarce=True)

Adds a train and optionally a test matrix as input data to this TrainSession

Parameters:
  • Y – Train matrix/tensor
  • Ytest (scipy.sparse matrix or :class: SparseTensor) – Test matrix/tensor. Mainly used for calculating RMSE.
  • noise – Noise model to use for Y
  • is_scarce (bool) – When Y is sparse, and is_scarce is True the missing values are considered as unknown. When Y is sparse, and is_scarce is False the missing values are considered as zero. When Y is dense, this parameter is ignored.
getConfig(self)

Get this TrainSession’s configuration in ini-file format

getRmseAvg(self)

Average RMSE across all samples for the test matrix

getStatus(self)

Returns StatusItem with current state of the session

getTestPredictions(self)

Get predictions for test matrix.

Returns:list of Prediction
Return type:list
init(self)

Initializes the TrainSession after all data has been added.

You need to call this method befor calling step(), unless you call run()

Returns:
Return type:StatusItem of the session.
makePredictSession(self)

Makes a PredictSession based on the model that as built in this TrainSession.

run(self)

Equivalent to:

self.init()
while self.step():
    pass
step(self)

Does on sampling or burnin iteration.

Returns:
  • - When a step was executed (StatusItem of the session.)
  • - After the last iteration, when no step was executed (None.)

MacauSession

class smurff.MacauSession(Ytrain, Ytest=None, side_info=None, univariate=False, direct=False, **args)

A train session specialized for use with the Macau algorithm

Ytrain
Train matrix/tensor
Ytest : scipy.sparse matrix or :class: SparseTensor
Test matrix/tensor. Mainly used for calculating RMSE.
side_info : list of :class: numpy.ndarray, scipy.sparse matrix or None
Side info matrix/tensor for each dimension If there is no side info for a certain mode, pass None. Each side info should have as many rows as you have elemnts in corresponding dimension of Ytrain.
direct : bool
Use Cholesky instead of CG solver
univariate : bool
Use univariate or multivariate sampling.
**args:
Extra arguments are passed to the TrainSession
Type:
class:numpy.ndarray, scipy.sparse matrix or :class: SparseTensor

BPMFSession

class smurff.BPMFSession(Ytrain, Ytest=None, univariate=False, **args)

A train session specialized for use with the BPMF algorithm

Ytrain
Train matrix/tensor
Ytest : scipy.sparse matrix or :class: SparseTensor
Test matrix/tensor. Mainly used for calculating RMSE.
univariate : bool
Use univariate or multivariate sampling.
**args:
Extra arguments are passed to the TrainSession
Type:
class:numpy.ndarray, scipy.sparse matrix or :class: SparseTensor

StatusItem

class smurff.StatusItem(phase, iter, phase_iter, model_norms, rmse_avg, rmse_1sample, train_rmse, auc_1sample, auc_avg, elapsed_iter, nnz_per_sec, samples_per_sec)

Short set of parameters indicative for the training progress.

phase
Type:{ “Burnin”, “Sampling” }
iter

Current iteration in current phase

Type:int
phase_iter

Number of iterations in this phase

Type:int
model_norms

Norm of each U/V matrix

Type:list of float
rmse_avg

Averag RMSE for test matrix across all samples

Type:float
rmse_1sample

RMSE for test matrix of last sample

Type:float
train_rmse

RMSE for train matrix of last sample

Type:float
auc_1sample

ROC AUC of the test matrix of the last sample Only available if you provided a threshold.

Type:float
auc_avg

Averag ROC AUC of the test matrix accross all samples Only available if you provided a threshold.

Type:float
elapsed_iter

Number of seconds the last sampling iteration took

Type:float
nnz_per_sec

Compute performance indicator; number of non-zero elements in train processed per second

Type:float
samples_per_sec

Compute performance indicator; number of rows and columns in U/V processed per second

Type:float