Training¶
The most versatile class is TrainSession.
MacauSession and BPMFSession provide a simpler interface.
TrainSession¶
-
class
smurff.TrainSession(priors=['normal', 'normal'], num_latent=None, num_threads=None, burnin=None, nsamples=None, seed=None, threshold=None, verbose=None, save_name=None, save_freq=None, checkpoint_freq=None)¶ Class for doing a training run in smurff
A simple use case could be:
>>> trainSession = smurff.TrainSession(burnin = 5, nsamples = 5) >>> trainSession.setTrain(Ydense) >>> trainSession.run()
-
priors¶ The type of prior to use for each dimension
Type: list, where element is one of { “normal”, “normalone”, “macau”, “macauone”, “spikeandslab” }
-
num_latent¶ Number of latent dimensions in the model
Type: int
-
burnin¶ Number of burnin samples to discard
Type: int
-
nsamples¶ Number of samples to keep
Type: int
-
num_threads¶ Number of OpenMP threads to use for model building
Type: int
-
verbose¶ Verbosity level for C++ library
Type: {0, 1, 2}
-
seed¶ Random seed to use for sampling
Type: float
-
save_name¶ HDF5 filename to store the samples.
Type: path
-
save_freq¶ - N>0: save every Nth sample
- N==0: never save a sample
- N==-1: save only the last sample
Type: int
-
checkpoint_freq¶ Save the state of the trainSession every N seconds.
Type: int
-
addData(pos, Y, noise=<smurff.helper.FixedNoise object>, is_scarce=False)¶ Stacks more matrices/tensors next to the main train matrix.
- pos : shape
- Block position of the data with respect to train. The train matrix/tensor has implicit block position (0, 0).
- Y : :class: numpy.ndarray,
scipy.sparsematrix or :class: SparseTensor - Data matrix/tensor to add
- is_scarce : bool
- When Y is sparse, and is_scarce is True the missing values are considered as unknown. When Y is sparse, and is_scarce is False the missing values are considered as zero. When Y is dense, this parameter is ignored.
- noise : :class: NoiseConfig
- Noise model to use for Y
-
addPropagatedPosterior(mode, mu, Lambda)¶ Adds mu and Lambda from propagated posterior
- mode : int
- dimension to add side info (rows = 0, cols = 1)
- mu : :class: numpy.ndarray matrix
- mean matrix mu should have as many rows as num_latent mu should have as many columns as size of dimension mode in train
- Lambda : :class: numpy.ndarray matrix
- co-variance matrix Lambda should be shaped like K x K x N Where K == num_latent and N == dimension mode in train
-
addSideInfo(mode, Y, noise=<smurff.helper.SampledNoise object>, direct=True)¶ Adds fully known side info, for use in with the macau or macauone prior
- mode : int
- dimension to add side info (rows = 0, cols = 1)
- Y : :class: numpy.ndarray,
scipy.sparsematrix - Side info matrix/tensor Y should have as many rows in Y as you have elemnts in the dimension selected using mode. Columns in Y are features for each element.
- noise : :class: NoiseConfig
- Noise model to use for Y
- direct : boolean
- When True, uses a direct inversion method.
- When False, uses a CG solver
The direct method is only feasible for a small (< 100K) number of features.
-
init()¶ Initializes the TrainSession after all data has been added.
You need to call this method befor calling
step(), unless you callrun()Returns: Return type: StatusItemof the trainSession.
-
makePredictSession()¶ Makes a
PredictSessionbased on the model that as built in this TrainSession.
-
run()¶ Equivalent to:
self.init() while self.step(): pass
-
setTrain(Y, noise=<smurff.helper.FixedNoise object>, is_scarce=True)¶ Adds a train and optionally a test matrix as input data to this TrainSession
Parameters: - Y – Train matrix/tensor
- noise – Noise model to use for Y
- is_scarce (bool) – When Y is sparse, and is_scarce is True the missing values are considered as unknown. When Y is sparse, and is_scarce is False the missing values are considered as zero. When Y is dense, this parameter is ignored.
-
step()¶ Does on sampling or burnin iteration.
Returns: - - When a step was executed (
StatusItemof the trainSession.) - - After the last iteration, when no step was executed (None.)
- - When a step was executed (
-
MacauSession¶
-
class
smurff.MacauSession(Ytrain, is_scarce=True, Ytest=None, side_info=None, univariate=False, direct=True, *args, **kwargs)¶ A train trainSession specialized for use with the Macau algorithm
-
Ytrain¶ - Train matrix/tensor
- Ytest :
scipy.sparsematrix or :class: SparseTensor - Test matrix/tensor. Mainly used for calculating RMSE.
- side_info : list of :class: numpy.ndarray,
scipy.sparsematrix or None - Side info matrix/tensor for each dimension If there is no side info for a certain mode, pass None. Each side info should have as many rows as you have elemnts in corresponding dimension of Ytrain.
- direct : bool
- Use Cholesky instead of CG solver
- univariate : bool
- Use univariate or multivariate sampling.
- **args:
- Extra arguments are passed to the
TrainSession
Type: class: numpy.ndarray, scipy.sparsematrix or :class: SparseTensor - Ytest :
-
BPMFSession¶
-
class
smurff.BPMFSession(Ytrain, is_scarce=True, Ytest=None, univariate=False, *args, **kwargs)¶ A train trainSession specialized for use with the BPMF algorithm
-
Ytrain¶ - Train matrix/tensor
- Ytest :
scipy.sparsematrix or :class: SparseTensor - Test matrix/tensor. Mainly used for calculating RMSE.
- univariate : bool
- Use univariate or multivariate sampling.
- **args:
- Extra arguments are passed to the
TrainSession
Type: class: numpy.ndarray, scipy.sparsematrix or :class: SparseTensor - Ytest :
-
StatusItem¶
-
class
smurff.StatusItem¶ Short set of parameters indicative for the training progress.
-
auc_1sample¶ ROC AUC of the test matrix of the last sampleOnly available if you provided a threshold
-
auc_avg¶ Average ROC AUC of the test matrix across all samplesOnly available if you provided a threshold
-
elapsed_iter¶ Number of seconds the last sampling iteration took
-
iter¶ Current iteration in current phase
-
nnz_per_sec¶ Compute performance indicator; number of non-zero elements in train processed per second
-
phase¶ { “Burnin”, “Sampling” }
-
rmse_1sample¶ RMSE for test matrix of last sample
-
rmse_avg¶ Averag RMSE for test matrix across all samples
-
samples_per_sec¶ Compute performance indicator; number of rows and columns in U/V processed per second
-
train_rmse¶ RMSE for train matrix of last sample
-