Content Recommendation Models

Seldon contains several built in content recommendations models.

We provide several batch and streaming built-in models.

Matrix Factorization

An algorithm made popular due to its sucess in the Netflix competition. It tries to find a small set of latent user and item factors that explain the user-item interaction data. We use a wrapper around the Apache Spark ALS implementation. Note, however, for this release we only provide implicit matrix factorization.

Model creation

You can create a matrix factorization Kubernetes job for client “test” starting at unix-day 16907 (17th April 2016) as follows:

cd kubernetes/conf/models
make matrix-factorization DAY=16907 CLIENT=test

Runtime Scoring

A matrix factorization scorer can be utilized.

User Clustered Matrix Factorization (experimental)

In situations where the number of user if very high, in the millions or more, then standard matrix factorization above can become computationally expensive when calculating the recommendatons for all users. An alternative is to cluster users inside the latent factor space created by the factorization of the user-item activity matrix. In this way recommendations can be created just for each cluster. This is viable if many users have very similar tastes and can be served identical recommendations.

Model creation

We provide a Spark job which builds on the ideas presented here. You can create a user clusters matrix factorization Kubernetes job for client “test” starting at unix-day 16907 (17th April 2016) as follows:

cd kubernetes/conf/models
make matrix-factorization-clusters DAY=16907 CLIENT=test

Runtime Scoring

A dedicated user clusters matrix factorization scorer can be utilized.

Batch Item Similarity

Item similarity models find correlations in the user-item interactions to find pairs of items that have consistently been viewed together. The underlying algorithm is the DIMSUM algorithm in Apache Spark 1.2.

Model creation

You can create an item similarity modelling job for client “test” starting at unix-day 16907 (17th April 2016) as follows:

cd kubernetes/conf/models
make item-similarity DAY=16907 CLIENT=test

Runtime Scoring

The item similarity scorer can be utilized.

Streaming Item Similarity

Batch item similarity is not applicable for domains such as news recommendation where new items that need to be recommended are published in short time frames. In these circumstances we can use an online streaming item similarity techniques that finds similar items through user activity in short time windows (for example every hour). For this we use a technique described in Estimating Rarity and Similarity over Data Stream Windows, Mayur Datar, S. Muthukrishnan which uses a streaming adaption of min-hashing to provide efficient item similarity over variable time windows.

Model creation

The online algorithm is implemented in two jobs.

The jobs can be created for a client test as follows:

cd kubernetes/conf/models
make streaming-itemsim CLIENT=test

You should start these two jobs on the cluster, for example:

kubectl create -f jobs/stream-itemsim-create-test.json
kubectl create -f jobs/stream-itemsim-dbupload-test.json

Runtime Scoring

The item similarity scorer can be utilized.

Content Similarity

Rather using collaborative filtering technques which try to find similarities in user’s activity and alternative technqiue is to use the actual content to find other content of a similar nature. We have python libraries which wrap several technique provided by the gensim document similarity toolkit to provide these. An example demo can be found here

A string baseline model especially for high churn scenarios such as news sites is to provide recent popular content. We provide a model that counts items in real time as they are interacted with by users and exponentially decays those counts to provide a continually updated view of popular content. The amount of decay can be controlled to reflect the amount of churn and traffic on a site. This model has only a runtime scoring configuration.

Runtime Scoring

The most popular scorer can be utilized.