Movielens 10 Million Example

A worked step-by-step example using the Movielens 10 Million dataset is provided here.

Prerequisites

Running

The entire set of steps can be executed by running one of the Kubernetes deployments in seldon-server/kubernetes/conf/examples/ml10m which should have been created when you followed the install and configuration steps.

There are two jobs:

Start the chosen the kubernetes job, for example:

cd seldon-server/kubernetes/conf/examples/ml10m
kubectl create -f ml10m-import-item-similarity.json

The job may take 10 or more minutes to run depending on the size and compute power of your Kubernetes cluster and the Spark cluster within it. You can check its status with kubectl get jobs -l job-name=ml10m-import.

This job will:

You can then test the recommendations by doing:

seldon-cli api --client-name ml10m --endpoint /js/recommendations --item 50 --limit 4 --user 625

The above gets recommendations based on a recent action history for user 625 being movie 50 which is “The Usual Suspects”. The result should look something like:

{
  "size": 4,
  "requested": 4,
  "list": [
    {
      "id": "318",
      "name": "",
      "type": 1,
      "first_action": 1465573742000,
      "last_action": 1465573742000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "1",
        "title": "Shawshank Redemption, The (1994)"
      }
    },
    {
      "id": "296",
      "name": "",
      "type": 1,
      "first_action": 1465573742000,
      "last_action": 1465573742000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "1",
        "title": "Pulp Fiction (1994)"
      }
    },
    {
      "id": "593",
      "name": "",
      "type": 1,
      "first_action": 1465573742000,
      "last_action": 1465573742000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "1",
        "title": "Silence of the Lambs, The (1991)"
      }
    },
    {
      "id": "47",
      "name": "",
      "type": 1,
      "first_action": 1465573742000,
      "last_action": 1465573742000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "1",
        "title": "Seven (a.k.a. Se7en) (1995)"
      }
    }
  ]
}

Detailed Steps

Below are the deatiled steps which can be found here.

Download Data

    echo "nameserver 8.8.8.8" >> /etc/resolv.conf
    wget http://files.grouplens.org/datasets/movielens/ml-10m.zip
    unzip ml-10m.zip
    iconv -f iso-8859-1 -t utf-8 ml-10M100K/movies.dat -o ml-10M100K/movies.dat.utf8

Create Historical Data Files

    echo "create items csv"
    cat <(echo 'id,title') <(cat ml-10M100K/movies.dat.utf8 | awk -F '::' '{printf("%d,\"%s\"\n",$1,$2)}') > items.csv

    echo "create users csv"
    cat <(echo "id") <(cat ml-10M100K/ratings.dat | awk -F'::' '{print $1}' | uniq) > users.csv

    echo "create actions csv"
    cat <(echo "user_id,item_id,value,time") <(cat ml-10M100K/ratings.dat | awk -F'::' 'BEGIN{OFS=","}{print $1,$2,$3,$4}') > actions.csv

Create Client

We will use a item schema to hold the title of the movies, as show below

{
    "types": [
        {
            "type_attrs": [
                {
                    "name": "title",
                    "value_type": "string"
                }
		    ],
            "type_id": 1,
            "type_name": "Movies"
	    }
    ]
}

The steps to setup and import the data can be done via the Seldon CLI

    seldon-cli client --action setup --client-name ml10m --db-name ClientDB
    kubectl cp attr.json /seldon-control:/tmp/attr.json && seldon-cli attr --action apply --client-name ml10m --json /tmp/attr.json
    kubectl cp items.csv /seldon-control:/tmp/items.csv && seldon-cli import --action items --client-name ml10m --file-path /tmp/items.csv
    kubectl cp users.csv /seldon-control:/tmp/users.csv && seldon-cli import --action users --client-name ml10m --file-path /tmp/users.csv
    kubectl cp actions.csv /seldon-control:/tmp/actions.csv && seldon-cli import --action actions --client-name ml10m --file-path /tmp/actions.csv

Build a Recommendation Model

The script contains calls to build either a item-similarity model or a matrix factorization model using luigi:

    rm -rf /seldon-data/seldon-models/ml10m/matrix-factorization/1
    luigi --module seldon.luigi.spark SeldonMatrixFactorization --local-schedule --client ml10m --startDay 1 

    rm -rf /seldon-data/seldon-models/ml10m/item-similarity/1
    luigi --module seldon.luigi.spark SeldonItemSimilarity --local-schedule --client ml10m --startDay 1 --ItemSimilaritySparkJob-sample 0.25 --ItemSimilaritySparkJob-dimsumThreshold 0.5 --ItemSimilaritySparkJob-limit 100 

Setup Runtime Scorer

We setup an appropriate runtime scorer depending on which model we created, for the item-similarity we would do:

    cat <<EOF | seldon-cli rec_alg --action create --client-name ml10m -f -
{
    "defaultStrategy": {
        "algorithms": [
            {
                "config": [
                    {
                        "name": "io.seldon.algorithm.general.numrecentactionstouse",
                        "value": "1"
                    }
                ],
                "filters": [],
                "includers": [],
                "name": "itemSimilarityRecommender"
            }
        ],
        "combiner": "firstSuccessfulCombiner",
        "diversityLevel": 3
    },
    "recTagToStrategy": {}
}
EOF
    seldon-cli rec_alg --action commit --client-name ml10m
    #pull updated conf from zookeeper so its safe
    seldon-cli client --action zk_pull