Movielens 100K Worked Example

A worked step-by-step example using the Movielens 100K dataset is provided here. The dataset is in reality small enough that Spark based models are not the best solution but the small size makes it easy to run and test the steps you would use for larger datasets.

Prerequisites

Running

The entire set of steps can be executed by running the Kubernetes deployment in kubernetes/conf/examples/ml100k/ml100k-import.json which should have been created when you followed the install and configuration steps.

Create the kubernetes job:

cd kubernetes/conf/examples/ml100k
kubectl create -f ml100k-import.json

The job may take a few minutes to fully run. You can check its status with kubectl get jobs -l name=ml100k-import.

This job will:

You can then test the recommendations by doing:

seldon-cli api --client-name ml100k --endpoint /js/recommendations --item 50 --limit 4

The above gets recommendations based on a recent action history for a user being movie 50 which is “Star Wars”. The result should look something like:

{
  "size": 4,
  "requested": 4,
  "list": [
    {
      "id": "181",
      "name": "",
      "type": 1,
      "first_action": 1460395228000,
      "last_action": 1460395228000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "7",
        "release": "14-Mar-1997",
        "title": "Return of the Jedi (1983)",
        "url": "http://us.imdb.com/M/title-exact?Return%20of%20the%20Jedi%20(1983)"
      }
    },
    {
      "id": "1",
      "name": "",
      "type": 1,
      "first_action": 1460395228000,
      "last_action": 1460395228000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "7",
        "release": "01-Jan-1995",
        "title": "Toy Story (1995)",
        "url": "http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)"
      }
    },
    {
      "id": "121",
      "name": "",
      "type": 1,
      "first_action": 1460395228000,
      "last_action": 1460395228000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "7",
        "release": "03-Jul-1996",
        "title": "Independence Day (ID4) (1996)",
        "url": "http://us.imdb.com/M/title-exact?Independence%20Day%20(1996)"
      }
    },
    {
      "id": "222",
      "name": "",
      "type": 1,
      "first_action": 1460395228000,
      "last_action": 1460395228000,
      "popular": false,
      "demographics": [],
      "attributes": {},
      "attributesName": {
        "recommendationUuid": "7",
        "release": "22-Nov-1996",
        "title": "Star Trek: First Contact (1996)",
        "url": "http://us.imdb.com/M/title-exact?Star%20Trek:%20First%20Contact%20(1996)"
      }
    }
  ]
}
}

Detailed Steps

Below are the deatiled steps which can be found here.

Download Data

    echo "nameserver 8.8.8.8" >> /etc/resolv.conf
    wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
    unzip ml-100k.zip
    iconv -f iso-8859-1 -t utf-8 ml-100k/u.item -o ml-100k/u.item.utf8

Create Historical Data Files

    echo "create items csv"
    cat <(echo 'id,title,release,url') <(cat ml-100k/u.item.utf8 | awk -F '|' '{printf("%d,\"%s\",\"%s\",\"%s\"\n",$1,$2,$3,$5)}') > items.csv

    echo "create users csv"
    cat <(echo "id") <(cat ml-100k/u.user | cut -d'|' -f1) > users.csv

    echo "create actions csv"
    cat <(echo "user_id,item_id,value,time") <(cat ml-100k/ua.base | cut -f1,2,3,4 --output-delimiter=,) > actions.csv

Create Client

We will use a item schema to hold the title, release data and IMDB URL of the movies, as show below

{
    "types": [
        {
            "type_attrs": [
                {
                    "name": "title",
                    "value_type": "string"
                },
                {
                    "name": "url",
                    "value_type": "string"
                },
                {
                    "name": "release",
                    "value_type": "string"
                }
	    ],
            "type_id": 1,
            "type_name": "Movies"
	}
    ]
}

The steps to setup and import the data can be done via the Seldon CLI

    seldon-cli client --action setup --client-name ml100k --db-name ClientDB
    seldon-cli attr --action apply --client-name ml100k --json attr.json
    seldon-cli import --action items --client-name ml100k --file-path items.csv
    seldon-cli import --action users --client-name ml100k --file-path users.csv
    seldon-cli import --action actions --client-name ml100k --file-path actions.csv

Build a Recommendation Model

We will build a matrix factorization model using Spark via luigi.

    luigi --module seldon.luigi.spark SeldonMatrixFactorization --local-schedule --client ml100k --startDay 1

Setup Runtime Scorer

We setup a runtime matrix factorization scorer that takes recent activity for users to make recommendations.

cat <<EOF | seldon-cli rec_alg --action create --client-name ml100k -f -
{
    "defaultStrategy": {
        "algorithms": [
            {
                "config": [
                    {
                        "name": "io.seldon.algorithm.general.numrecentactionstouse",
                        "value": "1"
                    }
                ],
                "filters": [],
                "includers": [],
                "name": "recentMfRecommender"
            }
        ],
        "combiner": "firstSuccessfulCombiner"
    },
    "recTagToStrategy": {}
}
EOF

seldon-cli rec_alg --action commit --client-name ml100k