Sentiment Prediction Demo
A worked step-by-step example using the Amazon Fine Foods Review dataset is provided here. The dataset provides around 500,000 reviews and ratings from Amazon for fine food products. We will use it to create a sentiment predictor that takes the review text and predicts whether it is a positive or negative review. A model will be created and run as a microservice within Seldon.
- You have installed Seldon on a Kubernetes cluster with Spark included (Spark is included by default).
- You haved added
seldon-server/kubernetes/binto you shell PATH environment variable.
The model creation step can be accomplished by running the Kubernetes deployment in
kubernetes/conf/examples/finefoods/train-finefoods.json which should have been created when you followed the install and configuration steps.
Create the kubernetes job:
The job may take a 10-20 minutes to fully run. You can check its status with
kubectl get jobs -l job-name=train-finefoods.
This job will:
- Download the finefoods review data. (N.B. make sure your Kubernetes cluster has access to the internet)
- Create a dataset where reviews with a score greater than 3 are marked as “positive” and less than 3 as “negative”. Review with a score of 3 are ignored.
- Create an XGBoost classification model using TFIDF features
- Save the modelling pipeline to persistent storage
To serve predictions we will load the saved pipeline into a microservice. This can be accomplished by using the script
This will load the pipeline saved in /seldon-data/seldon-models/finefoods/1/ and create a single replica microservice called finefoods-xgboost. It will activate this for the “test” client as a prediction algorithm.
You can then test the predictions, e.g.:
Which should produce a result like below indication a “negative” sentiment:
While the following:
Should produce a “positive” sentiment:
The model creation steps are provided by the Docker image seldonio/finefoods_xgboost whose code can be found in
seldon-server/docker/examples/finefoods. There is a script create_model.sh which is shown below:
- Downloads the data, uncompresses and turns it into UTF-8
- creates a data set from review and sentiment based on score
- Runs the create_model.py python script to create the model
- prints the confusion matrix from one of the cross validation fold runs.
The model creation in create_model.py uses the pyseldon package to build a sklearn pipeline as show below:
- Create TFIDF features using the pyseldon tfidf transformer, selecting the top 50K features from unigram and bigrams of the text.
- Transform the features into SVMlight format for ease of use by XGBoost
- Add an XGBoost classifier using th pyseldon wrapper with positive scaled weight of 0.2 to handle the data imbalance where positive reviews out number neagtive reviews 5:1.
- Run 5 fold cross validation
- Combine these into a sklearn pipeline
The script saves the pipleline after modelling so that it can be reloaded and run in prediction mode to score new data.