Predictive Pipelines in Python
Feature extraction pipelines allow you to define a repeatable process to transform a set of input features before you build a machine learning model on a final set of features. When the resulting model is put into production the feature pipeline will need to be rerun on each input feature set before being passed to the model for scoring.
Seldon feature pipelines are presently available in python. We plan to provide Spark based pipelines in the future.
Seldon provides a set of python modules to help construct feature pipelines for use inside Seldon. We use scikit-learn pipelines and Pandas. For feature extraction and transformation we provide a starter set of python scikit-learn Tranformers that take Pandas dataframes as input apply some transformations and output Pandas dataframes. There is also the ability to use any existing sklearn Transformer on Pandas dataframes with sklearn_transform.
Installation instructions can be found here.
The currently available example transforms are:
- Include_features_transform : include a subset of features
- Exclude_features_transform : exclude some subset of features
- Binary_transform : create a binary feature based on existence of a feature
- Split_transform : split a series of textual features into tokens
- Exist_features_transform : filter data to only those containing a set of features
- Svmlight_transform : create a feature that contains SVMLight numeric features from some input set of features
- Feature_id_transform : create an id feature from some input feature
- Tfidf_transform : create TFIDF features from an input feature
- Auto_transform : attempt to automatically normalize and create numeric, categorical and date features
- sklearn_transform : apply a sklearn Transformer to a Pandas Dataframe
Several small examples can be found in
Use sklean’s StandardScaler on a Pandas DataFrame.
When run this would print:
Auto transform a set of features
- Turn column “a” into a categorical column due to the small number of variables and limit specified by “max_values_numeric_categorical”
- Standard scale column “b”
- Leave column “c” as categorical but convert empty columns to “UKN” category
- Convert the specified date column “d” to a series of expanded features
The converted DataFrame would be:
Creating a Machine Learning model
As a final stage of any pipeline you would usually add a sklearn Estimator. We provide 3 builtin Estimators which wrap some popular machine learning toolkits and allow Pandas dataframes as input. There is also a general Estimator that can take any sckit-learn compatible estimator.
- XGBoostClassifier : XGBoost classifier which allows Pandas Dataframes as input
- VWClassifier : VW classifier which allows Pandas Dataframes as input
- KerasClassifier : Keras classifier which allows Pandas Dataframes as input
- SKLearnClassifier : General classifier that runs any sklearn classifier taking Pandas dataframes as input.
Simple Predictive Pipeline using Iris Dataset
An example pipeline to do very simple extraction on the Iris dataset is contained within the code at
python/docker/examples/iris. This contains pipelines that utilize Seldon’s Docker pipeline and create the following python pipelines:
- Create an id feature from the name feature
- Create an SVMLight feature from the four core predictive features
- Create a model with either XGBoost, Vowpal Wabbit or Keras
The pipeline utilizing XGBoost is shown below
Testing and Optimization
There are two modules for helping in testing and optimizing pipelines:
- cross_validation : Allow cross validation to be run on pipelines that use pandas dataframes
- bayes_optimize : Optimize an estimators parameters
There is a notebook showing how to use these in a simple example.
Further examples can be found in
- wine.ipynb : Jupyter python notebook to run a pipeline on wine classification
- credit_card.ipynb : Jupyter python notebook to run a pipeline and optimize on credit card data
- sklearn_scaler.py : an example of using a sklearn scaler in a pipeline with Pandas
- auto_transform.py : an example of a simple auto_transform on pandas data
Any Pipeline built using this package can easily be deployed as a microservice as shown below, where we assume a pipeline has been saved to “./pipeline” and we ish to call the loaded model “test_model”: