seldon.text package

Submodules

seldon.text.docsim

class seldon.text.docsim.DefaultJsonCorpus(input=None, create_dictionary=True)[source]

Bases: object

A default JSON corpus based on gensim TextCorpus. It assumes a file or list of JSON as input. The methods provided by gensim TextCorpus are needed for the GenSim training. Any corpus provided to DocumentSimilarity should provide the methods given in this class.

get_dictionary()[source]
get_json()[source]
get_meta()[source]

return a json object with meta data for the documents. It must return: id - id for this document optional title and tags. Tags will be used as base truth used to score document similarity results.

get_texts(raw=False)[source]

yield raw text or tokenized text

getstream()[source]
class seldon.text.docsim.DocumentSimilarity(model_type='gensim_lsi', vec_size=100, annoy_trees=100, work_folder='/tmp', sklearn_tfidf_args={'stop_words': 'english'}, sklearn_nmf_args={'alpha': 0.1, 'random_state': 1, 'l1_ratio': 0.5})[source]

Bases: seldon.util.Recommender

Parameters:
  • model_type (string) – gensim_lsi,gensim_lda,gensim_rp,sklearn_nmf
  • vec_size (int) – vector size of model
  • annoy_trees (int) – number of trees to create for Annoy approx nearest neighbour
  • work_folder (str) – folder for tmp files
  • sklearn_tfidf_args (dict, optional) – args to pass to sklearn TfidfVectorizer
  • sklear_nmf_args (dict, optional) – args to pass to sklearn NMF model
__getstate__()[source]

Remove things that should not be pickled as they are handled in save/load

create_gensim_model(corpus)[source]

Create a gensim model

Parameters:corpus (an object that satisfies a gensim TextCorpus) –
Returns:
Return type:gensim corpus model
create_sklearn_model(corpus)[source]

Create a sklearn model

Parameters:corpus (object) – a corpus object that has get_text(raw=True) method
Returns:
Return type:gensim corpus model
fit(corpus)[source]

Fit a document similarity model

Parameters:corpus (object) – a corpus object that follows DefaultJsonCorpus
Returns:
Return type:trained DocumentSimilarity object
get_meta(doc_id)[source]
load(folder)[source]

load models from folder

Parameters:folder (str) – location of models
nn(doc_id, k=1, translate_id=False, approx=False)[source]

nearest neighbour query

Parameters:
  • doc_id (long) – internal or external document id
  • k (int) – number of neighbours to return
  • translate_id (bool) – translate doc_id into internal id
  • approx (bool) – run approx nearest neighbour search using Annoy
Returns:

Return type:

list of pairs of document id (internal or external) and similarity metric in range (0,1)

recommend(user=None, ids=[], recent_interactions=[], client=None, limit=1)[source]
save(folder)[source]

save models to folder

Parameters:folder (str) – saved location folder
score(k=1, approx=False)[source]

score a model

Parameters:
  • k (int) – number of neighbours to return
  • approx (bool) – run approx nearest neighbour search using Annoy
Returns:

Return type:

accuracy metric - avg jaccard distance of returned tags to ground truth tags in meta data

seldon.text.docsim.current_milli_time()
seldon.text.docsim.jaccard(s1, s2)[source]

seldon.text.tagrecommend

class seldon.text.tagrecommend.Tag_Recommender(max_s2_size=0.1, min_s2_size=25, min_score=0.0)[source]

Bases: sklearn.base.BaseEstimator

asymmetric_occur(s1, s2, min_s2_size=25)[source]

Return asymmetric occurrence of set s1 against s2

Parameters:
  • s1 (set) – set (of document ids)
  • s2 (set) – set (of document ids)
  • min_s2_size (int, optional) – the absolute min number of documents in s2. Increase to stop very unlikely tags being recommended.
fit(corpus, split_char=', ')[source]

Process a corpus and fir data.

Parameters:
  • corpus (object) – a corpus object that follows seldon.text.DefaultJsonCorpus
  • split_char (str) – character to split tags
Returns:

Return type:

trained Tag_Recommender object

jaccard(s1, s2, max_s2_size=0.1)[source]

Return jaccard distance between two sets (of documents)

Parameters:
  • s1 (set) – set (of document ids)
  • s2 (set) – set (of document ids)
  • max_s2_size (int, optional) – the max percentage size of s2 for a result to be returned. Can be set to ignore very popular tags returning non-zero scores
knn(tag, k=5, metric='jaccard', exclusions=[])[source]

Get k nearest neighbours of a tag

Parameters:
  • tag (str) – query tag
  • k (int) – number of neighbours to return
  • metric (str) – metric to use, ‘jaccard’ or ‘asym’
  • excclusions (list of str) – tags to exclude
Returns:

Return type:

list of tuples of tag,score

recommend(tags, k=5, knn_k=5, metric='both')[source]

recommend tags for a given set of tags

Parameters:
  • tags (str) – query tags
  • k (int) – number of tags to return
  • knn_k (int) – number of nearest neighbours for each tag to collect
  • metric (str) – metric to use, ‘jaccard’ or ‘asym’
Returns:

Return type:

sorted list of tuples of tag,score

Module contents