MCPcopy
hub / github.com/ContextLab/hypertools / text2mat

Function text2mat

hypertools/tools/text2mat.py:26–146  ·  view source on GitHub ↗

Turns a list of text samples into a matrix using a vectorizer and a text model Parameters ---------- data : list (or list of lists) of text samples The text data to transform vectorizer : str, dict, class or class instance The vectorizer to use. Built-in optio

(data, vectorizer='CountVectorizer',
             semantic='LatentDirichletAllocation', corpus='wiki')

Source from the content-addressed store, hash-verified

24
25@memoize
26def text2mat(data, vectorizer='CountVectorizer',
27 semantic='LatentDirichletAllocation', corpus='wiki'):
28 """
29 Turns a list of text samples into a matrix using a vectorizer and a text model
30
31 Parameters
32 ----------
33
34 data : list (or list of lists) of text samples
35 The text data to transform
36
37 vectorizer : str, dict, class or class instance
38 The vectorizer to use. Built-in options are 'CountVectorizer' or
39 'TfidfVectorizer'. To change default parameters, set to a dictionary
40 e.g. {'model' : 'CountVectorizer', 'params' : {'max_features' : 10}}. See
41 http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
42 for details. You can also specify your own vectorizer model as a class,
43 or class instance. With either option, the class must have a
44 fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html).
45 If a class, pass any parameters as a dictionary to vectorizer_params. If
46 a class instance, no parameters can be passed.
47
48 semantic : str, dict, class or class instance
49 Text model to use to transform text data. Built-in options are
50 'LatentDirichletAllocation' or 'NMF' (default: LDA). To change default
51 parameters, set to a dictionary e.g. {'model' : 'NMF', 'params' :
52 {'n_components' : 10}}. See
53 http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition
54 for details on the two model options. You can also specify your own
55 text model as a class, or class instance. With either option, the class
56 must have a fit_transform method (see here:
57 http://scikit-learn.org/stable/data_transforms.html).
58 If a class, pass any parameters as a dictionary to text_params. If
59 a class instance, no parameters can be passed.
60
61 corpus : list (or list of lists) of text samples or 'wiki', 'nips', 'sotus'.
62 Text to use to fit the semantic model (optional). If set to 'wiki', 'nips'
63 or 'sotus' and the default semantic and vectorizer models are used, a
64 pretrained model will be loaded which can save a lot of time.
65
66 Returns
67 ----------
68
69 transformed data : list of numpy arrays
70 The transformed text data
71 """
72 if semantic is None:
73 semantic = 'LatentDirichletAllocation'
74 if vectorizer is None:
75 vectorizer = 'CountVectorizer'
76 model_is_fit=False
77 if corpus is not None:
78 if corpus in ('wiki', 'nips', 'sotus',):
79 if semantic == 'LatentDirichletAllocation' and vectorizer == 'CountVectorizer':
80 semantic = load(corpus + '_model')
81 vectorizer = None
82 model_is_fit = True
83 else:

Callers 12

test_transform_textFunction · 0.90
test_count_LDAFunction · 0.90
test_tfidf_LDAFunction · 0.90
test_count_NMFFunction · 0.90
test_tfidf_NMFFunction · 0.90
test_text_model_paramsFunction · 0.90
test_vectorizer_paramsFunction · 0.90
test_LDA_classFunction · 0.90
test_LDA_class_instanceFunction · 0.90
test_corpusFunction · 0.90
format_dataFunction · 0.90

Calls 6

loadFunction · 0.90
_check_mtypeFunction · 0.85
default_paramsFunction · 0.85
_fit_modelsFunction · 0.85
_transformFunction · 0.85
get_dataMethod · 0.80

Tested by 11

test_transform_textFunction · 0.72
test_count_LDAFunction · 0.72
test_tfidf_LDAFunction · 0.72
test_count_NMFFunction · 0.72
test_tfidf_NMFFunction · 0.72
test_text_model_paramsFunction · 0.72
test_vectorizer_paramsFunction · 0.72
test_LDA_classFunction · 0.72
test_LDA_class_instanceFunction · 0.72
test_corpusFunction · 0.72