hub / github.com/ContextLab/hypertools / text2mat

Function text2mat

hypertools/tools/text2mat.py:26–146 · view source on GitHub ↗

Turns a list of text samples into a matrix using a vectorizer and a text model Parameters ---------- data : list (or list of lists) of text samples The text data to transform vectorizer : str, dict, class or class instance The vectorizer to use. Built-in optio

(data, vectorizer='CountVectorizer',
             semantic='LatentDirichletAllocation', corpus='wiki')

Source from the content-addressed store, hash-verified

24
25	@memoize
26	def text2mat(data, vectorizer='CountVectorizer',
27	semantic='LatentDirichletAllocation', corpus='wiki'):
28	"""
29	Turns a list of text samples into a matrix using a vectorizer and a text model
30
31	Parameters
32	----------
33
34	data : list (or list of lists) of text samples
35	The text data to transform
36
37	vectorizer : str, dict, class or class instance
38	The vectorizer to use. Built-in options are 'CountVectorizer' or
39	'TfidfVectorizer'. To change default parameters, set to a dictionary
40	e.g. {'model' : 'CountVectorizer', 'params' : {'max_features' : 10}}. See
41	http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
42	for details. You can also specify your own vectorizer model as a class,
43	or class instance. With either option, the class must have a
44	fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html).
45	If a class, pass any parameters as a dictionary to vectorizer_params. If
46	a class instance, no parameters can be passed.
47
48	semantic : str, dict, class or class instance
49	Text model to use to transform text data. Built-in options are
50	'LatentDirichletAllocation' or 'NMF' (default: LDA). To change default
51	parameters, set to a dictionary e.g. {'model' : 'NMF', 'params' :
52	{'n_components' : 10}}. See
53	http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition
54	for details on the two model options. You can also specify your own
55	text model as a class, or class instance. With either option, the class
56	must have a fit_transform method (see here:
57	http://scikit-learn.org/stable/data_transforms.html).
58	If a class, pass any parameters as a dictionary to text_params. If
59	a class instance, no parameters can be passed.
60
61	corpus : list (or list of lists) of text samples or 'wiki', 'nips', 'sotus'.
62	Text to use to fit the semantic model (optional). If set to 'wiki', 'nips'
63	or 'sotus' and the default semantic and vectorizer models are used, a
64	pretrained model will be loaded which can save a lot of time.
65
66	Returns
67	----------
68
69	transformed data : list of numpy arrays
70	The transformed text data
71	"""
72	if semantic is None:
73	semantic = 'LatentDirichletAllocation'
74	if vectorizer is None:
75	vectorizer = 'CountVectorizer'
76	model_is_fit=False
77	if corpus is not None:
78	if corpus in ('wiki', 'nips', 'sotus',):
79	if semantic == 'LatentDirichletAllocation' and vectorizer == 'CountVectorizer':
80	semantic = load(corpus + '_model')
81	vectorizer = None
82	model_is_fit = True
83	else:

Callers 12

test_transform_textFunction · 0.90

test_count_LDAFunction · 0.90

test_tfidf_LDAFunction · 0.90

test_count_NMFFunction · 0.90

test_tfidf_NMFFunction · 0.90

test_transform_no_text_modelFunction · 0.90

test_text_model_paramsFunction · 0.90

test_vectorizer_paramsFunction · 0.90

test_LDA_classFunction · 0.90

test_LDA_class_instanceFunction · 0.90

test_corpusFunction · 0.90

format_dataFunction · 0.90

Calls 6

loadFunction · 0.90

_check_mtypeFunction · 0.85

default_paramsFunction · 0.85

_fit_modelsFunction · 0.85

_transformFunction · 0.85

get_dataMethod · 0.80

Tested by 11

test_transform_textFunction · 0.72

test_count_LDAFunction · 0.72

test_tfidf_LDAFunction · 0.72

test_count_NMFFunction · 0.72

test_tfidf_NMFFunction · 0.72

test_transform_no_text_modelFunction · 0.72

test_text_model_paramsFunction · 0.72

test_vectorizer_paramsFunction · 0.72

test_LDA_classFunction · 0.72

test_LDA_class_instanceFunction · 0.72

test_corpusFunction · 0.72