Turns a list of text samples into a matrix using a vectorizer and a text model Parameters ---------- data : list (or list of lists) of text samples The text data to transform vectorizer : str, dict, class or class instance The vectorizer to use. Built-in optio
(data, vectorizer='CountVectorizer',
semantic='LatentDirichletAllocation', corpus='wiki')
| 24 | |
| 25 | @memoize |
| 26 | def text2mat(data, vectorizer='CountVectorizer', |
| 27 | semantic='LatentDirichletAllocation', corpus='wiki'): |
| 28 | """ |
| 29 | Turns a list of text samples into a matrix using a vectorizer and a text model |
| 30 | |
| 31 | Parameters |
| 32 | ---------- |
| 33 | |
| 34 | data : list (or list of lists) of text samples |
| 35 | The text data to transform |
| 36 | |
| 37 | vectorizer : str, dict, class or class instance |
| 38 | The vectorizer to use. Built-in options are 'CountVectorizer' or |
| 39 | 'TfidfVectorizer'. To change default parameters, set to a dictionary |
| 40 | e.g. {'model' : 'CountVectorizer', 'params' : {'max_features' : 10}}. See |
| 41 | http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text |
| 42 | for details. You can also specify your own vectorizer model as a class, |
| 43 | or class instance. With either option, the class must have a |
| 44 | fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html). |
| 45 | If a class, pass any parameters as a dictionary to vectorizer_params. If |
| 46 | a class instance, no parameters can be passed. |
| 47 | |
| 48 | semantic : str, dict, class or class instance |
| 49 | Text model to use to transform text data. Built-in options are |
| 50 | 'LatentDirichletAllocation' or 'NMF' (default: LDA). To change default |
| 51 | parameters, set to a dictionary e.g. {'model' : 'NMF', 'params' : |
| 52 | {'n_components' : 10}}. See |
| 53 | http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition |
| 54 | for details on the two model options. You can also specify your own |
| 55 | text model as a class, or class instance. With either option, the class |
| 56 | must have a fit_transform method (see here: |
| 57 | http://scikit-learn.org/stable/data_transforms.html). |
| 58 | If a class, pass any parameters as a dictionary to text_params. If |
| 59 | a class instance, no parameters can be passed. |
| 60 | |
| 61 | corpus : list (or list of lists) of text samples or 'wiki', 'nips', 'sotus'. |
| 62 | Text to use to fit the semantic model (optional). If set to 'wiki', 'nips' |
| 63 | or 'sotus' and the default semantic and vectorizer models are used, a |
| 64 | pretrained model will be loaded which can save a lot of time. |
| 65 | |
| 66 | Returns |
| 67 | ---------- |
| 68 | |
| 69 | transformed data : list of numpy arrays |
| 70 | The transformed text data |
| 71 | """ |
| 72 | if semantic is None: |
| 73 | semantic = 'LatentDirichletAllocation' |
| 74 | if vectorizer is None: |
| 75 | vectorizer = 'CountVectorizer' |
| 76 | model_is_fit=False |
| 77 | if corpus is not None: |
| 78 | if corpus in ('wiki', 'nips', 'sotus',): |
| 79 | if semantic == 'LatentDirichletAllocation' and vectorizer == 'CountVectorizer': |
| 80 | semantic = load(corpus + '_model') |
| 81 | vectorizer = None |
| 82 | model_is_fit = True |
| 83 | else: |