MCPcopy
hub / github.com/ContextLab/hypertools / format_data

Function format_data

hypertools/tools/format_data.py:9–163  ·  view source on GitHub ↗

Formats data into a list of numpy arrays This function is useful to identify rows of your array that contain missing data or nans. The returned indices can be used to remove the rows with missing data, or label the missing data points that are interpolated using PPCA. Par

(x, vectorizer='CountVectorizer',
                semantic='LatentDirichletAllocation', corpus='wiki', ppca=True, text_align='hyper')

Source from the content-addressed store, hash-verified

7
8
9def format_data(x, vectorizer='CountVectorizer',
10 semantic='LatentDirichletAllocation', corpus='wiki', ppca=True, text_align='hyper'):
11 """
12 Formats data into a list of numpy arrays
13
14 This function is useful to identify rows of your array that contain missing
15 data or nans. The returned indices can be used to remove the rows with
16 missing data, or label the missing data points that are interpolated
17 using PPCA.
18
19 Parameters
20 ----------
21
22 x : numpy array, dataframe, string or (mixed) list
23 The data to convert
24
25 vectorizer : str, dict, class or class instance
26 The vectorizer to use. Built-in options are 'CountVectorizer' or
27 'TfidfVectorizer'. To change default parameters, set to a dictionary
28 e.g. {'model' : 'CountVectorizer', 'params' : {'max_features' : 10}}. See
29 http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
30 for details. You can also specify your own vectorizer model as a class,
31 or class instance. With either option, the class must have a
32 fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html).
33 If a class, pass any parameters as a dictionary to vectorizer_params. If
34 a class instance, no parameters can be passed.
35
36 semantic : str, dict, class or class instance
37 Text model to use to transform text data. Built-in options are
38 'LatentDirichletAllocation' or 'NMF' (default: LDA). To change default
39 parameters, set to a dictionary e.g. {'model' : 'NMF', 'params' :
40 {'n_components' : 10}}. See
41 http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition
42 for details on the two model options. You can also specify your own
43 text model as a class, or class instance. With either option, the class
44 must have a fit_transform method (see here:
45 http://scikit-learn.org/stable/data_transforms.html).
46 If a class, pass any parameters as a dictionary to text_params. If
47 a class instance, no parameters can be passed.
48
49 corpus : list (or list of lists) of text samples or 'wiki', 'nips', 'sotus'.
50 Text to use to fit the semantic model (optional). If set to 'wiki', 'nips'
51 or 'sotus' and the default semantic and vectorizer models are used, a
52 pretrained model will be loaded which can save a lot of time.
53
54 ppca : bool
55 Performs PPCA to fill in missing values (default: True)
56
57 text_align : str
58 Alignment algorithm to use when both text and numerical data are passed.
59 If numerical arrays have the same shape, and the text data contains the
60 same number of samples, the text and numerical data are automatically
61 aligned to a common space. Example use case: an array of movie frames
62 (frames by pixels) and text descriptions of the frame. In this case,
63 the movie and text will be automatically aligned to the same space
64 (default: hyperalignment).
65
66 Returns

Callers 11

test_np_arrayFunction · 0.90
test_dfFunction · 0.90
test_textFunction · 0.90
test_strFunction · 0.90
test_mixed_listFunction · 0.90
test_geoFunction · 0.90
test_missing_dataFunction · 0.90
test_force_alignFunction · 0.90
get_formatted_dataMethod · 0.90
transformMethod · 0.90
plotFunction · 0.90

Calls 4

text2matFunction · 0.90
df2matFunction · 0.90
fill_missingFunction · 0.85
get_dataMethod · 0.80

Tested by 8

test_np_arrayFunction · 0.72
test_dfFunction · 0.72
test_textFunction · 0.72
test_strFunction · 0.72
test_mixed_listFunction · 0.72
test_geoFunction · 0.72
test_missing_dataFunction · 0.72
test_force_alignFunction · 0.72