MCPcopy
hub / github.com/AutoViML/AutoViz / remove_variables_using_fast_correlation

Function remove_variables_using_fast_correlation

autoviz/AutoViz_Utils.py:1859–1931  ·  view source on GitHub ↗

#### THIS METHOD IS KNOWN AS SULO METHOD in HONOR OF my mother SULOCHANA SESHADRI ####### This highly efficient method removes variables that are highly correlated using a series of pair-wise correlation knockout rounds. It is extremely fast and hence can work on thousands of va

(df, numvars, modeltype, target,
                                            corr_limit=0.70, verbose=0)

Source from the content-addressed store, hash-verified

1857
1858##################################################################################
1859def remove_variables_using_fast_correlation(df, numvars, modeltype, target,
1860 corr_limit=0.70, verbose=0):
1861 """
1862 #### THIS METHOD IS KNOWN AS SULO METHOD in HONOR OF my mother SULOCHANA SESHADRI #######
1863 This highly efficient method removes variables that are highly correlated using a series of
1864 pair-wise correlation knockout rounds. It is extremely fast and hence can work on thousands
1865 of variables in less than a minute, even on a laptop. You need to send in a list of numeric
1866 variables and that's all! The method defines high Correlation as anything over 0.70 (absolute)
1867 but this can be changed. If two variables have absolute correlation higher than this, they
1868 will be marked, and using a process of elimination, one of them will get knocked out:
1869 To decide order of variables to keep, we use mutual information score to select. MIS returns
1870 a ranked list of these correlated variables: when we select one, we knock out other variables
1871 it is correlated to. Then we select next var. This way we remove correlated variables.
1872 Finally, we are left with uncorrelated variables that are also highly important (mutual score).
1873 ############## YOU MUST INCLUDE THE ABOVE MESSAGE IF YOU COPY THIS CODE IN YOUR LIBRARY #####
1874 """
1875 print(' Removing correlated variables from %d numerics using SULO method' % len(numvars))
1876 correlation_dataframe = df[numvars].corr()
1877 a = correlation_dataframe.values
1878 col_index = correlation_dataframe.columns.tolist()
1879 index_triupper = list(zip(np.triu_indices_from(a, k=1)[0], np.triu_indices_from(a, k=1)[1]))
1880 high_corr_index_list = [x for x in np.argwhere(abs(a[np.triu_indices(len(a), k=1)]) >= corr_limit)]
1881 tuple_list = [y for y in [index_triupper[x[0]] for x in high_corr_index_list]]
1882 correlated_pair = [(col_index[local_tuple[0]], col_index[local_tuple[1]]) for local_tuple in tuple_list]
1883 corr_pair_dict = dict(return_dictionary_list(correlated_pair))
1884 keys_in_dict = list(corr_pair_dict.keys())
1885 reverse_correlated_pair = [(y, x) for (x, y) in correlated_pair]
1886 reverse_corr_pair_dict = dict(return_dictionary_list(reverse_correlated_pair))
1887 for key, val in reverse_corr_pair_dict.items():
1888 if key in keys_in_dict:
1889 if len(key) > 1:
1890 corr_pair_dict[key] += val
1891 else:
1892 corr_pair_dict[key] = val
1893 flat_corr_pair_list = [item for sublist in correlated_pair for item in sublist]
1894 #### You can make it a dictionary or a tuple of lists. We have chosen the latter here to keep order intact.
1895 corr_pair_count_dict = count_freq_in_list(flat_corr_pair_list)
1896 corr_list = [k for (k, v) in corr_pair_count_dict]
1897 ###### This is for ordering the variables from highest to lowest importance by target ###
1898 if len(corr_list) == 0:
1899 print('Selecting all (%d) variables since none of them are highly correlated...' % len(numvars))
1900 return numvars
1901 else:
1902 max_feats = len(corr_list)
1903 if modeltype == 'Regression':
1904 sel_function = mutual_info_regression
1905 fs = SelectKBest(score_func=sel_function, k=max_feats)
1906 fs.fit(df[corr_list], df[target])
1907 mutual_info = dict(zip(corr_list, fs.scores_))
1908 else:
1909 sel_function = mutual_info_classif
1910 fs = SelectKBest(score_func=sel_function, k=max_feats)
1911 fs.fit(df[corr_list], df[target])
1912 mutual_info = dict(zip(corr_list, fs.scores_))
1913 #### The first variable in list has the highest correlation to the target variable ###
1914 sorted_by_mutual_info = [key for (key, val) in sorted(mutual_info.items(), key=lambda kv: kv[1], reverse=True)]
1915 ##### Now we select the final list of correlated variables ###########
1916 selected_corr_list = []

Callers 1

find_top_features_xgbFunction · 0.85

Calls 3

return_dictionary_listFunction · 0.85
count_freq_in_listFunction · 0.85
left_subtractFunction · 0.70

Tested by

no test coverage detected