MCPcopy Index your code
hub / github.com/wepe/MachineLearning / _chooseBestFeatureToSplit_ID3

Method _chooseBestFeatureToSplit_ID3

DecisionTree/id3_c45.py:65–97  ·  view source on GitHub ↗

ID3 函数功能:对输入的数据集,选择最佳分割特征 参数dataSet:数据集,最后一列为label 主要变量说明: numFeatures:特征个数 oldEntropy:原始数据集的熵 newEntropy:按某个特征分割数据集后的熵 infoGain:信息增益 bestInfoGain:记录最大的信息增益 bestFeatureIndex:信息增益最大

(self,X,y)

Source from the content-addressed store, hash-verified

63
64
65 def _chooseBestFeatureToSplit_ID3(self,X,y):
66 """ID3
67 函数功能:对输入的数据集,选择最佳分割特征
68 参数dataSet:数据集,最后一列为label
69 主要变量说明:
70 numFeatures:特征个数
71 oldEntropy:原始数据集的熵
72 newEntropy:按某个特征分割数据集后的熵
73 infoGain:信息增益
74 bestInfoGain:记录最大的信息增益
75 bestFeatureIndex:信息增益最大时,所选择的分割特征的下标
76 """
77 numFeatures = X.shape[1]
78 oldEntropy = self._calcEntropy(y)
79 bestInfoGain = 0.0
80 bestFeatureIndex = -1
81 #对每个特征都计算一下infoGain,并用bestInfoGain记录最大的那个
82 for i in range(numFeatures):
83 featList = X[:,i]
84 uniqueVals = set(featList)
85 newEntropy = 0.0
86 #对第i个特征的各个value,得到各个子数据集,计算各个子数据集的熵,
87 #进一步地可以计算得到根据第i个特征分割原始数据集后的熵newEntropy
88 for value in uniqueVals:
89 sub_X,sub_y = self._splitDataSet(X,y,i,value)
90 prob = len(sub_y)/float(len(y))
91 newEntropy += prob * self._calcEntropy(sub_y)
92 #计算信息增益,根据信息增益选择最佳分割特征
93 infoGain = oldEntropy - newEntropy
94 if (infoGain > bestInfoGain):
95 bestInfoGain = infoGain
96 bestFeatureIndex = i
97 return bestFeatureIndex
98
99 def _chooseBestFeatureToSplit_C45(self,X,y):
100 """C4.5

Callers 1

_createTreeMethod · 0.95

Calls 2

_calcEntropyMethod · 0.95
_splitDataSetMethod · 0.95

Tested by

no test coverage detected