hub / github.com/wepe/MachineLearning / _chooseBestFeatureToSplit_ID3

Method _chooseBestFeatureToSplit_ID3

DecisionTree/id3_c45.py:65–97 · view source on GitHub ↗

ID3 函数功能：对输入的数据集，选择最佳分割特征参数dataSet：数据集，最后一列为label 主要变量说明： numFeatures：特征个数 oldEntropy：原始数据集的熵 newEntropy：按某个特征分割数据集后的熵 infoGain：信息增益 bestInfoGain：记录最大的信息增益 bestFeatureIndex：信息增益最大

(self,X,y)

Source from the content-addressed store, hash-verified

63
64
65	def _chooseBestFeatureToSplit_ID3(self,X,y):
66	"""ID3
67	函数功能：对输入的数据集，选择最佳分割特征
68	参数dataSet：数据集，最后一列为label
69	主要变量说明：
70	numFeatures：特征个数
71	oldEntropy：原始数据集的熵
72	newEntropy：按某个特征分割数据集后的熵
73	infoGain：信息增益
74	bestInfoGain：记录最大的信息增益
75	bestFeatureIndex：信息增益最大时，所选择的分割特征的下标
76	"""
77	numFeatures = X.shape[1]
78	oldEntropy = self._calcEntropy(y)
79	bestInfoGain = 0.0
80	bestFeatureIndex = -1
81	#对每个特征都计算一下infoGain，并用bestInfoGain记录最大的那个
82	for i in range(numFeatures):
83	featList = X[:,i]
84	uniqueVals = set(featList)
85	newEntropy = 0.0
86	#对第i个特征的各个value，得到各个子数据集，计算各个子数据集的熵，
87	#进一步地可以计算得到根据第i个特征分割原始数据集后的熵newEntropy
88	for value in uniqueVals:
89	sub_X,sub_y = self._splitDataSet(X,y,i,value)
90	prob = len(sub_y)/float(len(y))
91	newEntropy += prob * self._calcEntropy(sub_y)
92	#计算信息增益，根据信息增益选择最佳分割特征
93	infoGain = oldEntropy - newEntropy
94	if (infoGain > bestInfoGain):
95	bestInfoGain = infoGain
96	bestFeatureIndex = i
97	return bestFeatureIndex
98
99	def _chooseBestFeatureToSplit_C45(self,X,y):
100	"""C4.5

Callers 1

_createTreeMethod · 0.95

Calls 2

_calcEntropyMethod · 0.95

_splitDataSetMethod · 0.95

Tested by

no test coverage detected