ID3 函数功能:对输入的数据集,选择最佳分割特征 参数dataSet:数据集,最后一列为label 主要变量说明: numFeatures:特征个数 oldEntropy:原始数据集的熵 newEntropy:按某个特征分割数据集后的熵 infoGain:信息增益 bestInfoGain:记录最大的信息增益 bestFeatureIndex:信息增益最大
(self,X,y)
| 63 | |
| 64 | |
| 65 | def _chooseBestFeatureToSplit_ID3(self,X,y): |
| 66 | """ID3 |
| 67 | 函数功能:对输入的数据集,选择最佳分割特征 |
| 68 | 参数dataSet:数据集,最后一列为label |
| 69 | 主要变量说明: |
| 70 | numFeatures:特征个数 |
| 71 | oldEntropy:原始数据集的熵 |
| 72 | newEntropy:按某个特征分割数据集后的熵 |
| 73 | infoGain:信息增益 |
| 74 | bestInfoGain:记录最大的信息增益 |
| 75 | bestFeatureIndex:信息增益最大时,所选择的分割特征的下标 |
| 76 | """ |
| 77 | numFeatures = X.shape[1] |
| 78 | oldEntropy = self._calcEntropy(y) |
| 79 | bestInfoGain = 0.0 |
| 80 | bestFeatureIndex = -1 |
| 81 | #对每个特征都计算一下infoGain,并用bestInfoGain记录最大的那个 |
| 82 | for i in range(numFeatures): |
| 83 | featList = X[:,i] |
| 84 | uniqueVals = set(featList) |
| 85 | newEntropy = 0.0 |
| 86 | #对第i个特征的各个value,得到各个子数据集,计算各个子数据集的熵, |
| 87 | #进一步地可以计算得到根据第i个特征分割原始数据集后的熵newEntropy |
| 88 | for value in uniqueVals: |
| 89 | sub_X,sub_y = self._splitDataSet(X,y,i,value) |
| 90 | prob = len(sub_y)/float(len(y)) |
| 91 | newEntropy += prob * self._calcEntropy(sub_y) |
| 92 | #计算信息增益,根据信息增益选择最佳分割特征 |
| 93 | infoGain = oldEntropy - newEntropy |
| 94 | if (infoGain > bestInfoGain): |
| 95 | bestInfoGain = infoGain |
| 96 | bestFeatureIndex = i |
| 97 | return bestFeatureIndex |
| 98 | |
| 99 | def _chooseBestFeatureToSplit_C45(self,X,y): |
| 100 | """C4.5 |
no test coverage detected