文章詳情頁

Python3 ID3決策樹判斷申請貸款是否成功的實現代碼

瀏覽：67日期：2022-07-24 18:50:54

1. 定義生成樹

# -*- coding: utf-8 -*-#生成樹的函數from numpy import * import numpy as npimport pandas as pdfrom math import log import operator # 計算數據集的信息熵(Information Gain)增益函數(機器學習實戰中信息熵叫香農熵)def calcInfoEnt(dataSet):#本題中Label即好or壞瓜 #dataSet每一列是一個屬性(列末是Label) numEntries = len(dataSet) #每一行是一個樣本 labelCounts = {} #給所有可能的分類創建字典labelCounts for featVec in dataSet: #按行循環：即rowVev取遍了數據集中的每一行 currentLabel = featVec[-1] #故featVec[-1]取遍每行最后一個值即Label if currentLabel not in labelCounts.keys(): #如果當前的Label在字典中還沒有 labelCounts[currentLabel] = 0 #則先賦值0來創建這個詞 labelCounts[currentLabel] += 1 #計數, 統計每類Label數量(這行不受if限制) InfoEnt = 0.0 for key in labelCounts: #遍歷每類Label prob = float(labelCounts[key])/numEntries #各類Label熵累加 InfoEnt -= prob * log(prob,2) #ID3用的信息熵增益公式 return InfoEnt### 對于離散特征: 取出該特征取值為value的所有樣本def splitDiscreteDataSet(dataSet, axis, value): #dataSet是當前結點(待劃分)集合,axis指示劃分所依據的屬性,value該屬性用于劃分的取值 retDataSet = [] #為return Data Set分配一個列表用來儲存 for featVec in dataSet: if featVec[axis] == value: reducedFeatVec = featVec[:axis] #該特征之前的特征仍保留在樣本dataSet中 reducedFeatVec.extend(featVec[axis+1:]) #該特征之后的特征仍保留在樣本dataSet中 retDataSet.append(reducedFeatVec) #把這個樣本加到list中 return retDataSet### 對于連續特征: 返回特征取值大于value的所有樣本(以value為閾值將集合分成兩部分)def splitContinuousDataSet(dataSet, axis, value): retDataSetG = [] #將儲存取值大于value的樣本 retDataSetL = [] #將儲存取值小于value的樣本 for featVec in dataSet: if featVec[axis] > value: reducedFeatVecG = featVec[:axis] reducedFeatVecG.extend(featVec[axis+1:]) retDataSetG.append(reducedFeatVecG) else: reducedFeatVecL = featVec[:axis] reducedFeatVecL.extend(featVec[axis+1:]) retDataSetL.append(reducedFeatVecL) return retDataSetG,retDataSetL #返回兩個集合, 是含2個元素的tuple形式### 根據InfoGain選擇當前最好的劃分特征(以及對于連續變量還要選擇以什么值劃分)def chooseBestFeatureToSplit(dataSet,labels): numFeatures = len(dataSet[0])-1 baseEntropy = calcInfoEnt(dataSet) bestInfoGain = 0.0; bestFeature = -1 bestSplitDict = {} for i in range(numFeatures): #遍歷所有特征：下面這句是取每一行的第i個, 即得當前集合所有樣本第i個feature的值 featList = [example[i] for example in dataSet] #判斷是否為離散特征 if not (type(featList[0]).__name__==’float’ or type(featList[0]).__name__==’int’): # 對于離散特征：求若以該特征劃分的熵增 uniqueVals = set(featList) #從列表中創建集合set(得列表唯一元素值) newEntropy = 0.0 for value in uniqueVals: #遍歷該離散特征每個取值 subDataSet = splitDiscreteDataSet(dataSet, i, value)#計算每個取值的信息熵 prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcInfoEnt(subDataSet)#各取值的熵累加 infoGain = baseEntropy - newEntropy #得到以該特征劃分的熵增 # 對于連續特征：求若以該特征劃分的熵增(區別：n個數據則需添n-1個候選劃分點, 并選最佳劃分點) else: #產生n-1個候選劃分點 sortfeatList=sorted(featList) splitList=[] for j in range(len(sortfeatList)-1): #產生n-1個候選劃分點 splitList.append((sortfeatList[j] + sortfeatList[j+1])/2.0) bestSplitEntropy = 10000 #設定一個很大的熵值(之后用) #遍歷n-1個候選劃分點: 求選第j個候選劃分點劃分時的熵增, 并選出最佳劃分點 for j in range(len(splitList)): value = splitList[j] newEntropy = 0.0 DataSet = splitContinuousDataSet(dataSet, i, value) subDataSetG = DataSet[0] subDataSetL = DataSet[1] probG = len(subDataSetG) / float(len(dataSet)) newEntropy += probG * calcInfoEnt(subDataSetG) probL = len(subDataSetL) / float(len(dataSet)) newEntropy += probL * calcInfoEnt(subDataSetL) if newEntropy < bestSplitEntropy: bestSplitEntropy = newEntropy bestSplit = j bestSplitDict[labels[i]] = splitList[bestSplit]#字典記錄當前連續屬性的最佳劃分點 infoGain = baseEntropy - bestSplitEntropy #計算以該節點劃分的熵增# 在所有屬性(包括連續和離散)中選擇可以獲得最大熵增的屬性 if infoGain > bestInfoGain: bestInfoGain = infoGain bestFeature = i #若當前節點的最佳劃分特征為連續特征，則需根據“是否小于等于其最佳劃分點”進行二值化處理 #即將該特征改為“是否小于等于bestSplitValue”, 例如將“密度”變為“密度<=0.3815” #注意：以下這段直接操作了原dataSet數據, 之前的那些float型的值相應變為0和1 #【為何這樣做?】在函數createTree()末尾將看到解釋 if type(dataSet[0][bestFeature]).__name__==’float’ or type(dataSet[0][bestFeature]).__name__==’int’: bestSplitValue = bestSplitDict[labels[bestFeature]] labels[bestFeature] = labels[bestFeature] + ’<=’ + str(bestSplitValue) for i in range(shape(dataSet)[0]): if dataSet[i][bestFeature] <= bestSplitValue: dataSet[i][bestFeature] = 1 else: dataSet[i][bestFeature] = 0 return bestFeature # 若特征已經劃分完，節點下的樣本還沒有統一取值，則需要進行投票：計算每類Label個數, 取max者def majorityCnt(classList): classCount = {} #將創建鍵值為Label類型的字典 for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 #第一次出現的Label加入字典 classCount[vote] += 1 #計數 return max(classCount)2. 遞歸產生決策樹

# 主程序：遞歸產生決策樹 # dataSet：當前用于構建樹的數據集, 最開始就是data_full，然后隨著劃分的進行越來越小。這是因為進行到到樹分叉點上了. 第一次劃分之前17個瓜的數據在根節點，然后選擇第一個bestFeat是紋理. 紋理的取值有清晰、模糊、稍糊三種；將瓜分成了清晰（9個），稍糊（5個），模糊（3個）,這時應該將劃分的類別減少1以便于下次劃分。 # labels：當前數據集中有的用于劃分的類別(這是因為有些Label當前數據集沒了, 比如假如到某個點上西瓜都是淺白沒有深綠了) # data_full：全部的數據 # label_full:全部的類別 numLine = numColumn = 2 #這句是因為之后要用global numLine……至于為什么我一定要用global# 我也不完全理解。如果我只定義local變量總報錯，我只好在那里的if里用global變量了。求解。def createTree(dataSet,labels,data_full,labels_full): classList = [example[-1] for example in dataSet] #遞歸停止條件1：當前節點所有樣本屬于同一類；(注：count()方法統計某元素在列表中出現的次數) if classList.count(classList[0]) == len(classList): return classList[0] #遞歸停止條件2：當前節點上樣本集合為空集(即特征的某個取值上已經沒有樣本了)： global numLine,numColumn (numLine,numColumn) = shape(dataSet) if float(numLine) == 0: return ’empty’ #遞歸停止條件3：所有可用于劃分的特征均使用過了，則調用majorityCnt()投票定Label； if float(numColumn) == 1: return majorityCnt(classList) #不停止時繼續劃分： bestFeat = chooseBestFeatureToSplit(dataSet,labels)#調用函數找出當前最佳劃分特征是第幾個 bestFeatLabel = labels[bestFeat] #當前最佳劃分特征 myTree = {bestFeatLabel:{}} featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) if type(dataSet[0][bestFeat]).__name__==’str’: currentlabel = labels_full.index(labels[bestFeat]) featValuesFull = [example[currentlabel] for example in data_full] uniqueValsFull = set(featValuesFull) del(labels[bestFeat]) #劃分完后, 即當前特征已經使用過了, 故將其從“待劃分特征集”中刪去 #【遞歸調用】針對當前用于劃分的特征(beatFeat)的每個取值，劃分出一個子樹。 for value in uniqueVals: #遍歷該特征【現存的】取值 subLabels = labels[:] if type(dataSet[0][bestFeat]).__name__==’str’: uniqueValsFull.remove(value) #劃分后刪去(從uniqueValsFull中刪!) myTree[bestFeatLabel][value] = createTree(splitDiscreteDataSet(dataSet,bestFeat,value),subLabels,data_full,labels_full)#用splitDiscreteDataSet() #是由于, 所有的連續特征在劃分后都被我們定義的chooseBestFeatureToSplit()處理成離散取值了。 if type(dataSet[0][bestFeat]).__name__==’str’: #若該特征離散【更詳見后注】 for value in uniqueValsFull:#則可能有些取值已經不在【現存的】取值中了 #這就是上面為何從“uniqueValsFull”中刪去 #因為那些現有數據集中沒取到的該特征的值，保留在了其中 myTree[bestFeatLabel][value] = majorityCnt(classList) return myTree 3. 調用生成樹

#生成樹調用的語句df = pd.read_excel(r’E:BaiduNetdiskDownloadspss數據實驗data銀行貸款.xlsx’) data = df.values[:,1:].tolist() data_full = data[:] labels = df.columns.values[1:-1].tolist() labels_full = labels[:] myTree = createTree(data,labels,data_full,labels_full)

查看數據

data

Python3 ID3決策樹判斷申請貸款是否成功的實現代碼

labels

Python3 ID3決策樹判斷申請貸款是否成功的實現代碼

4. 繪制決策樹

#繪決策樹的函數import matplotlib.pyplot as plt decisionNode = dict(boxstyle = 'sawtooth',fc = '0.8') #定義分支點的樣式leafNode = dict(boxstyle = 'round4',fc = '0.8') #定義葉節點的樣式arrow_args = dict(arrowstyle = '<-') #定義箭頭標識樣式# 計算樹的葉子節點數量 def getNumLeafs(myTree): numLeafs = 0 firstStr = list(myTree.keys())[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__==’dict’: numLeafs += getNumLeafs(secondDict[key]) else: numLeafs += 1 return numLeafs# 計算樹的最大深度def getTreeDepth(myTree): maxDepth = 0 firstStr = list(myTree.keys())[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__==’dict’: thisDepth = 1 + getTreeDepth(secondDict[key]) else: thisDepth = 1 if thisDepth > maxDepth: maxDepth = thisDepth return maxDepth # 畫出節點 def plotNode(nodeTxt,centerPt,parentPt,nodeType): createPlot.ax1.annotate(nodeTxt,xy = parentPt,xycoords = ’axes fraction’,xytext = centerPt,textcoords = ’axes fraction’,va = 'center', ha = 'center',bbox = nodeType,arrowprops = arrow_args) # 標箭頭上的文字 def plotMidText(cntrPt,parentPt,txtString): lens = len(txtString) xMid = (parentPt[0] + cntrPt[0]) / 2.0 - lens*0.002 yMid = (parentPt[1] + cntrPt[1]) / 2.0 createPlot.ax1.text(xMid,yMid,txtString) def plotTree(myTree,parentPt,nodeTxt): numLeafs = getNumLeafs(myTree) depth = getTreeDepth(myTree) firstStr = list(myTree.keys())[0] cntrPt = (plotTree.x0ff + (1.0 + float(numLeafs))/2.0/plotTree.totalW,plotTree.y0ff) plotMidText(cntrPt,parentPt,nodeTxt) plotNode(firstStr,cntrPt,parentPt,decisionNode) secondDict = myTree[firstStr] plotTree.y0ff = plotTree.y0ff - 1.0/plotTree.totalD for key in secondDict.keys(): if type(secondDict[key]).__name__==’dict’: plotTree(secondDict[key],cntrPt,str(key)) else: plotTree.x0ff = plotTree.x0ff + 1.0/plotTree.totalW plotNode(secondDict[key],(plotTree.x0ff,plotTree.y0ff),cntrPt,leafNode) plotMidText((plotTree.x0ff,plotTree.y0ff),cntrPt,str(key)) plotTree.y0ff = plotTree.y0ff + 1.0/plotTree.totalD def createPlot(inTree): fig = plt.figure(1,facecolor = ’white’) fig.clf() axprops = dict(xticks = [],yticks = []) createPlot.ax1 = plt.subplot(111,frameon = False,**axprops) plotTree.totalW = float(getNumLeafs(inTree)) plotTree.totalD = float(getTreeDepth(inTree)) plotTree.x0ff = -0.5/plotTree.totalW plotTree.y0ff = 1.0 plotTree(inTree,(0.5,1.0),’’) plt.show()5. 調用函數

#命令繪決策樹的圖createPlot(myTree)

myTree

總結

到此這篇關于Python3 ID3決策樹判斷申請貸款是否成功的實現代碼的文章就介紹到這了,更多相關python ID3 決策樹判斷內容請搜索好吧啦網以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持好吧啦網！

Python 編程

上一條：python實現猜單詞游戲下一條：Python使用os.listdir和os.walk獲取文件路徑

相關文章：

1. PHP正則表達式函數preg_replace用法實例分析2. 一個 2 年 Android 開發者的 18 條忠告3. vue使用moment如何將時間戳轉為標準日期時間格式4. js select支持手動輸入功能實現代碼5. Android 實現徹底退出自己APP 并殺掉所有相關的進程6. Android studio 解決logcat無過濾工具欄的操作7. 什么是Python變量作用域8. vue-drag-chart 拖動/縮放圖表組件的實例代碼9. Spring的異常重試框架Spring Retry簡單配置操作10. Vue實現仿iPhone懸浮球的示例代碼

排行榜

					
					vue-drag-chart 拖動/縮放圖表組件的實例代碼
PHP正則表達式函數preg_replace用法實例分析
一個 2 年 Android 開發者的 18 條忠告
Spring @Primary和@Qualifier注解原理解析
Vue實現仿iPhone懸浮球的示例代碼
關于docker部署的jenkins跑git上的程序的問題
js select支持手動輸入功能實現代碼
Spring的異常重試框架Spring Retry簡單配置操作
docker版es、milvus、minio啟動命令詳解
JSP標簽庫介紹
Android 實現徹底退出自己APP 并殺掉所有相關的進程