《数据挖掘》习题库及答案.docx
数据挖掘复习试题和答案考虑表中二元分类问题的训练样本集表48练习3的数据集实例z目标类1TTLO+2TT6.0+3TF5.04FF4。+5FT7.06FT3.07FF&08TF70+9FT5.Q1.整个训练样本集关于类属性的滴是多少?2.关于这些训练集中a1,a2的信息增益是多少?3.对于连续属性a3,计算所有可能的划分的信息增益。4.根据信息增益,a1,a2,a3哪个是最佳划分?5.根据分类错误率,a1,a2哪具最佳?6.根据gini指标,a1,a2哪个最佳?答1.ExamplesforcomputingEntropyCl0C26Entropy(t)=-p(jOlog2p(jt)P(C1)三0/6=0P(C2)=6/6=1Entropy=-0log0-1logI=-O-O=OP(C1)=16P(C2)=5/6Entropy=-(1/6)Iog2(1/6)-(5/6)Iog2(5/6)=0.65P(C1)=26P(C2)=4/6Entropy=-(2/6)Iog2(2/6)-(4/6)Iog2(4/6)=0.92Z7(+)=4/9andP(一)=5/9-4/9Iog2(4/9)-5/9Iog2(54)=0.9911.答2:SplittingBasedonINFO.InformationGain:Ckr、GAINpht-Entropy(p)-Entropy(J)ParentNode,pissplitintokpartitions;niisnumberofrecordsinpartitioni- MeasuresReductioninEntropyachievedbecauseofthesplit.Choosethesplitthatachievesmostreduction(maximizesGAIN)- UsedinID3andC4.5- Disadvantage:Tendstoprefersplitsthatresultinlargenumberofpartitions,eachbeingsmallbutpure.(估计不考)Forattribute,thecorrespondingcountsandprobabilitiesare:+-TF3114TheentropyforaisI-(34)log2(34)-(l4)log2(l4)+3-(l5)log2(l5)-(45)log2(45)=0.7616.Therefore,theinformationgainforais0.99110.7616=0.2294.Forattributes,thecorrespondingcountsandprobabilitiesare:S+T23F22Theentropyfor敢is3-(25)log2(25)-(35)log2(35)÷7-(24)log2(24)-(24)log2(24)=0.9839.Therefore,theinformationgainfor做is0.99110.9839=0.0072.ContinuousAttributes:ComputingGiniIndex.Forefficientcomputation:foreachattribute,-Sorttheattributeonvalues一Linearlyscanthesevalues,eachtimeupdatingthecountmatrixandcomputingginiindex一ChoosethesplitpositionthathastheleastginiindexCheatSortedVaIues_SplitPositions_NoNoNoYesYesYesNoNoNoIMOI60I7。I75I85TaxableIncome9095100I120125I2255657280879297110122172230<=><=><=><=><=><=><=><=><=><=><=>Yes0303030312213030303030No0716253434343443526170Gini0.4200.4000.3750.3430.4170.4000.3430.3750.4000.420©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200437Q3ClasslabelSplitpointEntropyInfoGain1.0+2.00.84840.14273.0-3.50.98850.264.0÷4.50.91830.07285.05.0-5.50.98390.726.0+6.50.97280.01837.07.0+7.50.88890.1022答4:Accordingtoinformationgain,aproducesthebestspIit.答5:ExamplesforComputingErrorError(t)=1maxF(z11)Therefore,accordingtoerrorrate,aproducesthebestspIit.答6:Gini(ChiIdren)=7/12*0.408+5/12*0.32=0.371BinaryAttributes:ComputingGINIIndex Splitsintotwopartitions EffectofWeighingpartitions:一LargerandPurerPartitionsaresoughtfor.Gini(N1)=1-(5/7)2-(2/7)2=0.408Gini(N2)=1-(1/5)2_(4/5)2=0.324/18/200434三)Tan,Steinbach,KumarIntroductiontoDataMiningForattributentheginiindexisr15一1-(3/4)2-(1/4)2+-1-(1/5)2-(4/5)2=0.3444.Forattribute02,theginiindexis51(2/5)2(3/5)2+Ii(2/4)2(2/4)2=qssqSincetheginiindexforaissmaller,itproducesthebettersplit.二、考虑如下二元分类问题的数据集AB类标号TF+TT+TT+TFTT+FFFFFFTTTF图443二元分类问题不纯性度量之间的比较1 .计算a.b信息增益,决策树归纳算法会选用哪个属性ThecontingencytablesaftersplittingonattributesAandBare:A=TA=FB=TB=FtD33315Theoverallentropybeforesplittingis:Erig=0.4log0.40.6log0.6=0.9710TheinformationgainaftersplittingonAis:4433EA=T=jlogw-亏l°gJ=°州527777尸33(J0nEa=F=-2log3-31g3=0=Emg-7/10EA=T-3/10EA=F=O.2813TheinformationgainaftersplittingonBis:3311Eb=t=-7lg7-7lg7=081134444EB=F=-77logp_77lg77=0.G500bbb=EoHg-4/10EB=T-6/10EB=F=O.2565Therefore,attributeAwillbechosentosplitthenode.2 .计算a.bgini指标,决策树归纳会用哪个属性?Theowrallginibeforesplittingis:Goria=1-0.42-0.62=0.48ThegaininginiaftersplittingonAis:GA=T=l-(02-0)2=0.4898-=1=<-(D2=Gorig-710G=t-310G=f=0.1371ThegaininginiaftersplittingonBis:GR=TGR=FY)Ie)LEY)YGorig-410G11三-6/1OGH=F=0.1633Therefore,attributeBwillbechosentosplitthenode.这个答案没问题3 .从图4-13可以看出嫡和gini指标在0,0.5都是单调递增,而0.5,1之间单调递减。有没有可能信息增益和gini指标增益支持不同的属性?解释你的理由Yes,eventhoughthesemeasureshavesimiIarrangeandmonotonousbehavior,theirrespectivegains,whichareseaIeddifferencesofthemeasures,donotnecessariIybehaveinthesameway,asiIIustratedbytheresuItsinparts(八)and(b).贝叶斯分类ExampleofNaiveBayesClassifierGivenaTestRecord:X-(Refund=No5Married,Income=120K)naiveBayesClassifier:P(Refund=YeslNo)=3/7P(Refund=NolNo)=<7P(Refund=YesIYes)=0P(Refund=NoIYes)=1P(MaritalStatus=SingIeINo)=2/7P(MaritalStatus=Divorced)No)=17P(MaritalStatus=MarriedINo)=4/7P(MaritalStatus=SingIeIYes)=2/7P(MaritalStatus=Divorced)Yes)=1/7P(Mar