JacekBiesiada1andWłodzisławDuch2,3
123
DivisionofComputerMethods,Dept.ofElectrotechnology,TheSilesianUniversityofTechnology,Katowice,Poland
Dept.ofInformatics,NicolausCopernicusUniversity,Toru´n,Poland
SchoolofComputerEngineering,NanyangTechnologicalUniversity,SingaporeGoogle:Duch
Summary.AnalgorithmforfilteringinformationbasedontheKolmogorov-Smirnovcorrela-tion-basedapproachhasbeenimplementedandtestedonfeatureselection.Theonlyparameterofthisalgorithmisstatisticalconfidencelevelthattwodistributionsareidentical.Empiricalcomparisonswith4otherstate-of-the-artfeaturesselectionalgorithms(FCBF,CorrSF,ReliefFandConnSF)areveryencouraging.
1Introduction
Forlargehighlydimensionaldatasetsfeaturerankingandfeatureselectionalgo-rithmsareusuallyofthefiltertype.InthesimplestcasefeaturefilterisafunctionreturningarelevanceindexJ(S|D,C)thatestimates,giventhedataD,howrelevantagivenfeaturesubsetSisforthetaskC(usuallyclassificationorapproximationofthedata).TherelevanceindexJ(S|D,C)iscalculateddirectlyfromdata,withoutanyreferencetotheresultsofprogramsthatareusedondatawithreduceddimen-sionality.SincethedataDandthetaskCareusuallyfixedandonlythesubsetsSvariesanabbreviatedformJ(S)isused.Insteadofasimplefunction(suchascorre-lationorinformationcontent)analgorithmicprocedure,suchasbuildingadecisiontreeorfindingnearestneighbors,maybeusedtoestimatethisindex.
RelevanceindicescomputedforindividualfeaturesXi,i=1...Nprovidein-dicesthatestablisharankingorderJ(Xi1)≤J(Xi2)···≤J(XiN).Thosefeatureswhichhavethelowestranksarefilteredout.Forindependentfeaturesthismaybesufficient,butiffeaturesarecorrelatedmanyofthemmayberedundant.Forsomedatadistributionsthebestpairoffeaturesmaynotevenincludeasinglebestfeature[14,2]!Thusrankingdoesnotguaranteethatthelargestsubsetofimportantfeatureswillbefound.Methodsthatsearchforthebestsubsetoffeaturesmayalsousefilterstoevaluatetheusefulnessofsubsetsoffeatures.
Althoughinthecaseoffiltermethodsthereisnodirectdependenceoftherele-vanceindexontheadaptivealgorithmsobviouslythethresholdsforfeaturerejection
maybeseteitherforrelevanceindices,orbyevaluationofthefeaturecontributionsbythefinalsystem.Featuresarerankedbythefilter,buthowmanyarefinallytakenmaybedeterminedusingadaptivesystemasawrapper.Evaluationoftheadaptivesystemperformance(frequentlycrossvalidationtests)aredoneonlyforafewpre-selectedfeaturesets,butstillthis“filtrapper”approachmayberathercostly.Whatisneededisasimplefiltermethodthatmaybeappliedtolargedatasetsrankingandre-movingredundantfeatures,parameterizedinstatisticallywell-establishedway.Suchanapproachesisdescribedinthispaper.
InnextsectionanewrelevanceindexbasedontheKolmogorov-Smirnov(K-S)testtoestimatecorrelationbetweenthedistributionoffeaturevaluesandtheclasslabelsisintroduced.Correlation-basedfiltersareveryfastandmaybecompetitivetofiltersbasedoninformationtheory.Thereforeinsection3empiricalcomparisonsbetweenK-Sfilter,Pearson’scorrelationbasedfilterandpopularfiltersbasedoninformationgainismadeonanumberofdatasets.
2Theoreticalframework
2.1Correlation-BasedMeasures
ForfeatureXwithvaluesxandclassesCwithvaluesc,whereX,Caretreatedasrandomvariables,Pearson’slinearcorrelationcoefficientisdefinedas[11]:
(xi−x¯i)(ci−c¯i)E(XC)−E(X)E(C)
=i.(1)(X,C)=22σ2(X)σ2(C)¯i)¯i)i(xi−xj(ci−c(X,C)isequalto±1ifXandCarelinearlydependentandzeroiftheyarecom-pletelyuncorrelated.Thesimplesttestestimatingsignificanceofthedifferencesin
(X,C)valuesisbasedontheprobabilitythattwovariablesarecorrelated[11]:
(2)P(X∼C)=erf|(X,C)|N/2,whereerfistheerrorfunction.Thefeaturelistorderedbydecreasingvaluesofthe
P(X∼C)mayserveasfeatureranking.Non-parametric,orSpearman’srankcor-relationcoefficientsmaybeusefulforordinaldatatypes.
Analternativeapproachistouseχ2statistics,butinbothcasesforlargenumberofsamplesprobabilityP(X∼C)issocloseto1thatrankingbecomesimpossibleduetothefinitenumericalaccuracyofcomputations.Forexample,withN=1000samplessmallcoefficients(X,C)≈0.02leadtoprobabilitiesofcorrelationaround0.5.The(X,C)orχ2thresholdsforthesignificanceofagivenfeaturemaythere-forebetakenfromalargeintervalcorrespondingtoalmostthesameprobabilitiesofcorrelation.
Informationtheoryisfrequentlyusedtodefinerelevanceindices.TheShannoninformation,orfeatureandclassentropy,is:
H(X)=−
i
P(xi)logP(xi);
H(C)=−
i
P(ci)logP(ci)
(3)
andthejointShannonentropyis:
H(X,C)=−
i,j
P(xi,cj)logP(xi,cj)
(4)
MutualInformation(MI)isthebasicquantityusedforinformationfiltering:
MI(X,C)=H(X)+H(C)−H(X,C)
(5)
SymmetricalUncertaintyCoefficient(SU)hassimilarpropertiestomutualinfor-mation:
MI(X,C)
SU(X,C)=2(6)
H(X)+H(C)IfagroupofkfeaturesXkhasalreadybeenselected,correlationcoefficientmaybeusedtoestimatecorrelationbetweenthisgroupandtheclass,includinginter-correlationsbetweenthefeatures.Denotingtheaveragecorrelationcoefficientbe-¯(Xk,C)andtheaveragebetweendiffer-tweenthesefeaturesandclassesasrkc=
¯(Xk,Xk)therelevanceofthefeaturesubsetisdefinedas:entfeaturesasrkk=
krkc
.J(Xk,C)=
k+(k−1)rkk
(7)
ThisformulahasbeenusedintheCorrelation-basedFeatureSelection(CFS)algo-rithm[6]adding(forwardselection)ordeleting(backwardselection)onefeatureatatime.
AdefinitionofpredominantcorrelationproposedbyYuandLiu[16]forFastCorrelation-BasedFilter(FCBF)includescorrelationsbeetwenfeatureandclassesandbetweenpairsoffeatures.TheFCBFalgorithmdoesatypicalrankingusingSUcoefficient(eq.6)todetermineclass-featurerelevance,settingsomethresholdvalueSU≥δtodecidehowmanyfeaturesshouldbetaken.Inthesecondpartredundantfeaturesareremovedbydefiningthe“predominantfeatures”.
AdifferenttypeofselectionmethodcalledConnSF,basedoninconsistencymea-sure,hasbeenproposedbyDashetal.[3]andwillbeusedforcomparisoninSec.3.Twoidenticalinputvectorsareinconsistentiftheyhaveidenticalclasslabels(asimilarconceptisusedinroughsettheory).Intuitivelyitisclearthatinconsistencygrowswhenthenumberoffeaturesisreducedandthatfeaturesubsetsthatleadtohighinconsistencyarenotuseful.Iftherearensamplesinthedatasetwithidenticalfeaturevaluesxi,andnkamongthembelongtoclasskthentheinconsistencycountisdefinedasn−maxkck.Thetotalinconsistencycountforafeaturesubsetisthesumofallinconsistencycountsforalldatavectors.
AdifferentwaytofindfeaturesubsetsisusedintheReliefalgorithm([8]and[13]).Thisalgorithm(seeFig.1)estimatesweightsoffeaturesaccordingtohowwell
1.2.3.4.5.6.7.8.setallweightsWxi=0forj=1tomdobegin
randomlyselectinstanceX;
findnearesthitHandnearestmissM;fori:=1tokdobegin
Wxi←Wxi−D(xi,X,H)/m+D(xi,X,M)/mend;end;
Fig.1.SketchoftheReliefalgorithm.
theirvaluesdistinguishbetweendatavectorsthatareneartoeachother.Foraran-domlyselectedvectorXfromadatasetSwithkfeaturesReliefsearchesthedatasetforitstwonearestneighbors:thenearesthitHfromthesameclassandthenearestmissMfromanotherclass.ForfeaturexandtwoinputvectorsX,Xthecontri-butiontotheweightWxisproportionaltotheD(x,X,X)=1−δ(X(x),X(x))forbinaryornominalfeatures,andD(x,X,X)=|X(x)−X(x)|forcontinuousfeatures.Theprocessisrepeatedmtimes,wheremisauserdefinedparameter[8].NormalizationwithmincalculationofWxguaranteesthatallweightsareinthe[−1,1]interval.Inourempiricalstudies(Sec.3)wehaveusedanextensionofthisagorithmformulticlassproblems,calledReliefF[13].
2.2Kolmogorov-SmirnovCorrelation-BasedFilterApproach.
EquivalenceoftworandomvariablesmaybeevaluatedusingtheKolmogorov-Smirnov(K-S)test[7].TheK-Stestmeasuresthemaximumdifferencebetweencummulativedistributionoftworandomvariables.Ifafeatureisredundantthanthehypothesisthatitsdistributionisequaltoalreadyselectedfeatureshouldhavehighprobability.nindependentobservationsoftworandomvariablesX,Xaregiveninthetrainingdata,wherefortheK-Stesttobevalidnshouldbemorethan40.ThetestforX,Xfeatureredundancyproceedsasfollows:••••
Discretizationoffeaturevaluesxintokbins[xi,xi+1],i=1...kisperformed.
Frequencyfi,fkofoccurrencesoffeaturevaluesineachbinarerecorded.
BasedonthefrequencycountscumulativedistributionfunctionsFiandFiareconstructed.
λ(K-Sstatistics)isthelargestabsoultedifferencebetweenFiandFi,i.e,
λ=n/2max|Fi−Fi|fori=1,2,...k.(8)
i
ProbabilitythatthemaximumK-Sdistanceλαislargerthanobservedmaybecalcu-latedusingK-Sstatisticsforeachparameterα[9]thathasthemeaningofstatistical
significancelevel.Whenλ<λαthenthetwodistributionsareequivalentwithαsig-nificancelevel,andthusoneofthefeaturesisredundant.Usingtypicalsignificancevaluesof0.95solvestheproblemofthethresholdvaluesforredundancy.
TheKolmogorov-SmirnovCorrelation-BasedFilter(K-SCBF)algorithmispre-sentedbelow.First,therelevanceisdeterminedusingthesymmetricaluncertainty(otherrelevancecriteriamayalsobeused),andthenK-Stestappliedtoremovere-dundancy.
AlgorithmK-SRBF:Relevanceanalysis
1.CalculatetheSU(X,C)relevanceindicesandcreateanorderedlistSoffeaturesaccordingtothedecreasingvalueoftheirrelevance.Redundancyanalysis
2.TakeasthefeatureXthefirstfeaturefromtheSlist
3.FindandremoveallfeaturesforwhichXisapproximatelyequivalentaccordingtotheK-Stest
4.SetthenextremainingfeatureinthelistasXandrepeatstep3forallfeaturesthatfollowitintheSlist.
Fig.2.Atwo-stepKolmogorov-SmirnovCorrelationBasedFiter(K-SCBF)algorithm.
3EmpiricalStudies.
ToevaluatetheperformanceoftheK-SCBFalgorithmbothartificialandrealdatasetshavebeenusedwithanumberofclassificationmethods.Twoartificialdatasets,Gauss4,andGauss8,havebeenusedinourpreviousstudy[4].Gauss4isbasedonsamplingfrom4Gaussianfunctionswithunitdispersionin4dimensions,eachclusterrepresentingaseparateclass.Thefirstfunctioniscenteredat(0,0,0,0),thenextat(1,1/2,1/3,1/4),(2,1,2/3,1/2),and(3,3/2,3,3/4),respectively.Thedatasetcontains4000vectors,1000pereachclass.Inthiscasetheidealrankingshouldgivethefollowingorder:X1>X2>X3>X4.
Gauss8usedhereisanextensionofGauss4,adding4additionalfeaturesthatareapproximatelylinearlydependentXi+4=2Xi+,whereisauniformnoisewithaunitvariance.Inthiscasetheidealrankingshouldgivethefollowingorder:X1>X5>X2>X6>X3>X7>X4>X8andtheselectionshouldrejectall4linearlydependentfeaturesasredundant.K-SCBFalgorithmandConnSF[3]algorithmhadnoproblemwiththistask,butFCBF[16]selectedonly3features,CorrSF[6]selectedonlyfirsttwoandReliefF[13]leftonlyfeature1and5,givingthemweight0.1(forfeatures2and6theweightwas0.060,droppingto0.024forfeature3,6andto0.017forfeatures4,8.
InTable3resultsofNaiveBayesClassifier(NBC)(Wekaimplementation,[15]),thenearestneighboralgorithm(1NN)withEuclideandistancefunction,C4.5tree[12]andtheSupportVectorMachinewithalinearkernelaregiven(WekaandSVM,Ghostminer3.0implementation4).
FortheinitialcomparisononrealdataseveraldatasetsfromtheUCIMachineLearningRepository[10]andtheUCIKDDArchive[1]wereused.AsummaryofalldatasetsispresentedinTable3.Foreachdatasetallfivefeatureselection
4
http://www.fqspl.com.pl/ghostminer/
Title
Fullset
Features1to8NBC82.131NN73.42C4.578.30SVM81.88Average79.91
FCBF1+2+381.5773.9079.1281.7079.09Selectedfeatures
CorrSFReliefFConnSFK-SCBF1+2+51+51to41to480.2576.9582.1382.1371.1068.1273.4273.4278.9576.1578.7078.7080.9076.9581.7381.7378.8375.3480.4080.40
Table1.Accuracyof4classifiersonselectedsubsetsoffeaturesfortheGauss8dataset.
TitleFeaturesInstancesClassesHypothyroid2137723Lung-cancer58323Promoters591062Splice6231903
Table2.Summaryofthedatasetsusedinempiricalstudies.
algorithmsarecompared(FCBF[16],CorrSF[6],ReliefF[13],ConnSF[3],andK-SCBF)andthenumberoffeaturesselectedbyeachalgorithmisgiven.Fordatasets
containingfeatureswithcontinuousvaluestheMDLPdiscretizationalgorithmhasbeenapplied5[5].5neighborsand30instanceswereusedforReliefF,assuggestedbyRobnik-SikoniaandKononenko[13].ForCorrSFandConnSFforwardsearchstrategyhasbeenused,andforFCBF,ReliefF,andtheK-SCBFforwardsearchstrategybasedonranking.
Dataset
Selectedfeatures
FullsetFCBFCorrSFReliefFConnSFK-SCBF
Hypothyroid21511166Lung-cancer58611843Splice62226191014Promoters59445Average509.85.510.567
Table3.Thenumberofselectedfeaturesforeachalgorithm;boldface–lowestnumber,italics
–highestnumber.
Theoverallbalancedaccuracy(accuracyforeachclass,averagedoverallclasses)obtainedfrom10-foldcross-validationcalculationsisreported.Fordatasetswithsignificantdifferencesbetweensamplesfromdifferentclassesbalancedaccuracyisamoresensitivemeasurethantheoverallaccuracy.ResultsofthesecalculationsarecollectedinTable3.
5
availablefromwww.public.asu.edu/∼huanliu/
MethodDataset
HypothyroidLung-cancerSplicePromotersAverageMethodDataset
HypothyroidLung-cancerSplicePromotersAverageMethodDataset
HypothyroidLung-cancerSplicePromotersAverageMethodDataset
HypothyroidLung-cancerSplicePromotersAverage
Fullset99.9149.4394.3581.1381.21Fullset83.2047.9294.80.5679.14Fullset61.6039.9480.0885.8566.49Fullset52.61.3792.8193.4070.06
FCBF66.9463.8794.0585.8477.68FCBF66.9450.5796.0395.2873.21FCBF83.36.0184.4592.4574.28FCBF45.4955.4195.7391.5072.03
C4.5treeCorrSFReliefF85.9298.9467.2163.8793.3994.8083.9665.0982.6280.68
ConnSFK-SCBF93.4598.6863.22.7693.6394.0584.9081.1383.8084.66
NaiveBayes
CorrSFReliefFConnSFK-SCBF58.4870.2867.1485.8373.4850.5765.6841.7493.3195.94.1794.7593.3961.3293.3987.3579.6769.4380.1077.421NearestNeighbor
CorrSFReliefFConnSFK-SCBF71.7278.9286.9483.9366.8453.1365.4145.7087.6884.0486.7783.8286.7959.4381.1385.8578.2669.0180.0674.83SVMCorrSFReliefF44.0751.2466.0761.6093.7595.7577.3658.4970.3166.77
ConnSFK-SCBF45.1384.3159.3747.3590.05.1187.3393.1170.4880.04
Table4.Balancedaccuracyforthe4classificationmethodsonfeaturesselectedbyeachal-gorithm;boldface–bestresults,italics–worst.
4Conclusion
Anewalgorithm,K-SCBF,forfindingnon-redundantfeaturesubsetsbasedonthe
Kolmogorov-Smirnovtesthasbeenintroduced.Ithasonlyoneparameter,statisticalsignificanceortheprobabilitythatthehypothesisthatdistributionsoftwofeaturesisequivalentistrue.Ourinitialtestsareencouraging:ontheartificialdataperfectrankinghasbeenrecreatedandredundantfeaturesrejected,whileontherealdata,withrathermodestnumberoffeaturesselectedresultsarefrequentlythebest,orclosetothebest,comparingwithfourstate-of-the-artfeatureselectionalgorithms.ThenewalgorithmseemstoworkespeciallywellwiththelinearSVMclassifier.ComputationaldemandsofK-SCBFalgorithmaresimilartoothercorrelation-basedfiltersandmuchlowerthanReliefF.
Itisobviousthatsometimesstatisticalsignificanceat0.05levelselectedforourtestsisnotoptimalandforthelungcancerdatatoofewfeatureshavebeenselected,leadingtoalargedecreaseofaccuracy.Thisparametermaybeoptimizedincrossval-idationtestsonthetrainingset,butthemethodguaranteesthateachtimeonlynon-redundantsubsetoffeatureswillbeselected.VariousvariantsoftheKolmogorov-Smirnovtestexist[11]andthealgorithmmaybeusedwithotherindicesforrele-vanceindication.Thesepossibilitiesremaintobeexplored.Furthertestsonmuchlargerbioinformaticsdatawillbereportedsoon.
Acknowledgement.ThisworkwasfinancedbythePolishCommitteeforScientificResearch(KBN)grant(2005-2007).
References
1.S.D.Bay.TheUCIKDDarchive.Univ.ofCalifornia,Irvine,1999.http://kdd.ics.uci.edu.2.T.M.Cover.Thebesttwoindependentmeasurementsarenotthetwobest.IEEETrans-actionsonSystems,Man,andCybernetics,4:116–117,1974.
3.M.DashandH.Liu.Consistency-basedsearchinfeatureselection.ArtificialIntelligence,151:155–176,2003.
4.W.Duch,T.Winiarski,J.Biesiada,andA.Kachel.Featureranking,selectionanddis-cretization.InProceedingsofInt.Conf.onArtificialNeuralNetworks(ICANN),pages251–2,Istanbul,2003.BogaziciUniversityPress.
5.U.M.FayyadandK.B.Irani.Multi-intervaldiscretizationofcontinous-valuedattributesforclassificationlearning.InR.Bajcsy,editor,ProceedingsoftheThirteenthInterna-tionalJointConferenceonArtificialIntelligence,Chambery,France,pages1022–1027,SanFrancisco,CA,1993.MorganKaufmann.
6.M.A.Hall.Correlation-basedFeatureSubsetSelectionforMachineLearning.PhDthesis,DepartmentofComputerScience,UniversityofWaikato,Waikato,N.Z.,1999.7.R.LahaI.ChakravartiandJ.Roy.HandbookofMethodsofAppliedStatistics.JohnWileyandSons,Chichester,1967.
8.K.KiraandL.A.Rendell.Apracticalapproachtofeatureselection.InProceedingsoftheNinthInternationalConferenceonMachineLearning(ICML-92),pages249–256,SanFrancisco,CA,1992.MorganKaufmann.
9.N.HastingsM.EvansandB.Peacock.StatisticalDistributions,3rd.ed.JohnWileyandSons,Chichester,2000.
10.C.J.MertzandP.M.Murphy.TheUCIrepositoryofmachinelearningdatabases.Univ.
ofCalifornia,Irvine,1998.http://www.ics.uci.edu.pl/mlearn/MLRespository.html.11.W.H.Press,S.A.Teukolsky,W.T.Vetterling,andB.P.Flannery.NumericalrecipesinC.
Theartofscientificcomputing.CambridgeUniversityPress,Cambridge,UK,1988.12.J.R.Quinlan.C4.5:ProgramsforMachineLearning.M.Kaufman,SanMateo,CA,
1993.
13.M.Robnik-SikonjaandI.Kononenko.Theoreticalandempiricalanalysisofrelieffand
rrelieff.MachineLearning,53:23–69,2003.
14.G.T.Toussaint.Noteonoptimalselectionofindependentbinary-valuedfeaturesforpat-ternrecognition.IEEETransactionsonInformationTheory,17:618–618,1971.
15.I.WittenandE.Frank.Dataminig–practicalmachinelearningtoolsandtechniques
withJAVAimplementations.MorganKaufmann,SanFrancisco,CA,2000.
16.L.YuandH.Liu.Featureselectionforhigh-dimensionaldata:Afastcorrelation-based
filtersolution.InProc12thIntConfonMachineLearning(ICML-03),Washington,D.C.,pages856–863,SanFrancisco,CA,2003.MorganKaufmann.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- sceh.cn 版权所有 湘ICP备2023017654号-4
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务