您的当前位置：首页 Feature Selection for High-Dimensional Data A KolmogorovSmirnov Correlation-based Filter

Feature Selection for High-Dimensional Data A KolmogorovSmirnov Correlation-based Filter

来源：尚车旅游网

FeatureSelectionforHigh-DimensionalData:AKolmogorov-SmirnovCorrelation-BasedFilter

JacekBiesiada1andWłodzisławDuch2,3

123

DivisionofComputerMethods,Dept.ofElectrotechnology,TheSilesianUniversityofTechnology,Katowice,Poland

Dept.ofInformatics,NicolausCopernicusUniversity,Toru´n,Poland

SchoolofComputerEngineering,NanyangTechnologicalUniversity,SingaporeGoogle:Duch

Summary.AnalgorithmforﬁlteringinformationbasedontheKolmogorov-Smirnovcorrela-tion-basedapproachhasbeenimplementedandtestedonfeatureselection.Theonlyparameterofthisalgorithmisstatisticalconﬁdencelevelthattwodistributionsareidentical.Empiricalcomparisonswith4otherstate-of-the-artfeaturesselectionalgorithms(FCBF,CorrSF,ReliefFandConnSF)areveryencouraging.

1Introduction

Forlargehighlydimensionaldatasetsfeaturerankingandfeatureselectionalgo-rithmsareusuallyoftheﬁltertype.InthesimplestcasefeatureﬁlterisafunctionreturningarelevanceindexJ(S|D,C)thatestimates,giventhedataD,howrelevantagivenfeaturesubsetSisforthetaskC(usuallyclassiﬁcationorapproximationofthedata).TherelevanceindexJ(S|D,C)iscalculateddirectlyfromdata,withoutanyreferencetotheresultsofprogramsthatareusedondatawithreduceddimen-sionality.SincethedataDandthetaskCareusuallyﬁxedandonlythesubsetsSvariesanabbreviatedformJ(S)isused.Insteadofasimplefunction(suchascorre-lationorinformationcontent)analgorithmicprocedure,suchasbuildingadecisiontreeorﬁndingnearestneighbors,maybeusedtoestimatethisindex.

RelevanceindicescomputedforindividualfeaturesXi,i=1...Nprovidein-dicesthatestablisharankingorderJ(Xi1)≤J(Xi2)···≤J(XiN).Thosefeatureswhichhavethelowestranksareﬁlteredout.Forindependentfeaturesthismaybesufﬁcient,butiffeaturesarecorrelatedmanyofthemmayberedundant.Forsomedatadistributionsthebestpairoffeaturesmaynotevenincludeasinglebestfeature[14,2]!Thusrankingdoesnotguaranteethatthelargestsubsetofimportantfeatureswillbefound.Methodsthatsearchforthebestsubsetoffeaturesmayalsouseﬁlterstoevaluatetheusefulnessofsubsetsoffeatures.

Althoughinthecaseofﬁltermethodsthereisnodirectdependenceoftherele-vanceindexontheadaptivealgorithmsobviouslythethresholdsforfeaturerejection

maybeseteitherforrelevanceindices,orbyevaluationofthefeaturecontributionsbytheﬁnalsystem.Featuresarerankedbytheﬁlter,buthowmanyareﬁnallytakenmaybedeterminedusingadaptivesystemasawrapper.Evaluationoftheadaptivesystemperformance(frequentlycrossvalidationtests)aredoneonlyforafewpre-selectedfeaturesets,butstillthis“ﬁltrapper”approachmayberathercostly.Whatisneededisasimpleﬁltermethodthatmaybeappliedtolargedatasetsrankingandre-movingredundantfeatures,parameterizedinstatisticallywell-establishedway.Suchanapproachesisdescribedinthispaper.

InnextsectionanewrelevanceindexbasedontheKolmogorov-Smirnov(K-S)testtoestimatecorrelationbetweenthedistributionoffeaturevaluesandtheclasslabelsisintroduced.Correlation-basedﬁltersareveryfastandmaybecompetitivetoﬁltersbasedoninformationtheory.Thereforeinsection3empiricalcomparisonsbetweenK-Sﬁlter,Pearson’scorrelationbasedﬁlterandpopularﬁltersbasedoninformationgainismadeonanumberofdatasets.

2Theoreticalframework

2.1Correlation-BasedMeasures

ForfeatureXwithvaluesxandclassesCwithvaluesc,whereX,Caretreatedasrandomvariables,Pearson’slinearcorrelationcoefﬁcientisdeﬁnedas[11]:

󰀅

(xi−x¯i)(ci−c¯i)E(XC)−E(X)E(C)

󰀇=󰀈󰀅i.(1)󰀇(X,C)=󰀅22σ2(X)σ2(C)¯i)¯i)i(xi−xj(ci−c󰀇(X,C)isequalto±1ifXandCarelinearlydependentandzeroiftheyarecom-pletelyuncorrelated.Thesimplesttestestimatingsigniﬁcanceofthedifferencesin

󰀇(X,C)valuesisbasedontheprobabilitythattwovariablesarecorrelated[11]:

󰀂󰀁󰀇

(2)P(X∼C)=erf|󰀇(X,C)|N/2,whereerfistheerrorfunction.Thefeaturelistorderedbydecreasingvaluesofthe

P(X∼C)mayserveasfeatureranking.Non-parametric,orSpearman’srankcor-relationcoefﬁcientsmaybeusefulforordinaldatatypes.

Analternativeapproachistouseχ2statistics,butinbothcasesforlargenumberofsamplesprobabilityP(X∼C)issocloseto1thatrankingbecomesimpossibleduetotheﬁnitenumericalaccuracyofcomputations.Forexample,withN=1000samplessmallcoefﬁcients󰀇(X,C)≈0.02leadtoprobabilitiesofcorrelationaround0.5.The󰀇(X,C)orχ2thresholdsforthesigniﬁcanceofagivenfeaturemaythere-forebetakenfromalargeintervalcorrespondingtoalmostthesameprobabilitiesofcorrelation.

Informationtheoryisfrequentlyusedtodeﬁnerelevanceindices.TheShannoninformation,orfeatureandclassentropy,is:

H(X)=−

󰀆

P(xi)logP(xi);

H(C)=−

󰀆

P(ci)logP(ci)

(3)

andthejointShannonentropyis:

H(X,C)=−

󰀆

i,j

P(xi,cj)logP(xi,cj)

(4)

MutualInformation(MI)isthebasicquantityusedforinformationﬁltering:

MI(X,C)=H(X)+H(C)−H(X,C)

(5)

SymmetricalUncertaintyCoefﬁcient(SU)hassimilarpropertiestomutualinfor-mation:

󰀃󰀄

MI(X,C)

SU(X,C)=2(6)

H(X)+H(C)IfagroupofkfeaturesXkhasalreadybeenselected,correlationcoefﬁcientmaybeusedtoestimatecorrelationbetweenthisgroupandtheclass,includinginter-correlationsbetweenthefeatures.Denotingtheaveragecorrelationcoefﬁcientbe-¯(Xk,C)andtheaveragebetweendiffer-tweenthesefeaturesandclassesasrkc=󰀇

¯(Xk,Xk)therelevanceofthefeaturesubsetisdeﬁnedas:entfeaturesasrkk=󰀇

krkc

.J(Xk,C)=󰀇

k+(k−1)rkk

(7)

ThisformulahasbeenusedintheCorrelation-basedFeatureSelection(CFS)algo-rithm[6]adding(forwardselection)ordeleting(backwardselection)onefeatureatatime.

AdeﬁnitionofpredominantcorrelationproposedbyYuandLiu[16]forFastCorrelation-BasedFilter(FCBF)includescorrelationsbeetwenfeatureandclassesandbetweenpairsoffeatures.TheFCBFalgorithmdoesatypicalrankingusingSUcoefﬁcient(eq.6)todetermineclass-featurerelevance,settingsomethresholdvalueSU≥δtodecidehowmanyfeaturesshouldbetaken.Inthesecondpartredundantfeaturesareremovedbydeﬁningthe“predominantfeatures”.

AdifferenttypeofselectionmethodcalledConnSF,basedoninconsistencymea-sure,hasbeenproposedbyDashetal.[3]andwillbeusedforcomparisoninSec.3.Twoidenticalinputvectorsareinconsistentiftheyhaveidenticalclasslabels(asimilarconceptisusedinroughsettheory).Intuitivelyitisclearthatinconsistencygrowswhenthenumberoffeaturesisreducedandthatfeaturesubsetsthatleadtohighinconsistencyarenotuseful.Iftherearensamplesinthedatasetwithidenticalfeaturevaluesxi,andnkamongthembelongtoclasskthentheinconsistencycountisdeﬁnedasn−maxkck.Thetotalinconsistencycountforafeaturesubsetisthesumofallinconsistencycountsforalldatavectors.

AdifferentwaytoﬁndfeaturesubsetsisusedintheReliefalgorithm([8]and[13]).Thisalgorithm(seeFig.1)estimatesweightsoffeaturesaccordingtohowwell

1.2.3.4.5.6.7.8.setallweightsWxi=0forj=1tomdobegin

randomlyselectinstanceX;

ﬁndnearesthitHandnearestmissM;fori:=1tokdobegin

Wxi←Wxi−D(xi,X,H)/m+D(xi,X,M)/mend;end;

Fig.1.SketchoftheReliefalgorithm.

theirvaluesdistinguishbetweendatavectorsthatareneartoeachother.Foraran-domlyselectedvectorXfromadatasetSwithkfeaturesReliefsearchesthedatasetforitstwonearestneighbors:thenearesthitHfromthesameclassandthenearestmissMfromanotherclass.ForfeaturexandtwoinputvectorsX,X󰀁thecontri-butiontotheweightWxisproportionaltotheD(x,X,X󰀁)=1−δ(X(x),X󰀁(x))forbinaryornominalfeatures,andD(x,X,X󰀁)=|X(x)−X󰀁(x)|forcontinuousfeatures.Theprocessisrepeatedmtimes,wheremisauserdeﬁnedparameter[8].NormalizationwithmincalculationofWxguaranteesthatallweightsareinthe[−1,1]interval.Inourempiricalstudies(Sec.3)wehaveusedanextensionofthisagorithmformulticlassproblems,calledReliefF[13].

2.2Kolmogorov-SmirnovCorrelation-BasedFilterApproach.

EquivalenceoftworandomvariablesmaybeevaluatedusingtheKolmogorov-Smirnov(K-S)test[7].TheK-Stestmeasuresthemaximumdifferencebetweencummulativedistributionoftworandomvariables.Ifafeatureisredundantthanthehypothesisthatitsdistributionisequaltoalreadyselectedfeatureshouldhavehighprobability.nindependentobservationsoftworandomvariablesX,X󰀁aregiveninthetrainingdata,wherefortheK-Stesttobevalidnshouldbemorethan40.ThetestforX,X󰀁featureredundancyproceedsasfollows:••••

Discretizationoffeaturevaluesxintokbins[xi,xi+1],i=1...kisperformed.

󰀁

Frequencyfi,fkofoccurrencesoffeaturevaluesineachbinarerecorded.

BasedonthefrequencycountscumulativedistributionfunctionsFiandFi󰀁areconstructed.

λ(K-Sstatistics)isthelargestabsoultedifferencebetweenFiandFi󰀁,i.e,

󰀇

λ=n/2max|Fi−Fi󰀁|fori=1,2,...k.(8)

ProbabilitythatthemaximumK-Sdistanceλαislargerthanobservedmaybecalcu-latedusingK-Sstatisticsforeachparameterα[9]thathasthemeaningofstatistical

signiﬁcancelevel.Whenλ<λαthenthetwodistributionsareequivalentwithαsig-niﬁcancelevel,andthusoneofthefeaturesisredundant.Usingtypicalsigniﬁcancevaluesof0.95solvestheproblemofthethresholdvaluesforredundancy.

TheKolmogorov-SmirnovCorrelation-BasedFilter(K-SCBF)algorithmispre-sentedbelow.First,therelevanceisdeterminedusingthesymmetricaluncertainty(otherrelevancecriteriamayalsobeused),andthenK-Stestappliedtoremovere-dundancy.

AlgorithmK-SRBF:Relevanceanalysis

1.CalculatetheSU(X,C)relevanceindicesandcreateanorderedlistSoffeaturesaccordingtothedecreasingvalueoftheirrelevance.Redundancyanalysis

2.TakeasthefeatureXtheﬁrstfeaturefromtheSlist

3.FindandremoveallfeaturesforwhichXisapproximatelyequivalentaccordingtotheK-Stest

4.SetthenextremainingfeatureinthelistasXandrepeatstep3forallfeaturesthatfollowitintheSlist.

Fig.2.Atwo-stepKolmogorov-SmirnovCorrelationBasedFiter(K-SCBF)algorithm.

3EmpiricalStudies.

ToevaluatetheperformanceoftheK-SCBFalgorithmbothartiﬁcialandrealdatasetshavebeenusedwithanumberofclassiﬁcationmethods.Twoartiﬁcialdatasets,Gauss4,andGauss8,havebeenusedinourpreviousstudy[4].Gauss4isbasedonsamplingfrom4Gaussianfunctionswithunitdispersionin4dimensions,eachclusterrepresentingaseparateclass.Theﬁrstfunctioniscenteredat(0,0,0,0),thenextat(1,1/2,1/3,1/4),(2,1,2/3,1/2),and(3,3/2,3,3/4),respectively.Thedatasetcontains4000vectors,1000pereachclass.Inthiscasetheidealrankingshouldgivethefollowingorder:X1>X2>X3>X4.

Gauss8usedhereisanextensionofGauss4,adding4additionalfeaturesthatareapproximatelylinearlydependentXi+4=2Xi+󰀃,where󰀃isauniformnoisewithaunitvariance.Inthiscasetheidealrankingshouldgivethefollowingorder:X1>X5>X2>X6>X3>X7>X4>X8andtheselectionshouldrejectall4linearlydependentfeaturesasredundant.K-SCBFalgorithmandConnSF[3]algorithmhadnoproblemwiththistask,butFCBF[16]selectedonly3features,CorrSF[6]selectedonlyﬁrsttwoandReliefF[13]leftonlyfeature1and5,givingthemweight0.1(forfeatures2and6theweightwas0.060,droppingto0.024forfeature3,6andto0.017forfeatures4,8.

InTable3resultsofNaiveBayesClassiﬁer(NBC)(Wekaimplementation,[15]),thenearestneighboralgorithm(1NN)withEuclideandistancefunction,C4.5tree[12]andtheSupportVectorMachinewithalinearkernelaregiven(WekaandSVM,Ghostminer3.0implementation4).

FortheinitialcomparisononrealdataseveraldatasetsfromtheUCIMachineLearningRepository[10]andtheUCIKDDArchive[1]wereused.AsummaryofalldatasetsispresentedinTable3.Foreachdatasetallﬁvefeatureselection

http://www.fqspl.com.pl/ghostminer/

Title

Fullset

Features1to8NBC82.131NN73.42C4.578.30SVM81.88Average79.91

FCBF1+2+381.5773.9079.1281.7079.09Selectedfeatures

CorrSFReliefFConnSFK-SCBF1+2+51+51to41to480.2576.9582.1382.1371.1068.1273.4273.4278.9576.1578.7078.7080.9076.9581.7381.7378.8375.3480.4080.40

Table1.Accuracyof4classiﬁersonselectedsubsetsoffeaturesfortheGauss8dataset.

TitleFeaturesInstancesClassesHypothyroid2137723Lung-cancer58323Promoters591062Splice6231903

Table2.Summaryofthedatasetsusedinempiricalstudies.

algorithmsarecompared(FCBF[16],CorrSF[6],ReliefF[13],ConnSF[3],andK-SCBF)andthenumberoffeaturesselectedbyeachalgorithmisgiven.Fordatasets

containingfeatureswithcontinuousvaluestheMDLPdiscretizationalgorithmhasbeenapplied5[5].5neighborsand30instanceswereusedforReliefF,assuggestedbyRobnik-SikoniaandKononenko[13].ForCorrSFandConnSFforwardsearchstrategyhasbeenused,andforFCBF,ReliefF,andtheK-SCBFforwardsearchstrategybasedonranking.

Dataset

Selectedfeatures

FullsetFCBFCorrSFReliefFConnSFK-SCBF

Hypothyroid21511166Lung-cancer58611843Splice62226191014Promoters59445Average509.85.510.567

Table3.Thenumberofselectedfeaturesforeachalgorithm;boldface–lowestnumber,italics

–highestnumber.

Theoverallbalancedaccuracy(accuracyforeachclass,averagedoverallclasses)obtainedfrom10-foldcross-validationcalculationsisreported.Fordatasetswithsigniﬁcantdifferencesbetweensamplesfromdifferentclassesbalancedaccuracyisamoresensitivemeasurethantheoverallaccuracy.ResultsofthesecalculationsarecollectedinTable3.

availablefromwww.public.asu.edu/∼huanliu/

MethodDataset

HypothyroidLung-cancerSplicePromotersAverageMethodDataset

HypothyroidLung-cancerSplicePromotersAverage

Fullset99.9149.4394.3581.1381.21Fullset83.2047.9294.80.5679.14Fullset61.6039.9480.0885.8566.49Fullset52.61.3792.8193.4070.06

FCBF66.9463.8794.0585.8477.68FCBF66.9450.5796.0395.2873.21FCBF83.36.0184.4592.4574.28FCBF45.4955.4195.7391.5072.03

C4.5treeCorrSFReliefF85.9298.9467.2163.8793.3994.8083.9665.0982.6280.68

ConnSFK-SCBF93.4598.6863.22.7693.6394.0584.9081.1383.8084.66

NaiveBayes

CorrSFReliefFConnSFK-SCBF58.4870.2867.1485.8373.4850.5765.6841.7493.3195.94.1794.7593.3961.3293.3987.3579.6769.4380.1077.421NearestNeighbor

CorrSFReliefFConnSFK-SCBF71.7278.9286.9483.9366.8453.1365.4145.7087.6884.0486.7783.8286.7959.4381.1385.8578.2669.0180.0674.83SVMCorrSFReliefF44.0751.2466.0761.6093.7595.7577.3658.4970.3166.77

ConnSFK-SCBF45.1384.3159.3747.3590.05.1187.3393.1170.4880.04

Table4.Balancedaccuracyforthe4classiﬁcationmethodsonfeaturesselectedbyeachal-gorithm;boldface–bestresults,italics–worst.

4Conclusion

Anewalgorithm,K-SCBF,forﬁndingnon-redundantfeaturesubsetsbasedonthe

Kolmogorov-Smirnovtesthasbeenintroduced.Ithasonlyoneparameter,statisticalsigniﬁcanceortheprobabilitythatthehypothesisthatdistributionsoftwofeaturesisequivalentistrue.Ourinitialtestsareencouraging:ontheartiﬁcialdataperfectrankinghasbeenrecreatedandredundantfeaturesrejected,whileontherealdata,withrathermodestnumberoffeaturesselectedresultsarefrequentlythebest,orclosetothebest,comparingwithfourstate-of-the-artfeatureselectionalgorithms.ThenewalgorithmseemstoworkespeciallywellwiththelinearSVMclassiﬁer.ComputationaldemandsofK-SCBFalgorithmaresimilartoothercorrelation-basedﬁltersandmuchlowerthanReliefF.

Itisobviousthatsometimesstatisticalsigniﬁcanceat0.05levelselectedforourtestsisnotoptimalandforthelungcancerdatatoofewfeatureshavebeenselected,leadingtoalargedecreaseofaccuracy.Thisparametermaybeoptimizedincrossval-idationtestsonthetrainingset,butthemethodguaranteesthateachtimeonlynon-redundantsubsetoffeatureswillbeselected.VariousvariantsoftheKolmogorov-Smirnovtestexist[11]andthealgorithmmaybeusedwithotherindicesforrele-vanceindication.Thesepossibilitiesremaintobeexplored.Furthertestsonmuchlargerbioinformaticsdatawillbereportedsoon.

Acknowledgement.ThisworkwasﬁnancedbythePolishCommitteeforScientiﬁcResearch(KBN)grant(2005-2007).

References

1.S.D.Bay.TheUCIKDDarchive.Univ.ofCalifornia,Irvine,1999.http://kdd.ics.uci.edu.2.T.M.Cover.Thebesttwoindependentmeasurementsarenotthetwobest.IEEETrans-actionsonSystems,Man,andCybernetics,4:116–117,1974.

3.M.DashandH.Liu.Consistency-basedsearchinfeatureselection.ArtiﬁcialIntelligence,151:155–176,2003.

4.W.Duch,T.Winiarski,J.Biesiada,andA.Kachel.Featureranking,selectionanddis-cretization.InProceedingsofInt.Conf.onArtiﬁcialNeuralNetworks(ICANN),pages251–2,Istanbul,2003.BogaziciUniversityPress.

5.U.M.FayyadandK.B.Irani.Multi-intervaldiscretizationofcontinous-valuedattributesforclassiﬁcationlearning.InR.Bajcsy,editor,ProceedingsoftheThirteenthInterna-tionalJointConferenceonArtiﬁcialIntelligence,Chambery,France,pages1022–1027,SanFrancisco,CA,1993.MorganKaufmann.

6.M.A.Hall.Correlation-basedFeatureSubsetSelectionforMachineLearning.PhDthesis,DepartmentofComputerScience,UniversityofWaikato,Waikato,N.Z.,1999.7.R.LahaI.ChakravartiandJ.Roy.HandbookofMethodsofAppliedStatistics.JohnWileyandSons,Chichester,1967.

8.K.KiraandL.A.Rendell.Apracticalapproachtofeatureselection.InProceedingsoftheNinthInternationalConferenceonMachineLearning(ICML-92),pages249–256,SanFrancisco,CA,1992.MorganKaufmann.

9.N.HastingsM.EvansandB.Peacock.StatisticalDistributions,3rd.ed.JohnWileyandSons,Chichester,2000.

10.C.J.MertzandP.M.Murphy.TheUCIrepositoryofmachinelearningdatabases.Univ.

ofCalifornia,Irvine,1998.http://www.ics.uci.edu.pl/mlearn/MLRespository.html.11.W.H.Press,S.A.Teukolsky,W.T.Vetterling,andB.P.Flannery.NumericalrecipesinC.

Theartofscientiﬁccomputing.CambridgeUniversityPress,Cambridge,UK,1988.12.J.R.Quinlan.C4.5:ProgramsforMachineLearning.M.Kaufman,SanMateo,CA,

1993.

13.M.Robnik-SikonjaandI.Kononenko.Theoreticalandempiricalanalysisofrelieffand

rrelieff.MachineLearning,53:23–69,2003.

14.G.T.Toussaint.Noteonoptimalselectionofindependentbinary-valuedfeaturesforpat-ternrecognition.IEEETransactionsonInformationTheory,17:618–618,1971.

15.I.WittenandE.Frank.Dataminig–practicalmachinelearningtoolsandtechniques

withJAVAimplementations.MorganKaufmann,SanFrancisco,CA,2000.

16.L.YuandH.Liu.Featureselectionforhigh-dimensionaldata:Afastcorrelation-based

ﬁltersolution.InProc12thIntConfonMachineLearning(ICML-03),Washington,D.C.,pages856–863,SanFrancisco,CA,2003.MorganKaufmann.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文