University of Otago
Technical Report OUCS-2004-03
A View-based Consistency Model based on Transparent Data Selection in Distributed Shared
Memory
Authors: Z. Huang
Department of Computer Science, University of Otago
C. Sun
School of Computing & Information Technology Griffith University, Brisbane, Australia
S. Cranefield, and M. Purvis
Department of Information Science, University of Otago
Status: Not yet submitted
Department of Computer Science,
University of Otago, PO Box 56, Dunedin, Otago, New Zealand
http://www.cs.otago.ac.nz/trseries/
AView-basedConsistencyModelbasedonTransparentDataSelectioninDistributedShared
Memory
Z.Huang,C.Sun,S.Cranefield,andM.Purvis
DepartmentsofComputer&InformationScienceUniversityofOtago,Dunedin,NewZealand
SchoolofComputing&InformationTechnology
GriffithUniversity,Brisbane,AustraliaEmail:hzy@cs.otago.ac.nz,scz@cit.gu.edu.au,
mpurvis,scranefield@infoscience.otago.ac.nz
Abstract
ThispaperproposesanovelView-basedConsistencymodelforDistributedSharedMemory,inwhichanewconcept,view,iscoined.Aviewisasetofdataobjectsthatapro-cessorhastherighttoaccessinthesharedmemory.TheView-basedConsistencymodelonlyrequiresthatthedataobjectsofaprocessor’sviewareupdatedbeforeaprocessoraccessesthem.Inthisway,itcanachievethemaximumrelaxationofconstraintsonmod-ificationpropagationandexecutionindata-race-freeprograms.Thispaperfirstbrieflyreviewsanumberofrelatedconsistencymodelsintermsoftheiruseofthreetechniques–time,processoranddataselection–whicheacheliminatesomeunnecessarypropaga-tionofmemorymodificationswhileguaranteeingsequentialconsistencyfordata-race-freeprograms.Then,wepresenttheView-basedConsistencymodelanditsimplementation.Incontrastwithothermodels,theView-basedConsistencymodelcanachievetransparentdataselectionwithoutprogrammerannotationandcanofferthemaximumperformanceadvantage.Differencesamongrelatedworkarediscussedthroughillustrativeexamples.
1
PerformanceevaluationhasshownthatourimplementationoftheView-basedConsistencymodeloutperformstheLazyReleaseConsistencymodel,andthattheView-basedConsis-tencymodelhasadvantagesoveroptimalconsistencyprotocolssuchastheAffinityEntryConsistencyprotocol.Finallywesummarisesourcontributionsandpointsoutourfuturedirectionofimplementationeffortfordistributedsharedmemorysystems.
KeyWords:DistributedSharedMemory,SequentialConsistency,RelaxedSequentialConsistency,EntryConsistency,ScopeConsistency,LazyReleaseConsistency,Timese-lection,Processorselection,Dataselection,View-basedConsistency,Viewdetection,Viewtransition,FalseSharing
1Introduction
ADistributedSharedMemory(DSM)systemcanprovideapplicationprogrammerstheillusionofsharedmemoryontopofmessage-passingdistributedsystems,whichfacilitatesthetaskofparallelprogrammingindistributedsystems.DistributedSharedMemoryhasbecomeanactiveareaofresearchinparallelanddistributedcomputingwiththegoalsofmakingDSMsystemsmoreconvenienttoprogramandmoreefficienttoimplement[21,12,20,9,7,6,3,14,15].TheconsistencymodelofaDSMsystemspecifiesorderingconstraintsonconcurrentmem-oryaccessesbymultipleprocessors,andhencehasfundamentalimpactonDSMsystems’programmingconvenienceandimplementationefficiency[22].TheSequentialConsistency(SC)model[19]hasbeenrecognizedasthemostnaturalanduser-friendlyDSMconsistencymodel.TheSCmodelguaranteesthattheresultofanyexecutionisthesameasiftheoper-ationsofallprocessorswereexecutedinsomeglobalsequentialorder,andtheoperationsofeachindividualprocessorappearinthissequenceintheorderspecifiedbyitsownprogram[19].ThismeansthatinanSC-basedDSMsystem,memoryaccessesfromdifferentproces-sorsmaybeinterleavedinanysequentialorderthatisconsistentwitheachprocessor’sorderofmemoryaccesses,andtheordersofmemoryaccessesobservedbydifferentprocessorsarethesame.OnewaytostrictlyimplementtheSCmodelistoensureallmemorymodificationsbetotallyorderedandmemorymodificationsgeneratedandexecutedatoneprocessorbepropa-gatedtoandexecutedinthatorderatotherprocessorsinstantaneously.Thisimplementationiscorrectbutitsuffersfromseriousperformanceproblems[25].
2
Inpractice,notallparallelapplicationsrequireeachprocessortoseeallmemorymodifi-cationsmadebyotherprocessors,letalonetoseetheminorder.Manyparallelapplicationsregulatetheiraccessestoshareddatabysynchronization,sonotallvalidinterleavingsoftheirmemoryaccessesarerelevanttotheirrealexecutions.Therefore,itisnotnecessaryfortheDSMsystemtoforceaprocessortopropagateallitsmodificationstoeveryotherprocessor(withacopyoftheshareddata)ateverymemorymodificationtime.Undercertainconditions,theDSMsystemcanselectthetime,theprocessor,andthedataforpropagatingsharedmem-orymodificationsinordertoimprovetheperformancewhilestillappearingtobesequentiallyconsistent[24].Forexample,consideraDSMsystemwithfourprocessors
,
,
,and
,where
,
,and
shareadataobject,
,and
shareadataobject
,and
and
shareadataobject,asshowninFig.1.Thedataobjectsand
aresharedamong
processorsatalatertimenotshowninthisscenario.
w(x)P1w(z)w(y)w(x1)w(x2)P2
r(x) w(x)r(x) r(x1)program order
P3
P4
r(y)w: write r: read
Figure1:AscenarioofaDSMprogram
Supposeallmemoryaccessestoshareddataobjectsareserializedamongcompetingproces-sorsbymeansofsynchronizationoperationstoavoiddataraces.Underthesecircumstances,thefollowingthreebasictechniquescanbeused[24].Timeselection:Modificationsonashareddataobjectbyoneprocessorarepropagatedtootherprocessorsonlyatthetimewhenthedataobjectistobereadbythem.Forexample,modificationsonby
maybepropagated
outwardonlyatthetimewheneither
or
isabouttoread.Processorselection:Modifi-cationsonashareddataobjectarepropagatedfromoneprocessortoonlyoneotherprocessorwhichisthenextoneinsequencetoreadtheshareddataobject.Forexample,modificationson
by
maybepropagatedto
(butnotto
,
)if
isthenextoneinsequencetoread
.Dataselection:Processorspropagatetoeachotheronlythoseshareddataobjectsthatare
reallysharedamongthem.Forexample,object
,and
maypropagatetoeachotheronlydata
(not),and
and
propagatetoeachotheronlydataobject(not).
ToimprovetheperformanceofthestrictSCmodel,anumberofRelaxedSequentialConsis-tency(RSC)modelshavebeenproposed[10,13,18,5,17],whichperformoneormoreofthe
3
abovethreeselectiontechniques.RSCmodelscanbealsocalledconditionalSequentialCon-sistencymodelsbecausetheyguaranteeSequentialConsistencyforsomeclassofprogramsthatsatisfytheconditionsimposedbythemodels.Thesemodelstakeadvantageofthesyn-chronizationsindata-race-freeprogramsandrelaxtheconstraintsonmodificationpropagationandexecution.Thatmeansmodificationsgeneratedandexecutedbyaprocessormaynotbepropagatedtoandexecutedatotherprocessorsimmediately.MostRSCmodelscanguaranteeSequentialConsistencyfordata-race-freeprogramsthatareproperlylabelled[13](i.e.,explicitprimitives,providedbythesystem,shouldbeusedforsynchronizationintheprograms).However,intheseRSCmodelsdataselectionisachievedbyprogrammerannotation.Theprogrammerisrequiredtoprovideannotationsoftheassociationbetweendataobjectsandsynchronizationobjects,suchaslocksandbarriers,sothattheDSMsystemcanknowwhichdataobjectsshouldbepropagatedwhenasynchronizationobjectisaccessed.Thistypeofannotationisanextraburdenontheprogrammerandcausesanincreaseofthecomplexityofparallelprogramming.Theeaseofprogrammingwouldbeimprovediftheneedforhumanannotationcouldbereplacedwithautomaticdetectionoftheassociation.Automaticdetectioncanbedoneatruntimeand/orcompiletime.Compile-timedetectionmaycauseaslowercompilingprocess,buthaslittlerun-timeoverhead.Run-timedetectionmayhavemorerun-timeoverhead,butcangetmoreaccurateinformation.ThispaperproposesanovelView-basedConsistency(VC)modelwhichcanachieve,withrun-timeautomaticviewdetection,dataselectiontransparentlyinadditiontotimeandprocessorselection.
Therestofthispaperisorganisedasfollows.Section2brieflypresentsthebackgroundtoourwork,especiallytheclassificationofRSCmodelsintermsoftime,processor,anddataselection.Section3presentstheVCmodel[16],whosepropertiesaredescribedindetail.Section4discussesimplementationissuesandpresentsourimplementationoftheVCmodel.Section5describesthedifferencesamongrelatedworkthroughillustrativeexamples.Section6presentsandevaluatesperformanceresults.Finally,themajorcontributionsofthispaperandsomesuggestionsforfutureworkaresummarizedinSection7.
4
2RelaxedSequentialConsistencyModels
DuringtheexecutionofaDSMparallelprogram,multipleprocessorscommunicatewitheachotherthroughthevirtualsharedmemory.Insharedmemorysomedataobjectsareread-only,andsomeareread/write.Topreventdataraces(wheremultipleprocessorsreadandwritethesamedataobjectconcurrently),aparallelprogramhastoguaranteethataprocessorhasgainedexclusiveaccessbeforeaccessingaread/writedataobject.
RSCmodelsdistinguishsynchronizationdataobjectsfromordinarydataobjectsinsharedmemory.Synchronizationdataobjects,suchaslocksandbarriers1,arethosethatareexplic-itlyusedtoenforceexclusiveaccesstootherdataobjects.Therestofthedataobjectsinsharedmemoryarecalledordinarydataobjects.RSCmodelsrequiresequentialconsistencyforthesynchronizationdataobjectsinsynchronizationprimitives,suchasacquireandrelease.RSCmodelsassumethereisonlyonememorycopyforthesynchronizationdataobjectsandaccessestothesamedataobjectareexecutedsequentially.Forthestorageofordinarydataobjects,RSCmodelsadoptareplicatedarchitecturetoimproveperformanceunderthedis-tributedenvironment.Copiesarestoredinthelocalmemoryofeachparticipatingprocessor.Modificationsgeneratedandexecutedlocallyatsometimebyoneoftheprocessorsmaybepropagatedtoandexecutedbyotherprocessorsatalatertime.Thisdelaymaycausetheviola-tionofsequentialconsistency,but,ifhandledproperly,canbeusedtoimprovetheperformanceoftheDSMsystem.Inordertoachievesequentialconsistencywhiletakingadvantageofthedelay,RSCmodelsdonotallowanydataraceonordinarydataobjectsintheprogram.Exclu-siveaccesstoordinarydataobjectshastobeguaranteedbyexplicitlyusingsynchronizationprimitives.Ifdataracesonordinarydataobjectsarepreventedbyusingthesynchronizationprimitivesintheprogram,mostRSCmodelscanguaranteesequentialconsistencyfortheor-dinarydataobjects.Insummary,RSCmodelsstrictlyimplementsequentialconsistencyforsynchronizationdataobjects,whiletheyimplementsequentialconsistencyforordinarydataobjectsundertheconditionthatthereisnodataraceonthem.Thiscanbeguaranteedbyproperlylabellingtheprogramwiththesynchronizationprimitives.Aprogramissaidtobeproperlylabelledifthereisnodataraceintheprogramduetotheuseofthesynchronizationprimitivessuchasacquire,releaseandbarrier[13].AnexecutionofsuchaDSMprogramcan
beviewedasasequenceofbarriersessionsasshowninFig.2.
barrier session
non−criticalnon−criticalBcriticalAregionRcriticalAregionRregionregionB: barrier A: acquire R:release
barrier
sessionBBprogram order
Figure2:Aviewofaprogramexecutionbasedontheconceptofregion
Abarriersessionbeginswithabarrierandendswithanotherbarrier.Insideabarriersessionthereisasequenceofregionscomprisingcriticalregionsandnon-criticalregions.Theseregionsaredelimitedbyacquire,releaseandbarrierprimitives.Acriticalregionisasectionofcodethatensuresthatonlyoneprocessorexecutesthesectionofcodeatanytime.Itbeginswithanacquireandendswitharelease.Anon-criticalregionbeginswitharelease(theoutermostonewhentherearenestedcriticalregions)orabarrierandendswithanacquire(theoutermostonewhentherearenestedcriticalregions)orabarrier.Anon-criticalregiondoesnotoverlapwithanycriticalregion,butacriticalregionmaybenestedwithinanothercriticalregion.
Inanexecutionofaproperlylabelledprogram,accessestodataobjectsmayhavesomecausalordering.Thecausalorderingisessentialtothedefinitionofdataracesanddata-race-freeprograms.TheRSCmodelstakeadvantageoftheassumptionthatprogramsaredata-racefreetooptimizemodificationpropagationandexecution.Definition1CausalOrderingRelation“
”
AtahighlevelwecanmodelaDSMprogramasasetofsequencesofactions(onesequenceforeachprocessor).Theseactionsincludememoryaccessesandsynchronizationprimitivessuchasacquire,releaseandbarrier.Giventwoactions
and
,generatedbyprocessors
and,respectively,then
ifandonlyif(1)
,andthegenerationof
happened,suchthat
beforethegenerationof
inprogramorder;or(2)
,
isarelease,
isanacquireon
thesamelockas
,and
releasesthelockto
;or(3)thereexistsanaccess
and
.
Definition2PreviousandconcurrentactionsGivenanytwoactions
and
,(1)
issaidtobepreviousto
ifandonlyif
;(2)
and
aresaidtobeconcurrent(written)ifandonlyifneithernor
.
6
Definition3Conflictingaccesses,dataraceanddata-race-freeprogramsTwomemoryaccesses
and
conflictifandonlyif
and
accessthesamememory
locationandatleastoneisawrite.Iftherearetwoconflictingaccessesexecutionofaprogram,and
and
inan
,thenwesayadataraceoccursintheexecution.Aprogram
isdataracefreeifandonlyifeverypossibleexecutionoftheprogramhasnodatarace.Fromtheabovedefinitionsweknowthatifnodataraceoccurs,thenwhenaprocessormodifiesadataobjectinsideacriticalregion,otherprocessorswillnotaccessthatdataobject.Thusitisnotnecessarytopropagatemodificationsduringtheexecutionofacriticalregion.TheRSCmodelshavetakenadvantageofthisrelaxationofmodificationpropagationtoachievetime,processor,anddataselection.Theycanguaranteesequentialconsistencyfordata-race-freeprogramswhileimprovingtheirperformance.InthefollowingwebrieflydescribetheRSCmodelsintermsofthethreeselectiontechniques.
TheWeakConsistency(WC)model[10]isanRSCmodelwhichachievestimeselection.TheWCmodelrequiresaprocessortopropagateallitsmodificationstootherprocessorsonlyatsynchronizationtime,ratherthanateverymemorymodificationtime.Withtimeselection,modificationsonshareddataobjectscanbeaccumulatedandonlythefinalresultsareprop-agatedinbatchesatsynchronizationtime.Inthisway,thenumberofmessagesintheWCsystemscanbegreatlyreducedcomparedtothatinstrictSCsystems.
TheEagerReleaseConsistency(ERC)model[13]takestimeselectiononestepfurtherthantheWCmodelbydistinguishingtwodifferentsynchronizationprimitives:acquireandrelease,whicharetheentryandexitofacriticalregion,respectively.TheERCmodelrequiresthatsharedmemorymodificationsbepropagatedoutwardonlyatreleasetime.Inotherwords,theERCmodelismoretimeselectivethantheWCmodelbypropagatingmodificationsoutwardonlyattheexitofacriticalregion,insteadofatboththeentryandexitofacriticalregionasintheWCmodel,thusfurtherreducingthenumberofmessagesinthememorysystem.TheLazyReleaseConsistency(LRC)model[18]improvestheERCmodelbyperformingbothtimeandprocessorselection.InsteadofpropagatingmodificationstoallotherprocessorsatreleasetimeasinERC,LRCpostponesthepropagationofmodificationsuntilanotherpro-cessorhassuccessfullyperformedanacquire.Atasuccessfulacquire,theDSMsystemisabletoknowpreciselywhichprocessoristhenextonetoaccesstheshareddata,somodificationscanbepropagatedonlytothatparticularprocessor(ornopropagationatallisrequiredifthe
7
nextprocessoristhecurrentprocessor),thusachievingprocessorselectionintheLRCmodel.Bysendingmodificationsonlytotheprocessorthatisenteringacriticalregion,agreaterre-ductioninthenumberofmessagescanbeachievedintheLRCmodel.
TheEntryConsistency(EC)model[5]isverysimilartoLRCinpropagatingmodificationsonlytothenextprocessorenteringacriticalregion.Inadditiontotimeandprocessorselection,theECmodelalsoperformsdataselectionbypropagatingonlythoseshareddataobjectsthatareassociatedwithasynchronizationdataobjectsuchasalock.Theseassociationsareanno-tatedbytheprogrammer.Duetotheinclusionofdataselection,theECmodelcanbemoreefficientthantheLRCmodel[6].
TheScopeConsistency(ScC)model[17]isabletooffermostofthepotentialperformanceadvantagesoftheECmodelbymeansoftime,processor,anddataselection,anditalsoim-provestheprogrammabilityoftheECmodelbyrequiringprogrammerstoattachconsistencyscopeswithcodesections,insteadofdata.Forcriticalregions,dataobjectscanbeautomat-icallyassociatedwithscopes;buttheprogrammerhastoannotatethescopesexplicitlyfornon-criticalregions.Iftheannotationisnotprovidedorisincorrect,ScCdoesnotguaranteesequentialconsistencyfortheprogram.
Fromtheabovediscussionwecanseethatconstraintsonmodificationpropagationandex-ecutionhavebecomemoreandmorerelaxed.ThisrelaxationallowsDSMsystemstoperformtime,processor,anddataselection.ToachievetheseselectionsRSCmodelsrequireprogram-merstoannotatetheprogramsmanuallysothatselectionscanbecombinedwithsynchro-nizationprimitives.Forexample,theERCmodelrequiresprogramstobeproperlylabelledwithsystem-providedsynchronizationprimitives,sothattheDSMsystemisexplicitlynotifiedofaprocessor’sentrytoandexitfromacriticalregionandcantherebyselecttheexittimetopropagatemodifications.TheECmodel,furthermore,requirestheprogrammertoassociatesynchronizationdataobjectsexplicitlywithordinarydataobjectstoachievedataselection.TheScCmodelmadeonesteptoward(partially)transparentdataselectionbytakingadvantageoftheconsistencyscopesimplicitlydefinedbysynchronizationprimitives,butprogrammersmaystillhavetodefineadditionalconsistencyscopesexplicitlyinprogramsinordertoguaranteesequentialconsistency.Weshouldbeverycautiousofrequiringprogrammerannotationwhichimposesanextraburdenonprogrammersandincreasesthecomplexityofparallelprogram-ming.
8
Wedistinguishtwotypesofprogrammerannotations:synchronizationannotationswhicharerequiredtoensureboththecorrectnessofparallelprograms(toavoiddataraces)andthecorrectnessofmemoryconsistency;andannotationswhicharerequiredonlyforthecorrect-nessofmemoryconsistency.Forthefirsttypeofannotations,suchasacquireandreleaseprimitivesintheERC,LRC,ECandScCmodels,theDSMsystemcantakeadvantageofthemtoachievetimeand/orprocessorselectionwithoutimposinganadditionalburdenonprogrammers.However,forthesecondtypeofannotations,suchastheassociationbetweensynchronizationobjectsanddataobjectsfordataselectionintheECmodel,andtheaddi-tionalconsistencyscopesintheScCmodel,theyaretrulyanextraburdenonprogrammersandincreasethecomplexityofparallelprogramming.Theyshouldbereplacedbyautomaticassociationsviarun-timedetectionand/orcompile-timeanalysis[11].
InthefollowingsectionwepresenttheView-basedConsistencymodel,whichachievestime,processor,anddataselectionandremovestheextraburdenonprogrammersrequiredbyprevi-ousdataselectiontechniques.
3View-basedConsistency
TheView-basedConsistency(VC)model[16]hasbeenproposedtoachievedataselectiontransparentlywithoutprogrammerannotation.SimilartootherRSCmodels,itguaranteesSequentialConsistencyforproperlylabelleddata-race-freeprograms.
Aviewisasetofordinarydataobjectsthataprocessorhastherighttoaccessinthesharedmemoryataparticularpointintime.Wesayaprocessorhastherighttoaccesssomedataobjectifandonlyifithasgainedexclusiveaccesstothedataobjectorthedataobjectisread-only.
Ingeneralwesaythattheviewofaprocessorisasetofdataobjectsthattheprocessorhastherighttoaccess.Indata-race-freeprograms,wecansaythattheviewcomprisesdataobjectsthattheprocessorwillaccessinsomefollowingperiodofexecution,sincetheprocessorshouldhavegottheaccessrighttothedataobjects.
Atanytimepointofanexecutioninadata-race-freeprogram,supposeanytwoprocessors
and
haveviews
and
,respectively.Then
mustonlycontainread-onlydata
objects,otherwiseadataracewouldoccur.ThisisillustratedinFig.3.
9
read−only datashared memoryP1P2P3Figure3:Asnapshotofprocessors’viewsinadata-race-freeprogram
InaDSMprogram,exclusiveaccesstoadataobjectcanonlybegainedinthefollowingthreeways:
1.implicitassignmentbytheprogrammerbymakinguseofabarriersession.Exclusiveaccessisguaranteedbybarriers.
2.explicitacquisitionbycallingtheacquireprimitive.Exclusiveaccessisguaranteedbythelockmechanismofcriticalregions.
3.implicitacquisitionguaranteedbytheprogrammerwhocontrolsaccesstothedataobjectbyassociatingitwiththestatusofanothercritical-region-protectedobject.Forexample,exclusiveaccesstoataskcanbeguaranteedbyremovingthetaskfromalock-protectedtaskqueue(andthusthestatusofthetaskqueuehasbeenchanged).
Therefore,inanexecutionofaDSMprogram,onlywhenaprocessorcallssynchronizationprimitives,e.g.,barrier,acquire,andrelease,doesitsviewchange.Aprocessor’sviewisconstantinsideacriticalornon-criticalregion.Onlywhenaprocessormovesfromoneregiontoanotherdoesitgainorloseexclusiveaccesstosomedataobjects,whichcausesachangetotheviewoftheprocessor.
Accordingtotheaboveobservation,viewscanbehandledandprocessedbasedonregions(criticalandnon-critical).Forconvenienceofdescriptioninthispaper,weclassifyviewsintoCriticalRegionViews(CRVs)andNon-criticalRegionViews(NRVs).ACRVistheviewofaprocessorwhileitexecutesacriticalregion,andanNRVistheviewofaprocessorwhileitexecutesanon-criticalregion.Indata-race-freeprograms,aCRVconsistsofthedataobjectsaccessedinthecriticalregion,andanNRVconsistsofthedataobjectsaccessedinthenon-criticalregions.
10
ConsistencymaintenanceinVCrequiresupdatingdataobjectsoftheviewbeforeaprocessorentersaregion.Moreprecisely,thefollowingconsistencyconditionsaregivenfortheView-basedConsistency(VC)model.AnyimplementationofVCshouldsatisfytheseconditions.Definition4ConditionsforView-basedConsistency
Beforeaprocessor
isallowedtoenteracriticalornon-criticalregion,allprevious
writeaccessestotheordinarydataobjectsoftheCRVorNRVmustbeperformedwithrespectto
accordingtotheircausalorder.
Beforeaprocessorisallowedtopassabarrierprimitive,allpreviouswriteaccesses
mustbeperformedwithrespectto
accordingtotheircausalorder.
Thesequentialconsistencyofsynchronizationdataobjectsmustbeguaranteedbytheimplementationofthesystemprimitivessuchasacquire,release,andbarrier.
Awriteaccesstoamemorylocationissaidtobeperformedwithrespecttoprocessoratimepointwhenasubsequentreadaccesstothatlocationbywriteaccess.
at
returnsthevaluesetbythe
Insideabarriersession,basedonthefirstcondition,aprocessorisguaranteedtoaccesstheup-to-dateversionofthedataobjectsinitsview.
Thesecondconditionstatesthatwhenaprocessorreachesabarrier,itshouldbeabletoseeallthepreviousmodificationsmadebyotherprocessors.ThisconditionisalsorequiredinEC,ScCandLRC.
SCcorrectnessofVCProcessorsaresynchronisedtomodifythesameviewoneafteran-other,butmaymodifydifferentviewsconcurrentlyinanydata-race-freeprogram.Basedonthisobservation,foranyparallelexecutionofaDSMprogramunderVC,wecanproduceaglobalsequentialorderofthemodificationsonviews,inwhichthemodificationsonthesameviewareorderedinthesamewayasthesynchronisedorderoftheparallelexecution,andthemodificationsondifferentviewsareputinprogramorderiftheyareexecutedsequentiallyintheprogram,otherwisetheyareparallelandputinanyorder.Parallelmodificationsondiffer-entviewscanbeexecutedinanyorder,whichwillnotaffecttheexecutionresult.Obviously,accordingtotheconsistencyconditionsforVC,theparallelexecutionresultoftheprogramun-derVCisthesameastheabovesequentialexecutionofthemodifications.Therefore,aglobal
11
sequentialorderhasbeenfoundtomatchtheparallelexecutionresultunderVC.AccordingtothedefinitionoftheSCmodel,VCcanguaranteeSequentialConsistencyfordata-race-freeprograms.
ThefirstconsistencyconditionofVCalsoindicatesthat,whenaprocessorisenteringanewregion,itisonlyrequiredtoupdatethosedataobjectsinthecorrespondingnewview.Inthisway,VCachievestimeselection(attheentryofaregion),processorselection(thenextprocessorwhichhastherighttoaccesstheview),anddataselection(onlythedataobjectsoftheview).
AnyimplementationoftheVCmodelshouldconformwiththeaboveconsistencycondi-tions.Therearetwoimportanttechnicalissuesintheimplementation:viewdetectionandviewtransition.Viewdetectionmeansidentifyingallthedataobjectsinthenewviewofapro-cessorbeforeitentersanewregion.Indata-race-freeprograms,theviewconsistsofthedataobjectstobeaccessedbytheprocessorinthenewregion.Thusviewdetectionisapredictionofthedataobjectsthatwillbeaccessedbytheprocessorduringexecutionofthenewregion.Viewtransitionmeansupdatingallthedataobjectsinthenewviewofaprocessorbeforeitentersanewregion.Inshort,beforeaprocessorentersanewregion,itsviewshouldhavebeendetected;whenaprocessorentersanewregion,theviewtransitionshouldhavebeenachieved.AnyimplementationoftheVCmodelshouldguaranteethatviewdetectionandtransitionareimplementedcorrectly.
Correctnessandaccuracyaretwoimportantissuesinviewdetection.Acorrectviewshouldincludealldataobjectsthataprocessorhastherighttoaccess.Anaccurateviewshouldincludeandonlyincludethosedataobjects.ThecorrectnessofviewdetectionmustbesatisfiedinVCimplementation,whileinaccuracyofviewdetectionmayonlyaffectitsperformance(e.g.,propagationofirrelevantmodificationsofdataobjects).Ofcourse,theperformancealsodependsontheeffectivenessofviewdetectionandviewtransition.WewilldiscusstechniquesforviewdetectionandtransitioninSection4.
Inthefollowingdiscussion,weassumetheviewofeveryprocessoriscorrectlyandaccu-ratelydetectedunderanidealsituation,inordertoexplainhowthegenericVCmodelworks.Fig.4showshowtheVCmodelworksusingthesamescenarioshowninFig.1.InFig.4,weassumetheaccessestodataobject
aresynchronisedbythestatusof,asinthethird
wayforacquiringexclusiveaccessdescribedearlyinthissection.Underthisassumptionitis
12
P1A(l1)w(z)w(y)R(l1)P2P3P4
xA(l2)w(x)R(l2)w(x1)w(x2)x A(l2) r(x)w(x)R(l2)xx1A(l2) r(x)R(l2) r(x1)yA(l1) r(y)R(l1)w: write r: read A: acquire R: release
: request update on x and execute it at the processorFigure4:View-basedConsistencyinaction
andtoracefor.
program order
notpossiblefor
Weassumetherearetwocriticalregionswhichareprotectedbylocksand
and
,respectively,
when
and
accessdataobjectsinnon-criticalregions.Wealsoassumethattheviewofevery
processorhasbeendetectedcorrectlyateachstep.Thedetectedviewofitentersthecriticalregion,thedetectedviewofcriticalregions,andthedetectedviewof
includes
and
includes
whentheyentertheir
inthenon-criticalregionincludes
.Before
executesthecriticalregion,onlythemodificationofitsview.For
ispropagatedtotheprocessortoupdate
and
,onlythemodificationon
ispropagatedtothemtoupdatetheirviews
inthesamecriticalregion.When
entersthenon-criticalregion,onlythemodificationon
ispropagated.Thelockacquisitionandmodificationpropagationareseparateinthefigure,
buttheycanbecombinedinanimplementationinordertoimproveDSMperformance.ComparisonwithrelatedmodelsAmongtheRSCmodels,onlyScC[17]andEC[5]canachievedataselection.However,theycannotguaranteeSCcorrectnessfordata-race-freeprograms(seeSection5fordetails).InordertoguaranteeSCcorrectness,extraprogrammerannotationsarerequiredinprograms.Toselectivelyupdatedataobjects,ECusesguardedshareddata
andScCusesscope.Guardedshareddata
isasetofdataobjectsassociated
withasynchronisationdataobject.Thisassociationisspecifiedbytheprogrammer.AscopeinScCconsistsofsomeprogramsectionsprotectedbyasynchronizationdataobjectsuchasalock.Both
andscopearestaticandfixedforaparticularsynchronizationdataobjector
acriticalregion.Evenifsomedataobjectsarenotgoingtobeaccessedbyaprocessorinacriticalregion,theyareupdatedsimplybecausetheyareassociatedwiththelockorthecriticalregion.Forexample,inFig.4,suppose
and
areassociatedwithlock
inEC.Thusafter
acquires
,modificationson
andarepropagatedtoit,eventhoughitisnotgoingto13
access.Thiscircumstanceisnotuncommoninparallelprograms.Incontrast,aviewinVC
isdynamicandmaybedifferenteverytimethesamecriticalregionisentered.Forexample,inFig.4,althoughincludes
updatesboth
and
inthe
-relatedcriticalregion,
’sviewonly
when
enterstheunderVC.
-relatedcriticalregion.Thereforeonlythemodificationon
ispropagatedto
ComparedtoLRC,VCcanreducemoreofthefalse-sharing2effectinpage-basedDSMsystems(seeSection5fordetails).Thefalse-sharingeffectisthepropagationofuselessmod-ifications,whichiscausedbyfalsesharing.
ProgramminginterfaceTheVCmodelprovidesthesameprogramminginterfaceasLRC,ERCandWC.ItcanguaranteeSequentialConsistencyfordata-race-freeprogramsthatareproperlylabelled.Incontrast,ECrequirestheprogrammertoprovidecorrectlock-dataasso-ciation.Ifthelock-dataassociationisnotcorrect,ECdoesnotguaranteethecorrectexecutionoftheprogram.Similarly,ScCdoesnotguaranteethesameexecutionresultasSequentialConsistencyforsomedata-race-freeprogramsifexplicitscopeannotationsarenotcorrectlyprovidedbytheprogrammer(seeSection5fordetails).
Fromtheabovediscussion,weknowthat,inthegenericVCmodel(anidealimplementationofVC)whereviewscanbeaccuratelydetected,nouselessmodificationswillbepropagatedamongprocessors.Therefore,thegenericVCmodelcanachievethemaximumdataselection.Inotherwords,thegenericmodelcanachievethemaximumrelaxationofconstraintsonmod-ificationpropagationandexecutionindata-race-freeprograms.Asaconsequence,webelievethegenericVCmodelcanbeusedasaframeworkforfutureresearchonDSMimplementa-tions.
Tosummarise,wemakethefollowingstatementswhicharetrueinthegenericVCmodel.
Theviewofaprocessorisconstantwithinaregion.
Whenaprocessorentersanewregion,onlythedataobjectsofitsnewviewareupdated.
SequentialConsistencyfordata-race-freeprogramsisguaranteed.
Time,processor,anddataselectionareachieved.Dataselectionisachievedwithoutprogrammerannotations.
Maximumrelaxationofconstraintsonmodificationpropagationandexecutionfordata-race-freeprogramsisachieved.
Inthefollowingsection,wewilldiscussourimplementationoftheVCmodel.
4Implementation
WehaveimplementedtheVCmodelbasedonTreadMarks[3],whichisapage-basedDSMsystem.InTreadMarks,adiffisusedtorepresentmodificationsonapage.Initiallyapageiswrite-protected.Whenawrite-protectedpageisfirstmodifiedbyaprocessor,atwinofthepageiscreatedandstoredinthesystemspace.Whenthemodificationsonthepageareneededbyanotherprocessor,acomparisonofthetwinandthecurrentversionofthepageisdonetocreateadiff,whichcanthenbeusedtoupdatecopiesofthepageinotherprocessors.Inourimplementation,apageistreatedasthebasicunitofdataobjects.Thusaviewintheimplementationconsistsofpages.Thediffscanberegardedasmodificationsonpages.
4.1Viewdetection
Inconsistencymaintenanceweonlyneedtoknowwhicharethemodifieddataobjectsandthenupdatethem.Likewise,tomaintaintheconsistencyofaview,weonlyneedtoupdatethemodifieddataobjectsintheview.Therefore,wearenotinterestedintheunchangedpagesofaviewandthusonlythemodifiedpagesarerecordedinourimplementationofviewdetection.Inourimplementation,viewdetectionisachievedatruntime.Todetectmodifiedpagesinaview,ourimplementationtakesadvantageofthefollowingtwoexistingmechanismsinTreadMarks:
1.Whenawriteaccessisperformedonaninvalidatedpage,apagefaultwilloccur.Apagefaulthandlerwillrequestthediffsofthepagefromotherprocessors.Weextendedthepagefaulthandlertostorethepage’sidentifierintothecorrespondingview.
15
2.Whenawriteaccessisperformedonawrite-protectedpage,aprotectionviolationin-terruptwilloccur.Aninterrupthandlerwillmakeatwinofthepageunderthemultiple-writerscheme[8].Weextendedtheinterrupthandlertostorethepage’sidentifierintothecorrespondingview.
SincetheabovetwomechanismsalreadyexistedinTreadMarks,therewaslittleoverheadforstoringtheidentifiersofmodifiedpages.4.1.1Viewdetectionincriticalregions
Viewsincriticalregionsarerelativelyeasytodetect.AllthepagespreviouslymodifiedinthesamecriticalregionareincludedinthecorrespondingCRV,becausetheywillbeverylikelyaccessedlaterbyaprocessorexecutingthesamecriticalregion,althoughprogramlogiclikeifstatementsmayaffecttheaccesspatternofaprocessor.However,thepagesmerelymodifiedfromoutsideacriticalregionareexcludedfromtheCRV,sincetheirdiffsshouldn’tbeusedtoupdatetheCRV,asthiswouldindicatethereisadataraceintheprogram.
WeconstructaCRVbyrecordingpagesmodifiedinsideacriticalregion.However,ifapageisalreadywritablebeforeanewregionisentered,thatpagewillnotbedetectedandstoredintheCRVorNRVifitismodifiedintheregion.Todetectallmodifiedpagesinaregion,wemakeallwritablepageswrite-protected(read-only)whenenteringtheregion.Thisistheadditionaloverheadrequiredforviewdetection.Itisproportionaltothenumberoftimescriticalregionsareexecutedintheprogram.Fortunately,theoverheadwillbebalancedbytheperformancegainwhichisalsoproportionaltothenumberoftimescriticalregionsareexecuted.Ourexperimentalresultsdemonstratethisadditionaloverheadisrelativelysmall.Basedontheabovewrite-protectionmechanism,ourviewdetectionalgorithmworksasbelow.Apageset
isassociatedwitheachlockprotectingacriticalregion.Initially
is
empty.Duringanexecutionofthecriticalregion,ifapage
isdetectedasmodified,then
.Thepageset
includesallthepagesmodifiedinthecriticalregion.Since
aprocessorenteringthecriticalregionwillverylikelyaccessthepagesinthedetectedCRV,andisreasonablyaccurate.
,
isusedas
16
4.1.2Viewshrinkageinnon-criticalregions
Unfortunately,viewsinnon-criticalregionsarenotaseasytodetectasincriticalregions,becausetherearenoobviousconnectionsbetweennon-criticalregions.Themodifiedpagesinpreviousnon-criticalregionsmaynotbelongtotheviewofasuccessivenon-criticalregion.However,asforaCRV,itiscertainthatanNRVshouldonlyincludepagesmodifiedinpreviousnon-criticalregions,sincemodificationsmadeincriticalregionsshouldn’tbeusedtoupdateanNRV,otherwise,itindicatesthatadataraceoccursintheprogram.Ofcourse,amodifiedpagemaybeincludedinbothaCRVandanNRVorintwoCRVsbecauseofthefalse-sharingproblem.
Inourimplementation,anNRVinitiallyconsistsofallpagespreviouslymodifiedinnon-criticalregionsandthusadetectedNRVmaybelargerthantherealone.Thismeansthatwhenaprocessorexecutesanon-criticalregionitmaynotaccesssomepagesintheNRV.Thisinaccuracyonlyaffectstheperformance(lessdataselectionandnoreductionoffalse-sharingeffectinnon-criticalregions),notthecorrectnessoftheimplementation.
Toimprovetheperformanceofourimplementation,weuseRegionalLocality[15]tody-namicallyshrinktheNRVduringexecutionofanon-criticalregion.RegionalLocalityisbasedontheobservationthatasetofpagesthatareaccessedinanon-criticalregionwillbeverylikelyaccessedasawholeinothernon-criticalregions.Forexample,supposeprocessoranon-criticalregionandaccessespages
enters
,
,...,
duringexecutionofthenon-critical
region,andprocessor
entersanothernon-criticalregionafterwards.Sincedataobjectsac-
cessedinanon-criticalregionoftenmigratetogetherfromoneprocessortoanotherprocessor,asregulatedbytheprogrammertoshiftworkloadamongprocessors,whentwomembersofthepagesetset.
ToexploitRegionalLocalityinnon-criticalregions,wedetectthepagesmodifiedinnon-criticalregionsandaggregatethem.WeadoptaModifiedPagesSet(MPS)forgroupingpagesmodifiedinnon-criticalregions.AnMPSisformedasbelow.Whenaprocessorentersanon-criticalregion,auniqueemptyMPSiscreatedforthenon-criticalregion.Iftheprocessormodifiesapageduringexecutionofthenon-criticalregion,theidentifierofthepageisstoredintotheMPS.Whentheprocessorexitsfromanon-criticalregion,theMPSbecomescompleteandisstoredforlateruse.
17
accessesoneor
,
,...,
,itwillverylikelyaccesseverymemberofthat
Weusesomehintstodecidewhetherweshouldshrinktheviewofaprocessorinanon-criticalregion.ThefirsthintweuseistheaccesstoanypageinanMPS(calledthefirsthitoftheMPS).ThishintsuggeststhatthefullsetofthepagesintheMPSmightbetherealviewoftheprocessorexecutingthenon-criticalregion.IftheaccessedpagebelongstomultipleMPSs,weusetheaccesstoanotherpage(calledthesecondhitoftheMPS)toconfirmwhichMPSmightbetherealview.OnceanMPSisassumedtobetherealviewoftheprocessor,theprocessorreducesitsviewtotheMPS.Theshrunkenviewcanbeusedtoprefetchdiffsoftheviewinthenon-criticalregion.Wewilldiscusshowtotakeadvantageoftheshrunkenviewinviewtransition.
Thedetectedviewsarepropagatedtoaprocessorwhenitcallsacquire.Thustherearenoextramessagesforviewpropagation,exceptforsmalldataofviewspiggy-backedonthelockreleasemessage.
4.2Viewtransition
Viewtransitionistomakeaview(asetofpages)uptodate.Beforeanewregionisentered,viewtransitionmustbedone.Modificationpropagationandexecutionprotocols(alsocalledconsistencyprotocols)areusedtoachieveviewtransition,wherepagesthatarerequiredtobemadeconsistentbytheconsistencymodelaremadeuptodate.Thetwocommonmodifica-tionpropagationandexecutionprotocolsaretheinvalidationandtheupdateprotocol.Iftheinvalidationprotocolisadopted,whenapageisrequiredtobemadeconsistent(uptodate)bytheconsistencymodel,itisinvalidatedfirstandthediffsarepropagatedandappliedbyapagefaultlater.Iftheupdateprotocolisadopted,whenapageisrequiredtobemadeconsistent,thediffsarepropagatedandappliedimmediately.
Fromanimplementationpointofview,ifaviewisaccuratelydetectedanditspagesaretobeaccessedbytheprocessor,itwouldbemoreefficienttousetheupdateprotocoltopropagateandapplythediffs,ratherthanusingtheinvalidationprotocol.However,ifthedetectedviewislargerthantherealview,theinvalidationprotocolmaybemoreefficient.Sincethedetectedviewhaspagesthatwillnotbeaccessedbytheprocessor,theupdateprotocolwillthuspropa-gateuselessdiffs.Incontrast,theinvalidationprotocolpropagates(small)invalidationnoticesfirsttoinvalidatethepagesinthedetectedview,butthediffsofapagearepropagatedonlywhenthepageisintherealviewandthusaccessedbytheprocessor.Inthisway,theprop-18
agationsofuselessdiffsareavoidedintheinvalidationprotocol,withtheoverheadofmoremessages(andpagefaults).Iftheuselessdiffsarehuge,thereisachancefortheinvalidationprotocoltobemoreefficient.
TheupdateprotocolisgenerallysuitableforVC,justastheinvalidationprotocolisforLRC(seeSection5fordetails).However,sincethedetectedNRVsinourimplementationarenotaccurateandarenormallylargerthantherealonesinourimplementation,weadoptedthein-validationprotocolfortheviewtransitionintonon-criticalregions.TheinvalidationnoticesfortheNRVofaprocessorarepropagatedtotheprocessorwhenitcallsacquire,butaredis-abledincriticalregions.Theyareenabledandeffectiveinanynon-criticalregions.Whenaprocessorentersanon-criticalregion,thepageswiththedisabledinvalidationnoticesarein-validated.However,thesenoticeswillbedisabledagaininthefollowingcriticalregioniftheirpagesarenotaccessedinthenon-criticalregionnorbelongtothefollowingCRV.Obviously,comparedwithTreadMarks,enabling/disablinginvalidationnoticesisanotheroverheadinourimplementation.Againthisoverheadisproportionaltothenumberoftimescriticalregionsareexecuted.
Duringexecutionofthenon-criticalregion,theNRVmayshrinktooneoftheMPSs.OncetheNRVshrinks,weprefetchthediffsofthepagesintheshrunkenNRVandapplythemtotheirpages.Thisoptimisationdoesn’tchangetheeffectivenessoftheinvalidationnoticesofotherpages.
5Comparisonwithrelatedwork
Ourimplementationcanimproveperformanceattwolevels:theconsistencymodelandcon-sistencyprotocollevels.Attheconsistencymodellevel,itonlyupdatespagesintheviewofaprocessorenteringanewregion.ThedetectedCRVsaregenerallyaccurateandtheupdatepro-tocolisusedtoupdatethem.TheaccurateCRVsdetectedinourimplementationhelpreducethefalse-sharingeffect,aswillbediscussedshortly.InLRCallpreviouslymodifiedpagesmustbeupdatedandtheinvalidationprotocolisusedtoalleviatetheoverheadofpropagationofuselessdiffs.
Attheconsistencyprotocollevel,besidestheupdateprotocoladoptedforCRVs,ourimple-mentationusesshrunkenviewstoprefetchthediffsinnon-criticalregions.Thistechniqueis
19
similartosomeoptimalconsistencyprotocols[23,14,4],whichhavebeenproposedtoim-provetheperformanceoftheLRCmodel.Theseprotocolschooseoptimalwaystopropagatediffs,e.g.,prefetchingofdiffs.However,theyworkatthelevelofmodificationpropagationinLRC,insteadofatthelevelofaconsistencymodel.Theydon’tchangetheconsistencyconditionsofLRC.Therefore,theyinheritthefalse-sharingeffectfromLRC.Thoughtheycanreducethenumberofmessagesandtheoverheadofpagefaultsbyprefetching,theycan’tremovethefalse-sharingeffectinLRC,i.e.,thepropagationofuselessdiffs.ThesetechniquesarecomplementarytoourVCimplementationandmaybeusefultofurtheroptimizethediffpropagationinnon-criticalregionsinourimplementation.
P1
T1=sh_malloc(size_of_task);produce_task(T1);Acquire(L1);enqueue(Q, T1);Release(L1);
P2
Acquire(L1);dequeue(Q, T2);Release(L1);
consume_task(T2);
Figure5:Ataskqueueprogram
Toexplainthesubtledifferencesamongrelatedmodels,wewilldiscussataskqueuepro-gramshowninFig.5.Tomakethecomparisonfair,inthediscussionweassumetheinvali-dationprotocolisusedwiththemodelsbeingcompared.Thisassumptionisvalid,sinceRSCmodelsaddresswhentomakewhichpagesconsistentinpage-basedDSMsystemsandthusareindependentofmodificationpropagationandexecutionprotocols(alsocalledconsistencyprotocols)suchastheinvalidationorupdateprotocol,althoughthoseprotocolsmayaffecttheperformanceoftheirimplementations.IntheprograminFig.5,thevariable
isusedasataskqueue.Processor
producesa
task
andputsitinto
,andthenprocessor
getsatask
from
andconsumesthe
task,wheretasks.
and
arelocalvariablesthatstoretheaddressesoftheshareddataobjectsfor
Fig.6illustratestheread/writeaccessestotheshareddataobjectsinthetaskqueueprogram.
20
shared memorypage1page2T1QEC and ScC: invalidate page 1.LRC: invalidates page 1 and 2.VC: propagates invalidation noticesfor page 1 and 2, but the notice forpage 1 is only effective in the critical region and the one for page 2 is only effective in the non−critical region.P1write(*T1)Aread(Q), write(Q)R...shared memorypage1page2program orderT2QP2A: acquire R: release
Aread(Q), write(Q)Rread(*T2), write(*T2)program order
Figure6:DifferencesamongEC,ScC,LRC,andVC
ItisusedtoexplainhowdifferentRSCmodelsmaintainconsistencyinapage-basedDSM.Tomakethesubtledifferencesmoreclear,weassume
and
happentopointtothesametask
,whilein
datawhichislocatedinpage2,while
isinpage1.
InEC,theprogrammerisrequiredtoannotatethatisassociatedwithlock
ScCthisassociationcanbedetectedatrun-time.InbothECandScC,onlymadeuptodatebefore
isrequiredtobe
entersthecriticalregion.Therefore,theinvalidationnoticeforpage
1ispropagatedto
andpage1ismadeinvalidthere.Alaterpagefaultonpage1willbring
ituptodate.Sincepage2isnotinvalidated,toby
willnotreadtheup-to-datetaskdatapointed
.ThusSCcannotbeguaranteedforthisdata-race-freeprogramunderECandScC.
InLRC,allpreviouswriteaccesses,includingthewriteson
andthetaskdatapointedby
,mustbeperformedbeforetheacquireisperformedin
.Thustheinvalidationnotices
forpage1and2arepropagatedandpage1and2aremadeinvalid.Laterpagefaultswillbringthemup-to-date.
InVC,allpreviouswriteaccessestothedataobjectsinaviewmustbeperformedbeforeaprocessorentersthecorrespondingregion.Run-timeviewdetectioninVCcandeterminethatpage1()belongstotheCRVandthatpage2(thetaskdatapointedtoby
)belongsto
theNRV.Thustheinvalidationnoticesforpage1and2arepropagatedto
,however,the
noticeforpage1isonlyeffectiveinthecriticalregiontobring
uptodate,andtheonefor
21
page2isonlyeffectiveinthenon-criticalregiontobringthetaskdatauptodate.MindfulreadersmayrealisethatthemorecomplicatedprocessingoftheinvalidationnoticesinVCisnotsuperiortoLRCinthisexample,giventheextraoverheadofenablinganddisablinginvalidationnotices.However,theindividualtreatmentsofthenoticesfordifferentregionscanresultinlessfalse-sharingeffectinmanyothersituations,aswillbedescribedshortly.FromtheabovedescriptionsweknowtheRSCmodelsaredifferentintermsofwhentomakewhichpagesuptodate,i.e.,theirconsistencyconditionsaredifferent.However,thereisanotherimportantaspectofdifferenceamongthemregardingdataselection.ThroughuserannotationinEC,run-timedetectionofscope-dataassociationinScC,orrun-timeviewdetec-tioninVC,thesemodelshavefirmknowledgethatpage1()willbeverylikelyaccessedin
thecriticalregion,whichcanbereadilyusedtoimprovetheirimplementations.Forexample,insteadofadoptingtheinvalidationprotocolwhichcausesonepagefaultandonemodification(diff)requestintheaboveexample,theycanadopttheupdateprotocolwhichcanpiggy-backthediffsonthelockgrantingmessage,i.e.,lockrelease(thoughextramessagesmaybere-quiredifdiffsarehuge)andthussavetwomessages(diffrequestandreply)andonepagefault.
Ontheotherhand,LRChasnosuchknowledgeandthusdoesn’tknowwhichdiffswillbeuseful.Therefore,theinvalidationprotocolcanhelpavoidpropagationofuselessdiffsinsuchasituation.Ofcourse,someextraworkcanbedonetoimprovethediffpropagationinLRC.ExamplesaretheAffinityEntryConsistency(AEC)protocol[23]whichcanpre-sendthemodificationsbasedonLockAcquirerPrediction(LAP),andtheHeuristicDiffAcquiring(HDA)protocol[14]whichadoptstheupdateprotocolforpagesmodifiedincriticalregionsandpiggy-backsthediffsonthelockgrantingmessage.However,theseimprovementsdon’tchangetheconsistencyconditionsinLRCsuchaswhichpagesneedtobemadeuptodateatacquiretime.ThefollowingexamplewillillustratethedifferencebetweenLRCandVCintermsofreducingthefalse-sharingeffectandthedifferencebetweenVCandoptimaldiffpropagation(orconsistency)protocolssuchasAEC.
Fig.7presentsanexampleofthefalsesharingeffectinLRC.Intheexample,dataobjects
and
lieinthesamepage.Processor
modifies
inthefirstcriticalregion
andthenshould
modifies
inthesecondcriticalregion
.Afterexitsfrom
,
entersit.Although
onlyaccessesdataobjectsand,accordingtoLRC,bothpage1andpage2of
22
shared memory
page1
page2
YX
Z
read(Y), write(Z)P1Awrite(X)RARupdate X and Z by invalidatingpage 1 and 2 of P2CR1CR2page fault because of true sharingprogram orderpage fault becauseof false sharingP2Aread(Z), read(Y)Rprogram order
shared memorypage1page2A: acquire R: release: invalidated pageYXZbeinvalidated.Whensharing.
Figure7:FalsesharingeffectinLRC
reads,thereisapagefaultonpage1,whichiscausedbyfalse
shared memory
page1
page2
YXZ
read(Y), write(Z)P1Awrite(X)RARpropagate invalidation noticesfor page 1 and 2 to P2, but onlyupdate Z by invalidatingpage 2 of P2CR1CR2page fault because of true sharingprogram orderP2Aread(Z), read(Y)Rprogram order
shared memorypage1page2A: acquire R: release: invalidated pageYXZFigure8:ReductionofthefalsesharingeffectinVC
Forthesameexample,Fig.8showshowVCremovesthateffectoffalsesharing.VConlyrequiresupdatingthedataobjectsofthecurrentviewofaprocessor.Theviewofonlyincludesdataobjects
in
and
.Since
isnotmodifiedby
,VConlyupdatesby
invalidatingpage2,thoughtheinvalidationnoticeforpage1isalsopropagated(butdisabled)
23
andmaybeeffectivelatertobring
uptodatein
.Byrestrictingtheeffectivescopeofthe
invalidationnoticeforpage1(whichisofnouseforupdating
and
intheviewof
),VC
canavoidthepagefaultonpage1causedbyfalsesharing.Naturally,aswesaidbefore,theupdateprotocolshouldbeadoptedtomaketheimplementationofVCmoreefficientwiththeknowledgeofviews.
Aswementionedbefore,LRCcanbeimprovedbyadoptingmorecomplicateddiffpropaga-tionprotocols,suchasAEC[23]andHDA[14].AECpre-sendsdiffsfromcriticalregionstotheprocessorsthatwilllikelyenterthesamecriticalregionsbasedonLockAcquirerPrediction(LAP).FortheexampleinFig.7,AECmaybeabletopredictthat
willenter
andthus
thediffofpage2canbepiggy-backedonthelockreleasemessageandappliedtopage2before
isentering
.Inthisway,thenumberofpagefaultsanddiffrequest/replymessagescan
bereduced.Similarly,HDAuseshistoryrecordstopre-sendlikely-usefuldiffstoonlythenextprocessorenteringthesamecriticalregion.ThoughtheperformanceofLRCcanbeimprovedbytheaboveoptimaldiffpropagationprotocols,theydon’tchangetheconsistencyconditionsofLRC.Thisistheessentialdifferencebetweenconsistencymodelsanddiffpropagation(orconsistency)protocols.Therefore,inFig7,theinvalidationnoticeforpage1isstillrequired(byLRC)tobeeffectiveinAECandHDA.Consequently,thepagefaultonpage1willoccurduetofalsesharinginAECandHDA,whichreflectstheiressentialdifferencefromVC.
shared memory
page1
page2
YX
Z
read(Y), write(Z)P1Awrite(X)RARpropagate invalidation noticesfor page 1 and 2 to P2, but onlyupdate Z by invalidatingpage 2 of P2CR1CR2page fault because of true sharingprogram orderenable the invalidation notice for page 1 to cause page faultbecause of w/w false sharingP2Aread(Z), write(Y)Rprogram order
shared memorypage1page2A: acquire R: release: invalidated pageYXZFigure9:Thew/wfalsesharingeffectintheVCimplementation
24
Therearetwokindsoffalse-sharingeffect,write/read(w/r)andwrite/write(w/w).Aw/rfalse-sharingeffectoccurswhenoneprocessormodifiesashareddataobjectthatliesinthesamememoryconsistencyunit(e.g.,apage)asanothershareddataobject,whileanotherpro-cessorreadstheothershareddataobject.Forexample,thefalse-sharingeffectinFig.7isw/r.Aw/wfalse-sharingeffectoccurswhenoneprocessormodifiesashareddataobjectthatliesinthesamememoryconsistencyunit(e.g.,apage)asanothershareddataobject,whileanotherprocessorwritestotheothershareddataobject.WhileourcurrentimplementationofVCcanreducew/rfalse-sharingeffectasshowninFig.8,itcan’treducethew/wfalse-sharingeffectsinceacoarse-graineddiffmanagementschemeisadopted.Forexample,inFig.9(whichissimilartothescenarioinFigs.7and8,except
writesratherthanreads
),when
writes
onpage1,anyimplementationofVCwillfacetwochoices.Oneistoignorethedisabled
invalidationnoticeforpage1,makeatwinofpage1andmakeadifflater.Thisschemehastodistinguishthediffsfromdifferentviewsandacomplicated,fine-graineddiffmanagementwillbeinevitable.Thebenefitoftheschemeisthereductionofthew/wfalse-sharingeffect.Thealternativechoiceistoenabletheinvalidationnoticeforpage1,bringituptodatethroughthepagefaultbeforemakingitstwin,asshowninFig.9.Whiletheschemesimplifiesdiffmanagement,ittoleratesthew/wfalse-sharingeffect.Weadoptedthelatterschemeinourcurrentimplementation,butwillinvestigatefine-graineddiffmanagementinthefuture.SCcorrectnessofourVCimplementationOurimplementationcannormallyguaranteeSequentialConsistencyfordata-race-freeprograms.Itguaranteesthatpagesthatarepreviouslymodifiedinacriticalregionareupdatedwhenaprocessorisenteringthesamecriticalregion.Thatmeans,pagesthatarenotmodifiedinthecriticalregionarenotputintotheCRVintheviewdetection.Formostdata-race-freeprograms,itistruethatdataobjectsrequiredtobeupdatedandthenaccessedinacriticalregionareonlythosepreviouslymodifiedinthesamecriticalregion.OurVCimplementationcanguaranteeSequentialConsistencyforthosedata-race-freeprograms.However,theremaybecaseswherepagesmodifiedinnon-criticalregionsareaccessedinacriticalregioninadata-race-freeprogram,inwhichcasesourimplementationcan’tguaranteeSequentialConsistency.Forexample,ifwemodifytheprograminFig.5,sothatwhen
dequeuesataskitalsochecksthetaskdatainthecriticalregion.Inthatcase,the
processorwon’treadtheup-to-datetaskdatainthecriticalregion(butwillreadtheup-to-datetaskdatainthefollowingnon-criticalregions)inourimplementation,thoughthereisnodata
25
raceintheprogram.Butweclaimthiskindofprogramisverypeculiar,sincethecriticalregionisdesignedtoprotectonlythetaskqueue,notthetaskdata.Fortunately,wehaven’tfoundthiskindofpeculiarprogramsinourapplicationsyet.
Iftheabovepeculiartypeofprogramneedstobesupported,wecanenhanceourimplemen-tationwithcompile-timeanalysis.Oncetheaboveaccesspatternisfoundatcompiletime,thecompilercandirecttheDSMtoenabletheinvalidationnoticesofallmodifiedpages,whichwillguaranteeSequentialConsistencyforthoseprograms.
ItisworthnotingthattheaboveproblemwithSCcorrectnessofourimplementationdoesn’taffecttheSCcorrectnessofthegenericVCmodel.
6Experimentalevaluation
Inthissection,wepresentanexperimentalevaluationoftheLRCmodelandourimplementa-tionofVC.BothofthesehavebeenimplementedbasedonTreadMarks[3].Theexperimentalplatformconsistsof8PCsrunningRedHatLinux6.1,whichareconnectedbya10MbpsEthernet.EachofthePCshasa500MHzprocessorand128Mbytesofmemory.Thepagesizeinthevirtualmemoryis4KB.
Wechosefiveapplicationsintheexperiment:TSP,QS,BT,WaterandIS.TSP,QS,Water,andISareprovidedbytheTreadMarksresearchgroup.AlltheprogramsarewrittenintheClanguage.TSPistheTravellingSalespersonProblem,whichfindstheminimumcostpaththatstartsatadesignatedcity,passesthrougheveryothercityexactlyonce,andreturnstotheoriginalcity.QSisatask-queuestyleprogram.Itusesarecursivesortingalgorithmthatoperatesbyrepeatedlypartitioninganunsortedinputlistintoapairofunsortedsublists,suchthatalloftheelementsinoneofthesublistsarestrictlygreaterthantheelementsoftheother,andthenputtingthesublistsintoataskqueue.Itrecursivelytakesasublistfromthequeueandsortsthesublist,untilthetaskqueueisempty.BTisanalgorithmthatcreatesafixed-depthbinarytree.Inthealgorithm,multipleprocessesexploreabinarytreetosearchforunexpandednodes.Ifaprocessfindsanunexpandednode,itexpandsthenodeandcreatesnewunexpandednodes.Thealgorithmterminateswhenthefixed-depthbinarytreeisestablished.Waterisamoleculardynamicssimulation.Eachtimestep,theintra-andinter-molecularforcesincidentonamoleculearecomputed.IS(IntegerSort)ranksanunsortedsequenceof
keys.The
26
rankofakeyinasequenceistheindexvaluethatthekeywouldhaveifthesequenceofkeys
weresorted.Allthekeysareintegersintherange[0,
]andthemethodusedisbucket
sort.Theseapplicationsarerepresentativeofbothnumericalcomputing(Water,QSandIS),andsymboliccomputing(TSPandBT).TheperformanceresultsarelistedinTable1.
App
(Sec.)
LRC
1.89
i
1.65
LRC
0.29
i
4.59
LRC
0.16
i
25.73
LRC
10.25
i
19.09
1242812423
511
1143711347
3441
32673330
1044
Model
(Sec.)
962960
937
Diff
RPF
Mesgs
-0
911
58
-0
5301
2
-792
69342
4347
-3
95478
6
ISLRC
108.20
4444
1569
-7965
-
details).
WedesignedtheexperimentinordertodemonstratetheperformancegainofVCatbothconsistencymodelandconsistencyprotocollevels.Wealsocollectedevidenceoffalse-sharingeffectinourapplicationsinordertodemonstratethepotentialperformancegainofVC.Theextraoverheadofviewmaintenanceisalsoindicatedfromourresults.
Table1showstheperformanceresultsforLRCandourVCimplementation.VC
Reqisthenumber
ofmessagesfordiffrequests,RPFisthereductioninpagefaultsduetooverallimprovementinourVCimplementation,RFSisthereductioninpagefaultsduetothereductionofthefalse-sharingeffectintheVCmodel,andMesgsisthetotalnumberofmessages.ThesecondcolumnSeq.Timeistherunningtimeofthesequentialexecutionoftheapplications.ExceptforTSP,thesequentialexecutionoftheapplicationsisfasterthantheirparallelexecutionon8PCs,whichdemonstratestheneedforimprovingtheDSMperformanceonclustercomputers.
VC
i,becausetheyhavenow/rfalse-sharingeffectincriticalregions.Asdiscussed
inSection5,ourimplementationcanonlyreducew/rfalse-sharingeffect.
Inaddition,inthetask-queuestyleprogramQS,theinaccurateNRVscan’thelpourimple-mentationtoreducethefalse-sharingeffect,thoughQShasw/rfalse-sharingeffectinnon-criticalregions.
Forthesetwoapplications,theperformanceofVC
icanimproveits
performanceupto2.4%.Thoughthisperformancegainisnotverysignificant,itcanonlybeachievedattheleveloftheconsistencymodelandcan’tbeachievedbytheoptimaldiffpropagationprotocols[23,14,4]suchasAEC.Thisperformancegainwillbeincreasedifthe
28
w/wfalse-sharingeffectcanbereducedorthedetectedNRVscanbemoreaccurate.ForWater,VC
ihasreducedthefalse-
sharingeffect(reductionofthreepagefaults)inWater,theoverheadofviewmaintenanceovershadowedthebenefit.
VCvs.LRC
InTable1,VCisourimplementationwiththeoptimaldiffpropagationprotocolasdescribedinSection4.2,wheretheupdateprotocolisadoptedforupdatingCRVs.Thisoptimisationisverynaturalinourimplementation,sincetheCRVsarereadilyavailableandpredictthepagestobeverylikelyaccessed.
Table1showsVCoutperformsLRCforallfiveapplicationstested.VChasimprovedtheperformancesignificantlycomparedwithLRC(35%forTSP,35.3%forQS,9%forBT,3.9%forWater,and4.6%forIS).ThenumberofdiffrequestmessagesinVCissignificantlylessthanthatinLRC(97.4%lessinTSP,75.8%lessinQS,35%lessinBT,4.3%lessinWater,and37.6%lessinIS).ThedetectedCRVshavehelpedtheupdateprotocoltoreducethediffrequestmessages.ConsequentlythetotalnumberofmessagesinVChasbeengreatlyreducedcomparedwithLRC.
Themajorreasonfortheaboveimprovementisthatthediffsofthepagestobeaccessedincriticalregions(i.e.,CRVs)arepiggy-backedonthelockreleasemessagesandappliedtothepagesimmediately.Thusthenumberofdiffrequest/replymessagesandtheoverheadofpagefaultsarereducedinthoseapplications.
ItisworthnotingthattheaboveperformancegainduetotheuseoftheupdateprotocolforCRVsmaybeachievedbytheoptimaldiffpropagationprotocols[23,14,4].
False-sharingeffect
Therearetwoapplications,BTandWater,thatdemonstratethereductionofthefalse-sharingeffectachievedinourimplementation.BT,whichuseslockstoprotectnodesinthebinarytree,hasseriousfalse-sharingeffect,whileWaterhaslessfalse-sharingeffect.ForBT,theperformanceimprovementduetoreductionofthefalse-sharingeffectis18.4%ofthetotalimprovement.ThissignificantproportionhasdemonstratedtheadvantageoftheVCmodel
29
overtheconsistencyprotocols,e.g.AEC,intermsoftheprogramswithseriousfalse-sharingeffect.
Togetanimpressionofthefalse-sharingeffectinourapplications,wehaverecordedthetotalnumberofpagefaultsthatareduetothefalse-sharingeffectinsidecriticalregions.InTable1,RFSisalsothenumberofpagefaultsthatareduetothew/rfalse-sharingeffectinsidethecriticalregions;TFSisthenumberofpagefaultsthatareduetoallfalse-sharingeffects(includingw/randw/wfalsesharing)insidethecriticalregions.Theresultsshowthereducedw/rfalse-sharingeffectisonlyasmallportionofthetotalfalse-sharingeffectforhalfofourapplications(0%forTSP,0%forQS,18.2%forBT,and50%forWater)insidethecriticalregions.Fortheapplicationswithseriousfalse-sharingeffect,suchasBT,thereisstillagreatroomforperformanceimprovementwhichcanonlybeachievedattheleveloftheconsistencymodelinVC.
Wewerenotabletocollectevidenceforthefalse-sharingeffectinnon-criticalregions,butwebelieveapplicationssuchasQSandWaterhavesignificantfalse-sharingeffectinnon-criticalregions.
7Conclusions
Inthispaper,wehavepresentedanewRSCmodel,theVCmodel,forDSMsystems.RSCmodelshaverelaxedtheconstraintsonmodification(diff)propagationandexecutionfordata-race-freeprogramsthatareproperlylabelledwhilestillguaranteeingsequentialconsistencyforthoseprograms.Wehavebrieflydiscussedeachofthemintermsofthreeselectiontechniques.AmongpreviousRSCmodels,onlyECandScCcanperformdataselection,butatthepriceofacomplexprogrammerinterfacewhichimposesanextraburdenontheprogrammer.TheVCmodelhasbeenproposedtoremovesuchaburdenbymeansofautomaticviewdetection.ItcanachievedataselectionwithoutprogrammerannotationsandreducesmorefalsesharingeffectthanLRC.Itsonlyconsistencyrequirementisthatallthedataobjectsintheviewofaprocessormustbeupdatedduringviewtransition.Inthisway,itcanachievethemaximumrelaxationontheconsistencyrequirementsandproducesmoreroomforoptimisationinDSMimplementations.Therefore,thegenericVCmodelcanbeusedasaframeworkforfutureresearchonDSMimplementations.
30
WehaveimplementedtheVCmodelbasedonTreadMarks.Techniquesfortheimplementa-tionofVChavebeendiscussedinthispaper.ThedifferencesbetweenourVCimplementationandrelatedworkwereillustratedthroughexamples.PerformanceresultshaveshownthatourVCimplementationgenerallyoutperformstheLRCmodelandhasadvantagesovertheopti-maldiffpropagationprotocols[23,14,4].TheresultshavealsodemonstratedthattheextraoverheadofviewmaintenanceinourVCimplementationisrelativelytrivial.
FurtherresearchshouldbecarriedoutundertheframeworkoftheVCmodel.First,fine-graineddiffmanagementneedstobeinvestigated.Thecurrentimplementationtoleratesthew/wfalsesharingeffectduetotheadoptionofsimplediffmanagement.Moreelaboratediffmanagementmayhelptoreducemoreofthefalse-sharingeffect.Second,mechanismsfordetectionofaccurateNRVsneedtobeinvestigated.Finally,Run-timeandcompile-timetech-niques[11]needtobedevelopedforaccurateviewdetection.
References
[1]S.V.AdveandM.D.Hill:“Aunifiedformalizationoffourshared-memorymodels”,IEEE
TransactionsonParallelandDistributedSystems,4(6),pp.613-624,June1993.[2]S.V.AdveandK.Gharachorloo:“Sharedmemoryconsistencymodels:atutorial”,IEEE
Computer,29(12),pp.66-76,December1996.
[3]C.Amza,etal.:“TreadMarks:Sharedmemorycomputingonnetworksofworkstations”,
IEEEComputer,29(2),pp.18-28,February1996.
[4]C.Amza,etal.:“AdaptiveProtocolsforSoftwareDistributedSharedMemory”,Proc.of
IEEE,SpecialIssueonDistributedSharedMemory,87(3),pp.467-475,March1999.[5]B.N.Bershad,etal.:“SharedMemoryParallelProgrammingwithEntryConsistency
forDistributedMemoryMultiprocessors”,CMUTechnicalReportCMU-CS-91-170,September1991.
[6]B.N.Bershad,etal.:“TheMidwayDistributedSharedMemorySystem”,InProc.of
IEEECOMPCONConference,pp.528-537,1993.
31
[7]J.B.Carter,J.K.Bennett,andW.Zwaenepoel:“Implementationandperformanceof
Munin”,InProc.ofthe13thACMSymposiumonOperatingSystemsPrinciples,pp.152-164,Oct.1991.
[8]J.B.Carter,J.K.Bennett,andW.Zwaenepoel:“Techniquesforreducingconsistency-relatedinformationindistributedsharedmemorysystems,”ACMTransactionsonCom-puterSystems,13(3),pp.205-243,August1995.
[9]P.Dasgupta,etal.:“ThedesignandimplementationoftheCloudsdistributedoperating
system”,ComputingSystemsJournal,3,Winter1990.
[10]M.Dubois,C.Scheurich,andF.A.Briggs:“Memoryaccessbufferinginmultiproces-sors”,InProc.ofthe13thAnnualInternationalSymposiumonComputerArchitecture,pp.434-442,June1986.
[11]S.Dwarkadas,etal.:“AnIntegratedCompile-Time/Run-TimeSoftwareDistributed
SharedMemorySystem”,InProc.oftheSeventhSymposiumonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,Oct.1996.
[12]B.FleischandR.H.Katz:“Mirage:Acoherentdistributedsharedmemorydesign”,In
Proc.ofthe12thACMSymposiumonOperatingSystemsPrinciples,pp.211-223,Dec.1989.
[13]K.Gharachorloo,D.Lenoski,andJ.Laudon:“Memoryconsistencyandeventordering
inscalablesharedmemorymultiprocessors”,InProc.ofthe17thAnnualInternationalSymposiumonComputerArchitecture,pp.15-26,May1990.
[14]Z.Huang,W.-J.Lei,C.Sun,andA.Sattar:“HeuristicDiffAcquiringinLazyRelease
ConsistencyModel”,InProc.of1997AsianComputingScienceConference,LectureNotesinComputerScience1345,pp.98-109,SpringerVerlag,1997.
[15]Z.Huang,C.Sun,andA.Sattar:“Exploringregionallocalityindistributedsharedmem-ory”,InProc.of1998AsianComputingScienceConference,LectureNotesinComputerScience1538,pp.142-156,SpringerVerlag,1998.
32
[16]Z.Huang,C.Sun,M.Purvis,andS.Cranefield:“View-basedConsistencyanditsImple-mentation”,InProc.oftheFirstIEEE/ACMInternationalSymposiumonClusterCom-putingandtheGRID,Brisbane,May2001.
[17]L.Iftode,J.P.SinghandK.Li:“ScopeConsistency:ABridgebetweenReleaseConsis-tencyandEntryConsistency”,InProc.ofthe8thAnnualACMSymposiumonParallelAlgorithmsandArchitectures,1996.
[18]P.Keleher:“LazyReleaseConsistencyforDistributedSharedMemory”,Ph.D.Thesis,
DeptofComputerScience,RiceUniv.,1995.
[19]L.Lamport:“Howtomakeamultiprocessorcomputerthatcorrectlyexecutesmultipro-cessprograms”,IEEETransactionsonComputers,28(9),pp.690-691,September1979.[20]D.Lenoski,etal.:“TheStanfordDASHmultiprocessor”,IEEEComputer,25(3),pp.63-79,March1992.
[21]K.Li,andP.Hudak:“MemoryCoherenceinSharedVirtualMemorySystems”,ACM
Trans.onComputerSystems,Vol.7,pp.321-359,November1989.
[22]D.Mosberger:“Memoryconsistencymodels”,OperatingSystemsReview,17(1),pp.18-26,Jan.1993.
[23]C.B.Seidel,R.Bianchini,andC.L.Amorim:“TheAffinityEntryConsistencyProtocol”,
InProc.ofthe1997InternationalConferenceonParallelProcessing,August1997.[24]C.Sun,Z.Huang,W.-J.Lei,andA.Sattar:“TowardsTransparentSelectiveSequential
ConsistencyinDistributedSharedMemorySystems”,InProc.ofthe18thIEEEInter-nationalConferenceonDistributedComputingSystems,pp.572-581,Amsterdam,May1998.
[25]A.S.Tanenbaum:DistributedOperatingSystems,PrenticeHall,1995.
33
因篇幅问题不能全部显示,请点此查看更多更全内容