您的当前位置：首页 Using Sparse Crossbars Within LUT Clusters

Using Sparse Crossbars Within LUT Clusters

来源：尚车旅游网

UsingSparseCrossbarswithinLUTClusters

Dept.ofElectricalGuyandLemieux

ComputerEngineering

UniversityofToronto

Toronto,Ontario,CanadaM5S3G4

lemieux@eecg.toronto.eduABSTRACT

InFPGAs,theinternalconnectionsinaclusteroflookuptables(LUTs)areoftenfully-connectedlikeafullcrossbar.Suchahighdegreeofconnectivitymakesroutingeasier,buthassigniﬁcantareaoverhead.ThispaperexplorestheuseofsparsecrossbarsasaswitchmatrixinsidetheclustersbetweentheclusterinputsandtheLUTinputs.Wehavereducedtheswitchdensitiesinsidethesematricesby50%ormoreandsavedfrom10to18%inareawithnodegradationtocritical-pathdelay.Tocompensateforthelossofroutability,increasedcomputetimeandspareclusterinputsarerequired.Furtherinvestigationmayyieldmodestareaanddelayreductions.

1.INTRODUCTION

ArecenttrendinFPGAarchitecturaldesignistouseaclusteredarchitecture,whereanumberoflookuptables(LUTs)aregroupedtogethertoactastheconﬁgurablelogicblock.Themotivationforusingclustersismanifold:toreducearea,toreducecriticalpathdelay,andtoreduceCADtoolruntime[1,2,9,10].ThistrendisfollowedbyFPGAsfromXilinx’sVirtexandSpartan-IIfamilies,aswellasAltera’sAPEXandACEXproducts.AlloftheseFPGAsarebasedonclustersof4-inputlookuptables.

Inaclusteredarchitecture,theLUTinputscanbechosenfromtwosources:1)asetofsharedclusterinputs,whicharesignalsarrivingfromotherclustersviathegeneralpurposerouting,or2)fromfeedbackconnections,whicharetheoutputsofLUTsinthiscluster.Ithasbeencommontoassumethattheseinternalclus-terconnectionsarefullypopulatedorfullyconnected,meaningev-eryLUTinputcanchooseanysignalfromalloftheclusterinputsandfeedbackconnectionscombined.Thisarrangementcanalsobeviewedasafullcrossbar,whereaswitchorcrosspointexistsattheintersectionpointofeveryLUTinputandeveryclusterinputorfeedbackconnection.

Inthispaper,itisassumedthattheconnectionswithintheclusteraremadebymultiplexersdrivingtheLUTinputs,calledLUTinputmultiplexers.Thesemultiplexerstendtohavealargenumberofin-putsand,afterincludingtherequisiteinputbuffersandcontrollingSRAMbits,contributesigniﬁcantlytoFPGAarea.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontheﬁrstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeciﬁcpermissionand/orafee.

FPGA2001,February11–13,2001,Monterey,CA.Sometypographicalerrorshavebeenﬁxed(February20,2001).

Dept.ofElectricalDavidandComputerLewis

Engineering

UniversityofToronto

Toronto,Ontario,CanadaM5S3G4

lewis@eecg.toronto.edu

AclusteredFPGAiscomposedofanumberofclustertileswhicharerepeatedinasimplearraypatternduringlayout.Eachtileiscompleteinthatitincludestheclusterlogic(theﬂip-ﬂops,LUTs,andLUTinputmultiplexers)aswellasthegeneralroutingtointerconnectthem.BasedonanareamodelstatedlaterinSec-tion2,theLUTinputmultiplexersalonecanconsume24to33%ofthetransistorareainaclustertile.AbreakdownoftheareaestimatesforanumberofsuchtilesisprovidedinTable1.

ThesigniﬁcantamountofarearequiredbytheLUTinputmulti-plexersmotivatedtheideaofremovingswitchesfromthefullcross-bar,ordepopulatingit,toresultinasparsecrossbar.Naturally,depopulatingtheclusterraisesthefollowingquestions:

1.Willdepopulationsavearea,requiregreaterroutingarea,orcreateunroutablearchitectures?2.Willdepopulationreduceorincreaseroutingdelays?3.Whatamountofdepopulationisreasonable?

4.Howmuchareaordelayreductioncanbeattained,ifany?5.Whataretheothereffectsofdepopulatingthecluster?Thispaperaddressesthesequestionsusinganexperimentalpro-cessofmappingbenchmarkcircuitstoclusteredFPGAarchitec-turesandmeasuringtheresultingareaanddelaycharacteristics.

1.1ComparisontoPriorWork

Theuseoffully-connectedclusterslikelystemsfrompreviouswork[12]whichsuggeststhatinputsofa4-LUTbefullyconnectedtotheroutingchannel.Thisprovidesenoughroutingﬂexibilitytoobtainminimumchannelwidthsinnon-clusteredarchitectures,theareametricinuseatthattime.Sincethen,clusteredarchitectureshavebecomeprevalent,CADtoolshaveimproved,andareametricshavebecomemoredetailed.

Reducingtheamountofconnectivitywithintheclusterwasre-centlyexploredusingasimplestripedswitchlayout[11].Ratherthanmodifytherouter,theT-VPACKpackingalgorithmwasal-teredinsuchawaythatroutabilityoftheclusterwasstillguar-anteed.Unfortunately,theareaimprovementobtainedusingthistechniquewaslimitedto5%anddelaysincreasedupto30%.

Inthiswork,thepackingalgorithmwasleftunchanged.In-stead,improvedswitchpatternswereused,spareclusterinputswereaddedtothecluster,andmodiﬁcationstotherouterweremadetosupportthesearchitecturalchanges.Althoughthesespareinputscontributetoadditionalarea,theyalsoimproveroutabilityandreducechannelwidthrequirements.Overall,anetareareduc-tionofupto18%withnodegradationtocritical-pathdelaywasobtained.

Architecture

Clustersize

45677

6050632167137512022(65.0%)(56.2%)(46.9%)(39.0%)(34.2%)

Total930711241143181962235145

Table1:Breakdownofclustertilearea.Theroutingareaisanarithmeticaveragerequiredtoroute20MCNCcircuits.

FcoutDisjointS BlockkNIIspare

LUTsizeclustersize

numberofclusterinputs

numberofadditionalclusterinputs,usedforroutingonly

Table2:Clusterorganizationparameters.

partitionedBLEBLEFcFcinFcfbFcFcout

clusterinputtoLUTinputdensityLUTfeedbacktoLUTinputdensityroutingchanneltoclusterinputdensityclusteroutputtotheroutingchanneldensityTable3:Switchdensityparameters.

FcfbsingleFcinBLEBLEFigure1:Detailsoftheclustertilearchitecture.

1.2Tradeoffs

Sparseclustersgivethepromiseofreducedarea,butoneimpor-tanttradeoffthatmustbemadetorealizethissavingsisincreasedroutingtime.Inourexperience,anapproximateruntimeincreaseofthreetofourtimeswasobserved.Thisincreasemaynotbetoler-ableduringearlyprototypingstageswhendesignchangesarefre-quent,butalesscostlydevicecouldoffsetthisinconveniencewhenanFPGAdesignshiftstovolumeproduction.Consequently,thepremiseofthispaperistoevaluatethelimitsofareareductionthatcanbeobtainedusingahighdegreeofCADtooleffort.

Theremainderofthispaperisorganizedasfollows.InSection2,theFPGAarchitectureisdescribedalongwiththeareaandde-laymodels.Section3discussestheexperimentalmethodologyandCADtoolsused.Section4presentstheresults,andSection5con-cludes.

formedbyaclusteranditsroutingchannelsisshowninFigure1.Thistileisdrawninawaytosuggestastep-and-repeatlayoutthatispossible,withwiresontheleftedgeofonetileliningupwithwiresontherightedgeoftheadjacenttile.

OneclustercontainsNbasiclogicelements(BLEs),whereoneBLEcontainsak-inputLUTandaregister.EachclusterhasIkN12primaryinputswhichareusedduringpacking[2].Aswell,aclusterhasIspareadditionalclusterinputswhicharere-servedonlyforrouting.Theseextrainputsarerequiredtoimproveroutabilityduetotherestrictionsimposedbysparseclusters.AlloftheseclusterorganizationparametersaresummarizedinTable2.Theclusterinputsareassumedtobelogicallyequivalent,buttheymayconnecttoonlysomeoftheLUTinputs.Theclusterinput(andoutput)pins,whichconnecttheclustertothegeneralrouting,areevenlydistributedonthefoursidesofthetile.LaterinSection4.3,weshallpartitiontheclusterinputsintofourgroupsbasedonwhichsidetheyareplaced.

2.2RoutingArchitectureDetails

Detailedroutingarchitecturalparametersweresettobethesameasearlierstudies[2,4].Inthedetailedroutingarchitecture,50%ofthetracksarelength-4segmentsusingtri-statebuffers,theremain-ingtracksarelength-4segmentsusingpasstransistors,andclockswereassumedtoberoutedonaglobalresource.Thedisjointswitch(S)blockwasused,sosignalsenteringtheroutingontrackimustremainonthattracknumberuntilthedestinationisreached.Thenumberofiopadsperclustertilepitchwassetto5forN6,andto7forN10.

Theroutingswitchsizes(i.e.,bufferandpasstransistorsizes)andwiringRCpropertieswerecomputedassumingdoubleminimum-spacedwiringandafully-populatedclustertilesize.Forthek4N6architecture,thebufferwas61timestheminimum

2.FPGAARCHITECTURE

ThissectiondescribesassumptionsmadeabouttheFPGAarchi-tectureandtheareaanddelaymodels.

2.1ArchitecturalModel

Thearchitectureusedinthisstudyisasymmetrical,island-styleFPGAcontaininginterconnectedclusters.ThebasicFPGAtile

sizeandthepasstransistorwas122timestheminimum.Theotherarchitectureshadlargertilesizesandusedbuffersizesof66,76,,and118.Thepasstransistorsizeswerealwayschosentobetwicethecorrespondingbuffersize.

WithinaBLE,theLUTinputsareassumedtobelogicallyequiv-alentandhencefreelypermutable.Theseinputscanselectsignalsfromtwoindependentsources:eitherclusterinputsorfeedbackconnections.Thedensityofswitchesforthesetworegions,FcinandFcfb,respectively,areindependentlycontrolled.Thesetwopa-rameterscontrolthesparsenessofswitchesinsidethecluster.

Forconnectionstooutsidethecluster,theinputsfromandout-putstothegeneralroutingchannelsareselectedusingswitchma-triceswithdensitiesofFcandFcout,respectively.ThepartofthegeneralroutingchannelthatconnectstotheclusteriscommonlyreferredtoastheconnectionblockorCblock.

TheparameterscontrollingswitchdensitiesinsideandoutsideoftheclusteraresummarizedinTable3.

EachBLEoutputdirectlydrivesaclusteroutputandalocalfeed-backconnection.TheBLEoutputsareassumedtobelogicallyequivalent,allowinganyfunctiontobeplacedinanyoftheBLEsofthecluster.Toachievethisoutputequivalence,everyBLEisgiventheexactsameinputswitchpattern.1

Toimproveroutability,theroutingtooltakesadvantageoftheinputandoutputequivalencesjustdescribed.ItmayalsoreplicatelogicontomultipleBLEsinthesamecluster,providedthereareemptyBLEsavailable.

2.3AreaModel

Theareamodelusedinthispaperisthesamebuffer-sharingmodelusedpreviously[2,4],withafewminorchangesdescribedbelow.Thismodelisbasedontheunitareaofaminimum-widthtransistor(T),includingthespacingtoanadjacenttransistor.Asmentionedin[4],discussionswithFPGAvendorshavesuggestedthatthis,andnotwiring,isthearea-limitingfactor.

AllofthelogicstructuresintheFPGAaremodeled,includingBLEs,theLUTinputmultiplexers,andtheclusterrouting,butnotthepadframe.Forexample,theareacontributionofapasstransis-tordependsonthetransistorwidth,andabufferchaindependsonthenumberofinverterstagesaswellastherequireddrivestrengthofeachstage.

Thedrivestrengthrequirementforabufferisbasedonfan-outandiscomputedasfollows.Ingeneral,itisassumedthatasizeBinverterinabufferissufﬁcienttodriveanotherinverterofsize4B,oratotaltransistorgatewidthof8B.However,buffersdrivingtheLUTinputmultiplexers,i.e.,theclusterinputbuffers,weresizeddifferently.ThesebuffersmustdrivealargerloadcreatedbythemanylevelsoftheLUTinputmultiplexertree.Thisloadislargernotonlyduetothedepthofthetree,butalsobecausediffusionisbeingdriven.Forthesebuffers,asizeBwasselectediftheﬁrstlevelfan-outofthebuffer2wasloadedbyatotaldiffusionwidthof2B,withtheexceptionthatdrivestrengthwaslimitedtobeatleast7xandatmost25xminimumsize.TheseapproximationsweremadeafterexaminingHSPICEresults[AhmedandWilton,privatecommunication].

Therewereafewadditionalimprovementsmadetothearea

icallyforsparseclusters.ThisversionofVPRincludesthelatesttiming-drivenpackingandplacementenhancements[9,10].

Duringrouting,theminimumchannelwidthrequiredtoroute,Wmin,wasfoundusingabinarysearch.Afterwards,aﬁnallow-stressroutingwasdonewithW13Wanddelaystatistics.ThisproceduremodelsmintrackstocomputeareathewayFPGAsareactuallyused;designersareseldomcomfortableworkingontheedgeofcapacityorroutability.

Theﬁnallow-stressroutingactuallyfailedin34outof3980(0.9%)circuit/architecturecombinations,usuallyduetoslowcon-vergenceorswitchpatterninterference.3Toresolvethis,one,two,thenthreeadditionaltrackswereaddedtothechannel.Thisstrat-egywassufﬁcienttorouteallbutfourofthetroublesomecases—thethreeunderlyingarchitecturesforthesecasesweredeemedun-routable,sotheywereabandonedfromfurtherconsiderationinthispaper.

Also,ifthebinarysearchwasunabletoﬁndareasonablemin-imumchannelwidth(Wchitecturewasdeemedunroutablemin240)andforabandoned.anyofthecircuits,Consequently,thear-everyarchitecturalresultpresentedinthispaperwasobtainedbyroutingallofthebenchmarkcircuits.

Allareaanddelayresultsareaveragesobtainedfromplacingandroutingthe20largestMCNCbenchmarkcircuits[7].Areaiscom-putedasthegeometricaverageoftheactiveFPGAarea,whichisdeﬁnedbelow.Thegeometricaverageensuresthatthecircuitsareallweightedequally,independentofthesizeofthecircuit.Delayresultsarealsothegeometricaverageofthecritical-pathdelayforeachbenchmarkcircuit.

ActiveFPGAareaisthearea,inunitsofminimum-widthtransis-torareas,ofoneclustertile(includingitsrouting)timesthenumberofclustersactuallyusedbythebenchmarkcircuit.Thismeasure-mentwasusedin[1,2]tobetterdistinguishpackingefﬁciency.WehavechosentousetheactiveFPGAareametricheretobeconsis-tentwiththoseresults.4

3.2CADToolEnhancements

OriginallyVPRroutedonlytoclusterinputpinsbecausefully-connectedclusterscouldguaranteetheroutabilityofclusterinputsandfeedbackconnections.ExtensivemodiﬁcationstoVPRwerenecessarytoroutesparselypopulatedclusters.Forexample,theroutinggraph,timinggraph,andnetliststructureshadtobealteredtoaccommodatetheclusterfeedbacknetsandthelocationofeveryBLEsink.Aswell,otherchangeswerenecessarytopermitnetstoenteraclustermorethanoncetoimproveroutability.

Theswitchpatterngeneratorfrom[8]wasintegratedintoVPRtocreatetheswitchpatternsfortheLUTinputmultiplexers.Thisgeneratorﬁrstdistributesswitchestobalancethefan-inandfan-outofeachwire,usuallyinarandompattern.Agreedyimprovementstrategyisthenfollowedwhichroughlymaximizesthenumberofdistinctoutputwiresreachedbyeverypairofinputwires.Toac-complishthis,switchesarerandomlyselected,ﬁrstinpairs,thensingly,andmovedonlyifthefan-in/-outconstraintsarekeptandtheaforementionedcostimproves.Usingthistechnique,theswitchpatternswithinaclusterareindividuallywell-designed.

Otherswitchpatternsintheroutingfabric,namelytheclusterinputandoutputpatterns,usetheoriginalVPRswitchplacementgenerators.Additionally,wehavenotattemptedtooptimizethecascadingofthetheclusterinputmultiplexersandLUTinputmul-tiplexers,exceptasnotedbelowinSection4.3.Thisextensiontotheworkisnontrivialandleftforfutureinvestigation.

fac

Tool

T-VPACKVPRbinarysearch

AdditionalParametersdefault

-pres

fac

mult1.3-max

iterations250

router

thatthesparseFc10resultismissingforN9inFig-ures3and4becauseVPRwasunabletoroutetheclmacircuitunderlow-stressconditionsduetoslowconvergence.

5Notice

5.4e+065.2e+06active area (Ts)sparse 7 2 0 0.5 0.5full 7 2 0 1.0 1.0

5.4e+065.2e+06active area (Ts)5e+0.8e+0.6e+06

Csparse 7 9 0 0.5 0.5full 7 9 0 1.0 1.0

5e+0.8e+06

A4.6e+06

4.4e+06

0.10.20.30.40.50.60.70.80.9

BD1

4.4e+06

0.10.20.30.40.50.60.70.80.9

Figure4:Fcimpactonareaforclustersizesof2and9.Intermediateclustersizesgavesimilarresults.

thesparseandfullypopulatedclusterresultsaresosimilar.Thiscanbepartlyattributedtotherelativeﬂatnessneartheminimumarea.ForN9,varyingFcfrom0.1to0.5causeslessthan5%changeinarea.Hence,preciseFcselectionisnotcritical,provideditislargeenoughtoberoutable,yetnotwastefullylarge.

Fortheremainderoftheresultsinthispaper,itwasdeterminedthataﬁxedvalueofFcwouldnotsigniﬁcantlyhinderarearesults.Ratherthanusingtheminimum-areaFcvaluesfromFigure5,wefeltthathavingafewmoreswitchesintherouting(byhavingaslightlylargerFc)wouldbehelpfulasclustersweremadeevenmoresparse(internally).Thisisespeciallyimportantbecausenoeffortwasmadetotunethetwoswitchpatternstogetherandwewishedtoavoidpossibleinterferencepatterns.Hence,wechosetosetFc05fortheN6architecturesandFc0366forthek7N10architecture.Theseparticularvalueswerechosenbecausetheywereusedinpreviouswork[2,4]andthisgivesusthemostcomparableresults.

0.70.6minimum area Fc0.50.40.30.20.1

sparse 7 X 0 0.5 0.5full 7 X 0 1.0 1.0

1015

2025I cluster inputs

3035

4.2.2SelectingFcout

PreviousexperimentshaveshownthatFcout1Nisadequateforroutinginfullypopulatedarchitectures[4].ConsideringthesimilarityoftheFcarearesultsbetweensparseandfullypopu-latedarchitectures,itwasdecidedthatmodifyingFcoutwouldhaveinsigniﬁcantimpactinasparselyconnectedarchitecture.Hence,Fcout1Nwasusedforallresults.

Figure5:BestFccorrespondingtominimumareaasafunctionofIclusterinputs.

permutationoftherows(oroutputs)tobalancethefan-inoftheLUTinputs.Thesematrices,butnotthepermutationpattern,areillustratedinFigure1.

Bothswitchdesignswereroutedinak7,N10,FcinFcfb043architecture.Bothdesignsrequiredidenticaltransistorarea,andthepartitionedmatrixwasonlyabout1%faster.Althoughthisisnotsigniﬁcantlyfaster,itwasusedforsubsequentresultsinthispapersinceitmayhelpwithsomepathologicalcases.

4.3PartitioningofClusterInputs

Additionalnetdelaycanbecausedbysparselypopulatedclus-tersbecausesomeLUTinputsmaynotbereachablefromparticu-larsidesofthecluster.Forexample,considerthecasewhensomeLUTinputconnectionshavealreadybeenformed,andthelastre-maininginputsignalisbeingmade.Alackofswitchesinsidetheclustermaycausethatnettoentertheclusterfromamoredistantside.Theresultisincreaseddelay.

Weinvestigatedthisproblembytryingasingleswitchmatrixforallclusterinputs,andonewhichwaspartitionedintofoursmallerswitchmatrices,oneforeachinputside.Thepartitionedmatrixaddressestheaboveproblembyensuringthatalloftheclusterin-putsfromanyparticularsidecanreachalloftheLUTinputs.Italsohasaweaknessthough:thesesmallerswitchmatricesarenotcarefullydesignedtocoupletogetherwell.Eachpartitionedmatrixisderivedfromthesamebasicswitchpattern,buteachhasitsown

4.4SparseClusterAreaResults

Theprimarymotivationfordepopulatingclustersistoreducethearea,andsubsequentlythecost,ofanFPGA.InSection4.2,itwasdeterminedthatsimplydepopulatingtheclusterto50%ismoreeffectiveatreducingareathanchoosingthepropervalueofFc.Inthissection,furtherdepopulationoftheclusterisexplored.

Toreducethenumberofroutingexperiments,itwasdecidedtoﬁxtheclustersizetoN6andvarytheLUTsizesfrom4through7.Thatparticularclustersizewasselectedbecauseitgeneratednear-minimumareaandarea-delayresultsforfullypopulatedclus-terswithalloftheseLUTsizes.ThelargerLUTsizesareespeciallyinterestingbecausetheyrequirelargerinputswitchmatrices,hence

4.8e+0.6e+0.4e+06active area (Ts)4.2e+0e+063.8e+063.6e+063.4e+063.2e+06

active area (Ts)4 6 X 0.25 0. 6 X 0.33 0. 6 X 0.4 0. 6 X 0.5 0. 6 X 1.0 1.04.8e+0.6e+0.4e+0.2e+0e+063.8e+063.6e+063.4e+063.2e+06

5 6 X 0.2 0.55 6 X 0.3 0.55 6 X 0.4 0.55 6 X 0.5 0.55 6 X 1.0 1.0468Ispare

101214

4.8e+0.6e+0.4e+06active area (Ts)4.2e+0e+063.8e+063.6e+063.4e+063.2e+06

02468Ispare

101214

6 6 X 0.17 0.56 6 X 0.25 0.56 6 X 0.33 0.56 6 X 0.41 0.56 6 X 0.5 0.56 6 X 1.0 1.07 6 X 0.14 1.07 6 X 0.22 0.437 6 X 0.29 0.437 6 X 0.43 0.437 6 X 1.0 1.00

8Ispare

468Ispare

101214

Figure6:ActiveFPGAareaoffullyandsparselypopulatedclusters.

offeringmorepotentialfordepopulation.Oneadditionalarchitec-turewithk7N10waschosentostudyanevenlargernumberofinputsenteringthecluster.

AnumberofpreliminaryroutingexperimentswererunwithawiderangeofvaluesforFcinandFcfb.Fromtheseresults,whicharenotshownhere,itwasconﬁrmedthatFcfbhaslessinﬂuenceonarea.AsFcfbwasreducedbelow50%,anumberofcircuitswouldnolongerroute.ItwasdeterminedthatFcfbof50%(or3743%fork7)wasaslowavalueascouldbetolerated.Similarprelim-inarysweepsindicatedthatFcin05wasnearlyalwaysroutable,soareareductionshouldconcentrateonmoresparsevalues.

ThearearesultsfromroutingthefourLUTsizesareshowninFigure6.Inthesegraphs,eachcurverepresentsthegeometricav-erageofactiveFPGAareaforaﬁxedvalueofFcin.Thenumberofspareinputsisvariedalongthex-axis.Thesparseclusterre-sultsshouldbecomparedagainsttheboldcurverepresentingthefully-populatedclusterarea.

Themostapparenttrendinthesecurvesisagentledip,thenageneralupwardclimbinareaasIspareisincreased.Theupwardtrendisanexpectedresult,sincethespareinputswillrequireaddi-tionalclusterinputmultiplexers.Thedipiscausedbyarapidinitialdeclineinaveragechannelwidth,whichthengraduallyreachesa5%to20%reduction(10%istypical).

AnumberofdatapointsaremissinginFigure6,speciﬁcallyforsmallIsparevalues.Thisisbecauseoneormorebenchmarkcircuits

couldnotberoutedonthearchitecture.Hence,althoughtheycon-tributetoareareductioninonlyafewcases,itisessentialtohavethesespareinputstomakesparseclustersroutable.Typically,be-tweentwotoﬁvespareinputsarerequiredtomakethearchitectureroutableandattainthelowestarea.

Thelowest-areaarchitecturesfromFigure6aresummarizedinTable5.Aswell,thelargeN10clusterarchitectureisincluded.Withthesearchitectures,a10to18%areasavingsisachieved.Asmentionedearlier,betweentwoandﬁvespareinputsissufﬁcienttoachievemostofthissavings,whichissurprisingsincethisonlyaboutonespareinputperside.

AbreakdownoftheclustertileareaisgivenTable6.For4-inputLUTs,therewasaslightdecreaseinroutingareabecausethespareinputshelpedreduceaveragechannelwidth.The5-and6-inputLUTscasesdidnotachievethesamebeneﬁtbecausethespareinputscontributedmoretoareathantheamountsavedbytheslightchannelwidthreduction.Thetwo7-LUTarchitectureshadanincreaseinroutingareaduetothespareinputsandachannelwidthincrease.However,thesparseswitchpopulationsproducedanetareasavingsof14%and18%,withthelargerclusterbeneﬁttingmore.Withrespecttotheentiretile,depopulatingtheclusterswasveryeffectiveatreducingtherelativeLUTinputmultiplexersizefromthe24–33%rangedownto12–18%.

Oneveryinterestingresultfromthisdataisthatasparseclusterofsix6-inputLUTsisslightlymorearea-efﬁcient(3%)thansix

ArchitectureNFc60.560.560.560.5100.366

Ispare

222510Fcfb0.50.50.50.430.43

ChannelWidth(arith.avg.)FullyPopulated

47.946.444.343.853.7

BestSparse3.333.353.233.9.03

Architecturek

TileArea(NumberofMinimum-WidthTransistorAreas)

Best-AreaSparseCluster

9307

1840

14318

6831

35145

120226713

5146

(26.2%)

16879

11358

6050

3080

(27.4%)

3496

8120

4298

(15.0%)

990

6371

2115

(17.1%)

1430

(17.1%)

Table6:Breakdownofclustertilearea.Theroutingareaisanarithmeticaverageforallcircuits.

1.45e-081.4e-081.35e-081.3e-081.25e-081.2e-08

0.1k=4delay (s)delay (s)k=5k=6k=70.20.3Fc

45k

670.40.50.6Figure7:DelaydecreaseswithLUTsize.

4-LUTsinasparsecluster.Thisisadeparturefrompreviousworkwhichhasconsistentlyshownthat4-LUTsachievelowerarea,al-beitinfullypopulatedclusters.Thereasonforthisdifferenceissimple:largerLUTsprovidemoreopportunityfordepopulation.Thisconceptissupportedbypreviousworkwhichhasshownthatsparsecrossbarswithmoreoutputsrequirefewerswitchesforthesamelevelofroutability[8].

Figure8:DelayisnotinﬂuencedbyFcin.Similarresultsindi-cateitisnotinﬂuencedbyIspareorFcfb.

4.5SparseClusterDelayResults

Asmentionedearlier,reducedswitchdensitiesmaycauseanin-creaseindelayduetoanincreaseinbendsorwireusetoachieveroutability.Althoughdelaymaydecreaseforotherreasonssuchasreducedloading,wechosetobeconservativeandignorethesepossiblebeneﬁts.

ThecurvesinFigure7showtheimpactthatvaryingtheLUTsizehasondelayforafewoftheN6architectures.Thecurvelabelsidentifyingthearchitectureshavebeenomittedforclarity,sinceonlytrendsneedtobeobserved.Theimportantthingtonoticeisthat,forallarchitectures,delaygoesdownaskincreases.

Similarly,Figure8showsthechangeindelayastheswitchden-sityFcinisvaried.ItisapparentinthegraphthatcurvesofthesameLUTsizeareallgroupedtogether.Inparticular,the4-and5-LUTdataiseasilydistinguishedfromthe6-and7-LUTdata.Theﬂat-nessofallofthesecurvesillustrateshowlittleimpactFcinhasondelay.

AnalysisofdelaywhilevaryingIspareorFcfbshowsthesamere-sult:delayisindependentoftheseparameters.Eventhoughsparseclusterspresentachallengetotherouterandremovemanychoices,andeventhoughsomefeedbackconnectionsmustleavetheclus-terandre-enterthroughthegeneral-purposerouting,therouterstillhasenoughfreedomtoensurethatnetsonthecriticalpathremainonthefastestpathstothecriticalsinks.

4.6SparseClusterArea-DelayProduct

Theprevioustwosectionspresentedresultsindicatingthe6-LUThadthelowestareaandthe7-LUThadthelowestdelay.Whenthe

Fully Populated Cluster

0.0650.06area ⋅ delay (T⋅ns)0.0550.050.0450.040.035

0 6 X 1.0 1.05 6 X 1.0 1.06 6 X 1.0 1.07 6 X 1.0 1.07 10 X 1.0 1.010Ispare

1520area ⋅ delay (T⋅ns)0.0650.060.0550.050.0450.040.035

0Best-Area Sparse Cluster4 6 X 0.50 0.505 6 X 0.40 0.506 6 X 0.33 0.507 6 X 0.14 0.437 10 X 0.14 0.43510Ispare

1520Figure9:Area-delayproductresultsforfully-populatedandbest-areasparsearchitectures.

AverageRuntime(seconds)

4357

188

275

FullyPopulated70

183

178

116

FullyPopulated

15086

Table7:Averageruntimeandnumberofroutingiterationsfortheﬁnallow-stressroute(arithmeticaveragesof20benchmarks).Runtimeswerecollectedonan866MHzPentiumIIIcomputerwith512MBofSDRAM.areaanddelayresultsarecombinedintheformofanarea-delayproduct,the6-LUTemergesasthesuperiorlogicblockchoice.Thismetricisimportantbecauseitindicateswhenthebesttrade-offisbeingmadebetweenusinganadditionalamountofareaforasimilarrelativegaininclockrate(orviceversa).Forexample,itisdirectlyusefulinFPGA-basedcomputationbecausethecomputa-tionrateisaproductofboththeclockrateandparallelism.

Thebestsparsearea-delayproductorganizationsarecomparedtotheirfully-populatedversionsinFigure9.Thearea-delayprod-uctimprovesforeveryLUTsizeduetotheareareduction.Theoverallbestsparsearchitecturecontaining6-LUTsisabout14%moreefﬁcientthanonecontaining4-LUTs,andabout22%moreefﬁcientthanthetraditionalfully-populated4-LUTcluster.

areused.Eventhoughruntimehasincreased,thenumberofrouteriterationsusedispracticallyunchanged.Themainreasonfortheslowdowncomesfromtheincreasednumberofwiresandswitchesinthearchitecturethatmustbeexaminedwitheachiteration:allclusterinputsnowhaveconnectionstomanyLUTinputs,andnetsareallowedtoenteraclustermorethanonce.Thiscausestheroutertoevaluatemanymoreroutingpathsbeforemakingadecision.ItisworthwhiletonotethathavinglargerLUTsizesandclus-tersizesreducestheamountofworkthatVPR4.30mustdo,soruntimedecreases.ThisbeneﬁtwasnotrealizedinthemodiﬁedVPRbecausetheamountofwiringinsidetheclusteralsoincreases,keepingruntimerelativelyﬂat.

Theadditionalruntimeneededtoroutethebest-areasparsear-chitecturesisalsoshowninTable7.Fork456theruntimeandthenumberofiterationsissimilar,fork7runtimenearlydoubledandthenumberofiterationsincreasedby25–30%.6Thisincreaseintheaverageiscausedbyalargeincreaseinfourofthenormallydifﬁcult-to-routecircuits.Theneedformorerouterit-erationsindicatesthesearchitecturesarebarelyroutable,probablybecauseFcinissolow,eventhoughthesecircuitsarebeingroutedusingthelow-stresschannelwidth.

IncreasingroutabilitybyincreasingIspareto15forthek7,N10architecturereducedruntimeto210secondsand97iterations.Hence,theamountofareasavingscanalsobebalancedagainsttheruntimeeffort.

4.7RoutingRuntimewithSparseClusters

Theremovalofswitchesinsidetheclusteralsoremovestheroutabilityguaranteeofthecluster.Consequently,theroutermustpayattentiontoallofthewiresandswitcheswithinthecluster,soitisexpectedthatadditionalruntimeeffortisrequiredtocompletetheroute.

TheaverageruntimeandaveragenumberofiterationsrequiredforroutingthedifferentarchitecturesareshowninTable7.ResultsarepresentedforfullypopulatedclusterstocomparetheoriginalVPR4.30tothemodiﬁedone.Aswell,themodiﬁedVPRcanbecomparedagainstitselftostudytheadditionalimpactofroutingthebest-areasparseclusters.

Generally,themodiﬁedVPRcurrentlyrunsaboutthreetofourtimesslowerthantheoriginalversionwhenfullypopulatedclusters

5.CONCLUSIONS

Thisworkhasstudiedtheareaanddelayimpactofsparselypop-ulatingtheinternalclusterconnectionsinaclusteredarchitecture.Attheexpenseofthreetofourtimesthecomputetime,anareasavingsof10toover14%wasrealizedbysparselypopulatingtheclusterinternalsof4-,5-,6-,and7-inputLUTarchitecturescon-taining6LUTspercluster.Alargerclustersizeoften7-LUTsobtainedan18%areasavings.Itwasalsoobservedthattheaddi-tionalroutereffortandreducedroutingﬂexibilitydidnotdegradecritical-pathdelay.

Aﬁxednumberofspareinputswereaddedtoeachcluster.Theseinputsareusedonlybyrouting,andarenotusedorre-quiredforpacking.Byaddingupto15spareinputs,thechannelwidthdecreasedbyabout10%inmostarchitectures,whetherfullorsparselypopulated.Althoughsparseclustersontheirownim-poseasmallincreaseinchannelwidth,thespareinputsreducethechannelwidth,resultinginasmall,netsavings.

Thechannelwidthreductiontypicallyproducedanetsavingsinroutingareaalonewhenuptosevenspareinputswereadded,butresultedinanetincreasethereafter.Ofcourse,theclusterarea(ex-cludingtherouting)alwaysincreasedwiththeadditionofsparein-puts.However,thisareaincreasedataslowerrateinmoresparselypopulatedclusters,asexpected.Whenaddedtotheroutingarea,mostarchitecturesbecamelessefﬁcientaftermorethanﬁvespareinputswereemployed.

Theincreaseinroutabilityanddecreasesinchannelwidthandareaindicatethatitisbesttoforcethepackingalgorithmtoleaveafewspareinputs(twoorthree)fortherouter.

Oneinterestingoutcomeofthisworkisthat,contrarytopopularbelief,itismorearea-efﬁcienttodepopulateonlytheLUTinputmultiplexersthanitistodepopulateonlytheclusterinputmulti-plexers(i.e.,theCblocks)inthegeneralrouting.Thereasonforthisisthat,duetoinputsharinginacluster,thereareabouttwiceasmanyLUTinputmultiplexersthanclusterinputmultiplexers.Ofcourse,depopulatingbothregionsprovidesevenmoresavings.Anotherinterestingobservationisthat6-LUTsbecomemoreareaefﬁcientthan4-LUTswhensparseclustersareemployed.Thiswasentirelyattributabletothemoresparsepatternthatcouldbeusedinthe6-LUTcase.

Theareaanddelayresultsinthispaperusedconservativeesti-matesandignoredsecondaryeffectswhichwouldimproveresultsfurther.Inparticular,thetilesizeandthesubsequentroutingswitchsizereductionfromsparseclusteruseshouldleadtoadditionalareaanddelayreduction.Delayimprovementmayalsocomefromre-ducedloadinginsidetheclusterandbygenerallyusinglargerclus-tersizes,whicharemorearea-efﬁcientwhenusingsparseclusters.Itisreasonabletoexpectthatlargerclustersizesmayproduceanevenlargerareasavingsduetothelargeamountofareaconcen-tratedintheLUTinputmultiplexers.

FutureworkinthisareawillincludeefforttojointlydesigntheLUTinputswitchmatriceswiththeclusterinputmultiplexerstoavoidswitchpatterninterference.Additionalconstraintssuchascarrychainsorotherlocalroutingmayimpactsparseclusterde-signandshouldbeevaluated.Awidervarietyofclustersizes,particularlytheeffectivenessoflargeclusters,shouldalsobeex-plored.Theareasavingsfromsparelypopulatedclusterswillre-ducetilesize,butthesubsequentareaanddelayreductionfromus-ingsmallerroutingswitchesshouldalsobequantiﬁed.Thedelayimprovementsarisingfromreducedloadingandlargerclustersizesshouldbeinvestigated.Also,effortsshouldbemadetoimprovetheruntimeoftherouterwhilestillretainingtheareasavings.

Aninterestingextensionofthisworkwouldinvolvetightercou-plingwiththepackingstage.Forexample,underspecialcircum-

stances,itmaybereasonabletohavethepackingtoolusethespareinputsreservedforrouting.Beforedoingthis,itcouldﬁrstdoaroutabilitytesttoverifywhetherthepotentialclusteroflogicblocksisroutable.Sincethisshouldn’tbeacommoncase,itcanbedonewithreasonableCPUeffort.ThismayincreasetheusefulnessoftheFPGAarchitectureforsubcircuitswhichhavewidefan-in(orpoorinputsharing),suchasﬁnitestatemachines.

6.ACKNOWLEDGEMENTS

TheauthorswishtothankEliasAhmed,MikeSheng,andSteveWiltonforHSPICEtimingresultsandhelpfuldiscussions.

7.REFERENCES

[1]E.Ahmed.Theeffectoflogicblockgranularityondeep-submicronFPGAperformanceanddensity.Master’sthesis,DepartmentofElectricalandComputerEngineering,UniversityofToronto,2001.

[2]E.AhmedandJ.Rose.TheeffectofLUTandclustersizeon

deep-submicronFPGAperformanceanddensity.InACM/SIGDAInt.Symp.onFPGAs,pages3–12,2000.[3]V.BetzandJ.Rose.VPR:Anewpacking,placementand

routingtoolforFPGAresearch.InField-ProgrammableLogic,pages213–222,1997.

[4]V.Betz,J.Rose,andA.Marquardt.ArchitectureandCAD

forDeep-SubmicronFPGAs.KluwerAcademicPublishers,Boston,1999.

[5]J.CongandY.Ding.FlowMap:Anoptimaltechnology

mappingalgorithmfordelayoptimizationinlookup-tablebasedFPGAdesigns.IEEETransactionsonComputer-AidedDesign,pages1–12,January1994.

[6]W.Elmore.Thetransientresponseofdampedlinear

networkswithparticularregardtowidebandampliﬁers.JournalofAppliedPhysics,pages55–63,January1948.[7]C.B.Laboratory.LGSynth93suite.

http://www.cbl.ncsu.edu/www/.

[8]G.Lemieux,P.Leventis,andD.Lewis.Generating

highly-routablesparsecrossbarsforPLDs.InACM/SIGDAInt.Symp.onFPGAs,pages155–1,Monterey,CA,February2000.

[9]A.Marquardt,V.Betz,andJ.Rose.Usingcluster-based

logicblocksandtiming-drivenpackingtoimproveFPGAspeedanddensity.InACM/SIGDAInt.Symp.onFPGAs,pages37–46,1999.

[10]A.Marquardt,V.Betz,andJ.Rose.Timing-drivenplacement

forFPGAs.InACM/SIGDAInt.Symp.onFPGAs,pages203–213,2000.

[11]M.I.Masud.FPGAroutingstructures:Anovelswitchblock

anddepopulatedinterconnectmatrixarchitectures.Master’sthesis,DepartmentofElectricalandComputerEngineering,UniversityofBritishColumbia,December1999.[12]J.RoseandS.Brown.Flexibilityofinterconnection

structuresinﬁeld-programmablegatearrays.IEEEJournalofSolidStateCircuits,26(3):277–282,March1991.[13]E.M.Sentovich,K.J.Singh,L.Lavagno,C.Moon,

R.Murgai,A.Saldanha,H.Savoj,P.R.Stephan,R.K.

Brayton,andA.Sangiovanni-Vincentelli.SIS:Asystemforsequentialcircuitanalysis.TechnicalReportUCB/ERLM92/41,UniversityofCalifornia,Berkeley,May1992.

[14]M.ShengandJ.Rose.Mixingbuffersandpasstransistorsin

FPGAroutingarchitectures.InACM/SIGDAInt.Symp.onFPGAs,2001.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文