Dept.ofElectricalGuyandLemieux
ComputerEngineering
UniversityofToronto
Toronto,Ontario,CanadaM5S3G4
lemieux@eecg.toronto.eduABSTRACT
InFPGAs,theinternalconnectionsinaclusteroflookuptables(LUTs)areoftenfully-connectedlikeafullcrossbar.Suchahighdegreeofconnectivitymakesroutingeasier,buthassignificantareaoverhead.ThispaperexplorestheuseofsparsecrossbarsasaswitchmatrixinsidetheclustersbetweentheclusterinputsandtheLUTinputs.Wehavereducedtheswitchdensitiesinsidethesematricesby50%ormoreandsavedfrom10to18%inareawithnodegradationtocritical-pathdelay.Tocompensateforthelossofroutability,increasedcomputetimeandspareclusterinputsarerequired.Furtherinvestigationmayyieldmodestareaanddelayreductions.
1.INTRODUCTION
ArecenttrendinFPGAarchitecturaldesignistouseaclusteredarchitecture,whereanumberoflookuptables(LUTs)aregroupedtogethertoactastheconfigurablelogicblock.Themotivationforusingclustersismanifold:toreducearea,toreducecriticalpathdelay,andtoreduceCADtoolruntime[1,2,9,10].ThistrendisfollowedbyFPGAsfromXilinx’sVirtexandSpartan-IIfamilies,aswellasAltera’sAPEXandACEXproducts.AlloftheseFPGAsarebasedonclustersof4-inputlookuptables.
Inaclusteredarchitecture,theLUTinputscanbechosenfromtwosources:1)asetofsharedclusterinputs,whicharesignalsarrivingfromotherclustersviathegeneralpurposerouting,or2)fromfeedbackconnections,whicharetheoutputsofLUTsinthiscluster.Ithasbeencommontoassumethattheseinternalclus-terconnectionsarefullypopulatedorfullyconnected,meaningev-eryLUTinputcanchooseanysignalfromalloftheclusterinputsandfeedbackconnectionscombined.Thisarrangementcanalsobeviewedasafullcrossbar,whereaswitchorcrosspointexistsattheintersectionpointofeveryLUTinputandeveryclusterinputorfeedbackconnection.
Inthispaper,itisassumedthattheconnectionswithintheclusteraremadebymultiplexersdrivingtheLUTinputs,calledLUTinputmultiplexers.Thesemultiplexerstendtohavealargenumberofin-putsand,afterincludingtherequisiteinputbuffersandcontrollingSRAMbits,contributesignificantlytoFPGAarea.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
FPGA2001,February11–13,2001,Monterey,CA.Sometypographicalerrorshavebeenfixed(February20,2001).
Copyright2001ACM1-58113-341-3/01/0002...$5.00
Dept.ofElectricalDavidandComputerLewis
Engineering
UniversityofToronto
Toronto,Ontario,CanadaM5S3G4
lewis@eecg.toronto.edu
AclusteredFPGAiscomposedofanumberofclustertileswhicharerepeatedinasimplearraypatternduringlayout.Eachtileiscompleteinthatitincludestheclusterlogic(theflip-flops,LUTs,andLUTinputmultiplexers)aswellasthegeneralroutingtointerconnectthem.BasedonanareamodelstatedlaterinSec-tion2,theLUTinputmultiplexersalonecanconsume24to33%ofthetransistorareainaclustertile.AbreakdownoftheareaestimatesforanumberofsuchtilesisprovidedinTable1.
ThesignificantamountofarearequiredbytheLUTinputmulti-plexersmotivatedtheideaofremovingswitchesfromthefullcross-bar,ordepopulatingit,toresultinasparsecrossbar.Naturally,depopulatingtheclusterraisesthefollowingquestions:
1.Willdepopulationsavearea,requiregreaterroutingarea,orcreateunroutablearchitectures?2.Willdepopulationreduceorincreaseroutingdelays?3.Whatamountofdepopulationisreasonable?
4.Howmuchareaordelayreductioncanbeattained,ifany?5.Whataretheothereffectsofdepopulatingthecluster?Thispaperaddressesthesequestionsusinganexperimentalpro-cessofmappingbenchmarkcircuitstoclusteredFPGAarchitec-turesandmeasuringtheresultingareaanddelaycharacteristics.
1.1ComparisontoPriorWork
Theuseoffully-connectedclusterslikelystemsfrompreviouswork[12]whichsuggeststhatinputsofa4-LUTbefullyconnectedtotheroutingchannel.Thisprovidesenoughroutingflexibilitytoobtainminimumchannelwidthsinnon-clusteredarchitectures,theareametricinuseatthattime.Sincethen,clusteredarchitectureshavebecomeprevalent,CADtoolshaveimproved,andareametricshavebecomemoredetailed.
Reducingtheamountofconnectivitywithintheclusterwasre-centlyexploredusingasimplestripedswitchlayout[11].Ratherthanmodifytherouter,theT-VPACKpackingalgorithmwasal-teredinsuchawaythatroutabilityoftheclusterwasstillguar-anteed.Unfortunately,theareaimprovementobtainedusingthistechniquewaslimitedto5%anddelaysincreasedupto30%.
Inthiswork,thepackingalgorithmwasleftunchanged.In-stead,improvedswitchpatternswereused,spareclusterinputswereaddedtothecluster,andmodificationstotherouterweremadetosupportthesearchitecturalchanges.Althoughthesespareinputscontributetoadditionalarea,theyalsoimproveroutabilityandreducechannelwidthrequirements.Overall,anetareareduc-tionofupto18%withnodegradationtocritical-pathdelaywasobtained.
Architecture
Clustersize
45677
6050632167137512022(65.0%)(56.2%)(46.9%)(39.0%)(34.2%)
Total930711241143181962235145
Table1:Breakdownofclustertilearea.Theroutingareaisanarithmeticaveragerequiredtoroute20MCNCcircuits.
FcoutDisjointS BlockkNIIspare
LUTsizeclustersize
numberofclusterinputs
numberofadditionalclusterinputs,usedforroutingonly
Table2:Clusterorganizationparameters.
partitionedBLEBLEFcFcinFcfbFcFcout
clusterinputtoLUTinputdensityLUTfeedbacktoLUTinputdensityroutingchanneltoclusterinputdensityclusteroutputtotheroutingchanneldensityTable3:Switchdensityparameters.
FcfbsingleFcinBLEBLEFigure1:Detailsoftheclustertilearchitecture.
1.2Tradeoffs
Sparseclustersgivethepromiseofreducedarea,butoneimpor-tanttradeoffthatmustbemadetorealizethissavingsisincreasedroutingtime.Inourexperience,anapproximateruntimeincreaseofthreetofourtimeswasobserved.Thisincreasemaynotbetoler-ableduringearlyprototypingstageswhendesignchangesarefre-quent,butalesscostlydevicecouldoffsetthisinconveniencewhenanFPGAdesignshiftstovolumeproduction.Consequently,thepremiseofthispaperistoevaluatethelimitsofareareductionthatcanbeobtainedusingahighdegreeofCADtooleffort.
Theremainderofthispaperisorganizedasfollows.InSection2,theFPGAarchitectureisdescribedalongwiththeareaandde-laymodels.Section3discussestheexperimentalmethodologyandCADtoolsused.Section4presentstheresults,andSection5con-cludes.
formedbyaclusteranditsroutingchannelsisshowninFigure1.Thistileisdrawninawaytosuggestastep-and-repeatlayoutthatispossible,withwiresontheleftedgeofonetileliningupwithwiresontherightedgeoftheadjacenttile.
OneclustercontainsNbasiclogicelements(BLEs),whereoneBLEcontainsak-inputLUTandaregister.EachclusterhasIkN12primaryinputswhichareusedduringpacking[2].Aswell,aclusterhasIspareadditionalclusterinputswhicharere-servedonlyforrouting.Theseextrainputsarerequiredtoimproveroutabilityduetotherestrictionsimposedbysparseclusters.AlloftheseclusterorganizationparametersaresummarizedinTable2.Theclusterinputsareassumedtobelogicallyequivalent,buttheymayconnecttoonlysomeoftheLUTinputs.Theclusterinput(andoutput)pins,whichconnecttheclustertothegeneralrouting,areevenlydistributedonthefoursidesofthetile.LaterinSection4.3,weshallpartitiontheclusterinputsintofourgroupsbasedonwhichsidetheyareplaced.
2.2RoutingArchitectureDetails
Detailedroutingarchitecturalparametersweresettobethesameasearlierstudies[2,4].Inthedetailedroutingarchitecture,50%ofthetracksarelength-4segmentsusingtri-statebuffers,theremain-ingtracksarelength-4segmentsusingpasstransistors,andclockswereassumedtoberoutedonaglobalresource.Thedisjointswitch(S)blockwasused,sosignalsenteringtheroutingontrackimustremainonthattracknumberuntilthedestinationisreached.Thenumberofiopadsperclustertilepitchwassetto5forN6,andto7forN10.
Theroutingswitchsizes(i.e.,bufferandpasstransistorsizes)andwiringRCpropertieswerecomputedassumingdoubleminimum-spacedwiringandafully-populatedclustertilesize.Forthek4N6architecture,thebufferwas61timestheminimum
2.FPGAARCHITECTURE
ThissectiondescribesassumptionsmadeabouttheFPGAarchi-tectureandtheareaanddelaymodels.
2.1ArchitecturalModel
Thearchitectureusedinthisstudyisasymmetrical,island-styleFPGAcontaininginterconnectedclusters.ThebasicFPGAtile
sizeandthepasstransistorwas122timestheminimum.Theotherarchitectureshadlargertilesizesandusedbuffersizesof66,76,,and118.Thepasstransistorsizeswerealwayschosentobetwicethecorrespondingbuffersize.
WithinaBLE,theLUTinputsareassumedtobelogicallyequiv-alentandhencefreelypermutable.Theseinputscanselectsignalsfromtwoindependentsources:eitherclusterinputsorfeedbackconnections.Thedensityofswitchesforthesetworegions,FcinandFcfb,respectively,areindependentlycontrolled.Thesetwopa-rameterscontrolthesparsenessofswitchesinsidethecluster.
Forconnectionstooutsidethecluster,theinputsfromandout-putstothegeneralroutingchannelsareselectedusingswitchma-triceswithdensitiesofFcandFcout,respectively.ThepartofthegeneralroutingchannelthatconnectstotheclusteriscommonlyreferredtoastheconnectionblockorCblock.
TheparameterscontrollingswitchdensitiesinsideandoutsideoftheclusteraresummarizedinTable3.
EachBLEoutputdirectlydrivesaclusteroutputandalocalfeed-backconnection.TheBLEoutputsareassumedtobelogicallyequivalent,allowinganyfunctiontobeplacedinanyoftheBLEsofthecluster.Toachievethisoutputequivalence,everyBLEisgiventheexactsameinputswitchpattern.1
Toimproveroutability,theroutingtooltakesadvantageoftheinputandoutputequivalencesjustdescribed.ItmayalsoreplicatelogicontomultipleBLEsinthesamecluster,providedthereareemptyBLEsavailable.
2.3AreaModel
Theareamodelusedinthispaperisthesamebuffer-sharingmodelusedpreviously[2,4],withafewminorchangesdescribedbelow.Thismodelisbasedontheunitareaofaminimum-widthtransistor(T),includingthespacingtoanadjacenttransistor.Asmentionedin[4],discussionswithFPGAvendorshavesuggestedthatthis,andnotwiring,isthearea-limitingfactor.
AllofthelogicstructuresintheFPGAaremodeled,includingBLEs,theLUTinputmultiplexers,andtheclusterrouting,butnotthepadframe.Forexample,theareacontributionofapasstransis-tordependsonthetransistorwidth,andabufferchaindependsonthenumberofinverterstagesaswellastherequireddrivestrengthofeachstage.
Thedrivestrengthrequirementforabufferisbasedonfan-outandiscomputedasfollows.Ingeneral,itisassumedthatasizeBinverterinabufferissufficienttodriveanotherinverterofsize4B,oratotaltransistorgatewidthof8B.However,buffersdrivingtheLUTinputmultiplexers,i.e.,theclusterinputbuffers,weresizeddifferently.ThesebuffersmustdrivealargerloadcreatedbythemanylevelsoftheLUTinputmultiplexertree.Thisloadislargernotonlyduetothedepthofthetree,butalsobecausediffusionisbeingdriven.Forthesebuffers,asizeBwasselectedifthefirstlevelfan-outofthebuffer2wasloadedbyatotaldiffusionwidthof2B,withtheexceptionthatdrivestrengthwaslimitedtobeatleast7xandatmost25xminimumsize.TheseapproximationsweremadeafterexaminingHSPICEresults[AhmedandWilton,privatecommunication].
Therewereafewadditionalimprovementsmadetothearea
icallyforsparseclusters.ThisversionofVPRincludesthelatesttiming-drivenpackingandplacementenhancements[9,10].
Duringrouting,theminimumchannelwidthrequiredtoroute,Wmin,wasfoundusingabinarysearch.Afterwards,afinallow-stressroutingwasdonewithW13Wanddelaystatistics.ThisproceduremodelsmintrackstocomputeareathewayFPGAsareactuallyused;designersareseldomcomfortableworkingontheedgeofcapacityorroutability.
Thefinallow-stressroutingactuallyfailedin34outof3980(0.9%)circuit/architecturecombinations,usuallyduetoslowcon-vergenceorswitchpatterninterference.3Toresolvethis,one,two,thenthreeadditionaltrackswereaddedtothechannel.Thisstrat-egywassufficienttorouteallbutfourofthetroublesomecases—thethreeunderlyingarchitecturesforthesecasesweredeemedun-routable,sotheywereabandonedfromfurtherconsiderationinthispaper.
Also,ifthebinarysearchwasunabletofindareasonablemin-imumchannelwidth(Wchitecturewasdeemedunroutablemin240)andforabandoned.anyofthecircuits,Consequently,thear-everyarchitecturalresultpresentedinthispaperwasobtainedbyroutingallofthebenchmarkcircuits.
Allareaanddelayresultsareaveragesobtainedfromplacingandroutingthe20largestMCNCbenchmarkcircuits[7].Areaiscom-putedasthegeometricaverageoftheactiveFPGAarea,whichisdefinedbelow.Thegeometricaverageensuresthatthecircuitsareallweightedequally,independentofthesizeofthecircuit.Delayresultsarealsothegeometricaverageofthecritical-pathdelayforeachbenchmarkcircuit.
ActiveFPGAareaisthearea,inunitsofminimum-widthtransis-torareas,ofoneclustertile(includingitsrouting)timesthenumberofclustersactuallyusedbythebenchmarkcircuit.Thismeasure-mentwasusedin[1,2]tobetterdistinguishpackingefficiency.WehavechosentousetheactiveFPGAareametricheretobeconsis-tentwiththoseresults.4
3.2CADToolEnhancements
OriginallyVPRroutedonlytoclusterinputpinsbecausefully-connectedclusterscouldguaranteetheroutabilityofclusterinputsandfeedbackconnections.ExtensivemodificationstoVPRwerenecessarytoroutesparselypopulatedclusters.Forexample,theroutinggraph,timinggraph,andnetliststructureshadtobealteredtoaccommodatetheclusterfeedbacknetsandthelocationofeveryBLEsink.Aswell,otherchangeswerenecessarytopermitnetstoenteraclustermorethanoncetoimproveroutability.
Theswitchpatterngeneratorfrom[8]wasintegratedintoVPRtocreatetheswitchpatternsfortheLUTinputmultiplexers.Thisgeneratorfirstdistributesswitchestobalancethefan-inandfan-outofeachwire,usuallyinarandompattern.Agreedyimprovementstrategyisthenfollowedwhichroughlymaximizesthenumberofdistinctoutputwiresreachedbyeverypairofinputwires.Toac-complishthis,switchesarerandomlyselected,firstinpairs,thensingly,andmovedonlyifthefan-in/-outconstraintsarekeptandtheaforementionedcostimproves.Usingthistechnique,theswitchpatternswithinaclusterareindividuallywell-designed.
Otherswitchpatternsintheroutingfabric,namelytheclusterinputandoutputpatterns,usetheoriginalVPRswitchplacementgenerators.Additionally,wehavenotattemptedtooptimizethecascadingofthetheclusterinputmultiplexersandLUTinputmul-tiplexers,exceptasnotedbelowinSection4.3.Thisextensiontotheworkisnontrivialandleftforfutureinvestigation.
fac
Tool
T-VPACKVPRbinarysearch
AdditionalParametersdefault
-pres
fac
mult1.3-max
iterations250
router
thatthesparseFc10resultismissingforN9inFig-ures3and4becauseVPRwasunabletoroutetheclmacircuitunderlow-stressconditionsduetoslowconvergence.
5Notice
5.4e+065.2e+06active area (Ts)sparse 7 2 0 0.5 0.5full 7 2 0 1.0 1.0
5.4e+065.2e+06active area (Ts)5e+0.8e+0.6e+06
Csparse 7 9 0 0.5 0.5full 7 9 0 1.0 1.0
5e+0.8e+06
A4.6e+06
4.4e+06
0.10.20.30.40.50.60.70.80.9
Fc
BD1
4.4e+06
0.10.20.30.40.50.60.70.80.9
Fc
1
Figure4:Fcimpactonareaforclustersizesof2and9.Intermediateclustersizesgavesimilarresults.
thesparseandfullypopulatedclusterresultsaresosimilar.Thiscanbepartlyattributedtotherelativeflatnessneartheminimumarea.ForN9,varyingFcfrom0.1to0.5causeslessthan5%changeinarea.Hence,preciseFcselectionisnotcritical,provideditislargeenoughtoberoutable,yetnotwastefullylarge.
Fortheremainderoftheresultsinthispaper,itwasdeterminedthatafixedvalueofFcwouldnotsignificantlyhinderarearesults.Ratherthanusingtheminimum-areaFcvaluesfromFigure5,wefeltthathavingafewmoreswitchesintherouting(byhavingaslightlylargerFc)wouldbehelpfulasclustersweremadeevenmoresparse(internally).Thisisespeciallyimportantbecausenoeffortwasmadetotunethetwoswitchpatternstogetherandwewishedtoavoidpossibleinterferencepatterns.Hence,wechosetosetFc05fortheN6architecturesandFc0366forthek7N10architecture.Theseparticularvalueswerechosenbecausetheywereusedinpreviouswork[2,4]andthisgivesusthemostcomparableresults.
0.70.6minimum area Fc0.50.40.30.20.1
sparse 7 X 0 0.5 0.5full 7 X 0 1.0 1.0
1015
2025I cluster inputs
3035
4.2.2SelectingFcout
PreviousexperimentshaveshownthatFcout1Nisadequateforroutinginfullypopulatedarchitectures[4].ConsideringthesimilarityoftheFcarearesultsbetweensparseandfullypopu-latedarchitectures,itwasdecidedthatmodifyingFcoutwouldhaveinsignificantimpactinasparselyconnectedarchitecture.Hence,Fcout1Nwasusedforallresults.
Figure5:BestFccorrespondingtominimumareaasafunctionofIclusterinputs.
permutationoftherows(oroutputs)tobalancethefan-inoftheLUTinputs.Thesematrices,butnotthepermutationpattern,areillustratedinFigure1.
Bothswitchdesignswereroutedinak7,N10,FcinFcfb043architecture.Bothdesignsrequiredidenticaltransistorarea,andthepartitionedmatrixwasonlyabout1%faster.Althoughthisisnotsignificantlyfaster,itwasusedforsubsequentresultsinthispapersinceitmayhelpwithsomepathologicalcases.
4.3PartitioningofClusterInputs
Additionalnetdelaycanbecausedbysparselypopulatedclus-tersbecausesomeLUTinputsmaynotbereachablefromparticu-larsidesofthecluster.Forexample,considerthecasewhensomeLUTinputconnectionshavealreadybeenformed,andthelastre-maininginputsignalisbeingmade.Alackofswitchesinsidetheclustermaycausethatnettoentertheclusterfromamoredistantside.Theresultisincreaseddelay.
Weinvestigatedthisproblembytryingasingleswitchmatrixforallclusterinputs,andonewhichwaspartitionedintofoursmallerswitchmatrices,oneforeachinputside.Thepartitionedmatrixaddressestheaboveproblembyensuringthatalloftheclusterin-putsfromanyparticularsidecanreachalloftheLUTinputs.Italsohasaweaknessthough:thesesmallerswitchmatricesarenotcarefullydesignedtocoupletogetherwell.Eachpartitionedmatrixisderivedfromthesamebasicswitchpattern,buteachhasitsown
4.4SparseClusterAreaResults
Theprimarymotivationfordepopulatingclustersistoreducethearea,andsubsequentlythecost,ofanFPGA.InSection4.2,itwasdeterminedthatsimplydepopulatingtheclusterto50%ismoreeffectiveatreducingareathanchoosingthepropervalueofFc.Inthissection,furtherdepopulationoftheclusterisexplored.
Toreducethenumberofroutingexperiments,itwasdecidedtofixtheclustersizetoN6andvarytheLUTsizesfrom4through7.Thatparticularclustersizewasselectedbecauseitgeneratednear-minimumareaandarea-delayresultsforfullypopulatedclus-terswithalloftheseLUTsizes.ThelargerLUTsizesareespeciallyinterestingbecausetheyrequirelargerinputswitchmatrices,hence
4.8e+0.6e+0.4e+06active area (Ts)4.2e+0e+063.8e+063.6e+063.4e+063.2e+06
0
4.8e+0.6e+0.4e+06active area (Ts)4.2e+0e+063.8e+063.6e+063.4e+063.2e+06
0
22
active area (Ts)4 6 X 0.25 0. 6 X 0.33 0. 6 X 0.4 0. 6 X 0.5 0. 6 X 1.0 1.04.8e+0.6e+0.4e+0.2e+0e+063.8e+063.6e+063.4e+063.2e+06
5 6 X 0.2 0.55 6 X 0.3 0.55 6 X 0.4 0.55 6 X 0.5 0.55 6 X 1.0 1.0468Ispare
101214
4.8e+0.6e+0.4e+06active area (Ts)4.2e+0e+063.8e+063.6e+063.4e+063.2e+06
02468Ispare
101214
6 6 X 0.17 0.56 6 X 0.25 0.56 6 X 0.33 0.56 6 X 0.41 0.56 6 X 0.5 0.56 6 X 1.0 1.07 6 X 0.14 1.07 6 X 0.22 0.437 6 X 0.29 0.437 6 X 0.43 0.437 6 X 1.0 1.00
2
4
6
8Ispare
10
12
14
468Ispare
101214
Figure6:ActiveFPGAareaoffullyandsparselypopulatedclusters.
offeringmorepotentialfordepopulation.Oneadditionalarchitec-turewithk7N10waschosentostudyanevenlargernumberofinputsenteringthecluster.
AnumberofpreliminaryroutingexperimentswererunwithawiderangeofvaluesforFcinandFcfb.Fromtheseresults,whicharenotshownhere,itwasconfirmedthatFcfbhaslessinfluenceonarea.AsFcfbwasreducedbelow50%,anumberofcircuitswouldnolongerroute.ItwasdeterminedthatFcfbof50%(or3743%fork7)wasaslowavalueascouldbetolerated.Similarprelim-inarysweepsindicatedthatFcin05wasnearlyalwaysroutable,soareareductionshouldconcentrateonmoresparsevalues.
ThearearesultsfromroutingthefourLUTsizesareshowninFigure6.Inthesegraphs,eachcurverepresentsthegeometricav-erageofactiveFPGAareaforafixedvalueofFcin.Thenumberofspareinputsisvariedalongthex-axis.Thesparseclusterre-sultsshouldbecomparedagainsttheboldcurverepresentingthefully-populatedclusterarea.
Themostapparenttrendinthesecurvesisagentledip,thenageneralupwardclimbinareaasIspareisincreased.Theupwardtrendisanexpectedresult,sincethespareinputswillrequireaddi-tionalclusterinputmultiplexers.Thedipiscausedbyarapidinitialdeclineinaveragechannelwidth,whichthengraduallyreachesa5%to20%reduction(10%istypical).
AnumberofdatapointsaremissinginFigure6,specificallyforsmallIsparevalues.Thisisbecauseoneormorebenchmarkcircuits
couldnotberoutedonthearchitecture.Hence,althoughtheycon-tributetoareareductioninonlyafewcases,itisessentialtohavethesespareinputstomakesparseclustersroutable.Typically,be-tweentwotofivespareinputsarerequiredtomakethearchitectureroutableandattainthelowestarea.
Thelowest-areaarchitecturesfromFigure6aresummarizedinTable5.Aswell,thelargeN10clusterarchitectureisincluded.Withthesearchitectures,a10to18%areasavingsisachieved.Asmentionedearlier,betweentwoandfivespareinputsissufficienttoachievemostofthissavings,whichissurprisingsincethisonlyaboutonespareinputperside.
AbreakdownoftheclustertileareaisgivenTable6.For4-inputLUTs,therewasaslightdecreaseinroutingareabecausethespareinputshelpedreduceaveragechannelwidth.The5-and6-inputLUTscasesdidnotachievethesamebenefitbecausethespareinputscontributedmoretoareathantheamountsavedbytheslightchannelwidthreduction.Thetwo7-LUTarchitectureshadanincreaseinroutingareaduetothespareinputsandachannelwidthincrease.However,thesparseswitchpopulationsproducedanetareasavingsof14%and18%,withthelargerclusterbenefittingmore.Withrespecttotheentiretile,depopulatingtheclusterswasveryeffectiveatreducingtherelativeLUTinputmultiplexersizefromthe24–33%rangedownto12–18%.
Oneveryinterestingresultfromthisdataisthatasparseclusterofsix6-inputLUTsisslightlymorearea-efficient(3%)thansix
ArchitectureNFc60.560.560.560.5100.366
Ispare
222510Fcfb0.50.50.50.430.43
ChannelWidth(arith.avg.)FullyPopulated
47.946.444.343.853.7
BestSparse3.333.353.233.9.03
Architecturek
6
5
6
7
10
TileArea(NumberofMinimum-WidthTransistorAreas)
Best-AreaSparseCluster
9307
1840
14318
6831
35145
120226713
5146
(26.2%)
16879
11358
6050
3080
(27.4%)
99
3496
8120
4298
(15.0%)
990
6371
2115
(17.1%)
1430
(17.1%)
Table6:Breakdownofclustertilearea.Theroutingareaisanarithmeticaverageforallcircuits.
1.45e-081.4e-081.35e-081.3e-081.25e-081.2e-08
1.45e-081.4e-081.35e-081.3e-081.25e-081.2e-08
0.1k=4delay (s)delay (s)k=5k=6k=70.20.3Fc
in
45k
670.40.50.6Figure7:DelaydecreaseswithLUTsize.
4-LUTsinasparsecluster.Thisisadeparturefrompreviousworkwhichhasconsistentlyshownthat4-LUTsachievelowerarea,al-beitinfullypopulatedclusters.Thereasonforthisdifferenceissimple:largerLUTsprovidemoreopportunityfordepopulation.Thisconceptissupportedbypreviousworkwhichhasshownthatsparsecrossbarswithmoreoutputsrequirefewerswitchesforthesamelevelofroutability[8].
Figure8:DelayisnotinfluencedbyFcin.Similarresultsindi-cateitisnotinfluencedbyIspareorFcfb.
4.5SparseClusterDelayResults
Asmentionedearlier,reducedswitchdensitiesmaycauseanin-creaseindelayduetoanincreaseinbendsorwireusetoachieveroutability.Althoughdelaymaydecreaseforotherreasonssuchasreducedloading,wechosetobeconservativeandignorethesepossiblebenefits.
ThecurvesinFigure7showtheimpactthatvaryingtheLUTsizehasondelayforafewoftheN6architectures.Thecurvelabelsidentifyingthearchitectureshavebeenomittedforclarity,sinceonlytrendsneedtobeobserved.Theimportantthingtonoticeisthat,forallarchitectures,delaygoesdownaskincreases.
Similarly,Figure8showsthechangeindelayastheswitchden-sityFcinisvaried.ItisapparentinthegraphthatcurvesofthesameLUTsizeareallgroupedtogether.Inparticular,the4-and5-LUTdataiseasilydistinguishedfromthe6-and7-LUTdata.Theflat-nessofallofthesecurvesillustrateshowlittleimpactFcinhasondelay.
AnalysisofdelaywhilevaryingIspareorFcfbshowsthesamere-sult:delayisindependentoftheseparameters.Eventhoughsparseclusterspresentachallengetotherouterandremovemanychoices,andeventhoughsomefeedbackconnectionsmustleavetheclus-terandre-enterthroughthegeneral-purposerouting,therouterstillhasenoughfreedomtoensurethatnetsonthecriticalpathremainonthefastestpathstothecriticalsinks.
4.6SparseClusterArea-DelayProduct
Theprevioustwosectionspresentedresultsindicatingthe6-LUThadthelowestareaandthe7-LUThadthelowestdelay.Whenthe
Fully Populated Cluster
0.0650.06area ⋅ delay (T⋅ns)0.0550.050.0450.040.035
0 6 X 1.0 1.05 6 X 1.0 1.06 6 X 1.0 1.07 6 X 1.0 1.07 10 X 1.0 1.010Ispare
1520area ⋅ delay (T⋅ns)0.0650.060.0550.050.0450.040.035
0Best-Area Sparse Cluster4 6 X 0.50 0.505 6 X 0.40 0.506 6 X 0.33 0.507 6 X 0.14 0.437 10 X 0.14 0.43510Ispare
1520Figure9:Area-delayproductresultsforfully-populatedandbest-areasparsearchitectures.
AverageRuntime(seconds)
k
65
67
10
4357
188
275
96
FullyPopulated70
183
178
84
83
116
FullyPopulated
84
91
88
15086
Table7:Averageruntimeandnumberofroutingiterationsforthefinallow-stressroute(arithmeticaveragesof20benchmarks).Runtimeswerecollectedonan866MHzPentiumIIIcomputerwith512MBofSDRAM.areaanddelayresultsarecombinedintheformofanarea-delayproduct,the6-LUTemergesasthesuperiorlogicblockchoice.Thismetricisimportantbecauseitindicateswhenthebesttrade-offisbeingmadebetweenusinganadditionalamountofareaforasimilarrelativegaininclockrate(orviceversa).Forexample,itisdirectlyusefulinFPGA-basedcomputationbecausethecomputa-tionrateisaproductofboththeclockrateandparallelism.
Thebestsparsearea-delayproductorganizationsarecomparedtotheirfully-populatedversionsinFigure9.Thearea-delayprod-uctimprovesforeveryLUTsizeduetotheareareduction.Theoverallbestsparsearchitecturecontaining6-LUTsisabout14%moreefficientthanonecontaining4-LUTs,andabout22%moreefficientthanthetraditionalfully-populated4-LUTcluster.
areused.Eventhoughruntimehasincreased,thenumberofrouteriterationsusedispracticallyunchanged.Themainreasonfortheslowdowncomesfromtheincreasednumberofwiresandswitchesinthearchitecturethatmustbeexaminedwitheachiteration:allclusterinputsnowhaveconnectionstomanyLUTinputs,andnetsareallowedtoenteraclustermorethanonce.Thiscausestheroutertoevaluatemanymoreroutingpathsbeforemakingadecision.ItisworthwhiletonotethathavinglargerLUTsizesandclus-tersizesreducestheamountofworkthatVPR4.30mustdo,soruntimedecreases.ThisbenefitwasnotrealizedinthemodifiedVPRbecausetheamountofwiringinsidetheclusteralsoincreases,keepingruntimerelativelyflat.
Theadditionalruntimeneededtoroutethebest-areasparsear-chitecturesisalsoshowninTable7.Fork456theruntimeandthenumberofiterationsissimilar,fork7runtimenearlydoubledandthenumberofiterationsincreasedby25–30%.6Thisincreaseintheaverageiscausedbyalargeincreaseinfourofthenormallydifficult-to-routecircuits.Theneedformorerouterit-erationsindicatesthesearchitecturesarebarelyroutable,probablybecauseFcinissolow,eventhoughthesecircuitsarebeingroutedusingthelow-stresschannelwidth.
IncreasingroutabilitybyincreasingIspareto15forthek7,N10architecturereducedruntimeto210secondsand97iterations.Hence,theamountofareasavingscanalsobebalancedagainsttheruntimeeffort.
4.7RoutingRuntimewithSparseClusters
Theremovalofswitchesinsidetheclusteralsoremovestheroutabilityguaranteeofthecluster.Consequently,theroutermustpayattentiontoallofthewiresandswitcheswithinthecluster,soitisexpectedthatadditionalruntimeeffortisrequiredtocompletetheroute.
TheaverageruntimeandaveragenumberofiterationsrequiredforroutingthedifferentarchitecturesareshowninTable7.ResultsarepresentedforfullypopulatedclusterstocomparetheoriginalVPR4.30tothemodifiedone.Aswell,themodifiedVPRcanbecomparedagainstitselftostudytheadditionalimpactofroutingthebest-areasparseclusters.
Generally,themodifiedVPRcurrentlyrunsaboutthreetofourtimesslowerthantheoriginalversionwhenfullypopulatedclusters
5.CONCLUSIONS
Thisworkhasstudiedtheareaanddelayimpactofsparselypop-ulatingtheinternalclusterconnectionsinaclusteredarchitecture.Attheexpenseofthreetofourtimesthecomputetime,anareasavingsof10toover14%wasrealizedbysparselypopulatingtheclusterinternalsof4-,5-,6-,and7-inputLUTarchitecturescon-taining6LUTspercluster.Alargerclustersizeoften7-LUTsobtainedan18%areasavings.Itwasalsoobservedthattheaddi-tionalroutereffortandreducedroutingflexibilitydidnotdegradecritical-pathdelay.
Afixednumberofspareinputswereaddedtoeachcluster.Theseinputsareusedonlybyrouting,andarenotusedorre-quiredforpacking.Byaddingupto15spareinputs,thechannelwidthdecreasedbyabout10%inmostarchitectures,whetherfullorsparselypopulated.Althoughsparseclustersontheirownim-poseasmallincreaseinchannelwidth,thespareinputsreducethechannelwidth,resultinginasmall,netsavings.
Thechannelwidthreductiontypicallyproducedanetsavingsinroutingareaalonewhenuptosevenspareinputswereadded,butresultedinanetincreasethereafter.Ofcourse,theclusterarea(ex-cludingtherouting)alwaysincreasedwiththeadditionofsparein-puts.However,thisareaincreasedataslowerrateinmoresparselypopulatedclusters,asexpected.Whenaddedtotheroutingarea,mostarchitecturesbecamelessefficientaftermorethanfivespareinputswereemployed.
Theincreaseinroutabilityanddecreasesinchannelwidthandareaindicatethatitisbesttoforcethepackingalgorithmtoleaveafewspareinputs(twoorthree)fortherouter.
Oneinterestingoutcomeofthisworkisthat,contrarytopopularbelief,itismorearea-efficienttodepopulateonlytheLUTinputmultiplexersthanitistodepopulateonlytheclusterinputmulti-plexers(i.e.,theCblocks)inthegeneralrouting.Thereasonforthisisthat,duetoinputsharinginacluster,thereareabouttwiceasmanyLUTinputmultiplexersthanclusterinputmultiplexers.Ofcourse,depopulatingbothregionsprovidesevenmoresavings.Anotherinterestingobservationisthat6-LUTsbecomemoreareaefficientthan4-LUTswhensparseclustersareemployed.Thiswasentirelyattributabletothemoresparsepatternthatcouldbeusedinthe6-LUTcase.
Theareaanddelayresultsinthispaperusedconservativeesti-matesandignoredsecondaryeffectswhichwouldimproveresultsfurther.Inparticular,thetilesizeandthesubsequentroutingswitchsizereductionfromsparseclusteruseshouldleadtoadditionalareaanddelayreduction.Delayimprovementmayalsocomefromre-ducedloadinginsidetheclusterandbygenerallyusinglargerclus-tersizes,whicharemorearea-efficientwhenusingsparseclusters.Itisreasonabletoexpectthatlargerclustersizesmayproduceanevenlargerareasavingsduetothelargeamountofareaconcen-tratedintheLUTinputmultiplexers.
FutureworkinthisareawillincludeefforttojointlydesigntheLUTinputswitchmatriceswiththeclusterinputmultiplexerstoavoidswitchpatterninterference.Additionalconstraintssuchascarrychainsorotherlocalroutingmayimpactsparseclusterde-signandshouldbeevaluated.Awidervarietyofclustersizes,particularlytheeffectivenessoflargeclusters,shouldalsobeex-plored.Theareasavingsfromsparelypopulatedclusterswillre-ducetilesize,butthesubsequentareaanddelayreductionfromus-ingsmallerroutingswitchesshouldalsobequantified.Thedelayimprovementsarisingfromreducedloadingandlargerclustersizesshouldbeinvestigated.Also,effortsshouldbemadetoimprovetheruntimeoftherouterwhilestillretainingtheareasavings.
Aninterestingextensionofthisworkwouldinvolvetightercou-plingwiththepackingstage.Forexample,underspecialcircum-
stances,itmaybereasonabletohavethepackingtoolusethespareinputsreservedforrouting.Beforedoingthis,itcouldfirstdoaroutabilitytesttoverifywhetherthepotentialclusteroflogicblocksisroutable.Sincethisshouldn’tbeacommoncase,itcanbedonewithreasonableCPUeffort.ThismayincreasetheusefulnessoftheFPGAarchitectureforsubcircuitswhichhavewidefan-in(orpoorinputsharing),suchasfinitestatemachines.
6.ACKNOWLEDGEMENTS
TheauthorswishtothankEliasAhmed,MikeSheng,andSteveWiltonforHSPICEtimingresultsandhelpfuldiscussions.
7.REFERENCES
[1]E.Ahmed.Theeffectoflogicblockgranularityondeep-submicronFPGAperformanceanddensity.Master’sthesis,DepartmentofElectricalandComputerEngineering,UniversityofToronto,2001.
[2]E.AhmedandJ.Rose.TheeffectofLUTandclustersizeon
deep-submicronFPGAperformanceanddensity.InACM/SIGDAInt.Symp.onFPGAs,pages3–12,2000.[3]V.BetzandJ.Rose.VPR:Anewpacking,placementand
routingtoolforFPGAresearch.InField-ProgrammableLogic,pages213–222,1997.
[4]V.Betz,J.Rose,andA.Marquardt.ArchitectureandCAD
forDeep-SubmicronFPGAs.KluwerAcademicPublishers,Boston,1999.
[5]J.CongandY.Ding.FlowMap:Anoptimaltechnology
mappingalgorithmfordelayoptimizationinlookup-tablebasedFPGAdesigns.IEEETransactionsonComputer-AidedDesign,pages1–12,January1994.
[6]W.Elmore.Thetransientresponseofdampedlinear
networkswithparticularregardtowidebandamplifiers.JournalofAppliedPhysics,pages55–63,January1948.[7]C.B.Laboratory.LGSynth93suite.
http://www.cbl.ncsu.edu/www/.
[8]G.Lemieux,P.Leventis,andD.Lewis.Generating
highly-routablesparsecrossbarsforPLDs.InACM/SIGDAInt.Symp.onFPGAs,pages155–1,Monterey,CA,February2000.
[9]A.Marquardt,V.Betz,andJ.Rose.Usingcluster-based
logicblocksandtiming-drivenpackingtoimproveFPGAspeedanddensity.InACM/SIGDAInt.Symp.onFPGAs,pages37–46,1999.
[10]A.Marquardt,V.Betz,andJ.Rose.Timing-drivenplacement
forFPGAs.InACM/SIGDAInt.Symp.onFPGAs,pages203–213,2000.
[11]M.I.Masud.FPGAroutingstructures:Anovelswitchblock
anddepopulatedinterconnectmatrixarchitectures.Master’sthesis,DepartmentofElectricalandComputerEngineering,UniversityofBritishColumbia,December1999.[12]J.RoseandS.Brown.Flexibilityofinterconnection
structuresinfield-programmablegatearrays.IEEEJournalofSolidStateCircuits,26(3):277–282,March1991.[13]E.M.Sentovich,K.J.Singh,L.Lavagno,C.Moon,
R.Murgai,A.Saldanha,H.Savoj,P.R.Stephan,R.K.
Brayton,andA.Sangiovanni-Vincentelli.SIS:Asystemforsequentialcircuitanalysis.TechnicalReportUCB/ERLM92/41,UniversityofCalifornia,Berkeley,May1992.
[14]M.ShengandJ.Rose.Mixingbuffersandpasstransistorsin
FPGAroutingarchitectures.InACM/SIGDAInt.Symp.onFPGAs,2001.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- sceh.cn 版权所有 湘ICP备2023017654号-4
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务