Bugzilla – Bug 1986
test result divergence for lte-frequency-reuse test
Last modified: 2014-12-19 10:34:10 EST
In preparing for ns-3.21 release, two test cases from the newly added lte-frequency-reuse test were discovered to produce failing result on an OS X machine but pass on Linux machines. These were temporarily disabled in this changeset: http://code.nsnam.org/ns-3-dev/rev/93c0d8e02bf0 Upon inspection of the logging output differences (calling LteHelper::EnableLogComponents() in the test suite and running the test-runner program directly), some differences in order of signal arrival were noticed (line numbers enabled in the below comparison). The below differences were spotted early on in the test case: 600 0x19b2610 state: IDLE 601 LteInterference:AddSignal(0x19b2a30, 2.22222e-07 2.22222e-07 2.22222e-07 2.22222e-07 2.22222e-07 2.22222e-07 602 , +214285.0ns) 603 LteInterference:DoAddSignal(0x19b2a30, 2.22222e-07 2.22222e-07 2.22222e- 07 2.22222e-07 2.22222e-07 2.22222e-07 604 ) 605 LteInterference:ConditionallyEvaluateChunk(0x19b2a30) 606 0x19b2a30 now +0.0ns last +0.0ns 607 LteSpectrumPhy:StartRxCtrl(0x19b2610) 608 LteUePhy:ReceivePss(0x19b3140, 1, 2.22222e-07 2.22222e-07 2.22222e-07 2. 22222e-07 2.22222e-07 2.22222e-07 609 ) 610 0x19b3140 PSS RNTI 0 cellId 1 has RSRP 5.22879 and RBnum 6 611 0x19b2610 synchronized with this signal (cellId=1) 612 0x19b2610 scheduling EndRx with delay +214285.0ns 613 0x19b2610 state: IDLE -> RX_CTRL 614 LteInterference:StartRx(0x19b2a30, 2.22222e-07 2.22222e-07 2.22222e-07 2 .22222e-07 2.22222e-07 2.22222e-07 615 ) 616 first signal 617 LteChunkProcessor:Start(0x19b3c20) 618 LteChunkProcessor:Start(0x19b3cf0) 619 LteChunkProcessor:Start(0x19b3d80) 620 0x19b2610 state: RX_CTRL 621 LteSpectrumPhy:StartRx(0x19b57e0, 0x19b9540) 622 0x19b57e0 state: IDLE 623 LteInterference:AddSignal(0x19b5ba0, 2.70481e-17 2.70481e-17 2.70481e-17 2.70481e-17 2.70481e-17 2.70481e-17 624 , +214285.0ns) 625 LteInterference:DoAddSignal(0x19b5ba0, 2.70481e-17 2.70481e-17 2.70481e- 17 2.70481e-17 2.70481e-17 2.70481e-17 626 ) vs. 600 0x7f9efbc42cf0 state: IDLE 601 LteInterference:AddSignal(0x7f9efbc4deb0, 2.70481e-17 2.70481e-17 2.7048 1e-17 2.70481e-17 2.70481e-17 2.70481e-17 602 , +214285.0ns) 603 LteInterference:DoAddSignal(0x7f9efbc4deb0, 2.70481e-17 2.70481e-17 2.70 481e-17 2.70481e-17 2.70481e-17 2.70481e-17 604 ) 605 LteInterference:ConditionallyEvaluateChunk(0x7f9efbc4deb0) 606 0x7f9efbc4deb0 now +0.0ns last +0.0ns 607 LteSpectrumPhy:StartRxCtrl(0x7f9efbc42cf0) 608 LteUePhy:ReceivePss(0x7f9efbc72240, 1, 2.70481e-17 2.70481e-17 2.70481e- 17 2.70481e-17 2.70481e-17 2.70481e-17 609 ) 610 0x7f9efbc72240 PSS RNTI 0 cellId 2 has RSRP -93.9177 and RBnum 6 611 0x7f9efbc42cf0 not in sync with this signal (cellId=1, m_cellId=2) 612 0x7f9efbc42cf0 state: IDLE 613 LteSpectrumPhy:StartRx(0x7f9efbc59880, 0x7f9efbc56970) 614 0x7f9efbc59880 state: IDLE 615 LteInterference:AddSignal(0x7f9efbc8d760, 2.22222e-07 2.22222e-07 2.2222 2e-07 2.22222e-07 2.22222e-07 2.22222e-07 616 , +214285.0ns) 617 LteInterference:DoAddSignal(0x7f9efbc8d760, 2.22222e-07 2.22222e-07 2.22 222e-07 2.22222e-07 2.22222e-07 2.22222e-07 618 ) Also, the Uids of the spectrum models in use differ between the two platforms (off by one), but the creation of many of these objects is apparently during static initialization and logging does not work. This suggests that static initialization order issues with random variable stream assignment may be causing the divergence. The newly added test suite does not include any calls to LteHelper::AssignStreams(). When I experimented with explicitly setting the AssignStreams() for this test case, I was able to clear the above signal order discrepancy (and several subsequent ones) but not able to completely clear (align) the behavior, suggesting that there are still more sources of randomness needing to be controlled. The config-store output does not show all random variables in use, so other approaches such as gdb are likely needed.
Created attachment 1881 [details] patch to fix stream assignments
The attached patch should be applied, but it does not solve the problem entirely. There is something else going on that needs to be fixed. To debug further, I enabled LteHelper::EnableLogComponents() and disabled all tests except for test case: LteSoftFfrAreaTestCase ("LteSoftFfrAreaTestCasePss1", "ns3::PssFfMacScheduler") When comparing the logging output of the two machines, the simulation log output is the same until the first occurrence of PssFfMacScheduler::DoSchedDlTriggerReq() in the debugger, around line 1520 of the source file: if (lcActives == 0) { // Set to max value, to avoid divide by 0 below lcActives = (uint16_t)65535; // UINT16_MAX; } uint16_t RgbPerRnti = (*itMap).second.size (); std::map <uint16_t,SbMeasResult_s>::iterator itCqi; I found that the value of RgbPerRnti is '3' on Linux, and '8' on OS X when this is first entered.
Here is some more data about the differences between the two platforms. I took these steps 1) disabled all test cases except for the one causing differences, and I applied the patch to assign consistent RV streams (I verified that all calls to "GetValue()" were the same for both machines) 2) ran the following command on both machines "NS_LOG="*" ./waf --run "test-runner --suite=lte-frequency-reuse" > log.txt 2>&1 (this takes a while due to the amount of log data) 3) looked at the first 1000000 lines: head -1000000 log.txt > log.head.txt 3) post-processed both log files to strip all hex memory addresses from the logs, so I could diff them: perl -pi -w -e 's/0x[0-9a-f]+//g' log.head.txt 5) diffed the two log.head.txt files (the machine corresponding to '-' lines is Linux, the machine with '+' is OS X) The initial part of the diff is pretty noisy but the simulation starts at time 0 at line 733245 of this diff. The first (possibly minor?) diff is found in constructing a packet burst: 733345 @@ -491464,36 +484466,36 @@ 733346 0s -1 Object:Object() 733347 0s -1 PacketBurst:PacketBurst() 733348 0s -1 Object:SetTypeId(, 0s -1 TypeId:GetName() 733349 -0s -1 TypeId:GetName(, 77) 733350 -0s -1 TypeId:LookupInformation(, 77) 733351 +0s -1 TypeId:GetName(, 65) 733352 +0s -1 TypeId:LookupInformation(, 65) 733353 ns3::PacketBurst) 733354 0s -1 Object:Check() 733355 0s -1 AttributeConstructionList:AttributeConstructionList() 733356 0s -1 Object:Construct(, ) 733357 0s -1 ObjectBase:ConstructSelf(, ) 733358 0s -1 Object:GetInstanceTypeId() 733359 -0s -1 ObjectBase:ConstructSelf(): [DEBUG] 0s -1 TypeId:GetAttributeN() 733360 -0s -1 TypeId:GetAttributeN(, 77) 733361 -0s -1 TypeId:LookupInformation(, 77) 733362 -0s -1 TypeId:GetName() 733363 -0s -1 TypeId:GetName(, 77) 733364 -0s -1 TypeId:LookupInformation(, 77) 733365 -construct tid=ns3::PacketBurst, params=0 733366 +0s -1 ObjectBase:ConstructSelf(): [DEBUG] construct tid=0s -1 TypeId:GetName() 733367 +0s -1 TypeId:GetName(, 65) 733368 +0s -1 TypeId:LookupInformation(, 65) 733369 +ns3::PacketBurst, params=0s -1 TypeId:GetAttributeN() 733370 +0s -1 TypeId:GetAttributeN(, 65) 733371 +0s -1 TypeId:LookupInformation(, 65) this diff also repeats at 7333421; basically, object construction of PacketBurst seems to generate some different log statements. I'm not sure whether this is of any importance. The first substantive difference that I spotted was shortly after: 733471 -0s -1 LteFfrSoftAlgorithm:DoReportUlCqiInfo() 733472 -0s -1 LteFfrSoftAlgorithm:DoReportUlCqiInfo(): [WARN ] Method should not be called, because it is empty 733473 -0s -1 LteFfrSoftAlgorithm:DoGetAvailableUlRbg() 733474 -0s -1 LteFfrSoftAlgorithm:DoGetMinContinuousUlBandwidth() 733475 -0s -1 LteFfrSoftAlgorithm:DoGetMinContinuousUlBandwidth(): [INFO ] minContinuousUlBandwidth: 6 733476 +0s -1 LteFrSoftAlgorithm:DoReportUlCqiInfo() 733477 +0s -1 LteFrSoftAlgorithm:DoReportUlCqiInfo(): [WARN ] Method should not be called, because it is empty 733478 +0s -1 LteFrSoftAlgorithm:DoGetAvailableUlRbg() 733479 +0s -1 LteFrSoftAlgorithm:DoGetMinContinuousUlBandwidth() 733480 +0s -1 LteFrSoftAlgorithm:DoGetMinContinuousUlBandwidth(): [INFO ] minContinuousUlBandwidth: 8 Here, note that minContinuousUlBandwidth is 6 on Linux but 8 on OS X. Another interesting diff is a little further down, where MultiModeSpectrumChannel::StartTx() only occurs on the OS X machine 734024 @@ -492451,6 +485453,17 @@ 734025 0.001s -1 LteSpectrumSignalParameters:Copy() 734026 0.001s -1 SpectrumSignalParameters:SpectrumSignalParameters(, ) 734027 0.001s -1 LteSpectrumSignalParameters:LteSpectrumSignalParametersDlCtrl Frame(, ) 734028 +0.001s -1 LteSimpleSpectrumPhy:GetMobility() 734029 +0.001s -1 LteSimpleSpectrumPhy:GetDevice() 734030 +0.001s -1 EventImpl:EventImpl() 734031 +0.001s -1 DefaultSimulatorImpl:Schedule(, 0, ) 734032 +0.001s -1 SystemThread:Equals() 734033 +0.001s -1 MapScheduler:Insert(, , 1000000, 51) 734034 +0.001s -1 EventId:EventId(, , 1000000, 4294967295, 51) 734035 +0.001s -1 MultiModelSpectrumChannel:StartTx(): [LOGIC] copying signal parameters 734036 +0.001s -1 LteSpectrumSignalParameters:Copy() 734037 +0.001s -1 SpectrumSignalParameters:SpectrumSignalParameters(, ) 734038 +0.001s -1 LteSpectrumSignalParameters:LteSpectrumSignalParametersDlCtrl Frame(, ) 734039 0.001s -1 LteSpectrumPhy:GetMobility(, m_cellId , 1) 734040 0.001s -1 Vector:CalculateDistance(200:0:0, 0:0:0) 734041 0.001s -1 IsotropicAntennaModel:GetGainDb(, (0, 1.5708)) ---- My sense is that the initial conditions (spectrum values?) of this test case get set up slightly differently for some reason, outside of random variables. The simulation executes about the same until PssFfMacScheduler::DoSchedDlTriggerReq() gets called, at which point the results begin to slightly diverge. There may be more information in this diff about how the simulation setup diverges (before time 0) that I didn't pick out of the noise of Time and SystemMutex log statements. I will attach the compressed diff file as an attachment; perhaps further munging of the diff will yield other ideas.
Created attachment 1882 [details] diff of log files
Created attachment 1933 [details] patch to fix Patch from Piotr
---- Bug 1986 FIX user: Piotr Gawlowicz <gawlowicz.p@gmail.com> files: src/lte/model/pss-ff-mac-scheduler.cc url: http://code.nsnam.org/ns-3-dev/rev/b6f6e368aa51