Bug 1986 - test result divergence for lte-frequency-reuse test
test result divergence for lte-frequency-reuse test
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: lte
pre-release
PC Linux
: P5 normal
Assigned To: Nicola Baldo
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-17 03:14 EDT by Tom Henderson
Modified: 2014-12-19 10:34 EST (History)
1 user (show)

See Also:


Attachments
patch to fix stream assignments (5.97 KB, patch)
2014-09-18 13:44 EDT, Tom Henderson
Details | Diff
diff of log files (567.86 KB, application/octet-stream)
2014-09-18 14:20 EDT, Tom Henderson
Details
patch to fix (2.87 KB, patch)
2014-12-16 13:02 EST, Tom Henderson
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Henderson 2014-09-17 03:14:45 EDT
In preparing for ns-3.21 release, two test cases from the newly added lte-frequency-reuse test were discovered to produce failing result on an OS X machine but pass on Linux machines.

These were temporarily disabled in this changeset:
http://code.nsnam.org/ns-3-dev/rev/93c0d8e02bf0

Upon inspection of the logging output differences (calling LteHelper::EnableLogComponents() in the test suite and  running the test-runner program directly), some differences in order of signal arrival were noticed (line numbers enabled in the below comparison).  The below differences were spotted early on in the test case:

600 0x19b2610 state: IDLE
    601 LteInterference:AddSignal(0x19b2a30, 2.22222e-07 2.22222e-07 2.22222e-07         2.22222e-07 2.22222e-07 2.22222e-07
    602 , +214285.0ns)
    603 LteInterference:DoAddSignal(0x19b2a30, 2.22222e-07 2.22222e-07 2.22222e-        07 2.22222e-07 2.22222e-07 2.22222e-07
    604 )
    605 LteInterference:ConditionallyEvaluateChunk(0x19b2a30)
    606 0x19b2a30 now +0.0ns last +0.0ns
    607 LteSpectrumPhy:StartRxCtrl(0x19b2610)
    608 LteUePhy:ReceivePss(0x19b3140, 1, 2.22222e-07 2.22222e-07 2.22222e-07 2.        22222e-07 2.22222e-07 2.22222e-07
    609 )
    610 0x19b3140 PSS RNTI 0 cellId 1 has RSRP 5.22879 and RBnum 6
    611 0x19b2610 synchronized with this signal (cellId=1)
    612 0x19b2610 scheduling EndRx with delay +214285.0ns
    613 0x19b2610 state: IDLE -> RX_CTRL
    614 LteInterference:StartRx(0x19b2a30, 2.22222e-07 2.22222e-07 2.22222e-07 2        .22222e-07 2.22222e-07 2.22222e-07
    615 )
    616 first signal
    617 LteChunkProcessor:Start(0x19b3c20)
    618 LteChunkProcessor:Start(0x19b3cf0)
    619 LteChunkProcessor:Start(0x19b3d80)
    620 0x19b2610 state: RX_CTRL
    621 LteSpectrumPhy:StartRx(0x19b57e0, 0x19b9540)
    622 0x19b57e0 state: IDLE
    623 LteInterference:AddSignal(0x19b5ba0, 2.70481e-17 2.70481e-17 2.70481e-17         2.70481e-17 2.70481e-17 2.70481e-17
    624 , +214285.0ns)
    625 LteInterference:DoAddSignal(0x19b5ba0, 2.70481e-17 2.70481e-17 2.70481e-        17 2.70481e-17 2.70481e-17 2.70481e-17
    626 )

vs.

   600 0x7f9efbc42cf0 state: IDLE
    601 LteInterference:AddSignal(0x7f9efbc4deb0, 2.70481e-17 2.70481e-17 2.7048        1e-17 2.70481e-17 2.70481e-17 2.70481e-17
    602 , +214285.0ns)
    603 LteInterference:DoAddSignal(0x7f9efbc4deb0, 2.70481e-17 2.70481e-17 2.70        481e-17 2.70481e-17 2.70481e-17 2.70481e-17
    604 )
    605 LteInterference:ConditionallyEvaluateChunk(0x7f9efbc4deb0)
    606 0x7f9efbc4deb0 now +0.0ns last +0.0ns
    607 LteSpectrumPhy:StartRxCtrl(0x7f9efbc42cf0)
    608 LteUePhy:ReceivePss(0x7f9efbc72240, 1, 2.70481e-17 2.70481e-17 2.70481e-        17 2.70481e-17 2.70481e-17 2.70481e-17
    609 )
    610 0x7f9efbc72240 PSS RNTI 0 cellId 2 has RSRP -93.9177 and RBnum 6
    611 0x7f9efbc42cf0 not in sync with this signal (cellId=1, m_cellId=2)
    612 0x7f9efbc42cf0 state: IDLE
    613 LteSpectrumPhy:StartRx(0x7f9efbc59880, 0x7f9efbc56970)
    614 0x7f9efbc59880 state: IDLE
    615 LteInterference:AddSignal(0x7f9efbc8d760, 2.22222e-07 2.22222e-07 2.2222        2e-07 2.22222e-07 2.22222e-07 2.22222e-07
    616 , +214285.0ns)
    617 LteInterference:DoAddSignal(0x7f9efbc8d760, 2.22222e-07 2.22222e-07 2.22        222e-07 2.22222e-07 2.22222e-07 2.22222e-07
    618 )

Also, the Uids of the spectrum models in use differ between the two platforms (off by one), but the creation of many of these objects is apparently during static initialization and logging does not work.

This suggests that static initialization order issues with random variable stream assignment may be causing the divergence.  The newly added test suite does not include any calls to LteHelper::AssignStreams().  When I experimented with explicitly setting the AssignStreams() for this test case, I was able to clear the above signal order discrepancy (and several subsequent ones) but not able to completely clear (align) the behavior, suggesting that there are still more sources of randomness needing to be controlled.  The config-store output does not show all random variables in use, so other approaches such as gdb are likely needed.
Comment 1 Tom Henderson 2014-09-18 13:44:57 EDT
Created attachment 1881 [details]
patch to fix stream assignments
Comment 2 Tom Henderson 2014-09-18 13:49:39 EDT
The attached patch should be applied, but it does not solve the problem entirely.  There is something else going on that needs to be fixed.

To debug further, I enabled LteHelper::EnableLogComponents() and disabled all tests except for test case:

LteSoftFfrAreaTestCase ("LteSoftFfrAreaTestCasePss1", "ns3::PssFfMacScheduler")

When comparing the logging output of the two machines, the simulation log output is the same until the first occurrence of PssFfMacScheduler::DoSchedDlTriggerReq()

in the debugger, around line 1520 of the source file:

      if (lcActives == 0)
        {
          // Set to max value, to avoid divide by 0 below
          lcActives = (uint16_t)65535; // UINT16_MAX;
        }
      uint16_t RgbPerRnti = (*itMap).second.size ();
      std::map <uint16_t,SbMeasResult_s>::iterator itCqi;

I found that the value of RgbPerRnti is '3' on Linux, and '8' on OS X when this is first entered.
Comment 3 Tom Henderson 2014-09-18 14:15:27 EDT
Here is some more data about the differences between the two platforms.  I took these steps

1) disabled all test cases except for the one causing differences, and I applied the patch to assign consistent RV streams (I verified that all calls to "GetValue()" were the same for both machines)

2) ran the following command on both machines

"NS_LOG="*" ./waf --run "test-runner --suite=lte-frequency-reuse" > log.txt 2>&1

(this takes a while due to the amount of log data)

3) looked at the first 1000000 lines:

head -1000000 log.txt > log.head.txt

3) post-processed both log files to strip all hex memory addresses from the logs, so I could diff them:

perl -pi -w -e 's/0x[0-9a-f]+//g' log.head.txt

5) diffed the two log.head.txt files (the machine corresponding to '-' lines is Linux, the machine with '+' is OS X)

The initial part of the diff is pretty noisy but the simulation starts at time 0 at line 733245 of this diff.

The first (possibly minor?) diff is found in constructing a packet burst:

 733345 @@ -491464,36 +484466,36 @@
 733346  0s -1 Object:Object()
 733347  0s -1 PacketBurst:PacketBurst()
 733348  0s -1 Object:SetTypeId(, 0s -1 TypeId:GetName()
 733349 -0s -1 TypeId:GetName(, 77)
 733350 -0s -1 TypeId:LookupInformation(, 77)
 733351 +0s -1 TypeId:GetName(, 65)
 733352 +0s -1 TypeId:LookupInformation(, 65)
 733353  ns3::PacketBurst)
 733354  0s -1 Object:Check()
 733355  0s -1 AttributeConstructionList:AttributeConstructionList()
 733356  0s -1 Object:Construct(, )
 733357  0s -1 ObjectBase:ConstructSelf(, )
 733358  0s -1 Object:GetInstanceTypeId()
 733359 -0s -1 ObjectBase:ConstructSelf(): [DEBUG] 0s -1 TypeId:GetAttributeN()
 733360 -0s -1 TypeId:GetAttributeN(, 77)
 733361 -0s -1 TypeId:LookupInformation(, 77)
 733362 -0s -1 TypeId:GetName()
 733363 -0s -1 TypeId:GetName(, 77)
 733364 -0s -1 TypeId:LookupInformation(, 77)
 733365 -construct tid=ns3::PacketBurst, params=0
 733366 +0s -1 ObjectBase:ConstructSelf(): [DEBUG] construct tid=0s -1 TypeId:GetName()
 733367 +0s -1 TypeId:GetName(, 65)
 733368 +0s -1 TypeId:LookupInformation(, 65)
 733369 +ns3::PacketBurst, params=0s -1 TypeId:GetAttributeN()
 733370 +0s -1 TypeId:GetAttributeN(, 65)
 733371 +0s -1 TypeId:LookupInformation(, 65)

this diff also repeats at 7333421; basically, object construction of PacketBurst seems to generate some different log statements.  I'm not sure whether this is of any importance.

The first substantive difference that I spotted was shortly after:

733471 -0s -1 LteFfrSoftAlgorithm:DoReportUlCqiInfo()
 733472 -0s -1 LteFfrSoftAlgorithm:DoReportUlCqiInfo(): [WARN ] Method should not be called, because it is empty
 733473 -0s -1 LteFfrSoftAlgorithm:DoGetAvailableUlRbg()
 733474 -0s -1 LteFfrSoftAlgorithm:DoGetMinContinuousUlBandwidth()
 733475 -0s -1 LteFfrSoftAlgorithm:DoGetMinContinuousUlBandwidth(): [INFO ] minContinuousUlBandwidth: 6
 733476 +0s -1 LteFrSoftAlgorithm:DoReportUlCqiInfo()
 733477 +0s -1 LteFrSoftAlgorithm:DoReportUlCqiInfo(): [WARN ] Method should not         be called, because it is empty
 733478 +0s -1 LteFrSoftAlgorithm:DoGetAvailableUlRbg()
 733479 +0s -1 LteFrSoftAlgorithm:DoGetMinContinuousUlBandwidth()
 733480 +0s -1 LteFrSoftAlgorithm:DoGetMinContinuousUlBandwidth(): [INFO ] minContinuousUlBandwidth: 8


Here, note that minContinuousUlBandwidth is 6 on Linux but 8 on OS X.


Another interesting diff is a little further down, where MultiModeSpectrumChannel::StartTx() only occurs on the OS X machine

 734024 @@ -492451,6 +485453,17 @@
 734025  0.001s -1 LteSpectrumSignalParameters:Copy()
 734026  0.001s -1 SpectrumSignalParameters:SpectrumSignalParameters(, )
 734027  0.001s -1 LteSpectrumSignalParameters:LteSpectrumSignalParametersDlCtrl        Frame(, )
 734028 +0.001s -1 LteSimpleSpectrumPhy:GetMobility()
 734029 +0.001s -1 LteSimpleSpectrumPhy:GetDevice()
 734030 +0.001s -1 EventImpl:EventImpl()
 734031 +0.001s -1 DefaultSimulatorImpl:Schedule(, 0, )
 734032 +0.001s -1 SystemThread:Equals()
 734033 +0.001s -1 MapScheduler:Insert(, , 1000000, 51)
 734034 +0.001s -1 EventId:EventId(, , 1000000, 4294967295, 51)
 734035 +0.001s -1 MultiModelSpectrumChannel:StartTx(): [LOGIC]  copying signal         parameters
 734036 +0.001s -1 LteSpectrumSignalParameters:Copy()
 734037 +0.001s -1 SpectrumSignalParameters:SpectrumSignalParameters(, )
 734038 +0.001s -1 LteSpectrumSignalParameters:LteSpectrumSignalParametersDlCtrl        Frame(, )
 734039  0.001s -1 LteSpectrumPhy:GetMobility(,  m_cellId , 1)
 734040  0.001s -1 Vector:CalculateDistance(200:0:0, 0:0:0)
 734041  0.001s -1 IsotropicAntennaModel:GetGainDb(, (0, 1.5708))

----

My sense is that the initial conditions (spectrum values?) of this test case get set up slightly differently for some reason, outside of random variables.  The simulation executes about the same until PssFfMacScheduler::DoSchedDlTriggerReq() gets called, at which point the results begin to slightly diverge.  There may be more information in this diff about how the simulation setup diverges (before time 0) that I didn't pick out of the noise of Time and SystemMutex log statements.  I will attach the compressed diff file as an attachment; perhaps further munging of the diff will yield other ideas.
Comment 4 Tom Henderson 2014-09-18 14:20:45 EDT
Created attachment 1882 [details]
diff of log files
Comment 5 Tom Henderson 2014-12-16 13:02:33 EST
Created attachment 1933 [details]
patch to fix

Patch from Piotr
Comment 6 Nicola Baldo 2014-12-19 10:34:10 EST
---- Bug 1986 FIX
user: Piotr Gawlowicz <gawlowicz.p@gmail.com>
files: src/lte/model/pss-ff-mac-scheduler.cc
url: http://code.nsnam.org/ns-3-dev/rev/b6f6e368aa51