Bug 975 - Problem stopping ns3::TapBridgeDevice
Problem stopping ns3::TapBridgeDevice
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: devices
pre-release
All All
: P5 normal
Assigned To: Tom Goff
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-05 13:09 EDT by alina
Modified: 2011-08-16 14:39 EDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description alina 2010-08-05 13:09:59 EDT
To read from the file descriptor associated to a tap device, the TapBridgeDevice class starts a new ns3::SystemThread which runs the ns3::TapBridgeDevice::ReadThread() method.
This method invokes the read() syscall inside an infinite loop [1]. The implemented mechanism to interrupt the loop consists in closing the file descriptor from the main loop [2].
But, as stated in the close(2) man page the described is not a reliable way of terminating a syscall on a file descriptor in linux:

"It is probably unwise to close file descriptors while they may be in use by system calls in other threads in the same process. Since a file descriptor may be re-used, there  are  some obscure race conditions that may cause unintended side effects."

I encountered that when running a simulation from python this mechanism never worked (the simulation didn't finish).
The first solution I tried was to use the signaling mechanism implemented by the ns3::SystemThread class to explicitly stop the thread.
So m_readThread->Shutdown () would be invoked from the main thread upon termination, sending a SIGALRM signal to the thread and interrupting the read.
The read loop controlled by the m_readThread->Break () flag would then finish allowing the thread to exit. But this didn't work properly when the simulation was run using python.

Another issue about the implementation of the ns3::SystemThread class is that it sets the handler for the SIGALRM to SIG_IGN, and modifying the handler like that could mess other things up, like python signal handling.
So instead a select mechanism to synchronize the thread termination can be used. I tested it and it seems to work as expected even when python is involved.
Below is a proposed implementation.

Some extra attributes:

  /**
   * \internal
   *
   * The pipe to unblock the select on m_sock on the device DoDispose().
   */
  int m_pipe[2];

  /**
   * \internal
   *
   * The flag to signal the read thread to stop reading from the file descriptor.
   */
  bool m_stop;

And this is how StopTapDevice() and ReadThread() would look like:

void
TapBridge::StopTapDevice (void)
{
  // flag the thread to stop reading from the fd
  m_stop = true;
  // write a '0' to the monitored pipe to interrupt the select
  write(m_pipe[1], "0", 1);
  // wait on the thread to finish
  m_readThread->Join ();
  m_readThread = 0;

  // close file descriptors
  close(m_pipe[0]);
  close(m_pipe[1]);
  close (m_sock);
  m_sock = -1;
}

void
TapBridge::ReadThread (void)
{
  int32_t len = -1;
  m_stop = false;
  int max_fd = m_fd > m_pipe[0] ? m_fd : m_pipe[0];
  while (!m_stop)
    {
      fd_set rfds;
      int ret;
      FD_ZERO(&rfds);
      FD_SET(m_fd, &rfds);
      FD_SET(m_pipe[0], &rfds);
      ret = select (max_fd + 1, &rfds, NULL, NULL, NULL);
      if (ret == -1)
        {
          if (errno == EINTR)
            {
              continue;
            }
          NS_FATAL_ERROR ("Error select()ing from tap device: errno=" << strerror (errno));
        }
      // Check that we actually got data in m_fd and not in m_pipe[0]
      if (!FD_ISSET(m_fd, &rfds))
        {
           continue;
        }

      uint32_t bufferSize = 65536;
      uint8_t *buf = (uint8_t *)malloc (bufferSize);
      NS_ABORT_MSG_IF (buf == 0, "TapBridge::ReadThread(): malloc packet buffer failed");
      NS_LOG_LOGIC ("Calling read on tap device socket fd " << m_sock);
      len = read (m_sock, buf, bufferSize);
      if (len == -1)
       {
          free (buf);
          buf = 0;
          if (errno == EINTR)
            {
              continue;
            }
          NS_FATAL_ERROR ("Error reading from tap device: errno=" << strerror (errno));
       }
       NS_LOG_INFO ("TapBridge::ReadThread(): Received packet on node " << m_nodeId);
       NS_LOG_INFO ("TapBridge::ReadThread(): Scheduling handler");
       NS_ASSERT_MSG (m_rtImpl, "EmuNetDevice::ReadThread(): Realtime simulator implementation pointer not set");
       m_rtImpl->ScheduleRealtimeNowWithContext (m_nodeId, MakeEvent (&TapBridge::ForwardToBridgedDevice, this, buf, len));
       buf = 0;
     }
}

[1] http://code.nsnam.org/ns-3-dev/file/48625d668186/src/devices/tap-bridge/tap-bridge.cc#l662
[2] http://code.nsnam.org/ns-3-dev/file/48625d668186/src/devices/tap-bridge/tap-bridge.cc#l250
Comment 1 Tom Henderson 2010-08-06 14:28:17 EDT
Is this related to bug 903?  Does the patch posted there resolve the problem?
Comment 2 alina 2010-08-09 06:29:36 EDT
The patch proposed in http://www.nsnam.org/bugzilla/show_bug.cgi?id=903 does not solve the problem, but in order to reproduce the bug I am reporting it is necessary to invoke Simulator::ScheduleDestroy (&TapBridge::StopTapDevice, this) (Which is the addition done by that patch).
The bug can be reproduced after applying http://www.nsnam.org/bugzilla/attachment.cgi?id=859 by running the ns-3 tap example 'tap-csma.cc' and making sure there is no traffic in the tap device other than the generated by the simulation. So the bug doesn't only occur when the simulation is run using python as I originally thought.

Below is the excerpt of the strace output for the tap-csma.cc example that shows that after the main thread (pid 29485) invokes a close on the the tap fd (close(8)) while the other thread(pid 29485) is doing a read(8, ), the main thread blocks in a futex ( m_readThread->Join (); ) waiting for the other thread to exit.
So, unless some traffic makes the read return the program never exits.


[pid 29485] read(8,  <unfinished ...>
[pid 29478] <... futex resumed> )       = 0
[pid 29478] futex(0xe00140, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 29478] futex(0xe0016c, FUTEX_WAIT_PRIVATE, 109, {0, 1795986}) = -1 ETIMEDOUT (Connection timed out)
[pid 29478] futex(0xe00140, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 29478] futex(0xe0016c, FUTEX_WAIT_PRIVATE, 111, {3, 825482687} <unfinished ...>
[pid 29485] <... read resumed> "33\0\0\0\2\0\0\0\0\0\1\206\335`\0\0\0\0\20:\377\376\200\0\0\0\0\0\0\2\0"..., 65536) = 70
[pid 29485] futex(0xe0016c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xe00168, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 29485] read(8,  <unfinished ...>
[pid 29478] <... futex resumed> )       = 0
[pid 29478] futex(0xe00140, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 29478] futex(0xe0016c, FUTEX_WAIT_PRIVATE, 113, {0, 1685728}) = -1 ETIMEDOUT (Connection timed out)
[pid 29478] futex(0xe00140, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 29478] futex(0xe0016c, FUTEX_WAIT_PRIVATE, 115, {0, 177472329}) = -1 ETIMEDOUT (Connection timed out)
[pid 29478] futex(0xe00140, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 29478] close(8)                    = 0
[pid 29478] futex(0x7f63786fd9e0, FUTEX_WAIT, 29485, NULL
Comment 3 Tom Goff 2011-08-16 12:07:39 EDT
I think this bug can be considered resolved.

The approach from above was included in changeset 6747:9dccf3624839
and the proposed fix to bug 1220 makes additional improvements.