From Nsnam
Jump to: navigation, search

Main Page - Current Development - Developer FAQ - Tools - Related Projects - Project Ideas - Summer Projects

Installation - Troubleshooting - User FAQ - HOWTOs - Samples - Models - Education - Contributed Code - Papers

Return to GSoC 2015 Accepted Projects page.

Project overview

  • Project:TCP layer refactoring with automated test on RFC-compliance and validation
  • Student: Natale Patriciello
  • Mentors: Tom Henderson,Tommaso Pecorella
  • Abstract: A step-by-step refactoring of the TCP layer, which should lead to a more easy way to test congestion control and RFC compliancy of its state machine.
  • Future Plans: None yet
  • Code:
  • About me: My first step into ns-3 were dated to middle 2013. I started working to an integration of ns-3 and Netmap; first results were published to IEEE ICON 2013. Then, I switched to the upper levels, namely TCP and IP, and made a middleware proposal to reduce latency in high-delay environments called C2ML, and published in GCOM 2014, with simulations entirely based on ns-3. I contributed to the TCP layer of ns-3 via the SOCIS 2014 experience, where I coded the TCP options and some TCP congestion control algorithms (BIC, Cubic, Hybla, Noordwijk). The project successful ended, and TCP variants are under review for an inclusion on the mainline (options were already accepted). More information about my research status are [here]

The (original) proposal

Actual TCP Overview

The ns-3 TCP layer was substantially rewritten in 2011, with the introduction of the abstract class TcpSocketBase, which provides the TCP socket basic functions, such as the mechanics of its state machine and the sliding window. It is born to be extensible, and in fact it needs to be extended to work: the first extensions that have been released were two TCP flavors, namely TCP NewReno and the basic TCP without congestion control. Over the years, only two subclasses have been added: TcpWestwood and TcpTahoe. It is worth noting that not even the algorithms written for ns-2 (for instance, Cubic, Bic and so on) were ported to ns-3. The first time I approached ns-3, I ascribed this behavior to the carelessly of the researchers. After all, TCP research is a well-investigated subject, and no more effort is put into its development anymore.


Things changed when I submitted my proposal to SOCIS 2014. It has been selected, and I started to develop a lot of TCP congestion control algorithms: Cubic, Bic, Hybla, Highspeed, and Noordwijk, together with an initial implementation of Tcp options (despite their creation in 1992, ns-3 was still missing all of them). All over the summer I faced the messy code of TCP layer. The real problem is not in the quality of the code (after all, it has -probably- worked well for all these years) but rather than in its design. At the core of this firm belief, there is one fundamental issue: that a congestion control "is-a" TCP (e.g. TcpNewReno "is-a" TcpSocketBase. This way, each TCP flavor needs to define its own cWnd and ssThresh. Moreover, each version should reimplement basic algorithms (like fast retransmission) and, even worse, bugs resolved in one subclass may be still present in other subclasses.


An evidence of that can be found on the test which are present on the src/test/ns3tcp subdirectory (I pass over the test in the src/internet directory.. the only test is a very general one where it is tested if TcpNewReno can open a connection, transfer some data and then close the connection). For instance, let's take the loss test: all flavors (westwood, newreno, tahoe..) are tested sequentially with an approach that, in words, sound like: "What happen if the 14th packet is lost?". The outcome is then compared with a reference pcap file, which generates an error if there is any difference. A good design would allowed to check the internal state, the values of cwnd before and after, and the slow start threshold, only one time for all these flavors, since they share the fast recovery / fast retransmit algorithms. Switching to the congestion window test, it is clear that cWnd is tested only for Reno, and against the linux 2.6.26 implementation. In general, no RFC compliance is tested (for example, we are in SYN_RCVD state, and we receive an ACK for a random sequence number. What happens?) and all testing is done through comparison with reference pcap files. Another issue for a ns-3 user is the doubtful consistency of the TcpSocketBase API. For example, the initial congestion window is expressed in packets, while the initial slow start threshold is expressed in bytes; these kind of differences could lead to subtle bugs and misunderstandings in the user-written code.

The complete proposal

Read (and comment) the entire proposal [here].

Expected deliverables

Week 1 - Step 1

  • Time measurement on TCP layer
  • Remove the TcpL4Layer and TcpSocketBase friendship (which become an "has" relationship)

Week 2 - Deliverable for Step 1; start of Step 2

  • Patches submitted for deliverables of step 1
  • Inserting cWnd and ssTh management into TcpSocketBase (and relative attributes). Subclasses of TcpSocketBase work on these protected variables.
  • Actual test updated to account this design

Week 3 - Step 2

  • Split congestion control part from TcpSocketBase, by creating the interface class TcpCongestionControl.
  • Port of existing congestion control as subclasses of TcpCongestionControl

Week 4 - Step 2

  • Improvement in actual test of congestion controls. Test will be re-organized and expanded (especially for variants written in SOCIS 2014)

Week 5 - Deliverable for Step2 and Step 3

  • Subclassing is done through "virtualization" of methods of class TcpSocketBase, and then the code splitting will be done. Non-implemented methods will be pure virtual.

Week 6 - Step 3

  • Carry on on the splitting, with a careful check when splitting duties

Week 7 - Deliverable for Step 3; start of Step 4

  • From here to the end of the project, effort on implementing RFC-compliance test for the TCP state machine.
  • In the remaining time, if it exists, testing against a reference implementation could be made (i.e. pcap generation of the Linux stack, with DCE, and a comparison against ns-3 implementation). Possible differences will be addressed with specific attributes to enable or disable Linux compatibility.

Week 8 - Step 4

Week 9 - Step 4

Week 10 - Step 4 and Deliverables for Step 3 and 4

  • Patches for step 3 and 4. I will high five everyone for the great work done and for the help received :-)

Technical issues and plan

Split input and output logic

Since is ~3000 lines long, and working on it starts to be annoying (most software engineer references (for example [1]) says the optimal length should be 600 lines long) because it is really difficult to walk through so much lines, I was thinking to split input and output logic of the socket in two different .cc files ( and, with the aim to reach 1500 lines for each one, without changing the architecture (TcpSocketBase would still be one class). The only thing which changes would be the logging: if the user want to outputs all socket logging, he/she should enable TcpSocketBaseInput and TcpSocketBaseOutput, instead of TcpSocketBase (I don't know if there's a way to create a - let's say - super-logging-class, which if enabled prints log statements for children classes). Pro: an user could log only the input (or the output) logic by enabling the component he/she wants.

[1] IEEE Computer Society Real-World Software Engineering Problems: A Self-Study Guide for Today's Software Professional

Socket attributes

Right now, TCP socket attributes are scattered along classes. For instance, SND.UNA is an attribute of TcpTxBuffer, RCV.NXT is in TcpRxBuffer, and with my refactoring it seems (from the talk with Peter) that cWnd and ssThresh (along with rxThresh and so on) belong to TcpSocketState. My opinion is that, while we are in the game, let's investigate all the possible options. So, why not move _all_ the attributes and traces which control the behavior of TCP socket inside the TCP socket itself? One of the criticism is:

 this attribute controls the behavior of class A, but it's not defined in A, it's defined in some other class B

My view is that where an attribute is defined is, in this case (and maybe only in this case), purely an implementation choice. cWnd or SND.UNA belongs conceptually to the socket; the fact that we, for easyness in debug and coding, moved their definition outside the main TcpSocketBase class is a thing that interest only developers, and not users. So, giving that all the documentation is correct (e.g. in the tutorial explain how to connect the TcpSocketBase, while in doxygen explains where certain parts are managed) the confusion that may be created to long-time users is avoided. For me either way is fine (technically, with the patch MakeTraceSourceAccessorFn, it is possible to do both).

Final though: In the code, right now, only cWnd and ssth are moved back into TcpSocketBase.

cWnd inflation/deflation

In Fast Recovery, RFC says that for each duplicate ACK the implementation should increment cWnd by one segment size. The reasoning behind this temporarily inflation of cwnd is to be able to send more segments out for each incoming duplicate-ACK (which indicates that another segment made it to the other side). This is necessary because TCP's sliding window is stuck and will not slide until the first non-duplicate ACK comes back. As soon as the first non-duplicate ACK comes back cwnd is set back to ssthresh and the window continues sliding in normal congestion-avoidance mode. The implementation of TCP in Linux kernel avoids this "shamanism" (they're so funny in commenting their code) by improving the estimate of the in-flight packet. In RFC and ns-3, the calculus is:

 AvailableWindow = cWnd - (SND.NXT - SND.UNA)

example: cWnd = 10, SND.NXT = 20, SND.UNA = 10. You receive 3 ACKs for 10. When receiving the third, you set cWnd to 13, and so:

 AvailableWindow = 13 - (20 - 10) = 3

and 3 packets could be sent. For each additional DUPACK, cWnd is incremented by 1 MSS, and one packet could be sent. When a full ACK is received, cWnd goes back to the right value (which is the recalculated ssth).In Linux [1], the calculus is:

 AvailableWindow = cWnd - (SND.NXT-SND.UNA) - left_out + retrans_out

What are these new values? retrans_out is the number of packet retransmitted, and

 left_out = sacked_out + lost_out

where sacked_out is the number of packets arrived at the received but not acked. With SACK this is easy to obtain, but with DUPACK is easy too (sacket_out=m_dupAckCount). lost_out is the only guessed value: with FACK, which is the most conservative heuristic, you assume that all not SACKed packets until the most forward SACK are lost. Since we have not SACK, NewReno estimate could be used, which basically assumes that only one segment is lost (classical Reno). If we are in recovery and a partial ACK arrives, it means that one more packet has been lost.

On the wire, inflating / deflating the cWnd or use the linux metric is exactly the same. On the cWnd analysis, it is better to not have the inflation because it allows to see exactly the classical Van Jacobson's shape. On the implementation point of view, it is easier to not have the inflation/deflation, since it eliminates some complexity from the code. However, this means that we will not _strictly_ follow the RFC. It depends on your point of view, if the result is exactly the same on wire. Mine is to agree to the Linux implementation.

Not present in the final submission, because it is more complex than expected. However, branch is still alive and under development

ACK state machine

To deal with SACK/FACK/ECN and so on, in Linux it is introduced a new state machine. I friendly call it "Ack State Machine". There are five states: Open, Disorder, CWR, Recovery, Loss. Introducing it in ns-3 would allow to manage the fast retransmit/recovery in a more consolidated way, at the cost of introducing another state machine in the code (which anyway could be tracked with attributes). It is also not defined in any RFC, but would help in the future to manage things like Explicit Congestion Notification, Local Device Congestion or ICMP source quench. Introducing this state machine touches the actual code in a much deeper way than just only refactoring (ah, a thing I forgot, is that this state machine only works in the ACK management part, other pieces are left untouched).

Present in the final submission

Slow start implementation

Let's take as source RFC 5681. Slow start is defined as:

During slow start, a TCP increments cwnd by at most SMSS bytes for
each ACK received that cumulatively acknowledges new data.  Slow
start ends when cwnd exceeds ssthresh (or, optionally, when it
reaches it, as noted above) or when congestion is observed.  While
traditionally TCP implementations have increased cwnd by precisely
SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND
that TCP implementations increase cwnd, per:
   cwnd += min (N, SMSS)                      (2)
where N is the number of previously unacknowledged bytes acknowledged
in the incoming ACK.

Imagine that the receiver uses delayed ACK algorithm by default (as ns-3 currently do): it means that, more or less, we send 1 ACK every 2 packet received. This means that, with the slow start algorithm imposed by the RFC, we (no, the RFC) currently reduce the throughput achievable during the slow start. Some math (is really required?):

  • Assume the sender just sent 2 segments of SMSS each
  • The receiver receive the first, do not ack it (delayed ack algorithm)
  • The receiver receive the second, sends the ACK of 2*SMSS bytes
  • The sender receive the ack, computes
              min (N, SMSS) = min (2*SMSS, SMSS) = SMSS

and so the cwnd is increased only by SMSS, and not 2*SMSS, as in the case without the delayed ACK. Without delayed ACK, we will end up with a cWnd of 4 segments, however with delayed ACK we will end up with a cWnd of only 3 segments. While one growth is exponential (4, 8, 16..) the other isn't, and the situation is even worse when we increase the number of ACK to wait before sending one delayed ACK (which often happens in fast networks).

What Linux do? When it senses that the other end is in slow start, it does not use delayed ack.

What we can do ?

-> use the following equation instead of (2):
                   cwnd += N
   giving that N isn't outside boundaries (i.e. N <= SND.UNA)
-> use the RFC equation, keeping delayed ACK algorithm the same,
   documenting this situation
-> (*) use the RFC equation, disabling delayed ACK when we suppose the
   other end is in slow start (i.e. until we do not send a triple
   DUPACK) and enabling it in congestion avoidance.

Before, for each received ACK (also when it ACKed less than one segment) we increased cWnd by MSS.

In the final submission is present the option marked by (*)

Ideas on possible tests

TCP Three way handshake Not present in the final submission but easy to setup
Two possible cases: Well-behaving endpoints: SYN-SYN/ACK-ACK progression or missing SYN/ACK or missing ACK because of drop.
  • Check the transmission of SYN/ACK and ACK after the first SYN
  • Check retransmission of SYN/ACK or ACK
  • Check retries count and termination if it is not possible to make the connection
TCP Four way tear-down Not present in the final submission but easy to setup
Also there we can have well-behaving endpoints or losses on FINs or ACKs.
  • Check the sequence FIN-->ACK FIN-->ACK
  • Check the retransmission of FINs
Established state
slow start Present in the final submission
Without any loss, congestion window grow up to ssth
  • Check what happens with small acks (e.g. 1000 bytes of MSS, ACK each 500 bytes)
Established state
congestion avoidance Present in the final submission
Without any loss, after reaching ssth, the congestion window grows up linearly (this depends on the congestion avoidance algorithm selected)
  • Check what happens with small acks (e.g. 1000 bytes of MSS, ACK each 500 bytes)
Established state
single loss, no RTO Present in the final submission
Single loss in the window. Only duplicated ACKs.
Established state
multiple losses, no RTO Not present in the final submission but easy to setup
Multiple losses in the window. Only duplicated ACKs and partial ACKs.
Established state
single loss, with RTO Present in the final submission
Single loss in the window detected through the expiration of the RTO.
Established state
multiple losses, with RTO Not present in the final submission but easy to setup
Multiple losses in the same window. Multiple RTO expiration.

Weekly progress

Week 1 - Step 1

Different tools has been evaluated to measure the performance on ns-3. I was mainly interested in the TCP layer, but I found that it requires less than 0.57% of the entire total running time for tcp-based examples like tcp-variants-comparison and tcp-bulk-send. In particular, I'm reporting the results with perf, which is the current "on-the-wave" technology for performance evaluation.

While the most source of overhead is in core (MultModM function. As a side question, there is a chance to see if there are other - maybe in assembly - way to do it?) from what is visible from this line:

 10,63%    [.] (anonymous namespace)::MultModM

TCP counts for less than 0.57% in its most demanding piece of code:

 0.57%   std::_List_base<ns3::Ptr<ns3::TcpOption>, std::allocator<ns3::Ptr<ns3::TcpOption> > >::_M_clear

and to reach the first function we should walk to the 4th place, where TcpTxBuffer::CopyFromSequence stands with 0.35%. The most demanding function in TcpSocketBase are SendPendingData and ReceivedData (obviously) while TcpL4Protocol isn't in the very first page of the list (so it is reasonably fast). Please note that with each run results can vary a little (measuring performance isn't an exact science) but from what I have gathered I can start now to change things and making sure that any my edit isn't adding unwanted complexity (on the contrary, with the hope to be slightly faster than the past). If you want to try, check ns-3-dev (with ./waf --run "tcp-bulk-send" --command-template="perf record %s") and then check the code in [1]. Results are reported with perf report.

Logically separate the TcpL4Protocol and TcpSocketBase isn't only a stylistic change. It implies a better definition of the role of the two classes: TcpL4Protocol handles {de,}multiplexing between opened sockets (thanks to the endpoints), while TcpSocketBase implement all the logic behind a communication between two endpoints with TCP.

To do so, an incremental approach has been taken. You can see in [1] all the patches that have been published with an in-depth explanation of what has been done and why.


Week 2

  • The merge for tcp-versions into mainline has been slightly delayed to Monday due to reviewers' duties.
  • Patches for TcpL4Protocol are ready for a review. They simplify a lot the class, reducing the duplicated code, and fixes two bugs (one for an invalid RST packet, and the other about const correctness of methods). Git repository is in [1].
  • Manage ssth and cwnd into TcpSocketBase

Finally! I have always wanted that. Now, each tcp version doesn't need to declare and manage their window flow variables. They are initializated and accessible via Attribute systems through TcpSocketBase ! This means ~500 lines of duplicated code removed, as can be seen from stat:

 16 files changed, 136 insertions(+), 618 deletions(-)

Without touching the functionalities of the congestion control algorithms. The code is ready in [2] (codereview will be setup after the patches in TcpL4Protocol reaches mainline). This is a starting point for extracting congestion control from the socket. By the way, I've done a little thing (unnoticed before). The function DoForwardUp exists in both version (IPv4 and IPv6) and that duplicates a lot of code. Thanks to the changes to TcpL4Protocol, now the two functions are merged. Less duplicated code and same functionality: this is what I love :-D (code in [3], codereview togheter with cwnd-ssth merge).

 src/internet/model/ | 185 +++++++++++++++++++++++++------------------------------------------------------------------------------
 src/internet/model/tcp-socket-base.h  |  22 +++++--------
 2 files changed, 53 insertions(+), 154 deletions(-)

[1] [2] [3]

Week 3

  • TcpCongestionOps abstract class has been created. To exchange data between socket and the congestion control, a TcpSocketState class has been created as well, with the members needed to the congestion control to work (e.g. cwnd, ssth).
  • NewAck has been implemented, and a simple transfer (tcp-bulk-send) is running fine under the new model (the code has not been modified, only refactored).

Week 4

  • Implemented ACK-state machine, which manages Fast Retransmission and Fast Recovery. The code has been changed only to manage the states in such ack machine (OPEN,DISORDER,RECOVERY,LOSS) but the path and the action taken by the code are exactly the same. The patch is made in an incremental way and takes 8 commits.
  • The loss management now happens in-window: this means that the ssthresh is recalculated only for the first loss in the window. It can be halved again only after the state change (LOSS->OPEN). I see some minor variation on the pcap trace before and after the patch, I'll investigate such differencies this week.
  • Since TCP Reno and TCP Tahoe differ from NewReno only for fast retransmit and fast recovery phases, they have been removed. If the community want them back, some switch should be added.


Week 5

  • Prepared the branch (after a first round of review by Tom) to be included, as midterm accomplishment. I've also updated my wiki page (which is now really long)
  • As recap, I've finished working on these branches:
  - gsoc-ack-state-machine (introd. ack state machine)
  - gsoc-cwnd-inflation    (removed inflation/deflation)
  - gsoc-tcp-tcb           (used a Transmission Control Block to pass
                            variables between Socket and Cong. Control)
  • I'm currently in the gsoc-tcp-error-model branch, where a first test on slow start is being developed.

Week 6

  • As pointed out in a review, I've added an API to make traces and attributes "deprecate".
  • The entire patchset is ready for review. Each commit is self-contained and documented in the commit message.
  • Slow start test done.
  • Congestion avoidance (NewReno, Tahoe) test done. It simply checks that, in each RTT, the cWnd is opened by 1 segment.

Week 7

Due to familiar issues, I've carried less than I've had in mind.

  • Updated TCP Congestion avoidance test

Week 8

For bug 2149, on Deprecation of attributes and trace sources, two iterations have been done. The current patch introduces MakeEmptyAttributeAccessor and MakeEmptyAttributeChecker, in order to utilize an EmptyAttributeValue as placeholder for deprecated/obsoleted attributes. For TraceSource, it is similar.

On the fast recovery/rentransmit side, the general structure of the test is completed. Things I want to test:

 - Ack state machine state changing
 - on each dupack, one segment is sent out
 - one ssth reduction per window

Week 9

Week 10

Midterm review

Here I want to present the work done until the 4th week for the midterm review. The section is organized as follows:

  • Brief patch summary
  • Link to the codereview

As general remarks, I'm sorry to be unable to provide a first set of test example in the midterm review, neither the full socket refactor. As you can deduce from the technical issues section, there are three more things to complete:

  • Ack state machine
  • remove inflation/deflation of cwnd
  • adding the Transmission Control Block class to exchange data between sockets and congestion control algorithms.

They'll be ready for the final review, along with the tests. Since they have been inserted in ns-3-dev (2015-10-21) the links now points to tags in real code.

To get individual patches, the procedure is as follows:

 git clone git_repo   # Only the first time
 git checkout -b branch_name
 git format-patch parent_branch

parent_branch will be indicated at the beginning of each section.

TCP L4 Protocol

 Parent branch: master (commit 1606c90 in github mirror)

The objective of this patchset is to address API in TcpL4Protocol, to separate TcpSocketBase and TcpL4Protocol behavior. In particular, no direct connections through "friend" relationship should be made between these classes; they are two separate entities, with well-defined duties. TcpL4Protocol should to multiplexing/demultiplexing, while TcpSocketBase maintains the TCP end-to-end behavior.

Git link:

Merge DoForward methods

 Parent branch: 2015-gsoc-1-tcp-l4-protocol

The commit unifies the behavior of DoForwardUp for both IPv4 and IPv6 (previously tagged as duplicated code) by changing the input parameters: from an {IPv4,IPv6}Header to a couple of address (sender and receiver). Thanks to the Send() method of TcpL4Protocol which takes in input two Addresses, the behavior of the method could be unified.

Git link:

Better print statements and log management

 Parent branch: 2015-gsoc-2-merge-doforwardup

This branch objective is to have independent print statements (through the use of operator<<) and, in general, an unified way of printing messages on TCP part.

Git link:

cWnd and ssTh management inside TcpSocketBase

 Parent branch: 2015-gsoc-3-messages

Each TCP flavors manage their congestion window and slow start threshold. These parameters are inside the TCP behavior, and so they have been moved inside TcpSocketBase. Rfc793 simply do not utilize these variables. It contains a change which is not back-compatible: traces of cWnd and ssTh are moved from TCP subclasses to TcpSocketBase. Initialization is also moved inside TcpSocketBase.

Git link:

Final review

Ack state machine

 Parent branch: 2015-gsoc-4-cwnd-ssth-merge

Implemented the ACK state machine. The states are:

 typedef enum
     OPEN,        /**< Normal state, no dubious events */
     DISORDER,    /**< In all the respects it is "Open",
                   *  but requires a bit more attention. It is entered when
                   *  we see some SACKs or dupacks. It is split of "Open" */
     CWR,         /**< cWnd was reduced due to some Congestion Notification event.
                   *  It can be ECN, ICMP source quench, local device congestion.
                   *  Not used in NS-3 right now. */
     RECOVERY,     /**< CWND was reduced, we are fast-retransmitting. */
     LOSS,         /**< CWND was reduced due to RTO timeout or SACK reneging. */
     LAST_ACKSTATE /**< Used only in debug messages */
   } TcpAckState_t;

This state machine allows to move Fast recovery and Fast retransmit out of NewReno, and integrate them into TcpSocketBase. As downside, Reno and RFC 793 TCP (no congestion control) are removed from the codebase.

Git link:

Transmission control block

 Parent branch: 2015-gsoc-5-ack-state-machine

Splitted the state from the TcpSocketBase. A new class has been added (TcpSocketState) which contains informations such as congestion window and slow start threshold.

Git link:

Fix TCP bugs

 Parent branch: 2015-gsoc-6-tcp-tcb

Fix some TCB bugs reported in the bug tracker. Id 2159, 2150, 2041, 2165

Git link:

Tcp Error Model

 Parent branch: 2015-gsoc-7-fix-bugs

Introduce the error model, and the environment, for TCP testing. Include tests for: slow start, new reno congestion avoidance, fast retransmission, rto expiration.

Git link: