GSOC2015TCPTest: Difference between revisions
| No edit summary | No edit summary | ||
| Line 61: | Line 61: | ||
| * Patches for step 3 and 4. I will high five everyone for the great work done and for the help received :-) | * Patches for step 3 and 4. I will high five everyone for the great work done and for the help received :-) | ||
| = Technical issues and plan = | |||
| == Split input and output logic == | |||
| Since tcp-socket-base.cc is ~3000 lines long, and working on it starts to be annoying (most software engineer references (for example [1]) says the optimal length should be 600 lines long) because it is really difficult to walk through so much lines, I was thinking to split input and output logic of the socket in two different .cc files (tcp-socket-base-input.cc and tcp-socket-base-output.cc), with the aim to reach 1500 lines for each one, without changing the architecture (TcpSocketBase would still be one class). The only thing which changes would be the logging: if the user want to outputs all socket logging, he/she should enable TcpSocketBaseInput and TcpSocketBaseOutput, instead of TcpSocketBase (I don't know if there's a way to create a - let's say - super-logging-class, which if enabled prints log statements for children classes). Pro: an user could log only the input (or the output) logic by enabling the component he/she wants. | |||
| [1] IEEE Computer Society Real-World Software Engineering Problems: A | |||
|     Self-Study Guide for Today's Software Professional | |||
| == Socket attributes == | |||
| Right now, TCP socket attributes are scattered along classes. For instance, SND.UNA is an attribute of TcpTxBuffer, RCV.NXT is in TcpRxBuffer, and with my refactoring it seems (from the talk with Peter) that cWnd and ssThresh (along with rxThresh and so on) belong to TcpSocketState. My opinion is that, while we are in the game, let's investigate all the possible options. So, why not move _all_ the attributes and traces which control the behavior of TCP socket inside the TCP socket itself? One of the criticism is: | |||
|   this attribute controls the behavior of class A, but it's not defined in A, it's defined in some other class B | |||
| My view is that where an attribute is defined is, in this case (and maybe only in this case), purely an implementation choice. cWnd or SND.UNA belongs conceptually to the socket; the fact that we, for easyness in debug and coding, moved their definition outside the main TcpSocketBase  class is a thing that interest only developers, and not users. So, giving that all the documentation is correct (e.g. in the tutorial  explain how to connect the TcpSocketBase, while in doxygen explains where certain parts are managed) the confusion that may be created to long-time users is avoided. For me either way is fine (technically, with  the patch MakeTraceSourceAccessorFn, it is possible to do both). | |||
| == cWnd inflation/deflation == | |||
| In Fast Recovery, RFC says that for each duplicate ACK the implementation should increment cWnd by one segment size. The reasoning behind this temporarily inflation of cwnd is to be able to send more segments out for each incoming duplicate-ACK (which indicates that another segment made it to the other side). This is necessary because TCP's sliding window is stuck and will not slide until the first non-duplicate ACK comes back. As soon as the first non-duplicate ACK comes back cwnd is set back to ssthresh and the window continues sliding in normal congestion-avoidance mode. The implementation of TCP in Linux kernel avoids this "shamanism" (they're so funny in commenting their code) by improving the estimate of the in-flight packet. In RFC and ns-3, the calculus is: | |||
|   AvailableWindow = cWnd - (SND.NXT - SND.UNA) | |||
| example: cWnd = 10, SND.NXT = 20, SND.UNA = 10. You receive 3 ACKs for | |||
| 10. When receiving the third, you set cWnd to 13, and so: | |||
|   AvailableWindow = 13 - (20 - 10) = 3 | |||
| and 3 packets could be sent. For each additional DUPACK, cWnd is incremented by 1 MSS, and one packet could be sent. When a full ACK is received, cWnd goes back to the right value (which is the recalculated ssth).In Linux [1], the calculus is: | |||
|   AvailableWindow = cWnd - (SND.NXT-SND.UNA) - left_out + retrans_out | |||
| What are these new values? retrans_out is the number of packet retransmitted, and | |||
|   left_out = sacked_out + lost_out | |||
| where sacked_out is the number of packets arrived at the received but not acked. With SACK this is easy to obtain, but with DUPACK is easy too (sacket_out=m_dupAckCount). lost_out is the only guessed value: with FACK, which is the most conservative heuristic, you assume that all not SACKed packets until the most forward SACK are lost. Since we have not SACK, NewReno estimate could be used, which basically assumes that only one segment is lost (classical Reno). If we are in recovery and a partial ACK arrives, it means that one more packet has been lost. | |||
| On the wire, inflating / deflating the cWnd or use the linux metric is exactly the same. On the cWnd analysis, it is better to not have the inflation because it allows to see exactly the classical Van Jacobson's shape. On the implementation point of view, it is easier to not have the inflation/deflation, since it eliminates some complexity from the code. However, this means that we will not _strictly_ follow the RFC. It depends on your point of view, if the result is exactly the same on wire. Mine is to agree to the Linux implementation. | |||
| == ACK state machine == | |||
| To deal with SACK/FACK/ECN and so on, in Linux it is introduced a new state machine. I friendly call it "Ack State Machine". There are five states: Open, Disorder, CWR, Recovery, Loss. Introducing it in ns-3 would allow to manage the fast retransmit/recovery in a more consolidated way, at the cost of introducing another state machine in the code (which anyway could be tracked with attributes). It is also not defined in any RFC, but would help in the future to manage things like Explicit Congestion Notification, Local Device Congestion or ICMP source quench. Introducing this state machine touches the actual code in a much deeper way than just only refactoring (ah, a thing I forgot, is that this state machine only works in the ACK management part, other pieces are left untouched). | |||
| == Ideas on possible tests == | |||
| ; TCP Three way handshake | |||
| : Two possible cases: Well-behaving endpoints: SYN-SYN/ACK-ACK progression or missing SYN/ACK or missing ACK because of drop. | |||
| * Check the transmission of SYN/ACK and ACK after the first SYN | |||
| * Check retransmission of SYN/ACK or ACK | |||
| * Check retries count and termination if it is not possible to make the connection | |||
| ; TCP Four way tear-down | |||
| : Also there we can have well-behaving endpoints or losses on FINs or ACKs. | |||
| * Check the sequence FIN-->ACK   FIN-->ACK | |||
| * Check the retransmission of FINs | |||
| ; Established state: slow start | |||
| : Without any loss, congestion window grow up to ssth | |||
| * Check what happens with small acks (e.g. 1000 bytes of MSS, ACK each 500 bytes) | |||
| ; Established state: congestion avoidance | |||
| : Without any loss, after reaching ssth, the congestion window grows up linearly (this depends on the congestion avoidance algorithm selected) | |||
| * Check what happens with small acks (e.g. 1000 bytes of MSS, ACK each 500 bytes) | |||
| ; Established state: single loss, no RTO  | |||
| : Single loss in the window. Only duplicated ACKs. | |||
| ; Established state: multiple losses, no RTO | |||
| : Multiple losses in the window. Only duplicated ACKs and partial ACKs. | |||
| ; Established state: single loss, with RTO | |||
| : Single loss in the window detected through the expiration of the RTO. | |||
| ; Established state: multiple losses, with RTO | |||
| : Multipli losses in the same window. Multiple RTO expiration. | |||
| = Weekly progress = | = Weekly progress = | ||
Revision as of 12:41, 21 June 2015
Main Page - Roadmap - Summer Projects - Project Ideas - Developer FAQ - Tools - Related Projects
HOWTOs - Installation - Troubleshooting - User FAQ - Samples - Models - Education - Contributed Code - Papers
Return to GSoC 2015 Accepted Projects page.
Project overview
- Project:TCP layer refactoring with automated test on RFC-compliance and validation
- Student: Natale Patriciello
- Mentors: Tom Henderson,Tommaso Pecorella
- Abstract: A step-by-step refactoring of the TCP layer, which should lead to a more easy way to test congestion control and RFC compliancy of its state machine.
- Future Plans: None yet
- Code:
- About me: My first step into ns-3 were dated to middle 2013. I started working to an integration of ns-3 and Netmap; first results were published to IEEE ICON 2013. Then, I switched to the upper levels, namely TCP and IP, and made a middleware proposal to reduce latency in high-delay environments called C2ML, and published in GCOM 2014, with simulations entirely based on ns-3. I contributed to the TCP layer of ns-3 via the SOCIS 2014 experience, where I coded the TCP options and some TCP congestion control algorithms (BIC, Cubic, Hybla, Noordwijk). The project successful ended, and TCP variants are under review for an inclusion on the mainline (options were already accepted). More information about my research status are [here]
The (original) proposal
Actual TCP Overview
The ns-3 TCP layer was substantially rewritten in 2011, with the introduction of the abstract class TcpSocketBase, which provides the TCP socket basic functions, such as the mechanics of its state machine and the sliding window. It is born to be extensible, and in fact it needs to be extended to work: the first extensions that have been released were two TCP flavors, namely TCP NewReno and the basic TCP without congestion control. Over the years, only two subclasses have been added: TcpWestwood and TcpTahoe. It is worth noting that not even the algorithms written for ns-2 (for instance, Cubic, Bic and so on) were ported to ns-3. The first time I approached ns-3, I ascribed this behavior to the carelessly of the researchers. After all, TCP research is a well-investigated subject, and no more effort is put into its development anymore.
Considerations
Things changed when I submitted my proposal to SOCIS 2014. It has been selected, and I started to develop a lot of TCP congestion control algorithms: Cubic, Bic, Hybla, Highspeed, and Noordwijk, together with an initial implementation of Tcp options (despite their creation in 1992, ns-3 was still missing all of them). All over the summer I faced the messy code of TCP layer. The real problem is not in the quality of the code (after all, it has -probably- worked well for all these years) but rather than in its design. At the core of this firm belief, there is one fundamental issue: that a congestion control "is-a" TCP (e.g. TcpNewReno "is-a" TcpSocketBase. This way, each TCP flavor needs to define its own cWnd and ssThresh. Moreover, each version should reimplement basic algorithms (like fast retransmission) and, even worse, bugs resolved in one subclass may be still present in other subclasses.
Tests
An evidence of that can be found on the test which are present on the src/test/ns3tcp subdirectory (I pass over the test in the src/internet directory.. the only test is a very general one where it is tested if TcpNewReno can open a connection, transfer some data and then close the connection). For instance, let's take the loss test: all flavors (westwood, newreno, tahoe..) are tested sequentially with an approach that, in words, sound like: "What happen if the 14th packet is lost?". The outcome is then compared with a reference pcap file, which generates an error if there is any difference. A good design would allowed to check the internal state, the values of cwnd before and after, and the slow start threshold, only one time for all these flavors, since they share the fast recovery / fast retransmit algorithms. Switching to the congestion window test, it is clear that cWnd is tested only for Reno, and against the linux 2.6.26 implementation. In general, no RFC compliance is tested (for example, we are in SYN_RCVD state, and we receive an ACK for a random sequence number. What happens?) and all testing is done through comparison with reference pcap files. Another issue for a ns-3 user is the doubtful consistency of the TcpSocketBase API. For example, the initial congestion window is expressed in packets, while the initial slow start threshold is expressed in bytes; these kind of differences could lead to subtle bugs and misunderstandings in the user-written code.
The complete proposal
Read (and comment) the entire proposal [here].
Expcted deliverables
Week 1 - Step 1
- Time measurement on TCP layer
- Remove the TcpL4Layer and TcpSocketBase friendship (which become an "has" relationship)
Week 2 - Deliverable for Step 1; start of Step 2
- Patches submitted for deliverables of step 1
- Inserting cWnd and ssTh management into TcpSocketBase (and relative attributes). Subclasses of TcpSocketBase work on these protected variables.
- Actual test updated to account this design
Week 3 - Step 2
- Split congestion control part from TcpSocketBase, by creating the interface class TcpCongestionControl.
- Port of existing congestion control as subclasses of TcpCongestionControl
Week 4 - Step 2
- Improvement in actual test of congestion controls. Test will be re-organized and expanded (especially for variants written in SOCIS 2014)
Week 5 - Deliverable for Step2 and Step 3
- Subclassing is done through "virtualization" of methods of class TcpSocketBase, and then the code splitting will be done. Non-implemented methods will be pure virtual.
Week 6 - Step 3
- Carry on on the splitting, with a careful check when splitting duties
Week 7 - Deliverable for Step 3; start of Step 4
- From here to the end of the project, effort on implementing RFC-compliance test for the TCP state machine.
- In the remaining time, if it exists, testing against a reference implementation could be made (i.e. pcap generation of the Linux stack, with DCE, and a comparison against ns-3 implementation). Possible differences will be addressed with specific attributes to enable or disable Linux compatibility.
Week 8 - Step 4
Week 9 - Step 4
Week 10 - Step 4 and Deliverables for Step 3 and 4
- Patches for step 3 and 4. I will high five everyone for the great work done and for the help received :-)
Technical issues and plan
Split input and output logic
Since tcp-socket-base.cc is ~3000 lines long, and working on it starts to be annoying (most software engineer references (for example [1]) says the optimal length should be 600 lines long) because it is really difficult to walk through so much lines, I was thinking to split input and output logic of the socket in two different .cc files (tcp-socket-base-input.cc and tcp-socket-base-output.cc), with the aim to reach 1500 lines for each one, without changing the architecture (TcpSocketBase would still be one class). The only thing which changes would be the logging: if the user want to outputs all socket logging, he/she should enable TcpSocketBaseInput and TcpSocketBaseOutput, instead of TcpSocketBase (I don't know if there's a way to create a - let's say - super-logging-class, which if enabled prints log statements for children classes). Pro: an user could log only the input (or the output) logic by enabling the component he/she wants.
[1] IEEE Computer Society Real-World Software Engineering Problems: A
Self-Study Guide for Today's Software Professional
Socket attributes
Right now, TCP socket attributes are scattered along classes. For instance, SND.UNA is an attribute of TcpTxBuffer, RCV.NXT is in TcpRxBuffer, and with my refactoring it seems (from the talk with Peter) that cWnd and ssThresh (along with rxThresh and so on) belong to TcpSocketState. My opinion is that, while we are in the game, let's investigate all the possible options. So, why not move _all_ the attributes and traces which control the behavior of TCP socket inside the TCP socket itself? One of the criticism is:
this attribute controls the behavior of class A, but it's not defined in A, it's defined in some other class B
My view is that where an attribute is defined is, in this case (and maybe only in this case), purely an implementation choice. cWnd or SND.UNA belongs conceptually to the socket; the fact that we, for easyness in debug and coding, moved their definition outside the main TcpSocketBase class is a thing that interest only developers, and not users. So, giving that all the documentation is correct (e.g. in the tutorial explain how to connect the TcpSocketBase, while in doxygen explains where certain parts are managed) the confusion that may be created to long-time users is avoided. For me either way is fine (technically, with the patch MakeTraceSourceAccessorFn, it is possible to do both).
cWnd inflation/deflation
In Fast Recovery, RFC says that for each duplicate ACK the implementation should increment cWnd by one segment size. The reasoning behind this temporarily inflation of cwnd is to be able to send more segments out for each incoming duplicate-ACK (which indicates that another segment made it to the other side). This is necessary because TCP's sliding window is stuck and will not slide until the first non-duplicate ACK comes back. As soon as the first non-duplicate ACK comes back cwnd is set back to ssthresh and the window continues sliding in normal congestion-avoidance mode. The implementation of TCP in Linux kernel avoids this "shamanism" (they're so funny in commenting their code) by improving the estimate of the in-flight packet. In RFC and ns-3, the calculus is:
AvailableWindow = cWnd - (SND.NXT - SND.UNA)
example: cWnd = 10, SND.NXT = 20, SND.UNA = 10. You receive 3 ACKs for 10. When receiving the third, you set cWnd to 13, and so:
AvailableWindow = 13 - (20 - 10) = 3
and 3 packets could be sent. For each additional DUPACK, cWnd is incremented by 1 MSS, and one packet could be sent. When a full ACK is received, cWnd goes back to the right value (which is the recalculated ssth).In Linux [1], the calculus is:
AvailableWindow = cWnd - (SND.NXT-SND.UNA) - left_out + retrans_out
What are these new values? retrans_out is the number of packet retransmitted, and
left_out = sacked_out + lost_out
where sacked_out is the number of packets arrived at the received but not acked. With SACK this is easy to obtain, but with DUPACK is easy too (sacket_out=m_dupAckCount). lost_out is the only guessed value: with FACK, which is the most conservative heuristic, you assume that all not SACKed packets until the most forward SACK are lost. Since we have not SACK, NewReno estimate could be used, which basically assumes that only one segment is lost (classical Reno). If we are in recovery and a partial ACK arrives, it means that one more packet has been lost.
On the wire, inflating / deflating the cWnd or use the linux metric is exactly the same. On the cWnd analysis, it is better to not have the inflation because it allows to see exactly the classical Van Jacobson's shape. On the implementation point of view, it is easier to not have the inflation/deflation, since it eliminates some complexity from the code. However, this means that we will not _strictly_ follow the RFC. It depends on your point of view, if the result is exactly the same on wire. Mine is to agree to the Linux implementation.
ACK state machine
To deal with SACK/FACK/ECN and so on, in Linux it is introduced a new state machine. I friendly call it "Ack State Machine". There are five states: Open, Disorder, CWR, Recovery, Loss. Introducing it in ns-3 would allow to manage the fast retransmit/recovery in a more consolidated way, at the cost of introducing another state machine in the code (which anyway could be tracked with attributes). It is also not defined in any RFC, but would help in the future to manage things like Explicit Congestion Notification, Local Device Congestion or ICMP source quench. Introducing this state machine touches the actual code in a much deeper way than just only refactoring (ah, a thing I forgot, is that this state machine only works in the ACK management part, other pieces are left untouched).
Ideas on possible tests
- TCP Three way handshake
- Two possible cases: Well-behaving endpoints: SYN-SYN/ACK-ACK progression or missing SYN/ACK or missing ACK because of drop.
- Check the transmission of SYN/ACK and ACK after the first SYN
- Check retransmission of SYN/ACK or ACK
- Check retries count and termination if it is not possible to make the connection
- TCP Four way tear-down
- Also there we can have well-behaving endpoints or losses on FINs or ACKs.
- Check the sequence FIN-->ACK FIN-->ACK
- Check the retransmission of FINs
- Established state
- slow start
- Without any loss, congestion window grow up to ssth
- Check what happens with small acks (e.g. 1000 bytes of MSS, ACK each 500 bytes)
- Established state
- congestion avoidance
- Without any loss, after reaching ssth, the congestion window grows up linearly (this depends on the congestion avoidance algorithm selected)
- Check what happens with small acks (e.g. 1000 bytes of MSS, ACK each 500 bytes)
- Established state
- single loss, no RTO
- Single loss in the window. Only duplicated ACKs.
- Established state
- multiple losses, no RTO
- Multiple losses in the window. Only duplicated ACKs and partial ACKs.
- Established state
- single loss, with RTO
- Single loss in the window detected through the expiration of the RTO.
- Established state
- multiple losses, with RTO
- Multipli losses in the same window. Multiple RTO expiration.
Weekly progress
Week 1 - Step 1
Different tools has been evaluated to measure the performance on ns-3. I was mainly interested in the TCP layer, but I found that it requires less than 0.57% of the entire total running time for tcp-based examples like tcp-variants-comparison and tcp-bulk-send. In particular, I'm reporting the results with perf, which is the current "on-the-wave" technology for performance evaluation.
While the most source of overhead is in core (MultModM function. As a side question, there is a chance to see if there are other - maybe in assembly - way to do it?) from what is visible from this line:
10,63% libns3-dev-core-debug.so [.] (anonymous namespace)::MultModM
TCP counts for less than 0.57% in its most demanding piece of code:
0.57% std::_List_base<ns3::Ptr<ns3::TcpOption>, std::allocator<ns3::Ptr<ns3::TcpOption> > >::_M_clear
and to reach the first function we should walk to the 4th place, where TcpTxBuffer::CopyFromSequence stands with 0.35%. The most demanding function in TcpSocketBase are SendPendingData and ReceivedData (obviously) while TcpL4Protocol isn't in the very first page of the list (so it is reasonably fast). Please note that with each run results can vary a little (measuring performance isn't an exact science) but from what I have gathered I can start now to change things and making sure that any my edit isn't adding unwanted complexity (on the contrary, with the hope to be slightly faster than the past). If you want to try, check ns-3-dev (with ./waf --run "tcp-bulk-send" --command-template="perf record %s") and then check the code in [1]. Results are reported with perf report.
Logically separate the TcpL4Protocol and TcpSocketBase isn't only a stylistic change. It implies a better definition of the role of the two classes: TcpL4Protocol handles {de,}multiplexing between opened sockets (thanks to the endpoints), while TcpSocketBase implement all the logic behind a communication between two endpoints with TCP.
To do so, an incremental approach has been taken. You can see in [1] all the patches that have been published with an in-depth explanation of what has been done and why.
[1] https://github.com/kronat/ns-3-dev-git/tree/gsoc-tcp-l4-protocol
Week 2
- The merge for tcp-versions into mainline has been slightly delayed to Monday due to reviewers' duties.
- Patches for TcpL4Protocol are ready for a review. They simplify a lot the class, reducing the duplicated code, and fixes two bugs (one for an invalid RST packet, and the other about const correctness of methods). Git repository is in [1].
- Manage ssth and cwnd into TcpSocketBase
Finally! I have always wanted that. Now, each tcp version doesn't need to declare and manage their window flow variables. They are initializated and accessible via Attribute systems through TcpSocketBase ! This means ~500 lines of duplicated code removed, as can be seen from stat:
16 files changed, 136 insertions(+), 618 deletions(-)
Without touching the functionalities of the congestion control algorithms. The code is ready in [2] (codereview will be setup after the patches in TcpL4Protocol reaches mainline). This is a starting point for extracting congestion control from the socket. By the way, I've done a little thing (unnoticed before). The function DoForwardUp exists in both version (IPv4 and IPv6) and that duplicates a lot of code. Thanks to the changes to TcpL4Protocol, now the two functions are merged. Less duplicated code and same functionality: this is what I love :-D (code in [3], codereview togheter with cwnd-ssth merge).
src/internet/model/tcp-socket-base.cc | 185 +++++++++++++++++++++++++------------------------------------------------------------------------------ src/internet/model/tcp-socket-base.h | 22 +++++-------- 2 files changed, 53 insertions(+), 154 deletions(-)
[1] https://github.com/kronat/ns-3-dev-git/tree/gsoc-tcp-l4-protocol [2] https://github.com/kronat/ns-3-dev-git/tree/gsoc-cwnd-ssth-merge [3] https://github.com/kronat/ns-3-dev-git/tree/gsoc-merge-doforwardup
Week 3
- TcpCongestionOps abstract class has been created. To exchange data between socket and the congestion control, a TcpSocketState class has been created as well, with the members needed to the congestion control to work (e.g. cwnd, ssth).
- NewAck has been implemented, and a simple transfer (tcp-bulk-send) is running fine under the new model (the code has not been modified, only refactored).
Final review
It's not the time :)