Bug 2559 – TCP advertised window still incorrect

Bug 2559 - TCP advertised window still incorrect


Summary:	TCP advertised window still incorrect

Status:	RESOLVED FIXED

Product:	ns-3
Classification:	Unclassified
Component:	tcp
Version:	ns-3-dev
Hardware:	All All

Importance:	P5 major
Assigned To:	natale.patriciello

URL:
Keywords:

Depends on:	2613
Blocks:
	Show dependency tree / graph

Reported:	2016-11-21 04:21 EST by Christoph Döpmann
Modified:	2017-02-06 07:56 EST (History)
CC List:	1 user (show)

See Also:

Attachments
Patch to fix the issue (482 bytes, patch) 2016-11-21 04:21 EST, Christoph Döpmann	Details \| Diff
Revised patch including test case (11.53 KB, patch) 2017-01-05 16:43 EST, Christoph Döpmann	Details \| Diff
Patch for tcp-zero-window-test (1.64 KB, patch) 2017-01-05 16:44 EST, Christoph Döpmann	Details \| Diff
New response vectors (18.53 KB, application/gzip) 2017-01-05 16:46 EST, Christoph Döpmann	Details
Patch + inverting order of ACK/reading (2.49 KB, patch) 2017-01-30 04:47 EST, natale.patriciello	Details \| Diff
v2 patch (5.24 KB, patch) 2017-02-03 04:27 EST, natale.patriciello	Details \| Diff
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Christoph Döpmann 2016-11-21 04:21:22 EST

Created attachment 2679 [details]
Patch to fix the issue

This issue has been discussed before, in Bug 2159. However, I think that the resolution implemented there is still incorrect.

When calculating the advertised window, originally

  m_rxBuffer->MaxBufferSize () - m_rxBuffer->Size ()

was used. Bug 2159 changed this to

  m_rxBuffer->MaxBufferSize ()

I'm suggesting now to use 

  m_rxBuffer->MaxRxSequence () - m_rxBuffer->NextRxSequence ()

instead. Let me explain why I think this is an important change.

As was pointed out in Bug 2159, the original approach can sometimes be too restrictive and too small windows are calculated. This is especially true if the receive buffer is almost full, but there are bytes missing at the beginning of the buffer. In this case, the advertised window would only be the size of the missing segment in the beginning. Not only may this lead to inefficiencies as was demonstrated in Bug 2519, but it is also what RFC 793 calls "shrinking the window" - the undesired process of decreasing the window without appropriate progress in the sequence number space.

The current fix, however, simply ignores the current state of the receive buffer. This is not what we want, as it effectively disables flow control. Problems arise if the receiving application does not read from the socket as fast as data arrives from the sender. Flow control should reduce the window as the buffer fills up, even going to rwin=0 if no more space is left. The current implementation, however, will never reduce the window, triggering packet drops that lead to costly timeouts on the sender side. Note that "not reading from the socket" is not at all a special or even an emergency situation, but a vaild method for e.g. throttling the connection from the application layer. With proper flow control in place, this should not lead to timeouts.

My fix now very closely follows RFC 793 which states that the advertised window size should be "the number of octets, starting with the acknowledgment number, that the receiving TCP is currently prepared to receive". That is, we need to calculate the difference between the ACK'ed sequence number (= m_rxBuffer->NextRxSequence()) and the end of our receive buffer (= m_rxBuffer->MaxRxSequence()). Put differently, we calculate the distance from the first gap in the receive buffer to the end of the buffer.

This deals with both of the aforementioned problems. If we encounter a missing segment at the beginning of the buffer, we still keep the window wide open, which is efficient and perfectly valid since DUP ACKs with the same ACK number (= the gap) will have the same advertised window size. On the other hand, we are sensitive to the buffers fill state. As it fills up, we reduce the window size, but only at the most by as many bytes as we received in order. Thus we are not doing the undesred "shrinking the window".

Comment 1 natale.patriciello 2017-01-03 04:09:17 EST

Hi,

I'm sorry I can't be that active but I'm actively looking for a new job.

> The current fix, however, simply ignores the current state of the receive 
> buffer. This is not what we want, as it effectively disables flow control.

This is not true, as explained in bug 2159. The sender always have the window, which is [SND.UNA,SND.UNA + min(CWND, AWND)]. If AWND is the receiver buffer size, and SND.UNA stays the same (first segment of the window has been lost) in case of AWND < CWND the sender will not send any byte in excess of the receiver buffer size. Different case is when the application doesn't pick up data from the TCP buffer, or take it with some delay (in current ns-3 we don't have such application), because SND.UNA is increased but the data stays in the buffer.

I'm ok with adding this patch, but it should be clear that covers a use-case outside of the current ns-3 codebase. In doing so, it should be provided a test case which test the old window calculation with the new in a number of random transfer with random packet losses, checking that the two values are the same in case the application pick up data as fast as the socket receives it.

Comment 2 Christoph Döpmann 2017-01-05 16:41:52 EST

Hi,

I've been working on providing a test suite for this, which is working reasonably well (see attachment).

I am not checking for 100% strict equality, though. The reason for that is that even when reading at maximum speed from the socket, there are situations in which a slight pile of available data builds up in the receive buffer. Therefore, for instance, when ACKing data reception before it was read by the application, the new approach calculates a slightly smaller AWND. This seems to be exactly what we want, because at that point of time, we do not have any means of telling whether the data will be read in the near future. In order to cope with that, I include the amount of data not yet read by the application into the comparison.

This slight change breaks some other unit tests, though. In several cases, they expect the window to be of some fixed size and are not prepared to handle a slightly smaller one. But, as mentioned before, when immediately ACKing data that has just arrived and not yet been fetched by the application, this seems to be the desired behaviour and unavoidable. The following tests are effected:

- tcp-zero-window-test: The test fails if the window decreases after having been opened again. I've changed it to not choke on that. This does not hinder the effectiveness of the test.

- ns3-tcp-interoperability, ns3-tcp-loss, ns3-tcp-state: The hard-coded response vectors of course didn't forgive that slight change either. I've checked that the only changes appear in the advertised windows and regenerated the vectors.

Please, let me again stress the importance of this issue. Natale, you are absolutely right in that the current solution suffices in case the application reads fast enough. I am, however, pretty convinced that I'm not the only one who'd assume when starting out with ns-3, that the TCP flow control implementation handles this correctly as this appears to be the main reason why flow control was introduced at all (otherwise one might save the space in the TCP header and not send a constant value with every segment :-) ).

By the way, there seems to be some unrelated bug I've come across, which I've filed as Bug 2613. There are situations in which MaxRxSequence () returns an incorrect value. This breaks breaks the AWND calculation and thus needs to be resolved first.

Comment 3 Christoph Döpmann 2017-01-05 16:43:30 EST

Created attachment 2735 [details]
Revised patch including test case

Comment 4 Christoph Döpmann 2017-01-05 16:44:52 EST

Created attachment 2736 [details]
Patch for tcp-zero-window-test

Comment 5 Christoph Döpmann 2017-01-05 16:46:01 EST

Created attachment 2737 [details]
New response vectors

Comment 6 Christoph Döpmann 2017-01-25 16:52:15 EST

Any chance you could look into that for the upcoming ns-3 release?

Comment 7 natale.patriciello 2017-01-26 06:25:34 EST

Hi Christoph,

I'm sorry, but the TCP SACK implementation delayed me. I would like to work full time on it, but for various reason, I'm not always in my best possible shape.

Today I took some time to look what Linux is currently doing. I analysed only the "fast path" mode, which is the only one I want to support in ns-3 right now. 

First of all, the window to advertise is calculated at line 264 of tcp_output.c [1], in which they have surprising lines:

264 static u16 tcp_select_window(struct sock *sk)
265 {
266 struct tcp_sock *tp = tcp_sk(sk);
267 u32 old_win = tp->rcv_wnd;
268 u32 cur_win = tcp_receive_window(tp);
269 u32 new_win = __tcp_select_window(sk);
270 
271 /* Never shrink the offered window */
272 if (new_win < cur_win) {
273 /* Danger Will Robinson!
274 * Don't update rcv_wup/rcv_wnd here or else
275 * we will not be able to advertise a zero
276 * window in time. --DaveM
277 *
278 * Relax Will Robinson.
279 */
280 if (new_win == 0)
281 NET_INC_STATS(sock_net(sk),
282 LINUX_MIB_TCPWANTZEROWINDOWADV);
283 new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
284 }
285 tp->rcv_wnd = new_win;
  ...

When the calculated new_win is less than the previous advertised, they simply ignore it and use the old value!! At this point, I looked to __tcp_select_window function.

The new window is calculated on the basis of:
1. The mss received so far ( tp → ack.rcv_mss )
2. The total space in the socket's receive buffer obtained from tcp_full_space()
3. The space available in the receive buffer from tcp_space()
4. tp → rcv_ssthresh 

tp → window_clamp is the upper limit on the total space in the receive buffer. They have a control that in ns-3 we don't have: if the highest mss observed is higher than the maximum space in the socket's receive buffer, they slash mss to the maximum buffer size. Then, they check if the receive buffer is half full. If so, they disable quick ACK mode (the Quick Ack mode is a mode in which delayed ACK are disabled: the reason is that they don't want to acknowledge data very fast so that the application gets enough time to eat up data in the receive buffer and leave enough space for the new data). Then they check if are under memory pressure, and if yes they set a limit of 4 MSS (this does not apply to us, we always have the memory we want). In this branch, if the available space is less than 1/16 of the original, they move directly to the 0-window state.

If nothing of this happens, they return free_space (shifted by considering the window scale option). Overall, the function looks like you proposed. If we assume that we don't have any memory problem, the window will never be shrinked, but in the worst case the right edge of the window will stay the same.

Just for curiosity, I checked what is happening when a block of data is received, in tcp_input, function tcp_rcv_established. As I've said before, please consider only the slow path (from line 5529). If the segment is valid, the core operations are the following:

5552 /* step 7: process the segment text */
5553 tcp_data_queue(sk, skb);
5554 
5555 tcp_data_snd_check(sk);
5556 tcp_ack_snd_check(sk);
5557 return;

tcp_data_queue processes the segment, and put it in the right buffer (user, in-order or socket, out of order buffer) and processes SACK/DSACK to send back. Then they call tcp_data_snd_check (which transmits any pending data in the transmit queue, if the window and the congestion state allow) and then tcp_ack_snd_check. This last call sends out ACK if any ACK is pending (in this way they don't send out two segments, ACK and data, separately.. a thing that we don't have in ns-3). It is the case to indagate what is happening inside that function: if they don't have sent a data segment, they call the function __tcp_ack_snd_check. Now I should go, I will finish later.

Just my though, I think Linux is ACKing in-sequence segments only when the application has read the data, and for out of order they ack as soon as possible. They are also starting with a smaller advertised window (without caring about the buffer size) and then they increase it as fast as the application can read data.


[1] http://lxr.free-electrons.com/source/net/ipv4/tcp_output.c

Comment 8 natale.patriciello 2017-01-26 12:10:46 EST

(In reply to Christoph Döpmann from comment #2)

> This slight change breaks some other unit tests, though. In several cases,
> they expect the window to be of some fixed size and are not prepared to
> handle a slightly smaller one. But, as mentioned before, when immediately
> ACKing data that has just arrived and not yet been fetched by the
> application, this seems to be the desired behaviour and unavoidable. 

I've continued reading __tcp_ack_snd_check, and it does some more checks before sending the ACK. Overall, I did not find any place in which it is written exactly that the ACK is sent after the application has read the data. However, I am not sure of why they put the conditional 

if (new_win < cur_win) { new_win = cur_win; }

So, coming back to the ns-3 case, I like your patch. However, I am not entirely satisfied with the advertising of a different window (up to n segment) for each ACK sent. By looking to an example trace of TCP conversation between my laptop and a remote server, with a declared MSS of 1460, the advertised window starts with a small value (near 29200) and then slowly increase:

32128, 35072, 3788, 40832 ... These values are at an interval a bit bigger than 1 MSS. In the case of Dup ACKs (I was fortunate enough to get some) the window stays at the same value (this is understandable because some implementations, with a different window, could have misinterpreted it as a regular ACK). The last 11 segments of the conversation have a fixed window, at a value of 184832. It seems that this mechanism does an additional congestion control, limiting the sender by introducing a "pacing-by-the-receiver" mechanism. Imagine that the sender doubles a huge window (with the boundary of the buffers at 3/4 of the new window). In this case, the receiver is limiting the rate, and when the capacity is reached, probably fewer segments than 1/4 will be lost. I suppose that this helps also with slow readers. If the data is staying in the queue, then there is no need to increase the window, because it is likely that the application got stuck somehow. Therefore, the window stays fixed to that value, until the application can read again. Somehow, it could lead to a "stable" window size, in which the sender sends the data at the rate that the receiver application can read (NOT the receiver kernel).

After this, I got back to __tcp_select_window, and "accidentally" I discovered some documentation :)

2371 /* This function returns the amount that we can raise the
2372  * usable window based on the following constraints
2373  *
2374  * 1. The window can never be shrunk once it is offered (RFC 793)
2375  * 2. We limit memory per socket
2376  *
2377  * RFC 1122:
2378  * "the suggested [SWS] avoidance algorithm for the receiver is to keep
2379  *  RECV.NEXT + RCV.WIN fixed until:
2380  *  RCV.BUFF - RCV.USER - RCV.WINDOW >= min(1/2 RCV.BUFF, MSS)"
2381  *
2382  * i.e. don't raise the right edge of the window until you can raise
2383  * it at least MSS bytes.
2384  *
2385  * Unfortunately, the recommended algorithm breaks header prediction,
2386  * since header prediction assumes th->window stays fixed.
2387  *
2388  * Strictly speaking, keeping th->window fixed violates the receiver
2389  * side SWS prevention criteria. The problem is that under this rule
2390  * a stream of single byte packets will cause the right side of the
2391  * window to always advance by a single byte.
2392  *
2393  * Of course, if the sender implements sender side SWS prevention
2394  * then this will not be a problem.
2395  *
2396  * BSD seems to make the following compromise:
2397  *
2398  *      If the free space is less than the 1/4 of the maximum
2399  *      space available and the free space is less than 1/2 mss,
2400  *      then set the window to 0.
2401  *      [ Actually, bsd uses MSS and 1/4 of maximal _window_ ]
2402  *      Otherwise, just prevent the window from shrinking
2403  *      and from being larger than the largest representable value.
2404  *
2405  * This prevents incremental opening of the window in the regime
2406  * where TCP is limited by the speed of the reader side taking
2407  * data out of the TCP receive queue. It does nothing about
2408  * those cases where the window is constrained on the sender side
2409  * because the pipeline is full.
2410  *
2411  * BSD also seems to "accidentally" limit itself to windows that are a
2412  * multiple of MSS, at least until the free space gets quite small.
2413  * This would appear to be a side effect of the mbuf implementation.
2414  * Combining these two algorithms results in the observed behavior
2415  * of having a fixed window size at almost all times.
2416  *
2417  * Below we obtain similar behavior by forcing the offered window to
2418  * a multiple of the mss when it is feasible to do so.
2419  *
2420  * Note, we don't "adjust" for TIMESTAMP or SACK option bytes.
2421  * Regular options like TIMESTAMP are taken into account.
2422  */

I think that in Linux there is an additional problem, because the buffers have not a fixed size, but instead they grow with time. Because of this, we have the increasing behaviour.

After all this, my proposals are the following:

0) Send the ACK after the application has read the data (reading the data does not take simulated time)
1) Send the ACK in the current way, starting to lower the window only when the buffer is half full
2) Implement the Linux mechanism of increasing the window with time.

Any opinion?

Comment 9 Christoph Döpmann 2017-01-28 06:59:05 EST

Thank you so much for your efforts!

I'm not sure how much of an objective it is for ns-3 to mirror the Linux kernel behaviour. Personally, I'd be in favour of going with option 0. For me, this best matches my intuition of what I'd expect to happen.

As far as I understand, option 1 doesn't seem to tackle the issue effectively, since (while the first half of the buffer steafily fills), we would still advertise a too large window. This can't be "corrected" in the second half of the window without doing what RFC 793 calls "shrinking the window" - that is, "taking back" a previously given promise of how much data from the sequence number space can be received. Or am I missing something here?

Regarding option 2 - implementing the Linux behaviour -, I'm not sure if that matches ns-3's scope. Personally, I'd expect the ns-3 TCP implementation to be a simple, straightforward one without these kinds of custom behaviour. This seems to be something that could be left to NSC/DCE for users who actually need to model these subtleties.

When going with option 0 - delaying the ACKs until the data has been read by the application if this happens at the very same simulator time - I'm not sure about the consequences and how this would interfere with other (TCP) stuff going on in ns-3, I don't currently have a well enough overview. What's your opinion on this? Would that be feasible?

Comment 10 natale.patriciello 2017-01-30 04:46:55 EST

(In reply to Christoph Döpmann from comment #9) 
> I'm not sure how much of an objective it is for ns-3 to mirror the Linux
> kernel behavior. Personally, I'd be in favor of going with option 0. For
> me, this best matches my intuition of what I'd expect to happen.

Our priority is to model real systems, and Linux has been very often used as a reference model.

 
> When going with option 0 - delaying the ACKs until the data has been read by
> the application if this happens at the very same simulator time - I'm not
> sure about the consequences and how this would interfere with other (TCP)
> stuff going on in ns-3, I don't currently have a well enough overview.
> What's your opinion on this? Would that be feasible?

After some tests, my opinion is the following:

Your patch, as is, is telling the other end a small lie. Since the reading and the sending of the ACK are happening at the same time, you are advertising a value which is good at time x (the time in which the ACK is going to be sent out and the time in which application is reading the data) but that is not true anymore at time x + Timestep (1). It would be good in case of a bad reader, which would not have read anything from the socket at time x.


I believe that your patch should be applied on top of the "case 0", in which app-reading / ACK-sending operations are inverted. My attempt is in the attached patch. Without your fix for the adv window, the results of the tests are the same (everything passes) and, in my opinion, no impact is made on the "user experience". But in the case of a slow reader, this behavior is still wrong because of the false adv window value. So I applied your patch (with the one from bug2613), and I have observed that, in general, the window reported is the same as in the current version, confirming the validity of the approach "case 0" + your patch. 

However, I have the following remarks:

- After receiving a FIN, the adv window is 0. This is fine from a logical point of view (we got a FIN, we don't expect anything more, so we advertise 0) but is different from what was happening before, and also to what I'm seeing from real pcap traces. We could still receive out-of-order data after a FIN.


- These tests still fail:
    ns3-tcp-loss
    ns3-tcp-state
    tcp-zero-window-test
  for what regards tcp-zero-window-test, I believe the reason is because of the previous point. For the other two, I will check from the traces what is happening, but I suppose a slight change in timings of the segments.

So, can we fix the 0-window after a FIN? If yes, this is an accept :)
Meanwhile, I would like to thank you for your efforts: your patches were clean and I liked to review them.

Nat

Comment 11 natale.patriciello 2017-01-30 04:47:37 EST

Created attachment 2767 [details]
Patch + inverting order of ACK/reading

Comment 12 Christoph Döpmann 2017-02-02 05:54:10 EST

Hi,

again, thanks for your efforts! Looks like we can bring that to a successful end. I am, however, struggling to resolve that issue of sending zero-windows after having received a FIN. Since MaxRxSequence () will always return the last valid sequence number as learned from the FIN, we would get a negative (~ random) window to advertise.

I am unsure what could be sent alternatively. I could imagine saving the last window and repeat it after FIN. This might be closest to what I observe via Wireshark on a Linux box. Do you agree? Unfortunately, the most straight-forward approaches would require either making AdvertisedWindowSize () non-const or saving the window after each call - I think, both are quite ugly. Therefore, I'd like to quickly double-check if that's something you'd be in favour of.

> We could still receive out-of-order data after a FIN.

While I do not think this is true (at least such segments will be discarded anyway based on TcpSocketBase::OutOfRange (), and a FIN wouldn't be accepted if prior data was missing) - I still believe now that a zero-window after FIN is not the best choice. After all, this would violate the "shrinking window" rule. This is probably why Linux behaves the way it does, as a consequence of the Kernel code line you quoted above:

271 /* Never shrink the offered window */
272 if (new_win < cur_win) {

Comment 13 natale.patriciello 2017-02-03 04:27:11 EST

(In reply to Christoph Döpmann from comment #12)
> I am unsure what could be sent alternatively. I could imagine saving the
> last window and repeat it after FIN. This might be closest to what I observe
> via Wireshark on a Linux box. Do you agree? Unfortunately, the most
> straight-forward approaches would require either making AdvertisedWindowSize
> () non-const or saving the window after each call - I think, both are quite
> ugly. Therefore, I'd like to quickly double-check if that's something you'd
> be in favour of.

Like this, it sounds ugly, but what if we offer a newly traced variable which indicates the advertised window size? In this way, we solve the problem without adding ugliness to the code :) Please check my last patch.

> 
> > We could still receive out-of-order data after a FIN.
> 
> While I do not think this is true (at least such segments will be discarded
> anyway based on TcpSocketBase::OutOfRange (), and a FIN wouldn't be accepted
> if prior data was missing)


But, a situation like this could happen:

A sends DATA SEGMENT 10
A sends DATA SEGMENT 11
A sends DATA SEGMENT 12
A sends FIN

[ network loses segment 11 ]

B receives DATA SEGMENT 10 [ send ack ]
B receives DATA SEGMENT 12 [ send dupack ]
B receives FIN [ save FIN sequence ]

but of course segment 11 should be retransmitted, we could not have a window of 0.

Please, check my new patch, if it works for you like it works for me I think we can add it in the mainline. Thanks again!

Nat

Comment 14 natale.patriciello 2017-02-03 04:27:55 EST

Created attachment 2775 [details]
v2 patch

Added the non-0 window advertising after a FIN

Comment 15 Christoph Döpmann 2017-02-03 06:34:23 EST

I just checked and it seems to do the job pretty well for me, too.

Comment 16 natale.patriciello 2017-02-03 07:06:23 EST

Ok, thank you for testing, moving to last call.

Comment 17 natale.patriciello 2017-02-06 07:55:37 EST

Fixed in  12672:10ea1cca0a24, thank you for the amazing work !