Je me souviens
Everything here is my opinion. I do not speak for your employer.
April 2015
May 2015

2015-04-14 »

Another thing I learned from Stuart Cheshire's talk a couple of weeks ago is the usefulness of TCP_NOTSENT_LOWAT.  Or more specifically, I learned what it even does and how it differs from just SO_SNDLOWAT or SO_SNDBUF.

Short version: it's better than those things and you should probably use it instead.

Longer version:

Naive use of SO_SNDBUF leads to bufferbloat.  The proper value of SO_SNDBUF (at least, if you plan to fill the whole buf) is around 2x the bandwidth-delay product, or else you get either suboptimal speed, or excessive bufferbloat.  The problem is, only TCP has a good estimate of the real bandwidth-delay product; that's its job.  So if you use SO_SNDBUF, you're overriding what TCP knows, and you'll probably be wrong.  (Sadly, not-so-old TCP implementations just used dumb SO_SNDBUF values, so you had to override it.)

SO_SNDLOWAT sounds like it might be helpful, but it never was, so Linux doesn't even implement it, at least according to man socket(7).

TCP_NOTSENT_LOWAT is a mix between the two.  It makes it so that select() won't return true for writable when there are TCP_NOTSENT_LOWAT bytes still waiting to be sent, beyond the TCP window.  Now, that's a few double negatives, so let's try that again: your socket selects true for writable only if it's worthwhile buffering up more stuff to keep the stream busy.  Otherwise, it tells you not to write anything else.  However, you actually can still write more stuff if you want and SO_SNDBUF is large enough.  (If you use this feature, it's safe to set SO_SNDBUF as large as you want, because you mostly are trying not to fill it all.)

The idea is that if you have traffic of various priorities waiting to be sent out, you want to put low-priority stuff into the socket buffer only if you're sure no high-priority stuff will arrive before it gets sent.  You wait until the buffer is emptying out (but not so empty that you're losing throughput) and then insert the highest-priority stuff you have, then go back to waiting.

The nice part is that the kernel does all the important magic, such as figuring out the additional number of bytes required for the delay-bandwidth product (which varies over time and is added to the NOTSENT_LOWAT value).  All you have to do is write data when you're asked to do so, and you magically get less TCP buffer bloat.

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on