SO_LINGER is not the same as Apache's "lingering close"

SO_LINGER is not the same as Apache's "lingering close"

Have you ever wondered what SO_LINGER is actually for? What TIME_WAIT does? What's the difference between FIN and RST, anyway? Why did web browsers have to have pipelining disabled for so long? Why did all the original Internet protocols have a "Quit" command, when the client could have just closed the socket and been done with it?¹

I've been curious about all those questions at different points in the past. Today we ran headlong into all of them at once while testing the HTTP client in EQL Data.

If you already know about SO_LINGER problems, then that probably doesn't surprise you; virtually the only time anybody cares about SO_LINGER is with HTTP. Specifically, with HTTP pipelining. And even more specifically, when an HTTP server decides to disconnect you after a fixed number of requests, even if there are more in the pipeline.

Here's what happens:

Client sends request #1
Client sends request #2
...
Client sends request #100
All those requests finally arrive at the server side, thanks to network latency.
Server sends response #1
...
Server sends response #10
Server disconnects, because it only handles 10 queries per connection.
Server kernel sends TCP RST because userspace didn't read all the input.
Client kernel receives responses 1..10
Client reads response #1
...
Client reads most of response #7
Client kernel receives RST, causing it to discard everything in the socket buffer(!!)
Client thinks data from response 7 is cut off, and explodes.

Clearly, this is crap. The badness arises from the last two steps: it's actually part of the TCP specification that the client has to discard the unread input data - even though that input data has safely arrived - just because it received a RST. (If the server had read all of its input before doing close(), then it would have sent FIN instead of RST, and FIN doesn't tell anyone to discard anything. So ironically, the server discarding its input data on purpose has caused the client to discard its input data by accident.)

Perfectly acceptable behaviour, by the way, would be for the client to receive a polite TCP FIN of the connection after response #10 is received. It knows that since a) the connection closed early, and b) the connection closed without error, that everything is fine, but the server didn't feel like answering any more requests. It also knows exactly where the server stopped, so there's no worrying about requests accidentally being run twice, etc. It can just open a connection and resend the failed requests.

But that's not what happened in our example above. So what do you do about it?

The "lingering close"

The obvious solution for this is what Apache calls a lingering close. As you can guess from the description, the solution is on the server side.

What you do is you change the server so, instead of shutting down its socket right away, it just does a shutdown(sock, SHUT_WR) to notify TCP/IP that it isn't planning to write any more stuff. In turn, this sends a notice to the client side, which (eventually) arrives and appears as an EOF - a clean end of data marker, right after response #10. At that point, the client can close() its socket, knowing that its input buffer is safely empty, thus sending a FIN to the server side.

Meanwhile, the server can read all the data in its input buffer and throw it away; it knows the client isn't expecting any more answers. It just needs to flush all that received stuff to avoid accidentally sending an RST and ruining everything. The server can just read until it receives its own end-of-data marker, which we now know is coming, since the client has called close().

Throw in a timeout here and there to prevent abuse, and you're set.

SO_LINGER

You know what all the above isn't? The same thing as SO_LINGER.

It seems like there are a lot of people who are confused by this. I certainly was; various Apache documentation, including the actual comment above the actual implementation of "lingering close" in Apache, implies that Apache's lingering code was written only because SO_LINGER is broken on various operating systems.

Now, I'm sure it was broken on various operating system for various reasons. But: even when it works, it doesn't solve this problem. It's actually a totally different thing.

SO_LINGER exists to solve exactly one simple problem, and only one problem: the problem that if you close() a socket after writing some stuff, close() will return right away, even if the remote end hasn't yet received everything you wrote.

This behaviour was supposed to be a feature, I'm sure. After all, the kernel has a write buffer; the remote kernel has a read buffer; it's going to do all that buffering in the background anyway and manage getting all the data from point A to point B. Why should close() arbitrarily block waiting for that data to get sent?

Well, it shouldn't, said somebody, and he made it not block, and that was the way it was. But then someone realized that there's an obscure chance that the remote end will die or disappear before all the data has been sent. In that case, the kernel can deal with it just fine, but userspace will never know about it since it has already closed the socket and moved on.

So what does SO_LINGER do? It changes close() to wait until all the data has been sent. (Or, if your socket is non-blocking, to tell you it can't close, yet, until all the data has been sent.)

What doesn't SO_LINGER do?

It doesn't read leftover data from your input buffer and throw it away, which is what Apache's lingering close does. Even with SO_LINGER, your server will still send an RST at the wrong time and confuse the client in the example above.

What do the two approaches have in common?

They both involve close() and the verb "linger." However, they linger waiting for totally different things, and they do totally different things while they linger.

What should I do with this fun new information?

If you're lucky, nothing. Apache already seems to linger correctly, whether because they eventually figured out why their linger implementation works and SO_LINGER simply doesn't, or because they were lucky, or because they were lazy. The comment in their code, though, is wrong.

If you're writing an HTTP client, hopefully nothing. Your client isn't supposed to have to do anything here; there's no special reason for you to linger (in either sense) on close, because that's exactly what an HTTP client does anyway: it reads until there's nothing more to read, then it closes the connection.

If you're writing an HTTP server: first of all, try to find an excuse not to. And if you still find yourself writing an HTTP server, make sure you linger on close. (In the Apache sense, not the SO_LINGER sense. The latter is almost certainly useless to you. As an HTTP server, what do you care if the data didn't get sent completely? What would you do about it anyway?)

Okay, Mr. Smartypants, so how did you get stuck with this?

Believe it or not, I didn't get stuck with this because I (or rather, Luke) was writing an HTTP server. At least, not initially. We were actually testing WvHttpPool (an HTTP client that's part of WvStreams) for reliability, and thought we (actually Luke :)) would write a trivial HTTP server in perl; one that disconnects deliberately at random times to make sure we recover properly.

What we learned is that our crappy test HTTP server has to linger, and I don't mean SO_LINGER, or it totally doesn't work at all. WvStreams, it turned out, works fine as long as you do this.

Epilogue: Lighttpd lingers incorrectly

The bad news is that the reason we were doing all this testing is that the WvStreams http client would fail in some obscure cases. It turns out this is the fault of lighttpd, not WvStreams.

lighttpd 1.4.19 (the version in Debian Lenny) implements its lingering incorrectly. So does the current latest version, lighttpd 1.4.23.

lighttpd implements Apache-style lingering, as it should. Unfortunately it stops lingering as soon as ioctl(FIONREAD) returns zero, which is wrong; that happens when the local socket buffer is empty, but it doesn't guarantee the remote end has finished sending yet. There might be another packet just waiting to arrive a microsecond later, and when it does, blam: RST.

Unfortunately, once I had debugged that, I found out that they actually forgot to linger at all except in case of actual errors. If it's just disconnecting you because you've made too many requests, it doesn't work, and kaboom.

And once I had debugged that, I found out that it sets the linger timeout to only one second. It should be more like 120 seconds, according to the RFCs, though apparently most OSes use about 30 seconds. Gee.

I guess I'll send in a patch.

Footnote

¹ Oh, I promised you an answer about the Quit command. Here it is: I'm pretty sure shutdown() was invented long after close(), so the only way to ensure a safe close was to negotiate the shutdown politely at the application layer. If the server never disconnects until it's been asked to Quit, and a client never sends anything after the Quit request, you never run into any of these, ahem, "lingering problems."

::li,nv

2009-08-14 »