It was time to learn how 802.11 authentication works, so I started reading (well... heavily skimming) the standards. Here's what I've got so far.
WPA2-PSK does not actually use EAP (Extensible Authentication Protocol). WPA2-Enterprise does; in that case it uses EAP to do some arbitrary configurable thing to produce the PMK ("Pairwise Master Key"). For example, EAP is the place where you might contact a RADIUS server. Or alternatively, with WPA2-PSK, the PMK is trivially constructed by hashing some stuff like your password, ssid, and MAC addresses. No packets are sent at all for this, and you have the PMK already.
Once you have the PMK, we do the infamous wifi "4-way handshake" to produce and exchange transient session keys. The handshake is the same regardless of what EAP method you use, or if you don't use EAP at all. Confusingly, the 4-way handshake seems to use packets called EAPOL (EAP over LAN) packets, but that's actually just a container which in this case happens to not contain EAP. Ha! Fooled you! Also, there is a thing called EAP-PSK which is intimately unrelated to all this.
The 4-way handshake seems to take about 20ms if all is well, which it usually is.
Occasionally, you want to rotate session keys inside a session. To do that, you just re-run the 4-way handshake without getting a new PMK. Notably, EAP and thus RADIUS do not ever get involved here. (It's good to know when RADIUS gets involved because it's typically annoyingly slow and unscalable.)
So okay, roaming. In general, wifi clients are responsible for choosing which AP they will talk to, and not the other way around. Occasionally a client will decide it's getting a crappy signal, and see about finding another AP with the same SSID that it can switch to instead. At that time, it disconnects from the old AP and re-runs all of the above, potentially including EAP and RADIUS if that's what you're using. This can be slow: EAP can take several seconds. To try to hide the slowness, they invented 802.11r. The purpose of 802.11r is simply to let you do an EAP transaction for AP#2 while still connected to AP#1. Basically you use it to calculate a new PMK. After that, you can disconnect from AP#1 and connect to AP#2 and do the (20ms or so) 4-way handshake, all in a very short time. Great! But this doesn't work for WPA2-PSK, for the simple reason that there is no work to do to get PMK#2. So 802.11r has nothing to do with that.
There are some other standards that also come up when talking about roaming:
- 802.11h: used for measuring signal strength and controlling maximum transmit power. Apparently needed if you want to comply with DFS (radar avoidance, essentially) regulations on some channels. Contains some interesting signal strength reporting tool though, so could be used to help inform roaming decisions.
- 802.11F (why a capital letter? why not?): "inter-access point protocol." Nobody ever used it, so it was cancelled.
- 802.11v: a new thing that appears to be mainly useful for helping APs fix up their bridging tables more quickly.
- 802.11k: "radio resource measurement." The only thing out of all the above that could possibly allow you to steer a client device to a particular AP.
802.11k is pretty special. First of all, it's not supported by hostapd, so most APs don't support it, and most Linux devices don't support it. I see a bunch of announcements about iOS supporting it, but not MacOS X, which seems weird, but there you go. Also, there is no actual command in 802.11k to actually push a client device to a particular AP; instead, the AP is supposed to wait until the station asks, and then send it a "list of eligible neighbour access points" that the station might want to connect to. There's no guarantee the station will take the advice. Also, there's no good way to poke a station to make it consider moving right away; we rely on the device to do something intelligent. Which is, of course, way too optimistic, when it comes to most devices.
That gives me a better clue of what people mean when they say the standards are "not that useful" for assisted roaming. Few people bothered to implement 802.11k.
So what do people do instead? It seems mostly they have APs forcibly disconnect a client device, then the "wrong" AP will refuse to let the client reconnect to that AP, thus forcing it to connect to the "right" AP. The problem with this method is it's slow: you can't use 802.11r EAP pre-authentication, because you're not connected to any AP at that point. The client also has to scan all the channels, which can take several seconds. During all that time, the wifi has been disconnected and your sessions are all frozen. None of this slowness has anything to do with problems that can be fixed with 802.11r. They can't even particularly be fixed with 802.11k since it can't force a roam either.