Keep it beautiful
Everything here is my opinion. I do not speak for your employer.
July 2017
August 2017

2017-07-04 »

Avery's[1] laws[2] of wifi reliability

Replacing your router: Vendor A: 10% broken Vendor B: 10% broken P(both A and B broken): 10% x 10% = 1%
Replacing your router (or firmware) almost always fixes your problem.
Adding a wifi extender: Router A: 90% working Router B: 90% working P(both A and B working): 90% x 90% = 81%
Adding an additional router almost always makes things worse.

All wireless networks, both LTE and mesh, go down sometimes, but I'm willing to bet that your wifi network is flakier than your phone's LTE connection. At Battlemesh v10, we were all sitting in a room with dozens of experimental misconfigured wifi routers offering open networks that may or may not ever successfully route back to the real Internet. What makes a network reliable or unreliable?

After a few years of messing with this stuff (and being surrounded by tons of engineers working on other distributed systems problems, which turn out to all have similar constraints), I think I can summarize it like this. Distributed systems are more reliable when you can get a service from one node OR another. They get less reliable when a service depends on one node AND another. And the numbers combine multiplicatively, so the more nodes you have, the faster it drops off.

For a non-wireless example, imagine running a web server with a database. If those are on two computers (real or virtual), then your web app goes down if you don't have the web server AND the database server working perfectly. It's inherently less reliable than a system that requires a web server, but does not require a database. Conversely, imagine you arrange for failover between two database servers, so that if one goes down, we switch to the other one. The database is up if the primary server OR the secondary server is working, and that's a lot better. But it's still less reliable than if you didn't need a database server at all.

Let's take that back to wifi. Imagine I have a wifi router from vendor A. Wifi routers usually suck, so for the sake of illustration, let's say it's 90% reliable, and for simplicity, let's define that as "it works great for 90% of customers and has annoying bugs for 10%." 90% of customers who buy a vendor A router will be happy, and then never change it again. 10% will be unhappy, so they buy a new router - one from vendor B. That one also works for 90% of people, but if the bugs are independent, it'll work for a different 90%. What that means is, 90% of the people are now using vendor A, and happy; 90% of 10% are now using vendor B, and happy. That's a 99% happiness rate! Even though both routers are only 90% reliable. It works because everyone has the choice between router A OR router B, so they pick the one that works and throw away the other.

This applies equally well to software (vendor firmware vs openwrt vs tomato) or software versions (people might not upgrade from v1.0 to v2.0 unless v1.0 gave them trouble). In our project, we had a v1 router and a v2 router. v1 worked fine for most people, but not all. When v2 came out, we started giving out v2 routers to all new customers, but also to v1 customers who complained that their v1 router had problems. When we drew a graph of customer satisfaction, it went up right after the v2 release. Sweet! (Especially sweet since the v2 router was my team's project :)). Upgrade them all, right?

Well, no, not necessarily. The problem was we were biasing our statistics: we only upgraded v1 users with problems to v2. We didn't "upgrade" v2 users with problems (of course there were some) to v1. Maybe both routers were only 90% reliable; the story above would have worked just as well in reverse. The same phenomenon explains why some people switch from openwrt to tomato and rave about how much more reliable it is, and vice versa, or Red Hat vs Debian, or Linux vs FreeBSD, etc. This is the "It works for me!" phenomenon in open source; simple probability. You only have an incentive to switch if the thing you have is giving you a problem, right now.

But the flip side of the equation is also true, and that matters a lot for mesh. When you set up multiple routers in a chain, now you depend on router A AND router B to both work properly, or your network is flakey. Wifi is notorious for this: one router accepts connections, but acts weird (eg. doesn't route packets), and clients still latch onto that router, and it ruins it for everyone. As the number of mesh nodes increases, the probability of this happening increases fast.

LTE base stations also have reliability problems, of course - plenty of them. But they usually aren't arranged in a mesh, and a single LTE station usually covers a much larger area, so there are fewer nodes to depend on. Also, each LTE node is typically "too big to fail" - in other words, it will annoy so many people, so quickly, that the phone company will need to fix it fast. A single mesh node being flakey might affect only a smaller region of space, so that everyone passing through that area would be affected, but most of the time, they aren't. That leads to a vague impression of "wifi meshes are flakey and LTE is reliable", even if your own mesh link is working most of the time. It's all a game of statistics.

Solution: the buddy system

Let your friend tell you if you're making an ass of yourself. Router A: 90% working Router B: 90% working P(either A or B working): 1 - (1-0.9) x (1-0.9) = 99%

In the last 15 years or so, distributed systems theory and practice have come a long way. We now, mostly, know how to convert an AND situation into an OR situation. If you have a RAID5 array, and one of the disks dies, you take that disk out of circulation so you can replace it before the next one dies. If you have a 200-node nosql database service, you make sure nodes that fail stop getting queries routed to them so that the others can pick up the slack. If one of your web servers gets overloaded running Ruby on Rails bloatware, your load balancers redirect traffic to one of the nodes that's less loaded, until the first server catches up.

So it should be with wifi: if your wifi router is acting weird, it needs to be taken out of circulation until it's fixed.

Unfortunately, it's harder to measure wifi router performance than database or web server performance. A database server can easily test itself; just run a couple of queries and make sure its request socket is up. Since all your web servers are accessible from the Internet, you can have a single "prober" service query them all one by one to make sure they're working, and reboot the ones that stop. But by definition, not all your wifi mesh nodes are accessible via direct wifi link from one place, so a single prober isn't going to work.

Here's my proposal, which I call the "wifi buddy system." The analogy is if you and some friends go to a bar, and you get too drunk, and start acting like a jerk. Because you're too drunk, you don't necessarily know you're acting like a jerk. It can be hard to tell. But you know who can tell? Your friends. Usually even if they're also drunk.

Although by definition, not all your mesh nodes are reachable from one place, you can also say that by definition, every mesh node is reachable by at least one other mesh node. Otherwise it wouldn't be a mesh, and you'd have bigger problems. That gives us a clue for how to fix it. Each mesh node should occasionally try to connect up to one or more nearby nodes, pretending to be an end user, and see if it can route traffic or not. If it can, then great! Tell that node it's doing a great job, keep it up. If not, then bad! Tell that node it had better get back on the wagon. (Strictly speaking, the safest way to implement this is to send only "you're doing great" messages after polling. A node that is broken might not be capable of receiving "you're doing badly" messages. You want a watchdog-like system that resets the node when it doesn't get a "great!" message within a given time limit.)

In a sufficiently dense mesh - where there's always two or more routes between a given pair of nodes - this converts AND behaviour to OR behaviour. Now, adding nodes (ones that can decommission themselves when there's a problem) makes things more reliable instead of less.

That gives meshes an advantage over LTE instead of a disadvantage: LTE has less redundancy. If a base station goes down, a whole area loses coverage and the phone company needs to rush to fix it. If a mesh node goes down, we route around the problem and fix it at our leisure later.

A little bit of math goes a long way!

Not enough for you?

You can see my complete slides (pdf) about consumer wifi meshes (including detailed speaker notes) from Battlemesh v10 in Vienna, or watch my talk on Youtube:

Previously: my talk on wifi data collection and analytics.

Footnote

[1] These so-called "laws" are a special case of more general and thus more useful distributed systems theorems. But this is the Internet, so I picked my one special case and named it after myself. Go ahead, try and stop me.

[2] Laws may be different in your jurisdiction.

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com