201304 - apenwarr

2013-04-01 »

Good news: successfully wrote program to simulate bad behaviour under heavy memory proessure.

Bad news: tested it on my desktop.

Oops. It worked.

2013-04-02 »

I didn't understand before, but now I do, why "realtime" stuff is incompatible with "virtual memory" stuff.

I have two words and a gutteral sound for you: priority inversion. Gaaaaaaaaaaaaaaaarghgggh.

kswapd: a non-realtime process that needs to swap pages from your realtime process in and out. That doesn't work, of course, leading to deadlocks. You might think you could solve it by using mlockall(MCL_FUTURE), which actually does mostly work, right up until you try to open a file. Then because of the prior mlockall(), it decides to swap in all the pages right away. But that's kswapd's job. Maybe with some help from nfsiod or mtdblock0 or one of the many other non-realtime-priority kernel threads.

Even if your realtime thread doesn't open any files - and it doesn't because we're not crazy - it doesn't matter. A non-realtime-priority thread in the same program can open a file, and you end up with pages mapped into your address space, and those pages need to be paged in right away because of mlockall, and the kernel helpfully chooses your realtime thread to do it, because after all, we don't want none of that priority inversion where your realtime thread can't continue because a low-priority thread is blocked on reading your pages.

Except it blocks anyway, because the kernel threads responsible for actually doing such things are not realtime threads.

My solution: if you have a page fault in your realtime process, nanosleep() for a bit to let other people run. Sure, it's horrible, but it was worse before. And actually trying to fix the priority inversion directly (eg. by boosting the priority of dependent processes) is a losing battle, because you don't know which processes you're dependent on, in the general case.

Sigh.

2013-04-03 »

According to the endless stream of people "endorsing" me on Linkedin, I'm apparently an all-time expert in Perl and now Ruby.

I don't know which bias this is, but it's not sampling bias, because these people actually do know me. I think it might be "click this button to get it the heck out of my way" bias.

2013-04-04 »

Just found out that our team is getting disproportionately blamed for network quality problems because we have better logs, reports, alerts, and monitoring than other teams.

I guess we're doing something right! No such thing as bad publicity, I always say.

2013-04-05 »

The same thing we do every month, Pinky...

http://www.pcmag.com/article2/0,2817,2417533,00.asp

This test is kind of ridiculous. It measures the average streaming bandwidth your customers achieve from Netflix, and Netflix has a maximum streaming rate that simply isn't all that fast. The primary factor controlling an ISP's average score is the average rate a subscriber chose when signing up with that ISP. An ISP can have perfectly good reasonably-priced 50 MBit/sec plans, which would easily stream at Netflix's maximum rate, but if they have a $1/month 3 MBit/sec plan and everyone switched to it, they'd lose the competition.

In other words, there is no particular reason Google Fiber is winning at this one and we can expect their score to decline as they move into neighbourhoods where a greater fraction of people choose the free plan.

2013-04-06 »

Rumour has it that the USSR used to rely on CIA estimates of their own economic output because they knew their own values were biased.

That's how I feel, some days.

2013-04-07 »

Sometimes you go into tech debt

I always thought the point of the technical debt metaphor was that sometimes you deliberately go into debt because it's the right thing to do right now. The people who introduced that term were making a very clear point that never going into debt is the wrong choice, and trying to explain how sometimes optimizing for the short term totally makes sense.

And yes, you still have to pay off your debts eventually. (Also cool however is that if your project goes "bankrupt", ie. you stop maintaining it, you never do have to pay off the debt. Maybe we should call it technical venture capital.)

2013-04-08 »

This document with definitions about data pipelines is more detailed than the implementation of our actual data pipeline.

"See all that stuff in there, Homer? That's why your robot never worked."

2013-04-09 »

Okay, it's settled. I just don't trust signal strength meters.

2013-04-10 »

Numbers. Can't live with em, can't just make wild estimates and move on with your day.

Actually, wait, yes you can.

2013-04-11 »

Two meditations on software stability:

1) There is a level of panic we expect. If it were higher, we'd stop deploying to new customers. If it were lower, we'd deploy new customers faster until the level of problems became unmanageable.

2) The Linux kernel is about 20 years old (!) now. It's unreasonable to expect that upgrading the kernel nowadays will cause a net reduction in bugs. If that were true, it would be virtually bug free by now. I can assure you that it is not.

2013-04-12 »

Hope is not a strategy? Nonsense. It has all the outward appearances of a strategy. Intense meetings, motivational speeches, post-mortems, the works.

Of course, when I do it, I hope really hard, so that's probably why it'll work for me.

2013-04-14 »

"The light is necessary to me so I don't forget to turn the box off when I finish watching TV." – a customer

Teaching people to believe they need to turn off their settop box when they're done watching TV, combined with training them that an "unclean shutdown" of their computer will cause it to eat itself for breakfast, are just two of the terrible sins our kind has committed against humanity.

(As a ridiculous compensation for this, in many modern settop boxes, the light and power button exist mostly as a placebo.)

2013-04-15 »

Okay. I respect the OOM killer more now.

Turns out it's hard.

2013-04-16 »

"I was really hoping that when AT&T and TWC complained that Google was getting a better deal than them that the cities would simply reply 'Sorry, but that offer is for new customers only. Thanks for being a valued customer'." – a comment on news.ycombinator

2013-04-17 »

1 Gbps, fully saturated, is 324,000 Gbytes per month.

According to the Amazon EC2 calculator, that would cost your EC2 instance $22,373.57. (Also they'll throw in a virtual computer worth $14.64/month :))

Conclusion: $70/month is a pretty good deal.

2013-04-18 »

I'm chasing a kernel bug that occurs every 1700 device-hours or so of runtime. That amount of time is a combination of "annoyingly infrequent" and "way the heck too frequent if you have thousands of devices." Unfortunately, we found out about this bug only after upgrading those thousands of devices to the new, buggier software version, because it shows up so infrequently that we didn't notice it in testing.

Now, this is more or less good news, because it means we caught all our high-frequency problems and are on to low-frequency problems. We're able to clearly see this problem when we roll out to 1000 customers, and maybe there's no better way to detect such problems (although of course we're constantly improving our stresstests etc).

What happens when we move up to 100,000 customers? Obviously the answer is staged rollouts, where we do only a sampling of customers first, look for problems, then move on. At each stage, we can detect progressively-less-common new problems.

But how many stages and what are the stages? WWGD (What Would Google Do)? Intuitively, I feel like "one rollout phase per order of magnitude" might be a good option to start with, but who knows.

2013-04-19 »

Dammit, iperf, the server isn't allowed to get ECONNREFUSED. It just doesn't make sense.

2013-04-21 »

I'm interested in the fact that I'm approximately the only person I know who thinks IPv6 not a good idea, and that multi-level ("carrier grade") NAT is actually not just more likely in the short term, but better in the long term.[*]

I think our differences of opinion come down to this: most people seem to understand complexity differently than I do. As far as I'm concerned, building and testing two parallel systems (IPv4 and IPv6) is twice as much work. Testing a slightly more complex system (NAT) applied recursively (carrier grade NAT), requires pretty much the same level of testing as applying it only once. And we already test that because we have to.

Or, you know, I could just be wrong and everyone else could be right. :)

[*] We'll also want a good protocol for opening incoming ports through recursive NAT. NAT-PMP seems to be that.

2013-04-22 »

Aha! I've finally narrowed it down. The crash is definitely in either the kernel core, the MIPS architecture support, the icache, the TLB, the page cache, the platform-specific drivers, the SATA layer, the ethernet bridging layer, the firewall, the multicast support, or else it's somewhere in userspace. Or possibly some combination of those.

We are this close, people.

2013-04-23 »

Netflix experimenting with higher prices, albeit in a rather non-threatening format. Good for them.

http://www.forbes.com/sites/matthickey/2013/04/22/netflix-to-offer-11-99-family-plan-beats-hbo-in-subscribers/

I've had a cable TV subscription for a few months now, and goodness yes, I certainly would be willing to pay more for Netflix instead.

2013-04-24 »

They say Canadians are polite, but we have nothing on Kansas City people. See, where I would say "the wifi extender just doesn't work at all," they managed to say something like:

"We need to experiment with the wifi settings in each home. We try to turn off the wifi extender when we can, and in cases where that means the network coverage isn't enough for the whole house, sometimes we turn off the extender and install a new router."

So, in short, the extender has two useful settings: off, or off and buy a linksys router.

2013-04-25 »

This article is a great example of how people will just believe whatever they want to believe, despite evidence blatantly crashing into things in front of them.

http://gawker.com/crashing-through-manhattan-in-the-fake-google-driverles-478133301
"""
Soon after, KATSU blatantly cut a cab off and the Fiat was whacked hard in the back bumper. Despite the New York cab drivers' fearsome reputation, this one was cowed. "He looked at the Google logo [on the self driving car] and he thought it was his fault," KATSU said. "He got a really scared face like he was doing something wrong."
"""

2013-04-26 »

blip: a tool for seeing your Internet latency

Why is your Internet slow? It's probably not bandwidth. Here's a graph of your internet performance, right now:

(If you're reading this through RSS and your reader doesn't support iframes, you can visit the app at gfblip.appspot.com. Also try it on your phone or tablet.)

This real-time latency based measurement is way more accurate than speedtest.net at predicting your real web browsing performance. Although maybe a bit harder to interpret the results.

For more information, motivation, philosophy, and ranting, read the README.

And it's open source. Have a nice day.

2013-04-30 »

Avery's rules of testing:

1. If you didn't test it, it doesn't work. Ever.

2. Thus, every time you write a new test, if your test is any good at all, you will discover something new that didn't work.

3. You will always be surprised when this happens. Even if you name the rules after yourself.