Forget privacy: you're terrible at targeting anyway
I don't mind letting your programs see my private data as long as I get something useful in exchange. But that's not what happens.
A former co-worker told me once: "Everyone loves collecting data, but nobody loves analyzing it later." This claim is almost shocking, but people who have been involved in data collection and analysis have all seen it. It starts with a brilliant idea: we'll collect information about every click someone makes on every page in our app! And we'll track how long they hesitate over a particular choice! And how often they use the back button! How many seconds they watch our intro video before they abort! How many times they reshare our social media post!
And then they do track all that. Tracking it all is easy. Add some log events, dump them into a database, off we go.
But then what? Well, after that, we have to analyze it. And as someone who has analyzed a lot of data about various things, let me tell you: being a data analyst is difficult and mostly unrewarding (except financially).
See, the problem is there's almost no way to know if you're right. (It's also not clear what the definition of "right" is, which I'll get to in a bit.) There are almost never any easy conclusions, just hard ones, and the hard ones are error prone. What analysts don't talk about is how many incorrect charts (and therefore conclusions) get made on the way to making correct ones. Or ones we think are correct. A good chart is so incredibly persuasive that it almost doesn't even matter if it's right, as long as what you want is to persuade someone... which is probably why newpapers, magazines, and lobbyists publish so many misleading charts.
But let's leave errors aside for the moment. Let's assume, very unrealistically, that we as a profession are good at analyzing things. What then?
Well, then, let's get rich on targeted ads and personalized recommendation algorithms. It's what everyone else does!
Or do they?
The state of personalized recommendations is surprisingly terrible. At this point, the top recommendation is always a clickbait rage-creating article about movie stars or whatever Trump did or didn't do in the last 6 hours. Or if not an article, then a video or documentary. That's not what I want to read or to watch, but I sometimes get sucked in anyway, and then it's recommendation apocalypse time, because the algorithm now thinks I like reading about Trump, and now everything is Trump. Never give positive feedback to an AI.
This is, by the way, the dirty secret of the machine learning movement: almost everything produced by ML could have been produced, more cheaply, using a very dumb heuristic you coded up by hand, because mostly the ML is trained by feeding it examples of what humans did while following a very dumb heuristic. There's no magic here. If you use ML to teach a computer how to sort through resumes, it will recommend you interview people with male, white-sounding names, because it turns out that's what your HR department already does. If you ask it what video a person like you wants to see next, it will recommend some political propaganda crap, because 50% of the time 90% of the people do watch that next, because they can't help themselves, and that's a pretty good success rate.
(Side note: there really are some excellent uses of ML out there, for things traditional algorithms are bad at, like image processing or winning at strategy games. That's wonderful, but chances are good that your pet ML application is an expensive replacement for a dumb heuristic.)
Someone who works on web search once told me that they already have an algorithm that guarantees the maximum click-through rate for any web search: just return a page full of porn links. (Someone else said you can reverse this to make a porn detector: any link which has a high click-through rate, regardless of which query it's answering, is probably porn.)
Now, the thing is, legitimate-seeming businesses can't just give you porn links all the time, because that's Not Safe For Work, so the job of most modern recommendation algorithms is to return the closest thing to porn that is still Safe For Work. In other words, celebrities (ideally attractive ones, or at least controversial ones), or politics, or both. They walk that line as closely as they can, because that's the local maximum for their profitability. Sometimes they accidentally cross that line, and then have to apologize or pay a token fine, and then go back to what they were doing.
This makes me sad, but okay, it's just math. And maybe human nature. And maybe capitalism. Whatever. I might not like it, but I understand it.
My complaint is that none of the above had anything to do with hoarding my personal information.
The hottest recommendations have nothing to do with me
Let's be clear: the best targeted ads I will ever see are the ones I get from a search engine when it serves an ad for exactly the thing I was searching for. Everybody wins: I find what I wanted, the vendor helps me buy their thing, and the search engine gets paid for connecting us. I don't know anybody who complains about this sort of ad. It's a good ad.
And it, too, had nothing to do with my personal information!
Google was serving targeted search ads decades ago, before it ever occurred to them to ask me to log in. Even today you can still use every search engine web site without logging in. They all still serve ads targeted to your search keyword. It's an excellent business.
There's another kind of ad that works well on me. I play video games sometimes, and I use Steam, and sometimes I browse through games on Steam and star the ones I'm considering buying. Later, when those games go on sale, Steam emails me to tell me they are on sale, and sometimes then I buy them. Again, everybody wins: I got a game I wanted (at a discount!), the game maker gets paid, and Steam gets paid for connecting us. And I can disable the emails if I want, but I don't want, because they are good ads.
But nobody had to profile me to make that happen! Steam has my account, and I told it what games I wanted and then it sold me those games. That's not profiling, that's just remembering a list that I explicitly handed to you.
Amazon shows a box that suggests I might want to re-buy certain kinds of consumable products that I've bought in the past. This is useful too, and requires no profiling other than remembering the transactions we've had with each other in the past, which they kinda have to do anyway. And again, everybody wins.
Now, Amazon also recommends products like the ones I've bought before, or looked at before. That's, say, 20% useful. If I just bought a computer monitor, and you know I did because I bought it from you, then you might as well stop selling them to me. But for a few days after I buy any electronics they also keep offering to sell me USB cables, and they're probably right. So okay, 20% useful targeting is better than 0% useful. I give Amazon some credit for building a useful profile of me, although it's specifically a profile of stuff I did on their site and which they keep to themselves. That doesn't seem too invasive. Nobody is surprised that Amazon remembers what I bought or browsed on their site.
Worse is when (non-Amazon) vendors get the idea that I might want something. (They get this idea because I visited their web site and looked at it.) So their advertising partner chases me around the web trying to sell me the same thing. They do that, even if I already bought it. Ironically, this is because of a half-hearted attempt to protect my privacy. The vendor doesn't give information about me or my transactions to their advertising partner (because there's an excellent chance it would land them in legal trouble eventually), so the advertising partner doesn't know that I bought it. All they know (because of the advertising partner's tracker gadget on the vendor's web site) is that I looked at it, so they keep advertising it to me just in case.
But okay, now we're starting to get somewhere interesting. The advertiser has a tracker that it places on multiple sites and tracks me around. So it doesn't know what I bought, but it does know what I looked at, probably over a long period of time, across many sites.
Using this information, its painstakingly trained AI makes conclusions about which other things I might want to look at, based on...
...well, based on what? People similar to me? Things my Facebook friends like to look at? Some complicated matrix-driven formula humans can't possibly comprehend, but which is 10% better?
Probably not. Probably what it does is infer my gender, age, income level, and marital status. After that, it sells me cars and gadgets if I'm a guy, and fashion if I'm a woman. Not because all guys like cars and gadgets, but because some very uncreative human got into the loop and said "please sell my car mostly to men" and "please sell my fashion items mostly to women." Maybe the AI infers the wrong demographic information (I know Google has mine wrong) but it doesn't really matter, because it's usually mostly right, which is better than 0% right, and advertisers get some mostly demographically targeted ads, which is better than 0% targeted ads.
You know this is how it works, right? It has to be. You can infer it from how bad the ads are. Anyone can, in a few seconds, think of some stuff they really want to buy which The Algorithm has failed to offer them, all while Outbrain makes zillions of dollars sending links about car insurance to non-car-owning Manhattanites. It might as well be a 1990s late-night TV infomercial, where all they knew for sure about my demographic profile is that I was still awake.
You tracked me everywhere I go, logging it forever, begging for someone to steal your database, desperately fearing that some new EU privacy regulation might destroy your business... for this?
Of course, it's not really as simple as that. There is not just one advertising company tracking me across every web site I visit. There are... many advertising companies tracking me across every web site I visit. Some of them don't even do advertising, they just do tracking, and they sell that tracking data to advertisers who supposedly use it to do better targeting.
This whole ecosystem is amazing. Let's look at online news web sites. Why do they load so slowly nowadays? Trackers. No, not ads - trackers. They only have a few ads, which mostly don't take that long to load. But they have a lot of trackers, because each tracker will pay them a tiny bit of money to be allowed to track each page view. If you're a giant publisher teetering on the edge of bankruptcy and you have 25 trackers on your web site already, but tracker company #26 calls you and says they'll pay you $50k a year if you add their tracker too, are you going to say no? Your page runs like sludge already, so making it 1/25th more sludgy won't change anything, but that $50k might.
Then the ad sellers, and ad networks, buy the tracking data from all the trackers. The more tracking data they have, the better they can target ads, right? I guess.
The brilliant bit here is that each of the trackers has a bit of data about you, but not all of it, because not every tracker is on every web site. But on the other hand, cross-referencing individuals between trackers is kinda hard, because none of them wants to give away their secret sauce. So each ad seller tries their best to cross-reference the data from all the tracker data they buy, but it mostly doesn't work. Let's say there are 25 trackers each tracking a million users, probably with a ton of overlap. In a sane world we'd guess that there are, at most, a few million distinct users. But in an insane world where you can't prove if there's an overlap, it could be as many as 25 million distinct users! The more tracker data your ad network buys, the more information you have! Probably! And that means better targeting! Maybe! And so you should buy ads from our network instead of the other network with less data! I guess!
None of this works. They are still trying to sell me car insurance for my subway ride.
It's not just ads
That's a lot about profiling for ad targeting, which obviously doesn't work, if anyone would just stop and look at it. But there are way too many people incentivized to believe otherwise. Meanwhile, if you care about your privacy, all that matters is they're still collecting your personal information whether it works or not.
What about content recommendation algorithms though? Do those work?
Obviously not. I mean, have you tried them. Seriously.
That's not quite fair. There are a few things that work. Pandora's music recommendations are surprisingly good, but they are doing it in a very non-obvious way. The obvious way is to take the playlist of all the songs your users listen to, blast it all into an ML training dataset, and then use that to produce a new playlist for new users based on... uh... their... profile? Well, they don't have a profile yet because they just joined. Perhaps based on the first few songs they select manually? Maybe, but they probably started with either a really popular song, which tells you nothing, or a really obscure song to test the thoroughness of your library, which tells you less than nothing.
(I'm pretty sure this is how Mixcloud works. After each mix, it tries to find the "most similar" mix to continue with. Usually this is someone else's upload of the exact same mix. Then the "most similar" mix to that one is the first one, so it does that. Great job, machine learning, keep it up.)
That leads us to the "random song followed by thumbs up/down" system that everyone uses. But everyone sucks, except Pandora. Why? Apparently because Pandora spent a lot of time hand-coding a bunch of music characteristics and writing a "real algorithm" (as opposed to ML) that tries to generate playlists based on the right combinations of those characteristics.
In that sense, Pandora isn't pure ML. It often converges on a playlist you'll like within one or two thumbs up/down operations, because you're navigating through a multidimensional interconnected network of songs that people encoded the hard way, not a massive matrix of mediocre playlists scraped from average people who put no effort into generating those playlists in the first place. Pandora is bad at a lot of things (especially "availability in Canada") but their music recommendations are top notch.
Just one catch. If Pandora can figure out a good playlist based on a starter song and one or two thumbs up/down clicks, then... I guess it's not profiling you. They didn't need your personal information either.
While we're here, I just want to rant about Netflix, which is an odd case of starting off with a really good recommendation algorithm and then making it worse on purpose.
Once upon a time, there was the Netflix prize, which granted $1 million to the best team that could predict people's movie ratings, based on their past ratings, with better accuracy than Netflix could themselves. (This not-so-shockingly resulted in a privacy fiasco when it turned out you could de-anonymize the data set that they publicly released, oops. Well, that's what you get when you long-term store people's personal information in a database.)
Netflix believed their business depended on a good recommendation algorithm. It was already pretty good: I remember using Netflix around 10 years ago and getting several recommendations for things I would never have discovered, but which I turned out to like. That hasn't happened to me on Netflix in a long, long time.
As the story goes, once upon a time Netflix was a DVD-by-mail service. DVD-by-mail is really slow, so it was absolutely essential that at least one of this week's DVDs was good enough to entertain you for your Friday night movie. Too many Fridays with only bad movies, and you'd surely unsubscribe. A good recommendation system was key. (I guess there was also some interesting math around trying to make sure to rent out as much of the inventory as possible each week, since having a zillion copies of the most recent blockbuster, which would be popular this month and then die out next month, was not really viable.)
Eventually though, Netflix moved online, and the cost of a bad recommendation was much less: just stop watching and switch to a new movie. Moreover, it was perfectly fine if everyone watched the same blockbuster. In fact, it was better, because they could cache it at your ISP and caches always work better if people are boring and average.
Worse, as the story goes, Netflix noticed a pattern: the more hours people watch, the less likely they are to cancel. (This makes sense: the more hours you spend on Netflix, the more you feel like you "need" it.) And with new people trying the service at a fixed or proportional rate, higher retention translates directly to faster growth.
When I heard this was also when I learned the word "satisficing," which essentially means searching through sludge not for the best option, but for a good enough option. Nowadays Netflix isn't about finding the best movie, it's about satisficing. If it has the choice between an award-winning movie that you 80% might like or 20% might hate, and a mainstream movie that's 0% special but you 99% won't hate, it will recommend the second one every time. Outliers are bad for business.
The thing is, you don't need a risky, privacy-invading profile to recommend a mainstream movie. Mainstream movies are specially designed to be inoffensive to just about everyone. My Netflix recommendations screen is no longer "Recommended for you," it's "New Releases," and then "Trending Now," and "Watch it again."
As promised, Netflix paid out their $1 million prize to buy the winning recommendation algorithm, which was even better than their old one. But they didn't use it, they threw it away.
Some very expensive A/B testers determined that this is what makes me watch the most hours of mindless TV. Their revenues keep going up. And they don't even need to invade my privacy to do it.
Who am I to say they're wrong?
Quotes from 1992
I was recently recommended to read the book Accidental Empires by Robert X. Cringely, first published in 1992 (or was it 1991?) and apparently no longer in print and also not in e-book format. To my surprise, it turns out archive.org has a solution for this, an epically user-unfriendly "virtual library card" (which is still worth it if you need to read a book) in which they apparently receive one physical copy of a book, scan it, and lend it out digitally, one person at a time, using an aggressively user-hostile DRM abomination called Adobe Digital Editions.
(I'm not kidding about the user hostility. The iOS version, for no reason, actually requires you to double click to open a book from the list. But, like double clicking with a mouse, the second click doesn't count if you're more than a few pixels off. I hope you have really precise fingers! I suspect the people who wrote it literally did not ever run it on real iPhone hardware, and therefore tested it with a mouse.)
It's a pretty fascinating book, snapshotting the microcomputer industry (and its history) as things sat back in 1991 (or was it 1992?). It surprised me how many things are just like they were back then, 27 years ago (yikes!). The names have changed, but human organizations remain the same.
(He also serialized and updated the book on his blog back in 2012-2013, but it's really long and I didn't have time to read it, and apparently that version also didn't come out in ebook format. I would have happily paid for it to read in a more convenient format. Oh well.)
Here are some quotes I painstakingly retyped. Take that, DRM garbage.
On avoiding adulthood:
They weren't rebels; they resented their parents and society very little. Their only alienation was the usual hassle of the adolescent - a feeling of being prodded into adulthood on somebody else's terms. [...] And turning this culture into a business? That was just a happy accident that allowed these boys to put off forever the horror age - that dividing line to adulthood that they would otherwise have been forced to cross after college.
On big things failing to scale down:
How did we get from big computers that lived in the basement of office buildings to the little computers that live on our desks today? We didn't. personal computers have almost nothing to do with big computers. They never have. [...] Big computers and little computers are completely different beasts created by radically different groups of people. It's logical, I know, to assume that the personal computer came from shrinking a mainframe, but that's not the way it happened.
On amateurs as the carriers of progress, predicting the success of open source:
It takes new ideas a long time to catch on - time that is mainly devoted to evolving the idea into something useful. This fact alone dumps most of the responsibility for early technical innovation in the laps of amateurs, who can afford to take the time. Only those who aren't trying to make money can afford to advance a technology that doesn't pay.
On the surprising existence of limited ambition:
Let's say for a minute that Eubanks was correct, and Gary Kildall didn't give a shit about the business. Who said that he had to? CP/M was his invention; Digital Research was his company. The fact that it succeeded beyond anyone's expectations did not make those earlier expectations invalid. Gary Kildall's ambition was limited, something that is not supposed to be a factor in American business. If you hope for a thousand and get a million, you are still expected to want more, but he didn't.
On the Chief Scientist job title:
In a personal computer hardware or software company, being named chief scientist means that the boss doesn't know what to do with you. Chief scientists don't generally have to do anything; they're just smart people whom the company doesn't want to lose to a competitor.
"Research subjects" at PARC, the original doesn't-count-as-headcount TVCs:
Money wasn't a problem, but manpower was; it was almost impossible to hire additional people at the Computer Science Laboratory because of the arduous hiring gauntlet and Taylor's reluctance to manage extra heads. [...] Simonyi came up with a scam. He proposed a research project to study programmer productivity and how to increase it. In the course of the study, test subjects would be paid to write software under Simonyi's supervision. [...] By calling them research subjects rather than programmers, he was able to bring some worker bees into PARC.
On indoctrination of recent graduates:
Through the architects and program managers, Gates was able to control the work of every programmer at Microsoft, but to do so reliably required cheap and obedient labor. Gates set a policy that consciously avoided hiring experienced programmers, specializing, instead, in recent computer science graduates.
On the dangers of only hiring geniuses:
Charles Simonyi accepts Microsoft mediocrity as an inevitable price paid to create a large organization. "The risk of genius is that the products that result from genius often don't have much to do with each other," he explained.
On the value of limiting yourself to standards:
...which was why the idea of 100 percent IBM compatibility took so long to be accepted. "Why be compatible when you could be better?" the smart guys asked on their way to bankruptcy court.
The social effects of frequent reorgs:
The rest of the company was as confused as its leadership. Somehow, early on, reorganizations - "reorgs" - became part of the Apple culture. they happen every three to six months and come from Apple's basic lack of understanding that people need stability in order to be able to work together. [...] Make a bad decision? Who cares! By the time the bad news arrives, you'll be gone and someone else will have to handle the problems.
On survival in a large organization, which is very different from entrepreneurship:
I learned an important lesson that day: Success in a large organization, whether it's a university or IBM, is generally based on appearance, not reality. It's understanding the system and then working within it that really counts, not bowling scores or body bags.
An interesting organizational structure, where Bill Gates was the chairman but hired a president who would be responsible for everything except software development:
This idea of nurturing the original purpose of the company while expanding the business organization is something that most software and hardware companies lose sight of as they grow. They managed it at Microsoft by having the programmers continue to report to Bill Gates while everyone on the business side reported to Shirley.
On how Flight Simulator got approved:
Then there was Flight Simulator, the only computer game published by Microsoft. There was no business plan that included a role for computer games in Microsoft's future. Bill Gates just liked to play Flight Simulator, so Microsoft published it.
On Novell's very strange history, and why it was so unusually good for its era (it really was, too. That thing was great):
The early versions of most software are so bad that good programmers usually want to throw them away but can't because ship dates have to be met. But Novell wasn't shipping anything in 1982-1983, so early versions of its network software were thrown away and started over again. Novell was able to take the time needed to come up with the correct architecture, a rare luxury for a start-up, and subsequently the company's greatest advantage.
A more specific version of the "Microsoft takes three tries to get anything right" theory:
Microsoft's entry into most new technologies follows this same plan, with the first effort being a preemptive strike, the second effort being market research to see what customers really want in a product, and the third try is the real product.
On industry analysts giving you the numbers you want to hear:
...the question, which was: When will unit sales of OS/2 exceed those of DOS? The assumption (and the flaw) built into this exercise is that OS/2, because it was being pushed by IBM, was destined to overtake DOS, which it hasn't. But given that the paying customers wanted OS/2 to succeed and that the research question itself suggested that OS/2 would succeed, market research companies like Dataquest, InfoCorp, and International Data Corporation dutifully crazy-glued their usual demand curves on a chart and predicted that OS/2 would be a big hit. There were no dissenting voices. Not a single market research report that I read or read about at that time predicted that OS/2 would be a failure.
On Bill Gates's annual reading weeks:
Annual reading weeks, when Gates stays home and reads technical reports for seven days straight and then emerges to reposition the company, are a tradition at Microsoft. Nothing is allowed to get in the way of planned reading for Chairman Bill.
A partially-failed prediction about Steve Jobs:
(When InfoWorld's Peggy Watt asked Gates if Microsoft would develop applications for the NeXT computer, he said, "Develop for it? I'll piss on it.") Alas, I'm not giving very good odds that Steve Jobs will be the leader of the next generation of personal computing.
A little-known partnership between IBM and Apple to try to make a new OS (Pink) that would finally beat DOS:
IBM has 33,000 programmers on its payroll but is so far from leading the software business (and knows it) that it is betting the company on the work of 100 Apple programmers wearing T-shirts in Mountain View, California.
A perception of industry fatigue in 1991-1992, which was around the time almost everyone gave up competing with DOS+Windows. Interestingly, this is also when Linux arrived (produced by "amateurs", see above) and may have rejuvenated things:
But today, everyone who wants to be in the PC business is already in it. Except for a new batch of kids who appear out of school each year, the only new blood in this business is due to immigration. And the old blood is getting tired - tired of failing in some cases or just tired of working so hard and now ready to enjoy life. The business is slowing down, and this loss of energy is the greatest threat to our computing future as a nation. Forget about the Japanese; their threat is nothing compared to this loss of intellectual vigor.
A rather bad-seeming idea for "software studios," which maybe seemed like a good idea at the time, but we've kinda tried it since then and it has a lot of unexpected downsides:
[Comparing to Hollywood studio structure] In the computer business, too, we've held to the idea that every product is going to live forever. We should be like the movies and only do sequels of hits. And you don't have to keep the original team together to do a sequel. All you have to do is make sure that the new version can read all the old product files and that it feels familiar.
An interesting perspective on the (then still in progress) takeover of computer hardware manufacturing by Asian countries. China wasn't yet even on the radar. (His claim was that it didn't matter because software was so valuable and stayed in America. 27 years later, that prediction has held up okay, although Asia seems to be doing fine with that hardware stuff):
The hardware business is dying. Let it. The Japanese and Koreans are so eager to take over the PC hardware business that they are literally trying to buy the future. But they're only buying the past.
(All above quotes by Robert X. Cringely)
The log/event processing pipeline you can't have
Let me tell you about the still-not-defunct real-time log processing pipeline we built at my now-defunct last job. It handled logs from a large number of embedded devices that our ISP operated on behalf of residential customers. (I wrote and presented previously about some of the cool wifi diagnostics that were possible with this data set.)
Lately, I've had a surprisingly large number of conversations about logs processing pipelines. I can find probably 10+ already-funded, seemingly successful startups processing logs, and the Big Name Cloud providers all have some kind of logs thingy, but still, people are not satisfied. It's expensive and slow. And if you complain, you mostly get told that you shouldn't be using unstructured logs anyway, you should be using event streams.
That advice is not wrong, but it's incomplete.
Instead of doing a survey of the whole unhappy landscape, let's just ignore what other people suffer with and talk about what does work. You can probably find, somewhere, something similar to each of the components I'm going to talk about, but you probably can't find a single solution that combines it all with good performance and super-low latency for a reasonable price. At least, I haven't found it. I was a little surprised by this, because I didn't think we were doing anything all that innovative. Apparently I was incorrect.
The big picture
Let's get started. Here's a handy diagram of all the parts we're going to talk about:
The ISP where I worked has a bunch of embedded Linux devices (routers, firewalls, wifi access points, and so on) that we wanted to monitor. The number increased rapidly over time, but let's talk about a nice round number, like 100,000 of them. Initially there were zero, then maybe 10 in our development lab, and eventually we hit 100,000, and later there were many more than that. Whatever. Let's work with 100,000. But keep in mind that this architecture works pretty much the same with any number of devices.
(It's a "distributed system" in the sense of scalability, but it's also the simplest thing that really works for any number of devices more than a handful, which makes it different from many "distributed systems" where you could have solved the problem much more simply if you didn't care about scaling. Since our logs are coming from multiple sources, we can't make it non-distributed, but we can try to minimize the number of parts that have to deal with the extra complexity.)
Now, these are devices we were monitoring, not apps or services or containers or whatever. That means two things: we had to deal with lots of weird problems (like compiler/kernel bugs and hardware failures), and most of the software was off-the-shelf OS stuff we couldn't easily control (or didn't want to rewrite).
(Here's the good news: because embedded devices have all the problems from top to bottom, any solution that works for my masses of embedded devices will work for any other log-pipeline problem you might have. If you're lucky, you can leave out some parts.)
That means the debate about "events" vs "logs" was kind of moot. We didn't control all the parts in our system, so telling us to forget logs and use only structured events doesn't help. udhcpd produces messages the way it wants to produce messages, and that's life. Sometimes the kernel panics and prints whatever it wants to print, and that's life. Move on.
Of course, we also had our own apps, which means we could also produce our own structured events when it was relevant to our own apps. Our team had whole never-ending debates about which is better, logs or events, structured or unstructured. In fact, in a move only overfunded megacorporations can afford, we actually implemented both and ran them both for a long time.
Thus, I can now tell you the final true answer, once and for all: you want structured events in your database.
...but you need to be able to produce them from unstructured logs. And once you can do that, exactly how those structured events are produced (either from logs or directly from structured trace output) turns out to be unimportant.
But we're getting ahead of ourselves a bit. Let's take our flow diagram, one part at a time, from left to right.
Userspace and kernel messages, in a single stream
Some people who have been hacking on Linux for a while may know about
/proc/kmsg: that's the file good old (pre-systemd)
klogd reads kernel
messages from, and pumps them to
syslogd, which saves them to a file.
Nowadays systemd does roughly the same thing but with more d-bus and
more corrupted binary log files. Ahem. Anyway. When you run the
command, it reads the same kernel messages (in a slightly different way).
What you might not know is that you can go the other direction. There's a
/dev and not
/proc) which, if you write
to it, produces messages into the kernel's buffer. Let's do that! For all
Wait, what? Am I crazy? Why do that?
Because we want strict sequencing of log messages between programs. And we want that even if your kernel panics.
Imagine you have, say, a TV DVR running on an embedded Linux system, and whenever you go to play a particular recorded video, the kernel panics because your chipset vendor hates you. Hypothetically. (The feeling is, hypothetically, mutual.) Ideally, you would like your logs to contain a note that the user requested the video, the video is about to start playing, we've opened the file, we're about to start streaming the file to the proprietary and very buggy (hypothetical) video decoder... boom. Panic.
What now? Well, if you're writing the log messages to disk, the joke's on you, because I bet you didn't fsync() after each one. (Once upon a time, syslogd actually did fsync() after each one. It was insanely disk-grindy and had very low throughput. Those days are gone.) Moreover, a kernel panic kills the disk driver, so you have no chance to fsync() it after the panic, unless you engage one of the more terrifying hacks like, after a panic, booting into a secondary kernel whose only job is to stream the message buffer into a file, hoping desperately that the disk driver isn't the thing that panicked, that the disk itself hasn't fried, and that even if you do manage to write to some disk blocks, they are the right ones because your filesystem data structure is reasonably intact.
(I suddenly feel a lot of pity for myself after reading that paragraph. I think I am more scars than person at this point.)
The kernel log buffer is in a fixed-size memory buffer in RAM. It defaults to being kinda small (tens or hundreds of kBytes), but you can make it bigger if you want. I suggest you do so.
By itself, this won't solve your kernel panic problems, because RAM is even more volatile than disk, and you have to reboot after a kernel panic. So the RAM is gone, right?
Well, no. Sort of. Not exactly.
Once upon a time, your PC BIOS would go through all your RAM at boot time and run a memory test. I remember my ancient 386DX PC used to do this with my amazingly robust and life-changing 4MB of RAM. It took quite a while. You could press ESC to skip it if you were a valiant risk-taking rebel like myself.
Now, memory is a lot faster than it used to be, but unfortunately it has gotten bigger more quickly than it has gotten faster, especially if you disable memory caching, which you certainly must do at boot time in order to write the very specific patterns needed to see if there are any bit errors.
So... we don't do the boot-time memory test. That ended years ago. If you reboot your system, the memory mostly will contain the stuff it contained before you rebooted. The OS kernel has to know that and zero out pages as they get used. (Sometimes the kernel gets fancy and pre-zeroes some extra pages when it's not busy, so it can hand out zero pages more quickly on demand. But it always has to zero them.)
So, the pages are still around when the system reboots. What we want to happen is:
The system reboots automatically after a kernel panic. You can do this by giving your kernel a boot parameter like "panic=1", which reboots it after one second. (This is not nearly enough time for an end user to read and contemplate the panic message. That's fine, because a) on a desktop PC, X11 will have crashed in graphics mode so you can't see the panic message anyway, and b) on an embedded system there is usually no display to put the message on. End users don't care about panic messages. Our job is to reboot, ASAP, so they don't try to "help" by power cycling the device, which really does lose your memory.) (Advanced users will make it reboot after zero seconds. I think panic=0 disables the reboot feature rather than doing that, so you might have to patch the kernel. I forget. We did it, whatever it was.)
The kernel always initializes the dmesg buffer in the same spot in RAM.
The kernel notices that a previous dmesg buffer is already in that spot in RAM (because of a valid signature or checksum or whatever) and decides to append to that buffer instead of starting fresh.
In userspace, we pick up log processing where we left off. We can capture the log messages starting before (and therefore including) the panic!
And because we redirected userspace logs to the kernel message buffer, we have also preserved the exact sequence of events that led up to the panic.
If you want all this to happen, I have good news and bad news. The good news is we open sourced all our code; the bad news is it didn't get upstreamed anywhere so there are no batteries included and no documentation and it probably doesn't quite work for your use case. Sorry.
Open source code:
logos tool for sending userspace logs to /dev/klogd. (It's logs... for the OS.. and it's logical... and it brings your logs back from the dead after a reboot... get it? No? Oh well.) This includes two per-app token buckets (burst and long-term) so that an out-of-control app won't overfill the limited amount of dmesg space.
PRINTK_PERSIST patch to make Linux reuse the dmesg buffer across reboots.
Even if you don't do any of the rest of this, everybody should use PRINTK_PERSIST on every computer, virtual or physical. Seriously. It's so good.
(Note: room for improvement: it would be better if we could just redirect
app stdout/stderr directly to
/dev/kmsg, but that doesn't work as well as
we want. First, it doesn't auto-prefix incoming messages with the app name.
Second, libc functions like printf() actually write a few bytes at a time,
not one message per write() call, so they would end up producing more than
one dmesg entry per line. Third,
/dev/kmsg doesn't support the token
bucket rate control that logos does, which turns out to be essential,
because sometimes apps go crazy. So we'd have to further extend the kernel
API to make it work. It would be worthwhile, though, because the extra
userspace process causes an unavoidable delay between when a userspace
program prints something and when it actually gets into the kernel log.
That delay is enough time for a kernel to panic, and the userspace message
gets lost. Writing directly to
/dev/kmsg would take less CPU, leave
userspace latency unchanged, and ensure the message is safely written before
(In related news, this makes all of syslogd kinda extraneous. Similarly for whatever systemd does. Why do we make everything so complicated? Just write directly to files or the kernel log buffer. It's cheap and easy.)
Uploading the logs
Next, we need to get the messages out of the kernel log buffer and into our log processing server, wherever that might be.
(Note: if we do the above trick - writing userspace messages to the kernel buffer - then we can't also use klogd to read them back into syslogd. That would create an infinite loop, and would end badly. Ask me how I know.)
So, no klogd -> syslogd -> file. Instead, we have something like syslogd -> kmsg -> uploader or app -> kmsg -> uploader.
What is a log uploader? Well, it's a thing that reads messages from the kernel kmsg buffer as they arrive, and uploads them to a server, perhaps over https. It might be almost as simple as "dmesg | curl", like my original prototype, but we can get a bit fancier:
Figure out which messages we've already uploaded (eg. from the persistent buffer before we rebooted) and don't upload those again.
Log the current wall-clock time before uploading, giving us sync points between monotonic time (
/dev/kmsglogs "microseconds since boot" by default, which is very useful, but we also want to be able to correlate that with "real" time so we can match messages between related machines).
Compress the file on the way out.
Somehow authenticate with the log server.
Bonus: if the log server is unavailable because of a network partition, try to keep around the last few messages from before the partition, as well as the recent messages once the partition is restored. If the network partition was caused by the client - not too rare if you, like us, were in the business of making routers and wifi access points - you really would like to see the messages from right before the connectivity loss.
Luckily for you, we also open sourced our code for this. It's in C so it's very small and low-overhead. We never quite got the code for the "bonus" feature working quite right, though; we kinda got interrupted at the last minute.
Open source code:
loguploader C client, including an rsyslog plugin for Debian in case you don't want to use the /dev/kmsg trick.
devcert, a tool (and Debian package) which auto-generates a self signed "device certificate" wherever it's installed. The device certificate is used by a device (or VM, container, whatever) to identify itself to the log server, which can then decide how to classify and store (or reject) its logs.
One thing we unfortunately didn't get around to doing was modifying the logupload client to stream logs to the server. This is possible using HTTP POST and Chunked encoding, but our server at the time was unable to accept streaming POST requests due to (I think now fixed) infrastructure limitations.
(Note: if you write load balancing proxy servers or HTTP server frameworks, make sure they can start processing a POST request as soon as all the headers have arrived, rather than waiting for the entire blob to be complete! Then a log upload server can just stream the bytes straight to the next stage even before the whole request has finished.)
Because we lacked streaming in the client, we had to upload chunks of log periodically, which leads to a tradeoff about what makes a good upload period. We eventually settled on about 60 seconds, which ended up accounting for almost all the end-to-end latency from message generation to our monitoring console.
Most people probably think 60 seconds is not too bad. But some of the awesome people on our team managed to squeeze all the other pipeline phases down to tens of milliseconds in total. So the remaining 60 seconds (technically: anywhere from 0 to 60 seconds after a message was produced) was kinda embarrassing. Streaming live from device to server would be better.
The log receiver
So okay, we're uploading the logs from client to some kind of server. What does the server do?
This part is both the easiest and the most reliability-critical. The job is this: receive an HTTP POST request, write the POST data to a file, and return HTTP 200 OK. Anybody who has any server-side experience at all can write this in their preferred language in about 10 minutes.
We intentionally want to make this phase as absolutely simplistic as possible. This is the phase that accepts logs from the limited-size kmsg buffer on the client and puts them somewhere persistent. It's nice to have real-time alerts, but if I have to choose between somewhat delayed alerts or randomly losing log messages when things get ugly, I'll have to accept the delayed alerts. Don't lose log messages! You'll regret it.
The best way to not lose messages is to minimize the work done by your log receiver. So we did. It receives the uploaded log file chunk and appends it to a file, and that's it. The "file" is actually in a cloud storage system that's more-or-less like S3. When I explained this to someone, they asked why we didn't put it in a Bigtable-like thing or some other database, because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple. Simple things don't break. Our friends on the "let's use structured events to make metrics" team streamed those events straight into a database, and it broke all the time, because databases have configuration options and you inevitably set those options wrong, and it'll fall over under heavy load, and you won't find out until you're right in the middle of an emergency and you really want to see those logs. Or events.
Of course, the file storage service we used was encrypted-at-rest, heavily audited, and auto-deleted files after N days. When you're a megacorporation, you have whole teams of people dedicated to making sure you don't screw this up. They will find you. Best not to annoy them.
We had to add one extra feature, which was authentication. It's not okay for random people on the Internet to be able to impersonate your devices and spam your logs - at least without putting some work into it. For device authentication, we used the rarely-used HTTP client-side certificates option and the devcert program (linked above) so that the client and server could mutually authenticate each other. The server didn't check the certificates against a certification authority (CA), like web clients usually do; instead, it had a database with a whitelist of exactly which certs we're allowing today. So in case someone stole a device cert and started screwing around, we could remove their cert from the whitelist and not worry about CRL bugs and latencies and whatnot.
Unfortunately, because our log receiver was an internal app relying on internal infrastructure, it wasn't open sourced. But there really wasn't much there, honest. The first one was written in maybe 150 lines of python, and the replacement was rewritten in slightly more lines of Go. No problem.
Retries and floods
Of course, things don't always go smoothly. If you're an ISP, the least easy thing is dealing with cases where a whole neighbourhood gets disconnected, either because of a power loss or because someone cut the fiber Internet feed to the neighbourhood.
Now, disconnections are not such a big deal for logs processing - you don't have any. But reconnection is a really big deal. Now you have tens or hundreds of thousands of your devices coming back online at once, and a) they have accumulated a lot more log messages than they usually do, since they couldn't upload them, and b) they all want to talk to your server at the same time. Uh oh.
Luckily, our system was designed carefully (uh... eventually it was), so it could handle these situations pretty smoothly:
The log uploader uses a backoff timer so that if it's been trying to upload for a while, it uploads less often. (However, the backoff timer was limited to no more than the usual inter-upload interval. I don't know why more people don't do this. It's rather silly for your system to wait longer between uploads in a failure situation than it would in a success situation. This is especially true with logs, where when things come back online, you want a status update now. And clearly your servers have enough capacity to handle uploads at the usual rate, because they usually don't crash. Sorry if I sound defensive here, but I had to have this argument a few times with a few SREs. I understand why limiting the backoff period isn't always the right move. It's the right move here.)
Less obviously, even under normal conditions, the log uploader uses a randomized interval between uploads. This avoids traffic spikes where, after the Internet comes back online, everybody uploads again exactly 60 seconds later, and so on.
The log upload client understands the idea that the server can't accept its request right now. It has to, anyway, because if the Internet goes down, there's no server. So it treats server errors exactly like it treats lack of connectivity. And luckily, log uploading is not really an "interactive" priority task, so it's okay to sacrifice latency when things get bad. Users won't notice. And apparently our network is down, so the admins already noticed.
/dev/kmsgbuffer was configured for the longest reasonable outage we could expect, so that it wouldn't overflow during "typical" downtime. Of course, there's a judgement call here. But the truth is, if you're having system-wide downtime, what the individual devices were doing during that downtime is not usually what you care about. So you only need to handle, say, the 90th percentile of downtime. Safely ignore the black swans for once.
The log receiver aggressively rejects requests that come faster than its ability to write files to disk. Since the clients know how to retry with a delay, this allows us to smooth out bursty traffic without needing to either over-provision the servers or lose log messages.
(Pro tip, learned the hard way: if you're writing a log receiver in Go, don't do the obvious thing and fire off a goroutine for every incoming request. You'll run out of memory. Define a maximum number of threads you're willing to handle at once, and limit your request handling to that. It's okay to set this value low, just to be safe: remember, the uploader clients will come back later.)
Okay! Now our (unstructured) logs from all our 100,000 devices are sitting safely in a big distributed filesystem. We have a little load-balanced, multi-homed cluster of log receivers accepting the uploads, and they're so simple that they should pretty much never die, and even if they do because we did something dumb (treacherous, treacherous goroutines!), the clients will try again.
What might not be obvious is this: our reliability, persistence, and scaling problems are solved. Or rather, as long as we have enough log receiver instances to handle all our devices, and enough disk quota to store all our logs, we will never again lose a log message.
That means the rest of our pipeline can be best-effort, complicated, and frequently exploding. And that's a good thing, because we're going to start using more off-the-shelf stuff, we're going to let random developers reconfigure the filtering rules, and we're not going to bother to configure it with any redundancy.
Grinding the logs
The next step is to take our unstructured logs and try to understand them. In other words, we want to add some structure. Basically we want to look for lines that are "interesting" and parse out the "interesting" data and produce a stream of events, each with a set of labels describing what categories they apply to.
Note that, other than this phase, there is little difference between how you'd design a structured event reporting pipeline and a log pipeline. You still need to collect the events. You still (if you're like me) need to persist your events across kernel panics. You still need to retry uploading them if your network gets partitioned. You still need the receivers to handle overloading, burstiness, and retries. You still would like to stream them (if your infrastructure can handle it) rather than uploading every 60 seconds. You still want to be able to handle a high volume of them. You're just uploading a structured blob instead of an unstructured blob.
Okay. Fine. If you want to upload structured blobs, go for it. It's just an HTTP POST that appends to a file. Nobody's stopping you. Just please try to follow my advice when designing the parts of the pipeline before and after this phase, because otherwise I guarantee you'll be sad eventually.
Anyway, if you're staying with me, now we have to parse our unstructured logs. What's really cool - what makes this a killer design compared to starting with structured events in the first place - is that we can, at any time, change our minds about how to parse the logs, without redeploying all the software that produces them.
This turns out to be amazingly handy. It's so amazingly handy that nobody believes me. Even I didn't believe me until I experienced it; I was sure, in the beginning, that the unstructured logs were only temporary and we'd initially use them to figure out what structured events we wanted to record, and then modify the software to send those, then phase out the logs over time. This never happened. We never settled down. Every week, or at least every month, there was some new problem which the existing "structured" events weren't configured to catch, but which, upon investigating, we realized we could diagnose and measure from the existing log message stream. And so we did!
Now, I have to put this in perspective. Someone probably told you that log messages are too slow, or too big, or too hard to read, or too hard to use, or you should use them while debugging and then delete them. All those people were living in the past and they didn't have a fancy log pipeline. Computers are really, really fast now. Storage is really, really cheap.
So we let it all out. Our devices produced an average of 50 MB of (uncompressed) logs per day, each. For the baseline 100,000 devices that we discussed above, that's about 5TB of logs per day. Ignoring compression, how much does it cost to store, say, 60 days of logs in S3 at 5TB per day? "Who cares," that's how much. You're amortizing it over 100,000 devices. Heck, a lot of those devices were DVRs, each with 2TB of storage. With 100,000 DVRs, that's 200,000 TB of storage. Another 300 is literally a rounding error (like, smaller than if I can't remember if it's really 2TB or 2TiB or what).
Our systems barfed up logs vigorously and continuously, like a non-drunken non-sailor with seasickness. And it was beautiful.
(By the way, now would be a good time to mention some things we didn't log: personally identifiable information or information about people's Internet usage habits. These were diagnostic logs for running the network and detecting hardware/software failures. We didn't track what you did with the network. That was an intentional decision from day 1.)
(Also, this is why I think all those log processing services are so badly overpriced. I wanna store 50 MB per device, for lots of devices. I need to pay S3 rates for that, not a million dollars a gigabyte. If I have to overpay for storage, I'll have to start writing fewer logs. I love my logs. I need my logs. I know you're just storing it in S3 anyway. You probably get a volume discount! Let's be realistic.)
But the grinding, though
Oh right. So the big box labeled "Grinder" in my diagram was, in fact, just one single virtual machine, for a long time. It lasted like that for much longer than we expected.
Whoa, how is that possible, you ask?
Well, at 5TB per day per 100,000 devices, that's an average of 57 MBytes per second. And remember, burstiness has already been absorbed by our carefully written log receivers and clients, so we'll just grind these logs as fast as they arrive or as fast as we can, and if there are fluctuations, they'll average out. Admittedly, some parts of the day are busier than others. Let's say 80 MBytes per second at peak.
80 MBytes per second? My laptop can do that on its spinning disk. I don't even need an SSD! 80 MBytes per second is a toy.
And of course, it's not just one spinning disk. The data itself is stored on some fancy heavily-engineered distributed filesystem that I didn't have to design. Assuming there are no, er, colossal, failures in provisioning (no comment), there's no reason we shouldn't be able to read files at a rate that saturates the network interface available to our machine. Surely that's at least 10 Gbps (~1 GByte/sec) nowadays, which is 12.5 of those. 1.25 million devices, all processed by a single grinder.
Of course you'll probably need to use a few CPU cores. And the more work you do per log entry, the slower it'll get. But these estimates aren't too far off what we could handle.
And yeah, sometimes that VM gets randomly killed by the cluster's Star Trek-esque hive mind for no reason. It doesn't matter, because the input data was already persisted by the log receivers. Just start a new grinder and pick up where you left off. You'll have to be able to handle process restarts no matter what. And that's a lot easier than trying to make a distributed system you didn't need.
As for what the grinder actually does? Anything you want. But it's basically the "map" phase in a mapreduce. It reads the data in one side, does some stuff to it, and writes out postprocessed stuff on the other side. Use your imagination. And if you want to write more kinds of mappers, you can run them, either alongside the original Grinder or downstream from it.
Our Grinder mostly just ran regexes and put out structures (technically protobufs) that were basically sets of key-value pairs.
(For some reason, when I search the Internet for "streaming mapreduce," I don't get programs that do this real-time processing of lots of files as they get written. Instead, I seem to get batch-oriented mapreduce clones that happen to read from stdin, which is a stream. I guess. But... well, now you've wasted some perfectly good words that could have meant something. So okay, too bad, it's a Grinder. Sue me.)
Reducers and Indexers
Once you have a bunch of structured events... well, I'm not going to explain that in a lot of detail, because it's been written about a lot.
You probably want to aggregate them a bit - eg. to count up reboots across multiple devices, rather than storing each event for each device separately - and dump them into a time-series database. Perhaps you want to save and postprocess the results in a monitoring system named after Queen Elizabeth or her pet butterfly. Whatever. Plug in your favourite.
What you probably think you want to do, but it turns out you rarely need, is full-text indexing. People just don't grep the logs across 100,000 devices all that often. I mean, it's kinda nice to have. But it doesn't have to be instantaneous. You can plug in your favourite full text indexer if you like. But most of the time, just an occasional big parallel grep (perhaps using your favourite mapreduce clone or something more modern... or possibly just using grep) of a subset of the logs is sufficient.
(If you don't have too many devices, even a serial grep can be fine. Remember, a decent cloud computer should be able to read through ~1 GByte/sec, no problem. How much are you paying for someone to run some bloaty full-text indexer on all your logs, to save a few milliseconds per grep?)
I mean, run a full text indexer if you want. The files are right there. Don't let me stop you.
On the other hand, being able to retrieve the exact series of logs - let's call it the "narrative" - from a particular time period across a subset of devices turns out to be super useful. A mini-indexer that just remembers which logs from which devices ended up in which files at which offsets is nice to have. Someone else on our team built one of those eventually (once we grew so much that our parallel grep started taking minutes instead of seconds), and it was very nice.
And then you can build your dashboards
Once you've reduced, aggregated, and indexed your events into your favourite output files and databases, you can read those databases to build very fast-running dashboards. They're fast because the data has been preprocessed in mostly-real time.
As I mentioned above, we had our pipeline reading the input files as fast as they could come in, so the receive+grind+reduce+index phase only took a few tens of milliseconds. If your pipeline isn't that fast, ask somebody why. I bet their program is written in java and/or has a lot of sleep() statements or batch cron jobs with intervals measured in minutes.
Again here, I'm not going to recommend a dashboard tool. There are millions of articles and blog posts about that. Pick one, or many.
Please, please, steal these ideas. Make your log and event processing as stable as our small team made our log processing. Don't fight over structured vs unstructured; if you can't agree, just log them both.
Don't put up with weird lags and limits in your infrastructure. We made 50MB/day/device work for a lot of devices, and real-time mapreduced them all on a single VM. If we can do that, then you can make it work for a few hundreds, or a few thousands, of container instances. Don't let anyone tell you you can't. Do the math: of course you can.
Eventually our team's log processing system evolved to become the primary monitoring and alerting infrastructure for our ISP. Rather than alerting on behaviour of individual core routers, it turned out that the end-to-end behaviour observed by devices in the field were a better way to detect virtually any problem. Alert on symptoms, not causes, as the SREs like to say. Who has the symptoms? End users.
We had our devices ping different internal servers periodically and log the round trip times; in aggregate, we had an amazing view of overloading, packet loss, bufferbloat, and poor backbone routing decisions, across the entire fleet, across every port of every switch. We could tell which was better, IPv4 or IPv6. (It's always IPv4. Almost everyone spends more time optimizing their IPv4 routes and peering. Sorry, but it's true.)
We detected some weird configuration problems with the DNS servers in one city by comparing the 90th percentile latency of DNS lookups across all the devices in every city.
We diagnosed a manufacturing defect in a particular batch of devices, just based on their CPU temperature curves and fan speeds.
We worked with our CPU vendor to find and work around a bug in their cache coherency, because we spotted a kernel panic that would happen randomly every 10,000 CPU-hours, but for every 100,000 devices, that's still 10 times per hour of potential clues.
...and it sure was good for detecting power failures.
Anyway. Log more stuff. Collect those logs. Let it flow. Trust me.
Update 2019-04-26: Based on a lot of positive feedback from people who read this blog post, I ended up starting a company that might be able to help you with your logs problems. We're building pipelines that are very similar to what's described here.
Aha! Okay, for some reason this article is trending again, and I'd better provide an update on my update. We did implement parts of this design for use in our core product, which is now quite distinct from logs processing.
After investigating the "logs" market last year, we decided not to commercialize a logs processing service. The reason is that the characteristics we want our design to have: cheap, lightweight, simple, fast, and reliable - are all things you would expect from the low-cost provider in a market. The "logs processing" space is crowded with a lot of premium products that are fancy, feature-filled, etc, and reliable too, and thus able to charge a lot of money.
Instead, we built a minimalistic version of the above design for our internal use, collecting distributed telemetry about Tailscale connection success rates to help debug the network. Big companies can also use it to feed into their IDS and SIEM systems.
We considered open sourcing the logs services we built (since open source is where attributes like cheap, lightweight, etc tend to flourish) but we can't afford the support overhead right now for a product that is at best tangential to our main focus. Sorry! Hopefully someday.