Stuffing the stuff

without getting stuffy
Everything here is my personal opinion. I do not speak for my employer.
February 2009
March 2009

2009-02-22 »

Programmers and Sadomasochism

Quick! Take a look at the following snippet of HTML, and tell me what's wrong with it.

    <div align=right>Hello, world!</div>

If you said, "It's completely invalid and unparseable because you forgot the quotes around 'right'!" then you're... hold on, wait a second.(1)

Unparseable? Every web browser in history can parse that tag. Non-conforming XML, yes, but unparseable? Hardly. There are millions of web pages where people forgot (or didn't bother) to quote the values of their attributes. And because those pages exist, everyone who parses HTML has to support that feature. So they do.

That's the difference between HTML and XML. With HTML, programmers answer to end users. And the end users are very clear: if your browser can't parse the HTML that every other browser can parse, then I'm switching to another browser.

XML is different. XML doesn't have any "end users." The only people who use XML parsers are other programmers. And programmers, apparently, aren't like normal people.

Real, commercial XML parsers, if fed a tag like the one above, would give me an error message. They would tell me to go back and fix my input. Apparently, I'm a bad person for even suggesting that we should try parsing that file.

Now, as it happens, the XML parser I wrote in 500 lines of Pascal a few days ago would not reject this input. It would just pretend the quotes were there. In fact, if my program parses the file and then you ask it to print the XML back out, it'll helpfully add the missing quotes in for you.

Let's phrase this another way. The painstakingly written, professional-grade, "high quality" XML parser, when presented this input that I received from some random web site, will stab me in the back and make me go do unspecified things to try to correct the problem by hand. Avery's cheeseball broken XML parser, which certainly doesn't claim to be good or complete, would parse the input just fine.(2)

This, an innocent bystander might think, would imply that my parser is the better one to use. But it's not, you see, because, as cdfrey points out:

    Interoperability is hard. Anyone can write their own parsers. And everyone has. That's why the monstrosity called XML was invented in the first place. It all starts with someone writing a quick and dirty parser, thereby creating their own unique file format whether they realize it or not.(3) And since they probably don't realize it, they don't document it. So the next person comes along, and either has to reverse engineer the parser code, or worse, guess at the format from existing examples.

Got it? By creating a permissive parser that just corrects simple input errors, I've made things worse for everybody else. I would make the world a better place if my parser would just reject bad XML, and then everyone would be forced to produce files with valid XML, and that would make life easier for people like me! Don't you see?

Well, no. There's a fallacy here. Let's look at our options:

Option 1: Bob produces invalid XML file and gives it to Avery. Avery uses professional-grade fancy pants parser, which rejects it. Avery is sad, but knows what to do: he phones up Bob and asks him to fix his XML producer. Bob is actually a guy in Croatia who hired a contractor five years ago to write his web site for him, and doesn't know where to find that contractor anymore, but because he knows it's better for the world, he finds a new contractor who fixes the output of his web site. Three weeks later, Bob sends a new XML file to Avery, who is now able to parse it.

Option 2: Bob produces invalid XML file and gives it to Avery. Avery's permissive parser that he wrote in an afternoon reads it just fine. Avery goes on with his work, and Bob doesn't need to pay a contractor.

Option 3: Bob produces valid XML in the first place, dammit, because he made sure his contractor ran his program's output successfully through a validator before he accepted the work as complete. Avery parses it easily, and is happy.

Now, obviously option 3 is preferable. The problem is, it's also not a real option. Bob already screwed up, and he's producing invalid XML. Avery has received the invalid data, and he's got to do something with it. Only options 1 and 2 are real.

Now, XML purists are telling me that I should pursue option 1. My question is: why? Option 1 keeps me from getting my work done. Then I have to go bother Bob, who wouldn't care except that I'm so obnoxious. And now he has to pay a contractor to fix it. The only reason I would take option 1 is if I enjoy pain, or inflicting pain on others. Apparently, lots of programmers out there enjoy pain.

Meanwhile, option 2 - the one that everybody frowns upon - is painless for everyone.

The usual argument for option 1 is that if enough people do it, then eventually people will Just Start Producing Valid XML Dammit, and you won't ever have this problem again. But here's the thing: we have a world full of people trying option 1. XML is all about the people who try option 1. And still Bob is out there, and he's still producing invalid XML, and I, not Bob, am still the one getting stabbed in the back by your lametarded strict XML parsers. Strict receiver-side validation doesn't actually improve interoperability, ever.

As programmers, we've actually known all this for a long time. It's called Postel's Law, in honour of Jon Postel, one of the inventors of the Internet Protocol. "Be liberal in what you accept, and conservative in what you send."

The whole Internet runs on this principle. That's why HTML is the way it is. It's why Windows can talk to Linux, even though both have lots of bugs.

I have my own way of phrasing Postel's law: "It takes two to miscommunicate."

As long as either side of any transaction is following Postel's law - either the sender strictly checks his XML for conformance or the receiver doesn't - the transaction will be a success. If both sides disregard his advice, that's when you have a problem.

Yes, Bob should have checked his data before he sent it to me. He didn't. That makes him a bad person - or at least an imperfect one. But if I refuse the data just because it's not perfect, then that doesn't solve the problem. It just makes me a bad person too.

Footnotes

(1) People who, instead, were going to complain that I should avoid the obsolete HTML 'align' attribute and switch to CSS would be the subject of a completely different rant.

(2) Note that there's plenty of perfectly valid XML that my cheeseball incomplete XML parser wouldn't parse, because it's cheeseball and incomplete. The ideal parser would be permissive and complete. But if I have to choose one or the other, I'm going to choose the one that actually parses the files I got from the customer. Wouldn't you?

(3) If you didn't catch it, the precise error in cdfrey's argument is this: You don't create a new file format by parsing wrong. You create a new file format by producing wrong. Ironically, a lot of people use strict professional-grade XML parsers but seem to believe that producing XML is easy.

Side note

By the way, even strict XML validation doesn't actually mean the receiver will understand your data correctly. It's semantics vs. syntax. You can easily write a perfectly valid HTML4-Strict compliant document and have it render differently in different browsers. Why? Because they all implement the CSS differently. Web browser interoperability problems actually have nothing to do with HTML parsing; it's all about the rendering, which is totally unrelated. It's amazing to me how many people think strict HTML validation will actually solve any real-world problems.

Update (2009/02/22): cdfrey responds to my response. Then he wrote an interesting essay about Postel's law, too. See also the ycombinator discussion of this article, and the reddit discussion.

::li

Why would you follow me on twitter? Use RSS.
apenwarr-on-gmail.com