Programming inside the URL string

Programming inside the URL string

Afterquery is hard to explain to people, possibly because it actually combines several pretty unusual concepts. A single unusual concept is bad enough, but several at once is likely to just leave you scratching your head. With that in mind, here's just one unusual concept: a programming language designed for URLs. Not a language for manipulating URLs; the language is the URL.

In afterquery, if I write this:

  http://afterquery.appspot.com?url=example.json&group=a,b,c&group=a,b

Then (assuming a..d are strings and e is numeric) it's about equivalent to this SQL:

  select a, b, count(c) as c, sum(d) as d, sum(e) as e
  from (
    select a, b, c, count(d) as d, sum(e) as e
    from example
    group by a, b, c
  )
  group by a, b

This gives, for each combination of a and b, the number of distinct values of c, the number of distinct combinations of (c,d), and the sum of e - each a slightly different useful aggregate.

Some people have used SQL for so long now that they don't remember anymore exactly how redundant the language is. The above SQL mentions 'e' 4 times! The afterquery code doesn't mention it at all; if you say you want to group by a, b, and c, then the assumption is that e (one of the columns in the initial dataset) is an aggregate value field and doesn't need to be mentioned unless you want an unusual aggregate. (The default aggregate is count() for non-numeric fields, and sum() for numeric fields.)

The sequence of clauses in SQL is also problematic because it's both arbitrary and restrictive. It doesn't reflect the order of operations; in reality, among other things, "from" obviously comes before "select." (Incidentally, LINQ in C# puts "from" first, for that reason.) Worse, "group by" is highly related to the select operation - whether or not something must be aggregated depends on whether it's in the "group by" clause or not, and yet "group by" is way down in the query while "select" is at the top. And worst of all, the entire "group by" clause is actually redundant: you could calculate the "group by" clause entirely by looking at which fields in the select contain aggregation functions and which don't. You know this is true, because if your select clause doesn't use aggregation functions for exactly the right fields (no more, and no less), then your query will abort with an error.

There's a lot of danger in trying to make a programming language that reads too much like English. Maybe it can be done, but you have to be tasteful about it. SQL is not tasteful; it's designed with the same mindset that produced COBOL. (The joke is that an object oriented COBOL would be called "ADD 1 TO COBOL GIVING COBOL", which is the COBOL equivalent of C++ or [incr tcl]. That line is the actual code for incrementing a number in COBOL. Compare with "SELECT sql+1 FROM sql WHERE sql=0 GROUP BY sql".)

Still though, despite SQL's insulting repetitiveness, it has one even more depressing advantage: it's still the most concise way to express a database query of moderate complexity. C# LINQ is the closest thing we have to a competitor - it was specifically designed to try to replace SQL because coding database queries in an imperative language is too messy - and our above nested grouping looks something like this (based on their sample code - I haven't used LINQ in a while and haven't tested it):

  var tmp = 
    from x in example
    group x by x.a, x.b, x.c into g
    select new {
      a = g.Key[0], b = g.Key[1], c = g.Key[2],
      d = g.Count(p => p.d),
      e = g.Sum(p => p.e)
    };
  var result =
    from r in tmp
    group r by r.a, r.b into g
    select new {
      a = g.Key[0], b = g.Key[1],
      c = g.Count(p => p.c);
      d = g.Sum(p => p.d);
      e = g.Sum(p => p.e);
    };

That mentions 'e' 4 times, like the SQL does, but also introduces new temporary variables x, g, r, and p. g is itself a magical complex object that includes a "Key" member we have to use, among other things. Now, LINQ is also much more powerful than SQL (you can use it for things other than databases) and in many cases it can translate itself into SQL (so it can query databases efficiently even though you wrote it imperatively), so it has redeeming features. But it definitely hasn't shortened our basic SQL query.

There are also ORMs (Object-Relational Mappers) out there, like ActiveRecord for example, which can be more concise than plain SQL. But they can't represent complicated concepts all the way through to the database. Generally you end up downloading either all the data and filtering it client-side, or one record at a time, leading to high latency, or splitting into two operations, one to get a list of keys, and another to fetch a bunch of keys in parallel. A proper "query language" like SQL doesn't require that kind of hand optimizing.

Somehow SQL has held on, since 1974, as still the best way to do that particular thing it does. You've got to give them some credit for that.

Imperative or not?

SQL is a programming language, although a restricted one. In my mind, I like to designate programming languages as one of three types: imperative, functional, or declarative. If you read the official definitions, functional languages are technically a subset of declarative ones, but I usually find those definitions to be more misleading than useful. HTML is a declarative language; LISP is a functional language. They're different, even if they do share underlying mathematical concepts.

Other declarative non-functional languages include CSS, XML, XQuery, JSON, regular expressions, Prolog, Excel formulas, and SQL. We can observe that declarative languages are a pretty weird bunch. They tend to share a few attributes: that it's hard to predict what the CPU will actually do (so performance depends on external knowledge of how a particular interpreter works), that expressing data works well and expressing commands works badly (I'm talking to you, ANT and MSBuild), and that explicit conditionals always look funny if they're supported at all.

These problems are very clear in the case of SQL, after using it for only a little while. The hardest part of SQL to understand is its interaction with the so-called "query optimizer" which decides what database indexes to use, and most importantly, when to use an index and when to use a full table scan. The person writing a SQL query will generally have a really clear idea when a table scan is appropriate (that is, usually for small amounts of data) and when it isn't, but there's no way in SQL to express that; you just have to trust the optimizer. It'll generally work, but sometimes it'll go crazy, and you'll have no idea what just happened to make your query go 100x slower. SQL suffers from a definition of "correctness" that doesn't include performance.

Declarative (and functional) languages also share a major advantage over imperative languages, which is that it's easy to manipulate and rearrange the program without losing its meaning, exactly because the implementation is left unspecified. For example, you should be able to convert an arbitrary SQL query to a map/reduce operation. Declarative and functional languages make things like parallelism easier to implement and reason about. (The "map" and "reduce" operations in map/reduce come from functional programming, of course.)

Let's look at afterquery again. The above afterquery, which matches the functionality of the above SQL query, can be broken down like this:

  url=example.json
  group=a,b,c
  group=a,b

Is it imperative, declarative, or functional? I've been thinking about it for a couple of weeks, and I don't really know. The fact that it can be mapped directly to a (declarative) SQL query suggests its declarative nature. But having written the implementation, I also know that how it works is very clearly imperative. It ends up translating the query to almost exactly this in javascript:

  grid = fetchurl('example.json')
  grid = groupby(grid, ['a', 'b', 'c']);
  grid = groupby(grid, ['a', 'b']);

That is, it's applying a series of changes to a single data grid (global shared state - the only state) in memory. You might notice that all the above imperative commands, though, have the same structure, so you could write the same thing functionally:

  groupby(
    groupby(
      fetchurl('example.json'),
      ['a', 'b', 'c']),
    ['a', 'b'])

Most imperative programs cannot be so easily mapped directly onto pure functional notation. So that leads me to my current theory, which is that afterquery really is an imperative language, but it happens to be one so weak and helpless that it can't express complex concepts (like loops) that would make it incompatible with functional/declarative interpretation. It's not turing-complete.

Nevertheless, the imperative-looking representation makes it easier to write and debug queries, and to estimate their performance, than declarative-looking SQL. In theory, a sequence of groups and pivots could be rearranged by a query optimizer to run faster or in parallel, but in practice, afterquery's goal of working on small amounts of data (unlike SQL, which is intended to run on large amounts) makes an optimizer pretty unnecessary, so we can just execute the program as a series of transformations in the order they're given.

An imperative language in the URL string

Traditionally, URLs are about as declarative as things come. At one level, they are just opaque string parameters to one of a very few functions (GET, POST, PUT, etc), some of which take an even bigger opaque string (the POST data) as a second parameter.

One level deeper, we know that URL strings contain certain well-known traditional components: protocol, hostname, path, query string (?whatever), anchor string (#whatever). Inside the query string (and sometimes the anchor string), we have key=value pairs separated by &, with special characters in the values traditionally encoded in a particular way (%-encoding). HTTP specifies the components, but it doesn't have to say anything about the structure of the query string, its key=value pairs, the & signs, or its %-encoding.

Afterquery uses the same key=value pairs as any query string, but while most apps treat them as a declarative dictionary of essentially unordered key=value pairs - with the only ordering being multiple instances of the same key - afterquery also depends on the ordering of keys. &group=a,b,c&pivot=a,b;c is a totally different command from the other way around.

Another huge constraint on a URL is its length: there is no predefined maximum length, but many browsers limit it, maybe to 1024 characters or so. Thus, if you want to keep the program stateless (no state stored between executions, and no programs stored on the server), it's important to keep things concise, so you can say what you need to say inside a single URL.

Luckily, sometimes the best art comes from constraints! Where perl-style punctuation-happy languages with lots of implicit arguments are nowadays unfashionable, we have no choice but to adopt them anyway if we want things to fit inside a URL. Our example afterquery is actually equivalent to:

  url=http://afterquery.appspot.com/example.json
  group=a,b,c;count(d),sum(e)
  group=a,b;count(c),sum(d),sum(e)

The URL has a default path based on the script location (as URLs always do), and the semicolon in group= statements separates the keys from the values. That leads to the very confusing at first, but very convenient thereafter, distinction between

  group=a,b,c

and

  group=a,b,c;

Which mean very different things: the first means "keep all the other columns, and guess how to aggregate their values" while the second means "throw away all the other columns."

Like a regex, it's totally unfriendly to an initial observer, where an SQL statement (or COBOL program) might make a beginner feel comfortable that they can read what's going on. But my theory is: once you've got the idea, SQL is just tedious, but afterquery is more fun.

And unlike SQL, a nontrivial program will fit in an URL string.

2012-12-18 »