100% Pure

accept no imitations

Everything here is my personal opinion. I do not speak for my employer.
Back: November 2012
Next: January 2013

2012-12-10 »

Preview: Chicken Inflation

* Source: Statistics Canada

December 10, 2012 12:04

2012-12-11 »

Let's think about *small* data for a change

There's a lot of buzz lately about "big data" - huge, Internet scale databases that take a long time to query, but are awesome, I guess, because of how much testosterone you need to have in order to possess one.

Let's leave aside the question of exactly how big big data needs to be. I've heard of people talking about databases as small as a gigabyte as being "big," I guess because downloading it would take a few minutes, and maybe 'grep' isn't the best way to query it. Other people would say a terabyte is big, or a petabyte.

I don't really care. All I know is, beyond a certain threshold that depends on the current state of computer technology, as soon as your data is "big," queries are slow. Maybe a few seconds, which isn't so bad, but maybe a few *minutes*. And when things get slow, you start having to mess with having separate "data warehouse" servers so that doing your big analytical queries don't bring down your whole database. And managing it all becomes a full time job for someone or many people or a whole company.

I happen to work for an employer that does that sort of thing a lot. And to be honest, I find it pretty boring. Perhaps I have a testosterone deficiency. It's not so much the bigness that bothers me: it's the waiting. I like my compile-test-debug cycle to be on the order of two seconds, but when SQL or mapreduce gets involved, it's more like two minutes.1

I know, cry for me, right? Two minutes of my life, all gone. But seriously, when you're trying to find trends and aren't quite sure what you're looking for and it takes a dozen tries, those two minutes can add up rapidly, especially when added to the 10 minutes of web browsing or email that inevitably ensues once I get bored waiting for the two minutes.

After worrying about this problem for a long time (years now, I guess), I think I've come up with a decent workaround. The trick is to divide your queries into multiple stages. At each stage, we reduce the total amount of data by a few orders of magnitude, and thus greatly decrease the cost of debugging a complex query.

Originally, I might have tried to make a single SQL query that goes from, say, a terabyte of data, down to 10 rows, and made a bar chart. Which 10 rows? Well, it takes a few tries to figure that out, or maybe a few hundred tries. But it's much faster if we set the initial goal instead to, say, only pull out the gigabyte I need from that terabyte. And then I can query a gigabyte instead. From there, I can reduce it to a megabyte, perhaps, which is easy enough to process in RAM without any kind of index or complexity or optimization.

That last part is what I want to talk to you about, because I've been working on a tool to do it. I call it afterquery, tagline: "The real fun starts after the Serious Analysts have gone home." Serious people write mapreduces and SQL queries. Serious people hire statisticians. Serious people have so much data that asking questions about the data requires a coffee break, but they get paid so much they don't have to care.

Afterquery is the opposite of Serious. It downloads the whole dataset into RAM on your web browser and processes it in javascript.

But what it lacks in seriousness, it makes up for in quick turnaround. Here's what it produces from 1582 rows of data I got from Statistics Canada:2

Those 1582 rows are the breakdown of Canada's 2011 GDP by province and NAICS (industry type) code. I used Afterquery to sum up the values across all provinces and make a pie chart by NAICS category group. Could I have just queried Statscan and gotten those summarized results in, say, 15 rows instead of 1582? Sure. But their web site is a bit clunky, and I have to redownload a CSV file every time, and it doesn't produce pretty graphs. 1582 rows is no problem to grunge through in javascript nowadays. It even works fine on my phone.

The real power, though, is that you can take the same data and present it in many ways. Here's another chart using exactly the same input data:

And here's a treemap chart. Treemaps are generally pretty hard to produce, because of the weird way you have to format the data. But afterquery helps reformat "normal" data into helpful treemap-compatible data. I really like this visualization; see if you can find who spends the most on mining, or who spends proportionally more on agriculture.

(Tip: click on one of the boxes in the treemap for a more detailed view. You can use that to find out what specific *kinds* of mining. Or click 'Edit' in the top-right of the chart box to customize the chart. For example, remove "GEO," from the treegroup to see overall results for Canada rather than by province.)

Some neat things about afterquery

  • It's completely stateless, configured entirely using URL parameters.

  • It just grabs a file using jsonp and applies the given transforms.

  • It can show either data tables or charts, including gviz charts, timelines, and dygraphs.

  • You can do multiple layers of "group by" operations very quickly. How many times have you wanted to, say, produce two columns: the total number of purchases, and the number of customers making purchases? To do that in SQL you'd use subqueries, perhaps, like this:
       select region, count(customer), sum(nsales), sum(value)
         from (
           select region, customer, count(*) nsales, sum(value)
             from sales
             group by region, customer
         )
         group by region
    

    With afterquery you just write two successive "group" transformations:

       &group=region,customer;count(*),sum(value)
       &group=region
    
    and it does the rest.

  • Anything that can be "group by" can be treegrouped, allowing you to easily draw a treemap chart.

  • Because we don't care about performance optimization - optimization is for Serious Analysts! - we can make tradeoffs for consistency instead. So you can apply transformations, filters, limits, and groupings in whatever order you want (instead of SQL's really strict sequencing requirements). And 'order by' operations aren't scrambled by doing 'group by', and we can have handy aggregate functions like first(x) (the first matching x in each group), last(x) (last matching x in each group), and cat(x) (concatenation of all the x in each group). You can imagine that 'order by x' followed by 'first(x)' could be a very useful combination.

  • A simple "pivot" operation that lets you turn multi-level groups into column headers, allowing you to easily produce interesting multi-line or multi-bar charts, or seriously dense (but useful) grids of numbers.

  • Because it's stateless, every afterquery chart is user-editable; just click the 'Edit' link in the upper-right of any chart or table, adjust the transformations, and see the new result immediately. Then cut-and-paste the URL to someone in IM or email, or drop it in an iframe on a web page, and you're done.

  • Because it runs in your web browser and uses jsonp, it uses your existing browser login credentials and cookies to get access to your data.3 And because it's purely client side javascript, your data never gets uploaded to a server.

Are we having fun yet? You can find afterquery on github or on Google Code. For documentation, look at its built-in help page.

Contributions or suggestions are welcome. You can join the brand new mailing list: afterquery on googlegroups.com.

Footnotes

1 Of course if you're really processing a lot of data, a data warehouse query can be much more than two minutes, but generally it's possible to find a subset to debug your query with that takes less time. Once the debugging is done, you can just run the big query and it doesn't really matter how long it takes.

2 Those 1582 rows are the output of Serious Analysts down at Statistics Canada, who have processed all the numbers from all the tax returns from all the companies in Canada, every year for decades. That's Big Data, all right.

3 Of course, using jsonp creates security concerns of its own. If I have time, I plan to write about those in another post because I believe it's possible to work around those problems and have the best of all worlds... but it's a little tricky. Meanwhile, grep the afterquery source for "oauth" :)

December 11, 2012 01:54

2012-12-18 »

Programming inside the URL string

Afterquery is hard to explain to people, possibly because it actually combines several pretty unusual concepts. A single unusual concept is bad enough, but several at once is likely to just leave you scratching your head. With that in mind, here's just one unusual concept: a programming language designed for URLs. Not a language for manipulating URLs; the language *is* the URL.

In afterquery, if I write this:

  http://afterquery.appspot.com?url=example.json&group=a,b,c&group=a,b

Then (assuming a..d are strings and e is numeric) it's about equivalent to this SQL:

  select a, b, count(c) as c, sum(d) as d, sum(e) as e
  from (
    select a, b, c, count(d) as d, sum(e) as e
    from example
    group by a, b, c
  )
  group by a, b

This gives, for each combination of a and b, the number of distinct values of c, the number of distinct combinations of (c,d), and the sum of e - each a slightly different useful aggregate.

Some people have used SQL for so long now that they don't remember anymore exactly how redundant the language is. The above SQL mentions 'e' 4 times! The afterquery code doesn't mention it at all; if you say you want to group by a, b, and c, then the assumption is that e (one of the columns in the initial dataset) is an aggregate value field and doesn't need to be mentioned unless you want an unusual aggregate. (The default aggregate is count() for non-numeric fields, and sum() for numeric fields.)

The sequence of clauses in SQL is also problematic because it's both arbitrary and restrictive. It doesn't reflect the order of operations; in reality, among other things, "from" obviously comes before "select." (Incidentally, LINQ in C# puts "from" first, for that reason.) Worse, "group by" is highly related to the select operation - whether or not something must be aggregated depends on whether it's in the "group by" clause or not, and yet "group by" is way down in the query while "select" is at the top. And worst of all, the entire "group by" clause is actually redundant: you could calculate the "group by" clause entirely by looking at which fields in the select contain aggregation functions and which don't. You know this is true, because if your select clause doesn't use aggregation functions for exactly the right fields (no more, and no less), then your query will abort with an error.

There's a lot of danger in trying to make a programming language that reads too much like English. Maybe it can be done, but you have to be tasteful about it. SQL is not tasteful; it's designed with the same mindset that produced COBOL. (The joke is that an object oriented COBOL would be called "ADD 1 TO COBOL GIVING COBOL", which is the COBOL equivalent of C++ or [incr tcl]. That line is the actual code for incrementing a number in COBOL. Compare with "SELECT sql+1 FROM sql WHERE sql=0 GROUP BY sql".)

Still though, despite SQL's insulting repetitiveness, it has one even more depressing advantage: it's still the most concise way to express a database query of moderate complexity. C# LINQ is the closest thing we have to a competitor - it was specifically designed to try to replace SQL because coding database queries in an imperative language is too messy - and our above nested grouping looks something like this (based on their sample code - I haven't used LINQ in a while and haven't tested it):

  var tmp = 
    from x in example
    group x by x.a, x.b, x.c into g
    select new {
      a = g.Key[0], b = g.Key[1], c = g.Key[2],
      d = g.Count(p => p.d),
      e = g.Sum(p => p.e)
    };
  var result =
    from r in tmp
    group r by r.a, r.b into g
    select new {
      a = g.Key[0], b = g.Key[1],
      c = g.Count(p => p.c);
      d = g.Sum(p => p.d);
      e = g.Sum(p => p.e);
    };

That mentions 'e' 4 times, like the SQL does, but also introduces new temporary variables x, g, r, and p. g is itself a magical complex object that includes a "Key" member we have to use, among other things. Now, LINQ is also much more powerful than SQL (you can use it for things other than databases) and in many cases it can translate itself into SQL (so it can query databases efficiently even though you wrote it imperatively), so it has redeeming features. But it definitely hasn't shortened our basic SQL query.

There are also ORMs (Object-Relational Mappers) out there, like ActiveRecord for example, which can be more concise than plain SQL. But they can't represent complicated concepts all the way through to the database. Generally you end up downloading either all the data and filtering it client-side, or one record at a time, leading to high latency, or splitting into two operations, one to get a list of keys, and another to fetch a bunch of keys in parallel. A proper "query language" like SQL doesn't require that kind of hand optimizing.

Somehow SQL has held on, since 1974, as still the best way to do that particular thing it does. You've got to give them some credit for that.

Imperative or not?

SQL is a programming language, although a restricted one. In my mind, I like to designate programming languages as one of three types: imperative, functional, or declarative. If you read the official definitions, functional languages are technically a subset of declarative ones, but I usually find those definitions to be more misleading than useful. HTML is a declarative language; LISP is a functional language. They're different, even if they do share underlying mathematical concepts.

Other declarative non-functional languages include CSS, XML, XQuery, JSON, regular expressions, Prolog, Excel formulas, and SQL. We can observe that declarative languages are a pretty weird bunch. They tend to share a few attributes: that it's hard to predict what the CPU will actually do (so performance depends on external knowledge of how a particular interpreter works), that expressing data works well and expressing commands works badly (I'm talking to you, ANT and MSBuild), and that explicit conditionals always look funny if they're supported at all.

These problems are very clear in the case of SQL, after using it for only a little while. The hardest part of SQL to understand is its interaction with the so-called "query optimizer" which decides what database indexes to use, and most importantly, when to use an index and when to use a full table scan. The person writing a SQL query will generally have a really clear idea when a table scan is appropriate (that is, usually for small amounts of data) and when it isn't, but there's no way in SQL to express that; you just have to trust the optimizer. It'll generally work, but sometimes it'll go crazy, and you'll have no idea what just happened to make your query go 100x slower. SQL suffers from a definition of "correctness" that doesn't include performance.

Declarative (and functional) languages also share a major advantage over imperative languages, which is that it's easy to manipulate and rearrange the program without losing its meaning, exactly because the implementation is left unspecified. For example, you should be able to convert an arbitrary SQL query to a map/reduce operation. Declarative and functional languages make things like parallelism easier to implement and reason about. (The "map" and "reduce" operations in map/reduce come from functional programming, of course.)

Let's look at afterquery again. The above afterquery, which matches the functionality of the above SQL query, can be broken down like this:

  url=example.json
  group=a,b,c
  group=a,b

Is it imperative, declarative, or functional? I've been thinking about it for a couple of weeks, and I don't really know. The fact that it can be mapped directly to a (declarative) SQL query suggests its declarative nature. But having written the implementation, I also know that how it works is very clearly imperative. It ends up translating the query to almost exactly this in javascript:

  grid = fetchurl('example.json')
  grid = groupby(grid, ['a', 'b', 'c']);
  grid = groupby(grid, ['a', 'b']);

That is, it's applying a series of changes to a single data grid (global shared state - the only state) in memory. You might notice that all the above imperative commands, though, have the same structure, so you could write the same thing functionally:

  groupby(
    groupby(
      fetchurl('example.json'),
      ['a', 'b', 'c']),
    ['a', 'b'])

Most imperative programs cannot be so easily mapped directly onto pure functional notation. So that leads me to my current theory, which is that afterquery really is an imperative language, but it happens to be one so weak and helpless that it can't express complex concepts (like loops) that would make it incompatible with functional/declarative interpretation. It's not turing-complete.

Nevertheless, the imperative-looking representation makes it easier to write and debug queries, and to estimate their performance, than declarative-looking SQL. In theory, a sequence of groups and pivots could be rearranged by a query optimizer to run faster or in parallel, but in practice, afterquery's goal of working on small amounts of data (unlike SQL, which is intended to run on large amounts) makes an optimizer pretty unnecessary, so we can just execute the program as a series of transformations in the order they're given.

An imperative language in the URL string

Traditionally, URLs are about as declarative as things come. At one level, they are just opaque string parameters to one of a very few functions (GET, POST, PUT, etc), some of which take an even bigger opaque string (the POST data) as a second parameter.

One level deeper, we know that URL strings contain certain well-known traditional components: protocol, hostname, path, query string (?whatever), anchor string (#whatever). Inside the query string (and sometimes the anchor string), we have key=value pairs separated by &, with special characters in the values traditionally encoded in a particular way (%-encoding). HTTP specifies the components, but it doesn't have to say anything about the structure of the query string, its key=value pairs, the & signs, or its %-encoding.

Afterquery uses the same key=value pairs as any query string, but while most apps treat them as a declarative dictionary of essentially unordered key=value pairs - with the only ordering being multiple instances of the same key - afterquery also depends on the ordering of keys. &group=a,b,c&pivot=a,b;c is a totally different command from the other way around.

Another huge constraint on a URL is its length: there is no predefined maximum length, but many browsers limit it, maybe to 1024 characters or so. Thus, if you want to keep the program stateless (no state stored between executions, and no programs stored on the server), it's important to keep things concise, so you can say what you need to say inside a single URL.

Luckily, sometimes the best art comes from constraints! Where perl-style punctuation-happy languages with lots of implicit arguments are nowadays unfashionable, we have no choice but to adopt them anyway if we want things to fit inside a URL. Our example afterquery is actually equivalent to:

  url=http://afterquery.appspot.com/example.json
  group=a,b,c;count(d),sum(e)
  group=a,b;count(c),sum(d),sum(e)

The URL has a default path based on the script location (as URLs always do), and the semicolon in group= statements separates the keys from the values. That leads to the very confusing at first, but very convenient thereafter, distinction between

  group=a,b,c

and

  group=a,b,c;

Which mean very different things: the first means "keep all the other columns, and guess how to aggregate their values" while the second means "throw away all the other columns."

Like a regex, it's totally unfriendly to an initial observer, where an SQL statement (or COBOL program) might make a beginner feel comfortable that they can read what's going on. But my theory is: once you've got the idea, SQL is just tedious, but afterquery is more fun.

And unlike SQL, a nontrivial program will fit in an URL string.

December 16, 2012 04:46

2012-12-29 »

3D Printing

My first 3d-printed creation (and my 3d model that I printed it from). The photo below is three printings of the same design at different sizes:

The entire car prints, bottom to top, as a single run, and yes, those wheels actually turn. Each two wheel + axle combination is a single solid object, and the frame between the two actually has closed loops around the axles. So we have the magical-seeming trick of passing the solid axles through the loops - without needing any welding or gluing after the fact. Print it, unstick it, and roll it off the platform.

Playing with this has really connected a few physical-world concepts that didn't click for me before. For example, measurement tolerances are absolutes, not percentages. The printer I used is accurate to about 0.1mm. Previously, I had never cared about any distances less than a millimeter ("If it ain't on my ruler, it don't exist") but at this scale (the smallest car is only 1cm tall), it matters. To make the loops that wrap around the axle so they don't blur into the axle itself, I had to leave quite a bit of extra space. This produced a tight fit for the smallest car, but when we naively scale up the design linearly (the big one is about 3cm tall) that excess space leaves the axles pretty floppy. (Or we can call it 4-wheel steering, and then it's a feature.)

The other scaling-related lesson comes back to that old interview question: if somehow you were shrunk down to the size of a coin and put inside a blender, how would you escape?

The answer is that you'd jump out. Why? Because your body's mass scales down with n^3, but the strength in your muscles scales down much more slowly - let's say n^2. Relative to your size, you'd have super strength. That's why grasshoppers can easily leap to 100x their height, but nothing the size of a human can do so.

For the same reason, when you drop my big car on the floor, a wheel tends to break off. When you drop the little one on the floor, it stays intact. Why? Because the mass (and thus the force with which it hits the floor) has scaled up by 3^3 = 27, but the tensile strength is only about 3x higher. That's enough to break the rather weak connection between axle and wheel. That could presumably be fixed by better engineering, but I'm not really That Kind of Engineer.

In short, 3D printing is fun. But really it just makes me that much more impatient for nanobots. Ooh, micron-scale accuracy and toy cars made of diamonds!

December 29, 2012 11:29

Back: November 2012
Next: January 2013
apenwarr-on-gmail.com