You'll like it

We promise
Everything here is my opinion. I do not speak for your employer.
April 2003
May 2003

2003-04-13 »

Guadec

Woo hoo, my UniConf and Sharvil's ExchangeIT talks got accepted for Guadec. In half-hour timeslots instead of hour ones, which is unexpected but not really a problem. (Were you supposed to request something, or something?)

Now all I need is a passport, a plane ticket, and a place to stay. Eek!

Amazing Discovery

Today I learned that according to Statistics Canada, there are more married women in Canada than married men. (Statscan is truly awesome, so I suppose there's some explanation for this.)

Document Correlation

This weekend I wrote a page correlator using ideas I blatantly stole (badly) from Paul Graham's essay on spam detection using Bayesian Filtering. His math makes way more sense than my cheesier version, but it mine just a hack so that's okay for now.

Quick summary of the theory: the interesting aspects of a document are characterized by the locally most common words among the set of globally least common words. That is, if I say the word splahooie a lot, but nobody else does, that makes "splahooie" an interesting characteristic of my document. But if everyone else says splahooie, or I don't say splahooie very much, it's not a keyword.

So anyway, that works pretty well with some refinement. But using this technique and a cheesy "keyword correlation" algorithm in perl+mysql, and using our internal company wiki (900+ pages) as a data source, I made it so you can get a list of "interesting pages" related to your current page.

The results were... interesting. The algorithm, though simple, is surprisingly good. What it does, though, is a bit weird, because of the way we define "interesting." We only care about globally uncommon stuff. The result is a system that tells you not exactly, "What's related to this page?" but rather, "What's unusual about this page?" If I ask about pphaneuf, it mentions XPLC but not Net Integrator, even though I (as an evil manager) make sure he works more on the latter than the former. But other people do too, so XPLC is more interesting as far as the algorithm is concerned. It's kind of like the anti-google.

Hmm, an abnormality detector. I bet I could sell this to school boards in Kansas.

Why would you follow me on twitter? Use RSS.
apenwarr-on-gmail.com