Nov 08 2007

Grammar, spam and stupidity

Published by Dougal at 7:19 pm under Computing, Good Science

There are a lot of stupid things in this world, and although automated grammar checking doesn’t rate as very important among them, it is still very stupid. Trying to get real, human grammarians to agree on points of grammar (especially in a language as mongrelised as English) is bad enough. Add in to the mix the inevitable artistic desire to break and reforge rules, and the computers have no chance at all.

But they keep on trying, letting Microsoft Word infuriate people on daily basis with the suggestion that they might want to change “which” to “that” (or is it the other way round?).

There is an analogue to spell checking and grammar checking in computer programming, which is similarly fraught with arguments. Computer languages have the slight advantage that they don’t evolve in a fluid and organic manner because the poor computers can’t keep up, but that doesn’t mean the languages are so simplistic that checking them is an easy job.

The same arguments are used against grammar checking in programming (called static typing) as are used in writing: that it impedes the creative flow of the programmer. There are ‘phrases’ that can be written in many programming languages that would result in a correct answer, but they can’t be executed without turning off checks which are useful most of the time. It is like the reverse of safety harnesses — most of the time static typing catches your errors, but just occasionally the ‘errors’ are intentional but non-obvious tricks. As it is with English grammar, too.

Spam and Bayesian analysis

Having thought a bit about analysing the structure of natural language and programming language for correctness, we can now think about content analysis. Grammar checkers are necessarily quite strict in their interpretation of the rules they apply, since “almost right” in grammatical terms is another way of saying “wrong” (ignoring artistic licence).

Content analysis is a slightly easier job in that respect because the rules do not need to be applied strictly. There can be some fuzziness about themes of writing: topics of conversation are fluid and constantly shifting, as well as subjective.

Take the problem of spam filtering. Only you can really decide if something sent to you was requested or not; even another intelligent human being wouldn’t be able to tell. But the worst of the spam is easy to spot using Bayesian classification. All the stuff that offers implausible genital enlargement. The filter checks lots of different things about a message (sender, word content and sentence structure, incidence of the word “viagra”, embedded images) and comes to a conclusion about its likelihood of being spam. Each item contributes toward a final score, and scores over a certain threshold are thrown to the junk pile.

The great thing about Bayesian inference is that the filters can “learn”. If you really do have a correspondent at Pfizer who sends updates about the little blue pill, then you can train the filter not to discard those messages. The appearance of the word “viagra” in the message will be counteracted by the presence of that sender’s specific email address.

Bayesian analysis for stupidity

The Bayesian statistics approach to problems like spam filtering tends to produce far better results than you might expect. A good spam filter (and one with lots of test mail to train on) will be very accurate. My Gmail account gets bombarded with spam and I don’t think I’ve had more than half a dozen messages make it into the main folder in well over a year of use. That’s really effective.

So can we take this approach to other ‘soft’ problems which would normally require human inference and understanding? The people over at Stupid Filter certainly think so, and they’ve taken their ideas, almost wholesale, from the spam filter people. In their words:

we’ll look for things that characterize stupidity and assign particular tokens different weights based on how often they occur in hand-picked examples of idiotic comments.

Instead of looking for “vi4gra” and “hot chixx”, their filter will look for “LOL!!1” and other signs of someone with a broken caps lock key. Then you can add stupidity filters to blogs and other software to filter out the noise. They’re also using YouTube comments as a ready-made trove of stupid statements

Grammar checking comes back round again

One of the problems that the designers of the stupidity filter have hit is the same, in spirit, as the problem faced by automated grammar checkers. That is, it’s never entirely obvious whether someone did something out-of-the-ordinary because (a) they are very subtle and clever indeed, or (b) they’re stupid. What if I want to use lolcats in an amusing manner? What if I’m mocking other people who tlk n txt spk lol? A poor filter will block these even though they may be intelligent and done in full knowledge of their meaning.

So the designers started looking for rules to differentiate intentional from unintentional stupidity. They believe some heuristics based on what people change about their writing may hold some insight. I suppose, for example, that anyone exclaiming “LOL!1!!eleventyone!1” would be a bit sarcastic, because it’s not remotely possible to type ‘eleventyone’ by accident. Only time will tell if they will be successful.

One response so far

One Response to “Grammar, spam and stupidity”

  1. Helenon 08 Nov 2007 at 9:10 pm

    lol! stupidity filter :-)

    ….damn. did i just get filtered?