I’ve been enjoying playing with the word clouds at http://wordle.net. They’re quite sophisticated and compact — small words can appear inside the bowls of large letters. The layout algorithm is obviously something more sophisticated than just placing bounding rectangles.
According to the FAQ the size of words should be related to their frequency in whatever source text you’re using. That hasn’t been the experience for me. Words that I have only written once can appear quite large, and I can’t really fathom this.
But in the interests of learning something more, I thought I’d put together a rudimentary word clouds implementation using the augmentations I recently made to the Haskell diagrams package. (BTW, none of these changes have gone upstream yet. I feel a bit bad about this, so I’ll work through the changes I’ve made later and send a bunch of patches to Brent Yorgey.) The technique was more-than-slightly influenced by this work done by Chris Done. Go to that site for a beautifully illustrated explanation of the algorithm.
In short, I scale all the words according to their frequency — to be exact, the number of instances of that word multiplied by 10. Then I place the biggest (most frequent) word in the middle of the empty canvas. Then I take the next word and place it in a free space around the perimeter of the first word. With each new word I look through all the words on the canvas in turn, looking for free space around their perimeter. As the centre fills up with words I have to move further out to get enough free space (though the words themselves get smaller as we reach the perimeter). This algorithm has terrible complexity. Don’t try it with more than the top 200 words. It still loads quicker than the Java implementation though ;-)
$ time ./cloud
real 0m0.915s
user 0m0.896s
sys 0m0.012s
I also apply a random colour to each word by choosing a random number for each of the red, green, blue components. This produces a surprisingly pleasing set of colours! The smaller words are then faded out by increasing their transparency proportional to its frequency — words which are one tenth as common are one tenth as opaque. This approach isn’t very helpful as far as visualisation and ease of understanding, but it does make things far prettier! (And helps to disguise the inelegance of the layout algorithm too.)
This is what the result looks like, when run on its own source code. Word clouds of source code can produce surprisingly beautiful output.

It’s important to tune your stop words correctly, otherwise your cloud is filled with boring stuff. In an English text you get loads of “a”, “the”, “to”, “it”. In Haskell you get lots of “import”, “newtype”, “deriving” and so on. Remove all of these and you get a nicer picture of the source text. Here we can see Word and Weight are important — two type synonyms for the more boring String and Int — which gives a hint at what the program is measuring.
As always, these examples are online at the following address, which is also a darcs repository: