FontClustr: It’s yours, free.

The software I wrote in January called FontClustr is now available under the GPL.

If you use my methodology to improve font selection in your own program, I would appreciate the hell out of it if you credited me in some way.

For the impatient, you’ll need the following:

Run it by making fontclustr.py executable and executing it. If all goes well, you’ll be in for about 4 hours and 400MB to make all the output. At least, that’s what it took on my machine with about 1000 installed fonts.

Hit the jump for all the boring stuff.

This code was last tested on Ubuntu Jaunty Jackalope. You’ll need these packages (and possibly others):python-pythonmagick
python-pygame
python-gd
python-opencv
fonttools


How It Works

There are 2 pieces to FontClustr: a distance calculation algorithm, and an agglomerative clustering algorithm. To my knowledge, these are both novel.

The distance calculation is done with the computer vision library OpenCV. I’m using “contours”, which (I think) are meant to track objects between consecutive frames of video. Given two contours, OpenCV can tell you how different one is from the other — how much adjustment is required to make the 2 curves line up. I’m bastardizing that concept to measure the differences between characters in different fonts.

Before being able to do the distance calculation, I render all the letters and numbers ([a..z][A..Z][0..9]) to bitmaps. Then, I center the rendered character within its bounding box. Why? No reason, it seemed to work. To compare 2 fonts, I compare the contours of the corresponding characters in each font (a to a, b to b, and so on), which gives me a distance for each pair of letters. I just sum these. There are a lot of contour storage and comparison options. Why did I choose the ones I did? No reason, it seemed to work (and not cause OpenCV to segfault… you get what you pay for, I guess).

As an aside, I briefly considered submitting this work to SIGGRAPH until I realized that there is no scientific merit behind this… I just did stuff until it worked. This will be a running theme here.

So, now I have a method of distance between fonts. So, now I can cluster them. My old AI book (Russel/Norvig, yeah it came in handy) talked about agglomerative clustering, but that was only for objects that “exist” in some n-dimensional space. Normal clustering would merge the closest 2 points into a cluster that is geometrically halfway between them and repeat until there is just one point. I have no dimensions (just distances), so I have to do it a bit differently.

My agglomerative clustering starts by making a huge matrix: for n fonts, I get an n×n matrix. To fill in the matrix, I compute the distances between each pair of fonts, which takes a long time. At this point, each font can be thought of as a cluster of one. I find the 2 closest clusters by locating the smallest number in the matrix (let’s call them fa and fb). To merge them, I remove the row and column for fa and fb from the matrix. Then, I create a new row and column by making a pythagorean distance calculation between each corresponding entry in the fa and fb rows. So, if fa’s distance from a third font “fc” is 3, and fb’s distance from fc is 4, then the new entry would be 5 (this is like pretending that each font is its own dimension — a worst-case scenario). I insert that back into the matrix and repeat. At the same time, I keep a tree representation of which fonts have been merged with which other fonts.

When the matrix is done, I print out the tree in HTML form. I make 3 output files: one with images (tree_images.html), one with text (tree_text.html), and another one that was my first attempt (tree.html, probably useless). If you want to mess with this, the heavy processing is all cached; once the distances are in the matrix, the tree generation only takes about a minute.

Making It Better

Obviously, 4 hours of processing time is pretty impractical for a font-selection replacement. I’m sure there are a million fonts out there, which would really kill the idea of clustering them all on-demand. So, one option is to do a lot of caching and do an incremental clustering run when a new font is added. Another might be to keep a central online repository of font distances.

Also, the process of creating a contour from a bitmap is expensive. I think you could create the contour directly from the truetype format, but I have no idea how to approach that. You could also use less than the whole alphabet — I’m sure that a seasoned typographer would be able to tell you which letters represent the essence of a font.

So, take this idea and run with it.

Make me proud.