We’ve received a lot of requests for a topic/category classifier, that is a classifier that labels a text or webpage with a topic (e.g. ‘Computers’, ‘Sports’ or ‘Games’). The basic idea of uClassify is that users build and share classifiers and I have been hoping that this classifier would pop up eventually. When I look through the list of +800 private classifiers I found a couple category classifiers but usually used to label for a specific domain (e.g. only ‘Sports’ or some more narrow topic set). However no one has yet built a public general topic classifier.
Finding topics
Building a topic classifier is not something you just sneeze out of your nose (as we say in Sweden), it takes some preprocessing. First of all you need to decide what categories you should use, luckily people already have constructed good structures such as Yahoo Directories and Open Directory Project (ODP).
I decided to go with ODP and create a set of hierarchical classifiers describing the two first levels of ODP. The top level classifier consists of the following topics: Arts, Business, Computers, Games, Health, Home, Recreation, Science, Society and Sports. Note that I’ve removed some from the original ODP (World, Reference, Regional, Shopping and News for various reasons).
Each topic in the top level classifier has a corresponding child classifier that in turn consists of all level 2 topics, for example, the classifier ‘Computers’ include, among other: Algorithms, Artificial Intelligence, Artificial Life, …, Virtual Reality.
Finding data
ODP provides RDF dumps of their directory – huge XML files (+2Gb) that includes the entire directory with topic titles, descriptions and external links. I decided to try making use of this directory, so I wrote a SAX parser that extracted the topics and links. Then I downloaded and cleaned 60 links from each category and used that as training data.
Result
You can try the general topic classifier here. And you can find the sub classifiers here named ‘Business Topics’ etc.