Bloggparti.se – is text left or right wing? (Swedish)

A new site called bloggparti.se (only works for Swedish blogs/texts) using uClassify has spread through the Swedish blogosphere. The site takes a blog or text and tests it to see how it resembles to the major Swedish political parties.

Mattias Aspelund from 49lights.com created this classifier using 100 tagged blogs from each party. The site was created within 24 hours and had more than 1000 requests on the first day.

We think it’s very exciting to see how quickly people can build cool applications around uClassify. Self test sites seems to be very popular for bloggers, for example genderanalyzer.com went from 0 to Google Page Rank 6 in just three months.

I know there are more applications being built right now, looking forward to see those in action!

Tutorial – Creating your own classifier

This is a brief tutorial of how to create your own classifier. I’ve used the term class synonymously to category and classifier to categorizer.

1. Determine the classifier domain

Before a classifier can start to classify it needs to be created and trained. First you should ask yourself what you want the classifier to do, is it a spam filter? a news categorizer? Let’s assume it’s a news categorizer for this tutorial. So we create a news classifier with the name ‘Example News Categorizer’.

Fig 1. Create the classifier

2. Define the relevant classes

Secondly you need define what classes your classifier should include. Choosing relevant classes is straightforward – just ask yourself what categories are relevant for the domain you have chosen. Once you have selected the classes you want the classifier to distinguish between you create them. This is easy in our Graphical User Interface but can also be done via our web API. For our small example we create the following three classes: Science, Sports and Entertainment. You can create as many classes as you want.

Fig 2. Create the classes (categories)

You can also add and remove classes dynamically – so don’t worry if you aren’t 100% sure that you have included all.

3. Collect training data

Before the classifier can start to categorize texts into the classes we need to learn it how texts belonging to the different classes look. This is the hardest part as it requires you to collect actual training data. You can collect it from any source you find appropriate.

3.1 Amount of training data

It’s hard to generalize the amount texts needed for a classifier to work as it’s highly dependent on the domain. Simple domains such as classifying the language of a text only requires a small amount while harder problems such as seeing difference between texts written by males and females requires much more training data. However to test an idea I suggest at least 20 documents per category. With each document in the same format of those that will be used for classification later (e.g. for a spam filter you train it on e-mails). 20 is the bare minimum – from there the classifier only gets more accurate.

For our news categorizer I collected 20 plain text articles per class from random sources on Internet.

3.2 Automate the collecting!

In some cases you can automate the data collection by finding trusted sources on Internet. For example for our news classifier I could jack into three RSS feeds for Science, Sports and Entertainment and automatically gather the data. Ahhh, no manual collecting!! Nice.

4. Train the classifier

So you have collected training data in some form (perhaps text files on your hard drive or lists of urls or some feeds), now it’s time to train the classifier. This can be done manually in the GUI or automated if you have some basic programming skills. For our tutorial I found 20 news articles per class and copied and pasted the them manually into the GUI, it took me about 30 minutes.

Screenshot of training

Fig 3. Training the classifier via the GUI

4.1 Automate the training! (requires novice programming skills)

Training a classifier through the GUI can be cumbersome if large amounts of training data is tractable. My suggestion is to create a small script in your favorite language that automatically trains the classifier. If your training data is laying around on your machine locally (perhaps automatically collected?=) you can just batch it into our web API. If you haven’t collected the training data yet you could create a script that automatically collects it and train the classifier with it!

4. Start classifying

This is the fun part, when you have created your classifier you can start to use it. You can always test it in our GUI. Further you can (and should) build your own web site around it via our web API – providing the world with more semantics and cool classifications that never have been seen before! Also – remember that you can use your classifiers commercially and make money on it!

I’ve published the example classifier, don’t expect it to work perfectly – it has only been trained on 20 articles per class! Test it here – Example News Categorizer

Summary

  • Find out what you want to classify on and create a classifier
  • Define and create the categories
  • Collect training data for each category
  • Train each category on the gathered data
  • Build a really cool web site around it!

Everybody can classify

Creating your own classifiers has never been easier, we have developed a Click’n’Classify Graphical User Interface (GUI). This means that you can manually create and train your classifiers without knowing any programming at all. This is very good way to test an idea, if the classifier works well – build your web site around it or use it for whatever purpose.

The GUI allows you to do everything that you can do via our Application Programming Interface (API). Also, just like phpMyAdmin shows the SQL queries our uClassify GUI will show the XML queries so you can easily understand and use the API from your site.

Features

  • Create and remove classifiers
  • Add and remove classes
  • Train and untrain classes
  • See basic information about your classifiers

Screenshot – Create a classifier

This shows a screenshot of how it looks like when you are about to create a classifier, just log in and try it yourself!

Creating a classifier is easy

Screenshot – Training a classifier

Just copy and paste the texts you want to use as training data.

Training a classifier is easy

Happy classifying!

Click’n’classify

As we suspected, most users who sign up think it’s to high threshold to get started as it requires some programming to create and train classifiers. Therefore we have decided to add more GUI features that allows users to do all the API calls without any programming! Once classifiers are set up developers can start building their web application around their classifiers via the API.

Copy’n’code

All GUI driven API calls will display the generated the XML so that users can easily see whats going on and copy the XML directly into the code (much like PhpMyAdmin does with SQL queries).

We expect this to take a couple of weeks.

oFaust.com – another site using uclassify

We are very happy to anouce that yet another site is using the uclassify web service! ofaust.com is a literature expert who finds out to which classical author a text resemble most. The developers let us know that it has been trained on over 80 different works of classical authors such as Plato, Shakespeare, Tolstoy and of course Goethe.

o'Faust

The beta is now up and running, please sign up create your own web site using cool classifications!