New look

I’m very happy to announce that we have just given the uClassify.com homepage a new look (obviously). It’s a lot brighter than the old and we hope you like it. A big thank you to Anna Gathu who designed it.

There are no changes to the functionality except that public classifiers are now sorted on the total number of calls so classifiers that are more commonly used are shown first. Language detection is our most popular classifier which has classified over 5 million documents.

Let us know how you like the new style.

uClassify Server Indie license

First to clarify: this doesn’t affect our free web api. The priced licenses found under products applies to users that need to run local uClassify server for reasons such performance and security.

Indie developers

We are now offering a new server license for independent developers and smaller companies (<€100.000 in annual revenue). The price is only €99/year for the first 20 to sign up, after this the price will double to €199/year. Our hope is to help more companies to solve a hard problem. The uClassify classification server is highly scalable, robust and accurate. Try it out for yourself!

There is also a new professional license available, see products for more information.

API change: Moved textCoverage into ApiVersion 1.01

The last release of the API introduced a new feature called textCoverage. This release was a bit premature and supposed to go into the API version 1.01 in order not to break any of our users response parsers.

If you have not changed anything in your parser during the last couple of days this should not affect you. If anyone was quick enough to start using the textCoverage under version ‘1.00’ this change means that it will disappear from the responses and then you need to bump the version to 1.01. I am really sorry about that.

Bumping xml version

For xml just chage the version number from ‘1.00’ to ‘1.01’: <uclassify xmlns=”http://api.uclassify.com/1/RequestSchema” version=”1.01“>

Bumping the url API version

Here you need to add a new paramter, ‘version’ and set it to the version: http://uclassify.com/browse/uClassify/Text Language/ClassifyUrl?readkey=YOUR_READ_API_KEY_HERE&url=http%3a%2f%2fblog.uclassify.com&version=1.01

You can read more about version handling here.

I am really sorry for any disturbance this may have caused. Let me know if you need any support with this.

uClassify Corpus Tool BETA

With the uClassify Corpus Tool you will be able to build and test classifiers locally without any programming involved. It’s included in the distribution of the uClassify server. You can download the server evaluation version here.

Classifier representation

This tool is really simple to use. To represent a classifier simply create a directory on your hard drive with the name of the classifier. Then create sub directories for each class belonging to this classifier.

For example:
c:corpussentiment (classifier ‘sentiment’ directory)
c:corpussentimentpositive (class ‘positive’ belonging to classifier ‘sentiment’)
c:corpussentimentnegative (class ‘negative’ belonging to classifier ‘sentiment’)

Now you just fill each class directory with documents belonging the that class. For example put positive Amazon reviews in the ‘c:corpussentimentpositive’ and negative reviews in the ‘c:corpussentimentnegative’ folder.

Testing a classifier

To test this classifier you run the uClassify Corpus Tool:
uclassifytool.exe -test c:corpussentiment

This will output some basic metrics on the performance such as accuracy, macro precision, macro recall and the f1 measure between precision and recall. Also some per class statistics are shown.

In order to calculate custom metrics on a classifier you can export a confusion matrix with the the flag ‘-outcm’. This will allow you to calculate a lot of other measurements on the classifier. You may also output per class (one vs all) statistics with the ‘-outpc’ flag.

Building a classifier

To build a classifier:
uclassifytool.exe -build c:corpussentiment

This will create a binary that the uClassify server can read. It’s basically a frequency distribution with some additional information. The resulting file will be called ‘sentiment.dat’ and placed in the root dir of the classifier (in this case ‘c:corpussentimentsentiment.dat’).

Now you can just copy this file to your local uClassify server classifier directory.

Public data sets

There are hundreds of public data sets that you can test the classifier on. You can just download them, unzip and put their documents in a directory structure that the uClassify Corpus Tool understands. To mention a few:

Future

For now the .dat files can only be used by your own local uClassify server, however, we are looking into ways to make it possible to upload .dat classifier files to the uclassify.com to be used via the web api.

uClassify Tool Screenshot

Added text coverage score to classification responses

When you classify texts you will get back class probabilities. Sometimes it’s hard to know what those are based on therefore I’ve added a new score called ‘text coverage’.

Text coverage is the proportion words in the text to classify that is found in the training data. This is helps users to determine how trustworthy the probabilities are. For example if you send a text with 10000 words to the language classifier and get back high English probability but with with a low text coverage (say 0.01). This means that only 100 of the 10000 words where recognized by the language classifier. A reasonable cause could be that the text is written in an unknown language but has some English words in (quotations, borrowed words etc). It’s up to the user to determine how to handle this. Sometimes low text coverage scores are ok, it’s highly dependent on the domain.

The text coverage can be found as an attribute in the <classification> tag and is called ‘textCoverage’.

Let me know if you have any questions about this.

Category classifier

We’ve received a lot of requests for a topic/category classifier, that is a classifier that labels a text or webpage with a topic (e.g. ‘Computers’, ‘Sports’ or ‘Games’). The basic idea of uClassify is that users build and share classifiers and I have been hoping that this classifier would pop up eventually. When I look through the list of +800 private classifiers I found a couple category classifiers but usually used to label for a specific domain (e.g. only ‘Sports’ or some more narrow topic set). However no one has yet built a public general topic classifier.

Finding topics

Building a topic classifier is not something you just sneeze out of your nose (as we say in Sweden), it takes some preprocessing. First of all you need to decide what categories you should use, luckily people already have constructed good structures such as Yahoo Directories and Open Directory Project (ODP).

I decided to go with ODP and create a set of hierarchical classifiers describing the two first levels of ODP. The top level classifier consists of the following topics: Arts, Business, Computers, Games, Health, Home, Recreation, Science, Society and Sports. Note that I’ve removed some from the original ODP (World, Reference, Regional, Shopping and News for various reasons).

Each topic in the top level classifier has a corresponding child classifier that in turn consists of all level 2 topics, for example, the classifier ‘Computers’ include, among other: Algorithms, Artificial Intelligence, Artificial Life, …, Virtual Reality.

Finding data

ODP provides RDF dumps of their directory – huge XML files (+2Gb) that includes the entire directory with topic titles, descriptions and external links. I decided to try making use of this directory, so I wrote a SAX parser that extracted the topics and links. Then I downloaded and cleaned 60 links from each category and used that as training data.

Result

You can try the general topic classifier here. And you can find the sub classifiers here named ‘Business Topics’ etc.

Hierarchy

Click on the image to get it in full scale.
Topics

Radian6 uses uClassify for spam filtering

Radian6 – social media monitoring

In November 2008 I was contacted by Chris Newton, CTO at Radian6, he was curious if uClassify could help him to filter out spam blogs (splogs). I suggested that we tested it over our web API, and soon enough we had set up a spam classifier. After months of running their system against our web API, radian6 were pleased with the evaluation. To meet their high demands they purchased an own uClassify server. Today they monitor more than 10 million blog posts daily. Each post is run through the uClassify server to filter out spam.


Radian6

“Radian6’s software platform tracks mentions across over 150 million social media sites and sources.”

About uClassify

uClassify can either be used via our free web API or by installing an own local server. For most users the web API is enough as it’s free and has a generous license (you are allowed to use it for commercial use just remember to link back). For developers that want to process large volumes of documents the web API may not be enough, then it’s possible to purchase a very own uClassify server that can be installed locally.

UrlAi update

Up until now the development of urlai.com has mainly been an exercise in writing efficient SQL, the database itself has been filled with, if I recall correctly, millions of blog post and for each post 12 different classifications. Everything is queried from the database and if not in the database through our classifiers. Now when we have fixed most of the bottlenecks in the SQL queries we have decided to start playing a bit with the data.

Ranking

Our first feature was to add blog ranking.  We started with rank based on the mood (upset or happy), once this is stable and scalable we will add gender ranking as well (most manly/feminine bloggers).

Future

We have some ideas of how to move urlai further in the future. As we are gathering a lot of classified data we imagine that there should be some value in text search through all the posts and plot the result with respect to the classifiers.

Also to improve our classifiers we are looking into user reinforced training. For example, when a blog is shown, users get an opportunity to leave classifier feedback. “Hey I’m not 60-100 years!”

We are definitely going to add more classifiers as well.

We also would like to add some cool visualization of the classifiers – perhaps one that even makes it possible to zoom in search set->blogger->posts->specific words. Perhaps with the cool GapMinder tool?

Download evaluation server

It’s now possible to evaluate the uClassify server locally. We have built a new version of the server that can be downloaded freely and executed on Windows operating systems.

With the evaluation version you can test one of the servers key features – the classification speed without having to go via the web API which can be slow for large volumes of data that has to be sent over the web. This is important for anyone who want to make sure it can handle big volumes of data before purchasing a commercial license.

The only restriction is that it has to be restarted for every 10000 calls – just as a reminder as it’s only for evaluation =)

Have a look in the server manual and download it! …. and let us know what you think!

Bloggitik.se – Swedish political blogs classified on subject

Bloggitik, a new service in Swedish based on uClassify has been launched. It collects political blogs and automatically categorizes each post. This makes it easier for users to find among all the blogs. The people at blogitik.se has also made sure that the system learns as it’s being used, when a blog is classified into the wrong category, readers can correct this and the system will improve over time.

This is a really cool usage of our service and the best of luck!

bloggitik