uClassify blog – Page 5 – uClassify machine learning news and development

uClassify Corpus Tool BETA

With the uClassify Corpus Tool you will be able to build and test classifiers locally without any programming involved. It’s included in the distribution of the uClassify server. You can download the server evaluation version here.

Classifier representation

This tool is really simple to use. To represent a classifier simply create a directory on your hard drive with the name of the classifier. Then create sub directories for each class belonging to this classifier.

For example:
c:corpussentiment (classifier ‘sentiment’ directory)
c:corpussentimentpositive (class ‘positive’ belonging to classifier ‘sentiment’)
c:corpussentimentnegative (class ‘negative’ belonging to classifier ‘sentiment’)

Now you just fill each class directory with documents belonging the that class. For example put positive Amazon reviews in the ‘c:corpussentimentpositive’ and negative reviews in the ‘c:corpussentimentnegative’ folder.

Testing a classifier

To test this classifier you run the uClassify Corpus Tool:
uclassifytool.exe -test c:corpussentiment

This will output some basic metrics on the performance such as accuracy, macro precision, macro recall and the f1 measure between precision and recall. Also some per class statistics are shown.

In order to calculate custom metrics on a classifier you can export a confusion matrix with the the flag ‘-outcm’. This will allow you to calculate a lot of other measurements on the classifier. You may also output per class (one vs all) statistics with the ‘-outpc’ flag.

Building a classifier

To build a classifier:
uclassifytool.exe -build c:corpussentiment

This will create a binary that the uClassify server can read. It’s basically a frequency distribution with some additional information. The resulting file will be called ‘sentiment.dat’ and placed in the root dir of the classifier (in this case ‘c:corpussentimentsentiment.dat’).

Now you can just copy this file to your local uClassify server classifier directory.

Public data sets

There are hundreds of public data sets that you can test the classifier on. You can just download them, unzip and put their documents in a directory structure that the uClassify Corpus Tool understands. To mention a few:

Future

For now the .dat files can only be used by your own local uClassify server, however, we are looking into ways to make it possible to upload .dat classifier files to the uclassify.com to be used via the web api.

Added text coverage score to classification responses

When you classify texts you will get back class probabilities. Sometimes it’s hard to know what those are based on therefore I’ve added a new score called ‘text coverage’.

Text coverage is the proportion words in the text to classify that is found in the training data. This is helps users to determine how trustworthy the probabilities are. For example if you send a text with 10000 words to the language classifier and get back high English probability but with with a low text coverage (say 0.01). This means that only 100 of the 10000 words where recognized by the language classifier. A reasonable cause could be that the text is written in an unknown language but has some English words in (quotations, borrowed words etc). It’s up to the user to determine how to handle this. Sometimes low text coverage scores are ok, it’s highly dependent on the domain.

The text coverage can be found as an attribute in the <classification> tag and is called ‘textCoverage’.

Let me know if you have any questions about this.

Category classifier

We’ve received a lot of requests for a topic/category classifier, that is a classifier that labels a text or webpage with a topic (e.g. ‘Computers’, ‘Sports’ or ‘Games’). The basic idea of uClassify is that users build and share classifiers and I have been hoping that this classifier would pop up eventually. When I look through the list of +800 private classifiers I found a couple category classifiers but usually used to label for a specific domain (e.g. only ‘Sports’ or some more narrow topic set). However no one has yet built a public general topic classifier.

Finding topics

Building a topic classifier is not something you just sneeze out of your nose (as we say in Sweden), it takes some preprocessing. First of all you need to decide what categories you should use, luckily people already have constructed good structures such as Yahoo Directories and Open Directory Project (ODP).

I decided to go with ODP and create a set of hierarchical classifiers describing the two first levels of ODP. The top level classifier consists of the following topics: Arts, Business, Computers, Games, Health, Home, Recreation, Science, Society and Sports. Note that I’ve removed some from the original ODP (World, Reference, Regional, Shopping and News for various reasons).

Each topic in the top level classifier has a corresponding child classifier that in turn consists of all level 2 topics, for example, the classifier ‘Computers’ include, among other: Algorithms, Artificial Intelligence, Artificial Life, …, Virtual Reality.

Finding data

ODP provides RDF dumps of their directory – huge XML files (+2Gb) that includes the entire directory with topic titles, descriptions and external links. I decided to try making use of this directory, so I wrote a SAX parser that extracted the topics and links. Then I downloaded and cleaned 60 links from each category and used that as training data.

Result

You can try the general topic classifier here. And you can find the sub classifiers here named ‘Business Topics’ etc.

Hierarchy

Click on the image to get it in full scale.

Radian6 uses uClassify for spam filtering

Radian6 – social media monitoring

In November 2008 I was contacted by Chris Newton, CTO at Radian6, he was curious if uClassify could help him to filter out spam blogs (splogs). I suggested that we tested it over our web API, and soon enough we had set up a spam classifier. After months of running their system against our web API, radian6 were pleased with the evaluation. To meet their high demands they purchased an own uClassify server. Today they monitor more than 10 million blog posts daily. Each post is run through the uClassify server to filter out spam.

“Radian6’s software platform tracks mentions across over 150 million social media sites and sources.”

About uClassify

uClassify can either be used via our free web API or by installing an own local server. For most users the web API is enough as it’s free and has a generous license (you are allowed to use it for commercial use just remember to link back). For developers that want to process large volumes of documents the web API may not be enough, then it’s possible to purchase a very own uClassify server that can be installed locally.

UrlAi update

Up until now the development of urlai.com has mainly been an exercise in writing efficient SQL, the database itself has been filled with, if I recall correctly, millions of blog post and for each post 12 different classifications. Everything is queried from the database and if not in the database through our classifiers. Now when we have fixed most of the bottlenecks in the SQL queries we have decided to start playing a bit with the data.

Ranking

Our first feature was to add blog ranking. We started with rank based on the mood (upset or happy), once this is stable and scalable we will add gender ranking as well (most manly/feminine bloggers).

Future

We have some ideas of how to move urlai further in the future. As we are gathering a lot of classified data we imagine that there should be some value in text search through all the posts and plot the result with respect to the classifiers.

Also to improve our classifiers we are looking into user reinforced training. For example, when a blog is shown, users get an opportunity to leave classifier feedback. “Hey I’m not 60-100 years!”

We are definitely going to add more classifiers as well.

We also would like to add some cool visualization of the classifiers – perhaps one that even makes it possible to zoom in search set->blogger->posts->specific words. Perhaps with the cool GapMinder tool?

Download evaluation server

It’s now possible to evaluate the uClassify server locally. We have built a new version of the server that can be downloaded freely and executed on Windows operating systems.

With the evaluation version you can test one of the servers key features – the classification speed without having to go via the web API which can be slow for large volumes of data that has to be sent over the web. This is important for anyone who want to make sure it can handle big volumes of data before purchasing a commercial license.

The only restriction is that it has to be restarted for every 10000 calls – just as a reminder as it’s only for evaluation =)

Have a look in the server manual and download it! …. and let us know what you think!

Bloggitik.se – Swedish political blogs classified on subject

Bloggitik, a new service in Swedish based on uClassify has been launched. It collects political blogs and automatically categorizes each post. This makes it easier for users to find among all the blogs. The people at blogitik.se has also made sure that the system learns as it’s being used, when a blog is classified into the wrong category, readers can correct this and the system will improve over time.

This is a really cool usage of our service and the best of luck!

Sites using uClassify

I’ve started to list sites that are using uClassify, I’ll update this list every now and then. Please comment if your site is missing.

UrlAi.com

Classifies blogs on gender, age, mood and tonality. Blogs are followed over time to give more accurate results.

TrollGuard.com

TrollGuard is a free WordPress plugin that protects your blog from spam comments.

Typealyzer.com

This innovative site finds out the blog author personality, using a psychological text analysis. Check it out!

GenderAnalyzer.com

This neat web site figures out if a blog is written by a man or woman, using the uclassify web service.

AgeAnalyzer.com

Tries to guess the author age from reading a blog.

oFaust.com

See what classical author your text is most alike, perhaps it can be used to help write texts in e.g. Shakespearian style! It hightlights words and sentances that are characteristic for the author.

URL Profiler

A powerful tool for SEOs to quickly audit links, content & social data.

The News Marketplace

The News Marketplace gathers the latest and most popular news stories from the top Maltese news sites and automatically categorizes and rates each article on its semantic.

EpiSPIDER AI Mashup 2.0

EpiSPIDER is a tool that demonstrates connectivity between “consumers”, “producers” and “transformers” of data within an emerging information and knowledge architecture.

hombreomujer.com

Tries to guess an authors gender, for spanish texts.

fidelofranco.com

Are you writing like Fidel or like Franco? For texts in spanish.

tuideologia.com

Tries to guess left/right wing alignment for spanish texts.

Trve vs Emo

Are text more aligned aginst black metal bands, such as Mayhem, Burzum and Darkthrone or emo bands like My Chemical Romance and Fall out boy.

MDG Actors

Highlights the people that are making a big difference and call out the people that aren’t for reaching the Millennium Development Goals? Not sure if this site is still working.

BloggParti.se

Tests texts against the major political parties in Sweden. Only for Swedish texts.

Bloggitik.se

Swedish political blog posts are automatically sorted into subject.

UrlAi.com – who are you?

We have created a new service called UrlAi.com, the basic concept is to run blog posts through a bunch of classifiers over time. To begin with we use Gender, Age, Mood and Tonality but the system is dynamic so we can add new classifiers at any time. If you have created a classifier that would fit on urlai.com let us know!

Some ideas

We have many ideas of how we can develop this project further, for example, now we are only showing a summary pie chart, it would be nice to see posts over time. User feedback for online training and classifier improvement may be possible. Another thing we could do is to have classified posts searchable, for example, enabling users to see the mood of everyone who mentioned ‘Avatar’.

Some kudos

Just want to thank the people that has been involved in this project, Roger Karlsson for coding, Johanna Forsman for the awesome logo and Mattias Östmar for sharing his Tonality and Mood classifiers. Mattias has also contributed with many ideas around this, being the idea fountain he is 😀

Artificial Intelligence to determine an authors age

We have just released ageanalyzer.com, a site that reads a blog and guesses the age of the author!

Background

Our writing style reflects us in many ways, for example texts written in anger probably differs from words written in joy. Reading a text intuitively gives us a clue about the author as you start forming a picture in your head. Sometimes it’s easy to pinpoint how you got this picture and at other times harder.

We wanted to know if we could give computers the same intuition, in this particular project we are interesting in finding out if a computer can tell the age of an author – only given a text.

To do this experiment we collected 7000 blogs that had age information in the profile and split it into 6 different age groups, 13-17, 18-25, 26-35, 36-50, 51-65 and 65+. We then created a classifier on uClassify and fed it with the training data. Viola!

Expected results

After running tests on the training data (10-fold-cross-validation) it was clear that our classifier was able to find differences between the six age groups. We expect the proportion of correctly classified blogs would be around 30% compared to a baseline of 17% which would be expected if the classifier was guessing out of the blue.

We have added a poll to the site to help us see how well (or poorly) it works!

Try AgeAnalyzer out here!