Posts Tagged ‘classifier’

Tutorial - Creating your own classifier

Wednesday, December 17th, 2008

This is a brief tutorial of how to create your own classifier. I’ve used the term class synonymously to category and classifier to categorizer.

1. Determine the classifier domain

Before a classifier can start to classify it needs to be created and trained. First you should ask yourself what you want the classifier to do, is it a spam filter? a news categorizer? Let’s assume it’s a news categorizer for this tutorial. So we create a news classifier with the name ‘Example News Categorizer’.

Fig 1. Create the classifier

2. Define the relevant classes

Secondly you need define what classes your classifier should include. Choosing relevant classes is straightforward - just ask yourself what categories are relevant for the domain you have chosen. Once you have selected the classes you want the classifier to distinguish between you create them. This is easy in our Graphical User Interface but can also be done via our web API. For our small example we create the following three classes: Science, Sports and Entertainment. You can create as many classes as you want.

Fig 2. Create the classes (categories)

You can also add and remove classes dynamically - so don’t worry if you aren’t 100% sure that you have included all.

3. Collect training data

Before the classifier can start to categorize texts into the classes we need to learn it how texts belonging to the different classes look. This is the hardest part as it requires you to collect actual training data. You can collect it from any source you find appropriate.

3.1 Amount of training data

It’s hard to generalize the amount texts needed for a classifier to work as it’s highly dependent on the domain. Simple domains such as classifying the language of a text only requires a small amount while harder problems such as seeing difference between texts written by males and females requires much more training data. However to test an idea I suggest at least 20 documents per category. With each document in the same format of those that will be used for classification later (e.g. for a spam filter you train it on e-mails). 20 is the bare minimum - from there the classifier only gets more accurate.

For our news categorizer I collected 20 plain text articles per class from random sources on Internet.

3.2 Automate the collecting!

In some cases you can automate the data collection by finding trusted sources on Internet. For example for our news classifier I could jack into three RSS feeds for Science, Sports and Entertainment and automatically gather the data. Ahhh, no manual collecting!! Nice.

4. Train the classifier

So you have collected training data in some form (perhaps text files on your hard drive or lists of urls or some feeds), now it’s time to train the classifier. This can be done manually in the GUI or automated if you have some basic programming skills. For our tutorial I found 20 news articles per class and copied and pasted the them manually into the GUI, it took me about 30 minutes.

Screenshot of training

Fig 3. Training the classifier via the GUI

4.1 Automate the training! (requires novice programming skills)

Training a classifier through the GUI can be cumbersome if large amounts of training data is tractable. My suggestion is to create a small script in your favorite language that automatically trains the classifier. If your training data is laying around on your machine locally (perhaps automatically collected?=) you can just batch it into our web API. If you haven’t collected the training data yet you could create a script that automatically collects it and train the classifier with it!

4. Start classifying

This is the fun part, when you have created your classifier you can start to use it. You can always test it in our GUI. Further you can (and should) build your own web site around it via our web API - providing the world with more semantics and cool classifications that never have been seen before! Also - remember that you can use your classifiers commercially and make money on it!

I’ve published the example classifier, don’t expect it to work perfectly - it has only been trained on 20 articles per class! Test it here - Example News Categorizer

Summary

  • Find out what you want to classify on and create a classifier
  • Define and create the categories
  • Collect training data for each category
  • Train each category on the gathered data
  • Build a really cool web site around it!

What’s your mood?

Tuesday, December 2nd, 2008

Today, 2 months after our launch, our users have created over 200 classifiers. Most are unpublished and under construction. PRfekt, the team behind the popular Typealyzer, recently published a new classifier that determines the mood of a text - whether a text is happy or upset. You can try it for yourself here!

So lets test some snippets!

Jamis is (justly) upset and writes:

Is anyone else annoyed by the “just speak your choice” automation in so many telephone menus? I feel like an idiot mumbling “YES!” or “CHECK BALANCE!” into my phone. Maybe it’s the misanthrope in me coming to the front, but I’d much rather push buttons than talk to a pretend person.

The mood classifier says 98.1% upset.

Spam is no fun either, or as Ed-Anger notes:

“I’m madder than a rooster in an empty hen house at Internet spammers and I won’t take it anymore. Those creeps clutter up my e-mail with their junk, everything from penis enlargement pills to some lady telling me she’ll give me a million dollars if I’ll help her get her money out of Africa. “Rush me 10 grand quick as possible and we’ll get the whole thing started,” she says.”

The mood classifier says 97.0% upset.

Now over to some happy blogs, amour-amour has a confesion:

“I love my iphone in a way I never thought possible!! When my fiance got his and spent 23 hours gazing at it lovingly, uploading (or is it downloading??) apps and buying accessories for it I put it down to him just being a technology geek.”

The mood classifier says 79.8% happy.

Finally Nitwik Nastik comments a Rickey Gervais:

“This is a hilarious stand-up routine by British Comedian Ricky Gervais on Bible and Creationism. It’s really funny how he ridicules the creationist stories from the book of Genesis (the book of genesis can be found here)and point out to it’s obvious logical blunders. Sometimes it may be difficult to understand his accent and often he will make some funny comments under his breath, so try to listen carefully.”

The mood classifier says 69.7% happy.

The author recommends at least two hundred words (more text than my samples) which seems reasonable!

GenderAnalyzer thoughts

Saturday, November 22nd, 2008

First, thanks to everyone who is testing GenderAnalyzer, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging about it.

Our learnings

Determining the gender of an author is not easy, besides the classification there is a chain of technical events that must work in order to get a reliable result. As many of you have noticed the accuracy has dropped to 53% which is far lower than expected based on our tests. There may be several reasons for this low accuracy and I will mention some of them here.

  • Our trainingdata of 2000 blogs is automatically collected from blogspot. Runing internal tests (10 fold cross validation) on this data gives us an accurcy of 75% this effectivly means “Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”. So our trainingdata is probably not very representative, as a matter of fact it’s very stereotypical (see for yourself here). Using data from all kind of sources should give us a better model.
  • When someone is testing a blog we are not crawling through posts on the blog to get a good amount of text. We are only hitting the given url and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data. Does anyone know a nice library that - given an url crawls blog posts? Via RSS perhaps?
  • We are trying to encode test data to utf-8 which is the format of the training data - it could be that we are missing some encodings.
  • And of course - the difference between male and female writing is not significant?

What’s next?

We are currently collecting a new set of training data that is much more representative. We will switch to this classifier during the next week and start a new poll for it. It’s going to be very exciting!

Spam, huh?

Thursday, October 23rd, 2008

We are currently working on a prototype to identify spam blogs - splogs. Spam blogs can be really tricky to identify even to the human eye, as i-trepreneur.com writes in a recent post:

Why? These Splogs are user friendly. They were not made for search engines but for real visitors. There’s excellent design, well organized sections, working RSS feed. All the information on such Splogs is manually selected from the most popular resources on the net and is properly referenced. Only fresh content is used so it is not identified as duplicate instantly.

Pointing out that madconomist dot com and business-opportunities dot biz are two well made splogs which people are commenting and linking. I can’t tell by just looking at them with my bare eyes - so is’t spam huh? A later post on that philosophical aspect!

A prototype

We have set up a prototype to identify spam blogs. Right now it’s really rudimentary but shows potential. In the future by using clusters of classifiers hosted here at uclassify we think we can create a sufficiently good splog classifier.

Check out the project here, www.spamhuh.com. Remember that it’s only an early prototype!

Concerning the two hard to detect spam blogs above spamhuh.com is able to correctly identify one of them :)

Try it out and let us know what you think!!

Everybody can classify

Sunday, October 19th, 2008

Creating your own classifiers has never been easier, we have developed a Click’n’Classify Graphical User Interface (GUI). This means that you can manually create and train your classifiers without knowing any programming at all. This is very good way to test an idea, if the classifier works well – build your web site around it or use it for whatever purpose.

The GUI allows you to do everything that you can do via our Application Programming Interface (API). Also, just like phpMyAdmin shows the SQL queries our uClassify GUI will show the XML queries so you can easily understand and use the API from your site.

Features

  • Create and remove classifiers
  • Add and remove classes
  • Train and untrain classes
  • See basic information about your classifiers

Screenshot - Create a classifier

This shows a screenshot of how it looks like when you are about to create a classifier, just log in and try it yourself!

Creating a classifier is easy

Screenshot - Training a classifier

Just copy and paste the texts you want to use as training data.

Training a classifier is easy

Happy classifying!

Click’n’classify

Monday, October 13th, 2008

As we suspected, most users who sign up think it’s to high threshold to get started as it requires some programming to create and train classifiers. Therefore we have decided to add more GUI features that allows users to do all the API calls without any programming! Once classifiers are set up developers can start building their web application around their classifiers via the API.

Copy’n’code

All GUI driven API calls will display the generated the XML so that users can easily see whats going on and copy the XML directly into the code (much like PhpMyAdmin does with SQL queries).

We expect this to take a couple of weeks.

Developing the development

Thursday, October 9th, 2008

Since we released the beta version a couple a weeks ago we have seen a few websites pop up building on the uClassify techonology. This is very encouraging for us! Right now we are trying to reach out to more users who want to use our classifier API.

We have spent a lot of time on development of our service - making it parallel - robust - low on memory - fast etc. This is what we are really good at. The remaining part which is as important - to reach out to users - advertise ourselves and being seen on the right places is not our sharpest skill.

Besides writing this blog and posting the uClassify link on a couple of sites we haven’t done much to show our muscles - yet! We thought that we perhaps would use our own API ourselves - that is probably an easier way to create some buzz! We have a couple of ideas make us seen (feel free to use these ideas for yourself):

Build an Anti Spam Comment Plugin for WordPress?

We are quite confident that we could do really well as the classifier engine has shown really good results in Cactus Spam Filter. This would compete or be a good complement to Akismet, Defensio and similar. Is there anyone who needs another blog spam comment filter?
antispamspam

Build a Spam Blog Filter?

This seems to be a problem for many blog communities, building a splogs (spam blogs) filter could give us some good attention. What would be really nice is if somebody could provide us with dynamic training data on slogs and blogs - then we could automate the training process and find the undetected spam! Anyone who want to donate their spam? :)

Implement a JSON API for uClassify?

Building a JSON API would not only broaden our API (only XML API right now) it would also let users use our classification service via Yahoo! Pipes. Yahoo Pipes let’s you combine different RSS flows into one and use external web services (via JSON) - which is madly cool.

Language Detection - talar du svenska?

We already have a language detection classifier (not published yet) that only needs training data refinement (removal of noise such as English words in the Filipino class). It supports 40 languages. This would be fairly simple and could give us some buzz.

Ideas, anyone!

Do you have any ides? Let us know - or use the uClassify API to create your own classifier (spam filter, language detection or whatever comes to your mind).

Classifier performance - part II

Wednesday, October 8th, 2008

In the first part I explained some guidelines to keep in mind when selecting a test copurs. In this part I will give a brief introduction of how to run tests on your corpuses. Given a corpus of labeled documents, how can it be used to determine a classifiers performance? There are several ways, one of the simplest is to divide it into two parts and use one part for training and the other for testing. How the corpus is divided affects the performance and is very likely to be biased, there is also a great data loss (50% is never used as testing/training). We can do a lot better.

Leave one out cross validation (LOOCV)

A well established technique is to train on all documents except one which is left out and used for testing. This procedure is repeated so that every document has been used for testing once. An advantage of this method is that it almost uses the full corpus as the training data (no data waste). The downside is that it’s expensive as it must be repeated as many times as there are documents. k fold cross validation solves this problem by dividing the corpus into k piles. The performance is then averaged over all the runs.

k fold cross validation

Perhaps the most common way to run tests is to use k fold cross validation. This means that k-1 parts of the corpus are used for training and 1 part for testing. This method is repeated k times so that every part of the corpus is once used for testing and k-1 times for training. Using 10 fold cross validation is commonly used. In that case start by training the classifier on part 2->10 and test it on part 1, then training it on part 1+3->10 and test it on part 2 and so on. For every rotation the performance is measured. When the tests have completed the performance is averaged. Using k fold cross validation will give a more robust performance measure as every part will be used as training and test data.

Remember from part one that it can be useful to vary the size of the corpus, scaling it from a small magnitude to a greater and using unbalanced data.

Summary

  • Don’t use test methods because they are simple – the results probably fool you.
  • Use an established method, such as k fold cross validation or leave one out.
  • Always remember to specify what method you have used together with the results.

In the next part I’ll show how performance actually can be measured! Happy classifying until then!

Classifier performance - Part I

Monday, October 6th, 2008

There are several different classifiers, to name a few: Naive Bayesian Classifiers, Support Vector Machines, k Nearest Neighbor and Neural Networks. A crucial cornerstone when choosing a classifier is the performance - how well it classifies the data. There are several methods to measuring how good a classifier performs. In three parts I will try to give an idea of how to avoid common pitfalls.

Part I: Choosing test corpus
Part II: Running tests
Part III: Measuring the performance

What test corpus should I use? Use many!

This is perhaps the hardest part when trying to determine the performance of a classifier, every subset of data is a model that is likely to be biased. Therefore you should always question on what data (corpus) the tests are carried out on. For example, a classifier that reports high performance on a specific corpus is likely to have a different performance on real world data (and often lower! – to not look stupid in comparison to other classifiers they are tuned for the test corpus but this bias may degrade performance on other corpuses). Using many relevant corpuses can help avoiding that a classifier gets to narrowed down (specialized) to one specific corpus.

In “The fallacy of corpus anti-spam evalutation” Vipul explains why many anti spam researchers results may not say much since the test corpus is static. While researchers are busy measuring their performance on a corpus from 2005 (TREC2005), spammers today have had three years to figure out how to fool their spam filters… I completely agree, it’s almost like

- Look everyone, I’ve spent years inventing a highly accurate long bow to shoot down our nemisis Spam! It works really well back in my training yard!

- Did you not hear? Spam evolved, they now come in F-117 Nighthawk Stealth Fighters cloaked by deceiving words, even if you could see it - an arrow couldn’t scratch it.

- Oh, dear...

Small or large test sets? Both!

The size of the test data is also vital; using too small test set says more of how the classifier will perform during the training phase than after. Using too large and rich training sets may invoke overfitting (so much data that seemingly nonsense tests will fit your model). The best is to measure performance on different sizes, scaling the training set from a few test documents to the full set. Unfortunately this is often disregarded.

Hint: A benefit from measuring the performance as a function of a scaling corpus is that you can predict how much training data is needed to reach a certain level of performance. Just project the performance curve beyond the size of the test corpus.

Balanced or unbalanced training sets? Both!

The test corpus can be heavily unbalanced, meaning that one of the classes is overrepresented and another is underrepresented in number of documents. As Amund Tviet points out in his blog

“Quite frequently classification problems have to deal with unbalanced data sets, e.g. let us say you were to classify documents about soccer and casting (fishing), and your training data set contained about 99.99% soccer and 0.01% about casting, a baseline classifier for a similar dataset could be to say - “the article is about soccer”. This would most likely be a very strong baseline, and probably hard to beat for most heavy machinery classifiers.”

In many cases it’s desirable to run tests on unbalanced test sets. For example, imagine that you get 2000 spam every day and 10 legitimate. You decide to install a machine learning spam filter. The spam filter requires training, so each day you train it on 2000 spam and 10 legitimate e-mails. This creates unbalanced training data for your classifier, and it’s extremely important that it your spam filter doesn’t mark your legitimate as spam (could be a matter of life, death, peace or war - literally).

Summary

  • Understand that corpuses are biased and therefore also the test results.
  • Use up-to-date corpuses if the classification domain is dynamic.
  • Make sure the test data is as representative as possible for the domain. E.g. don’t trust test results from a spam corpus to apply on sentiment classification.
  • Prefer running tests on many different corpuses.
  • Run tests on the corpuses as they scale in size.
  • Make sure that the classifier is robust on unbalanced training data, especially when correct classifications can be a matter of life and death.

Gender Text Analysis

Saturday, October 4th, 2008

Do males and females express themselves differently in text? Yes is the answer if we look at the research carried out at the University of Texas, in the article “Effects on age and Gender on Blogging” [1] it’s found that author gender can be determined with an accuracy of 80% by looking at a text. This is achieved with a classifier, trained on 37478 blogs written by males and females at blogger.com.

Gender stereotypes in the blogosphere

The research also shows the most discriminating terms for males of females (using information gain).

Male favorite words


- linux
- microsoft
- gaming
- server

- software
- gb
- programming
- google
- data
- graphics
- india
- nations
- democracy

- users
- economic

Female favorite words


- shopping
- mom
- cried
- freaked
- pink

- cute
- gosh
- kisses
- yummy
- mommy
- boyfriend
- skirt
- adorable
- husband
- hubby

They conclude “Male bloggers of all ages write more about politics, technology and money than do their female cohorts. Female bloggers discuss their personal lives – and use more personal writing style – much more than males do.”

Try it on your blog

GenderAnalyzer.com uses the same approach as described in the article, they have collected 2000 blogs from blogger.com written by men and woman. They also have a poll which allows us to see how well it’s working, as we speak it has an accuracy of 70%.

Trying this blog in the analyzer gives us the correct answer

Results
We think http://blog.uclassify.com is written by a man.

[1] J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006. PDF