Fidel or Franco?

Just wanted to share that another interesting uClassify application has been created. This application classifies Spanish web pages for Fidel or Franco alignment. (Note that the text has to be in Spanish to work properly).

Even though I think this page is intended for fun, I believe  that such classifiers can be used in commercial purposes. Imagine a political party that want to find bloggers that have a right or left wing alignment to help them drive their election. They could then run a huge number of blogs (collected from some data source) through the classifier and find those that they are looking for.

Good work!

Classifier performance – part II

In the first part I explained some guidelines to keep in mind when selecting a test copurs. In this part I will give a brief introduction of how to run tests on your corpuses. Given a corpus of labeled documents, how can it be used to determine a classifiers performance? There are several ways, one of the simplest is to divide it into two parts and use one part for training and the other for testing. How the corpus is divided affects the performance and is very likely to be biased, there is also a great data loss (50% is never used as testing/training). We can do a lot better.

Leave one out cross validation (LOOCV)

A well established technique is to train on all documents except one which is left out and used for testing. This procedure is repeated so that every document has been used for testing once. An advantage of this method is that it almost uses the full corpus as the training data (no data waste). The downside is that it’s expensive as it must be repeated as many times as there are documents. k fold cross validation solves this problem by dividing the corpus into k piles. The performance is then averaged over all the runs.

k fold cross validation

Perhaps the most common way to run tests is to use k fold cross validation. This means that k-1 parts of the corpus are used for training and 1 part for testing. This method is repeated k times so that every part of the corpus is once used for testing and k-1 times for training. Using 10 fold cross validation is commonly used. In that case start by training the classifier on part 2->10 and test it on part 1, then training it on part 1+3->10 and test it on part 2 and so on. For every rotation the performance is measured. When the tests have completed the performance is averaged. Using k fold cross validation will give a more robust performance measure as every part will be used as training and test data.

Remember from part one that it can be useful to vary the size of the corpus, scaling it from a small magnitude to a greater and using unbalanced data.

Summary

  • Don’t use test methods because they are simple – the results probably fool you.
  • Use an established method, such as k fold cross validation or leave one out.
  • Always remember to specify what method you have used together with the results.

In the next part I’ll show how performance actually can be measured! Happy classifying until then!

Classifier performance – Part I

There are several different classifiers, to name a few: Naive Bayesian Classifiers, Support Vector Machines, k Nearest Neighbor and Neural Networks. A crucial cornerstone when choosing a classifier is the performance – how well it classifies the data. There are several methods to measuring how good a classifier performs. In three parts I will try to give an idea of how to avoid common pitfalls.

Part I: Choosing test corpus
Part II: Running tests
Part III: Measuring the performance

What test corpus should I use? Use many!

This is perhaps the hardest part when trying to determine the performance of a classifier, every subset of data is a model that is likely to be biased. Therefore you should always question on what data (corpus) the tests are carried out on. For example, a classifier that reports high performance on a specific corpus is likely to have a different performance on real world data (and often lower! – to not look stupid in comparison to other classifiers they are tuned for the test corpus but this bias may degrade performance on other corpuses). Using many relevant corpuses can help avoiding that a classifier gets to narrowed down (specialized) to one specific corpus.

In “The fallacy of corpus anti-spam evalutation” Vipul explains why many anti spam researchers results may not say much since the test corpus is static. While researchers are busy measuring their performance on a corpus from 2005 (TREC2005), spammers today have had three years to figure out how to fool their spam filters… I completely agree, it’s almost like

– Look everyone, I’ve spent years inventing a highly accurate long bow to shoot down our nemisis Spam! It works really well back in my training yard!

– Did you not hear? Spam evolved, they now come in F-117 Nighthawk Stealth Fighters cloaked by deceiving words, even if you could see it – an arrow couldn’t scratch it.

– Oh, dear...

Small or large test sets? Both!

The size of the test data is also vital; using too small test set says more of how the classifier will perform during the training phase than after. Using too large and rich training sets may invoke overfitting (so much data that seemingly nonsense tests will fit your model). The best is to measure performance on different sizes, scaling the training set from a few test documents to the full set. Unfortunately this is often disregarded.

Hint: A benefit from measuring the performance as a function of a scaling corpus is that you can predict how much training data is needed to reach a certain level of performance. Just project the performance curve beyond the size of the test corpus.

Balanced or unbalanced training sets? Both!

The test corpus can be heavily unbalanced, meaning that one of the classes is overrepresented and another is underrepresented in number of documents. As Amund Tviet points out in his blog

“Quite frequently classification problems have to deal with unbalanced data sets, e.g. let us say you were to classify documents about soccer and casting (fishing), and your training data set contained about 99.99% soccer and 0.01% about casting, a baseline classifier for a similar dataset could be to say – “the article is about soccer”. This would most likely be a very strong baseline, and probably hard to beat for most heavy machinery classifiers.”

In many cases it’s desirable to run tests on unbalanced test sets. For example, imagine that you get 2000 spam every day and 10 legitimate. You decide to install a machine learning spam filter. The spam filter requires training, so each day you train it on 2000 spam and 10 legitimate e-mails. This creates unbalanced training data for your classifier, and it’s extremely important that it your spam filter doesn’t mark your legitimate as spam (could be a matter of life, death, peace or war – literally).

Summary

  • Understand that corpuses are biased and therefore also the test results.
  • Use up-to-date corpuses if the classification domain is dynamic.
  • Make sure the test data is as representative as possible for the domain. E.g. don’t trust test results from a spam corpus to apply on sentiment classification.
  • Prefer running tests on many different corpuses.
  • Run tests on the corpuses as they scale in size.
  • Make sure that the classifier is robust on unbalanced training data, especially when correct classifications can be a matter of life and death.