Classifier performance – Part I

There are several different classifiers, to name a few: Naive Bayesian Classifiers, Support Vector Machines, k Nearest Neighbor and Neural Networks. A crucial cornerstone when choosing a classifier is the performance – how well it classifies the data. There are several methods to measuring how good a classifier performs. In three parts I will try to give an idea of how to avoid common pitfalls.

Part I: Choosing test corpus
Part II: Running tests
Part III: Measuring the performance

What test corpus should I use? Use many!

This is perhaps the hardest part when trying to determine the performance of a classifier, every subset of data is a model that is likely to be biased. Therefore you should always question on what data (corpus) the tests are carried out on. For example, a classifier that reports high performance on a specific corpus is likely to have a different performance on real world data (and often lower! – to not look stupid in comparison to other classifiers they are tuned for the test corpus but this bias may degrade performance on other corpuses). Using many relevant corpuses can help avoiding that a classifier gets to narrowed down (specialized) to one specific corpus.

In “The fallacy of corpus anti-spam evalutation” Vipul explains why many anti spam researchers results may not say much since the test corpus is static. While researchers are busy measuring their performance on a corpus from 2005 (TREC2005), spammers today have had three years to figure out how to fool their spam filters… I completely agree, it’s almost like

– Look everyone, I’ve spent years inventing a highly accurate long bow to shoot down our nemisis Spam! It works really well back in my training yard!

– Did you not hear? Spam evolved, they now come in F-117 Nighthawk Stealth Fighters cloaked by deceiving words, even if you could see it – an arrow couldn’t scratch it.

– Oh, dear...

Small or large test sets? Both!

The size of the test data is also vital; using too small test set says more of how the classifier will perform during the training phase than after. Using too large and rich training sets may invoke overfitting (so much data that seemingly nonsense tests will fit your model). The best is to measure performance on different sizes, scaling the training set from a few test documents to the full set. Unfortunately this is often disregarded.

Hint: A benefit from measuring the performance as a function of a scaling corpus is that you can predict how much training data is needed to reach a certain level of performance. Just project the performance curve beyond the size of the test corpus.

Balanced or unbalanced training sets? Both!

The test corpus can be heavily unbalanced, meaning that one of the classes is overrepresented and another is underrepresented in number of documents. As Amund Tviet points out in his blog

“Quite frequently classification problems have to deal with unbalanced data sets, e.g. let us say you were to classify documents about soccer and casting (fishing), and your training data set contained about 99.99% soccer and 0.01% about casting, a baseline classifier for a similar dataset could be to say – “the article is about soccer”. This would most likely be a very strong baseline, and probably hard to beat for most heavy machinery classifiers.”

In many cases it’s desirable to run tests on unbalanced test sets. For example, imagine that you get 2000 spam every day and 10 legitimate. You decide to install a machine learning spam filter. The spam filter requires training, so each day you train it on 2000 spam and 10 legitimate e-mails. This creates unbalanced training data for your classifier, and it’s extremely important that it your spam filter doesn’t mark your legitimate as spam (could be a matter of life, death, peace or war – literally).


  • Understand that corpuses are biased and therefore also the test results.
  • Use up-to-date corpuses if the classification domain is dynamic.
  • Make sure the test data is as representative as possible for the domain. E.g. don’t trust test results from a spam corpus to apply on sentiment classification.
  • Prefer running tests on many different corpuses.
  • Run tests on the corpuses as they scale in size.
  • Make sure that the classifier is robust on unbalanced training data, especially when correct classifications can be a matter of life and death.